Phonological Rule Formalism
One influential formalism for stating phonological rules comes from Chomsky and Halle’s “The Sound Pattern of English” (SPE) (Chomsky1968?). In this formalism, rules have the general form:
\[A \rightarrow B / C\_D\]
Where: - \(A\) is the target (what changes) - \(B\) is the structural change (what \(A\) changes to) - \(C\) is the left context - \(D\) is the right context
This can be read as “\(A\) becomes \(B\) when it occurs between \(C\) and \(D\).”
The rule applies whenever the sequence \(CAD\) is found in the input string, transforming it to \(CBD\) in the output.
Converting Rules to FSTs
To convert a phonological rule \(A \rightarrow B / C\_D\) into an FST, we need to:
- Create an FST that recognizes the pattern \(CAD\) in the input
- Ensure that when this pattern is recognized, \(A\) is replaced by \(B\) in the output
- Ensure that all other segments are copied unchanged from input to output
The algorithm developed by (Kaplan1994?) and refined by (Mohri1997?) provides a systematic way to do this.
The Replace Operator
The core of the Kaplan and Kay algorithm is the “replace” operator, which can be defined as:
\[\text{replace}(A, B, C, D) = \{(x, y) \mid y \text{ is the result of replacing all instances of } A \text{ with } B \text{ in } x \text{ when } A \text{ occurs between } C \text{ and } D\}\]
This operator defines a regular relation that can be implemented as an FST.
The Formal Algorithm
The formal algorithm for converting a rule \(A \rightarrow B / C\_D\) into an FST involves several sophisticated steps. Here we follow the improved algorithm by (Mohri1997?), which fixed issues in the original (Kaplan1994?) construction.
Preliminaries
First, we define some basic concepts:
- Let \(\Sigma\) be the input/output alphabet
- Let \(A, B, C, D\) be regular expressions over \(\Sigma\)
- Define the “context language” \(L = C \circ A \circ D\), which represents all strings that match the rule context
- Define auxiliary FSAs for these languages:
- \(M_A\) recognizes language \(A\)
- \(M_B\) recognizes language \(B\)
- \(M_C\) recognizes language \(C\)
- \(M_D\) recognizes language \(D\)
- \(M_L\) recognizes language \(L\)
Marker Symbols
To handle rule application correctly, we introduce special marker symbols: - \(\#_1, \#_2, \#_3, \#_4\) (not in \(\Sigma\)) to mark boundaries - \(\sigma = \#_1, \tau = \#_2, \rho = \#_3, \eta = \#_4\)
Construction Components
The algorithm constructs several intermediate transducers:
1. Boundary Marking Transducer (\(T_{\text{Mark}}\))
This transducer identifies potential rule application sites and adds boundary markers:
\[T_{\text{Mark}} = \{(x, x') \mid x' \text{ is } x \text{ with } \sigma, \tau, \rho, \eta \text{ marking the boundaries of all occurrences of } L \text{ in } x\}\]
Where occurrences of \(L\) are marked as: - \(\sigma\) at the start of \(C\) - \(\tau\) at the boundary between \(C\) and \(A\) - \(\rho\) at the boundary between \(A\) and \(D\) - \(\eta\) at the end of \(D\)
Formally, \(T_{\text{Mark}}\) is constructed as:
\[T_{\text{Mark}} = \text{Id}(\Sigma^*) \circ T_{\sigma} \circ \text{Id}(\Sigma^*) \circ T_{\tau} \circ \text{Id}(\Sigma^*) \circ T_{\rho} \circ \text{Id}(\Sigma^*) \circ T_{\eta} \circ \text{Id}(\Sigma^*)\]
Where: - \(\text{Id}(L)\) is the identity transducer for language \(L\) - \(T_{\sigma}, T_{\tau}, T_{\rho}, T_{\eta}\) insert the respective boundary markers at appropriate positions
2. Replacement Transducer (\(T_{\text{Replace}}\))
This transducer performs the actual replacement between \(\tau\) and \(\rho\) markers:
\[T_{\text{Replace}} = \text{Id}((\Sigma \cup \{\sigma, \tau, \rho, \eta\})^* \circ \{\sigma\}) \circ \text{Id}((\Sigma \cup \{\sigma, \tau, \rho, \eta\})^* \circ \{\tau\}) \circ T_{A \rightarrow B} \circ \text{Id}(\{\rho\} \circ (\Sigma \cup \{\sigma, \tau, \rho, \eta\})^* \circ \{\eta\}) \circ \text{Id}((\Sigma \cup \{\sigma, \tau, \rho, \eta\})^*)\]
Where \(T_{A \rightarrow B}\) maps strings in \(A\) to their corresponding strings in \(B\).
3. Marker Removal Transducer (\(T_{\text{Unmark}}\))
This transducer removes all marker symbols:
\[T_{\text{Unmark}} = \{(x, y) \mid y \text{ is } x \text{ with all occurrences of } \sigma, \tau, \rho, \eta \text{ removed}\}\]
The Complete Algorithm
The complete rule transducer is constructed by composing these components:
\[T_{\text{Rule}} = T_{\text{Mark}} \circ T_{\text{Replace}} \circ T_{\text{Unmark}}\]
Mohri’s Improvements
(Mohri1997?) introduced three key improvements to the original Kaplan and Kay algorithm:
- Factorization: The replace operation is factored into simpler components, making it more efficient
- Filter Transducers: Special filter transducers are used to ensure correct applications in the presence of overlapping potential matches
- Determinization: The algorithm ensures that the resulting transducer can be determinized, which is crucial for efficiency
Filter Transducers
The filter transducers include:
Left-to-right filter (\(T_{\text{LTR}}\)): \[T_{\text{LTR}} = \{(x, x) \mid \text{no marker in } x \text{ is followed by a marker of lower index}\}\]
Containment filter (\(T_{\text{Contain}}\)): \[T_{\text{Contain}} = \{(x, x) \mid \text{every } \sigma \text{ in } x \text{ is matched with } \eta, \text{ and same for } \tau \text{ and } \rho\}\]
With these filters, the improved marker transducer is:
\[T_{\text{Mark}}' = T_{\text{Mark}} \circ T_{\text{LTR}} \circ T_{\text{Contain}}\]
Epsilon Handling
A critical aspect of the algorithm is the handling of epsilon transitions. Mohri’s approach uses special epsilon filters to ensure correct behavior when rules involve empty strings. These filters prevent the improper application of rules and ensure termination of the algorithm.
The CDRewrite Function
The complete algorithm is often implemented through a function called CDRewrite, which is formally defined as:
\[\text{CDRewrite}(A, B, C, D, \Sigma) = T\]
where: - \(A\) is a regular expression representing the target pattern - \(B\) is a regular expression representing the replacement pattern - \(C\) is a regular expression representing the left context - \(D\) is a regular expression representing the right context - \(\Sigma\) is the alphabet of the language - \(T\) is the resulting finite state transducer
The CDRewrite function constructs \(T\) as follows:
\[\text{CDRewrite}(A, B, C, D, \Sigma) = T_{\text{Mark}} \circ T_{\text{Replace}} \circ T_{\text{Unmark}}\]
where:
\[T_{\text{Mark}} = \text{Id}(\Sigma^*) \circ T_{\sigma} \circ \text{Id}(\Sigma^*) \circ T_{\tau} \circ \text{Id}(\Sigma^*) \circ T_{\rho} \circ \text{Id}(\Sigma^*) \circ T_{\eta} \circ \text{Id}(\Sigma^*) \circ T_{\text{LTR}} \circ T_{\text{Contain}}\]
\[T_{\text{Replace}} = \text{Id}((\Sigma \cup M)^* \sigma (\Sigma \cup M)^* \tau) \circ T_{A \rightarrow B} \circ \text{Id}(\rho (\Sigma \cup M)^* \eta (\Sigma \cup M)^*)\]
\[T_{\text{Unmark}} = \{(x, y) \mid y \text{ is } x \text{ with all occurrences of } \sigma, \tau, \rho, \eta \text{ removed}\}\]
where: - \(M = \{\sigma, \tau, \rho, \eta\}\) is the set of marker symbols - \(T_{\sigma}, T_{\tau}, T_{\rho}, T_{\eta}\) are transducers that insert the respective markers at appropriate positions - \(T_{A \rightarrow B}\) is a transducer that replaces strings matching \(A\) with their corresponding strings in \(B\) - \(T_{\text{LTR}}\) is the left-to-right filter that ensures proper marker ordering - \(T_{\text{Contain}}\) is the containment filter that ensures proper marker matching
This function encapsulates all the complexity of the algorithm, providing a simple interface for phonological rule implementation.
Example: German Final Devoicing
For German final devoicing, we can express the rule as:
\[\text{CDRewrite}(\{b \rightarrow p, d \rightarrow t, g \rightarrow k, v \rightarrow f, z \rightarrow s, \text{ʒ} \rightarrow \text{ʃ}\}, \epsilon, \#, \Sigma^*)\]
This creates an FST that implements the rule:
\[[+\text{obstruent}, +\text{voice}] \rightarrow [-\text{voice}] / \_ \#\]
Example: Turkish Vowel Harmony
For Turkish vowel harmony, we need to track the backness of the last vowel in the stem:
\[\text{CDRewrite}(E \rightarrow e, \{e, i, ö, ü\}C^*, \epsilon, \Sigma^*) \cup \text{CDRewrite}(E \rightarrow a, \{a, ı, o, u\}C^*, \epsilon, \Sigma^*)\]
This creates an FST that implements the rule:
\[E \rightarrow [αback] / [αback]C^*\_\]
Where \(E\) is an abstract vowel that takes on the backness of the preceding vowel.
Composition of Rules
One of the powerful aspects of the FST approach is that multiple rules can be composed into a single FST. If we have two rules represented as FSTs \(T_1\) and \(T_2\), we can create a new FST \(T = T_1 \circ T_2\) that applies both rules in sequence.
For example, if we have a rule for German final devoicing and another rule for German schwa epenthesis, we can compose them to create a single FST that applies both rules:
\[T = T_{\text{final-devoicing}} \circ T_{\text{schwa-epenthesis}}\]
This is particularly useful for modeling rule ordering in phonology, where the order in which rules apply can affect the final output.