Phonological Rule Formalism

One influential formalism for stating phonological rules comes from Chomsky and Halle’s “The Sound Pattern of English” (SPE) (Chomsky1968?). In this formalism, rules have the general form:

\[A \rightarrow B / C\_D\]

Where: - $A$ is the target (what changes) - $B$ is the structural change (what $A$ changes to) - $C$ is the left context - $D$ is the right context

This can be read as “$A$ becomes $B$ when it occurs between $C$ and $D$.”

The rule applies whenever the sequence $CAD$ is found in the input string, transforming it to $CBD$ in the output.

Converting Rules to FSTs

To convert a phonological rule $A \rightarrow B / C\_D$ into an FST, we need to:

Create an FST that recognizes the pattern $CAD$ in the input
Ensure that when this pattern is recognized, $A$ is replaced by $B$ in the output
Ensure that all other segments are copied unchanged from input to output

The algorithm developed by (Kaplan1994?) and refined by (Mohri1997?) provides a systematic way to do this.

The Replace Operator

The core of the Kaplan and Kay algorithm is the “replace” operator, which can be defined as:

\[\text{replace}(A, B, C, D) = \{(x, y) \mid y \text{ is the result of replacing all instances of } A \text{ with } B \text{ in } x \text{ when } A \text{ occurs between } C \text{ and } D\}\]

This operator defines a regular relation that can be implemented as an FST.

The Formal Algorithm

The formal algorithm for converting a rule $A \rightarrow B / C\_D$ into an FST involves several sophisticated steps. Here we follow the improved algorithm by (Mohri1997?), which fixed issues in the original (Kaplan1994?) construction.

Preliminaries

First, we define some basic concepts:

Let $\Sigma$ be the input/output alphabet
Let $A, B, C, D$ be regular expressions over $\Sigma$
Define the “context language” $L = C \circ A \circ D$, which represents all strings that match the rule context
Define auxiliary FSAs for these languages:
- $M_A$ recognizes language $A$
- $M_B$ recognizes language $B$
- $M_C$ recognizes language $C$
- $M_D$ recognizes language $D$
- $M_L$ recognizes language $L$

Marker Symbols

To handle rule application correctly, we introduce special marker symbols: - $\#_1, \#_2, \#_3, \#_4$ (not in $\Sigma$) to mark boundaries - $\sigma = \#_1, \tau = \#_2, \rho = \#_3, \eta = \#_4$

Construction Components

The algorithm constructs several intermediate transducers:

1. Boundary Marking Transducer ($T_{\text{Mark}}$)

This transducer identifies potential rule application sites and adds boundary markers:

\[T_{\text{Mark}} = \{(x, x') \mid x' \text{ is } x \text{ with } \sigma, \tau, \rho, \eta \text{ marking the boundaries of all occurrences of } L \text{ in } x\}\]

Where occurrences of $L$ are marked as: - $\sigma$ at the start of $C$ - $\tau$ at the boundary between $C$ and $A$ - $\rho$ at the boundary between $A$ and $D$ - $\eta$ at the end of $D$

Formally, $T_{\text{Mark}}$ is constructed as:

\[T_{\text{Mark}} = \text{Id}(\Sigma^*) \circ T_{\sigma} \circ \text{Id}(\Sigma^*) \circ T_{\tau} \circ \text{Id}(\Sigma^*) \circ T_{\rho} \circ \text{Id}(\Sigma^*) \circ T_{\eta} \circ \text{Id}(\Sigma^*)\]

Where: - $\text{Id}(L)$ is the identity transducer for language $L$ - $T_{\sigma}, T_{\tau}, T_{\rho}, T_{\eta}$ insert the respective boundary markers at appropriate positions

2. Replacement Transducer ($T_{\text{Replace}}$)

This transducer performs the actual replacement between $\tau$ and $\rho$ markers:

\[T_{\text{Replace}} = \text{Id}((\Sigma \cup \{\sigma, \tau, \rho, \eta\})^* \circ \{\sigma\}) \circ \text{Id}((\Sigma \cup \{\sigma, \tau, \rho, \eta\})^* \circ \{\tau\}) \circ T_{A \rightarrow B} \circ \text{Id}(\{\rho\} \circ (\Sigma \cup \{\sigma, \tau, \rho, \eta\})^* \circ \{\eta\}) \circ \text{Id}((\Sigma \cup \{\sigma, \tau, \rho, \eta\})^*)\]

Where $T_{A \rightarrow B}$ maps strings in $A$ to their corresponding strings in $B$.

3. Marker Removal Transducer ($T_{\text{Unmark}}$)

This transducer removes all marker symbols:

\[T_{\text{Unmark}} = \{(x, y) \mid y \text{ is } x \text{ with all occurrences of } \sigma, \tau, \rho, \eta \text{ removed}\}\]

The Complete Algorithm

The complete rule transducer is constructed by composing these components:

\[T_{\text{Rule}} = T_{\text{Mark}} \circ T_{\text{Replace}} \circ T_{\text{Unmark}}\]

Mohri’s Improvements

(Mohri1997?) introduced three key improvements to the original Kaplan and Kay algorithm:

Factorization: The replace operation is factored into simpler components, making it more efficient
Filter Transducers: Special filter transducers are used to ensure correct applications in the presence of overlapping potential matches
Determinization: The algorithm ensures that the resulting transducer can be determinized, which is crucial for efficiency

Filter Transducers

The filter transducers include:

Left-to-right filter ($T_{\text{LTR}}$): \[T_{\text{LTR}} = \{(x, x) \mid \text{no marker in } x \text{ is followed by a marker of lower index}\}\]
Containment filter ($T_{\text{Contain}}$): \[T_{\text{Contain}} = \{(x, x) \mid \text{every } \sigma \text{ in } x \text{ is matched with } \eta, \text{ and same for } \tau \text{ and } \rho\}\]

With these filters, the improved marker transducer is:

\[T_{\text{Mark}}' = T_{\text{Mark}} \circ T_{\text{LTR}} \circ T_{\text{Contain}}\]

Epsilon Handling

A critical aspect of the algorithm is the handling of epsilon transitions. Mohri’s approach uses special epsilon filters to ensure correct behavior when rules involve empty strings. These filters prevent the improper application of rules and ensure termination of the algorithm.

The CDRewrite Function

The complete algorithm is often implemented through a function called CDRewrite, which is formally defined as:

\[\text{CDRewrite}(A, B, C, D, \Sigma) = T\]

where: - $A$ is a regular expression representing the target pattern - $B$ is a regular expression representing the replacement pattern - $C$ is a regular expression representing the left context - $D$ is a regular expression representing the right context - $\Sigma$ is the alphabet of the language - $T$ is the resulting finite state transducer

The CDRewrite function constructs $T$ as follows:

\[\text{CDRewrite}(A, B, C, D, \Sigma) = T_{\text{Mark}} \circ T_{\text{Replace}} \circ T_{\text{Unmark}}\]

where:

\[T_{\text{Replace}} = \text{Id}((\Sigma \cup M)^* \sigma (\Sigma \cup M)^* \tau) \circ T_{A \rightarrow B} \circ \text{Id}(\rho (\Sigma \cup M)^* \eta (\Sigma \cup M)^*)\]

\[T_{\text{Unmark}} = \{(x, y) \mid y \text{ is } x \text{ with all occurrences of } \sigma, \tau, \rho, \eta \text{ removed}\}\]

where: - $M = \{\sigma, \tau, \rho, \eta\}$ is the set of marker symbols - $T_{\sigma}, T_{\tau}, T_{\rho}, T_{\eta}$ are transducers that insert the respective markers at appropriate positions - $T_{A \rightarrow B}$ is a transducer that replaces strings matching $A$ with their corresponding strings in $B$ - $T_{\text{LTR}}$ is the left-to-right filter that ensures proper marker ordering - $T_{\text{Contain}}$ is the containment filter that ensures proper marker matching

This function encapsulates all the complexity of the algorithm, providing a simple interface for phonological rule implementation.

Example: German Final Devoicing

For German final devoicing, we can express the rule as:

\[\text{CDRewrite}(\{b \rightarrow p, d \rightarrow t, g \rightarrow k, v \rightarrow f, z \rightarrow s, \text{ʒ} \rightarrow \text{ʃ}\}, \epsilon, \#, \Sigma^*)\]

This creates an FST that implements the rule:

\[[+\text{obstruent}, +\text{voice}] \rightarrow [-\text{voice}] / \_ \#\]

Example: Turkish Vowel Harmony

For Turkish vowel harmony, we need to track the backness of the last vowel in the stem:

\[\text{CDRewrite}(E \rightarrow e, \{e, i, ö, ü\}C^*, \epsilon, \Sigma^*) \cup \text{CDRewrite}(E \rightarrow a, \{a, ı, o, u\}C^*, \epsilon, \Sigma^*)\]

This creates an FST that implements the rule:

\[E \rightarrow [αback] / [αback]C^*\_\]

Where $E$ is an abstract vowel that takes on the backness of the preceding vowel.

Composition of Rules

One of the powerful aspects of the FST approach is that multiple rules can be composed into a single FST. If we have two rules represented as FSTs $T_1$ and $T_2$, we can create a new FST $T = T_1 \circ T_2$ that applies both rules in sequence.

For example, if we have a rule for German final devoicing and another rule for German schwa epenthesis, we can compose them to create a single FST that applies both rules:

\[T = T_{\text{final-devoicing}} \circ T_{\text{schwa-epenthesis}}\]

This is particularly useful for modeling rule ordering in phonology, where the order in which rules apply can affect the final output.

--- title: Phonological Rule Formalism bibliography: ../../references.bib --- One influential formalism for stating phonological rules comes from Chomsky and Halle's "The Sound Pattern of English" (SPE) [@Chomsky1968]. In this formalism, rules have the general form: $$A \rightarrow B / C\_D$$ Where: - $A$ is the target (what changes) - $B$ is the structural change (what $A$ changes to) - $C$ is the left context - $D$ is the right context This can be read as "$A$ becomes $B$ when it occurs between $C$ and $D$." The rule applies whenever the sequence $CAD$ is found in the input string, transforming it to $CBD$ in the output. ## Converting Rules to FSTs To convert a phonological rule $A \rightarrow B / C\_D$ into an FST, we need to: 1. Create an FST that recognizes the pattern $CAD$ in the input 2. Ensure that when this pattern is recognized, $A$ is replaced by $B$ in the output 3. Ensure that all other segments are copied unchanged from input to output The algorithm developed by @Kaplan1994 and refined by @Mohri1997 provides a systematic way to do this. ### The Replace Operator The core of the Kaplan and Kay algorithm is the "replace" operator, which can be defined as: $$\text{replace}(A, B, C, D) = \{(x, y) \mid y \text{ is the result of replacing all instances of } A \text{ with } B \text{ in } x \text{ when } A \text{ occurs between } C \text{ and } D\}$$ This operator defines a regular relation that can be implemented as an FST. ## The Formal Algorithm The formal algorithm for converting a rule $A \rightarrow B / C\_D$ into an FST involves several sophisticated steps. Here we follow the improved algorithm by @Mohri1997, which fixed issues in the original @Kaplan1994 construction. ### Preliminaries First, we define some basic concepts: 1. Let $\Sigma$ be the input/output alphabet 2. Let $A, B, C, D$ be regular expressions over $\Sigma$ 3. Define the "context language" $L = C \circ A \circ D$, which represents all strings that match the rule context 4. Define auxiliary FSAs for these languages: - $M_A$ recognizes language $A$ - $M_B$ recognizes language $B$ - $M_C$ recognizes language $C$ - $M_D$ recognizes language $D$ - $M_L$ recognizes language $L$ ### Marker Symbols To handle rule application correctly, we introduce special marker symbols: - $\#_1, \#_2, \#_3, \#_4$ (not in $\Sigma$) to mark boundaries - $\sigma = \#_1, \tau = \#_2, \rho = \#_3, \eta = \#_4$ ### Construction Components The algorithm constructs several intermediate transducers: #### 1. Boundary Marking Transducer ($T_{\text{Mark}}$) This transducer identifies potential rule application sites and adds boundary markers: $$T_{\text{Mark}} = \{(x, x') \mid x' \text{ is } x \text{ with } \sigma, \tau, \rho, \eta \text{ marking the boundaries of all occurrences of } L \text{ in } x\}$$ Where occurrences of $L$ are marked as: - $\sigma$ at the start of $C$ - $\tau$ at the boundary between $C$ and $A$ - $\rho$ at the boundary between $A$ and $D$ - $\eta$ at the end of $D$ Formally, $T_{\text{Mark}}$ is constructed as: $$T_{\text{Mark}} = \text{Id}(\Sigma^*) \circ T_{\sigma} \circ \text{Id}(\Sigma^*) \circ T_{\tau} \circ \text{Id}(\Sigma^*) \circ T_{\rho} \circ \text{Id}(\Sigma^*) \circ T_{\eta} \circ \text{Id}(\Sigma^*)$$ Where: - $\text{Id}(L)$ is the identity transducer for language $L$ - $T_{\sigma}, T_{\tau}, T_{\rho}, T_{\eta}$ insert the respective boundary markers at appropriate positions #### 2. Replacement Transducer ($T_{\text{Replace}}$) This transducer performs the actual replacement between $\tau$ and $\rho$ markers: $$T_{\text{Replace}} = \text{Id}((\Sigma \cup \{\sigma, \tau, \rho, \eta\})^* \circ \{\sigma\}) \circ \text{Id}((\Sigma \cup \{\sigma, \tau, \rho, \eta\})^* \circ \{\tau\}) \circ T_{A \rightarrow B} \circ \text{Id}(\{\rho\} \circ (\Sigma \cup \{\sigma, \tau, \rho, \eta\})^* \circ \{\eta\}) \circ \text{Id}((\Sigma \cup \{\sigma, \tau, \rho, \eta\})^*)$$ Where $T_{A \rightarrow B}$ maps strings in $A$ to their corresponding strings in $B$. #### 3. Marker Removal Transducer ($T_{\text{Unmark}}$) This transducer removes all marker symbols: $$T_{\text{Unmark}} = \{(x, y) \mid y \text{ is } x \text{ with all occurrences of } \sigma, \tau, \rho, \eta \text{ removed}\}$$ ### The Complete Algorithm The complete rule transducer is constructed by composing these components: $$T_{\text{Rule}} = T_{\text{Mark}} \circ T_{\text{Replace}} \circ T_{\text{Unmark}}$$ ### Mohri's Improvements @Mohri1997 introduced three key improvements to the original Kaplan and Kay algorithm: 1. **Factorization**: The replace operation is factored into simpler components, making it more efficient 2. **Filter Transducers**: Special filter transducers are used to ensure correct applications in the presence of overlapping potential matches 3. **Determinization**: The algorithm ensures that the resulting transducer can be determinized, which is crucial for efficiency #### Filter Transducers The filter transducers include: 1. **Left-to-right filter** ($T_{\text{LTR}}$): $$T_{\text{LTR}} = \{(x, x) \mid \text{no marker in } x \text{ is followed by a marker of lower index}\}$$ 2. **Containment filter** ($T_{\text{Contain}}$): $$T_{\text{Contain}} = \{(x, x) \mid \text{every } \sigma \text{ in } x \text{ is matched with } \eta, \text{ and same for } \tau \text{ and } \rho\}$$ With these filters, the improved marker transducer is: $$T_{\text{Mark}}' = T_{\text{Mark}} \circ T_{\text{LTR}} \circ T_{\text{Contain}}$$ ### Epsilon Handling A critical aspect of the algorithm is the handling of epsilon transitions. Mohri's approach uses special epsilon filters to ensure correct behavior when rules involve empty strings. These filters prevent the improper application of rules and ensure termination of the algorithm. ### The CDRewrite Function The complete algorithm is often implemented through a function called CDRewrite, which is formally defined as: $$\text{CDRewrite}(A, B, C, D, \Sigma) = T$$ where: - $A$ is a regular expression representing the target pattern - $B$ is a regular expression representing the replacement pattern - $C$ is a regular expression representing the left context - $D$ is a regular expression representing the right context - $\Sigma$ is the alphabet of the language - $T$ is the resulting finite state transducer The CDRewrite function constructs $T$ as follows: $$\text{CDRewrite}(A, B, C, D, \Sigma) = T_{\text{Mark}} \circ T_{\text{Replace}} \circ T_{\text{Unmark}}$$ where: $$T_{\text{Mark}} = \text{Id}(\Sigma^*) \circ T_{\sigma} \circ \text{Id}(\Sigma^*) \circ T_{\tau} \circ \text{Id}(\Sigma^*) \circ T_{\rho} \circ \text{Id}(\Sigma^*) \circ T_{\eta} \circ \text{Id}(\Sigma^*) \circ T_{\text{LTR}} \circ T_{\text{Contain}}$$ $$T_{\text{Replace}} = \text{Id}((\Sigma \cup M)^* \sigma (\Sigma \cup M)^* \tau) \circ T_{A \rightarrow B} \circ \text{Id}(\rho (\Sigma \cup M)^* \eta (\Sigma \cup M)^*)$$ $$T_{\text{Unmark}} = \{(x, y) \mid y \text{ is } x \text{ with all occurrences of } \sigma, \tau, \rho, \eta \text{ removed}\}$$ where: - $M = \{\sigma, \tau, \rho, \eta\}$ is the set of marker symbols - $T_{\sigma}, T_{\tau}, T_{\rho}, T_{\eta}$ are transducers that insert the respective markers at appropriate positions - $T_{A \rightarrow B}$ is a transducer that replaces strings matching $A$ with their corresponding strings in $B$ - $T_{\text{LTR}}$ is the left-to-right filter that ensures proper marker ordering - $T_{\text{Contain}}$ is the containment filter that ensures proper marker matching This function encapsulates all the complexity of the algorithm, providing a simple interface for phonological rule implementation. ## Example: German Final Devoicing For German final devoicing, we can express the rule as: $$\text{CDRewrite}(\{b \rightarrow p, d \rightarrow t, g \rightarrow k, v \rightarrow f, z \rightarrow s, \text{ʒ} \rightarrow \text{ʃ}\}, \epsilon, \#, \Sigma^*)$$ This creates an FST that implements the rule: $$[+\text{obstruent}, +\text{voice}] \rightarrow [-\text{voice}] / \_ \#$$ ## Example: Turkish Vowel Harmony For Turkish vowel harmony, we need to track the backness of the last vowel in the stem: $$\text{CDRewrite}(E \rightarrow e, \{e, i, ö, ü\}C^*, \epsilon, \Sigma^*) \cup \text{CDRewrite}(E \rightarrow a, \{a, ı, o, u\}C^*, \epsilon, \Sigma^*)$$ This creates an FST that implements the rule: $$E \rightarrow [αback] / [αback]C^*\_$$ Where $E$ is an abstract vowel that takes on the backness of the preceding vowel. ## Composition of Rules One of the powerful aspects of the FST approach is that multiple rules can be composed into a single FST. If we have two rules represented as FSTs $T_1$ and $T_2$, we can create a new FST $T = T_1 \circ T_2$ that applies both rules in sequence. For example, if we have a rule for German final devoicing and another rule for German schwa epenthesis, we can compose them to create a single FST that applies both rules: $$T = T_{\text{final-devoicing}} \circ T_{\text{schwa-epenthesis}}$$ This is particularly useful for modeling rule ordering in phonology, where the order in which rules apply can affect the final output.