Parsing

Parsing with Finite State Automata

While recognition simply determines whether a string belongs to a language, parsing goes further by assigning structure to the string. For regular languages, parsing typically involves identifying the specific path through the automaton that accepts the string.

The Parsing Problem

The parsing problem for FSAs can be defined as:

Input: A finite state automaton $G = \langle Q, \Sigma, \delta, q_0, F \rangle$ and a string $\boldsymbol\sigma \in \Sigma^*$
Output: If $\boldsymbol\sigma \in \mathbb{L}(G)$, a path through $G$ that accepts $\boldsymbol\sigma$; otherwise, an indication that no such path exists

Note that we’ll focus on nondeterministic finite automata (NFAs) throughout this discussion. This is sufficient because any deterministic finite automaton (DFA) can be represented as an NFA with exactly the same states and transitions, but without any nondeterministic choices or epsilon transitions.

Parsing Algorithm

To extend our recognition algorithm to a parsing algorithm, we need to keep track of the paths taken through the automaton. Similar to recognition, we’ll define this recursively.

First, let’s define a path as a sequence of symbol-state pairs, where each pair represents a transition in the automaton. For a path $p$, we denote the last state in the path as $\text{last-state}(p)$.

We need a helper function to handle epsilon transitions. Let’s define $\text{epsilon-paths}(q)$ as the set of all paths from state $q$ to any other state using only epsilon transitions:

\[ \text{epsilon-paths}(q) = \{\langle (\epsilon, q_1), (\epsilon, q_2), \ldots, (\epsilon, q_n) \rangle \mid q_1, \ldots, q_n \in Q \text{ and there is a path from } q \text{ to } q_n \text{ using only } \epsilon \text{ transitions}\} \]

Note that this set always includes the empty path $\langle \rangle$ from a state to itself via zero epsilon transitions.

We can now define a recursive parsing function $\text{parse-from}_G(\boldsymbol\sigma, P)$ where $P$ is a set of partial paths:

\[ \text{parse-from}_G(\boldsymbol\sigma, P) = \begin{cases} \{p \in P \mid \text{last-state}(p) \in F\} & \text{if } \boldsymbol\sigma = \epsilon \\ \text{parse-from}_G(\sigma_2\ldots\sigma_n, P') & \text{if } \boldsymbol\sigma = \sigma_1\sigma_2\ldots\sigma_n \\ \emptyset & \text{otherwise} \end{cases} \]

where: \[P' = \bigcup_{p \in P} \bigcup_{q' \in \delta(\text{last-state}(p), \sigma_1)} \bigcup_{e \in \text{epsilon-paths}(q')} \{p \circ \langle (\sigma_1, q') \rangle \circ e\}\]

Here, $\circ$ represents path concatenation. The set $P'$ contains all possible extensions of paths in $P$ after reading symbol $\sigma_1$, including any subsequent epsilon transitions.

The full parsing function is then:

\[ \text{parse}_G(\boldsymbol\sigma) = \text{parse-from}_G(\boldsymbol\sigma, \bigcup_{e \in \text{epsilon-paths}(q_0)} \{\langle \rangle \circ e\}) \]

This algorithm not only determines whether the string is accepted but also provides all possible paths through the automaton that lead to acceptance. If the result is an empty set, then the string is not in the language of the automaton.

For deterministic finite automata (DFAs), the algorithm simplifies because there are no epsilon transitions and each state has at most one transition on each symbol. In this case, there is at most one valid parse for any accepted string.

Ambiguity in Parsing

An important consideration in parsing with FSAs is ambiguity. A string may have multiple valid parses if the automaton is nondeterministic. In such cases, the parsing algorithm returns the set of all valid parses.

For example, consider an NFA that recognizes the language $(a|b)^*a(a|b)^*$. The string “aba” has multiple valid parses, as the middle “a” could be the one required by the pattern, or one of the outer “a”s could be. The parsing algorithm would return all these possibilities as distinct paths through the automaton.

Handling ambiguity is particularly important in linguistic applications, where different parses may correspond to different linguistic analyses.

Applications in Phonology

In phonological analysis, parsing algorithms for FSAs have several important applications:

1. Morphological Decomposition

Parsing can identify the morphological structure of words by tracing the path through an FSA that represents morphological rules.

For example, parsing the word “unhappiness” through an appropriate FSA might yield a path that identifies the prefix “un-”, the root “happy”, and the suffix “-ness”, along with the rules that combine them.

2. Phonological Rule Application

Parsing can reveal how phonological rules apply to derive surface forms from underlying representations.

For instance, parsing the word “cats” (/kæts/) might show how the plural morpheme /-z/ undergoes devoicing after the voiceless consonant /t/ to become /-s/.

3. Syllabification

Parsing can determine the syllable structure of words by identifying the specific path through a syllable structure FSA.

For example, parsing “strength” through the English syllable FSA would identify it as a single syllable with a complex onset /str/ and a complex coda /ŋθ/.

Extending Parsing for Richer Analyses

While basic FSA parsing provides valuable information about the structure of strings, linguistic applications often require richer analyses. Several extensions to basic parsing can provide more detailed linguistic information:

Weighted Parsing

By assigning weights to transitions in the FSA, we can implement weighted parsing that finds the most probable parse according to some probability distribution or cost function. This is particularly useful for handling ambiguity in natural language processing.

Feature-Based Parsing

We can extend FSAs to carry feature information along transitions, allowing for more detailed linguistic analyses. For example, transitions might carry information about phonological features, morphological categories, or syntactic properties.

Transducer-Based Parsing

Finite state transducers (FSTs) extend FSAs by associating an output symbol with each transition. Parsing with FSTs can simultaneously recognize a string and transform it, which is useful for modeling processes like morphological analysis or phonological alternations. We’ll see this in more detail in the next section.

--- title: Parsing bibliography: ../../references.bib --- ## Parsing with Finite State Automata While recognition simply determines whether a string belongs to a language, parsing goes further by assigning structure to the string. For regular languages, parsing typically involves identifying the specific path through the automaton that accepts the string. ### The Parsing Problem The *parsing problem* for FSAs can be defined as: - **Input**: A finite state automaton $G = \langle Q, \Sigma, \delta, q_0, F \rangle$ and a string $\boldsymbol\sigma \in \Sigma^*$ - **Output**: If $\boldsymbol\sigma \in \mathbb{L}(G)$, a path through $G$ that accepts $\boldsymbol\sigma$; otherwise, an indication that no such path exists Note that we'll focus on nondeterministic finite automata (NFAs) throughout this discussion. This is sufficient because any deterministic finite automaton (DFA) can be represented as an NFA with exactly the same states and transitions, but without any nondeterministic choices or epsilon transitions. ### Parsing Algorithm To extend our recognition algorithm to a parsing algorithm, we need to keep track of the paths taken through the automaton. Similar to recognition, we'll define this recursively. First, let's define a path as a sequence of symbol-state pairs, where each pair represents a transition in the automaton. For a path $p$, we denote the last state in the path as $\text{last-state}(p)$. We need a helper function to handle epsilon transitions. Let's define $\text{epsilon-paths}(q)$ as the set of all paths from state $q$ to any other state using only epsilon transitions: $$ \text{epsilon-paths}(q) = \{\langle (\epsilon, q_1), (\epsilon, q_2), \ldots, (\epsilon, q_n) \rangle \mid q_1, \ldots, q_n \in Q \text{ and there is a path from } q \text{ to } q_n \text{ using only } \epsilon \text{ transitions}\} $$ Note that this set always includes the empty path $\langle \rangle$ from a state to itself via zero epsilon transitions. We can now define a recursive parsing function $\text{parse-from}_G(\boldsymbol\sigma, P)$ where $P$ is a set of partial paths: $$ \text{parse-from}_G(\boldsymbol\sigma, P) = \begin{cases} \{p \in P \mid \text{last-state}(p) \in F\} & \text{if } \boldsymbol\sigma = \epsilon \\ \text{parse-from}_G(\sigma_2\ldots\sigma_n, P') & \text{if } \boldsymbol\sigma = \sigma_1\sigma_2\ldots\sigma_n \\ \emptyset & \text{otherwise} \end{cases} $$ where: $$P' = \bigcup_{p \in P} \bigcup_{q' \in \delta(\text{last-state}(p), \sigma_1)} \bigcup_{e \in \text{epsilon-paths}(q')} \{p \circ \langle (\sigma_1, q') \rangle \circ e\}$$ Here, $\circ$ represents path concatenation. The set $P'$ contains all possible extensions of paths in $P$ after reading symbol $\sigma_1$, including any subsequent epsilon transitions. The full parsing function is then: $$ \text{parse}_G(\boldsymbol\sigma) = \text{parse-from}_G(\boldsymbol\sigma, \bigcup_{e \in \text{epsilon-paths}(q_0)} \{\langle \rangle \circ e\}) $$ This algorithm not only determines whether the string is accepted but also provides all possible paths through the automaton that lead to acceptance. If the result is an empty set, then the string is not in the language of the automaton. For deterministic finite automata (DFAs), the algorithm simplifies because there are no epsilon transitions and each state has at most one transition on each symbol. In this case, there is at most one valid parse for any accepted string. ### Ambiguity in Parsing An important consideration in parsing with FSAs is ambiguity. A string may have multiple valid parses if the automaton is nondeterministic. In such cases, the parsing algorithm returns the set of all valid parses. For example, consider an NFA that recognizes the language $(a|b)^*a(a|b)^*$. The string "aba" has multiple valid parses, as the middle "a" could be the one required by the pattern, or one of the outer "a"s could be. The parsing algorithm would return all these possibilities as distinct paths through the automaton. Handling ambiguity is particularly important in linguistic applications, where different parses may correspond to different linguistic analyses. ## Applications in Phonology In phonological analysis, parsing algorithms for FSAs have several important applications: ### 1. Morphological Decomposition Parsing can identify the morphological structure of words by tracing the path through an FSA that represents morphological rules. For example, parsing the word "unhappiness" through an appropriate FSA might yield a path that identifies the prefix "un-", the root "happy", and the suffix "-ness", along with the rules that combine them. ### 2. Phonological Rule Application Parsing can reveal how phonological rules apply to derive surface forms from underlying representations. For instance, parsing the word "cats" (/kæts/) might show how the plural morpheme /-z/ undergoes devoicing after the voiceless consonant /t/ to become /-s/. ### 3. Syllabification Parsing can determine the syllable structure of words by identifying the specific path through a syllable structure FSA. For example, parsing "strength" through the English syllable FSA would identify it as a single syllable with a complex onset /str/ and a complex coda /ŋθ/. ## Extending Parsing for Richer Analyses While basic FSA parsing provides valuable information about the structure of strings, linguistic applications often require richer analyses. Several extensions to basic parsing can provide more detailed linguistic information: ### Weighted Parsing By assigning weights to transitions in the FSA, we can implement weighted parsing that finds the most probable parse according to some probability distribution or cost function. This is particularly useful for handling ambiguity in natural language processing. ### Feature-Based Parsing We can extend FSAs to carry feature information along transitions, allowing for more detailed linguistic analyses. For example, transitions might carry information about phonological features, morphological categories, or syntactic properties. ### Transducer-Based Parsing Finite state transducers (FSTs) extend FSAs by associating an output symbol with each transition. Parsing with FSTs can simultaneously recognize a string and transform it, which is useful for modeling processes like morphological analysis or phonological alternations. We'll see this in more detail in the next section.