Recognition

In the previous sections, we’ve explored how finite state automata (FSAs) can be used to define and generate languages. Now, we’ll examine the other direction: given a string, how do we determine whether it belongs to the language defined by an FSA?

The Recognition Problem

We’ve already seen an example of the recognition problem: when we discussed matching with regular expressions. In (grammars-as-languages?), we formally defined matching as a function:

\[\text{match}: R(\Sigma)\;\times\;\Sigma^* \rightarrow \{\top, \bot\}\]

\[\text{match}(\rho, \sigma) = \begin{cases}\top & \text{if } \sigma \in \text{eval}(\rho) \\ \bot & \text{otherwise}\end{cases}\]

Alternatively, we viewed it as a function from strings to booleans parameterized by a regular expression $\rho$:

\[\text{match}_\rho: \Sigma^* \rightarrow \{\top, \bot\}\]

This parameterized view parallels how we’ll define our recognition algorithm for FSAs below: as a function parameterized by a particular grammar. Regular expression matching and FSA recognition are closely related because regular expressions and FSAs have equivalent expressive power, both defining exactly the class of regular languages.

Recognition Algorithm

For a given FSA $G = \langle Q, \Sigma, \delta, q_0, F \rangle$ and an input string $\boldsymbol\sigma = \sigma_1\sigma_2\ldots\sigma_n$, we can define a recognition function $\text{recognize}(G, \boldsymbol\sigma)$ that returns true if $\boldsymbol\sigma \in \mathbb{L}(G)$ and false otherwise.

First, we need a helper function to handle epsilon transitions. Let’s define $\text{epsilon-reachable}(q)$ as the set of all states reachable from state $q$ via epsilon transitions only:

\[ \text{epsilon-reachable}(q) = \{q'\in Q \mid \text{there exists a path from } q \text{ to } q' \text{ using only } \epsilon \text{ transitions}\} \]

Note that this set always includes $q$ itself, since a state is reachable from itself via zero epsilon transitions.

We can extend this to sets of states:

\[ \text{epsilon-reachable}(S) = \bigcup_{q \in S} \text{epsilon-reachable}(q) \]

Now, we can define our recognition algorithm recursively. Let $\text{recognize-from}_G(\boldsymbol\sigma, S)$ be a function that determines whether the string $\boldsymbol\sigma$ can be accepted starting from any state in the set $S$:

\[ \text{recognize-from}_G(\boldsymbol\sigma, S) = \begin{cases} \text{true} & \text{if } \boldsymbol\sigma = \epsilon \text{ and } S \cap F \neq \emptyset \\ \text{recognize-from}_G(\sigma_2\ldots\sigma_n, S') & \text{if } \boldsymbol\sigma = \sigma_1\sigma_2\ldots\sigma_n \\ \text{false} & \text{otherwise} \end{cases} \]

where $S' = \text{epsilon-reachable}\left(\bigcup_{q \in S} \delta(q, \sigma_1)\right)$, which represents all states reachable after reading $\sigma_1$ from any state in $S$, including any subsequent epsilon transitions.

The full recognition function is then:

\[ \text{recognize}_G(\boldsymbol\sigma) = \text{recognize-from}_G(\boldsymbol\sigma, \text{epsilon-reachable}(\{q_0\})) \]

This algorithm starts by finding all states reachable from the initial state via epsilon transitions, then processes each symbol in the input string, tracking the set of possible states at each step. After processing the entire string, it checks if any of the current states is an accepting state.

For deterministic finite automata (DFAs), the algorithm simplifies because there are no epsilon transitions, and $\delta(q, \sigma)$ always returns a single state rather than a set of states. In this case:

\[ \text{recognize}_G(\boldsymbol\sigma) = \begin{cases} \text{true} & \text{if } \text{process}_G(\boldsymbol\sigma, q_0) \in F \\ \text{false} & \text{otherwise} \end{cases} \]

where $\text{process}_G(\boldsymbol\sigma, q)$ is defined as:

\[ \text{process}_G(\boldsymbol\sigma, q) = \begin{cases} q & \text{if } \boldsymbol\sigma = \epsilon \\ \text{process}_G(\sigma_2\ldots\sigma_n, \delta(q, \sigma_1)) & \text{if } \boldsymbol\sigma = \sigma_1\sigma_2\ldots\sigma_n \text{ and } \delta(q, \sigma_1) \text{ is defined} \\ \text{undefined} & \text{otherwise} \end{cases} \]

This algorithm has a time complexity of $O(n)$ for DFAs and $O(n \cdot |Q|)$ for NFAs, where $n$ is the length of the input string and $|Q|$ is the number of states in the automaton.

--- title: Recognition bibliography: ../../references.bib --- In the previous sections, we've explored how finite state automata (FSAs) can be used to define and generate languages. Now, we'll examine the other direction: given a string, how do we determine whether it belongs to the language defined by an FSA? ## The Recognition Problem We've already seen an example of the recognition problem: when we discussed matching with regular expressions. In @grammars-as-languages, we formally defined matching as a function: $$\text{match}: R(\Sigma)\;\times\;\Sigma^* \rightarrow \{\top, \bot\}$$ $$\text{match}(\rho, \sigma) = \begin{cases}\top & \text{if } \sigma \in \text{eval}(\rho) \\ \bot & \text{otherwise}\end{cases}$$ Alternatively, we viewed it as a function from strings to booleans parameterized by a regular expression $\rho$: $$\text{match}_\rho: \Sigma^* \rightarrow \{\top, \bot\}$$ This parameterized view parallels how we'll define our recognition algorithm for FSAs below: as a function parameterized by a particular grammar. Regular expression matching and FSA recognition are closely related because regular expressions and FSAs have equivalent expressive power, both defining exactly the class of regular languages. ### Recognition Algorithm For a given FSA $G = \langle Q, \Sigma, \delta, q_0, F \rangle$ and an input string $\boldsymbol\sigma = \sigma_1\sigma_2\ldots\sigma_n$, we can define a recognition function $\text{recognize}(G, \boldsymbol\sigma)$ that returns true if $\boldsymbol\sigma \in \mathbb{L}(G)$ and false otherwise. First, we need a helper function to handle epsilon transitions. Let's define $\text{epsilon-reachable}(q)$ as the set of all states reachable from state $q$ via epsilon transitions only: $$ \text{epsilon-reachable}(q) = \{q'\in Q \mid \text{there exists a path from } q \text{ to } q' \text{ using only } \epsilon \text{ transitions}\} $$ Note that this set always includes $q$ itself, since a state is reachable from itself via zero epsilon transitions. We can extend this to sets of states: $$ \text{epsilon-reachable}(S) = \bigcup_{q \in S} \text{epsilon-reachable}(q) $$ Now, we can define our recognition algorithm recursively. Let $\text{recognize-from}_G(\boldsymbol\sigma, S)$ be a function that determines whether the string $\boldsymbol\sigma$ can be accepted starting from any state in the set $S$: $$ \text{recognize-from}_G(\boldsymbol\sigma, S) = \begin{cases} \text{true} & \text{if } \boldsymbol\sigma = \epsilon \text{ and } S \cap F \neq \emptyset \\ \text{recognize-from}_G(\sigma_2\ldots\sigma_n, S') & \text{if } \boldsymbol\sigma = \sigma_1\sigma_2\ldots\sigma_n \\ \text{false} & \text{otherwise} \end{cases} $$ where $S' = \text{epsilon-reachable}\left(\bigcup_{q \in S} \delta(q, \sigma_1)\right)$, which represents all states reachable after reading $\sigma_1$ from any state in $S$, including any subsequent epsilon transitions. The full recognition function is then: $$ \text{recognize}_G(\boldsymbol\sigma) = \text{recognize-from}_G(\boldsymbol\sigma, \text{epsilon-reachable}(\{q_0\})) $$ This algorithm starts by finding all states reachable from the initial state via epsilon transitions, then processes each symbol in the input string, tracking the set of possible states at each step. After processing the entire string, it checks if any of the current states is an accepting state. For deterministic finite automata (DFAs), the algorithm simplifies because there are no epsilon transitions, and $\delta(q, \sigma)$ always returns a single state rather than a set of states. In this case: $$ \text{recognize}_G(\boldsymbol\sigma) = \begin{cases} \text{true} & \text{if } \text{process}_G(\boldsymbol\sigma, q_0) \in F \\ \text{false} & \text{otherwise} \end{cases} $$ where $\text{process}_G(\boldsymbol\sigma, q)$ is defined as: $$ \text{process}_G(\boldsymbol\sigma, q) = \begin{cases} q & \text{if } \boldsymbol\sigma = \epsilon \\ \text{process}_G(\sigma_2\ldots\sigma_n, \delta(q, \sigma_1)) & \text{if } \boldsymbol\sigma = \sigma_1\sigma_2\ldots\sigma_n \text{ and } \delta(q, \sigma_1) \text{ is defined} \\ \text{undefined} & \text{otherwise} \end{cases} $$ This algorithm has a time complexity of $O(n)$ for DFAs and $O(n \cdot |Q|)$ for NFAs, where $n$ is the length of the input string and $|Q|$ is the number of states in the automaton.