NSST constraints

Definition 1. (Streaming String Transducer)

An NSST is a tuple $$T = (Q, \Sigma, \Gamma, X, E, q_0, F)$$, where $$\Sigma \operatorname{and} \Gamma$$ are input and output alphabet respectively, $$X$$ is a finite set of variables, and $$Q$$ a finite set of states. $$E \subseteq Q \times \Sigma \times Q \times S$$ is the transition function where S is the set of substitution s:$$X \rightarrow (X \cup \Gamma)^{\ast}$$, and F is a partial function from Q to $$(X \cup \Gamma)^{\ast}$$.

Like FA, a successful run of T on $$w = \sigma_1 \ldots \sigma_n$$ is the sequence of states $$q_0 \xrightarrow{\sigma_1 / s_1} q_1 \xrightarrow{\sigma_2 / s_2} \ldots \xrightarrow{\sigma_n / s_n} q_n$$ where $$(q_i, \sigma_{i + 1}, q_{i + 1}, s_{i + 1}) \in E$$ and $$F (q_n)$$ is defined. The output of a run of T is defined as $$\operatorname{Out} (r) = s_{\varepsilon} \circ s_1 \circ s_2 \ldots s_n \circ F (q_n)$$ where $$s_{\varepsilon}$$ is the empty substitution which maps all variables to $$\varepsilon$$.

For word w, the output of w by T is the set $$T (w) = \{ \operatorname{Out} (r) |r\operatorname{is}a\operatorname{successful}\operatorname{run}\operatorname{of}T\operatorname{on}w \}$$.

Theorem 1. (PreImage) Given NSST $$T = (Q_T, \Sigma, \Sigma', \Gamma, E, q_{0, T}, F_T)$$ and NFA $$M = (Q_M, \Sigma', \delta_M, q_{0, M}, F_M)$$, there is an NFA $$N = (Q_N, \Sigma, \delta_N, q_{0, N}, F_N)$$, such that for all word w on $$\Sigma$$, $$w \in L (N)$$ if and only if there is a word $$w' \in T (w)$$ such that $$w' \in L (M)$$. N is computable from T and M.

We omit the correctness proof for now, but give a construction of N:

Let $$Q_N = Q_T \times P (Q_M)^{Q_M \times \Gamma}$$, thus the state of N is denoted by a pair $$(q, \rho)$$ where q is a state of T and $$\rho \in Q_M \times \Gamma \rightarrow P (Q_M)$$ is the simulation function. The initial state of N is $$(q_{0, T}, \rho_{\varepsilon})$$ where $$\rho_{\varepsilon} (q, x) = \{ q \}$$ for all state q and variable x.

For a state $$q \in Q_M$$, a simulation function $$\rho$$, and a string $$w \in (\Gamma \cup \Sigma')^{\ast}$$, we recursively define the evaluation of w on q and $$\rho$$, denoted by $$U (q, \rho, w)$$ as

$$\displaystyle U (q, \rho, w) = \left\{ \begin{array}{ll} \delta_M^{\ast} (U (q, \rho, w'), a) & w = w' a \wedge a \in \Sigma'\\ \bigcup_{q' \in U (q, \rho, w')} \rho (q', a) & w = w' a \wedge a \in \Gamma\\ \{ q \} & w = \varepsilon \end{array} \right.$$

For all substitution $$s \in \Gamma \rightarrow (\Gamma \cup \Sigma')^{\ast}$$, the simulation function $$\rho$$ updated by s, denoted by $$R (\rho, s)$$, is another simulation function defined as follows:

$$\displaystyle R (\rho, s) (q, x) = U (q, \rho, s (x))$$

Now, consider the transition function of N. For all state $$(q, \rho) \in Q_N$$, if $$(q, a, q', s) \in E$$ for some character a and substitution s, then:

$$\displaystyle ((q, \rho), a, (q', R (\rho, s))) \in \delta_N$$

At last, the terminal state set $$F_N$$ contains all the states $$(q, \rho)$$ which satisfy the following properties:

1. $$F_T (q)$$ is defined

2. $$U (q_{0, M}, \rho, F_T (q)) \cap F_M \neq \emptyset$$

Note 1. (Intuition)

The construction above is actually very simple. The automaton N simulates the run of NSST T, together with the summary of M when a variable of T in inputed. The construction should be exponential.

The construction does not use the so-called copyless property[1, 2], thus the expressive power of NSST in this context is greater than 2FT.

Definition 2. (Two way NSST?) The definition of 2NSST is just like 2FT, allowing bidirectional move of NSST's head.

Note 2.

One thing is worth noting about this idea of 2NSST: the pre-image of 2NSST is still computable. We just use a 2FA N to simulate the 2NSST, in the same manner of the construction above. We can then transform the 2FA into a one-way FA in exponential time.

However, the expressive power of 2NSST is unknown (yet). I highly suspect that 2NSST is equivalent to NSST, since the bidirectional move of head provides the same function as variables, to some extent. Anyway, the expressive power of NSST should be sufficient now.

All the function expressible by 2FT is also expressible by NSST, like split. Below is an example of what more NSST can express.

Definition 3. (Regular Expression with capturing group, regex)

Suppose $$\Gamma$$ is some set of variables, x is an element of $$\Gamma$$. a is a character in $$\Sigma$$

$$\displaystyle \alpha \,:\, = \varepsilon |a| \alpha + \alpha | \alpha \circ \alpha | \alpha^{\ast} | (\alpha) \%x$$

WLOG, we assume a variable occurs at most once in a regex.

the semantics of regex is defined as tuple $$(w, w_x, w_y, \ldots)$$ where w is the whole string matched, and $$w_x$$ is the string matched by the capturing group marked by x, etc. A more formal definition involves match trees and recursive definition. See[3, 4]

Definition 4. (ReplaceAll with capturing group and back reference)

Consider the constriant $$y =\operatorname{replaceAll} (x, e, r)$$, where e is a regex and $$r \in (\Sigma \cup \Gamma)^{\ast}$$.

The semantics of replaceAll in this form is like the original one, with leftmost and longest match of e in x replaced, except that when replacing, every variable in r is substituted by the string captured by the corresponding capturing group in e.

A more formal definition should be proposed in the future.

Note 3.

The capturing process might not be unique, though. For example, $$(a^+) \%x (a^+) \%y$$ matches any string $$a^n$$ with n greater than 1, but the match over string 'aaa' has two possibilities ('aa' for x, or 'a' for x).

Thus, this form of replaceAll becomes nondeterministic.

Remark 1. Given e and r, we could use NSST to model this function.

Again, we give a construction of NSST, but omit the proof:

Suppose A=$$(Q, F, \delta, \Sigma, q_0)$$ is the NFA constructed by Glushkov method on e. Here, $$Q = \{ q_0, q_1, \ldots, q_n \}$$. Each state other than $$q_0$$ corresponds to an occurrence of character in e.

For every subexpression v of e, by the definition of Glushkov construction, there must be a subgraph of A which corresponds to v. Use A(v) to denote the subgraph. For a state q of A, define $$\Gamma (q) = \{ x \in \Gamma |v\operatorname{is}a\operatorname{capturing}\operatorname{group}\operatorname{of}x\operatorname{in}e \wedge q \in A (v) \}$$, in other words, $$\Gamma (q)$$ is the set of variables which should be updated when entering q.

Furthermore, for a subexpression v, we use Start(v) and End(v) to denote the starting and ending states of A(v) (see the definition of Glushkov construction).

Let NSST $$T = (Q', \Sigma, \Sigma, \{ \operatorname{acc} \} \cup \Gamma, E, q_0', F')$$, we construct T using a similar method as parsing automata. The initial state is $$(\{ q_0 \}, \operatorname{left}, \emptyset)$$.

• The transition within left mode is just the same as the original parsing automata. The substitution is s(acc)=acc.a and s(x)=x for every x other than acc, where a is the input character in this step.

• Suppose $$(\rho \{ q_0 \}, \operatorname{left}, S) \in Q'$$, and $$a \in \Sigma$$. If $$\delta (S, a) \cap F = \emptyset$$, $$\delta^{\ast} (\rho, a) \cap F = \emptyset$$ and $$\delta (q_0, a) \nsubseteq \delta (S, a) \cup \delta^{\ast} (\rho, a)$$, then for every $$q' \in \delta (q_0, a)$$, we have

$$\displaystyle ((\rho \{ q_0 \}, \operatorname{left}, S), a, (\{ q' \}, \operatorname{long}, \delta (S, a) \cup \delta^{\ast} (\rho, a)), s) \in E$$

where s is the substitution defined as: s(x) = x.a for every x in $$\Gamma (q')$$, and s(y)=y for every other variable y.

• Suppose $$(\{ q \}, \operatorname{long}, S) \in Q'$$, and $$a \in \Sigma$$. If $$\delta (S, a) \cap F = \emptyset$$, and $$\delta (q, a) \nsubseteq \delta (S, a)$$, then for every $$q' \in \delta (q_0, a)$$, we have

$$\displaystyle ((\{ q \}, \operatorname{long}, S), a, (\{ q' \}, \operatorname{long}, \delta (S, a)), s) \in E$$

where s is defined as follows:

Suppose $$V = \{ v|v\operatorname{is}a\operatorname{Kleene}\operatorname{star}\operatorname{in}e \wedge q \in \operatorname{End} (v) \wedge q' \in \operatorname{Start} (v) \}$$. Let P be the set of all variables captured in V. For all variable x in the set $$\Gamma (q') \backslash P$$, s(x)=acc.a. For all variable y in the set $$P\backslash \Gamma (q')$$, $$s (x) = \varepsilon$$. For all variable z in the set $$P \cap \Gamma (q')$$, s(z)=a. All the other variables do not change.

• Suppose $$(\{ q \}, \operatorname{long}, S) \in Q'$$, and $$a \in \Sigma$$. If $$\delta (S, a) \cap F = \emptyset$$ and $$\delta (q, a) \cap F \neq \emptyset$$, then for every state $$q' \in \delta (q, a) \cap F$$, we have

$$\displaystyle ((\{ q \}, \operatorname{long}, S), a, (\{ q_0 \}, \operatorname{left}, \delta (S, a) \cup q'), s) \in E$$

where s is defined as: $$s (\operatorname{acc}) =\operatorname{acc}.a_1 .a_2 \ldots a_n$$, where $$r = a_1 \ldots a_n$$. $$s (x) = \varepsilon$$ for every other variable x.

The output function $$F'$$ is simple: $$F' (q) =\operatorname{acc} \quad \operatorname{iff} \quad q = (-, \operatorname{left}, -)$$.

## Bibliography

 Rajeev Alur and Pavol Černý. Expressiveness of streaming string transducers. Page 12.

 Rajeev Alur and Jyotirmoy V. Deshmukh. Nondeterministic Streaming String Transducers. In Luca Aceto, Monika Henzinger, and Jiří Sgall, editors, Automata, Languages and Programming, volume 6756 of Lecture Notes in Computer Science, pages 1–20. Springer Berlin Heidelberg.

 Cezar Câmpeanu, Kai Salomaa, and Sheng Yu. A FORMAL STUDY OF PRACTICAL REGULAR EXPRESSIONS. 14(06):1007–1018.

 Benjamin Carle and Paliath Narendran. On Extended Regular Expressions. In Adrian Horia Dediu, Armand Mihai Ionescu, and Carlos Martín-Vide, editors, Language and Automata Theory and Applications, volume 5457 of Lecture Notes in Computer Science, pages 279–289. Springer Berlin Heidelberg.

 Taolue Chen, Yan Chen, Matthew Hague, Anthony W. Lin, and Zhilin Wu. What is decidable about string constraints with the ReplaceAll function. 2:3–1.

 Emmanuel Filiot and Pierre-Alain Reynier. Copyful Streaming String Transducers. In Matthew Hague and Igor Potapov, editors, Reachability Problems, volume 10506 of Lecture Notes in Computer Science, pages 75–86. Springer International Publishing.