Grammatical Inference as a Principal Component Analysis Problem

Rapha¨ l Bailly e RAPHAEL . BAILLY @ LIF. UNIV- MRS . FR Francois Denis ¸ FRANCOIS . DENIS @ LIF. UNIV- MRS . FR Liva Ralaivola LIVA . RALAIVOLA @ LIF. UNIV- MRS . FR Laboratoire d'Informatique Fondamentale de Marseille, Aix-Marseille Universit´ , F-13013 Marseille, FRANCE e

Abstract
One of the main problems in probabilistic grammatical inference consists in inferring a stochastic language, i.e. a probability distribution, in some class of probabilistic models, from a sample of strings independently drawn according to a fixed unknown target distribution p. Here, we consider the class of rational stochastic languages composed of stochastic languages that can be computed by multiplicity automata, which can be viewed as a generalization of probabilistic automata. Rational stochastic languages p have a useful algebraic characterization: all the mappings up : v  p(uv) lie in a finite dimen sional vector subspace Vp of the vector space R  composed of all real-valued functions defined over  . Hence, a first step in the grammatical inference process can consist in identifying the subspace Vp . In this paper, we study the possibility of using Principal Component Analysis to achieve this task. We provide an inference algorithm which computes an estimate of this space and then build a multiplicity automaton which computes an estimate of the target distribution. We prove some theoretical properties of this algorithm and we provide results from numerical simulations that confirm the relevance of our approach.

ence is to infer an estimate of p in some class of probabilistic models (Carrasco & Oncina, 1994; Thollard et al., 2000). Here, we consider the class of rational stochastic languages composed of stochastic languages that can be computed by a multiplicity automaton (Beimel et al., 2000; Denis & Esposito, 2008). A multiplicity automaton (MA) is a tuple , Q, , ,  where Q is a set of states,  : Q  R is an initialization function,  : Q  R is a termination function and  : Q ×  × Q  R is a transition function. The class of rational stochastic languages (S rat ()) strictly encompasses the set of stochastic languages that can be defined by probabilistic automata, or equivalently, Hidden Markov Models. The main interest in considering rational stochastic languages relies in the fact that they have an algebraic characterization which proves to be useful for inference purpose (Denis et al., 2006). For any string u and any stochastic language p, let us denote by up  the function defined by up(v) = p(uv). Then, a stochastic  language p is rational if and only if the set {up|u   }  spans a finite dimensional vector subspace Vp of the vector space R  composed of all real-valued functions defined over  . The dimension of Vp is called the rank of p. It coincides with the minimal number of states needed by a multiplicity automaton to compute p. Hence, a first step in the grammatical inference process can consists in identifying the subspace {up|u   }. In this paper, we study  the possibility to use Principal Component Analysis (PCA) techniques to achieve this task (see (Clark et al., 2006) for another use of PCA in the field of Grammatical Inference). Next section (Section 2) gives a high level overview of the results presented in this work. Preliminaries on rational stochastic languages are given in Section 3. The main inference algorithm is described in Section 4. Consistency properties are proved in Section 5, which also contains the method to infer the rank of the target. Some experiments are provided in Section 6.

1. Introduction
A stochastic language over the finite alphabet  is a probability distribution defined on the set of strings  , i.e. a mapping p :   R which satisfies (i) 0  p(u)  1 for any string u and (ii) u p(u) = 1. Given a set of strings independently drawn according to a fixed unknown stochastic language p, a usual goal in grammatical inferAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

2. Overview of Results
The class of rational stochastic languages is included in the Hilbert space l2 ( ) composed of all real valued func-

Grammatical Inference as a PCA Problem
2 tions r such that the sum u r(u) is convergent and equipped with the corresponding dot product. Given a sample of strings S drawn according to an unknown stochastic language p and a dimension parameter k, we build the  k-dimensional subspace VS,k of R  that minimizes 2    u ||upS - VS,k upS || where pS denotes the empirical distribution induced by S and where V denotes the orthogonal projection of the subspace V . Then, we use the  linear dependencies between the projections on VS,k of the elements upS to build a multiplicity automaton AS . The   method is consistent: if pS = p, the space VS,k is equal  to the space spanned by the set {up|u   }, and the au tomaton AS computes the target p. And we prove that if k is equal to the rank of the target, the linear dependencies  computed in the inferred space VS,k converge to the correct ones with a rate of convergence equal to O(|S|-1/2 ). Lastly, we show that the dimension d of the target p can be inferred from the sample S: in order to build the subspace  VS,k , using PCA techniques, we compute the eigenvalues 1  2  . . .  m  0 of a positive semi-definite matrix N and we show that the sum i>d i of the last m - d eigen values tends to 0 as the size of the sample S increases. This property provides an algorithm to infer the dimension d.

denoted by Rrat  . A stochastic language is a formal series p which only takes non negative values and such that u p(u) = 1. A stochastic language defines a probability distribution over  . A rational stochastic language (Denis & Esposito, 2008) is a stochastic language which can be computed by a multiplicity automaton. The set of all stochastic languages (resp. rational stochastic languages) defined on  is denoted by S() (resp. S rat ()). It can be shown that all probability distributions computed by Hidden Markov Models (or equivalently, probabilistic automata) are rational stochastic languages but the converse is false. The values of a rational stochastic language decrease exponentially fast. It can be shown that for any p  S rat (), there exists 0 <  < 1 such that for any integer n  0, p(n ) = O(n ). Let A = , Q, A , A , A be an MA that computes the rational stochastic language p and suppose that A has a minimal number of states d (equal to the dimension of the space spanned by {up|u   }).  Let B = , Q, B , B , B be another MA defined on the same set of states as A. It can be shown (Denis et al., 2006) that if the parameters of B converge to the parameters of A, then, the series computed by B converges to the series computed by A for the norm || · ||1 . More precisely, for any > 0, there exists  > 0 such that if |A (q) - B (q)| < , |A (q) - B (q)| < , |A (q, x, q ) - B (q, x, q )| <  for any states q, q and any letter x then u |p(u) - rB (u)| < . As a first consequence, if the parameters of B are sufficiently close to the parameters of A, then the series computed by B is absolutely convergent and u rB (u) is arbitrarily close to 1. A second consequence is that the absolutely convergent series rB can be used to compute a stochastic language prB which approximates p (see algorithm 1). It can be shown that if u |p(u)-rB (u)| < then, u |prB (u) - rB (u)| < (3+) and therefore, 1- u |p(u) - prB (u)| = O(). We consider the Hilbert space l2 ( ) composed of the rational series r  R  such that w r(w)2 <  and where the inner product ·, · is defined by r, s = 2 1/2 . w r(w)s(w). Hence, ||r|| = w r(w) rat    Clearly, S ( )  l2 ( ) and if r  l2 ( ) then ur  l2 ( ). Given a vector subspace V of l2 ( ), we  denote by V the orthogonal projection on V .

3. Preliminaries
Let  be the set of strings on the finite alphabet . The empty string is denoted by , and the length of a string u is denoted by |u|. For any integer k, we denote by k the set {u   | |u| = k} and by k the set {u   | |u|  k}. Let L   . We denote by pref (L) (resp. suf (L)) the set {u   |v   uv  L} (resp. {u   |v   vu  L}). A formal power series is a mapping r from  to R. The set of all formal power series is denoted by R  . It is an R-vector space. For any series r and any string u   , we denote by ur the series defined by ur(w) = r(uw).   A multiplicity automaton (MA) is a tuple , Q, , ,  where Q is a finite set of states,  : Q ×  × Q  R is the transition function,  : Q  R is the initialization function and  : Q  R is the termination function. We extend the transition function  to Q ×  × Q by (q, wx, q ) = q Q (q, w, q )(q , x, q ) and (q, , q ) = 1 if q = q and 0 otherwise, for any q, q  Q, x   and w   . For any MA A = , Q, , ,  , we define the series rA by rA (w) = q,q Q (q)(q, w, q ) (q ). A series r is rational if it can be computed by a multiplicity automaton. It can be proved that a series r is rational iff the vector subspace of R  spanned by the set {ur|u   } has a  finite dimension, which is called the rank of r. This dimension coincides with the minimal number of states needed for a MA to compute r. The family of all rational series is

4. Principle of the algorithm
Let p  S rat () be a rational stochastic language, let Vp be the vector space spanned by {up|u   } and let  d = dim(Vp ). Let S be a sample independently drawn according to p and let pS be the empirical distribution defined from S. We first build an estimate VS of Vp from S. Then, we show that VS can be used to build a MA AV whose as-

Grammatical Inference as a PCA Problem

Algorithm 1 RandomDraw. Defines a stochastic language pr such that for any string u, pr (u) = 0 if r(u)  0 and pr (u)  r(u)/r( ) otherwise. Data : An absolutely convergent rational series r s.t. u r(u) > 0. Result : A string u drawn randomly X = {}   For x  , let px = max(0, xr( ))  Let p = max(0, r(). Let s = xX px Draw an element z of X according to the multinomial model (px /s)xX if z =  then return  else return z+RandomDraw(zr)  end if sociated rational series approximates the target p. In this section, we shall implictly suppose that the dimension d of Vp is known. We will show in the next section how it may be estimated from the data. 4.1. Estimating the target space Let k  0 be an integer. The first step consists in finding  the k-dimensional vector subspace VS,k of l2 ( ) which minimizes the distance to {upS |u   }: 
 VS,k = Argmindim(V )=k u  Since the support of pS is finite, VS,k can be computed by using PCA. Let v1 , . . . , vm be an enumeration of suf (S). For any u  pref (S), let xu be the vector of Rm defined 1 by xu [i] = upS (vi ) - Card(suf (S)) vsuf (S) upS (v).   Let XS be the matrix containing the vectors xu as rows. The matrix MS = XS XS is positive semi-definite and the eigenvectors (q1 , . . . , qk ) corresponding to the k largest  (positive) eigenvalues form an orthogonal basis of VS,k .

· S,B = qi (). The automaton AS,B computes a rational series which only depends on S and V . Indeed, let us define the linear operator  :   L(R  ) by V V  x  (v)  if u =  if u = xv

(u) =

Let r be the series computed by AS,B . The following proposition shows that r(u) = (u)(pS )(), which does not depend on B. Proposition 1. For any string u, r(u) = (u)(pS )(). Proof. It can easily be shown, by induction on the length of u, that q B (q, u, q )q = (u)(q). Then, we have: r(u) =
q,q

(q)(q, u, q ) (q )   (q)
q q

=

(q, u, q )q  ()

=
q

(q)(u)(q) () (q)q()
q

= (u)(

||upS - V upS ||2 .  

= (u)(pS )().

Note that if we take V = Vp and if we set (q) to the i-th component of Vp (p) in the basis (q1 , . . . , qk ), then (u) = uVp and r = p: the automaton computes the  target p, which can be seen as a weak consistency property of the algorithm. 4.3. The algorithm

4.2. Building the automaton Now, let V be a k-dimensional vector subspace of l2 ( ) and let B = {q1 , . . . , qk } be a basis of V . We define the multiplicity automaton AS,B = , QS,B , S,B , S,B , S,B by · QS,B = {q1 , . . . , qk },
j j · S,B (qi , x, qj ) = i,x where i,x is the j-th component of V (xqi ) in the basis (q1 , . . . , qk ) 

Algorithm 4.3 takes a sample S and an estimate of the rank d of the target space as input and outputs a multiplicity au tomaton. It first computes the space VS,k , then it computes  an automaton AS,B from S and a basis of VS,k , which defines a rational series that only depends on S and d. 4.4. Example Let us consider the stochastic language p defined by the probabilistic automaton described on Figure 1 and let S be a sample composed of 1000 examples independently drawn according to p. Table 1 shows the beginning of the array XS (before centering).

· S,B (qi ) = i where i is the i-th component of V (pS ) in the basis (q1 , . . . , qk ),

Grammatical Inference as a PCA Problem

Algorithm 2 Building an automaton corresponding to a sample S and a dimension d Data : A sample S = {si   , 1  i  |S|} i.i.d. according to a distribution p, a dimension d Result : A Multiplicity Automaton A W = {wi , 1  i  |W |}  prefixes of S U = {ui , 1  i  |U |}  suffixes of S X  a |W | × |U | matrix 1  Xi,j  wi pS (uj ) - |U | vU upS (v)  N =XX (i , qi )  eigenvalues of N in decreasing order, and corresponding eigenvectors B  (q1 , . . . , qd ) /*B = orth. proj. on the vector subspace spanned by B*/ for i from 1 to d do (qi )  the i-th coordinate of B (pS ) in B  (qi )  qi () for each x   do for j from 1 to d do (qi , x, qj )  the j-th coordinate of B (xqi ) in  B end for end for end for return A = , {q1 , . . . , qn }, , , 

Table 2. The first values computed with the target automaton A (Fig. 1 (a)), the first learned MA A1 (Fig. 1 (b)), the second learned MA A2 (Fig. 1 (c)) and the stochastic language prA2 derived from A2 by using algorithm RandomDraw.
pA rA1 rA2 prA2  0.0 0.057 0.000 0.000 a 0.083 0.021 0.092 0.10 b 0.083 0.041 0.080 0.086 aa 0.028 0.007 0.025 0.028 ab 0.028 0.015 0.028 0.030 ba 0.069 0.015 0.071 0.077 bb 0.069 0.030 0.066 0.072

b,1/2 a,1/4 b,1/4

a,1/3 b,1/3

(a)

1

p0

p1

1/3

a,0.36 b,0.72

a,0.36 b,0.72 a,-0.41 b,-0.23

a,0.04 b,0.36

(b)

0.17

p0

(c)

0.17

p1 a,-0.08 b,0.11

p2

0.06

0.33

0.33

-0.91

Table 1. First rows and columns of XS (before centering).
XS  a b ...  0.0 0.091 0.08 a 0.091 0.027 0.074 b 0.08 0.026 0.058 aa 0.027 0.009 0.016 ab 0.026 0.01 0.026 ba 0.074 0.008 0.055 bb 0.058 0.007 0.052 ...

Figure 1. A target stochastic language defined by the probabilistic automaton (a). The learned automaton with d = 1 (b) and d = 2 (c). The parameters have been trunked after the second decimal. The quadratic distance between the target (a) and the learned automaton is 1.76 · 10-2 for the first automaton (b) and 1.49 · 10-4 for the second (c).

The 3 largest eigenvalues of the matrix MS = XS XS are (in decreasing order) 7.71 · 10-1 , 2.14 · 10-1 , 5.98 · 10-3 . Figure 1 shows the automaton output by the learning algorithm for d = 1 and d = 2. The quadratic distance between the target and the learned automaton is 1.758 · 10-2 for the first automaton (d = 1), 1.487 · 10-4 for the second (d = 2) and 2.325 · 10-4 for d = 3 (the automaton is not represented). We can remark that the distance is minimal when the number of states is correct. Table 2 shows the first values computed by the target and the learned automaton.

Proposition 2. Let S be a sample i.i.d. according to a rational stochastic language p with rank d. Then  E[ VS,d (wpS ) - wp ]  0 uniformly wrt w as the size of   S increases. Proof. Let w   . As pS (w) is a binomial distribution with parameters |S| and p(w), V[pS (w)] = p(w)(1-p(w))  p(w) . Thus, E[ wpS - wp 2 ]    |S| |S| E[pS (wu) - p(wu)]2  u p(wu) = p(w ) . There|S| |S| fore, E[ wpS - wp 2 ] tends to 0 as the size of S increases.  
u


5. Consistency
5.1. Consistency when the rank of the target is known We show that the solution computed by the algorithm converges to the target as the size of the sample S increases.

We have wpS - Vp (wpS )  wpS - wp since wp       Vp . Furthermore,
nN p( ) |S|
n

w

E[ wpS - wp 2 ]   
O( ) |S|
n

w

p(w ) |S|

=


nN

= O(1/|S|) for some 0 <

 < 1. Therefore, (E[ wpS - Vp (wpS ) ])2  (E[ wpS -   

Grammatical Inference as a PCA Problem

wp ])2  E[ wpS - wp 2 ]       2 w E[ wpS - wp ] =  O(1/|S|) which proves that E[ VS,d (wpS ) - wp ]    E[ wpS - Vp (wpS ) + wpS - wp ]  O(|S|-1/2 ).    
   That is, E[ VS,d (wpS ) - wp ]  0 uniformly wrt w as the size of S increases.

Proof. Let such that
k - /2 

> 0. There exists Wk,   , |Wk, | < 
max min X
wWk,

dim(V )=k vV \{0}

V (wp) 

2

= k,
2

It can easily be deduced from Proposition 2 that if d u1 p, . . . , up p form a basis of Vp , if vp =    i=1 i ui p d   ^  and if VS,d (vpS ) =  i=1 i VS,d (ui pS ), then -1/2 E[M ax( i - i )] = O(|S| ^ ). In particular, linear relations between residuals are exactly identified in the limit. 5.2. Convergence of eigenvalues We will use two results about eigenvalues: Lemma 1 (Shawe-Taylor et al. (2001)). Let M be the variance-covariance matrix of the set {wp, w  W }, with  |W | = m. Let {i , 1  i  m} be the set of eigenvalues of M . Let Vd be the d-dimension vector subspace wich minimizes wW wp - Vd (wp) 2 . Then, d s.t.   1  d  m, · d = maxdim(V )=d minvV \{0} ·
d+1im wW

We have also k,  k ( wWk, 2  since Wk,   ). w V (wp)

V (wp) 


There exists N such that w  Wk, , v  R , n > N , v (wp) 2 - v (wpSn ) 2  2|Wk, | because v is 1  lipschitz. Let k, =
dim(V )=k vV \{0}

max

min

V (wpSN ) 
w

2

 k,N

We have: k,N  k,  k, + /2  k + This is true for every , and we have the conclusion. Algorithm 3 first finds a lower bound of the rank with the method we just described, then it detects the point (greater than the bound previously found) on the curve of the logarithm of the eigenvalues that corresponds to the highest second order slope (see the last loop of the algorithm).

v (wp) 2 ; 

i =

wW

wp - Vd (wp) 2 .  

6. Numerical Simulations
We carry out two types of simulations: first we focus on the ability of our algorithm to retrieve the structure and coefficients of a specific automaton, and then we perform experiments on randomly drawn automata. We intend to study the possibility of determining the rank of the target p from a finite sample S i.i.d. according to p. Comparison with other methods to find the structure (or the number of states) of a statistical model from a learning sample are not carried out here and are left for future work. However, we would like to stress that existing inference algorithms either provide models of far lower expressiveness (such as PDFA), or are designed to work in an asymptotic way, and still need be tuned to obtain an efficient implementation (such as DEES). We therefore anticipate our method to provide more conclusive empirical results. 6.1. Rank Estimation Procedures We propose four criteria to estimate the target rank.

Proposition 3. Let S be a sample i.i.d. according to a rational stochastic language p with rank d, and pS the empirical distribution deduced from S. Let V be the orthogonal  projection on V . Let VS be the d-dimension vector sub space wich minimizes wpref (S) wpS - VS (wpS ) 2 ,   V  = Res(p). Let {i , 1  i  m} be the set of eigenvalues of the variance-covariance matrix of {wpS , w   pref (S)}, with m = |pref (S)|. Then E[
d+1im

i ] tends to 0 as the size of S increases.

Proof. E[ d+1i|pref (S)| i ] =  wpref (S) wpS - M 2  (wpS ) V S   |S| , with M = max{|w|, w  pref (S)}. This tends to 0 as |S| tends to infinity. Proposition 4. Let S be an infinite sample i.i.d. according to a rational stochastic language p with rank d, and Sn the n first elements. Let pS the empirical distribution deduced from S, and pS the one deduced from Sn . Let k = k,n =
dim(V )=k vV \{0}

max

min

V (wp) 
w

2

dim(V )=k vV \{0}

max

min

V (wpSn ) 
w

2

Eigenvalue-based Test We use the algorithm described before, based on eigenvalues, but we don't take the value Lmax as bound, we use instead annother value. Let |S| 1 p (w) = pS (w) + |S| be a smoothing of pS . Variance of p(w) is estimated with V (w) =
p (w)(1-p (w)) . |S|

Then lim inf n k,n  k .

Finally:

Grammatical Inference as a PCA Problem

Algorithm 3 Finding the rank d Data : A sample S = {si , 1  i  |S|} i.i.d. according to a distribution p Result : A dimension d W = {wi , 1  i  |W |}  prefixes of S U = {ui , 1  i  |U |}  suffixes of S Lmax  max{|si |, si  S} M  a (|W |, |U |) matrix |{s =w uj ,s S}| Mi,j  wi pS (uj ) = k i|S| k  N = MMT (i )1i|U |  eigenvalues of N in decreasing order d  |U | sum  0 max while sum < L|S| and d > 0 do sum  sum + d dd-1 end while dd+1 ed d   d+2 2
d+1

are l1 and l2 distances, and KL-divergence. D=
w10

d(p(w), pn (w))

6.2. Single case study We start with the case of a single 5-state automaton on the alphabet  = {0, 1} described on Fig. 2. Fig. 3 represents the eigenvalues curves, in log scale, for a sample size of 1000, 5000, 20000 and 100000 strings.

Figure 2. The 5-states automaton

F IG . 5 ­ L'automate A

F IG . 6 ­ n=100

F IG . 7 ­ n=3000

while e < |U | - 2 do ee+1 e if   e+2 then 2 e+1 de end if end while return d

One can see the eigenvalues decrease exponentially, so the sum from the i-th eigenvalue to the last one is close to the i-th eigenvalue. Considering this, the eigenvalues-based algorithm identifies the highest second order slope point under the horizontal line, representing the value Lmax /|S|:
IG . 9 ­ n=3000000 F . 8 ­ n=50000 With a sample size ofIG1000, the retained Fdimension is 3 while for larger sizes the correct dimension is identified.

F IG . 10 ­ n=100×

 = 0.1, sur des Sn des n In the case of the l1 distance,échantillons été constituéspartir premiers mots de S échantillon ded the correct rank is foundéchantillon notre constitué for L'automate de la Fig. 10 a réalisé à d'un 100×500 all samples. premiers mots de S dupliqués 500 fois. L'automate de la Fig. 13 a été réalisé à part
1 1

Fig. 5). Les automates suivants sont ceux calculés par notre programme pour un par

Lmax =
uW,vW

pS (uv) +

V (uv)

0.1

0.01

0.001

We then estimate the "elbow" position by maximizing the second order slope of the eigenvalues logarithm curve, that is, if {1 , . . . , n } are the eigenvalues, computing i arg maxi i+2 for i greater than the dimension previ2 i+1 ously found. Distance to Test Sample For each target automaton, we build automata for rank d from 1 to 11. We want to check here if it is possible to infer the rank by minimizing the distance beetween each of the eleven automata and a test sample. The inferred automata do not, in general, compute a probability distribution. We simulate one from each inferred automaton with Algorithm 1. We then compute here an approximation of the distance beetween the target probability p and each inferred probability pn for 1  n  11, only for strings of length lower than 10. The distances considered

0.0001

1e-05 2 1 4 6

échantillon S1000×5000 constitué des 1000 premiers mots de S dupliqués 5000 fois, celu Fig. 14 à partir d'un échantillon S10000×500 constitué des 10000 premiers mots de S du 500 fois. Les nombre entre parenthèses représentent l'erreur théorique commise sur l ficient, par la méthode de calcul décrite au-dessus. L'absence de valeur signifie que le -1 n'a pu aboutir : (1 - Mk Mk - M )  0. Plusieurs remarques peuvent être faites concernant ces résultats. La première conc rapidité avec laquelle l'algorithme trouve le bon nombre d'états. L'étendue (1) de l'int de confiance ne joue pas un grand rôle dans l'estimation des coefficients. Elle joue par un rôle dans la mesure de la précision du calcul des coefficients, et dans l'estimat nombre d'états de l'automate. L'étendue (1) de l'intervalle de confiance est grande com à ce qu'on pourrait en attendre.
0.1 0.01 0.001 0.0001 1e-05 8 10 1e-06 2 4 6 8 10 1 0.1 0.01

0.1

0.01 0.001 0.001 0.0001 0.0001 1e-05 1e-05 1e-06

1e-06

1e-07

1e-07 2 4 6 8 10

1e-08 2 4 6 8 10

Figure 3. Curves of Eigenvalues (in logarithm scale) for samF . 11 ­ n=5000000 F IG . 12 ­ n=7000000 ple size of 1000, 5000,IG20000 and 100000 strings, from left to right and top to down. The horizontal line represents the value Lmax /|S|. 14

F IG . 13 ­ n=1000×

6.3. General case study We implement the following protocol. We first randomly generate 500 probabilistic 4-state automata on the alphabet

Grammatical Inference as a PCA Problem

 = {0, 1} and we generate for each of them one sample of 1000, 5000, 20000 and 100000 strings. From each sample, we generate a data matrix on which we perform a PCA. The test for the dimension is a first estimation of the target rank. We generate weighted automata for dimensions from 1 to 11, and we compute the l1 , l2 -distances and the KL-divergence between each target automaton and its aproximations of ranks from 1 to 11. This gives three other estimations of the target rank. Randomly generated automata Each coefficient of the initial vector and final vector is uniformly generated between 0 and 1. Each coefficient of the transition matrices is uniformly generated with probability 1/3, and is equal to 0 with probability 2/3. We first divide the initial vector's coordinates by their sum. Then, for every state, we do the sum of final coefficient and all the outgoing transitions. This must sum up to 1, so we divide each term by the sum to normalize. We obtain this way a weighted automaton wich computes a probability distribution over  . Data Matrix We do not consider the whole data matrix but the one formed with the prefixes and suffixes of length lower or equal than 4, in lexicographic order. Let us call this set W = { , 0, 1, 00, 01, 10, 11, . . . , 1110, 1111}. |W | = 31. The matrix XS is defined by: XS i,j = pS (wi wj ). After performing the PCA, we obtain 31 eigenvalues and eigenvectors. The Fig.4 represents the curve of the average eigenvalues computed by the algorithm for each sample size. One can see clearly the "elbow" at position 5 with a sample size greater than 20000. The horizontal line represents the value Lmax /|S|. The Fig. 5, 6, 7 and 8 represents the compared dimensions obtained by the eigenvalues-based algorithm, l1 and l2 -distance minimization, and KL-divergence minimization. Results From our experiments of rank prediction, one can observe that: · The KL-divergence criterion performances decrease with the sample size. · l1 and l2 minimization criteria slowly increase with the sample size. We chose to compute distances with the real target distribution, in order to have a convenient way to compare results. In reality, one only knows the empirical distribution

1

1

0.1 0.1

0.01 0.01 0.001

0.001 0.0001

0.0001

2

4

6

8

10

1e-05

2

4

6

8

10

1

1

0.1

0.1

0.01

0.01

0.001

0.001

0.0001

0.0001

1e-05

1e-05

1e-06

2

4

6

8

10

1e-06

2

4

6

8

10

Figure 4. Curves of the average eigenvalues (in logarithm scale) for sample size of 1000, 5000, 20000 and 100000 strings, from left to right and top to down. The horizontal line represents the average value Lmax /|S|.

deduced from learning sample, and this should decrease performances of those three criteria. · The eigenvalues criterion performances seem to increase faster than the others when the size of the sample increases.

7. Conclusion
We introduce a new approach in probabilistic grammatical inference which differs from previous one on several aspects. Most classical inference algorithms used in the field of probabilistic grammatical build an automaton or a grammar iteratively from a sample S; starting from an automaton composed of only one state, they have to decide whether a new state must be added to the structure. This iterative decision relies on a statistical test with a known drawback: as the structure grows, the test relies on less and less examples. Instead of this iterative approach, we tackle the problem globally and our algorithm computes in one step the space needed to build the output automaton. That is, we have reduced the problem set in the classical probabilistic grammatical inference framework into a classical optimization problem. We now need to experimentally compare our approach to existing ones on real data: this is a work in progress. A further consequence of our approach is that it will be possible to introduce non linearity via the kernel PCA technique developed in (Shawe-Taylor et al., 2001) and by the Hilbert space embedding of distributions proposed in (Smola et al., 2007).

Acknowledgments
This work is partially supported by the IST Program of the EC, under the FP7 Pascal 2 Network of Excellence, ICT216886-NOE and by the ANR project SEQUOIA.

Grammatical Inference as a PCA Problem
250 Eigenvalues l1 distance l2 distance Kullback-Leibler 200 200 Eigenvalues l1 distance l2 distance Kullback-Leibler

150 150

100 100

50

50

0 0 2 4 6 8 10 12

0 0 2 4 6 8 10 12

Figure 5. Estimated rank with the four methods |S| = 1000.
Eigenvalues l1 distance l2 distance Kullback-Leibler

Figure 7. Estimated rank with the four methods |S| = 20000.
300 Eigenvalues l1 distance l2 distance Kullback-Leibler

200

250

150

200

150 100 100

50 50

0 0 2 4 6 8 10 12

0 0 2 4 6 8 10 12

Figure 6. Estimated rank with the four methods |S| = 5000.

Figure 8. Estimated rank with the four methods |S| = 100000.

References
Beimel, A., Bergadano, F., Bshouty, N. H., Kushilevitz, E., & Varricchio, S. (2000). Learning functions represented as multiplicity automata. Journal of the Association for Computing Machinery, 47, 506­530. Carrasco, R. C., & Oncina, J. (1994). Learning stochastic regular grammars by means of a state merging method. Second International Colloquium on Grammatical Inference (pp. 139­152). Springer. Clark, A., Flor^ ncio, C. C., & Watkins, C. (2006). Lane guages as hyperplanes: Grammatical inference with string kernels. 17th European Conference on Machine Learning (pp. 90­101). Springer. Denis, F., & Esposito, Y. (2008). On rational stochastic languages. Fundamenta Informaticae, 86, 41­77. Denis, F., Esposito, Y., & Habrard, A. (2006). Learning ra-

tional stochastic languages. 19th Conference on Learning Theory (pp. 274­288). Springer. Shawe-Taylor, J., Cristianini, N., & Kandola, J. S. (2001). On the concentration of spectral properties. Advances in Neural Information Processing Systems, 14 (pp. 511­ 517). MIT Press. Smola, A. J., Gretton, A., Song, L., & Sch¨ lkopf, B. o (2007). A Hilbert space embedding for distributions. 18th International Conference on Algorithmic Learning Theory (pp. 13­31). Springer. Thollard, F., Dupont, P., & de la Higuera, C. (2000). Probabilistic DFA Inference using Kullback-Leibler Divergence and Minimality. 17th International Conference on Machine Learning (pp. 975­982). Morgan Kaufmann.