Lyndon Words with a Fixed Standard Right Factor Fr´d´rique Bassino ee Julien Cl´ment e Cyril Nicaud Given a totally ordered alphabet A, a Lyndon word is a word that is strictly smaller, for the lexicographical order, than any of its conjugates (i.e., all words obtained by a circular permutation on the letters). Lyndon words were introduced by Lyndon [7] under the name of "standard lexicographic sequences" in order to give a base for the free Lie algebra over A. The set of Lyndon words is denoted by L. For instance, with a binary alphabet A = {a, b}, the first Lyndon words until length five are L = {a, b, ab, aab, abb, aaab, aabb, abbb, aaaab, aaabb, aabab, aabbb, ababb, abbbb, . . . }. Note that a non-empty word is a Lyndon word if and only if it is strictly smaller than any of its proper suffixes. The standard (suffix) factorization of Lyndon words plays a central role in the study of free Lie algebras (see [6], [8], [9]). For a Lyndon word w L \ A not reduced to a letter, the pair (u, v ) of Lyndon words such that w = uv and v of maximal length is called the standard factorization. The words u and v are called the left factor and right factor of the standard factorization. Equivalently, the right factor v of the standard factorization of a Lyndon word w can be defined as the smallest proper suffix of w for the lexicographical order. For instance we have the following standard factorizations: aaabaab = aaab · aab aaababb = a · aababb aabaabb = aab · aabb. This structure encodes a non-associative operation, either a commutator in the free group [3], or a Lie bracketing [6]; both constructions leads to bases of the free Lie algebra. Our goal is to get a better insight of the standard factorization of Lyndon words in order to design better algorithms for the construction of the Lyndon tree. A naive algorithm for such a construction has time complexity O(n2 ) in the worst case. The characterization of the set of Lyndon words with a given right standard factor is also a first step towards the average case analysis of the height of Lyndon trees in order to give a more accurate view of the observed behavior of related algorithms. To state our main results, we need to recall first what are regular languages. A language L is a set of words over a fixed alphabet A. The structurally simplest (yet non trivial) languages are the regular languages that can be defined in a variety of ways: by regular expressions and by finite automata. Concatenation of languages is denoted by a product (L1 · L2 = {w1 w2 | w1 L1 , w2 L2 }). Union of languages is the a ordinary set union. The empty word is denoted by nd the Kleene star operator is understood as L = + L + L · L + · · · . A regular language over an alphabet A is built by recursively applying concatenation, union and Kleene star operator to the singleton languages { } and { } ( A). A regular expression is a description of a regular language (most commonly using symbols "·, +, "). Our main result can hence be stated as follows: The set of Lyndon words with a fixed standard right factor is a regular language whose regular expression is given in Theorem 1. One can then associate to a Lyndon word w a binary tree T (w) called its Lyndon tree recursively built in the Theorem 2 gives the corresponding enumerative generfollowing way: ating function. ­ if w is a letter, then T (w) is a leaf labeled by w, Let A = {a1 < · · · < aq = } be a totally ordered q ary alphabet where denotes the greatest symbol of A. ­ otherwise T (w) is an internal node having T (u) e and T (v ) as children where u · v is the standard We denote A the set of finite words. We consider th lexicographical order < over all non-empty words of A factorization of w. defined by the extension of the order over A. Let w be Institut Gaspard Monge, Universit´ de Marne-la-Vall´e, a word of A \ { } , the successor S (w) of w = v i , e e France where is a symbol of A \ { } and i 0, is defined Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688 653 by S (w) = v , being the smallest letter greater than in A. For any Lyndon word v , we define the set of words Xv = {v , S (v ), S 2 (v ), . . . , S k (v ) = }. Note that k = 1 + q × |v | - i=1 i × |v |i where q is the cardinality of the alphabet A, |v | is the length of v and |v |i is the number of occurrences of the ith letter of the alphabet A in v . Examples. (i) for A = {a, b}, v = aabab and Xaabab = {aabab, aabb, ab, b}. (ii) for A = {a, b, c}, v = bbc and Xbbc = {bbc, bc, c}. We define the generating functions Xv (z ) of Xv and Xv (z ) of Xv where |w| is the length of a word w: w w Xv (z ) = z |w| and Xv (z ) = z |w | . Xv Xv q As the set Xv is a code, the elements of Xv are (finite) sequences of elements of Xv (see [5]): Xv (z ) = 1 . 1 - Xv (z ) One gets from Theorem 1 the generating function of the set Fv = {uv L | u · v is the standard factorization}. Theorem 2. Let v be a Lyndon word over a q -ary By construction, v is the smallest element of Xv A alphabet. The generating function of the set Fv of for the lexicographical order. Moreover for any Lyndon Lyndon words having a right standard factor v is + 1 . word u, u > v if and only if u Xv . w qz - 1 z |w | = z |v | Fv (z ) = + For any letter A, denote A = {a A | a } 1 - Xv (z ) Fv the set of letters smaller than . For a language L, we This generating function can further be used to enuuse the convenient notation L+ = L \ { }. We recall also that the class of regular languages is closed under merate or get the asymptotic behavior of the numbers of Lyndon words with fixed right standard factor (see [5]). the set difference operator "\". However the next step would be to sum rational funcTheorem 1. Let v L beginning by a letter A tions when the right standard factor v runs over the and u A . Then uv is a Lyndon word with u · v as set of Lyndon words. The fact that the set of Lyndon + standard factorization if and only if u (A Xv ) \ Xv . words is a structurally complex language (for instance Hence the set Fv of Lyndon words having v as right not even context-free [1]) gives a hint on the difficulty standard factor is a regular language. of this study. For instance, consider the alphabet {a, b, c} and the References Lyndon word v = bbc, then the set of Lyndon words having v as right standard factor is the language ((a + [1] J. Berstel, L. Boasson, The set of Lyndon words is b)(bbc + bc + c) ) \ (bbc + bc + c)+ . We list benot context-free, Bull. Eur. Assoc. Theor. Comput. Sci. low words of (a + b)(bbc + bc + c) until length four EATCS 63 (1997) 139­140. and underline words which are not in (bbc + bc + c)+ [2] K. S. Booth. Lexicographically least circular subb a strings. Inform. Process. Lett., 10(4-5):240­242, 1980. [3] K. Chen, R. Fox, R. Lyndon, Free differential calculus ac bc IV: The quotient groups of the lower central series, abc acc bbc bcc Ann. Math. 58 (1958) 81­95. abbc abcc acbc accc bbbc bbcc bcbc bccc. One of the basic properties of the set of Lyndon words, thoroughly used in the proof, is that every word w of A is uniquely factorizable as a non increasing product of Lyndon words w= 1 2 . . . n, i L, 1 2 · · · n. Moreover this decomposition can be computed efficiently in linear time and space [4]. We remark that when we apply this decomposition to "shifted" Lyndon words (words obtained by deleting the first symbol) the last factor is exactly the right standard factor. In the same vein, using a result due to Booth [2], one can also give a reject algorithm to generate random Lyndon words of a given length n in linear time on average. [4] J.-P. Duval, Factorizing words over an ordered alphabet, Journal of Algorithms 4 (1983) 363­381. [5] P. Fla jolet, R. Sedgewick, Analytic combinatorics­ symbolic combinatorics, Book in preparation, (Individual chapters are available as INRIA Research reports at http://www.algo.inria.fr/flajolet/publist.html) (2002). [6] M. Lothaire, Combinatorics on Words, Vol. 17 of Encyclopedia of mathematics and its applications, AddisonWesley, 1983. [7] R. Lyndon, On Burnside problem I, Trans. American Math. Soc. 77 (1954) 202­215. [8] C. Reutenauer, Free Lie algebras, Oxford University Press, 1993. [9] F. Ruskey, J. Sawada, Generating Lyndon brackets: a basis for the n-th homogeneous component of the free Lie algebra, Journal of Algorithms 46 (2003) 21­26. 654