WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Efficient Mining of Frequent Sequence Generators Chuancong Gao , Jianyong Wang , Yukai He , Lizhu Zhou § § ¶ Tsinghua University, Beijing, 100084, P.R.China ¶ { gaocc07, heyk05}@mails.tsinghua.edu.cn,{ jianyong, dcszlz}@tsinghua.edu.cn ABSTRACT Sequential pattern mining has raised great interest in data mining research field in recent years. However, to our best knowledge, no existing work studies the problem of frequent sequence generator mining. In this paper we present a novel algorithm, FEAT (abbr. Frequent sEquence generATor miner), to perform this task. Experimental results show that FEAT is more efficient than traditional sequential pattern mining algorithms but generates more concise result set, and is very effective for classifying Web product reviews. Categories and Subject Descriptors: H.2.8 [Database Management]: Database applications - Data Mining General Terms: Algorithms, Performance Keywords: Sequence Generators, Sequence, Web Mining S DB , denoted by S DBSpre . Given a subsequence Sp = ep1 ep2 . . . epm , its support sup(Sp ) is defined as the number of sequences in S DBSp each of which contains Sp , denoted by |S DBSp |. Given a user specified minimum support threshold, min_sup, Sp is said to be frequent if sup(Sp ) min_sup holds. Sp is called a sequence generator iff Sp such that Sp < Sp (i.e., Sp contains Sp ) and sup(Sp ) = sup(Sp ). In addition, given a sequence S =e1 e2 . . . en and an item e , we denote e1 e2 . . . ei-1 ei+1 . . . en by S (i) , ei ei+1 . . . ej by S(i,j ) , and e1 e2 . . . en e by . Given a minimum support threshold min_sup and an input sequence database S DB , the task of frequent sequence generator mining is to mine the complete set of sequence generators which are frequent in database SDB. 2.2 Pruning Strategy A naïve approach to mining the set of frequent sequence generators is to first apply a sequential pattern mining algorithm to find the set of frequent subsequences and check if each frequent subsequence is a generator. However, it is inefficient as it cannot prune the unpromising parts of search space. In this subsection we propose two novel pruning methods, Forward Prune and Backward Prune, which can be integrated with the pattern-growth enumeration framework [5] to speed up the mining process. We first introduce Theorems 1 and 2 which form the basis of the pruning methods, but due to limited space we eliminate their proofs here. T H E O R E M 1. Given two sequences Sp1 and Sp2 , if Sp1 < Sp2 (i.e., Sp1 is a proper subsequence of Sp2 ) and S DBSp1 =S DBSp2 , then any extension to Sp2 cannot be a generator. 1 T H E O R E M 2. Given subsequence Sp =e1 e2 . . . en and an item e , if S DBSp =S DBS (i) (i = 1, 2, · · · , n), then we have that p S DB= S DB . p 1. INTRODUCTION Sequential pattern mining has raised great interest in data mining research field in recent years. Various mining methods have been proposed, including sequential pattern mining[1][5], and closed sequentialpattern mining[7][6]. Sequential pattern mining has also shown its utility for Web data analysis, such as mining Web log data[2] and identifying comparative sentences from Web forum posting and product reviews[3]. However, there exists no existing work on mining frequent sequence generators, where a sequence generator is informally defined as one of the minimal subsequences in an equivalence class. Thus, generators have the same ability to describe an equivalence class as their corresponding subsequences of the same equivalence class, and according to the MDL principle[4], generators are preferable to all sequential patterns in terms of Web page and product review classification. In the rest of this paper, we first give a formal problem formulation and focus on our solution in Section 2, then present the performance study in Section 3. We conclude the study in Section 4. 2. MINING SEQUENTIAL GENERATORS 2.1 Problem Formulation An input sequence database S DB contains a set of input sequences, where an input sequence is an ordered list of items (each item can appear multiple times in a sequence) and can be denoted by S =e1 e2 . . . en . Given a prefix of sequence S, Spre =e1 e2 . . . ei , we define the projected sequence of Spre w.r.t. S as ei+1 e2 . . . en . The complete set of projected sequences of Spre w.r.t. each sequence in S DB is called the projected database of Spre w.r.t. Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. L E M M A 1. (Forward Prune). Given subsequence Sp ,and let Sp =, if sup(Sp )=sup(Sp ) and for any local frequent item u of Sp we always have S DB =S DB , then Sp can be safely pruned. P RO O F. Easily derived from Theorem 1. L E M M A 2. (Backward Prune). Given Sp =e1 e2 . . . en , if there exists an index i(i = 1, 2, · · · , n - 1) and a corresponding index j (j =i+1, i+2, · · · , n) such that S DB(Sp )(1,j) =S DB((Sp )(1,j) )(i) , then Sp can be safely pruned. P RO O F. Easily derived from Theorem 2 and Theorem 1. Note that a similar checking has been adopted in a closed sequential pattern mining algorithm, CloSpan [7]. Here we adapted the technique to the setting of sequence generator mining. 1 1051 WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China 1e+008 PrefixSpan FEAT 10000 Runtime (in seconds) Runtime (in seconds) 2.3 Generator Checking Scheme The preceding pruning techniques can be used to prune the unpromising parts of search space, but they cannot assure each mined frequent subsequence S =e1 e2 . . . en is a generator. We devise a generator checking scheme as shown in Theorem 3 in order to perform this task, and it can be done efficiently during pruning process by checking whether there exists such an index i(i=1, 2, · · · , n) that |S DBS |=|S DBS (i) |, as sup(S )=|S DBS | holds. T H E O R E M 3. A sequence S =e1 e2 . . . en is a generator if and only if i that 1i n and sup(S )=sup(S (i) ). P RO O F. Easily derived from the definition of generator and the well-known downward closure property of a sequence. PrefixSpan FEAT 1e+006 10000 100 1 0.01 1000 100 10 0.018 0.02 0.022 0.024 0.026 0.028 0.03 80 85 90 95 100 Minimum Support Threshold (in %) Minimum Support Threshold (in %) a) Gaz elle b) P r og r amT r ace Figure 1: Runtime Efficiency Comparison This also validates that our pruning techniques are very effective, since without pruning FEAT needs to generate the same set of sequential patterns as PrefixSpan and perform generator checking to remove those non-generators, thus it should be no faster than PrefixSpan if the pruning techniques are not applied. Figure 1 b) shows that for dense dataset ProgramTrace, FEAT is significantly faster than PrefixSpan with any minimum support. For example, PrefixSpan used nearly 200,000 seconds to finish even at a minimum support of 100% , while FEAT costs less then 0.02 seconds. We used generators and sequential patterns as features to build SVM and Naïve Bayesian classifiers respectively. The results for Office07Review dataset show that both generator-based and sequential pattern-based models achieve almost the same accuracy. For example, with a minimum support of 2% and a minimum confidence of 75%, both generator-based and sequential pattern-based Naïve Bayesian classifiers can achieve the same best accuracy of 80.6%. As generator-based approach is more efficient, it has an edge over sequential pattern-based approach in terms of efficiency. 2.4 Algorithm By integrating the preceding pruning methods and generator checking scheme with a traditional pattern growth framework [5], we can easily derive the FEAT algorithm as shown in Algorithm 1. Given a prefix sequence SP , FEAT first finds all its locally frequent items, uses each locally frequent item to grow SP , and builds the projected database for the new prefix (lines 2,3,4). It adopts both the forward and backward pruning techniques to prune the unpromising parts of search space (lines 8,11), and uses the generator checking scheme to judge whether the new prefix is a generator (lines 7,9,11,12). Finally, if the new prefix cannot be pruned , FEAT recursively calls itself with the new prefix as its input (lines 14,15). Algorithm 1: F E AT (Sp, S DBSp , min_sup, F GS ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 end Input : Prefix sequence SP , SP 's projected database S DBSp , minmum support min_sup, and result set F GS begin foreach i in localF r equentI tems(S DBSp , min_sup) do i Sp < Sp , i >; i S DBS i pr oj ectedDatabase(S DBSp , Sp ); canP r une f alse; isGener ator tr ue; if sup(S DBSp ) = sup(S DBS i ) then canP r une i F or w ar dP r une(Sp , S DBSp , Sp , S DBS i ); isGener ator f alse; if not canP r une then i B ackw ar dP r une(Sp , S DBS i , canP r une, isGener ator ); p p p p 4. CONCLUSIONS In this paper we study the problem of mining sequence generators, which has not been explored previously to our best knowledge. We proposed two novel pruning methods and an efficient generator checking scheme, and devised a frequent generator mining algorithm, FEAT. An extensive performance study shows that FEAT is more efficient than the state-of-the-art sequential pattern mining algorithm, PrefixSpan, and is very effective for classifying Web product reviews. In future we will further explore its applications in Web page classification and click stream data analysis. if isGener ator then i F GS F GS {SP }; if not canP r une then i F E AT (SP , S DBS i , min_sup, F GS ); p 3. PERFORMANCE EVALUATION 5. ACKNOWLEDGEMENTS This work was partly supported by 973 Program under Grant No. 2006CB303103, and Program for New Century Excellent Talents in University under Grant No. NCET-07-0491, State Education Ministry of China. We conducted extensive performance study to evaluate FEAT algorithm on a computer with Intel Core Duo 2 E6550 CPU and 2GB memory installed. Due to limited space, we only report the results for some real datasets. The first dataset, Gazelle, is a Web clickstream data containing 29,369 sequences of Web page views. The second dataset, ProgramTrace, is a program trace dataset. The third dataset, Office07Review, contains 320 consumer reviews for Office 2007 collected from Amazon.com, in which 240 and 80 reviews are labeled as positive and negative, respectively. Figure 1 shows the runtime efficiency comparison between FEAT and PrefixSpan, a state-of-the-art algorithm for mining all sequential patterns. Figure 1a) demonstrates that FEAT is slightly slower than P r ef ixS pan when the minimum support threshold is high for sparse dataset Gazelle, however, with a minimum support threshold less than 0.026%, FEAT is significantly faster than P r ef ixS pan. 6. REFERENCES [1] R. Agrawal, R. Srikant. Mining Sequential Patterns. ICDE'95. [2] J. Chen, T. Cook. Mining Contiguous Sequential Patterns from Web Logs. WWW'07 (Posters track). [3] N. Jindal, B. Liu. Identifying Comparative Sentences in Text Documents. SIGIR'06. [4] J. Li, et al. Minimum description length principle: Generators are preferable to closed patterns. AAAI'06. [5] J. Pei, J. Han, et al. Prefixspan: mining sequential patterns efficciently by prefix-projected pattern growth. ICDE'01. [6] J. Wang,J. Han. BIDE:efficient mining of frequent closed sequences. ICDE'04. [7] X. Yan, J. Han, R. Afshar. CloSpan: Mining closed sequential patterns in large datasets. SDM'03. 1052