Pacific Symposium on Biocomputing 13:90-101(2008) September 24, 2007 22:4 Proceedings Trim Size: 9in x 6in main TRELLIS+: AN EFFECTIVE APPROACH FOR INDEXING GENOME-SCALE SEQUENCES USING SUFFIX TREES BENJARATH PHOOPHAKDEE AND MOHAMMED J. ZAKI Dept. of Computer Science, Rensselaer Polytechnic Institute, Troy, NY, 12180 E-mail: {phoopb,zaki}@cs.rpi.edu With advances in high-throughput sequencing methods, and the corresponding exponential growth in sequence data, it has become critical to develop scalable data management techniques for sequence storage, retrieval and analysis. In this paper we present a novel disk-based suffix tree approach, called Trellis+, that effectively scales to massive amount of sequence data using only a limited amount of main-memory, based on a novel string buffering strategy. We show experimentally that Trellis+ outperforms existing suffix tree approaches; it is able to index genome-scale sequences (e.g., the entire Human genome), and it also allows rapid query processing over the disk-based index. Availability: TRELLIS+ source code is available online at http://www.cs.rpi.edu/ zaki/software/trellis 1. Intro duction Sequence data banks have been collecting and disseminating an exponentially increasing amount of sequence data. For example, the most recent release of GenBank contains over 77 Gbp (giga, i.e., 109 , base-pairs) from over 73 million sequence entries. Anticipated advances in rapid sequencing technology, applied to metagenomics (i.e., study of genomes recovered from environmental samples) or rapid, low-cost human genome sequencing, will yield a vast amount of short sequence reads. Individual genomes can also be enormous (e.g., the Amoeba dubia genome is estimated to be 670 Gbp a ). It is thus crucial to develop scalable data management techniques for storage, retrieval and analysis of complete and partial genomes. In this paper we fo cus on disk-based suffix trees as the index structure for effective massive sequence data management. Suffix trees have been used to efficiently solve a variety of problems in biological sequence analysis, such as exact and approximate sequence matching, repeat finding, and sequence assembly (via all pairs suffix-prefix matching) 9 , as well as anchor finding for genome alignment 4 . Suffix trees can be constructed in time and space linear in the sequence length 16 , provided the tree fits entirely in the main memory. A variety of efficient in-memory suffix tree construction algorithms have been proposed 8,6 . However, these algorithms do not scale up when the input sequence is extremely large. Several disk-based suffix tree algorithms have been proposed recently. Some of the approaches 11,12,15 completely abandon the use of suffix links This work was supported in part by NSF Career award IIS-0092978, and NSF grants EIA-0103708 and EMT-0432098. a Database of Genome Sizes: http://www.cbs.dtu.dk/databases/DOGS/ Pacific Symposium on Biocomputing 13:90-101(2008) September 24, 2007 22:4 Proceedings Trim Size: 9in x 6in main and sacrifice the theoretically superior linear construction time in exchange for a quadratic time algorithm with better locality of reference. Some approaches 11,12,2 also suffer from the skewed partitions problem. They build prefix-based partitions of the suffix tree relying on a uniform distribution of prefixes, which is generally not true for sequences in nature. This results in partitions of non-uniform size, where some are very small, and others are to o large to fit in memory. Methods that do not have the skew problem and that also maintain suffix links, have also been proposed 1,3 . However, these metho ds do not scale up to the human genome level. The only known suffix tree metho ds that can handle the entire human genome include TDD 15 and Trellis 13 . Trellis was shown to outperform TDD by over 3 times. However, these metho ds still assume that the input sequence can fit in memory, which limits their suitability for indexing massive sequence data. Other suffix trees variants 10 , and other disk-based sequence indexing structures like String B-trees 7 and external suffix arrays 5,14 have also been proposed to handle large sequences. A comparison between TDD 15 and the DC3 5 metho d for disk-based suffix arrays suggests that TDD is twice as fast 15 . In this paper we present a novel disk-based suffix tree indexing algorithm, called Trellis+, for massive sequence data. Trellis+ effectively handles genome-scale sequences and beyond with only a limited amount of main-memory. We show that Trellis+ is over twice as fast as Trellis, especially with restricted amount of memory. Trellis+ is able to index the entire human genome (approx. 3Gbp) in about 11 hours, using only 512MB of memory, and on average queries take under 0.06 seconds, over various query lengths. To the best of our knowledge these are the fastest reported time with such a limited amount of main-memory. 2. Preliminary Concepts Let denote a set of characters (or the alphabet), and let || [0,2] [1,2] [2,2] [6,6] ACG CG G $ denote its cardinality. Let be the set of all possible strings 1 2 3 6 (or sequences) that can be constructed using . Let $ be [3,6] [6,6] [3,6] [6,6] [3,6] [6,6] the terminal character, used to ACG $ ACG $ ACG $ mark the end of a string. Let S = s0 s1 s2 . . . sn-1 be the in0 3 1 4 2 5 put string where S and its length |S | = n. The ith suf- Figure 1. Suffix tree TS for S = ACGACG$. fix of S is represented as Si = si si+1 si+2 . . . sn-1 . For convenience, we append the terminal character to the string, and refer to it by sn . The suffix tree of the string S , denoted as TS , stores all the suffixes of S in a tree structure, where suffixes that share a common prefix lie on the same path from the ro ot of the tree. A suffix tree has two kinds of nodes: internal and leaf no des. An internal node in the suffix tree, except the root, 0 Pacific Symposium on Biocomputing 13:90-101(2008) September 24, 2007 22:4 Proceedings Trim Size: 9in x 6in main has at least 2 children, where each edge to a child begins with a different character. Since the terminal character is unique, there are as many leaves in the suffix tree as there are suffixes, namely n + 1 leaves (counting $ as the "empty" suffix). Each leaf node thus corresponds to a unique suffix Si . Let (v ) denote the substring obtained by concatenating all characters from the ro ot to no de v . Each internal node v also maintains a suffix link to the internal no de w, where (w) is the immediate suffix of (v ). A suffix tree example is given in Fig. 1; circles represent internal nodes, square no des denote leaves, and dashed lines indicate suffix links. Internal nodes are labeled in depth-first order, and leaf nodes are labeled by the suffix start position. The edges are also shown in the encoded form, giving the start and end positions of the edge label. 3. The Basic Trellis+ Approach Trellis+ follows the same overall approach as Trellis 13 . Let S denote the input sequence, which may be a single genome, or the string obtained by concatenating many sequences. Trellis+ follows a partitioning and merging approach to build a disk-based suffix tree. The main idea is to maintain a complete suffix tree as a collection of several prefix-based subtrees. Trellis+ has three main steps: i) prefix creation, ii) partitioning, and iii) merging. In the prefix crea) Sequence Partitioning R0 R r-1 R1 ation phase Trellis+ creates a list of variableb) Suffix trees T TR TR length prefixes {P0 , P1 ,R for Ri ˇ ˇ ˇ , Pm-1 }. Each TR ,P TR ,P prefix Pi is chosen TP so that its frequency c) Sub-trees for d) Merging in the input string prefix Pj in Ri TR ,P DISK TR ,P S do es not exceed a maximum frequency threshold, tm , deterFigure 2. Overview of Trellis+ mined by the mainmemory limit, which guarantees that the prefix-based sub-tree TPi , composed of all the suffixes beginning with Pi as a prefix, will fit in the available main-memory. The variable prefix set is computed iteratively; in each iteration prefixes up to a given length are counted (those that exceed the frequency threshold tm in the last iteration). + In the partitioning phase, the input string S is split into r = ntp 1 segments (Fig. 2, step a), where n = |S | and tp is the segment size threshold, chosen so that the resulting suffix tree TRi for each segment Ri (Fig. 2, step b) fits in main-memory. Note that TRi contains all the suffixes of S that start only in segment Ri ; TRi is constructed using the in-memory Ukkonen's algorithm 16 . Each resulting suffix tree TRi from a given segment is further split into smaller subtrees TRi ,Pj (Fig. 2, step c), that share a common S 0 1 r-1 0 0 1 m-1 j 0 j r-1 j Pacific Symposium on Biocomputing 13:90-101(2008) September 24, 2007 22:4 Proceedings Trim Size: 9in x 6in main prefix Pj , which are then stored on the disk. After pro cessing all segments Ri , in the merging phase, Trellis+ merges all the subtrees TRi ,Pj for each prefix Pj from the different partitions Ri into a merged suffix subtree TPj (Fig. 2, step d). Note that TPj is guaranteed to fit in memory due to the choice of tm threshold. The merging for a given prefix Pj proceeds in steps; at each stage i, let Mi denote the current merged tree obtained after processing subtrees TR0 ,Pj through TRi ,Pj for segments R0 through Ri . In the next step we merge TRi+1 ,Pj from segment Ri+1 with Mi to obtain Mi+1 , and so on (for i [0, r - 1]). The merging is done recursively in a depth-first manner, by merging labels on all child edges, from the root to the leaves. The final merged tree Mr-1 is the full prefixed suffix tree TPj , which is then stored back on the disk. The complete suffix tree is simply a forest of these prefix-based subtrees (TPj ). Note that Trellis+ has an optional suffix link recovery phase, but we omit its description due to space limitations; see 13 for additional details. 4. Trellis+: Optimizations for Massive Sequences In this section, we intro duce two optimizations to the original Trellis. The first optimization is based on a simple observation that larger suffix subtrees can be created in the partitioning phase under the same memory restriction. As a result, there is less disk management overhead, and fewer merge operations are required, speeding up the algorithm. The second optimization is a novel string buffering strategy. The buffer is based on several techniques, which together remove the limitation of Trellis that requires the input sequence to fit entirely in memory. This means Trellis+ can index sequences that are much larger than the available memory. 4.1. Larger Segment Size Trellis+ uses two thresholds, tp and tm , to ensure that the suffix subtrees for a given segment TRi and a given prefix TPj , respectively, can fit in memory. Let |S | = n be the sequence length, M be the available mainmemory (in bytes), and let si and sl be the size of an internal and leaf no de. Typically, the number of internal nodes in the suffix tree is about 0.8 times the number of leaf nodes. During the partitioning phase, the sequence corresponding to the segment Ri is kept in memory in a compressed form, costing tp /4 bytes space (since we use 2 bits to encode each of the 4 DNA bases). Since TRi has tp leaf nodes and 0.8tp internal nodes, tp is chosen to satisfy the following equation: M tp + (0.8si + sl )tp = tp 1 (1) M 4 4 + (0.8si + sl ) During the merging phase, we use the threshold tm to ensure that TPj can fit in memory. Tpj has tm leaf and 0.8tm internal nodes. Additionally, new internal no des, on the order of 0.6tm , are created during the edge merge Pacific Symposium on Biocomputing 13:90-101(2008) September 24, 2007 22:4 Proceedings Trim Size: 9in x 6in main operations. Furthermore, since all segments can be accessed, we would need to keep the entire input string S in memory, taking up space n/4 bytes (this limitation will be removed in Sec. 4.2). Thus tm is chosen to satisfy the following equation: M M-n n 4 + (0.8si + sl + 0.6si )tm = tm 4 (1.4si + sl ) (2) Trellis uses a global threshold t = min(tp , tm ) to control the overall memory usage. However, note that tm is always smaller than tp (since t n), and this means that as the input sequence length increases, Trellis must cho ose smaller and smaller thresholds, resulting in a corresponding increase in the number of segments, degrading the overall index construction time. Our first optimization is based on a simple but effective observation that the partitioning phase need not use the global t threshold. Trellis+ uses the larger tp value for the partitioning phase, since Eq. (1) already guarantees that TRi will fit in M bytes. For the merging phase Trellis+ uses the smaller tm value given by Eq. (2) to guarantee that each TPj fits under M . This means that Trellis+ uses fewer, larger partitions, resulting in fewer tree merge operations, and fewer disk I/O operations, yielding faster overall running times. Note however that there is no difference in the number of variable length prefixes, since the same threshold t = tm is used. 4.2. The String Buffer During the partitioning phase Trellis+ needs to keep the current input string segment Ri in memory. However, for the merging phase, without any optimization, Trellis+ would require the entire input string in memory. To remove this memory bottleneck, Trellis+ uses a novel string buffering technique, which requires only a small amount of memory to be assigned to the input string during the merging phase, thus enabling Trellis+ to scale to extremely large sequences. The string buffering strategy relies on several different techniques, each uniquely important because of its impact on the buffer hit rate. The basic idea behind the buffer design is to keep the characters most likely to be accessed in memory, and to load the rest from disk as needed. 4.2.1. Edge Index Shifting The goal of the index shifting technique is to restrict the character accesses during the merging phase to a small region of the input sequence. This small region of the input string can then be kept in memory as a part of the string buffer, hence increasing the buffer hit rate. Recall that a suffix tree edge is represented by two indexes, [start, end], denoting its edge label S [start . . . end]. The basic observation is that these indexes need not be unique so long as they denote the same string label. Pacific Symposium on Biocomputing 13:90-101(2008) September 24, 2007 22:4 Proceedings Trim Size: 9in x 6in main For example, an edge with 1e+06 label "AT " may use the in100000 dexes [0, 1] or [1000, 1001] to 10000 enco de its label, as long as S [0] = A, S [1] = T , and 1000 S [1000] = A, S [1001] = T . An100 other important observation is 10 that the edge lengths between 1 1 10 100 1000 10000 two internal no des, i.e., interEdge Length nal edge lengths, are generally Figure 3. Distribution of internal edge lengths short. For example, using Human Chromosome I (approx. 200Mbp), we found that most internal edge lengths fall between 1 and 25 characters, and the ma jority are only a few characters long (the mean length is only 6.7), as shown in Fig. 3. Frequency 1e+07 (a) Figure 4. (b) (a) Index Shifting, (b) Percentage of Indexes Shifted To implement the index shifting technique, a small "guide" suffix tree is independently maintained, built from the first 2Mbp of Human Chromosome I. Prior to writing each internal edge in any subtree TRi to the disk, we search for its string label in the guide suffix tree. If found, we switch the edge's current indexes to the indexes found in the guide tree. The edge index shifting is illustrated in Fig. 4(a); here, two edges from the partition R50 have their edge indexes shifted to indexes at the beginning of the input string. Based on the data from all the partitions for the complete Human genome (using 512MB memory), as shown in Fig. 4(b), we found that on average 97% of the internal edge label indexes can be shifted to the range [0 . . . 2 × 106 ) via this optimization. This behavior is not entirely surprising, since the genome contains many short repeats, most of which are likely to have been encountered in the first 2Mbp segment of the genome (which is confirmed by Fig. 4(b)). In addition to the guide tree, the string S [0 . . . 2 × 106 ) is also stored in the memory (requiring 0.5MB space after compression) as part of the string buffer because it will be heavily accessed during the merging step. The guide suffix tree requires about 70MB mem- Pacific Symposium on Biocomputing 13:90-101(2008) September 24, 2007 22:4 Proceedings Trim Size: 9in x 6in main ory. Furthermore, as mentioned previously, additional internal nodes are created during the subtree merging phase. Trellis+ also shifts these indexes to be in the range [0 . . . 2 × 106 ). 4.2.2. Buffering Internal Edge Labels Fig. 4(b) shows that approximately 3% of the internal edge labels are still not found in the guide suffix tree. These leftover pairs of internal edge indexes are recorded during the partitioning phase whenever index shifting cannot be applied. Then, during the merging phase, the substrings corresponding to these index ranges are loaded directly into the main memory. These strings are also compressed using 2 bits per character. In all of our experiments (even for the complete human genome), the memory required to keep these substrings consumes at most 20MB. 4.2.3. Buffering Current Segment Subtrees TRi ,Pj are always merged starting from segment R0 to the last partition Rr-1 for each prefix Pj . When the ith subtree is being merged with the intermediate merged prefix-subtree Mi-1 (from partitions R0 through Ri-1 ), the substring from partition Ri is more heavily accessed than those of the previous partitions. Based on this observation, Trellis+ always keeps the string corresponding to the current partition Ri in memory, which tp requires 4 bytes of space. 4.2.4. Leaf Edge Label Encoding The index shifting optimization can only be applied to internal nodes, and not to the leaf no des, since the leaf edge lengths are typically an order of magnitude longer than internal node edge lengths. Nevertheless, we observed that generally only a few characters from the beginning of the leaf edges are accessed during merging (before a mismatch occurs). This is because leaves are relatively deep in the tree and lengthy exact matches do not o ccur to o frequently. Therefore, merging does not require too many leaf character accesses. To guarantee that the more frequently accessed characters are readily in memory, we allow 64 bits to store the first 29 characters (which require 58 bits, with 2 bits per character) of each leaf label. The last 6 bits are used as an offset to denote the number of current valid characters for the leaf edge. Initially all 29 characters are valid, but characters towards the end become invalid if an internal node is created as a result of merging the leaf edge with another edge. The encoded strings are stored with their respective leaf nodes, and not actually in the memory buffer. Since disk accesses are expensive, the encoded strings are loaded on an as needed basis (we found that 15 ­ 35% of leaves are not accessed at all during the merge). The memory required for leaf edge label encoding is at most 8tm bytes per prefix. We found that about 93 ­ 97% of leaf characters accessed during the merge can be found using the encoded labels. Pacific Symposium on Biocomputing 13:90-101(2008) September 24, 2007 22:4 Proceedings Trim Size: 9in x 6in main 4.2.5. String Buffer Summary As for the rest of the characters that are a buffer miss (i.e., not captured by any of the above optimizations), they are directly read from the disk. We found that the input sequence disk access pattern resulting from the buffer misses during the merge has very poor locality of reference, i.e., it is almost completely random, with the exception that short consecutive range of characters are accessed together. These short ranges represent the labels of the edges being merged. Therefore, we keep a small label buffer of size 256KB to store the characters that require a direct disk access: each disk read fetches 256KB consecutive characters at a time. The total amount of memory required for all of the optimization constituting the string buffer can be calculated by adding the amounts of memory required for each technique: 0.5MB for the index shifting, 70MB for the tp guide tree, 20MB for buffering internal edge labels, 4×106 MB for buffering 8tm current segment, 106 MB for leaf edge label encoding, and 0.25MB for the small label buffer. The total string buffer size is thus well under 100MB, using 512MB memory limit (using Eqs.(1) and (2) to compute tp and tm ). Note that like Trellis, Trellis+ has O(n) space and O(n2 ) time complexity in the worst case, due to the O(n2 ) worst-case merging phase time. In practice the running time is O(n log n); see 13 for a detailed complexity analysis of Trellis. 5. Exp eriments We now present an experimental study on the performance of Trellis+. We compare Trellis+ only against Trellis since we showed 13 that Trellis outperforms other disk-based suffix methods like TDD 15 , DynaCluster 3 , TOPQ 1 and so on. TDD 15 was in turn shown to have much better performance than the Hunt's method 11 , and even a state-of-the-art suffix array metho d, DC3 5 . Note that we were not able to compare with ST-Merge 15 (an extension of TDD, designed to scale to sequences larger than memory), since its implementation is not currently available from its authors. All experiments were performed on an Apple Power Mac G5 machine with 2.7GHz pro cessor, 512KB cache, 4GB main-memory, and 400GB disk space. The maximum amount of main-memory usage across all experiments was restricted to 512MB; this memory limit applies to all internal data structures including those for the suffix tree, memory buffers and the input string. Both Trellis+ and Trellis were compiled with the GNU g++ compiler v. 3.4.3 and were run in 32-bit mode; they produce identical suffix trees. The sequence data used in all experiments are segments of the human genome ranging from size 200Mbp to 2400Mbp, as well as the entire human genome. To study the effects of the two optimizations, we denote by Trellis+nb the version of Trellis+ that only has the large segment size optimization but no string buffer, and we denote by Trellis+b, the version that has both the larger segment and string buffer optimizations. Pacific Symposium on Biocomputing 13:90-101(2008) September 24, 2007 22:4 Proceedings Trim Size: 9in x 6in main 5.1. Effect of Larger Segment Size Here we study the effects of the larger segment size, without the string buffer. Trellis+nb has larger and therefore fewer partitions than Trellis, since for Trellis the number of partitions is O( tn ) and the value of tp p decreases as the sequence length n increases, resulting in many partitions (as shown in Fig. 5(a)). Therefore, when indexing a very large sequence, the performance of Trellis suffers when tp is small, because of a large number of partitions. In contrast, since the partitioning threshold tp for Trellis+nb remains constant regardless of n, its number of partitions increases at a much slower rate, as shown in Fig. 5(b). 1e+07 1800 1600 1400 Partitioning Phase Threshold (tp) 9e+06 8e+06 TRELLIS TRELLIS+B TRELLIS+NB 6e+06 5e+06 4e+06 3e+06 2e+06 TRELLIS+NB/B TRELLIS 600 1000 1400 1800 Sequence Length (Mbp) # Partitions 7e+06 1200 1000 800 600 400 200 0 200 600 1000 1400 1800 2200 1e+06 200 Sequence Length (Mbp) (a) Partitioning Threshold (tp ) Figure 5. 700 600 TRELLIS TRELLIS+B TRELLIS+NB (b) Number of Partitions Effect of Larger Segment Size on Partitioning Phase 500 TRELLIS 450 TRELLIS+B TRELLIS+NB 400 350 300 250 200 150 100 50 0 200 400 TRELLIS 350 TRELLIS+B TRELLIS+NB 300 250 200 150 100 50 600 1000 1400 1800 2200 0 200 600 1000 1400 1800 2200 Partitioning Time (mins) Total Time (mins) 500 400 300 200 100 0 200 600 1000 1400 1800 2200 Sequence Length (Mbp) Sequence Length (Mbp) Merging Time (mins) Sequence Length (Mbp) (a) Total Running Time (mins) Figure 6. (b) Partitioning Time Running Time Comparison (c) Merging Time The timings of Trellis+nb in comparison to Trellis are shown in Figs. 6(a), 6(b), and 6(c), which show the total time, partitioning phase time, and merging phase time for Trellis+nb versus Trellis, as we increase the sequence length from 200Mbp to 1.8Gbp. We find that Trellis+nb consistently outperforms Trellis, especially when the input sequence size is much larger than the available memory (which is only 512MB). For example, Trellis+nb is about twice as fast as Trellis for the 1.8Gbp input sequence. This is directly a consequence of the larger, Pacific Symposium on Biocomputing 13:90-101(2008) September 24, 2007 22:4 Proceedings Trim Size: 9in x 6in main fewer partitions used by Trellis+nb, which result in a much faster partitioning phase (see Fig. 6(b)). The impact of larger segment sizes on the merging phase is not much (see Fig. 6(c)), but Trellis+nb still has faster merge times, since there are fewer partitions to be merged for each prefixbased subtree TPj . 100 90 4500 4000 Buffer Hit Percentage Merge Time (sec) 80 70 60 50 40 30 20 0 5 10 Partition Number 15 20 ALL SI+SM+BI SI+SM SI+BI SI 3500 3000 2500 2000 1500 1000 500 SI BI S+ Optimizations +S SI M +S SI M AL L +B I (a) Buffer Hit Rate Figure 7. (b) Buffer Optimizations Times Effect of String Buffer Optimizations 5.2. Effect of String Buffer We now investigate the effect of the string buffering strategy. First we report the difference in the buffer hit rate and merging phase time for Trellis+b using the different combinations of buffering optimizations. Fig. 7(a) shows the buffer hit rate for all the characters accessed during the subtree merging operations, using as input string Human Chromosome I (with length approx. 200Mbp), with the 512MB memory limit. The hit rates are shown only for the first 20 partitions, but the same trend continues for the remaining partitions. In the figure, SI denotes the internal edge index shifting, SM denotes index shifting during merge phase, BI denotes buffering internal labels, and ALL denotes all the buffering optimizations. We can clearly see that internal edge index shifting alone yields a buffer hit rate of over 50%. Combination of optimizations yield higher hit rates, so that when all the optimization are combined we achieve a buffer hit rate of over 90%. Fig. 7(b) shows effect of the improved buffer hit rates on the running time of the merging phase in Trellis+b. All the optimizations results in a four-fold decrease in time. Comparing the total running time, and the times for the partitioning and merging phases (shown in Figs. 6(a), 6(b), and 6(c)), we find that initially Trellis+nb (that does not use the string buffer) outperforms Trellis+b (that uses string buffer). However, as the input sequence becomes much larger, Trellis+nb is left with less memory to construct the tree, because it has to maintain the entire compressed input string in memory. Consequently, beyond a certain sequence length, Trellis+b starts to outperform Trellis+nb. In fact, without string buffer, we were not able to run Trellis+nb on an input of size larger than 1.8Gbp, whereas with the string buffer Trellis+b can construct the disk-based suffix tree for Pacific Symposium on Biocomputing 13:90-101(2008) September 24, 2007 22:4 Proceedings Trim Size: 9in x 6in main the entire Human genome. For a 2.4Gbp sequence, Trellis+b took about 8.3 hrs (500 mins, as shown in Fig. 6(a)), and for the ful l Human genome (with over 3Gbp length), Trellis+b finished in about 11 hours using only 512MB memory! b 7e+06 4000 Merging Phase Threshold (tm) # Variable Length Prefixes 6e+06 5e+06 4e+06 3e+06 2e+06 TRELLIS TRELLIS+NB TRELLIS+B 600 1000 1400 1800 2200 Sequence Length (Mbp) 3500 3000 2500 2000 1500 1000 500 TRELLIS TRELLIS+NB TRELLIS+B 1e+06 200 0 200 600 1000 1400 1800 2200 Sequence Length (Mbp) (a) Merging Threshold (tm ) Figure 8. (b) Number of Prefixes Effect on the Merging Threshold and Number of Variable Length Prefixes Fig. 8(a) shows the merging phase threshold tm , and Fig. 8(b) shows the number of variable-length prefixes for Trellis+b and Trellis+nb. Since Trellis+nb has to retain the entire input string in memory during the merging phase, with increasing sequence length Trellis+nb has less amount of memory remaining, resulting in smaller tm and many more prefixes. On the other hand, for Trellis+b the number of prefixes grows very slowly. Overall, as shown in Figs. 6(b) and 6(c), the suffix buffer allows Trellis+b to scale gracefully for sequence much larger than the available memory, whereas Trellis+nb could not run for an input string longer than 1.8Gbp (with 512MB memory). 5.3. Query Times Query Time (secs) Query Times on the Human Genome 0.065 We now briefly discuss the query 0.06 time performance on the disk-based suffix tree created by Trellis+ on the entire human genome (which 0.055 o ccupies about 71GB on disk). 500 queries of different lengths ranging 0.05 from 40bp to 10,000bp were genQuery Length (bp) erated from random starting positions in the human genome. FigFigure 9. Average Query Times ure 9 shows the average query times over the 500 random queries for each query length (using 2GB memory). The average query time for even the longest query (with length 10,000bp) took under 0.06s, showing the effectiveness of disk-based suffix tree indexing in terms of the query performance (see 13 for more details). b We showed earlier 13 that Trellis can index the entire human genome in ab out 4 hours with 2GB memory. 60 40 10 80 0 00 10 00 80 00 60 00 40 00 20 00 10 0 80 0 60 0 40 0 20 0 Pacific Symposium on Biocomputing 13:90-101(2008) September 24, 2007 22:4 Proceedings Trim Size: 9in x 6in main 6. Conclusion In this paper we have presented effective optimization strategies which enable Trellis+ to handle genome-scale sequences, using only a limited amount of main memory. Trellis+ is suitable for indexing entire genomes, or massive amounts of short sequence read data, such as those resulting from cheap genome sequencing and metagenomics pro jects. For the latter case, we simply concatenate all the short reads into a single long sequence S and index it. In addition we maintain an auxiliary index on disk that allows one to lo ok up for each suffix position Si , the corresponding sequence id, and offset into the short read. Using all pairs suffix-prefix matching 9 , our disk based suffix tree index can enable rapid sequence assembly, and can also enable other next generation sequence analysis applications. References 1. S.J. Bedathur and J.R. Haritsa. Engineering a fast online p ersistent suffix tree construction. In 20th Int'l Conference on Data Engineering, 2004. 2. A.L. Brown. Constructing genome scale suffix trees. In 2nd Asia-Pacific Bioinformatics Conference, 2004. 3. C.-F. Cheung, J.X. Yu, and H. Lu. Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Transactions on Know ledge and Data Engineering, 17(1):90­105, 2005. 4. A.L. Delcher, A. Phillippy, J. Carlton, and S.L. Salzb erg. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Research, 30(11):2478­2483, 2002. 5. R. Dementiev, J. K¨rkk¨inen, J. Mehnert, and P. Sanders. Better external aa memory suffix array construction. In Workshop on Algorithm Engineering and Experiments, 2005. 6. M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. On the sortingcomplexity of suffix tree construction. J. of the ACM, 47(6):987­1011, 2000. 7. P. Ferragina and R. Grossi. The string B-tree: a new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):236­280, 1999. 8. R. Giegerich, S. Kurtz, and J. Stoye. Efficient implementation of lazy suffix trees. Software Practice & Experience, 33(11):1035­1049, 2003. 9. D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. 10. K. Heumann and H. W. Mewes. The hashed p osition tree (HPT): A suffix tree variant for large data sets stored on slow mass storage devices. In 3rd South American Workshop on String Processing, 1996. 11. E. Hunt, M.P. Atkinson, and R.W. Irving. A database index to large biological sequences. In 27th Int'l Conference on Very Large Data Bases, 2001. 12. R. Japp. The top-compressed suffix tree: A disk-resident index for large sequences. In BNCOD Bioinformatics Workshop, 2004. 13. B. Phoophakdee and M. J. Zaki. Genome-scale disk-based suffix tree indexing. In ACM SIGMOD Int'l Conference on Management of Data, 2007. 14. K. Sadakane and T. Shibuya. Indexing huge genome sequences for solving various problems. Genome Informatics, 12:175­183, 2001. 15. Y. Tian, S. Tata, R.A. Hankins, and J.M. Patel. Practical methods for constructing suffix trees. VLDB Journal, 14(3):281­299, 2005. 16. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3), 1995.