Practical methods for constructing suffix trees

Tian, Yuanyuan; Tata, Sandeep; Hankins, Richard A.; Patel, Jignesh M.

doi:10.1007/s00778-005-0154-8

Practical methods for constructing suffix trees

Regular Paper
Published: 26 September 2005

Volume 14, pages 281–299, (2005)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Yuanyuan Tian¹,
Sandeep Tata¹,
Richard A. Hankins² &
…
Jignesh M. Patel¹

244 Accesses
58 Citations
9 Altmetric
Explore all metrics

Abstract

Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not well characterized.

In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n²) worst-case complexity outperforms popular linear time algorithms like Ukkonen and McCreight, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we describe two approaches. First, we present a buffer management strategy for the O(n²) algorithm. The resulting new algorithm, which we call “Top Down Disk-based” (TDD), scales to sizes much larger than have been previously described in literature. This approach far outperforms the best known disk-based construction methods. Second, we present a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and show that for constructing very large suffix trees with very little resources, this algorithm is more efficient than TDD.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2, 53–86 (2004)
Article Google Scholar
Aluru, S.: Suffix Trees and Suffix Arrays, Handbook of Data Structures and Applications. CRC Press (2004)
Andersson, A., Nilsson, S.: Efficient implementation of suffix trees. Software: Pract Exp. 25(2), 129–141 (1995)
Google Scholar
Apostolico, A., Szpankowski, W.: Self-alignments in words and their applications. J. Algorithms 13(3), 446–467 (1992)
Article Google Scholar
Apweiler, R., Bairoch, A., Wu, C.H., Barker, W., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D., O'Donovan, A.C., Redaschi, N., Yeh, L.L.: Uniprot: The universal protein knowledgebase. Nucl. Acids Res. 32(D), 115–119 (2004)
Article PubMed Google Scholar
Atkinson, M., Jordan, M.: Providing orthogonal persistence for java. In Proceedings of the 12th European Conference on Object-Oriented Programming, pp. 383–395 (1998)
Bedathur, S.J., Haritsa, J.R.: Engineering a fast online persistent suffix tree construction. In: Proceedings of the 20th International Conference on Data Engineering, pp. 720–731 (2004)
Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 360–369 (1997)
Blumer, A., Ehrenfeucht, A., Haussler, D.: Average sizes of suffix trees and DAWGs. Discrete Applied Mathematics 24(1), 37–45 (1989)
Article Google Scholar
Carvalho, A., Freitas, A., Oliveira, A., Sagot, M.-F.: A parallel algorithm for the extraction of structured motifs. In Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 147–153 (2004)
Cheng, L.-L., Cheung, D., Yiu, S.-M.: Approximate string matching in DNA sequences. In: Proceeings of the 8th International Conference on Database Systems for Advanced Applications, pp. 303–310 (2003)
Cheung, C.-F., Yu, J.X., Lu, H.: Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering 17(1), 90–105 (2005)
Article Google Scholar
Clifford, R., Sergot, M.J.: Distributed and paged suffix trees for large genetic databases. In: Proceedings of 14th Annual Symposium on Combinatorial Pattern Matching, pp. 70–82, 2003.
Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory and its applications. Algorithmica 32(1), 1–35 (2002)
Article MathSciNet Google Scholar
Deep-Shallow Suffix Array and BWT Construction Algorithms. http://www.mfn.unipmn.it/~manzini/lightweight/.
Delcher, A., Kasif, S., Fleischmann, R., Peterson, J., White, O., Salzberg, S.: Alignment of whole genomes. Nucleic Acids Res. 27(11), 2369–2376 (1999)
Article PubMed Google Scholar
Delcher, A., Phillippy, A., Carlton, J., Salzberg, S.: Fast algorithms for large-scale genome alignment and comparision. Nucleic Acids Res. 30(11), 2478–2483 (2002)
Article PubMed Google Scholar
Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. In: Proceedings of the 7th Workshop on Algorithm Engineering and Experiments (2005)
Devroye, L., Szpankowski, W., Rais, B.: A note on the height of suffix trees. SIAM J. Comput. 21(1), 48–53 (1992)
Article Google Scholar
External Memory Suffix Array Construction Project. http://i10www.ira.uka.de/dementiev/esuffix/docu/index.html.
Farach, M.: Optimal suffix tree construction with large alphabets. In: Proceedings of the 38th Annual Symposium on Foundations of Computer Science, pp. 137–143. IEEE Computer Society (1997)
Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. Journal of The ACM 47(6), 987–1011 (2000)
Article Google Scholar
GenBank, NCBI, 2004. http://www.ncbi.nlm.nih.gov/GenBank.
Giegerich, R., Kurtz, S.: From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica 19(3), 331–353 (1997)
Google Scholar
Giegerich, R., Kurtz, S., Stoye, J.: Efficient implementation of lazy suffix trees. Soft. Pract. Exp. 33(11), 1035–1049 (2003)
Article Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press (1997)
Heumann, K., Mewes, H.-W.: The hashed position tree (HPT): A suffix tree variant for large data sets stored on slow mass storage devices. In: Proceedings of the 3rd South American Workshop on String Processing, pp. 101–115 (1996)
Hunt, E., Atkinson, M.P., Irving, R.W.: A database index to large biological sequences. The VLDB J. 7(3), 139–148 (2001)
Google Scholar
Intel Corporation. The IA-32 Intel Architecture Optimization Reference Manual. Intel (Order Number 248966) (2004)
Intel Corporation. The IA-32 Intel Architecture Software Developer's Manual: System Programming Guide, vol. 3. Intel (Order Number 253668) (2004)
Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Proceedings of the 13th International Conference on Automata, Languages and Programming, pp. 943–955 (2003)
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, pp. 181–192 (2001)
Kim, D.K., Sim, J.S., Park, H., Park, K.: Linear-time construction of suffix arrays. In: Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, pp. 186–199 (June 2003)
Ko, P., Aluru, S.: Space efficient linear-time construction of suffix arrays. In: Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, pp. 200–210 (June 2003)
Kurtz, S.: Reducing space requirement of suffix trees. Soft: Pract. Exp. 29(13), 1149–1171 (1999)
Article Google Scholar
Kurtz, S., Choudhuri, J.V., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: REPuter: The manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 29, 4633–4642 (2001)
Article PubMed Google Scholar
Kurtz, S., Phillippy, A., Delcher, A., Smoot, M., Shumway, M., Antonescu, C., Salzberg, S.: Versatile and open software for comparing large genomes. Genome Bio. 5(R12) (2004)
Manzini, G.: Two space saving tricks for linear time LCP array computation. In Proceedings of the 9th Scandinavian Workshop on Algorithm Theory, pp. 372–383 (2004)
Manzini, G., Ferragina, P.: Engineering a lightweight suffix array construction algorithm. Algorithmica 40(1), 33–50 (2004)
Article Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of The ACM 23(2), 262–272 (1976)
Article Google Scholar
Meek, C., Patel, J.M., Kasetty, S.: Oasis: An online and accurate technique for local-alignment searches on biological sequences. In Proceedings of 29th International Conference on Very Large Data Bases, pp. 910–921 (2003)
Navarro, G., Baeza-Yates, R., Tariho, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)
Google Scholar
Patel, J.M.: The role of declarative querying in bioinformatics. OMICS: J. Integr. Biol. 7(1), 89–92 (2003)
Article Google Scholar
Pettersson, M.: Perfctr: Linux performance montioring counters driver. http://user.it.uu.se/~mikpe/linux/perfctr.
Project Gutenberg. http://www.gutenberg.net.
Schurmann, K.-B., Stoye, J.: Suffix-tree construction and storage with limited main memory. Technical Report 2003-06, Univeristy of Bielefeld, Germany (2003)
STXXL Library. http://i10www.ira.uka.de/dementiev/stxxl.shtml.
Szpankowski, W.: Average-Case Analysis of Algorithms on Sequences. John Wiley and Sons (2001)
Tata, S., Hankins, R.A., Patel, J.M.: Practical suffix tree construction. In: Proceedings of 30th International Conference on Very Large Data Bases, pp. 36–47 (2004)
The Growth of GenBank, NCBI, 2004. http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html.
The MUMmer Software. http://www.tigr.org/software/mummer/.
Ukkonen, E.: Constructing suffix-trees on-line in linear time. In Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture: Information Processing, pp. 484–492 (1992)
Vitter, J.S., Shriver, M.: Algorithms for parallel memory: Two-level memories. Algorithmica 12, 110–147 (1994)
Article Google Scholar
Weiner, P.: Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Yona, S., Tsadok, D.: ANSI C implementation of a suffix tree. http://cs.haifa.ac.il/~shlomo/suffix_tree.

Download references

Author information

Authors and Affiliations

University of Michigan, 1301 Beal Avenue, Ann Arbor, MI, 48109-2122, USA
Yuanyuan Tian, Sandeep Tata & Jignesh M. Patel
Microarchitecture Research Lab, Intel Corp., 2200 Mission College Blvd, Santa Clara, CA, 95054, USA
Richard A. Hankins

Authors

Yuanyuan Tian
View author publications
You can also search for this author in PubMed Google Scholar
Sandeep Tata
View author publications
You can also search for this author in PubMed Google Scholar
Richard A. Hankins
View author publications
You can also search for this author in PubMed Google Scholar
Jignesh M. Patel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Richard A. Hankins.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tian, Y., Tata, S., Hankins, R.A. et al. Practical methods for constructing suffix trees. The VLDB Journal 14, 281–299 (2005). https://doi.org/10.1007/s00778-005-0154-8

Download citation

Received: 14 October 2004
Accepted: 01 July 2005
Published: 26 September 2005
Issue Date: September 2005
DOI: https://doi.org/10.1007/s00778-005-0154-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Practical methods for constructing suffix trees

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

Stratified random sampling from streaming and stored data

The Clustal Omega Multiple Alignment Package

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Practical methods for constructing suffix trees

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

Stratified random sampling from streaming and stored data

The Clustal Omega Multiple Alignment Package

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation