Skip to main content
Log in

Practical methods for constructing suffix trees

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not well characterized.

In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n2) worst-case complexity outperforms popular linear time algorithms like Ukkonen and McCreight, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we describe two approaches. First, we present a buffer management strategy for the O(n2) algorithm. The resulting new algorithm, which we call “Top Down Disk-based” (TDD), scales to sizes much larger than have been previously described in literature. This approach far outperforms the best known disk-based construction methods. Second, we present a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and show that for constructing very large suffix trees with very little resources, this algorithm is more efficient than TDD.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2, 53–86 (2004)

    Article  Google Scholar 

  2. Aluru, S.: Suffix Trees and Suffix Arrays, Handbook of Data Structures and Applications. CRC Press (2004)

  3. Andersson, A., Nilsson, S.: Efficient implementation of suffix trees. Software: Pract Exp. 25(2), 129–141 (1995)

    Google Scholar 

  4. Apostolico, A., Szpankowski, W.: Self-alignments in words and their applications. J. Algorithms 13(3), 446–467 (1992)

    Article  Google Scholar 

  5. Apweiler, R., Bairoch, A., Wu, C.H., Barker, W., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D., O'Donovan, A.C., Redaschi, N., Yeh, L.L.: Uniprot: The universal protein knowledgebase. Nucl. Acids Res. 32(D), 115–119 (2004)

    Article  PubMed  Google Scholar 

  6. Atkinson, M., Jordan, M.: Providing orthogonal persistence for java. In Proceedings of the 12th European Conference on Object-Oriented Programming, pp. 383–395 (1998)

  7. Bedathur, S.J., Haritsa, J.R.: Engineering a fast online persistent suffix tree construction. In: Proceedings of the 20th International Conference on Data Engineering, pp. 720–731 (2004)

  8. Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 360–369 (1997)

  9. Blumer, A., Ehrenfeucht, A., Haussler, D.: Average sizes of suffix trees and DAWGs. Discrete Applied Mathematics 24(1), 37–45 (1989)

    Article  Google Scholar 

  10. Carvalho, A., Freitas, A., Oliveira, A., Sagot, M.-F.: A parallel algorithm for the extraction of structured motifs. In Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 147–153 (2004)

  11. Cheng, L.-L., Cheung, D., Yiu, S.-M.: Approximate string matching in DNA sequences. In: Proceeings of the 8th International Conference on Database Systems for Advanced Applications, pp. 303–310 (2003)

  12. Cheung, C.-F., Yu, J.X., Lu, H.: Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering 17(1), 90–105 (2005)

    Article  Google Scholar 

  13. Clifford, R., Sergot, M.J.: Distributed and paged suffix trees for large genetic databases. In: Proceedings of 14th Annual Symposium on Combinatorial Pattern Matching, pp. 70–82, 2003.

  14. Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory and its applications. Algorithmica 32(1), 1–35 (2002)

    Article  MathSciNet  Google Scholar 

  15. Deep-Shallow Suffix Array and BWT Construction Algorithms. http://www.mfn.unipmn.it/~manzini/lightweight/.

  16. Delcher, A., Kasif, S., Fleischmann, R., Peterson, J., White, O., Salzberg, S.: Alignment of whole genomes. Nucleic Acids Res. 27(11), 2369–2376 (1999)

    Article  PubMed  Google Scholar 

  17. Delcher, A., Phillippy, A., Carlton, J., Salzberg, S.: Fast algorithms for large-scale genome alignment and comparision. Nucleic Acids Res. 30(11), 2478–2483 (2002)

    Article  PubMed  Google Scholar 

  18. Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. In: Proceedings of the 7th Workshop on Algorithm Engineering and Experiments (2005)

  19. Devroye, L., Szpankowski, W., Rais, B.: A note on the height of suffix trees. SIAM J. Comput. 21(1), 48–53 (1992)

    Article  Google Scholar 

  20. External Memory Suffix Array Construction Project. http://i10www.ira.uka.de/dementiev/esuffix/docu/index.html.

  21. Farach, M.: Optimal suffix tree construction with large alphabets. In: Proceedings of the 38th Annual Symposium on Foundations of Computer Science, pp. 137–143. IEEE Computer Society (1997)

  22. Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. Journal of The ACM 47(6), 987–1011 (2000)

    Article  Google Scholar 

  23. GenBank, NCBI, 2004. http://www.ncbi.nlm.nih.gov/GenBank.

  24. Giegerich, R., Kurtz, S.: From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica 19(3), 331–353 (1997)

    Google Scholar 

  25. Giegerich, R., Kurtz, S., Stoye, J.: Efficient implementation of lazy suffix trees. Soft. Pract. Exp. 33(11), 1035–1049 (2003)

    Article  Google Scholar 

  26. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press (1997)

  27. Heumann, K., Mewes, H.-W.: The hashed position tree (HPT): A suffix tree variant for large data sets stored on slow mass storage devices. In: Proceedings of the 3rd South American Workshop on String Processing, pp. 101–115 (1996)

  28. Hunt, E., Atkinson, M.P., Irving, R.W.: A database index to large biological sequences. The VLDB J. 7(3), 139–148 (2001)

    Google Scholar 

  29. Intel Corporation. The IA-32 Intel Architecture Optimization Reference Manual. Intel (Order Number 248966) (2004)

  30. Intel Corporation. The IA-32 Intel Architecture Software Developer's Manual: System Programming Guide, vol. 3. Intel (Order Number 253668) (2004)

  31. Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Proceedings of the 13th International Conference on Automata, Languages and Programming, pp. 943–955 (2003)

  32. Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, pp. 181–192 (2001)

  33. Kim, D.K., Sim, J.S., Park, H., Park, K.: Linear-time construction of suffix arrays. In: Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, pp. 186–199 (June 2003)

  34. Ko, P., Aluru, S.: Space efficient linear-time construction of suffix arrays. In: Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, pp. 200–210 (June 2003)

  35. Kurtz, S.: Reducing space requirement of suffix trees. Soft: Pract. Exp. 29(13), 1149–1171 (1999)

    Article  Google Scholar 

  36. Kurtz, S., Choudhuri, J.V., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: REPuter: The manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 29, 4633–4642 (2001)

    Article  PubMed  Google Scholar 

  37. Kurtz, S., Phillippy, A., Delcher, A., Smoot, M., Shumway, M., Antonescu, C., Salzberg, S.: Versatile and open software for comparing large genomes. Genome Bio. 5(R12) (2004)

  38. Manzini, G.: Two space saving tricks for linear time LCP array computation. In Proceedings of the 9th Scandinavian Workshop on Algorithm Theory, pp. 372–383 (2004)

  39. Manzini, G., Ferragina, P.: Engineering a lightweight suffix array construction algorithm. Algorithmica 40(1), 33–50 (2004)

    Article  Google Scholar 

  40. McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of The ACM 23(2), 262–272 (1976)

    Article  Google Scholar 

  41. Meek, C., Patel, J.M., Kasetty, S.: Oasis: An online and accurate technique for local-alignment searches on biological sequences. In Proceedings of 29th International Conference on Very Large Data Bases, pp. 910–921 (2003)

  42. Navarro, G., Baeza-Yates, R., Tariho, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)

    Google Scholar 

  43. Patel, J.M.: The role of declarative querying in bioinformatics. OMICS: J. Integr. Biol. 7(1), 89–92 (2003)

    Article  Google Scholar 

  44. Pettersson, M.: Perfctr: Linux performance montioring counters driver. http://user.it.uu.se/~mikpe/linux/perfctr.

  45. Project Gutenberg. http://www.gutenberg.net.

  46. Schurmann, K.-B., Stoye, J.: Suffix-tree construction and storage with limited main memory. Technical Report 2003-06, Univeristy of Bielefeld, Germany (2003)

  47. STXXL Library. http://i10www.ira.uka.de/dementiev/stxxl.shtml.

  48. Szpankowski, W.: Average-Case Analysis of Algorithms on Sequences. John Wiley and Sons (2001)

  49. Tata, S., Hankins, R.A., Patel, J.M.: Practical suffix tree construction. In: Proceedings of 30th International Conference on Very Large Data Bases, pp. 36–47 (2004)

  50. The Growth of GenBank, NCBI, 2004. http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html.

  51. The MUMmer Software. http://www.tigr.org/software/mummer/.

  52. Ukkonen, E.: Constructing suffix-trees on-line in linear time. In Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture: Information Processing, pp. 484–492 (1992)

  53. Vitter, J.S., Shriver, M.: Algorithms for parallel memory: Two-level memories. Algorithmica 12, 110–147 (1994)

    Article  Google Scholar 

  54. Weiner, P.: Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pp. 1–11 (1973)

  55. Yona, S., Tsadok, D.: ANSI C implementation of a suffix tree. http://cs.haifa.ac.il/~shlomo/suffix_tree.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Richard A. Hankins.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tian, Y., Tata, S., Hankins, R.A. et al. Practical methods for constructing suffix trees. The VLDB Journal 14, 281–299 (2005). https://doi.org/10.1007/s00778-005-0154-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-005-0154-8

Keywords

Navigation