skip to main content
article
Free Access

Automatic structuring and retrieval of large text files

Published:01 February 1994Publication History
First page image

References

  1. 1 Borafogo, R.A. and Shneiderman, B. Identifying aggregates in hypertext stractures. In Proceedings of Hypertext 91, ACM, New ~ork, Dec. 1991, pp. 63-74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2 Clititerow, P., Rivekeno, D. and Muller, M. VISAR. A System, for interence and navigation in hypertext. In Proceedings of Hypertext 89, ACM, New york, Nov. 89, pp. 293-304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3 Croft, W.B. and Thompson, R.H. 13R: A new approach to the design of document rettieval systems. J. Amo Soc. Inf. Sci. 73, 6 (1987), 389-404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4 Croft, W.R. and Turtule H. A retrieval model incorporating hypertext links. In Proceedings of Hypertext 89, ACM, New York, 1989, 213-223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5 Grouch, D.B., Crouch, C0' and An. dreas, G. the use of cluster hierarchies in hypertext information retrieval. In Proceedings of Hypertext 89, ACM, New York, Nov. 1989, pp. 225- 237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6 Iuhr, N. Hypertext and information retrieval. In P. Gloor and N. Streitz, Eds., Hypertext and Hypermedia. springer Verlag, Berlim 199o, pp. 101-111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7 Funk and Wagnalis New Encyclopedia Funk and Wagnalls, New York, 1979, 29 volumes.Google ScholarGoogle Scholar
  8. 8 Furuta, R., Plaisant, C. and Shneiderman, B. Automatically transforming regularly structured lineat documents into hypertext. Electronic Pub- 2. 4 Dec. 1989), 211-229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9 Grice. H. P. Studies in the Way of Words. Harvard University Press, Cambridge Mass., 1989.Google ScholarGoogle Scholar
  10. 10 O'Connor, J. Answer passage retrieval by text scattering. J. Am. Soc- Inf. sci 32.4 (July 1980), 227-239.Google ScholarGoogle Scholar
  11. 11 Rocchio, J.J. Relevance Feedback in Information retrival, in the Smart system Experiments in Automatic Document Processing G. Salton, Ed., Prentice-Hall, Englewood Cliffs, N.J., 1971, 313-323.Google ScholarGoogle Scholar
  12. 12 Salton, G. Ed., The Smart Retrieval System-Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, N.J., 1971. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13 Salton, G. Automatic Text Processing- The Transformation, Analysis and Retreieval of Information by computer. Addison-Wesley, Reading, Mass., 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14 Sahon G. Developments in automatic text retrieval science 253:5023, (Aug 30, 1991), 974-980.Google ScholarGoogle Scholar
  15. 15 Sahon. G. and Allan, J. Selective text utilization and text traversal. In Proceedings of Hypertext '93. ACM, New York, 1993. pp. 131-144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16 Sahom, G., ABatt, J. and Buckley, C. selective use of full-text databases. Tech. Rep. TR 92-1300. Dept. of Computer Sciences, Cornell University, Ithaca, New York, Aug, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 17 Salton, G., Allan, J. and Buckley, C. Approaches in passage retrieval in full-text information systems. In Proceedings of the Sixteenth ACM/SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 1993, pp. 49-58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. 18 Salton, G. and Bukley, C. Term weighting approaches in automatic text retrieval. Inf Proc Manager, 24, 5 (1988), 513-523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. 19 Sahton, G. and Buckley, C. Improving retrieval performance by relevance feedback. Am. Soc. Inf. Scn. 41, 4 (1990), 288-297.Google ScholarGoogle Scholar
  20. 20 Salton, G. and Buckley C. Automatic text stucturing and retrieval- Experiments in automatic encyclopedia searching in Proceedings of the Fourteenth ACM/SIGIR Conference on Research and development in Information Retrieval. AMC, New York, 1991, pp. 21-30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. 21 Salton. G.and Buckley, C. Globak text matching for information retrieval science 253-5-23, (Aug. 30, 1991), 1012-1015.Google ScholarGoogle Scholar
  22. 22 Salton, G. and Fox, E.A. and Wu, H. Extended Boolean information re trieval. Commun. ACM 26, 12 (Dec. 1983), 1022-1036. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. 23 Salton, c., Yang, C.S. and Wong, A. A vector spasce model for automatic indexing, Commun. ACM 18, 11 (Nov. 1975), 613-620. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. 24 Stanfill, C. and Kahle, B. Parallel free-text search on the Connection Machine. Commun. ACM 29, 12 (Dec. 1986), 1229-1239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. 25 Wittgenstein, L. Philosophical Investigations. BAsil Blackwell and Mott Ltd., Oxford, England, 1953.Google ScholarGoogle Scholar

Index Terms

  1. Automatic structuring and retrieval of large text files

        Recommendations

        Reviews

        Harold Borko

        Salton is a well-known and respected teacher and researcher in the areas of text processing and information storage and retrieval as well as a prolific writer on these topics. This paper, written in collaboration with his students, describes the results of experiments involving the structuring of large bodies of text by linking excerpts containing related subject matter. The experiments utilize many common text processing procedures, some of which have been developed by Salton and his colleagues, plus a number of innovative strategies for text matching and retrieval. The database used in these experiments consists of 25,000 articles contained in the 29 volumes of the <__?__Pub Fmt italic>Funk and <__?__Pub Fmt nolinebreak>Wagnalls<__?__Pub Fmt /nolinebreak> New Encyclopedia<__?__Pub Fmt /italic>, about 65MB of text. Salton states that, to retrieve relevant information from such a large body of text, the use of keywords connected by Boolean operators would give inadequate results. He proposes a vector processing model in which both the stored documents and the search requests are represented by sets (vectors) of terms without Boolean operators. Searching is done by comparing query vectors with the vectors representing the stored documents and retrieving those items found to be similar to the queries. In the reported experiment, an existing encyclopedia article is used as a search request, and the goal is to retrieve related articles. The formula for computing the vector similarity function is given and discussed, and the search results, using a number of different conditions, are displayed in easily readable tables and figures. As Salton points out, “The evaluation of information retrieval or text linking operations is a major unsolved problem” (p.<__?__Pub Fmt interword-space>105). The problem is largely due to the difficulty in obtaining accurate relevance judgments, which are needed to compute recall and precision parameters. Recognizing that one cannot obtain absolute performance measurements, this experiment computes relative performance results for the different search runs in order to judge the effectiveness of the retrieval methodologies. Both the experimental and the evaluation procedures are reasonable and appropriate. No mention is made of the cost and computer processing time required to prepare keyword vectors for the encyclopedia articles and to conduct the search either by a simple matching procedure or by the use of relevance feedback, however. Could these techniques be practically and affordably applied to existing information retrieval databases__?__ At the very least,<__?__Pub Caret> this question should have been considered.

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Communications of the ACM
          Communications of the ACM  Volume 37, Issue 2
          Feb. 1994
          97 pages
          ISSN:0001-0782
          EISSN:1557-7317
          DOI:10.1145/175235
          Issue’s Table of Contents

          Copyright © 1994 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 February 1994

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader