ABSTRACT
The computation of page importance in a huge dynamic graph has recently attracted a lot of attention because of the web. Page importance, or page rank is defined as the fixpoint of a matrix equation. Previous algorithms compute it off-line and require the use of a lot of extra CPU as well as disk resources (e.g. to store, maintain and read the link matrix). We introduce a new algorithm OPIC that works on-line, and uses much less resources. In particular, it does not require storing the link matrix. It is on-line in that it continuously refines its estimate of page importance while the web/graph is visited. Thus it can be used to focus crawling to the most interesting pages. We prove the correctness of OPIC. We present Adaptive OPIC that also works on-line but adapts dynamically to changes of the web. A variant of this algorithm is now used by Xyleme.We report on experiments with synthetic data. In particular, we study the convergence and adaptiveness of the algorithms for various scheduling strategies for the pages to visit. We also report on experiments based on crawls of significant portions of the web.
- S. Abiteboul, G. Cobena, J. Masanes, and G. Sedrati. A first experience in archiving the french web. ECDL, 2002.]] Google ScholarDigital Library
- S. Abiteboul, M. Preda, and G. Cobena. Computing web page importance without storing the graph of the web (extended abstract). IEEE-CS Data Engineering Bulletin, Volume 25, 2002.]]Google Scholar
- Alta vista. http://www.altavista.com/.]]Google Scholar
- Andrei Z. Broder and al. Graph structure in the web. WWW9/Computer Networks, 2000.]] Google ScholarDigital Library
- Steve Chien, Cynthia Dwork, Ravi Kumar, Dan Simon, and D. Sivakumar. Link evolution: Analysis and algorithms. In Workshop on Algorithms and Models for the Web Graph (WAW), 2002.]]Google Scholar
- Junghoo Cho, Hector García-Molina, and Lawrence Page. Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1-7):161--172, 1998.]] Google ScholarDigital Library
- Kai Lai Chung. Markov chains with stationary transition probabilities. Springer, 1967.]]Google Scholar
- Colin Cooper and Alan M. Frieze. A general model of undirected web graphs. In European Symposium on Algorithms, pages 500--511, 2001.]] Google ScholarDigital Library
- Y. Dong D. Zhang. An efficient algorithm to rank web resources. 9 th International World Wide Web Conference, 2000.]] Google ScholarDigital Library
- F. R. Gantmacher. Applications of the theory of matrices. In Interscience Publishers, pages 64--79, 1959.]]Google Scholar
- Google. http://www.google.com/.]]Google Scholar
- T. Haveliwala. Efficient computation of pagerank. Technical report, Stanford University, 1999.]]Google Scholar
- H. Garcia-Molina J. Cho. Synchronizing a database to improve freshness. SIGMOD, 2000.]] Google ScholarDigital Library
- M.R. Henzinger J. Dean. Finding related pages in the world wide web. 8th International World Wide Web Conference, 1999.]] Google ScholarDigital Library
- A. Broder K. Bharat. Estimating the relative size and overlap of public web search engines. 7th International World Wide Web Conference (WWW7), 1998.]] Google ScholarDigital Library
- Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999.]] Google ScholarDigital Library
- G.V. Meghabghab. Google's web page ranking applied to different topological web graph structures. JASIS 52(9), 2001.]] Google ScholarDigital Library
- Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. ACM Computing Surveys, 28(1):33--37, 1996.]] Google ScholarDigital Library
- Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web, 1998.]]Google Scholar
- M. Preda. Data acquisition for an xml warehouse. DEA thesis Paris 7 University, 2000.]]Google Scholar
- L. Page S. Brin. The anatomy of a large-scale hypertextual web search engine. WWW7 Conference, Computer Networks 30(1--7), 1998.]] Google ScholarDigital Library
- B. Dom S. Chakrabarti, M. van den Berg. Focused crawling: a new approach to topic-specific web resource discovery. 8th World Wide Web Conference, 1999.]] Google ScholarDigital Library
- L. Giles S. Lawrence. Accessibility and distribution of information on the web. Nature, 1999.]]Google Scholar
- Search-engine watch. www.searchenginewatch.com/.]]Google Scholar
- S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. IBM Journal of Research and Development, 41(6):711--??, 1997.]] Google ScholarDigital Library
- C.J. van Rijsbergen. Information retrieval. London, Butterworths, 1979.]] Google ScholarDigital Library
- Lucie Xyleme. A dynamic warehouse for xml data of the web. IEEE Data Engineering Bulletin, 2001.]]Google ScholarDigital Library
- Xyleme. www.xyleme.com.]]Google Scholar
Index Terms
- Adaptive on-line page importance computation
Recommendations
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Random web crawls
WWW '07: Proceedings of the 16th international conference on World Wide WebThis paper proposes a random Web crawl model. A Web crawl is a (biased and partial) image of the Web. This paper deals with the hyperlink structure, i.e. a Web crawl is a graph, whose vertices are the pages and whose edges are the hypertextual links. Of ...
Uncovering the unarchived web
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrievalMany national and international heritage institutes realize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites ...
Comments