Article

Adaptive on-line page importance computation

Authors:
Serge Abiteboul

INRIA Domaine de Voluceau, Rocquencourt, France

INRIA Domaine de Voluceau, Rocquencourt, France
View Profile

,
Mihai Preda

Xyleme SA, Saint-Cloud, France

Xyleme SA, Saint-Cloud, France
View Profile

,
Gregory Cobena

INRIA Domaine de Voluceau, Rocquencourt, France

INRIA Domaine de Voluceau, Rocquencourt, France
View Profile

WWW '03: Proceedings of the 12th international conference on World Wide WebMay 2003Pages 280–290https://doi.org/10.1145/775152.775192

Published:20 May 2003Publication History

WWW '03: Proceedings of the 12th international conference on World Wide Web

Pages 280–290

ABSTRACT

The computation of page importance in a huge dynamic graph has recently attracted a lot of attention because of the web. Page importance, or page rank is defined as the fixpoint of a matrix equation. Previous algorithms compute it off-line and require the use of a lot of extra CPU as well as disk resources (e.g. to store, maintain and read the link matrix). We introduce a new algorithm OPIC that works on-line, and uses much less resources. In particular, it does not require storing the link matrix. It is on-line in that it continuously refines its estimate of page importance while the web/graph is visited. Thus it can be used to focus crawling to the most interesting pages. We prove the correctness of OPIC. We present Adaptive OPIC that also works on-line but adapts dynamically to changes of the web. A variant of this algorithm is now used by Xyleme.We report on experiments with synthetic data. In particular, we study the convergence and adaptiveness of the algorithms for various scheduling strategies for the pages to visit. We also report on experiments based on crawls of significant portions of the web.

References

S. Abiteboul, G. Cobena, J. Masanes, and G. Sedrati. A first experience in archiving the french web. ECDL, 2002.]] Google ScholarDigital Library
S. Abiteboul, M. Preda, and G. Cobena. Computing web page importance without storing the graph of the web (extended abstract). IEEE-CS Data Engineering Bulletin, Volume 25, 2002.]]Google Scholar
Alta vista. http://www.altavista.com/.]]Google Scholar
Andrei Z. Broder and al. Graph structure in the web. WWW9/Computer Networks, 2000.]] Google ScholarDigital Library
Steve Chien, Cynthia Dwork, Ravi Kumar, Dan Simon, and D. Sivakumar. Link evolution: Analysis and algorithms. In Workshop on Algorithms and Models for the Web Graph (WAW), 2002.]]Google Scholar
Junghoo Cho, Hector García-Molina, and Lawrence Page. Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1-7):161--172, 1998.]] Google ScholarDigital Library
Kai Lai Chung. Markov chains with stationary transition probabilities. Springer, 1967.]]Google Scholar
Colin Cooper and Alan M. Frieze. A general model of undirected web graphs. In European Symposium on Algorithms, pages 500--511, 2001.]] Google ScholarDigital Library
Y. Dong D. Zhang. An efficient algorithm to rank web resources. 9 th International World Wide Web Conference, 2000.]] Google ScholarDigital Library
F. R. Gantmacher. Applications of the theory of matrices. In Interscience Publishers, pages 64--79, 1959.]]Google Scholar
Google. http://www.google.com/.]]Google Scholar
T. Haveliwala. Efficient computation of pagerank. Technical report, Stanford University, 1999.]]Google Scholar
H. Garcia-Molina J. Cho. Synchronizing a database to improve freshness. SIGMOD, 2000.]] Google ScholarDigital Library
M.R. Henzinger J. Dean. Finding related pages in the world wide web. 8th International World Wide Web Conference, 1999.]] Google ScholarDigital Library
A. Broder K. Bharat. Estimating the relative size and overlap of public web search engines. 7th International World Wide Web Conference (WWW7), 1998.]] Google ScholarDigital Library
Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999.]] Google ScholarDigital Library
G.V. Meghabghab. Google's web page ranking applied to different topological web graph structures. JASIS 52(9), 2001.]] Google ScholarDigital Library
Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. ACM Computing Surveys, 28(1):33--37, 1996.]] Google ScholarDigital Library
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web, 1998.]]Google Scholar
M. Preda. Data acquisition for an xml warehouse. DEA thesis Paris 7 University, 2000.]]Google Scholar
L. Page S. Brin. The anatomy of a large-scale hypertextual web search engine. WWW7 Conference, Computer Networks 30(1--7), 1998.]] Google ScholarDigital Library
B. Dom S. Chakrabarti, M. van den Berg. Focused crawling: a new approach to topic-specific web resource discovery. 8th World Wide Web Conference, 1999.]] Google ScholarDigital Library
L. Giles S. Lawrence. Accessibility and distribution of information on the web. Nature, 1999.]]Google Scholar
Search-engine watch. www.searchenginewatch.com/.]]Google Scholar
S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. IBM Journal of Research and Development, 41(6):711--??, 1997.]] Google ScholarDigital Library
C.J. van Rijsbergen. Information retrieval. London, Butterworths, 1979.]] Google ScholarDigital Library
Lucie Xyleme. A dynamic warehouse for xml data of the web. IEEE Data Engineering Bulletin, 2001.]]Google ScholarDigital Library
Xyleme. www.xyleme.com.]]Google Scholar

Index Terms

Adaptive on-line page importance computation
1. Information systems
  1. Information systems applications
    1. Data mining
2. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
  2. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Read More
Random web crawls
WWW '07: Proceedings of the 16th international conference on World Wide Web

This paper proposes a random Web crawl model. A Web crawl is a (biased and partial) image of the Web. This paper deals with the hyperlink structure, i.e. a Web crawl is a graph, whose vertices are the pages and whose edges are the hypertextual links. Of ...
Read More
Uncovering the unarchived web
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

Many national and international heritage institutes realize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '03: Proceedings of the 12th international conference on World Wide Web
May 2003
772 pages
ISBN:1581136803
DOI:10.1145/775152
Conference Chairs:
Gusztáv Hencsey
MTA SZTAKI, Hungary
,
Bebo White
Stanford Linear Accelerator Center, USA
,
Program Chairs:
Yih-Farn Robin Chen
AT&T Labs -- Research, USA
,
László Kovács
MTA SZTAKI, Hungary
,
Steve Lawrence
Google Inc., USA
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 May 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
hyperlink
page importance
web graph
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 133
  Total Citations
  View Citations
- 1,656
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Adaptive on-line page importance computation

WWW '03: Proceedings of the 12th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Current challenges in web crawling

Random web crawls

Uncovering the unarchived web

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Adaptive on-line page importance computation

WWW '03: Proceedings of the 12th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Current challenges in web crawling

Random web crawls

Uncovering the unarchived web

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media