Designing a Regional Crawler for Distributed and Centralized Search Engines

Milad Shokouhi [HREF1], Department of Computer Engineering [HREF2] , Bu-Ali Sina University [HREF3], Hamedan, Iran.shokouhi@ce.basu.ac.ir

Pirooz Chubak [HREF4], Department of Computer Engineering [HREF5] , Sharif University of Technology [HREF6], Tehran, Iran.chubak@ce.sharif.edu

Abstract

Today, by the growth of WWW, the significance and popularity of search engines are increasing day by day. However, today web crawlers are unable to update their huge search engine indexes concurrent to the growth in the information available on the web. Most of times they download some unimportant pages and ignore the pages that their probability of being searched is noticeable. This sometimes leads to incapability of search engines for giving up to date information to the users. Regional Crawler that we introduce as a new crawling strategy in this paper, improves the problem of updating and finding new pages to some extent by gathering users' common needs and interests in a certain domain, which can be as small as a LAN in a department of a university or as huge as a country. It crawls the pages containing information about these interests at first, instead of crawling the web without any predefined order. In this paper, we design the Regional Crawler architecture and introduce its application in centralized and distributed search engines.

Introduction

The Significance of search engines

Surveys show that the use of internet is growing world wide both in servers and in clients. For instance, the number of web sites which was less than 3 millions in 1998 has become more than 9 millions by the year 2002 (HREF 8). The number of internet users which was less than 160 millions in 1998 increased to more than 600 millions by the year 2002 (HREF 9) and (HREF 10).

Above statistics, show the huge amount of information on WWW and the growth of people's requests for these information in different parts of the world. The necessity of an efficient tool for finding what users are looking for makes the significance of search engines more obvious. Search engines take users' queries as input and produce a sequence of URLs that match the query according to a rank they calculate for each document on the web (M. Hollander, 2003; G. Salton, M.J. McGill ,1983).

What are crawlers?

Web crawler is a program that traverses the internet automatically by retrieving a web page and then recursively retrieving all linked pages (S. Mayocchi ,2001; P. Boldi, B. Codenotti, M. Santini & S. Vigna, 2002).Because of the huge amount of information on the web, it takes several weeks to crawl it totally. (HREF 5; A. Heydon, M. Najork, 1999). Even the largest search engines, like Google and Altavista, cover only limited parts of the web and much of their data are out of date several months of the year (T. Suel, V. Shkapenyuk, 2002). Therefore, they usually do not cover some important information that changes hourly or daily (like news). Most of the recent works done on crawling strategies attempt to minimize the number of pages that need to get downloaded, or maximize the benefit obtained per downloaded page (T. Suel, V. Shkapenyuk, 2002).

Regional Crawler Method

In this method, the crawling strategy is based on users' interests and needs in certain domains. These needs and interests are determined according to common characteristics of the users like geographical location, age, membership and job. Regional Crawler uses these interests as basic data for crawling strategy. In the other words, people in the same region are more likely to search for similar subjects and ignore the other categories that may be important for people in other areas. For example, people in Iran are usually searching for information about soccer and Middle East news, but in the U.S users are more likely to search for baseball events. Even people in a CS department usually look for similar information, (computer science articles for example), so the region could even be defined as small as a LAN. The more a document contains common interests of different domains; the more is its chance for getting crawled. We have designed the architecture of regional crawler for two most common search engine structures called Centralized and Distributed.

A Web Crawler Design

The first crawler, Mathew Gray's Wanderer, was written in the spring of 1993, roughly coinciding with the first release of NCSA MOSAIC (A. Heydon, M. Najork, 1999) and (HREF 9).

Fig. 1.Web Crawler Architecture

Figure 1 shows an easy architecture for web crawler:

Crawler Manager: takes a set of URLs from Link Extractor and sends the Next URL to the DNS resolver to obtain its IP address. This saves a lot of time because spiders do not have to send requests to DNS every time they want to download a page.

Robots.txt file: are the means by which web authors express their wish as to which pages they want the crawlers to avoid. Crawlers must respect authors' wishes as well.

Spider: downloads robots.txt file and other pages that are requested by the crawler manager and permitted by web authors. The robots.txt files are sent to crawler manager for processing and extracting? the URLs. The other downloaded files are sent to a central indexer.

Link Extractor: look through the pages downloaded by the spiders, extracts URLs from the links in those pages and sends the URLs to the crawler manager for downloading afterwards.

Any crawler must fulfill following two issues (T. Suel, V. Shkapenyuk, 2002):

1) It must have a good crawling strategy

2) It has to have a highly optimized system architecture that can download a large number of pages per seconds.

Most of search engines use more than one crawler and manage them in a distributed method. This has following benefits:

Using Regional Crawler in Centralized Search Engines

In centralized search engines, there is a central URL store, which sends URLs to the crawler for processing and downloading. The mechanism that leads to the production of a list of ranked URLs to get downloaded, determines the crawling strategy. There are three major crawling strategies for centralized crawlers in the literature (F. Menczer, G. Pant, P. Srinivasan and M. Ruiz, 2001).

In this section, we will describe our crawling strategy, which is a new kind of Best-First strategy.

The main in detail architecture for crawlers in Centralized Search Engines is just like figure 2:

Fig. 2.Architecture of a Centeralized Regional Crawler

In this figure, valid URL Queue extracts the valid URLs from robot.txt and sends them to URL Ranker to rank the URLs according to user's interests for getting crawled sooner and more often. The Interest Manager? has to specify the users' interests. A Page Processing Unit in Search Engine extracts the keywords of pages and submits them to URL Ranker.

As we mentioned earlier, the regional crawler decides which IP to download next according to the interests of the users in specific domains. Therefore, we must find a method to map users' interest to IP addresses that have sent their queries to the centralized search engine. The sequence of URLs that spider must download is determined by the common interests of a certain region. Each region is known by the IP addresses of the users with similar interests. These interests must be the specific characteristics of that domain, this means we ignore some public interests like Soccer, which is a common interest in most of regions. The regions granularity could be as small as a LAN or as big as a country. We map the IP addresses to the interests of domains according to the known characteristics of different environments such as country, location, age range, job and other common characteristics of the users in a domain.

As an example, if the discussed domain is a CS department LAN in a university, the probability of requesting pages about computer science articles, other universities web sites and computer related topics is much more than business related web pages. Therefore, we manually specify the interest of different regions in Region Interest table as illustrated in Table I.

Table I

REGION INTERESTS 

Region

IP Addresses

Region Interests

Sharif University CS LAN

81.31.164.1-254

Computer Engineering, Robocup, ACM

Iran

Set of Iran IPs

Iran Politics, Wrestling, Soccer

We obtain the keywords of each URL from the search engine Page Processing Unit. For example, http://ce.sharif.edu is the URL of computer engineering department of Sharif University of Technology in Iran and the retrieved keywords would be "Computer Engineering", "ACM Contest", and "Robocup" etc. Therefore, because of the similarity between region interests of Bu-Ali Sina University CS LAN(BASU LAN) and the discussed URL keywords, the URL Ranker should give a higher rank to this URL and its linked pages, so that the BASU LAN spider, crawl it more often.

Using Regional Crawler in Distributed and Agent Based Search Engines

Agent Based Search Engines

The current search engine architectures have a high cost (M. Chau, D. Zeng, H. Chen, M. Huang & D. Hendriawan, 2002) and some problems and limitations, which we will point to some of them in below: (B. Ling,2002)

Generally, a Multi-Agent architecture has five advantages over the past methods:

1) Flexibility

2) Robustness

3) Distributed Computing

4) Facilitating development/Maintenance

5) Facilitating Research

The Architecture of most Agent-Based search engines like (T. Bauer, D. B. Leake ,2002; S. Pelletier, S. Pierre, H. Hoang ,2003; M. Chau, D. Zeng, H. Chen, M. Huang & D. Hendriawan, 2002; B. Ling,2002 )is based on a Three-Layer Model (Hermans, 1996). The main idea of this Three-Layer model is to divide the internet structure into three layers and devote some particular activities to each layer.

Fig. 3.Three Layer Architecture

According to figure 3, the requesters are the users who enter the query into the system and have an individual unique user profile. User profile contains the user interests and the results of previous searches (S. Pelletier, S. Pierre, H. Hoang ,2003). Providers section also contains the services and information of the providers that are being searched for pages related to users' queries. Intermediaries are responsible for matching the users' requests with the information available from the providers or information which have become accessible by the other users according to their user profile.

Regional Crawler in Agent Based Search Engines

As we explained before, current distributed and Agent-Based search engines are usually constructed based on a Three-Layer structure. Generally the main structure of a Personal Agent in most of Agent-Based search engines is just like figure 4 (B. Ling,2000).

Fig. 4.Personal Agent in Agent-Based Search Engines

User Profiles were discussed in last section completely. Search Agent will search the internet and User Interface Agent acts as an interface between the user and the whole Personal Agent structure and enables the collaboration between the user and Agents for searching and entering the queries. Middle Agent plays the most important role in this architecture. It's a bridge between users and providers in the way that providers would announce the services that they provide and users will ask for their needs on the other hand and the Middle Agent would act as a Match Maker between those groups. The Advantage of this methodology is that some user profiles would be devoted to the users which show their needs, interests and past search results. When a query entered by a user and got ready for the search. Middle Agent would get the query and specify the subject of it, then it would search through the user profiles and User Agents to find a similar user according to the public profiles and get information from its past search results for the same or similar queries and send these results as an answer for the user who has entered the query. By adding the Regional Crawler as a Regional Agent to the above architecture we would have Figure 5:

Fig. 5.Regional Crawler in Agent-Based Architecture

Regional Agent is responsible for collecting the users' interests of a specific region. Users with similar User Profile will be gathered together by Reinforcement Learning methods (R. S. Sutton, A. G. Barto, 1998) or Supervised learning (S. Chakrabarti, 2000) depending on the Middle-Agent architecture. Regional Agent will search for users with similar interests and gather them in a unique public agent. Then we devote a special crawler for each regional agent and ask it to crawl the web in the way that it can satisfy the users' interests. This means that the crawler should look for the pages related to region interests before the other topics available on the web. Since the crawlers are in cooperation with Search Agents, Regional Agent will ask Search Agents to update the important web pages (look for important new pages) by announcing the user interests and needs to them. The important point in this architecture is that by implementing the RL methods, regions domain will be unlimited and as an example two Fans of a particular soccer club in two different locations of the world would be in the same region. So by adding a regional Agent to the Architectures above, we expect the important pages from the users (user agents) perspective become updated more frequently. In this Agent-Based architecture, all the collaborations, cooperation, and any massage exchanging, between two agents is based on the knowledge query and manipulation language KQML. KQML is a protocol for exchanging information and knowledge between agents (M. N. Huhnus, L. M. Stephens). According to this architecture, the Regional Crawler shown as Regional Agent would be located in the Intermediaries Section discussed before.

Future Work

Since the idea of Regional Crawler is quite new in the literature, there seems to be a lot of work to do on this topic. For example, attempts should be made on improving the profiling methods in agent-based crawlers by using reinforcement-learning (R. S. Sutton, A. G. Barto, 1998) methods. In addition, realization of changes in a domain interests and adjusting the pages to be crawled are so important. For example in some weeks, people may get interested in political news in a domain according to the political situation occurring in the outside world relating to that domain, like a country.

Another idea is giving a higher rank to domains, which send more queries to our search engine. This causes more people to get better and more updated results than before.

Conclusion

In this paper we designed a new web crawler called Regional Crawler and discussed about the role it plays in different search engine architectures. The main advantage of Regional Crawler over the other kinds of web crawlers is updating and finding the new most important pages according to users' needs and interests. So, in any specific Region the crawling method is affected by the results of retrieved interests. This leads to a noticeable reduction in the size of the index in the search engine and it saves great amount of memory and network resource.

As an example, the people of US are interested in baseball, so if any pages are added recently to a baseball site about a new tournament on near future, search engines should find these new pages sooner instead of new pages related to the forecasting news of a small village in South Africa! The old methods do not cover this requirement, and all we have done is to improve the way of crawling to cover these requirements.

The more important a page is, the more frequently it will be looked for, updated and become available which current search engines are unable of giving such a service to users.

References

T. Bauer, D. B. Leake, (2002). "Calvin: A Multi Agent Personal Information Retrieval System", Department of Computer Science, Lindley Hall, Indiana University.

S. Pelletier, S. Pierre, H. Hoang, "Modeling a Multi-Agent System for Retrieving Information from Distributed Sources", Journal of Computing and Information Technology-CIT 11, 2003, 1, 1-10

M. Chau, D. Zeng, H. Chen, M. Huang, D. Hendriawan, (2002). "Design and Evaluation of a Multi-agent Collaborative Web Mining System", Decision Support Systems

B. Ling, (2000). "Enhanced Co-operative Knowledge Sharing Model for Agent Based Information Retrieval", M.S.c Thesis, Staffordshire University

S. Mayocchi,(2001). "A Web Crawler for Automated Location of Genomic Sequences", Department of Computer Systems and Electrical Engineering, University of Queensland, BA Thesis

P. Boldi, B. Codenotti, M. Santini, S. Vigna, (2002). "UniCrawler: A Scalable Fully Distributed Web Crawler" in Proceedings of AusWeb02, the Eighth Australian World Wide Web Conference

A. Heydon, M. Najork, (1999). "Mercator: A Scalable, Extensible Web Crawler" in World Wide Web, 2(4): 219-229

S. Brin, L. Page, (1998). "The Anatomy of a Large-Scale Hypertextual Web Search Engine" In Proceedings of the Seventh World-Wide Web Conference

S. Chakrabarti, (2000). "Data mining for hypertext: A tutorial survey, SIGKDD Exploration" in Newsletter of special Internet Group (SIG) on knowledge Discovery & Data Mining ,ACM 1(2),pages 1-11

M. Hollander, (2003) "Google's Page Rank Algorithm to Better Internet Searching", University of Minnesota, Morris Computer Science Seminar

G. Salton, M.J. McGill, (1983). Introduction to Modern Information Retrieval, McGraw-Hill Book Co., New York

R. S. Sutton, A. G. Barto, (1998). Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA

T. Suel, V. Shkapenyuk, (2002). "Design and Implementation of a High-Performance Distributed Web Crawler" In Proceedings of the 18th International Conference on Data Engineering (ICDE'02), San Jose, CA Feb. 26--March 1, pages 357--368

S. Brin, R. Motwani, L. Page and T. Winograd, (1998). "What Can You Do with the Web in Your Pocket" in Bulletin of IEEE Computer Society Technical Committee on Data Engineering

F. Menczer, G. Pant, P. Srinivasan and M. Ruiz, (2001). "Evaluating Topic-Driven Web Crawlers" In Proceedings of the 24th annual International ACM/SIGIR Conference, New Orleans, USA

M. N. Huhnus, L. M. Stephens, "Multi Agent Systems and Societies of Agents, Multi Agent Systems A Modern Approach to Distributed Artificial Intelligence", edited by Gerhard Weiss ,The MIT Press Cambridge, Massachusetts London, England

S. Chakrabarti, M. van den Berg, B. Domc, (1999). " Focused crawling: a new approach to topic specific Web resource discovery" in Proceedings of 8th International WWW Conference

N. Angkawattanawit, A. Rungsawang, (2002). "Learnable Crawling: An Efficient Approach to Topic-specific Web Resource Discovery" in The 2nd International Symposium on Communications and Information Technology (ISCIT 2002)

F. Menczer and R. Belew, (2000). "Adaptive retrieval agents: Internalizing local context and scaling up to the Web", Machine Learning, 39(2{3):203{242

Hypertext References

HREF1
http://ce.basu.ac.ir/~shokouhi
HREF2
http://ce.basu.ac.ir
HREF3
http://www.basu.ac.ir
HREF4
http://ce.sharif.edu/~chubak
HREF5
http://ce.sharif.edu
HREF6
http://www.sharif.edu
HREF7
http://www.stanford.edu/services/websearch/Google/TGIF/outof-index.html
HREF8
http://wcp.oclc.org/stats.html
HREF9
http://www.upsdell.com/BrowserNews/stat-growth.html
HREF10
http://www.isc.org/ds/WWW-200301/index.html
HREF11
http://www.mit.edu/people/mkgray/net/background.html

Copyright

Milad Shokouhi, © 2004. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.