Comparison of Major Web Search Engine Overlap: 2005 and 2007

Amanda Spink, Faculty of Information Technology, Queensland University of Technology. ah.spink@qut.edu.au

Bernard J. Jansen, College of Information Sciences and Technology, The Pennsylvania State University. jjansen@ist.psu.edu

Changru Wang, Infospace, Inc. – Search & Directory. Changru.Wang@infospace.com

Abstract

This paper provides preliminary results from a study examining the overlap among results retrieved by four major Web search engines for a large set of more than 19,332 queries. Previous studies show the lack of overlap in results returned by Web search engines for the same queries. Our large-scale study measured the overlap of first page results (both non-sponsored and sponsored) across four major Web search engines – Google, Live, Ask, and Yahoo! - using a large number of randomly selected Infospace, Inc queries from April 2007. We then compared the results to results retrieved for the same queries from the meta-search engine Dogpile.com. The percent of total results unique to only one of the four major Web search engines was 88.3 percent, with 8.9 percent of total search results found on two of the four Web search engines, 2.2 percent on three engines and 0.6 percent of results found across all four Web search engines. This level of Web search engine overlap is smaller than data from July 2005 and reflects the growing differences in Web search engines retrieval and ranking results. Results point to the value of meta-search engines in Web retrieval to overcome the biases of individual search engines.

Introduction

Web search engines differ from one another in crawling reach, frequency of updates, sponsored search advertisers, and relevancy rankings. Studies show a lack of Web search engine results overlap as a single Web search engines does not index all Websites. This paper reports preliminary results from a study examining the overlap among four major Web search engine results retrieved for the same queries, including Ask, Live, Google and Yahoo! using April 2007 data. We compare these with results retrieved for the same queries by the meta-search engine Dogpile.com. This scale study builds on the previous Dogpile overlap study into search engines differences using July 2005 data, and the performance capabilities of single and meta-search engines (Spink, Jansen, Koshman & Blakely, 2006).

Related Studies

Web search engine crawling and retrieving studies have been an important area of Web research. Gordon and Pathak (1999) report that approximately 93 percent of results were retrieved by only one Web search engine. Nicholson (2000) found low Web search engine overlap. Mowshowitz and Kawaguchi (2005) examined the difference between Web search engine results from an expected distribution. Egghe and Rousseau (2006) analyze information retrieval (IR) system overlap from a mathematical perspective and Bar-Ilan (2005) discusses a statistical comparison of overlap in Web search engines. Spink, Jansen, Koshman and Blakely (2006) found only a 3.2 percentage overlap for the first page of results for 10,000 queries run against four major Web search engines. In summary, studies show consistencies in Web search engine regarding overlap. The study reported in this paper builds on the previous study by Spink, Jansen, Koshman and Blakely (2006) using July 2005 data and reports preliminary results related to the overlap trends among Web search engine results using April 2007 data. Studying trends in Web search engines overlap and performance is an important area of Web research (Spink and Jansen, 2004).

Research Questions

The goal of our research was to:

  1. Measure the current overlap (i.e., share the same results) using April 2007 data across four major Web search engines on the first results page across a wide range of search queries and compare with the previous Dogpile overlap study using July 2005 data.
  2. Measure the degree to which the meta-search Web engine Dogpile.com provides the most highly ranked search results from four major single source Web search engines.
  3. Determine the differences in the first page of search results and their rankings (i.e., each Web search engine’s view of the most relevant content) across the four single-source Web search engines, including both sponsored and non-sponsored results.

Research Design

Data Collection

To ensure a random and representative query sample, we used the following steps to generate the query list. We pulled 19,332 random queries from the server access logfiles of the Dogpile.com Web meta-search engine from April 2007. We selected the queries from one weekday and one weekend day of the log files to ensure a more diverse set of users. We removed all duplicate queries to ensure a unique list. We also removed non-alphanumeric terms not usually processed by Web search engines. For each of 19,332 queries in the list, each of the four single Web search engines were queried in sequence between April 17 and 18, 2007. We captured the results (non-sponsored and sponsored) from the first result page and stored the following data in a database: Most Web users do not enter more than two queries per search session and view few results pages (Spink & Jansen, 2004). Therefore, examining overlap levels for first page results is the most important for Web searchers.

Data Analysis

After collecting the results data for the 19,332 queries, we ran an overlap algorithm based off the URL for each result by query. We ran the algorithm against each query to determine the search results overlap by query. When the URL on one engine exactly matched the URL from one or more engines of the other engines a duplicate match were recorded for that query. The overlap of first result page search for each query was then summarized across all 19,332 queries to generate the overlap metrics.

For a given query, we retrieved the URL of each result for each Web search engine from the database. We compiled a complete result set compiled for that query in the following fashion: Begin with an empty result-set as the complete result set. For each result R in engine E, if the result is not in the complete set yet, add it, and flag that it is contained in engine X. If the result *is* in the complete set, that means it does not need to be added (it is not unique), so flag the result in the complete set as also being contained by engine X (this assumes that it was already added to the complete set by some other preceding engine). Determining whether the result is *in* the complete set or not is done by simple string comparisons of the URL of the current result and the rest of the results in the complete set. What we have after going through all results for all Web search engines is a complete set of results, where each result in the complete set are marked by at least one engine and up to the maximum number of engines (in this case, 4). The different combinations (in engine X only, in engine Y only, in engine Z only, in both engine X and engine Y but not engine Z, etc.) are then counted up and added to the metric counts being collected for overlap.

Results

First Page Overlap Results

Table 1 shows the mean number of unique and shared results across the first page results for the four Web search engines.

 

Table 1. Search engine overlap
 
Unique
Two Engines
Three Engines
All Four Engines
Google Only
147,712
     
Yahoo! Only
190,475
     
Ask Only
159,749
     
Live Search Only
187,496
     
Google & Yahoo!  
11,056
   
Google & Ask  
21,582
   
Google & Live  
11,447
   
Yahoo! & Ask  
6,739
   
Live Search & Yahoo!  
12,688
   
Live Search & Ask  
5,600
   
Google, Yahoo!, & Ask    
5,338
 
Google, Yahoo!, & Live    
5,541
 
Yahoo!, Ask, & Live    
2,012
 
Google, Ask & Live    
4,045
 
Yahoo!, Google, Live & Ask      
4,955

 

Overall, a majority of the results a single source Web search engine returns on its first result page, for a given query, are unique to that Web search engine. The differences in each Web search engine’s indexing and ranking methodologies impacts the results a Web searcher will receive when querying these engines for the same query. Therefore, while the engines in this study may find quality content for some queries, they do not always find or in some cases present all of the best content for a given query on their first result page. Table 2 shows that the percent of returned results declines as more Web search engines are added to the analysis.

 

Table 2. Overlap results
 
July 2005 Results
April 2007 Results
% of results unique to one engine
84.9%
88.3%
% of results shared by any two engines
11.4%
8.9%
% of results shared by any three engines
2.6%
2.2%
% of results shared by all four engines
1.1%
0.6%

 

The overlap of between Google, Yahoo!, Live and Ask declined from July 2005 to April 2007. In April 2007, three Web search engines found only 2.2 percent of results - down from 2.6 percent in July 2005 and only 0.6 percent by all four Web search engines – down from 1.1 percent in July 2005. First page search results from the top Web search engines are unique in April 2007 than in July 2005. The top four search engines have further diverged since July 2005 in terms of search results. A key issue is whether this trend will most likely continue as each engine continues to modify their crawling and ranking technologies.

Dogpile.com Results

Table 3 outlines the results that Dogpile.com displays on its first result page.

 

Table 3. Results returned by Dogpile.com - total first page results for the 19,332 queries = 355,345
 
% of Dogpile.com Total Results
Total Returned
Total in Dogpile.com
Matched With All 4 Engines
97.9%
4,955
4,849
Matched With Any 3 Engines
94%
16,936
15,927
Matched With Any 2 Engines
78.5%
69,112
54,287
Unique To Any One Engine
24.4%
685,432
167,573

 

Table 3 shows that Dogpile.com matches a high proportion of results returned by all four Web search engines. Individual Web search engines provide limited and non-comprehensive coverage of all Websites related to a query. Meta Web search engines provide the most comprehensive coverage of Websites related to a query.

 

First Results Page Non-Sponsored Results Unique to Search Engine

 

Table 4. First results page non-sponsored results unique to search engine
  % of Non-Sponsored Results Unique to Search Engine % of Non-Sponsored Results Overlap With 1+ Search Engines
Google
76.1%
23.9%
Yahoo!
77.5%
22.3%
Ask
83.1%
16.5%
Live
77.6%
22.3%

 

Isolating just non-sponsored search results further supports the fact that each search engine has a different view of the Web. Searching only one search engine can limit a searcher from finding the best result for their query. For those using a Web search engine to research a topic, this data highlights the need to search multiple sources to explore a topic, whether it is researching ancient Mayan civilization or vacation packages to Hawaii.

 

Yahoo! and Google Sponsored Link Overlap

When looking at sponsored link overlap, it makes sense to focus on Yahoo! and Google as they supply sponsored links to the majority of search engines on the Web, including Live and Ask. The study found Yahoo! returned 64,046 sponsored links across the 19,332 queries while Google returned 42,075 sponsored links. However, the majority of those were unique to each engine. Sponsored links overlapped between any two engines (Google, Yahoo!, Live, and Ask).

 

Table 5. Yahoo! and Google sponsored link overlap
 
Unique Sponsored Links
Overlapping Sponsored Links
% of Engine's Sponsored Links Overlapped
Google & Yahoo!
101,436
4,682
4.6%
Google & Live
106,341
4,506
4.2%
Google & Ask
68,800
20,312
29.5%
Yahoo! & Live
128,282
4,539
3.5%
Yahoo! & Ask
107,037
4,049
3.8%
Live & Ask
111,063
4,752
4.3%

 

The study also illustrated the known relationship between Google and Ask. Through partnerships, Google supplies Ask with a feed of their advertisers that Ask incorporate into its results page. The partnership is illustrated in the data with a higher overlap of sponsored results between Google and Ask.

 

Non-Sponsored Links Overlapped Between Any Two Web Search Engines (Google, Yahoo!, Live, and Ask)

 

Table 6. Non-Sponsored links overlapped between any two engines (Google, Yahoo!, Live and Ask)
 
Unique Non-Sponsored Links
Overlapping Non-Sponsored Links
% of Engine's Non-Sponsored Links Overlapped
Google & Yahoo!
323,327
21,995
6.8%
Google & Live
313,649
21,397
6.8%
Google & Ask
317,667
15,606
4.9%
Yahoo! & Live
319,894
20,542
6.4%
Yahoo! & Ask
323,931
14,732
4.5%
Live & Ask
316,629
11,758
3.7%

 

Total links overlapped between any two engines (Google, Yahoo!, Live, and Ask)

 

Table 7. Total links overlapped between any two engines (Google, Yahoo!, Live, and Ask)
 
Unique Total Links
Overlapping Total Links
% of Engine's Total Links Overlapped
Google & Yahoo!
423,590
26,890
6.3%
Google & Live
419,472
25,988
6.2%
Google & Ask
385,776
35,920
9.3%
Yahoo! & Live
447,392
25,196
5.6%
Yahoo! & Ask
429,780
19,044
4.4%
Live & Ask
427,192
16,612
3.9%

 

Discussion

The study findings build upon our previous study Spink, Jansen, Koshman and Blakely (2006) using data from July 2005 that investigates the overlap of Web search engines. The current study shows that first results returned by the four major Web search engines included in this study continue to differ from one another and that difference is widening. The major Web search engines included fewer results in common in 2007 than in 2005 on the first results page for any given query. Different Web search engines use different technology and yield different first page search results.

The results of this study also emphasize the fact that major Web search engines (Ask, Live, Google and Yahoo!), have built and developed proprietary indexing and ranking methods that lead to different results returned by different engines to the same queries. The findings suggest that meta-search technology, such as Dogpile.com, using the collective content, resources, and ranking capabilities of the major Web search engines provide a more comprehensive result set containing potentially relevant results from the top Web search engines to the first results page. This study suggests that using a meta-search engine reduces the time spent searching multiple Web search engines while providing the top ranked results from the single Web search engines.

Overall, users need to understand more about Web search engine capabilities, coverage and limitations. Single Web search engines have both strengths and weaknesses. The diverse recall of different Web search engine’s coverage needs to be better understood by users. In addition, the functionality of single and meta-search engines needs to be more thoroughly compared. People who use only one Web search engine may be missing useful information by not accessing multiple Web search engines or Web meta-search engines. Users need more information on the size of the Web, how quickly the Web grows and the degree that Web search engines index Websites in topic areas.

Conclusion and Further Research

Our study findings confirm and extend previous research results regarding Web search engine overlap. This study shows that different Web search engines continue to have different abilities and the overlap among Web search engine results is now lower than during 2005. Web meta-search engines each provide a different and unique perspective on the Web. We are conducting ongoing overlap studies to determine further additional dimensions of the overlap and rankings. Further studies are needed to examine retrieved results beyond the second results page.

References

Bar-Ilan, J. (2005). “Comparing Rankings of Search Results on the Web” in Information Processing and Management 2005 v.41 n.6 p.973-986.

Egghe, L., and Rousseau, R. (2006). “Classical Retrieval and Overlap Measures Satisfy the Requirements for Rankings Based on a Lorenz Curve” in Information Processing and Management 2006 v.42 n.1 p.106-120.

Gordon, M., and Pathak, P. (1999). “Finding Information on the World Wide Web: The Retrieval Effectiveness of Search Engines” in Information Processing and Management 1999 v.35 p.141-180.

Nicholson, S. (2000). “Raising Reliability of Web Search Tool Research Through Replication and Chaos Theory” in Journal of the American Society for Information Science 2000 v.51 n.8 p.724-729. 

Mowshowitz, A., and Kawaguchi, A. (2005). “Measuring search engine bias,” Information Processing and Management, 41, 193–205.

Spink, A., and Jansen, B. J. (2004). Web Search: Public Searching of the Web. Springer, Dordrecht.

Spink, A., Jansen, B.J., Koshman, S., and Blakely, C. (2006). “A Study of Results Overlap and Uniqueness and Among Major Web Search Engines” in Information Processing and Management 2006 v.42 n.5 p.1379-1390.

Copyright

Amanda Spink, Bernard J. Jensen, and Changru Wang, © 2008. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.