Bernard J. Jansen, Assistant Professor, College of Information Sciences and Technology, The Pennsylvania State University, 329F Information Sciences and Technology Building, University Park, PA, 16802, USA. jjansen@acm.org
Sherry Koshman, Assistant Professor, School of Information Sciences, University of Pittsburgh, 135 N. Bellefield Ave., Pittsburgh, PA, USA 15260. skoshman@sis.pitt.edu
Amanda Spink, Professor, Faculty of Technology, Queensland University of Technology, Gardens Point Campus, 2 George St, GPO Box 2434, Brisbane QLD 4000. ah.spink@qut.edu.au
Various interface and algorithmic techniques are under developments to assist Web searchers with managing the volume of information available on the Web. Clustering of the Web search engine results is one such area. User searching with clustered results in operational Web environments is not well understood. This paper reports on Web usage analysis of Vivisimo.com, which is a Web meta-search engine that dynamically clusters users’ search results in real time. The research questions are: 1) What are the characteristics of Web searching on a clustering search engine, such as Vivisimo?, 2) How are searchers interacting with clustered results?, and (3) What are the visitation patterns of searchers using Vivisimo? We analyzed data collected from 25 April – 2 May 2004, representing 100% of site traffic. Results show that approximately 50% of searchers who interact with results do use clusters, but only 2% of interactions go beyond the top-level clusters. These results provide new insight into search characteristics with a cluster-based Web search engine and point to the need for better results visualization methods.
Effective interface design is critical for a number of applications
such as Web system design, information architecture, ecommerce
decisions, and clustering of search results. Cumbersome search results
lists generated by traditional Web search engines is a well-recognized
problem in Web information retrieval (Chen & Dumais, 2000).
Clustering provide users with a means of viewing groups of similar
search results can potentially enhance the effectiveness of Web search
(Kim & Chan, 2003).
The application of clustering to Web search engine technology is a
novel approach that offers structure to the information deluge often
faced by Web searchers. Researchers have studied clustering methods in
research labs (Chen & Cooper, 2001; Zamir & Etzoni, 1999; Zeng,
et al., 2004). However, there has been little research into Web
searchers interaction with clustered search engine results in a
naturalistic Web environment. This investigation’s objective is
to understand better the nature of user interaction with an operational
cluster-based Web search engine. How do searchers interact with
clustered results? How do users search on cluster-based search engines?
How often do searchers frequent such search engines?
We conducted a quantitative Web usage analysis to examine user queries
presented to the system. The overall goal of this research is to extend
further the line of user interaction research in Web searching and
specifically with clustering as implemented by the Vivisimo search
engine, which is a common industry standard display for Web results
clustering.
This investigation uses transaction log analysis to study Vivisimo
usage characteristics. Transaction logs capture the interactions
between Web systems and users of that system. Web usage mining using
transaction log analyses offers an unobtrusive method for studying user
interactions with Web search engines. This study extends this line of
research of Web search engines to a cluster-based operational
environment. In this context, an operational environment refers to a
publicly available commercial search engine on the Web. User
interaction with clusters in operational environments is currently
unexplored and therefore not well understood.
In related work, a log analysis was used to evaluate user interaction
with Grouper, a clustering interface for Web search engine results
(Zamir & Etzoni, 1999). The findings showed that users tended to
examine more clusters than hypothesized. The logs were also analyzed to
compare Grouper with a traditional text-based interface, HuskySearch,
in order to determine the number of documents clicked-on by the users.
Their results showed that users followed more multiple documents using
the Grouper clustering interface and more single documents using
HuskySearch.
The research questions are: (1) What are the characteristics of Web
searching on a clustering search engine, such as Vivisimo?, (2) How are
searchers interacting with clustered results?, and (3) What are the
visitation patterns of searchers using Vivisimo?
Search characteristics are defined as query structure (frequency,
length, repeated), terms per query, term co-occurrence, search session
length, session frequency and session length. Cluster use is defined as
the pattern and frequency of cluster manipulation by Vivisimo
searchers. Visitation patterns are measurements of daily usage of
Vivisimo by individual searchers.
The Vivisimo interface contains a dialog box for inputting queries and supports Boolean and exact phrase matching (http:www.vivisimo.com).
The default search source is the Web and a drop down menu provides
options for additional source selection (e.g. CBC, CNN, Wisenut).
Searches can be limited by domain or host name, by link content, Web
page or Uniform Resource Locater (URL) information.
Vivisimo offers an “Advanced” search form containing options for source and language selection, defining the number and display of search results, deciding how links should be opened, and whether or not the content filter is applied. After a user submits a query, Vivisimo presents the clusters using a tree metaphor, which is similar to that used for viewing folders in Windows Explorer. The clusters appear on the left side of the page and the results pages are featured on the right of the main search page (Figure 1).

Figure 1: Vivisimo Interface
Unlike typical Web search engines, which present lists of search
output, Vivisimo’s clustering feature creates dynamic post-search
categories in a meta-searching environment. Users can click on cluster
labels to retrieve results pages from that cluster. Clusters can be
expanded by clicking on the plus sign to reveal sub-clusters and the
cluster tree may be elongated by clicking on the “More”
option. Search terms can be entered in the “Find in
clusters” search box to search the clusters.
The results pages are initially displayed as a result of the initial
search. Results pages are retrieved when the user clicks on the
clusters and additional results pages may be selected at the bottom of
the window. Hyperlinks may be accessed for individual items and Web
pages may be previewed, opened in the results frame, or opened in a new
window.
An item on the results pages may be identified within the clusters
by clicking on the “show in clusters” option next to the
item. This highlights the clusters on the tree, which contain the item.
The “Details” feature shows the number of results for the
sources searched.
We now address our
first research question: What are the
characteristics of Web searching on a clustering search engine, such as
Vivisimo?
|
Term |
Occurrences |
% |
|
|
new |
|
1484 |
13.4% |
|
what |
is |
1346 |
12.1% |
|
history |
of |
1144 |
10.3% |
|
of |
pictures |
985 |
8.9% |
|
real |
estate |
940 |
8.5% |
|
for |
sale |
890 |
8.0% |
|
download |
free |
883 |
8.0% |
|
high |
school |
875 |
7.9% |
|
how |
a |
867 |
7.8% |
|
windows |
xp |
860 |
7.8% |
|
university |
of |
806 |
7.3% |
|
|
|
11,080 |
100.0% |
|
Length |
Occurrences |
% |
|
0 |
3,251 |
0.4% |
|
1 |
174,338 |
18.8% |
|
2 |
278,377 |
30.0% |
|
3 |
212,738 |
22.9% |
|
4 |
121,864 |
13.1% |
|
5 |
61,974 |
6.7% |
|
6 |
29,321 |
3.2% |
|
7 |
14,626 |
1.6% |
|
8 |
7,475 |
0.8% |
|
9 |
4,933 |
0.5% |
|
10 |
2,723 |
0.3% |
|
>10 |
15,683 |
1.7% |
|
Total |
927,303 |
100.0% |
|
Rank |
Query |
Occurrence |
% |
|
1 |
"Mark
Twain" |
688 |
0.07% |
|
2 |
Looney
Tunes |
493 |
0.05% |
|
3 |
Google |
488 |
0.05% |
|
4 |
Cloning |
428 |
0.05% |
|
5 |
yahoo |
273 |
0.03% |
|
6 |
Ebay |
257 |
0.03% |
|
7 |
Sex |
243 |
0.03% |
|
8 |
|
185 |
0.02% |
|
9 |
dictionary |
141 |
0.02% |
|
10 |
yahoo.com |
135 |
0.01% |
|
|
|
3331 |
0.36% |
Session length is the number of queries per session. Table 4 shows that the highest percentage of sessions (41.8%) contained one query. The majority of sessions (71.2%) contained one, two, or three queries.
Table 4: Session Length
|
Length |
Occurrence |
% |
|
1 |
115,064 |
41.8% |
|
2 |
54,094 |
19.6% |
|
3 |
30,735 |
11.2% |
|
4 |
19,538 |
7.1% |
|
5 |
13,322 |
4.8% |
|
6 |
9,300 |
3.4% |
|
7 |
6,963 |
2.5% |
|
8 |
5,180 |
1.9% |
|
9 |
4,036 |
1.5% |
|
10 |
3,037 |
1.1% |
|
>10 |
14,187 |
5.1% |
|
|
275,456 |
100.% |
|
Session Duration |
Occurrences |
% |
|
Less Than 1 minute |
125,241 |
45.5% |
|
1 to 5 minutes |
30,275 |
11.0% |
|
5 to 10 minutes |
15,592 |
5.7% |
|
10 to 15 minutes |
9,801 |
3.6% |
|
15 to 30 minutes |
16,197 |
5.9% |
|
30 to 60 minutes |
14,412 |
5.2% |
|
1 to 2 hours |
13,079 |
4.7% |
|
2 to 3 hours |
7,958 |
2.9% |
|
3 to 4 hours |
6,345 |
2.3% |
|
More than 4 hours |
36,566 |
13.3% |
|
|
275,466 |
100.0% |


|
No.
of Days User Visited Search Engine |
No.
of Users |
% |
No. of
Sessions During Time Period |
% |
|
2 |
17,762 |
46.5% |
35,524 |
46.5% |
|
3 |
8,393 |
22.0% |
16,786 |
22.0% |
|
4 |
5,405 |
14.1% |
10,810 |
14.1% |
|
5 |
3,905 |
10.2% |
7,810 |
10.2% |
|
6 |
1,393 |
3.6% |
2,786 |
3.6% |
|
7 |
886 |
2.3% |
1,772 |
2.3% |
|
8 |
460 |
1.2% |
920 |
1.2% |
|
|
38,204 |
100.0% |
76,408 |
100.0% |
|
|
Total |
Sun |
Mon |
Tues |
Wed |
Thurs |
Fri |
Sat |
Sun |
|
All Users |
275,456 |
29,476 |
44,630 |
43,524 |
43,898 |
42,630 |
38,653 |
28,532 |
4,113 |
|
100% |
10.7% |
16.2% |
15.8% |
15.9% |
15.5% |
14.0% |
10.4% |
1.5% |
|
|
Repeat Users |
120,088 |
9,339 |
20,397 |
20,764 |
20,778 |
19,988 |
17,832 |
9,050 |
1,940 |
|
100% |
7.8% |
17.0% |
17.3% |
17.3% |
16.6% |
14.8% |
7.5% |
1.6% |
|
|
Total |
Sun |
Mon |
Tues |
Wed |
Thurs |
Fri |
Sat |
Sun |
|
All |
927,303 |
91,438 |
157,945 |
153,785 |
152,476 |
146,921 |
128,396 |
86,326 |
10,016 |
|
100% |
9.9% |
17.0% |
16.6% |
16.4% |
15.8% |
13.8% |
9.3% |
1.1% |
|
|
Repeat |
541,754 |
39,466 |
97,277 |
96,888 |
96,199 |
91,338 |
77,944 |
37,657 |
4,985 |
|
|
7.3% |
18.0% |
17.9% |
17.8% |
16.9% |
14.4% |
7.0% |
0.9% |
The analysis of a Vivisimo transaction log showed that while clustering is a new feature in search engine technology that is actively used by Vivisimo searchers, search characteristics such as term co-occurrence remain relatively stable in comparison to earlier Web research. Future research includes the examination of cluster usage on a per query basis and investigating user interaction with Vivisimo to specifically determine patterns in cluster label selection, depth of clusters selected and use of “find in clusters” feature. Observing user interaction with the system in an empirical study would help to understand searching behavior based on real-time usage of a cluster-based interface.
Cacheda, F. and Viña, A.(2001). "Experiences retrieving information in the World Wide Web," in Proceedings of the 6th IEEE Symposium on Computers and Communications, Hammamet, Tunisia, pp. 72-79.
Chen, H. and Dumais, S. (2000). "Bringing order to the Web: automatically categorizing search results," in Proceedings of the SIGCHI conference on Human factors in computing systems, The Hague, The Netherlands, pp. 145-152.
Chen, H-M. and Cooper, M.D. (2001). "Using clustering techniques to detect usage patterns in a Web-based information system," Journal of the American Society for Information Science and Technology, vol. 52, pp. 888-904.
Hearst, M.A. and Pedersen, J.O. (1996)."Reexamining the cluster hypothesis: scatter/gather on retrieval result," in Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, Zurich, Switzerland, 1996. pp. 76-84.
Jansen, B.J., Spink, A. and Saracevic, T. (2000)."Real Life, Real Users, and Real Needs: A Study and Analysis of User Queries on the Web," Information Processing and Management, vol. 36, pp. 207-227.
Jansen, B.J. , Spink, A. and Pederson, J. (2005). "Trend Analysis of AltaVista Web Searching," Journal of the American Society for Information Science and Technology, vol. 56, pp. 559-570.
Kim, H.R. and Chan, P.K. (2003). "Learning Implicit User Interest Hierarchy for Context in Personalization," in Proceedings of the 8th International Conference on Intelligent User Interfaces, Miami, Florida, USA, pp. 101 - 108.
Leydesdorff, L. (1989). "Words and co-words as indicators of intellectual organization," Research Policy, vol. 18, pp. 209-223, 1989.
Montgomery, A. and Faloutsos, C. (2001)."Identifying web browsing trends and patterns," IEEE Computer, vol. 34, pp. 94-95.
Silverstein, C., Henzinger, M., Marais, H. and Moricz, M. (1999). "Analysis of a Very Large Web Search Engine Query Log," SIGIR Forum, vol. 33, pp. 6-12
Zamir, O. and Etzoni, O. (1999)."Grouper: a dynamic clustering interface for Web search results," Computer Networks, vol. 31, pp. 1361-1374.
Zeng, H., He, Q., Chen, Z., Ma, W. and Ma, J. (2004). "Learning to cluster Web results," in Proceedings of the SIGIR conference on research and development in information retrieval, Sheffield, England, pp. 210-216.
.
We would like to
thank Vivisimo for providing the data for this analysis without which we could
not have conducted this research.