Investigating Usage of the Vivisimo Clustering Search Engine Interface

Bernard J. Jansen, Assistant Professor, College of Information Sciences and Technology, The Pennsylvania State University, 329F Information Sciences and Technology Building, University Park, PA, 16802, USA. jjansen@acm.org

Sherry Koshman, Assistant Professor, School of Information Sciences, University of Pittsburgh, 135 N. Bellefield Ave., Pittsburgh, PA, USA 15260. skoshman@sis.pitt.edu

Amanda Spink, Professor, Faculty of Technology, Queensland University of Technology, Gardens Point Campus, 2 George St, GPO Box 2434, Brisbane QLD 4000. ah.spink@qut.edu.au

Abstract

Various interface and algorithmic techniques are under developments to assist Web searchers with managing the volume of information available on the Web. Clustering of the Web search engine results is one such area. User searching with clustered results in operational Web environments is not well understood. This paper reports on Web usage analysis of Vivisimo.com, which is a Web meta-search engine that dynamically clusters users’ search results in real time. The research questions are: 1) What are the characteristics of Web searching on a clustering search engine, such as Vivisimo?, 2) How are searchers interacting with clustered results?, and (3) What are the visitation patterns of searchers using Vivisimo? We analyzed data collected from 25 April – 2 May 2004, representing 100% of site traffic. Results show that approximately 50% of searchers who interact with results do use clusters, but only 2% of interactions go beyond the top-level clusters. These results provide new insight into search characteristics with a cluster-based Web search engine and point to the need for better results visualization methods.

Introduction

Effective interface design is critical for a number of applications such as Web system design, information architecture, ecommerce decisions, and clustering of search results. Cumbersome search results lists generated by traditional Web search engines is a well-recognized problem in Web information retrieval (Chen & Dumais, 2000). Clustering provide users with a means of viewing groups of similar search results can potentially enhance the effectiveness of Web search (Kim & Chan, 2003).
The application of clustering to Web search engine technology is a novel approach that offers structure to the information deluge often faced by Web searchers. Researchers have studied clustering methods in research labs (Chen & Cooper, 2001; Zamir & Etzoni, 1999; Zeng, et al., 2004). However, there has been little research into Web searchers interaction with clustered search engine results in a naturalistic Web environment. This investigation’s objective is to understand better the nature of user interaction with an operational cluster-based Web search engine. How do searchers interact with clustered results? How do users search on cluster-based search engines? How often do searchers frequent such search engines?
We conducted a quantitative Web usage analysis to examine user queries presented to the system. The overall goal of this research is to extend further the line of user interaction research in Web searching and specifically with clustering as implemented by the Vivisimo search engine, which is a common industry standard display for Web results clustering.

Related Work

This investigation uses transaction log analysis to study Vivisimo usage characteristics. Transaction logs capture the interactions between Web systems and users of that system. Web usage mining using transaction log analyses offers an unobtrusive method for studying user interactions with Web search engines. This study extends this line of research of Web search engines to a cluster-based operational environment. In this context, an operational environment refers to a publicly available commercial search engine on the Web. User interaction with clusters in operational environments is currently unexplored and therefore not well understood.
In related work, a log analysis was used to evaluate user interaction with Grouper, a clustering interface for Web search engine results (Zamir & Etzoni, 1999). The findings showed that users tended to examine more clusters than hypothesized. The logs were also analyzed to compare Grouper with a traditional text-based interface, HuskySearch, in order to determine the number of documents clicked-on by the users. Their results showed that users followed more multiple documents using the Grouper clustering interface and more single documents using HuskySearch.

Research Questions

The research questions are: (1) What are the characteristics of Web searching on a clustering search engine, such as Vivisimo?, (2) How are searchers interacting with clustered results?, and (3) What are the visitation patterns of searchers using Vivisimo?
Search characteristics are defined as query structure (frequency, length, repeated), terms per query, term co-occurrence, search session length, session frequency and session length. Cluster use is defined as the pattern and frequency of cluster manipulation by Vivisimo searchers. Visitation patterns are measurements of daily usage of Vivisimo by individual searchers.

Research Design

Vivisimo.com

The Vivisimo interface contains a dialog box for inputting queries and supports Boolean and exact phrase matching (http:www.vivisimo.com).
The default search source is the Web and a drop down menu provides options for additional source selection (e.g. CBC, CNN, Wisenut). Searches can be limited by domain or host name, by link content, Web page or Uniform Resource Locater (URL) information.

Vivisimo offers an “Advanced” search form containing options for source and language selection, defining the number and display of search results, deciding how links should be opened, and whether or not the content filter is applied. After a user submits a query, Vivisimo presents the clusters using a tree metaphor, which is similar to that used for viewing folders in Windows Explorer. The clusters appear on the left side of the page and the results pages are featured on the right of the main search page (Figure 1).

Figure 1: Vivisimo Interface

Figure 1: Vivisimo Interface

Unlike typical Web search engines, which present lists of search output, Vivisimo’s clustering feature creates dynamic post-search categories in a meta-searching environment. Users can click on cluster labels to retrieve results pages from that cluster. Clusters can be expanded by clicking on the plus sign to reveal sub-clusters and the cluster tree may be elongated by clicking on the “More” option. Search terms can be entered in the “Find in clusters” search box to search the clusters.

The results pages are initially displayed as a result of the initial search. Results pages are retrieved when the user clicks on the clusters and additional results pages may be selected at the bottom of the window. Hyperlinks may be accessed for individual items and Web pages may be previewed, opened in the results frame, or opened in a new window.

An item on the results pages may be identified within the clusters by clicking on the “show in clusters” option next to the item. This highlights the clusters on the tree, which contain the item. The “Details” feature shows the number of results for the sources searched.

Data Collection

The Vivisimo transaction log data used for this study represents a one-week period from April 25 to May 02, 2004. The transaction log recorded 100% of the traffic on the Vivisimo Web site during this period and contained 927,303 queries.

Data Analysis

The transaction log is a flat ASCII file, which was imported into a relational database, and a unique identifier for each record was assigned. Using four fields (User Identification, Date, Time of Day, and Query Terms), the initial query was located and the chronological series of actions on a given day was recreated to represent a user session.

A term is any series of characters separated by white space or other separator. A query is the entire string of terms submitted by a searcher in a given instance of interaction. A session is the entire series of queries submitted by a user during one interaction with the Web search engine on a given day. An identical query is a query that is a copy of a previous query within the same user session. A repeat query is a query submitted more than once, irrespective of the user.

The transaction log contained searches from both human users and agents. This analysis focused on only those queries submitted by humans rather than by some automated process. Given that there is no way to accurately identify human from non-human searchers, most researchers utilizing transaction logs for data collection must either ignore it (Cacheda & Vina, 2001; Jansen et al., 2005) or assume some temporal or interaction cut-off (Silverstein et al., 1999).

We used the latter approach, separating sessions having 100 or fewer queries into an individual transaction log. We selected this cut-off because it is almost 50 times greater than the reported mean search session for human Web searchers, and it assured that human searches were not excluded. Although this cutoff probably introduced some agent or common user terminal sessions, the assumption was made that the analysis yielded retrieved a subset of the transaction log that contained queries submitted primarily by human searchers, yet remained broad enough to not introduce bias by too low of a cut-off threshold.

When a searcher submits a query, then views a document, and returns to the search engine, the Vivisimo server logs this second visit with the identical user identification and query, but with a new time (i.e., the time of the second visit). Vivisimo assigns a unique code to identify a user’s multiple interactions with the system. This is beneficial information in determining how many of the retrieved results pages the searcher visited from the search engine, however it also introduces duplicate queries.

To address this issue, the transaction log was collapsed by combining all identical queries submitted by the same user to give us the unique queries for analyzing sessions, queries and terms, and pages of results viewed. The complete un-collapsed sessions were used in order to obtain an accurate measure of the session duration and the number of results pages visited. When the sessions were collapsed, the number of identical queries by the same user was recorded in a separate field within the remaining records.

In addition to the fields for unique identifier and number of identical queries, we included a field within each record containing the length of the query, measured in terms. In addition, we generated two other tables for the collapsed data set, one for term data and one for co-occurrence data. The term table contains fields for a term, and the number of times that term occurrences in the complete data set. The co-occurrence table contains fields for term – term pairs and the number of times that pair occurrences within the data set, irrespective of order.

The database now contains four tables (un-collapsed data set, collapsed data set, terms, and co-occurrence). The data from these four tables were analyzed to investigate our research questions. The analysis was conducted using queries, usually a series of layered queries, Visual Basic for Applications scripts, or a combination of the two. A series of UNIX text manipulation commands were used to parse and calculate statistics on the some of the clustering data. Key fields were extracted from the log file for the clustering analysis and each query was identified by a unique Vivisimo assigned code.

Research Design

We now address our first research question: What are the characteristics of Web searching on a clustering search engine, such as Vivisimo?

Term Characteristics

It is occasionally difficult to determine the specific usage of a term intended by a searcher outside the framework of a particular query. In these instances, a term co-occurrence analysis is more helpful. All of the ten term co-occurrence pairs are phrases or portions of natural language queries. Table 1 presents term co-occurrences for the data set and the percentage is calculated as a portion of all co-occurrence term pairs.
Table 1. Term co-occurrence.

Term

Term

Occurrences

%

new

york

1484

13.4%

what

is

1346

12.1%

history

of

1144

10.3%

of

pictures

985

8.9%

real

estate

940

8.5%

for

sale

890

8.0%

download

free

883

8.0%

high

school

875

7.9%

how

a

867

7.8%

windows

xp

860

7.8%

university

of

806

7.3%

 

 

11,080

100.0%

Term Characteristics

The highest percentage of queries contained two terms and the majority of queries (71.7%) contained one, two or three terms. Few queries contained six or more terms (Table 2).

Table 2: Query Length

Length

Occurrences

%

0

3,251

0.4%

1

174,338

18.8%

2

278,377

30.0%

3

212,738

22.9%

4

121,864

13.1%

5

61,974

6.7%

6

29,321

3.2%

7

14,626

1.6%

8

7,475

0.8%

9

4,933

0.5%

10

2,723

0.3%

>10

15,683

1.7%

Total

927,303

100.0%



Table 3 displays the top repeat queries in the data set. There is a wide distribution of queries in the data and the top repeat queries total represents approximately one half of one percent (0.05%) of the total number of queries.

Table 3: Top Repeat Queries

Rank

Query

Occurrence

%

1

"Mark Twain"

688

0.07%

2

Looney Tunes

493

0.05%

3

Google

488

0.05%

4

Cloning

428

0.05%

5

yahoo

273

0.03%

6

Ebay

257

0.03%

7

Sex

243

0.03%

8

paris hilton

185

0.02%

9

dictionary

141

0.02%

10

yahoo.com

135

0.01%

 

 

3331

0.36%

Session Characteristics

Session length is the number of queries per session. Table 4 shows that the highest percentage of sessions (41.8%) contained one query. The majority of sessions (71.2%) contained one, two, or three queries.

Table 4: Session Length

Length

Occurrence

%

1

115,064

41.8%

2

54,094

19.6%

3

30,735

11.2%

4

19,538

7.1%

5

13,322

4.8%

6

9,300

3.4%

7

6,963

2.5%

8

5,180

1.9%

9

4,036

1.5%

10

3,037

1.1%

>10

14,187

5.1%

 

275,456

100.%

Session duration was measured from the time the first query was submitted until the user departed the search engine for the last time (i.e., does not return) on a given day. This definition allows for the measurement of the total user time on the search engine and the time spent viewing the first and all subsequent Web documents, except the final document. The final viewing time is not available since the Web search engine server records the time stamp. A limitation of this type of naturalistic study is that the time between visits from the Web document to the search engine may not have been entirely spent viewing the Web document.

Concerning the aggregate statistics for session duration, the average session duration is one hour, thirty-four seconds, and one second (1:34:01), but the mode is less than a minute. The minimum session was approximately a second, while the maximum session spanned nearly a 24 hour period (23:59:44).  Almost half of the sessions (45%) were less than a minute in length (Table 5).

Table 5: Distribution of Session duration

Session Duration

Occurrences

%

Less Than 1 minute

125,241

45.5%

1 to 5 minutes

30,275

11.0%

5 to 10 minutes

15,592

5.7%

10 to 15 minutes

9,801

3.6%

15 to 30 minutes

16,197

5.9%

30 to 60 minutes

14,412

5.2%

1 to 2 hours

13,079

4.7%

2 to 3 hours

7,958

2.9%

3 to 4 hours

6,345

2.3%

More than 4 hours

36,566

13.3%

 

275,466

100.0%


Interaction with Clusters

We now address our second research question: How are searchers interacting with clustered results?
Vivisimo has a proprietary clustering feature. We investigated the usage of the clustering feature. Vivisimo’s clustering feature permits viewing the results without interaction with clusters.  The clusters may also be at several levels (i.e., top, second, third, etc.), with each level, a more narrow cluster of results.  To view each cluster, a searcher may click on a cluster to expand that cluster, in which case there may be an aggregate number of the results, more clustering levels, or a combination of both.  There may be more clusters than appear on the first page, in which case a user may view the next page of clustering results.

For each query submission, the Vivisimo search returns three frame records within the transaction log. The form frame defines the frame set including tree and list frames. The tree frame represents clusters, and the list frame corresponds to the results pages.
We classified each interaction with the clustering feature. For specific interactions clusters themselves, we classified them based on the number of clusters at each level expanded at that instance (e.g., top level, second level, third level, etc.).
More than 85% of time, users did not interact with the clustering feature. Just fewer than 6% of the searchers viewed more than the initial cluster. The most common cluster expansion was One Top Level Cluster.

A sizeable percentage of searchers (1.06%) view the cluster results, without expanding any clusters (i.e., clicked the top cluster that lists all of the results).  So, most users had little interaction with the clusters; however, a small number of users had ten clusters expanded at any one time, and some users expanded to the twenty sixth level of clusters.

Clusters were expanded in 12.6% of the records not including the initial query results in this data set. This distribution represents user interaction with the system following the presentation of initial search results (Figure 2).

Figure 2:  Post-Search Results Cluster Expansion
Figure 2:  Post-Search Results Cluster Expansion
Cluster expansion indicates user activity in clicking on the “+” sign next to a cluster label and expanding the tree to reveal sub-clusters. Figure 3 shows the distribution and extent to which clusters were expanded. Clusters are most frequently expanded once representing about 60% of all cluster interactions. The maximum number of clusters expanded in a record is 26.

Figure 3: Total Cluster Expansion
Figure 3: Total Cluster Expansion

Visitation Pattern Characteristics

We now address our third research question: What are the visitation patterns of searchers using Vivisimo?
We first examine the number of sessions of these users. We isolated these unique user identification codes to see how many times these users visited the Vivisimo search engine during the 8-day period.

From Table 6, we see that 68% of the users visited the search engine 2 or 3 times, accounting for 47% of the sessions. There were just approximately 1% of the users who visited the search engine on all 8 days of the collection period. However, the last day of the data collection was not a full day (i.e., only slightly more than four hours). There were 2.3% of the users who made seven repeated visits to the search engine, so this percentage may be a better indicator of percentage of daily repeat users.

Table 6: Sessions by Repeat Users of Vivisimo

No. of Days User Visited Search Engine

No. of Users

%

No. of Sessions During Time Period

%

2

17,762

46.5%

35,524

46.5%

3

8,393

22.0%

16,786

22.0%

4

5,405

14.1%

10,810

14.1%

5

3,905

10.2%

7,810

10.2%

6

1,393

3.6%

2,786

3.6%

7

886

2.3%

1,772

2.3%

8

460

1.2%

920

1.2%

 

38,204

100.0%

76,408

100.0%



We next examined the usage levels by day, shown in Table 7. In Table 7, rows three and four show the number of session each day and the percentage respective for the entire data set. Rows three and four show the same for the repeat users.
Looking at the all users rows, there are 275,456 total sessions generated by 193,572 unique users. The mean sessions per day are 34,432 sessions with weekdays showing a higher percentage of sessions. If we ignore the 4 hour period on Sunday, 2 May, the mean are 38,763  sessions.
Table 7: Sessions per Day for Repeat Users of Vivisimo

 

Total

Sun
28-Mar

Mon
29-Mar

Tues
30-Mar

Wed
31-Mar

Thurs
1-Apr

Fri
2-Apr

Sat
3-Apr

Sun
4-Apr

All Users

275,456

29,476

44,630

43,524

43,898

42,630

38,653

28,532

4,113

100%

10.7%

16.2%

15.8%

15.9%

15.5%

14.0%

10.4%

1.5%

Repeat Users

120,088

9,339

20,397

20,764

20,778

19,988

17,832

9,050

1,940

100%

7.8%

17.0%

17.3%

17.3%

16.6%

14.8%

7.5%

1.6%



Examining repeat users, there are 120,088 sessions from 38,204 users. So, 20% of Vivisimo users accounted for 44% of the sessions. The mean sessions per day is 15,011 sessions (16,878 sessions ignoring 2 May), again with the weekdays showing a higher percentage of sessions.
We next examine the query level of analysis for these repeat users.

Queries

We calculated the number of queries that the 38,204 repeat searchers submitted during the data collation period, as shown in Table 8.
Looking at the all users rows, there are 927,303 queries generated by 193,572 unique users. The mean queries per day are 115,913 queries with weekdays showing a higher percentage of sessions. If we ignore the 4 hour period on Sunday, 2 May, the mean is 131,041 queries.


Table 8: Queries per Day by Repeat Users of Vivisimo

 

Total

Sun
28-Mar

Mon
29-Mar

Tues
30-Mar

Wed
31-Mar

Thurs
1-Apr

Fri
2-Apr

Sat
3-Apr

Sun
4-Apr

All
Users

927,303

91,438

157,945

153,785

152,476

146,921

128,396

86,326

10,016

100%

9.9%

17.0%

16.6%

16.4%

15.8%

13.8%

9.3%

1.1%

Repeat
Users

541,754

39,466

97,277

96,888

96,199

91,338

77,944

37,657

4,985

 

7.3%

18.0%

17.9%

17.8%

16.9%

14.4%

7.0%

0.9%




Approximately 80% of queries were entered during weekdays, with about 5% fewer queries submitted per day on the weekends.
Examining repeat users, there are 541,754 queries from 38,204 users. Therefore, 20% of Vivisimo users accounted for 58% of the queries. The mean queries per day is 67,719 sessions (76,681 queries ignoring 25 April), again with the weekdays showing a higher percentage of queries.

Discussion

Term frequency data showed a wide distribution. Interestingly 60% of the top ten term co-occurrence pairs (including one inverted term pair) were found in previous term co-occurrence analyses of the search engine data sets (Ledersdorff, 1989). Some elements of search strings are common over long periods. Repeated queries were widely distributed and the top repeated queries represented only one-half of one percent of the total number of queries. Web Information is heterogeneous and the nature of repeated query entries reflects the span of topic coverage.

The findings indicate that higher numbers of queries were presented to the system during weekdays. The queries were generally brief. The highest percentage of queries (30%) contained two terms and the majority of queries (71.7%) contained one, two, or three terms. This is a slightly higher percentage than earlier research which showed that 60% of Web searchers used 1 or 2 terms (Jansen et al., 2000). Hence, this result is not unique to a clustering environment.

Higher percentages of search sessions occurred on weekdays. Almost half of the search sessions (41.8%) contained one query. The session duration mode value was less than one minute and almost half of the sessions (45.5%) fell into this category. The search session times were less than shown in previous research. In an AltaVista study, 72% of sessions were less than 5 minutes and 82% were less than 15 minutes (Jansen et al., 2005).
 
Concerning the user interaction with clusters, the higher percentage of list records shows that more results pages were viewed than cluster expansions. This means that the users clicked on clusters to retrieve results pages or they clicked on the “more” option to retrieve more results pages. The record analysis show that almost half of the post-search user interactions involve clicking on Vivisimo clusters; however, expanding the cluster tree is infrequently used.

Implications

From a search engine design perspective, the general use of clusters in synthesizing Web search results may represent a more efficient method to the display and rendering of Web search results. The brevity of search session times with Vivisimo implies that using clusters offers a more direct approach to finding the information that users are seeking. Interaction with the clusters point toward a similar pattern. Cluster clicking activity was well represented in the data, and the initial cluster display may have been sufficient to resolve the information need thus reducing the user’s need to extend the cluster tree to find more cluster labels. This supposition would need to be tested further in usability studies.

From an interface design perspective, the direct manipulation of clusters works well in the handling of search results. Clicking on cluster labels is better utilized than elongating the cluster tree. Cluster label selection may be more intuitive since it is analogous to clicking on hyperlinks or file folders, whereas the cluster tree expansion option is not immediately visible to the end user.

However, the Vivisimo’s search entry dialogue box, which is similar to traditional search engine technologies, did not impact the nature of query construction. The general search characteristics of user searches were similar to those of non-clustering or traditional search engines. Earlier cluster related user studies support these findings. In the Scatter/Gather study, Hearst and Pedersen (1996) showed that documents similar to each other are more relevant than non-similar ones. Their users interacted well with clusters and initial results showed that they selected clusters containing the most relevant documents. Similarly, Chen and Dumais (2000) found that the user interface which categorized Web search results measured better with users than the traditional list interface.

Results from this study also shown that about 20% of the searchers account for the majority of Web searching usage. Practical consideration can be given to adopting personalization features and other methods that are directed toward increasing the relevance of results for these high volume users.

Conclusions and Future Work

The analysis of a Vivisimo transaction log showed that while clustering is a new feature in search engine technology that is actively used by Vivisimo searchers, search characteristics such as term co-occurrence remain relatively stable in comparison to earlier Web research. Future research includes the examination of cluster usage on a per query basis and investigating user interaction with Vivisimo to specifically determine patterns in cluster label selection, depth of clusters selected and use of “find in clusters” feature. Observing user interaction with the system in an empirical study would help to understand searching behavior based on real-time usage of a cluster-based interface.


References

Cacheda, F. and Viña, A.(2001). "Experiences retrieving information in the World Wide Web," in Proceedings of the 6th IEEE Symposium on Computers and Communications, Hammamet, Tunisia, pp. 72-79.

Chen, H. and Dumais, S. (2000). "Bringing order to the Web: automatically categorizing search results," in Proceedings of the SIGCHI conference on Human factors in computing systems, The Hague, The Netherlands, pp. 145-152.

Chen, H-M. and Cooper, M.D. (2001). "Using clustering techniques to detect usage patterns in a Web-based information system," Journal of the American Society for Information Science and Technology, vol. 52, pp. 888-904.

Hearst, M.A. and Pedersen, J.O. (1996)."Reexamining the cluster hypothesis: scatter/gather on retrieval result," in Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, Zurich, Switzerland, 1996. pp. 76-84.

Jansen, B.J., Spink, A. and Saracevic, T. (2000)."Real Life, Real Users, and Real Needs: A Study and Analysis of User Queries on the Web," Information Processing and Management, vol. 36, pp. 207-227.

Jansen, B.J. , Spink, A. and Pederson, J. (2005). "Trend Analysis of AltaVista Web Searching," Journal of the American Society for Information Science and Technology, vol. 56, pp. 559-570.

Kim, H.R. and Chan, P.K. (2003). "Learning Implicit User Interest Hierarchy for Context in Personalization," in Proceedings of the 8th International Conference on Intelligent User Interfaces, Miami, Florida, USA, pp. 101 - 108.

Leydesdorff, L. (1989). "Words and co-words as indicators of intellectual organization," Research Policy, vol. 18, pp. 209-223, 1989.

Montgomery, A. and Faloutsos, C. (2001)."Identifying web browsing trends and patterns," IEEE Computer, vol. 34, pp. 94-95.

Silverstein, C., Henzinger, M., Marais, H. and  Moricz, M. (1999). "Analysis of a Very Large Web Search Engine Query Log," SIGIR Forum, vol. 33, pp. 6-12

Zamir, O. and Etzoni, O. (1999)."Grouper: a dynamic clustering interface for Web search results," Computer Networks, vol. 31, pp. 1361-1374.

Zeng, H., He, Q., Chen, Z., Ma, W. and Ma, J. (2004). "Learning to cluster Web results," in Proceedings of the SIGIR conference on research and development in information retrieval, Sheffield, England, pp. 210-216.

.

 

Acknowlegements

We would like to thank Vivisimo for providing the data for this analysis without which we could not have conducted this research.


Copyright

<Jansen, Koshman & Spink>, © 2006. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.