Investigating Usage of the Vivisimo Clustering Search Engine Interface

Bernard J. Jansen, Assistant Professor, College of Information Sciences and Technology, The Pennsylvania State University, 329F Information Sciences and Technology Building, University Park, PA, 16802, USA. jjansen@acm.org

Sherry Koshman, Assistant Professor, School of Information Sciences, University of Pittsburgh, 135 N. Bellefield Ave., Pittsburgh, PA, USA 15260. skoshman@sis.pitt.edu

Amanda Spink, Professor, Faculty of Technology, Queensland University of Technology, Gardens Point Campus, 2 George St, GPO Box 2434, Brisbane QLD 4000. ah.spink@qut.edu.au

Abstract

Various interface and algorithmic techniques are under developments to assist Web searchers with managing the volume of information available on the Web. Clustering of the Web search engine results is one such area. User searching with clustered results in operational Web environments is not well understood. This paper reports on Web usage analysis of Vivisimo.com, which is a Web meta-search engine that dynamically clusters users’ search results in real time. The research questions are: 1) What are the characteristics of Web searching on a clustering search engine, such as Vivisimo?, 2) How are searchers interacting with clustered results?, and (3) What are the visitation patterns of searchers using Vivisimo? We analyzed data collected from 25 April – 2 May 2004, representing 100% of site traffic. Results show that approximately 50% of searchers who interact with results do use clusters, but only 2% of interactions go beyond the top-level clusters. These results provide new insight into search characteristics with a cluster-based Web search engine and point to the need for better results visualization methods.

Introduction

Effective interface design is critical for a number of applications such as Web system design, information architecture, ecommerce decisions, and clustering of search results. Cumbersome search results lists generated by traditional Web search engines is a well-recognized problem in Web information retrieval (Chen & Dumais, 2000). Clustering provide users with a means of viewing groups of similar search results can potentially enhance the effectiveness of Web search (Kim & Chan, 2003).
The application of clustering to Web search engine technology is a novel approach that offers structure to the information deluge often faced by Web searchers. Researchers have studied clustering methods in research labs (Chen & Cooper, 2001; Zamir & Etzoni, 1999; Zeng, et al., 2004). However, there has been little research into Web searchers interaction with clustered search engine results in a naturalistic Web environment. This investigation’s objective is to understand better the nature of user interaction with an operational cluster-based Web search engine. How do searchers interact with clustered results? How do users search on cluster-based search engines? How often do searchers frequent such search engines?
We conducted a quantitative Web usage analysis to examine user queries presented to the system. The overall goal of this research is to extend further the line of user interaction research in Web searching and specifically with clustering as implemented by the Vivisimo search engine, which is a common industry standard display for Web results clustering.

Related Work

This investigation uses transaction log analysis to study Vivisimo usage characteristics. Transaction logs capture the interactions between Web systems and users of that system. Web usage mining using transaction log analyses offers an unobtrusive method for studying user interactions with Web search engines. This study extends this line of research of Web search engines to a cluster-based operational environment. In this context, an operational environment refers to a publicly available commercial search engine on the Web. User interaction with clusters in operational environments is currently unexplored and therefore not well understood.
In related work, a log analysis was used to evaluate user interaction with Grouper, a clustering interface for Web search engine results (Zamir & Etzoni, 1999). The findings showed that users tended to examine more clusters than hypothesized. The logs were also analyzed to compare Grouper with a traditional text-based interface, HuskySearch, in order to determine the number of documents clicked-on by the users. Their results showed that users followed more multiple documents using the Grouper clustering interface and more single documents using HuskySearch.

Research Questions

The research questions are: (1) What are the characteristics of Web searching on a clustering search engine, such as Vivisimo?, (2) How are searchers interacting with clustered results?, and (3) What are the visitation patterns of searchers using Vivisimo?
Search characteristics are defined as query structure (frequency, length, repeated), terms per query, term co-occurrence, search session length, session frequency and session length. Cluster use is defined as the pattern and frequency of cluster manipulation by Vivisimo searchers. Visitation patterns are measurements of daily usage of Vivisimo by individual searchers.

Research Design

Vivisimo.com

The Vivisimo interface contains a dialog box for inputting queries and supports Boolean and exact phrase matching (http:www.vivisimo.com).
The default search source is the Web and a drop down menu provides options for additional source selection (e.g. CBC, CNN, Wisenut). Searches can be limited by domain or host name, by link content, Web page or Uniform Resource Locater (URL) information.

Vivisimo offers an “Advanced” search form containing options for source and language selection, defining the number and display of search results, deciding how links should be opened, and whether or not the content filter is applied. After a user submits a query, Vivisimo presents the clusters using a tree metaphor, which is similar to that used for viewing folders in Windows Explorer. The clusters appear on the left side of the page and the results pages are featured on the right of the main search page (Figure 1).

Figure 1: Vivisimo Interface

Figure 1: Vivisimo Interface

Unlike typical Web search engines, which present lists of search output, Vivisimo’s clustering feature creates dynamic post-search categories in a meta-searching environment. Users can click on cluster labels to retrieve results pages from that cluster. Clusters can be expanded by clicking on the plus sign to reveal sub-clusters and the cluster tree may be elongated by clicking on the “More” option. Search terms can be entered in the “Find in clusters” search box to search the clusters.

The results pages are initially displayed as a result of the initial search. Results pages are retrieved when the user clicks on the clusters and additional results pages may be selected at the bottom of the window. Hyperlinks may be accessed for individual items and Web pages may be previewed, opened in the results frame, or opened in a new window.

An item on the results pages may be identified within the clusters by clicking on the “show in clusters” option next to the item. This highlights the clusters on the tree, which contain the item. The “Details” feature shows the number of results for the sources searched.

Data Collection

The Vivisimo transaction log data used for this study represents a one-week period from April 25 to May 02, 2004. The transaction log recorded 100% of the traffic on the Vivisimo Web site during this period and contained 927,303 queries.

Data Analysis

The transaction log is a flat ASCII file, which was imported into a relational database, and a unique identifier for each record was assigned. Using four fields (User Identification, Date, Time of Day, and Query Terms), the initial query was located and the chronological series of actions on a given day was recreated to represent a user session.

A term is any series of characters separated by white space or other separator. A query is the entire string of terms submitted by a searcher in a given instance of interaction. A session is the entire series of queries submitted by a user during one interaction with the Web search engine on a given day. An identical query is a query that is a copy of a previous query within the same user session. A repeat query is a query submitted more than once, irrespective of the user.

The transaction log contained searches from both human users and agents. This analysis focused on only those queries submitted by humans rather than by some automated process. Given that there is no way to accurately identify human from non-human searchers, most researchers utilizing transaction logs for data collection must either ignore it (Cacheda & Vina, 2001; Jansen et al., 2005) or assume some temporal or interaction cut-off (Silverstein et al., 1999).

We used the latter approach, separating sessions having 100 or fewer queries into an individual transaction log. We selected this cut-off because it is almost 50 times greater than the reported mean search session for human Web searchers, and it assured that human searches were not excluded. Although this cutoff probably introduced some agent or common user terminal sessions, the assumption was made that the analysis yielded retrieved a subset of the transaction log that contained queries submitted primarily by human searchers, yet remained broad enough to not introduce bias by too low of a cut-off threshold.

When a searcher submits a query, then views a document, and returns to the search engine, the Vivisimo server logs this second visit with the identical user identification and query, but with a new time (i.e., the time of the second visit). Vivisimo assigns a unique code to identify a user’s multiple interactions with the system. This is beneficial information in determining how many of the retrieved results pages the searcher visited from the search engine, however it also introduces duplicate queries.

To address this issue, the transaction log was collapsed by combining all identical queries submitted by the same user to give us the unique queries for analyzing sessions, queries and terms, and pages of results viewed. The complete un-collapsed sessions were used in order to obtain an accurate measure of the session duration and the number of results pages visited. When the sessions were collapsed, the number of identical queries by the same user was recorded in a separate field within the remaining records.

In addition to the fields for unique identifier and number of identical queries, we included a field within each record containing the length of the query, measured in terms. In addition, we generated two other tables for the collapsed data set, one for term data and one for co-occurrence data. The term table contains fields for a term, and the number of times that term occurrences in the complete data set. The co-occurrence table contains fields for term – term pairs and the number of times that pair occurrences within the data set, irrespective of order.

The database now contains four tables (un-collapsed data set, collapsed data set, terms, and co-occurrence). The data from these four tables were analyzed to investigate our research questions. The analysis was conducted using queries, usually a series of layered queries, Visual Basic for Applications scripts, or a combination of the two. A series of UNIX text manipulation commands were used to parse and calculate statistics on the some of the clustering data. Key fields were extracted from the log file for the clustering analysis and each query was identified by a unique Vivisimo assigned code.

Research Design

We now address our first research question: What are the characteristics of Web searching on a clustering search engine, such as Vivisimo?

Term Characteristics

It is occasionally difficult to determine the specific usage of a term intended by a searcher outside the framework of a particular query. In these instances, a term co-occurrence analysis is more helpful. All of the ten term co-occurrence pairs are phrases or portions of natural language queries. Table 1 presents term co-occurrences for the data set and the percentage is calculated as a portion of all co-occurrence term pairs.
Table 1. Term co-occurrence.

Term

Term

Occurrences

%

new

york

1484

13.4%

what

is

1346

12.1%

history

of

1144

10.3%

of

pictures

985

8.9%

real

estate

940

8.5%

for

sale

890

8.0%

download

free

883

8.0%

high

school

875

7.9%

how

a

867

7.8%

windows

xp

860

7.8%

university

of

806

7.3%

 

 

11,080

100.0%

Term Characteristics

The highest percentage of queries contained two terms and the majority of queries (71.7%) contained one, two or three terms. Few queries contained six or more terms (Table 2).

Table 2: Query Length

Length

Occurrences

%

0

3,251

0.4%

1

174,338

18.8%

2

278,377

30.0%

3

212,738

22.9%

4

121,864

13.1%

5

61,974

6.7%

6

29,321

3.2%

7

14,626

1.6%

8

7,475

0.8%

9

4,933

0.5%

10

2,723

0.3%

>10

15,683

1.7%

Total

927,303

100.0%



Table 3 displays the top repeat queries in the data set. There is a wide distribution of queries in the data and the top repeat queries total represents approximately one half of one percent (0.05%) of the total number of queries.

Table 3: Top Repeat Queries

Rank

Query

Occurrence

%

1

"Mark Twain"

688

0.07%

2

Looney Tunes

493

0.05%

3

Google

488

0.05%

4

Cloning

428

0.05%

5

yahoo

273

0.03%

6

Ebay

257

0.03%

7

Sex

243

0.03%

8

paris hilton

185

0.02%

9

dictionary

141

0.02%

10

yahoo.com

135

0.01%

 

 

3331

0.36%

Session Characteristics

Session length is the number of queries per session. Table 4 shows that the highest percentage of sessions (41.8%) contained one query. The majority of sessions (71.2%) contained one, two, or three queries.

Table 4: Session Length

Length

Occurrence

%

1

115,064

41.8%

2

54,094

19.6%

3

30,735

11.2%

4

19,538

7.1%

5

13,322

4.8%

6

9,300

3.4%

7

6,963

2.5%

8

5,180

1.9%

9

4,036

1.5%

10

3,037

1.1%

>10

14,187

5.1%

 

275,456

100.%

Session duration was measured from the time the first query was submitted until the user departed the search engine for the last time (i.e., does not return) on a given day. This definition allows for the measurement of the total user time on the search engine and the time spent viewing the first and all subsequent Web documents, except the final document. The final viewing time is not available since the Web search engine server records the time stamp. A limitation of this type of naturalistic study is that the time between visits from the Web document to the search engine may not have been entirely spent viewing the Web document.

Concerning the aggregate statistics for session duration, the average session duration is one hour, thirty-four seconds, and one second (1:34:01), but the mode is less than a minute. The minimum session was approximately a second, while the maximum session spanned nearly a 24 hour period (23:59:44).  Almost half of the sessions (45%) were less than a minute in length (Table 5).

Table 5: Distribution of Session duration

Session Duration

Occurrences

%

Less Than 1 minute

125,241

45.5%

1 to 5 minutes

30,275

11.0%

5 to 10 minutes

15,592

5.7%

10 to 15 minutes

9,801

3.6%

15 to 30 minutes

16,197

5.9%

30 to 60 minutes

14,412

5.2%

1 to 2 hours

13,079

4.7%

2 to 3 hours

7,958

2.9%

3 to 4 hours

6,345

2.3%

More than 4 hours

36,566

13.3%

 

275,466

100.0%


Interaction with Clusters

We now address our second research question: How are searchers interacting with clustered results?
Vivisimo has a proprietary clustering feature. We investigated the usage of the clustering feature. Vivisimo’s clustering feature permits viewing the results without interaction with clusters.  The clusters may also be at several levels (i.e., top, second, third, etc.), with each level, a more narrow cluster of results.  To view each cluster, a searcher may click on a cluster to expand that cluster, in which case there may be an aggregate number of the results, more clustering levels, or a combination of both.  There may be more clusters than appear on the first page, in which case a user may view the next page of clustering results.

For each query submission, the Vivisimo search returns three frame records within the transaction log. The form frame defines the frame set including tree and list frames. The tree frame represents clusters, and the list frame corresponds to the results pages.
We classified each interaction with the clustering feature. For specific interactions clusters themselves, we classified them based on the number of clusters at each level expanded at that instance (e.g., top level, second level, third level, etc.).
More than 85% of time, users did not interact with the clustering feature. Just fewer than 6% of the searchers viewed more than the initial cluster. The most common cluster expansion was One Top Level Cluster.

A sizeable percentage of searchers (1.06%) view the cluster results, without expanding any clusters (i.e., clicked the top cluster that lists all of the results).  So, most users had little interaction with the clusters; however, a small number of users had ten clusters expanded at any one time, and some users expanded to the twenty sixth level of clusters.

Clusters were expanded in 12.6% of the records not including the initial query results in this data set. This distribution represents user interaction with the system following the presentation of initial search results (Figure 2).

Figure 2:  Post-Search Results Cluster Expansion
Figure 2:  Post-Search Results Cluster Expansion
Cluster expansion indicates user activity in clicking on the “+” sign next to a cluster label and expanding the tree to reveal sub-clusters. Figure 3 shows the distribution and extent to which clusters were expanded. Clusters are most frequently expanded once representing about 60% of all cluster interactions. The maximum number of clusters expanded in a record is 26.

Figure 3: Total Cluster Expansion
Figure 3: Total Cluster Expansion

Visitation Pattern Characteristics

We now address our third research question: What are the visitation patterns of searchers using Vivisimo?
We first examine the number of sessions of these users. We isolated these unique user identification codes to see how many times these users visited the Vivisimo search engine during the 8-day period.

From Table 6, we see that 68% of the users visited the search engine 2 or 3 times, accounting for 47% of the sessions. There were just approximately 1% of the users who visited the search engine on all 8 days of the collection period. However, the last day of the data collection was not a full day (i.e., only slightly more than four hours). There were 2.3% of the users who made seven repeated visits to the search engine, so this percentage may be a better indicator of percentage of daily repeat users.

Table 6: Sessions by Repeat Users of Vivisimo

No. of Days User Visited Search Engine

No. of Users

%

No. of Sessions During Time Period

%

2

17,762

46.5%

35,524

46.5%

3

8,393

22.0%

16,786

22.0%

4

5,405

14.1%

10,810

14.1%

5

3,905

10.2%

7,810

10.2%

6

1,393

3.6%

2,786

3.6%

7

886

2.3%

1,772

2.3%

8

460

1.2%

920

1.2%