Creating a keyword search vocabulary for Course Finder

Dey Alexander, Principal Consultant, Dey Alexander Consulting [HREF1], PO Box 2655, Cheltenham, VIC 3192. dey@deyalexander.com.au

Scott Rippon, User Interface Designer, IT Services Division, Monash University [HREF2], Building 203, 700 Blackburn Rd, Clayton, VIC 3800. scott.rippon@its.monash.edu.au

Guy Sangwine, User Interface Designer, IT Services Division, Monash University [HREF2], Building 203, 700 Blackburn Rd, Clayton, VIC 3800. guy.sangwine@its.monash.edu.au

Abstract

This paper describes the creation of a keyword search vocabulary for Course Finder, Monash University's web-based course catalogue. It describes the approach that was taken, the basic structure of the vocabulary, and lessons learned during the design process.

Introduction

Course Finder is a web-based application providing access to Monash University's catalogue of degree and diploma courses. It is an important component of the university's prospective student website. In its second phase of development, stakeholders decided that a keyword search would improve the usability of the application.

A full-text search would have been the easiest approach. However, the application only had access to course descriptions, not the descriptions of subjects taught within each course. Course descriptions were not a rich enough source of keywords, so it was clear that we would need to generate a vocabulary to drive the keyword search.

We began by looking for existing vocabularies. However, none provided a relevant set of terms to match Monash's course offerings and they were generally far too broad in overall scope. We decided to start work on developing a vocabulary from scratch.

Developing the vocabulary structure

We focused first on the structure of the vocabulary. A key consideration was making the vocabulary manageable. We needed a structure that was clear and easy to maintain without specialist knowledge of search. We decided on the following:

"Study area" and "study level" were terms used by those involved in student recruitment at Monash. They were already in use in the first version of Course Finder, and so could easily be adopted as the key organising terms for the vocabulary.

Using a relational database, each term in the vocabulary could be assigned a set of variant terms – misspelt or mistyped terms. Our earlier user research had shown that typing and spelling errors were quite common, particular amongst school leavers and international students.

Populating the vocabulary with terms

When it came to populating the vocabulary with terms, we first worked on the primary study area terms. We considered the existing terms used in the first version of Course Finder, but soon realised that they did not provide the level of granularity required. The university offers courses that are very broad-ranging in subject matter, such as the Bachelor of Arts, but it also offers courses that are very narrow in focus, like the Diploma in Languages (Spanish). Using "Languages, literature, cultures, linguistics" as a primary organising term would mean that a user searching on "Ukranian" would get the Diploma in Languages (Spanish) in the result set. This was clearly not desirable. We did, however, retain the existing set of study area terms for the browse alternative to the keyword search, as the image below shows.

Screenshot showing browse options from Course Finder

Working with diplomas and graduate diplomas, which represented the narrowest study area coverage, we developed the primary study area terms. As far as possible, we used terms from course names so that they would be familiar to those maintaining the vocabulary. We then used course descriptions to generate an initial list of related study area terms and related career terms, before turning to descriptions of subjects offered within degrees and diplomas to further populate the vocabulary. Finally, we added synonyms.

We then moved on to the primary study level terms. The first terms we created were "diploma", "bachelor", "master", "doctorate", etc. We considered using "undergraduate" and "postgraduate" as primary study level terms but opted instead to add them as related study level terms to each of the primary terms we had already created. The rationale was simply ease of maintenance (applying two primary terms to a course would take longer than applying one).

The project sponsor, our subject matter expert with whom we worked closely during the vocabulary development, was keen to have abbreviations for each of the course names added to the vocabulary. She thought students might search on terms such as BA (for Bachelor of Arts). Without an existing keyword search, we had no data to refer to about this kind of search behaviour, but in user testing we had seen several examples of users searching on MBA. So a further set of primary study level terms, representing course names such as Masters of Business Administration, was added. To those we added a series of related study level terms, such as "MBA".

Stemming and stop words

We decided to use stemming technology with the vocabulary (Frakes, 1992). Stemming allows all terms entered into the vocabulary to be reduced to their core term minus the stem (e.g. "biostatistics" becomes "biostat"). The advantage is that the vocabulary does not need to include words representing all of the possible stems (e.g. "biostatics", "biostatistician", "biostaticians"). Stemming is also used to filter users’ search terms, so the search input can be matched with the keyword stems included in the vocabulary. We liaised with the development team and tested their recommended stemming algorithm. We had duplicate terms removed from the vocabulary and the remaining terms presented in alphabetical order for ease of maintenance. The image below shows how the vocabulary terms appear within the maintenance tool.

Screenshot showing the maintenance tool and how stemming has been used in the vocabulary

Most search engines ignore common terms, such as "the", "in" and "is" when processing the user's search input. These terms are referred to as "stop words" (Sullivan, 2003). Excluding stop words speeds up search performance. However, we did not want to use a standard set of stop words. First, the word "it" is usually considered a stop word. In testing a range of university websites we had seen many instances where students searching for information technology courses would use this term and be puzzled when it produced no search results (Alexander, 2005). Some concluded the university did not offer such courses. In each case, this conclusion was inaccurate. Second, we had not seen many searches where phrases or natural language had been used. We decided to set out with a very small list of stop words and monitor the search logs to see if more were required.

Maintenance of the vocabulary

Our final tasks were to create a maintenance interface to the vocabulary and develop a maintenance procedure based on simple search log analysis to ensure that the vocabulary adapted to fit the searching behaviour of users.

Creating the maintenance tool interface was a challenge. We were limited in the technology we could use (AJAX-style interactions would have been more efficient than loading or reloading whole pages, see Maurer 2006), and had little time. Our design evaluations were restricted to a few design walkthroughs. Even these resulted in several design iterations before we passed on our designs to the development team. Since there was only likely to be one person responsible for the maintenance of the vocabulary, we were able to show her how to use the tool, and we produced a short user manual.

Screenshot showing the search log analysis options in the maintenance tool

Our maintenance tool included access to search engine logs and a simple statistical summary of search behaviour - see image above. We advised the maintainer to check the search summary on a weekly basis, at least for the first few months after launch. For any searches that should have produced results but did not, adjustments could be made to the vocabulary. For any searches that produced very large results sets, tweaking the vocabulary and re-testing the search allowed for further improvements. Over time, as search performance improved, periodic maintenance intervals could be lengthened.

Lessons learned

The development of the vocabulary was part of a much larger project and we did not allocate sufficient time for the activities involved.

We had hoped to build on or borrow from existing vocabularies, and investigating these cost us several valuable days. Granularity of the vocabulary was the key. Had we realised this at the start would we have had more time to spend on designing the maintenance tool.

We ended up using the maintenance tool to finalise development of the vocabulary, and discovered a shortcut feature that would have been useful: automatically populating primary study area terms assigned to degrees that also form a double-degree.

References

Alexander, D. (2005). "How usable are university websites? A report on a study of the prospective student experience." [HREF4]

Frakes, W.B. (1992). Information retrieval: data structures and algorithms. Prentice-Hall Inc, Upper Saddle River, NJ, USA.

Maurer, D. (2006). "Usability for Rich Internet Applications", Digital Web Magazine [HREF5]

Sullivan, D. (2003). "What are stop words?" [HREF3]

Hypertext References

HREF1
http://www.deyalexander.com.au
HREF2
http://www.monash.edu.au/
HREF3
http://searchenginewatch.com/facts/article.php/2156061
HREF4
http://ausweb.scu.edu.au/aw05/papers/refereed/alexander/
HREF5
http://www.digital-web.com/articles/usability_for_rich_internet_applications/

Copyright

© 2006 Dey Alexander, Scott Rippon, Guy Sangwine. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.