Personalised metasearch: Augmenting your brain

Nathan Bailey [HREF1], Manager Flexible Learning and Teaching Program, Information Technology Services [HREF2] , PO Box 3A, Monash University [HREF3], Victoria, 3800. Nathan.Bailey@its.monash.edu

Abstract

The Working Council of CIOs (Business Wire, 2001) reported that knowledge workers spend 35% of their productive time searching for information online, and 40% cannot find the information they need to do their jobs on their intranets.

The core activity of Universities is generating, managing and disseminating information. With such a significant portion of time spent looking for information, there is a huge opportunity for Universities and other organisations to improve efficiency simply by directing information more effectively to staff and students. This paper describes a simple yet powerful search broker that integrates personal and public information into search results.

Introduction

The problem

Various studies indicate that the average worker spends anything from 25% (InformationWeek, 2003) to 40% [HREF4] of their time searching for information. Using the most credible figure of around 35% and a minimum wage of $431.40/week [HREF5], that's around $151 per week spent on looking for information — and hopefully most of us are earning more than the minimum wage! All this equates to several hundreds of thousands of dollars spent by the average University on searching, every week.

These statistics are heavily bandied about by document management and content management companies because this is their core value proposition. But it isn't enough to have a well-catalogued, well-structured organisation-wide website or document archive — there are still dozens of other resources that people need to access that wouldn't appear in most document management or content management systems. These range from highly personal files, such as those on your local hard drive or your email folders, right through to highly public resources such as the World Wide Web, with resources of varying levels of restriction in-between.

How does the average person know where to look? Just where did they read that key fact last week? Was it an email from Joe, or on Bob's website, or ...? We need some way to search all of these resources at once — a metasearch.

Some definitions...

Metasearching, also known as 'federated searching' allows a user to search across multiple resources at once, i.e. to initiate several concurrent searches and then view the collated results.

A portal is a personalised website that profiles specific information and services to you, according to your profile. my.monash is Monash's enterprise information portal that provides a "one stop shop" for staff and students to access everything they need in managing and carrying out their relationship with Monash.

Metasearch engines

Metasearch engines [HREF6] have been around for a number of years, allowing you to shop for the best price (e.g. mysimon.com [HREF7]) or simply to search multiple search engines at once (e.g. search.com [HREF8]). The limitation of these systems is that they can only research and collate results from public resources. Any resources to which you may have access (either through your organisation or your personal files) cannot easily be integrated into these metasearches. This is where the enterprise information portal fills the gap.

Portals

Portals continue to be the flavour of the month with large vendors (such as IBM and CA) and research analysts (like Gartner). Enterprise information portals [HREF9] provide organisations with an integrated view of all the information and services relevant to them, according to the profile the enterprise has for them.

A newer flavour to the portal pie is coming through, with a focus on CMS', intranets and personalised applications [HREF10]. Vendors' key value proposition is now leaning towards access to increasingly relevant information (through well-categorised content in content management systems) and increasingly relevant applications (that are customised to the organisation and the individual's role in that organisation).

In this vein, a personalised metasearch provides the panacea we are looking for — both the power of the metasearch, to integrate results from a range of resources, and the power of the portal, to provide access to personal, highly relevant resources. Together, they provide a means to supply a powerful searching environment that displays integrated results across a range of relevant resources.

Personalised metasearch

Overview

The my.monash portal is in the process of being extended with a personalised metasearch using a set of simple but powerful tools. The diagram below describes the basic architecture:

Diagram of the Search Broker

Users connect to the my.monash portal and type in their search. The search is dispatched to the search broker for distribution. There are a set of agents (triangles in the above diagram) that are oriented toward searching specific resources. These agents carry out their work in parallel to ensure as many results can be returned in as short a time as possible. The search broker then aggregates the results and sends them back to the browser for display.

Leveraging the power of collaborative development

One of the core values of ITS (Information Technology Services) at Monash is to use existing solutions where they provide equivalent, cost-effective solutions to the business need, rather than building our own (and re-inventing the wheel). The search broker draws on two types of existing resources — public collections of search tools and a specialist metasearch engine for library resources.

By far the most powerful collection of tools is freely available. Two leading sets are those on CPAN [HREF11], the perl community's public library of code (especially the 'WWW::Search' series), and Mozilla [HREF12], which provides a powerful metasearcher of its own, code-named 'Mycroft' [HREF13] (after Sherlock Holmes' brother, since it is based on Apple's Sherlock metasearcher). Together, these two collections provide access to hundreds of web search engines in an easily integratable way. Even more importantly, these tools are actively maintained by their authors, saving Monash any possible costs in terms of keeping up to date with changing search specifications.

How it works

For some time now, various tools have been made available for integrating with web sites. These tools essentially act as a browser, sending and receiving content in the standard way the site expects (see 'Search interfaces' below).

Perl provides 'LWP::UserAgent', a powerful browser that copes with pretty much everything a normal browser can do, except for JavaScript and Java. Java itself has a similar browser library, as do most other leading languages. Being able to customise browser interactions is great, but one of the challenges of integrating with websites is the cost of maintenance — every update can cause the integration to break.

This is where the power of a development community comes in to play. Perl's WWW::Search library is a set of dozens of search interfaces to a range of key web resources, each maintained by separate authors or groups of authors. As these key sites change, the authors release new updates — and if you need it fixed now, then you always have the source code to fix it yourself (or to pay someone else to).

Similarly, the Mycroft community maintain hundreds of searches that integrate with Mozilla to provide individual and metasearch capabilities. These work in almost exactly the same fashion as WWW::Search, so there is considerable scope for exchange between the two (i.e. converting a Mycroft plugin to WWW::Search would be a relatively trivial exercise, especially with WWW::Scraper::Sherlock — see [HREF14] for a web-based example).

What about non-web resources?

Our metasearch also needs to integrate with key resources such as the white pages directory, your personal email, the files on shared drives and your local PC and a broad range resources using various standard or proprietary protocols (e.g. a z39.50-based library catalogue).

Again, the power of open-source comes in to play, with a community maintained library of agents. Integration with email (POP/IMAP), libraries (z39.50 [HREF15]), directories (LDAP), calendars (iTIP and iMIP [HREF16]) and directly to databases (e.g. Oracle/postgreSQL) are all supported and actively improved. These agents provide quick access to search the various repositories of information that are available within the organisation (or externally, if you have access).

Search interfaces (a glance under the hood)

This form of integration is sometimes called 'screen-scraping' [HREF17], because essentially the browser is grabbing the web "screen", and "scraping" out the relevant bits of content. Diagram of a Search Agent The search interface, or 'agent' reads the same search form that a normal browser would, sends the same search request and views the same results set. In fact, there is no way for the remote site to know that whether any particular transaction is being done by an agent or a normal user driving their desktop web browser!

Of course, publishers can still monitor log files for excessive activity from single hosts (but that could just be a heavily used proxy) and scan online for possible infringements (e.g. use Google to search for any sites talking about providing illegitimate metasearches — but obviously they wouldn't find intranets and enterprise information portals). Legitimate organisations wouldn't wish to advocate illegitimate access anyway, which leads in to the next question — legality and IP restrictions.

Is this legal?

With increasing attention being paid to intellectual property, digital rights management and copyright are becoming significant issues for Universities. Inevitably, the question arises "Is 'screen-scraping' legal?", i.e. can such a metasearch engine, that draws content from so many different sources, still comply with IP restrictions?

This question needs to be answered separately for each of the three kinds of content (public, restricted and personal). Public information is actually the greatest concern. Many organisations derive significant income or marketing advantage from their websites, and some are even cautious about standard 'integration' techniques like deep linking (i.e. linking directly to a resource rather than just the home page).

Public sites

Three links appear in the footer of almost every website now — a legal disclaimer, a privacy statement and terms of use. It is important that integrators carefully review the terms of use to check if what they are doing is explicitly permitted. If there is any ambiguity, approval should be sought from the publishing organisation. Many organisations are willing to do so, recognising that things like metasearchers and deep-linking drive traffic to their website. This is especially true for enterprise portals, which often provide a specific demographic (e.g. for Universities, it's students) that the publisher may be interested in attracting.

Now the world's most recognised brand (yes, even more than Coke!) [HREF18], Google actively encourages integration, with the Mycroft plugin being written by a Google staffer. On a broader scale, a whole development community has sprung up around the Google API that allows integrated searching (at this stage only for non-commercial use) [HREF19].

Restricted sites

For restricted sites, a license for accessing and using the information has already been established, so reuse may well already be explicitly approved. In this case, the standard terms of use of the site may not be sufficient, as there is likely to be a more specific agreement in place. Consultation with the licensing officer (who may deal directly with the publisher) should clarify what can legitimately be done.

Alternatively, the simplest restricted sites to get permission for should be internal ones, containing enterprise data generated by other organisation units.

Personal content

This category is the simplest of all — if you don't want to metasearch your private materials — don't! :-) More specifically, most of these resources can be access-controlled in some way, which will ensure that access only comes by expected paths or people. But if not, customising your metasearch should provide sufficient control to avoid searching materials you don't want exposed to the metasearcher.

Advocating for search interfaces

In the previous section we discussed how websites can't easily track for illegitimate agents. But in the case of legitimate agents, we want them to be able to track. We want them to know how much our agent is driving traffic to their website. This kind of positive marketing increases the argument for opening up to search agents. And in the case of agents like Mycroft, it also advocates for the organisation's cause — standards compliance and Mozilla support instead of an IE-only web. Therefore, most agents do in fact report themselves, either via a special form field or in the user-agent (browser identification) string.

Aggregation algorithm

So with the full support of savvy publishers and a rich set of search tools, we are successfully gathering dozens of results to present back to the user. The key question here is — how do we sequence the results from the various result sets?

Aggregating results is challenging, because it isn't always clear what the user wants. Which result is most relevant for their query? Some search engines provide 'relevance' indices that indicate a scale of relevance. Mycroft uses this to aggregate results in its metasearch function. However, this presumes that the relevance indices are comparable between search engines, and indeed, that they supply such indices at all.

Many engines don't provide indices — e.g. a white pages directory search of staff and students, or a search of your own email folders. Some arbitrary algorithm must evaluate the respective priorities between result sets. The algorithm currently used by my.monash's metasearch is as follows:

  1. Exact matches are more interesting than approximate matches
  2. People are more interesting than paper
  3. Personalised results are more interesting than general ones
  4. Titles / link text are better indicators than the rest of the content
  5. Home pages (i.e. URLs ending in a slash, or URLs that are just host names) are more relevant than other pages
  6. More recent articles are more interesting than old ones
  7. Word proximity is an indicator of relevance (i.e. the closer the search terms are to each other, the better)
  8. URLs with the search term in them are more relevant than those without.

Each of these is an arguable proposition — perhaps users are more interested in paper than people, or in public information than their email archive. Clearly, at some stage in the future, these kind of factors will need to be customisable if users want to fine tune the capacity of the metasearch. However, this simple model works reasonably well at present, but does deserve further research.

The major limitation of the broker is that it does not have access to the full data records, like most search engines do, so it can't use that extra information in defining relevance. Thus the relevance decisions it makes are intrinsically weaker than, for example, Google's PageRank [HREF20].

Some worked examples!

This is a wonderful idea conceptually, but "show me the money!" Let's see this thing in action! There are two key groups we're trying to service here — the general knowledge worker, trying to identify something, and the researcher (be it student or academic) who has a range of interests they are reviewing on an ongoing basis.

Digging up the dirt...

Increasingly in order to get work done, "it is not what you know but who you know." Many projects are successful because of "goodwill networks" of aligned contributors. How can such a network be grown and fertilised? What about a metasearch that includes white pages information, schedules, contributions in public forums and other public documents (the web)?

Below is a sample of a search for me. You can see my name, my title, organisational affiliations, contact details, my schedule for today and photo of me that links off to our organisational chart, showing who I work with and where. As a direct "person match" this result rates higher than the other aggregate results from public, internal and restricted sources (Rules 1 and 2).

The next result is my personal domain. It ranks highly as a top level domain that just contain my name in the title (Rules 4 and 5). Another couple of home pages rate similarly highly, before generally trailing off into results with my name.

Search for 'Nathan Bailey'

Whilst it does favour exact personal results, the metasearch engine doesn't correctly "recognise" that it has found a person and do special things, beyond the basic white pages information. It could go off and search known mail archives (both internal and public) or known intranets or other specific resources that may provide insight into this person.

Supporting research

As a researcher in the field of portals, I often search for new information about portals, knowledge management, corporate workflows, etc. Instead of having to review many different information repositories, the metasearcher allows me to search many of them at once. Below is a search for 'enterprise information portal' across a semi-public resource (the Monash web; semi-public because the search includes restricted documents) and two internal, restricted resources (ProQuest's ABI/Inform and LexisNexis' collection of major Australian newspapers).

The first two results hit the top because they have one of the search terms ('portal') in their link text (Rule 4). The next two incorporate the full search term and are relatively recent (Rules 1, 6 and 7).

The next link doesn't have portal in the title or content, but it does have it in the URL (Rule 8, which is possibly merely a variant on rules 4 and 5). The remaining links are merely assimilated as they were found, since they don't mention any of the search terms explicitly — which is disappointing, since, for example, the EIRG are actually quite interested in the portals field, but because their homepage doesn't mention it, we can't "promote" their result.

Search for 'Enterprise Information Portal'

Leveraging further

Obviously a metasearch engine is useful for research (either informal or academic), but this is only half of what we can do. Rather than just allowing people to search, we can store their search and advise them when new results become available. I could receive an SMS saying "You've got new search results", or an email, or simply review my stored searchers next time I visit the metasearcher page, like visiting a bookmark.

Drawing on the portal's strengths, we can go further still, and try to predict what search topics a particular person might be interested in, by reviewing the types of mailing list/discussion forums they participate in, what papers they have published, the journals they have published in, their stated research interests, etc. As my focus of research changes, my autosearcher changes with me!

And then, we can encourage collaboration with other researchers using amazon.com-style matching: "people who (research in this field, read this journal, participate in this forum) also ..." But we risk diverging — the facilitation of research by portals is an entire topic of its own.

Performance and capacity

Integrating a diverse set of resources can be challenging, with the capacity to suffer from the weakest link in the chain. Two strategies have been adopted to improve performance, and a third remains available. Firstly, every request is handed off to an individual agent to carry out that search. This allows searching in parallel, rather than visiting each resource sequentially in turn.

Secondly, a timeout has been introduced that ensures the all results will be returned within a certain time (currently 30 seconds). This value could be editable by the user, and users could even drill down to set different values for different resources. Any requests taking longer than the timeout are terminated. However, results are streamed back as soon as they are received, so even terminated searches may provide partial results to be included in the aggregate result set. The broker ensures that whatever results are available are returned to the browser within the set timeout.

Finally, further separation of function could be done to prevent any one part of the broker system from becoming a bottleneck. The most likely candidate would be the aggregator with its expensive parsing and ranking rules. Given that the University has established network-fabric-based load balancing (i.e. within our core routers), if performance does become an issue, a farm of brokers could easily be set up to share the load.

An important caveat

This system is still in a prototype state, so there are no raving user reports or detailed usage statistics available at the time of writing. However, it should be made available as a pilot service in the near future. If publishing timelines permit, this section will be rewritten to incorporate some usage analysis and feedback, otherwise this information will be provided as part of the presentation at the conference.

Further work

A number of Universities have bought metasearch engines, such as Ex Libris' MetaLib [HREF21] and FDI's ZPORTAL [HREF22]. The CAUL's AARLIN project [HREF23] is probably the premier example in this area. These tools provide a powerful mechanism for metasearching across library resources, including links back to local library copies of articles in the results set. Once in place, these engines can become another agent in the pool for searching, allowing even more sophisticated cross-research searches.

Finally, there are e-print and other online repositories (e.g. SMETE [HREF24], the Networked Digital Library of Theses and Dissertations [HREF25] and the Open Archives Initiative [HREF26] with the OAIster [HREF27] metasearcher) and metadata repositories such as EdNA [HREF28] that also provide possible pools of information to search. These could be codified in the WWW::Search way to allow both this and other search tools to leverage them.

Conclusion

Metasearchers provide an excellent way for Universities to leverage their immense base of knowledge. As the statistics described at the start of this paper clearly highlight, all organisations are becoming increasingly dependent on information management. This relatively cheap technology employs the power of the open source community to provide a rich toolset for both knowledge workers and researchers alike. Information wants to be free, and metasearch-powered enterprise portals are here to help it happen!


References

Kontzer, Tony (2003). "Search On" in InformationWeek, Jan 20, 2003, Manhasset. Available online [HREF29].

Business Wire (February 27th, 2001), Working Council of CIOs as cited in [HREF30]. See also Feldman, S. and Sherman, C. (2001) The high cost of not finding information, IDC Whitepaper.

Hypertext References

HREF1
http://polynate.net/
HREF2
http://its.monash.edu/
HREF3
http://monash.edu/
HREF4
Forrester Research, as cited by Documentum, http://www.documentum.fr/events/02_21_02_web.htm
HREF5
http://www.actu.asn.au/public/about/minimumwage.html
HREF6
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/MetaSearch.html
HREF7
http://mysimon.com/
HREF8
http://search.com/
HREF9
http://ausweb.scu.edu.au/aw01/papers/refereed/treloar/paper.html
HREF10
http://ausweb.scu.edu.au/aw02/papers/edited/bailey/paper.html
HREF11
http://search.cpan.org/
HREF12
http://mozilla.org/
HREF13
http://mycroft.mozdev.org/
HREF14
http://www.sherch.com/~pldms/cgi-bin/sherch.pl
HREF15
http://www.loc.gov/z3950/agency/
HREF16
http://www.ietf.org/html.charters/calsch-charter.html
HREF17
http://www.elsewhere.org/jargon/html/entry/screen-scraping.html
HREF18
http://www.fool.com/news/take/2003/mft/mft03021102.htm
HREF19
http://www.google.com/apis/
HREF20
http://www.iprcom.com/papers/pagerank/
HREF21
http://www.aleph.co.il/MetaLib/
HREF22
http://www.fdusa.com/products/zportal.html
HREF23
http://www.aarlin.edu.au/
HREF24
http://www.smete.org/public/about_smete/activities/technology/federated_search/
HREF25
http://jin.dis.vt.edu/fedsearch/ndltd/support/search-catalog.html. See also http://www.dlib.org/dlib/september98/powell/09powell.html which describes the implementation of the NDLTD and provides references to other relevant research.
HREF26
http://www.openarchives.org/
HREF27
http://oaister.umdl.umich.edu/
HREF28
http://www.edna.edu.au/metadata/
HREF29
http://informationweek.com/story/showArticle.jhtml?articleID=6500054
HREF30
http://www.verity.com/pdf/MK0424_ROI_DidYouKnow.pdf
HREF31
http://www.searchtools.com/slides/searchengines2002/sem2002-17.html

Copyright

Nathan Bailey, © 2003. The author assigns to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.