Simply seeding search engines

Debbie Campbell [HREF1], Director, Coordination Support Branch [HREF2], National Library of Australia [HREF3], Parkes Place, Canberra, ACT 2600. Email: dcampbel@nla.gov.au

Abstract

Although the National Library of Australia has an ongoing programme of digitisation to increase the accessibility of its collections, raising awareness of the existence of the digital versions of these collections requires a programme of activity in its own right. One tool that may be used is a Web search engine such as Google, or a specialised engine such as OAIster [HREF4]. Search engines can be seeded with reliable resource locators for digital objects based on URLs or Digital Object Identifiers (DOIs). However, the locators must be persistent. The persistence of unique digital identifiers encourages ongoing citability, thereby increasing awareness and use of digital objects. This paper describes how the National Library is attempting to ensure the ongoing use of its “digital collections” of books, journals, maps, pictures, photographs, manuscripts and music.

OAIster ...find the pearls

Introduction

One of the aims of the National Library’s recent Resource Discovery Service Project was to examine an appropriate relationship between the Library’s online information services and commercial search engines [HREF2]. The Library’s Electronic Information Resources Strategies and Action Plan 2001-2002 recognises that a search engine is usually the first point of access to information on the Web [HREF5].

This is considered to be true for any information seeker, unless a particular entry point is integrated by design into the desktop platform, or the institution or employer underwriting the access mechanism blocks access to general Web searching.

One search engine in particular is known to be commonly used by the academic or research profession internationally, as well as by the general public – Google. This is due to its usually reliable pinpointing of information, in part because of its unique method of ranking results according to link popularity. It also examines the words used to describe each link. And to date, Google has also resisted the commercialisation of its results sets that other search engines have been party to.

While Google is currently the search engine of choice of many, it can only be representative in any strategy to increase the exposure of qualitative information as the circumstances of search engines change.

Current relationships

Researchers rely on Google to search across disciplines and resources in the absence of centralised, visible Australian tools and services. Many Australian institutions, cultural and academic, have created or are in the process of creating aggregations of high value online content. The aggregations are usually stored in databases commonly referred to as ‘The Hidden Web’ because search engines, including Google, are unable to harvest them.

The National Library has overcome this problem to a certain extent by writing standard metadata describing the aggregation (i.e. at the directory level), embedding it in home pages and other selected pages of an online service, and seeding the URL for the service directly into several search engines. This process is considered to be relatively successful as it exposes valuable, consistent Library-authored metadata and can be statistically significant. It is also worth noting that web sites with longevity are favoured in Google, and partly explains the reason for the Library’s success.

While metadata at the directory or service level is important, there would be a greater benefit to searchers if item level records are available to search engines. For example, it would enable Google to direct searchers to culturally significant materials and increase the citations of those materials. This should then be reflected as higher Google rankings. One measurable target is to have each item appear in the first page of results displayed.

Seeding simply

The National Library considered two methods for increasing awareness and use of its high quality digital collections. Although mutually exclusive technically, the intended outcomes are complementary. Firstly, items can be made available to Google for harvesting by exposing their URLs in an HTML file. Listing collection-level URLs on a Web page should be sufficient if the digital items are only one or two levels (sub-directories) away from the collection level address. As the search engine does not conduct a search query when harvesting to access items in a repository, a hierarchical approach to the construction of a unique URL for each digital object is more effective.

In the National Library’s Pictures Catalogue, a persistent URL has been assigned to each digitised image. For example, the image of Upper Coomera Wharf by artist Edwin Bode, 1859-1926 has a persistent identifier of http://nla.gov.au/nla.pic-an5776589-v and its related metadata record has a persistent identifier of nla.pic-an5776589 [HREF6].

Unfortunately, this form of identifier is not recognised by Google. The Google harvester only recognises URLs containing a standard file type, in this case .GIF; .JPG; and .PNG. In particular, it is not supported by the Google image search service [HREF7].

Bode, Edwin, 1859-1926. Upper Coomera wharf [picture]

Despite the best intentions of creating a persistent locator which can be migrated to an international digital identifier standard in the future, an alternate method is necessary for search engines to find these cultural gems.

Seeding reliably

The second method for seeding a search engine is based on the use of the technologies provided by the Open Archives Initiative. OAI provides a range of open source software to support harvestable collections of metadata. Metadata describing digital items is usually stored in repositories separate from the objects themselves. Both the OAIster project of the University of Michigan Library, and the DP9 project of Old Dominion University’s Digital Library Research Group, have used this approach [HREF8].

The DP9 project exposes harvested metadata records to regular search engines in a gateway for Web crawlers. “DP9 does this by providing consistent URLs for repository records, and converting them to OAI queries against the appropriate repository when the URL is requested. This allows search engines to index the ‘deep Web’ contained within OAI compliant repositories.” Any agency willing to have its collections harvested provides an OAI-compatible server to it.

The National Library has been able to set up an OAI-compliant repository in conjunction with its Digital Collections Manager in test mode. This will allow search engines to be appropriately seeded in the future.

Persistence

The National Library of Australia has worked with other cultural agencies to establish ‘best practice’ for the permanent citability of digitised and digital materials [HREF9]. This will help to release important cultural heritage collections for future generations of use.

Acknowledgments

For the implementation:
Tony Boston; Director, Digital Services; National Library of Australia; tboston@nla.gov.au

For the concept:
Kent Fitch; kfitch@nla.gov.au

Hypertext references

HREF1
http://www.nla.gov.au/nla/staffpaper/2003/dcampbell1.html
HREF2
http://www.nla.gov.au/initiatives/resdisc.html
HREF3
http://www.nla.gov.au/
HREF4
http://oaister.umdl.umich.edu/o/oaister/
HREF5
http://www.nla.gov.au/policy/electronic/resourcesplan2002.html
HREF6
http://nla.gov.au/nla.pic-an5776589
HREF7
http://images.google.com
HREF8
http://arc.cs.odu.edu:8080/dp9/index.jsp
HREF9
http://www.nla.gov.au/initiatives/persistence.html

Copyright

Debbie Campbell, © 2003. The author assigns to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.