Systems for providing searchable access to collections of HTML documents


David Morton, Information Technology Services, The University of Melbourne, Parkville, Victoria, 3052, Australia. Phone +61-3-9344 4516. Fax +61-3-9347 4803. Email: d.morton@its.unimelb.edu.au

Sandra Silcot, Information Technology Services, The University of Melbourne, Parkville, Victoria, 3052, Australia. Phone +61-3-9344 8034. Fax +61-3-9347 4803. Email: ssilcot@www.unimelb.edu.au. Home Page: Sandra's Web [HREF 1]


Keywords: keyword searching, index HTML documents, WAIS, glimpse, World Wide Web (WWW, W3)

Introduction

Keyword searching is a facility that has now become very familiar to most users of on-line services. Users of library databases, CD-ROM's and other on-line bibliographic services have been using keyword searching facilities for some time.

The functionality of many HTML documents or sets of HTML documents can be significantly enhanced by providing a similar style of access to them through keyword searching. A keyword searching facility will turn a set of HTML documents into much more than merely an electronic 'mirror' of the paper version.

As the World Wide Web has developed, a number of different systems have emerged to provide searching facilities for HTML documents. This paper examines the requirements for keyword searching of HTML documents at the University of Melbourne and the subsequent implementation of an effective system for access to HTML documents using a keyword searching system.

Background

The University of Melbourne established its Campus Wide Information System [HREF 2], based on the World Wide Web, in June of 1994. Since then the use of the service has undergone rapid growth. The volume of World Wide Web traffic generated from within the university is currently doubling every five to six weeks. The number of Faculties or departments [HREF 3] running servers has steadily increased to the current level of over forty servers.

The main university CWIS server acts as a central directory pointing to the various Faculty and Departmental servers. In addition it provides access to a number of university-wide services, such as the telephone directory service, and is also used to publish a variety of documents with university-wide scope. These include the Undergraduate Handbook [HREF 4], the Research Report [HREF 5] and the Strategic Plan 1994 to 1996 [HREF 6]. All of these documents include a keyword searching facility to enhance their functionality.

Functionality and software requirements

A searching system should be easy to use and easy to maintain. It should be efficient and not consume too much system resources. To be effective from the user's point of view, it should have good response time and provide the searching facilities they have become accustomed to when using other on-line searching systems.

The CWIS team established a number of specific requirements for such a searching system. These can be categorised into the following sections.

System requirements

A number of specific system requirements were identified. Some of these were required, while others were desired.

1. Conformance with established international standards (Desired)

Where possible, software used at the University of Melbourne should conform with international standards. In the area of indexing and communication with databases, the relevant standard was considered to be Z39.50 [HREF 7].

This standard defines a protocol for the searching and retrieval of information from databases on different computers across a network. It does not concern itself with the internal workings of an indexing and retrieval system but, rather, how that system communicates with other database engines and, in particular, how they exchange information contained within their databases.

2. Software should reside in the public domain (Required)

Any software used to implement a keyword searching system must reside in the public domain. This ensures the maximum flexibility for the system. In particular this has the following benefits.

3. The keyword searching facility must run under UNIX and integrate seamlessly with the World Wide Web (Required)

Since the main university CWIS server runs under UNIX it was essential that the searching system also run under UNIX. However, it need not be exclusively UNIX based. Indeed, support for servers on other platforms, particularly Macintosh and Windows, would be an advantage.

The searching system to be implemented must integrate fully with the World Wide Web. WWW browsers on all platforms should be able to submit queries and display query results. The search results should contain 'live' links to referenced HTML documents.

4. The indexing and retrieval system should be efficient in its use of system resources (Desired)

Indexes created from HTML documents should not occupy too much disk space. It would be alarming if the index occupied substantially more space than the original collection it indexed. In fact, it would be much more desirable if the index occupied only a small amount of extra disk space.

The retrieval process should not consume too much CPU time or memory. This is particularly the case if searching is conducted on the same machine that is serving up the original HTML documents. This was the anticipated configuration in this case. Our concern was to ensure that the capacity of the CWIS server to deliver documents was not going to be severely degraded by people using it to perform keyword searches.

Administration/Maintenance

The keyword searching system should make it easy to create a new index of a set of HTML documents. It should also be easy to update an existing index if the source documents change.

The ease of creation and maintenance of a search index will be determined by a number of factors. These include:

User functionality

The purpose of a keyword searching facility is for people to use it effectively to locate the information they require. To this end it is vital that the software used meets a number of criteria that will determine how well the searching facility will meet people's needs.

These can be categorised into a number of sections.

1. Performance

The keyword searching facility must function quickly enough that people conducting searches will not grow impatient and discontinue the search. Ideally, on long searches, a progress report should be included.

2. Ease of use

The search interface must be easy to use. It should be based on an HTML forms interface. The search engine should be invoked using a cgi-bin script. The form should be easy to use with simple controls of search settings.

If possible, search results should display the search results in context to allow the user to decide more easily if particular 'hits' are relevant. Search results should also be displayed in a natural order. They should not be displayed simply in the order in which they were found. And they should not contain 'visible' HTML.

It must be simple to transfer from the search results to the original HTML document. Ideally, this should be by a 'live' link.

3. Functionality

The keyword search engine should provide support for boolean searches using 'AND' and 'OR' at a minimum. It should also include a facility to perform partial word searches.

There should be capacity to control the maximum number of documents returned from a search and also control over case sensitive searching.

A number of the HTML collections that were to be indexed required numbers to be indexed. An example of this is the Undergraduate Handbook [HREF 4] which includes subject codes that should be searchable.

Initial WAIS Implementation

The first keyword system that was implemented was based on WAIS. WAIS is a database index and retrieval engine that was developed as a joint project between Apple Computer, Thinking Machines Corporation and Dow Jones.

WAIS is more than a simple database system to index and retrieve documents. It also incorporates a sophisticated networking component that allows search results to be exchanged from one computer to another using the Z39.50 protocol.

WAIS can index a variety of different document formats including HTML. It also has support for plain text files and image file formats including JPEG and GIF. (Image files only have their filenames indexed.)

The WAIS project resulted in a commercial version distributed by WAIS Inc. [HREF 8] and a public domain version known as freewais [HREF 9].

The freewais distribution includes an indexing program (waisindex), server software (waisserver) for making the database available over the network, a search engine for local wais databases (waisquery), a search engine for wais databases on other computers on the network (waissearch) and wais client software for vt100 terminals (swais) and X stations (xwais).

Most World Wide Web clients cannot communicate directly with a wais server. Instead they need a Web to wais gateway. The function of the gateway is to take a query from a Web client and format it into a query suitable for submission to the specified wais database. Once the query has been performed by the wais server the gateway formats the results for display back on the Web client.

The World Wide Web to WAIS gateway

The initial choice for a World Wide Web to wais gateway was wwwwais.c [HREF 10] developed by Kevin Hughes [HREF 11] at EIT. At that stage the software was at version 2.2.

The principal disadvantage with the wwwwais.c gateway was that the results from a search of a wais database were displayed showing the file names of matching documents, rather than their HTML titles. This often made it difficult to determine if a particular document in the results list was of interest.

Instead, an alternative gateway program kid-of-wais [HREF 12], based on the script wais.pl distributed with the ncsa server software and wwwwais.c, was chosen for use. It includes a facility to display the title of Web documents (if present) that are found by a search.

An additional script, called print-hit-bold [HREF 12] was used with kid-of-wais. Its function is to position the browser at the first occurrence of a search term within a document that is selected for viewing from the results list.

The Search interface

A forms-based cgi-bin Perl script was developed to provide the search interface and transfer the search terms and options to the WAIS gateway.

A copy of the script was made and slightly modified every time a new search database was created.

Figure 1. The form to search the Undergraduate Handbook using WAIS.

Figure 1 shows the forms-based interface that was created to access the search engine. It provides a simple interface for carrying out a search of a WAIS database (in this case the Undergraduate Handbook [HREF 4]). Boolean searches can be used and there is a pop-up menu to control the maximum number of matches to be returned from the search.

In this particular search, a second search option was available to allow searching for subject codes. This was an entirely separate search that used a grep-like search of a file matching subject codes to document URLs.

This was necessary because WAIS did not perform well with numbers in the subject code format. For example a search for the subject code 131-201 would locate that subject, but also about 30 others all of which contained one or other of the numbers but not always both.

It seemed that the search input string was not being treated as one string but that the two numbers that make up the code were being or'ed together.

Normal text searches of the WAIS database generally worked well however. For example, the results of the search shown in Figure 1 can be seen in Figure 2.

Figure 2. Search results of the Undergraduate Handbook using the search term 'Foucault and philosophy'.

The results of the search show the matching documents using their title rather than their file name. Each match is a 'live' link to the actual document. The results are displayed in the order of their WAIS 'relevancy' score from highest to lowest. The score is calculated on the basis of the frequency of matches with the search terms and their location within the document.

In addition, the size and type of document is displayed. This is useful for large documents or if the match is with an image file or other non-text file. People with slow network connections can choose not to download a file if it is very large.

Evaluation of WAIS implementation against defined criteria

This sections looks at the performance of the keyword searching system against the criteria that were established in the previous section 'Functionality and software requirements'.

System Requirements

System requirements defined general requirements for the searching system including the nature of the software and the platform it should run on.

1. Standards

WAIS was specifically designed to conform with Z39.50, the standard for exchanging information between databases on different computers.

There is no specific standard for World Wide Web to WAIS gateways, although the gateway must communicate with the Web server using the Common Gateway Interface (CGI) standard. The Common Gateway Interface is a standard for external gateway programs to interface with information servers such as HTTP servers.

Kid-of-wais is a script that conforms with the CGI guidelines.

2. Public Domain

The WAIS version used was freeWAIS 0.3. This a public domain implementation of the WAIS protocol. The source code is available.

Kid-of-wais and print-hit-bold are both in the public domain and their source code is available.

3. WWW Integration on Unix Platform

The WAIS server program is specifically designed to run on UNIX computers. To the best knowledge of the authors, it does not run on other platforms.

The kid-of-wais gateway is written in Perl. While Perl is available on UNIX, Macintosh and Windows platforms, kid-of-wais has been specifically designed to work with UNIX HTTP servers.

4. Disk Space and System Resource Usage

WAIS indexes do not seem to be particularly efficient. For example, the Undergraduate Handbook itself occupies 15.5MB of disk space. The WAIS index of the Handbook occupies a further 6MB of disk space.

Although WAIS fell within the range regarded as acceptable in the criteria it was still regarded as using excessive disk space for its indexes.

Administration/Maintenance

Administration/Maintenance defined the requirements for the running of the searching system from the server side.

1. Ease of new index creation

A Perl script was created to carry out the indexing process. This was relatively easy to copy and adapt to index a new set of HTML documents.

The main disadvantage was that the process is manual and required reasonable UNIX skills. This did not cause problems because those people with the privileges to administer the server at this level all had good UNIX skills.

2. Ease of control over what is indexed

The indexing script described above includes controls over the file types that should be indexed and those that should not. The script is also used to control which directories are to be indexed and which ones should not be included.

The maintenance of the script was a manual process.

3. Automation of reindexing

As implemented, reindexing of the WAIS database was not automated. Instead, the indexing script had to be re-run when a change to the HTML collection was made.

It would be feasible, however, to automate this process with 'cron' and 'make' so that the indexing script would be automatically run whenever the HTML documents were modified. There would be a delay, however, between the updates and when cron ran the script. The length of the delay would depend on how often cron was set to look for changes in the HTML files. The main reason this system was not implemented was lack of time and resources.

3. Incremental re-indexing

Freewais0.3 does have a facility for incremental re-indexing of its databases. This function was not used in this implementation as it would have required modification to the indexing script. Such a modification was not given a high priority.

User functionality

Administration/Maintenance defined the requirements for the running of the searching system from the user's perspective.

1. Performance (speed)

The speed of keyword searches was considered adequate although by no means stunning. For example, the search shown in Figure 1 above took between 5 and 10 seconds depending on the load on the server.

Documents selected from the results page would typically take slightly longer to load. This was because they were being processed through the print-hit-bold CGI script to position the browser at the first match in the document.

2. Ease of use

The ease of use section of the criteria was further sub-divided into the following sub-sections.

Forms Interface for searching

A forms-based search interface (see Figure 1 above) was developed in Perl. The script firstly displayed a search form and then was re-invoked to send the search request to kid-of-wais (the WWW to WAIS gateway).

The form was simple to use and provided a satisfactory interface to search a set of HTML documents.

Matches displayed in Context

The results of the search showed the title of any matching documents. It did not show the actual matches inside the document. This made it difficult to determine if a particular document was relevant and one you wanted to actually view.

Relevancy or rank ordering

WAIS ranks the results according to a relevancy score based on the number of matches and if matches occur in document titles. In practice it was difficult to work out why results were ranked in a particular order. The WAIS score did not seem to be a useful ranking of the search results.

No HTML displayed in results

The search results showed the HTML title of the documents (if present). Since no document contents were displayed as part of the search results, no HTML was present.

Hyperlinks from results to actual HTML files

WAIS can return URLs to documents that match the search. These were included in the results page so that a 'live' link existed to the actual documents. Print-hit-bold would then position the browser at the location of the first match in the document when a link was followed.

3. Functionality

The user functionality section of the criteria was further sub-divided into the following sub-sections.

Booleans

WAIS has support for boolean search terms including 'AND', 'OR' and 'NOT'. Complex searches involving multiple boolean search terms are also supported. For example WAIS will handle a search of the following form.

(computer and political) or science

Maximum match control

A WAIS index has a built in maximum number of documents that can be returned from a search. This is defined at the time of indexing. This value can be varied downward from that limit. Facility was built in to the form to specify the maximum number of documents returned from a search up to a maximum of 60.

Case sensitive control

WAIS does not have facility to control the case sensitivity of searches. By default, searches are not case sensitive.

Partial Words

WAIS cannot perform partial word searches. By default, all WAIS searches are exact matches on the specified search strings. This was seen as a significant problem with the WAIS searching system.

Numbers

WAIS does seem to index numbers but did not search for number strings in the way that was required. An additional searching system based on grep had to be developed to allow number searching in the desired way.

Summary

The WAIS system provided an adequate searching system that provided simple keyword searching in most circumstances. It was not as simple as it should be to create a new index, particularly if some control over what was to be indexed was required.

The principal concerns we had with the WAIS implementation were with user functionality. It was often difficult to determine if a particular match was relevant or not, other than by viewing the actual document. Partial word matches were not available, there was no control over case sensitive searches and numbers were not handled in the way we required.

For these reasons we looked for a better solution for our keyword searching facility.

Glimpse

GLIMPSE [HREF 13], which stands for GLobal IMPlicit SEarch, is a public domain general purpose indexing and query scheme for Unix file systems. In comparison to WAIS, Glimpse builds a very small index, typically well under 10% of the size of the indexed text, yet still allows fast and very flexible full-text retrieval. Search options include Boolean queries, approximate matching (partial words and misspellings), case sensitivity, and regular expressions. In contrast to WAIS, a Glimpse search will display the actual lines where a match occurred as opposed to filenames only. Its HTTP extensions provide the unusual capability to both search and browse from a single HTML form.

Basic Architecture

The Glimpse package consists of:

These major components and their functions are described below.

agrep (Wu and Manber 1991), (Wu and Manber 1992)

agrep is a C program which functions as the file searching engine. Powerful pattern matching is possible as the syntax of regular expressions in agrep is generally the same as for grep. In addition, agrep supports run-time options such as the number of misspellings allowed, user specified record delimiters, case sensitivity, partial or whole word matches, and "best match".

glimpseindex

Glimpse does not automatically index files - you have to run glimpseindex. This C program will traverse any given directory recursively and all text-based files will be indexed.

There are three indexing options: a tiny index (2-3% of the total size of all files), a small index (7-8%) and a medium-size index (20-30%). In all cases only single words are stored in the index, always as lower-case. In general, the larger the index the faster the search.

The index consists of several files with the prefix .glimpse_ stored in the indexed directory. By default, glimpseindex ignores non-text files such as compressed files although these can be indexed with the provision of a user defined filter program. Numbers may be indexed if required. Glimpseindex stores the full filenames of all the files that were indexed in the file .glimpse_filenames. It is possible to exclude, or include, specified files or directory paths by specifying patterns in the .glimpse_exclude and .glimpse_include files, respectively.

An incremental indexing option, which adds to the index any files which have been created or modified since the index was built, is available for all but the largest size indexes. One can also increment the index by adding a nominated file.

glimpse

This C program performs the searches against the index. It includes all of agrep and when directed to a file, rather than an index, it is functionally equivalent to agrep: the pattern can be any agrep legal pattern (eg. wild cards, boolean ANDs and ORs, classes of characters, regular expressions).

Most but not all of agrep's options are supported, including the number of misspellings, record separators, case sensitivity, interpretation of meta-characters, output of filenames or matched records, whole or partial words, and best match mode. There is no size limit for simple patterns or Boolean AND patterns but complicated patterns are currently limited to 30 characters.

The powerful -F option of glimpse limits the search based on pattern matches against filenames (boolean ORs are supported). This enables both filename and file content matching to be combined, by running agrep against .glimpse_filenames to filter the results of the index search.

Another useful option specifies whether the proximity of boolean matches should be restricted to a single record, or the whole file.

glimpsehttp [HREF 14]

The components so far described can all be executed from a Unix shell. Additional components for the management and delivery of glimpse search capabilities to HTTP servers is provided by glimpsehttp. These HTTP related extensions include:

amgr (archive manager)

amgr is a Perl script which allows an HTTP server administrator to define and control the indexing of multiple collections of HTML. An administrator can control archives by defining for each archive the root of the directory tree to be indexed, a descriptive title, the type of index (tiny or small), and whether numbers are indexed.

A key feature of amgr is that as it indexes an HTML collection, it creates a special file, typically named ghindex.html, that contains a forms-based interface to the index. The definition and content of the search form is defined in a site wide template from which specific instances of the form are intelligently cloned and customised for that sub-directory. The archive title, and a pre-defined administrator address, are included on each form.

In addition to containing the forms interface to a search, each instance of the form can contain hypertext links to search forms in each sub-directory and hypertext links to all files in the current directory. If the name of the hyperlinked file is present in a file named .description in the directory, then a user specified title is taken from the .description file and used as the hypertext link text instead of the filename. Thus in the ghindex.html form, users are able to browse as well as search: they see a hyperlinked descriptive title for every file present in the current directory and/or able to jump to the ghindex.html form in any sub-directory (NB: the ghindex forms are built so that searches invoked from sub-directories are limited to that sub-section of the directory tree using the -F option of glimpse described above).

aglimpse

aglimpse is a CGI compliant Perl script which is invoked when the user submits the search. It does the work of parsing the user's search options, running a glimpse search of the index, formatting the results as HTML, and passing them back to the server for display by the user's browser.

The user interface controls supported in version 2.1 of the Glimpse package include case sensitivity, partial or whole word match, number of misspellings, maximum files returned, and the maximum matches per file returned. Limiting a search to a particular directory is handled via the PATHINFO CGI variable which is included at the end of the ACTION= specification of the search form.

Implementation of Glimpse with local extensions

Installation of version 2.0 of the Glimpse package on a DEC 5100 Ultrix 4.3 was relatively straightforward. The glimpseHTTP Perl scripts required some simple pathname customisation and a copy of the scripts for each HTTP server.

Our initial experiences with Glimpse were good but could have been better. The package promised a great deal of functionality but as distributed would require considerable manual effort to administer. For example, there was no assistance provided to build the .description files, used to provide meaningful descriptive titles in the browsing section of the search forms. Furthermore, even if the .description file was present, its descriptive titles were not used in the search results display returned to users.

Because the glimpseHTTP extensions were in Perl, customising and adding additional functionality was straightforward. A brief description of our functional enhancements to glimpseHTTP follow.

Figure 3. The form to search the Undergraduate Handbook using Glimpse.

HTML Titles returned in searches

To achieve this additional feature we wrote a Perl script, makedescfile.pl, which would recursively process a directory, extract the HTML title enclosed in <title>...</title> tags, and create the .description file containing these descriptive titles in each directory.

We then modified the indexing function of the archive manager (amgr) script to provide the option of automatically running makedescfile.pl. We then modified aglimpse, the CGI interface program, to obtain a html file's title from the .description file and display its title in the search results.

Figure 4. Search results of the Undergraduate Handbook using Glimpse with the search term 'foucault and philosophy' .

Archive manager administrative enhancements

Customised addresses on index forms

It was soon apparent the existing hardcoded administrator address was a limitation for a HTTP site with multiple collections of HTML maintained by different authors. It was a simple modification of the amgr Perl script to add extra fields to enable the administrator name, and email address, to be specified individually for each archive (site-wide default values are included). The search form template was also modified to include this data in an <address> tag.

Customised templates

The amgr script as distributed allowed a single site-wide search template. We modified it to enable a template file to be specified for each archive, enabling customised search forms to suit an individual collection.

We later allowed the actual name of the search form to vary from the hardcoded default of ghindex.html, so that a site-wide or upper level index would not inadvertently overwrite the search forms of another index.

aglimpse CGI-form functional enhancements

These enhancements simply added the required interface elements to useful agrep features which were not being exploited. Specifically:

Proximity control for boolean queries

We added a checkbox to the form template to exploit the -W option of glimpse. This option changes the scope of Boolean queries to be the whole file, rather than a single record (the default record delimiter is a newline). If the user's boolean search finds no matches in a single line, they are advised to try the search again with booleans matching across the whole file. Obviously matches of boolean terms within the proximity of a single line are likely to be more relevant, so we made this default.

Selection of multiple sub-directories for searching

The aglimpse script was further modified to build an OR pattern of matching filenames if the relevant fields were present in the form, in addition to its standard way of limiting a search to a sub-directory via PATHINFO. This has the effect of enabling a user to select via custom checkboxes, in any combination, the sub-directory paths to search without the user needing to navigate up and down individual sub-directories. At the time of writing, there was no way of automatically generating these controls on the form as it is designed for highly customised and specialised search forms.

Filename or file text searching

The user can now control whether search text will be applied to filenames or the index itself. In the case of the former, aglimpse tells glimpse to search .glimpse_filenames. For the latter, aglimpse tells glimpse to search the index. This has useful and powerful consequences for structured collections of HTML which employ naming conventions in filenames. Based on naming conventions, it is possible for the user to retrieve certain types of files.

For example, the 3200 file collection making up the Undergraduate Handbook employs naming conventions. The filename type of search is used to search for subject codes eg. a search for '/131-1' will retrieve the titles of all first year History subjects (code 131).

Display controls to return titles, not matches

By specifying that the search should return one match only per file, aglimpse will now display titles only. This is in contrast to the default search which will display every matching line within each file.

Formatting of results

Other minor adjustments were made to the appearance of the returned HTML, including parsing out any html tags, and displaying each matched file as a paragraph rather than the rather "loud" <H3> heading.

Forms design

The controls for maximum matches, and the maximum files returned, were changed from text entry boxes to pop-up menus. Other minor adjustments were made to the processing of the template file to place the form and its controls at the front, provide a hotlink to the browsing section, and to enable sub-directories and/or filenames to be included or omitted in particular templates.

Evaluation of Glimpse implementation against defined criteria

This evaluation considers the performance of the Glimpse package modified as described against the criteria discussed earlier.

System Requirements

1. Standards

Not Z39.50 compliant.

The aglimpse HTTP interface script is CGI compliant.

2. Public Domain

The package is in the public domain and source code is available. At the time of writing and based on discussions with the author of the glimpseHTTP extensions, it is likely that most if not all of these modifications will soon be incorporated in a future distributed version of the Glimpse package. This demonstrates the benefits of public domain code and collaborative development.

3. WWW Integration on Unix Platform

Glimpse is well integrated with Unix and its HTTP gateway is a simple CGI compliant script. A minor problem was encountered with access permissions on indexes but was overcome by a minor change to the archive manager.

Glimpse is not available on other platforms.

4. Disk Space and System Resource Usage

Space usage is extremely efficient and our experiences justify the claims made by Glimpse's authors. For example, the medium sized index built for the 15 Megabyte University Handbook collection of over 3000 files occupied just under 1 Megabyte, representing only 6-7% of the size of the total collection.

Indexing of very large collections may run into memory limitations on some systems, but the time required to index a collection is acceptable.

No problems have been encountered with excessive resource usage in the conduct of searches.

Administration/Maintenance

1. Ease of new index creation

The archive manager makes creating a new index extremely easy and fast.

2. Ease of control over what is indexed

It is quite simple to exert fine-grained control over the paths, and file extensions, that are to be indexed by glimpseindex using defining patterns in the .glimpse_exclude and .glimpse_include files. Further simple enhancements are planned to amgr to automate building include and exclude files for new collections based on site-defaults.

2. Automation of reindexing

Can be achieved through cron of shell glimpseindex commands. There is no way as yet of having these commands generated by, or controlled via, the amgr program which is interactive in nature. This could be overcome in the future by automating the running of the amgr with Expect.

3. Incremental reindexing

The amgr can not as yet handle this and so incremental reindexing must be done manually. It is expected that the limitation on incremental indexing of large indexes will be removed in a future release.

User functionality

1. Performance (speed)

The speed of glimpse depends mainly on the number and sizes of the files that contain a match and only to a second degree on the total size of all indexed files. If the pattern is reasonably uncommon, then all matches will be reported in a few seconds even in extremely large collections.

Glimpse finds whole phrases by splitting any multiword pattern into its set of words and looking for all of them in the index (which is single word based). It then applies agrep to find the combined pattern in files which contain the single words of the phrase. This can be slow for cases where both words are very common, but their combination is not.

Overall our experience has shown that the performance of glimpse for HTML searches is very acceptable, with most searches of the Handbook index completing within 3-10 seconds. Filename based searches are extremely fast because the search is a simple agrep search of a single file, .glimpse_filenames.

2. Ease of use

GlimpseHTTP rates extremely well on all the ease of use criteria defined with the exception of rank ordering of search results. This could be rectified with further minor amendments to aglimpse. The degree of fine-grained control of what is searched and how it is searched is excellent. The default values usually provide adequate results and the controls are located unobtrusively so as not to confront the user with unnecessary complexity.

The results display is excellent, with the full text of matches shown, and hypertext links to found files and lines which can take the user to the place where the match was found in the retrieved file. Where the HTML source is unwrapped (ie. no embedded newlines in paragraphs), the matches show an entire self-contained HTML element such as a paragraph.

3. Functionality

All functionality criteria such as partial words, number searching, maximum match control, and case sensitivity are met in full. Simple Boolean searches, such as multiple ANDs and multiple ORs, work well. Some difficulties with combined ANDs and ORs were encountered and had not been resolved at the time of writing.

In addition, glimpseHTTP provided additional and powerful functions which exceeded our expectations. The capacity to browse as well as search is very useful but could be better implemented. Unfortunately, directory and file browse lists do not observe the rules defined in the include and exclude files, although it is anticipated that it would not be difficult to add this feature. This means, for example, that directories that have been specifically excluded from indexing will still appear in the list of sub-directories for browsing. Naturally, the normal server controls still apply and if a directory is protected by access controls this will remain the case.

The capacity to search filenames lends extremely powerful retrieval features to large collections with naming conventions. The capacity to select sub-directories in any combination via custom checkboxes was an unexpected benefit which is extremely useful for some collections.

Conclusion

As a search system for HTML collections, we conclude that the Glimpse package offers many advantages over the more commonly used WAIS in all areas except Z39.50 compliance: system requirements, administration and maintenance and, most importantly, user functionality is dramatically better than WAIS for both searching and browsing.

The capacity to integrate filename and text based searching offers opportunities for HTML authors to structure and name their collections in ways which offer extremely powerful search and presentation capabilities to users.

Being in the public domain, the glimpse package is well suited for further enhancements and customised modification of the package and extensions [HREF 15] to suit particular collections are making their appearance on the Web.

However, Glimpse's main strength for Web providers is as a general purpose search and browse facility which, once installed, makes it simple to index new collections and customise the search interface to suit the nature of the collection.

References

Wu and Manber, "Fast Text Searching With Errors," Technical report #91-11, Department of Computer Science, University of Arizona, June 1991 (available by anonymous ftp from ftp://cs.arizona.edu/agrep/agrep.ps.1).

Wu and Manber, "Agrep -- A Fast Approximate Pattern Searching Tool", To appear in USENIX Conference 1992 January (available by anonymous ftp from ftp://cs.arizona.edu/agrep/agrep.ps.2).

Hypertext References

HREF 1
http://www.unimelb.edu.au/~ssilcot - Sandra Silcot's home page
HREF 2
http://www.unimelb.edu.au - The University of Melbourne Campus Wide Information home page.
HREF 3
http://www.unimelb.edu.au/cgi-bin/nph-depts - The University of Melbourne Faculty and Department servers page.
HREF 4
http://www.unimelb.edu.au/HB/ - The University of Melbourne Undergraduate Handbook.
HREF 5
http://www.unimelb.edu.au/research.report/ - The University of Melbourne Research Report.
HREF 6
http://www.unimelb.edu.au/StrategicPlans/StratPlan.html - 'Building on Quality', the University of Melbourne Strategic Plan 1994-96.
HREF 7
http://vinca.cnidr.org/protocols/z3950/z3950v3d10.html - Definition of the Z39.50 protocol for exchange of information between databases. Tenth Draft (V3D10).
HREF 8
http://server.wais.com/ - WAIS Inc. home page.
HREF 9
ftp://ftp.cnidr.org/pub/NIDR.tools/freewais/ - ftp site for the freewais distribution.
HREF 10
http://www.eit.com/software/wwwwais/wwwwais.html - Home page for the wwwwais.c program. A World Wide Web to WAIS gateway.
HREF 11
http://www.eit.com/people/kev.html - Home page of Kevin Hughes, developer of 'wwwwais.c'.
HREF 12
http://www.cso.uiuc.edu/grady.html - Information on 'kid-of-wais' World Wide Web to WAIS gateway and 'print-hit-bold'
HREF 13
http://glimpse.cs.arizona.edu:1994/index.html - Glimpse Home page.
HREF 14
http://glimpse.cs.arizona.edu:1994/glimpsehttp.html - Glimpse http gateway documentation
HREF 15
http://glimpse.cs.arizona.edu:1994/ghttp/contrib/tree.html - Additional contributions to Glimpse page.

Copyright

© Southern Cross University, 1995. Permission is hereby granted to use this document for personal use and in courses of instruction at educational institutions provided that the article is used in full and this copyright statement is reproduced. Permission is also given to mirror this document on WorldWideWeb servers. Any other usage is expressly prohibited without the express permission of Southern Cross University.
Return to the AusWeb95 Table of Contents

AusWeb95 The First Australian WorldWideWeb Conference