Sandra Silcot, Information Technology Services, The University of Melbourne, Parkville, Victoria, 3052, Australia. Phone +61-3-9344 8034. Fax +61-3-9347 4803. Email: ssilcot@www.unimelb.edu.au. Home Page: Sandra's Web [HREF 1]
Keyword searching is a facility that has now become very familiar to most users of on-line services. Users of library databases, CD-ROM's and other on-line bibliographic services have been using keyword searching facilities for some time.
The functionality of many HTML documents or sets of HTML documents can be significantly enhanced by providing a similar style of access to them through keyword searching. A keyword searching facility will turn a set of HTML documents into much more than merely an electronic 'mirror' of the paper version.
As the World Wide Web has developed, a number of different systems have emerged to provide searching facilities for HTML documents. This paper examines the requirements for keyword searching of HTML documents at the University of Melbourne and the subsequent implementation of an effective system for access to HTML documents using a keyword searching system.
The University of Melbourne established its Campus Wide Information System [HREF 2], based on the World Wide Web, in June of 1994. Since then the use of the service has undergone rapid growth. The volume of World Wide Web traffic generated from within the university is currently doubling every five to six weeks. The number of Faculties or departments [HREF 3] running servers has steadily increased to the current level of over forty servers.
The main university CWIS server acts as a central directory pointing to the various Faculty and Departmental servers. In addition it provides access to a number of university-wide services, such as the telephone directory service, and is also used to publish a variety of documents with university-wide scope. These include the Undergraduate Handbook [HREF 4], the Research Report [HREF 5] and the Strategic Plan 1994 to 1996 [HREF 6]. All of these documents include a keyword searching facility to enhance their functionality.
A searching system should be easy to use and easy to maintain. It should be efficient and not consume too much system resources. To be effective from the user's point of view, it should have good response time and provide the searching facilities they have become accustomed to when using other on-line searching systems.
The CWIS team established a number of specific requirements for such a searching system. These can be categorised into the following sections.
A number of specific system requirements were identified. Some of these were required, while others were desired.
Where possible, software used at the University of Melbourne should conform with international standards. In the area of indexing and communication with databases, the relevant standard was considered to be Z39.50 [HREF 7].
This standard defines a protocol for the searching and retrieval of information from databases on different computers across a network. It does not concern itself with the internal workings of an indexing and retrieval system but, rather, how that system communicates with other database engines and, in particular, how they exchange information contained within their databases.
Any software used to implement a keyword searching system must reside in the public domain. This ensures the maximum flexibility for the system. In particular this has the following benefits.
Access to the source code allows site specific modifications to improve performance or enhance security.
The price is right. Systems that reside in the public domain are, almost by definition, more affordable than propriety systems.
Easier to change horses mid-stream. Once a financial commitment has been made and data is in a propriety format it is much more difficult to abandon one solution in favour of another solution that may emerge.
Since the main university CWIS server runs under UNIX it was essential that the searching system also run under UNIX. However, it need not be exclusively UNIX based. Indeed, support for servers on other platforms, particularly Macintosh and Windows, would be an advantage.
The searching system to be implemented must integrate fully with the World Wide Web. WWW browsers on all platforms should be able to submit queries and display query results. The search results should contain 'live' links to referenced HTML documents.
Indexes created from HTML documents should not occupy too much disk space. It would be alarming if the index occupied substantially more space than the original collection it indexed. In fact, it would be much more desirable if the index occupied only a small amount of extra disk space.
The retrieval process should not consume too much CPU time or memory. This is particularly the case if searching is conducted on the same machine that is serving up the original HTML documents. This was the anticipated configuration in this case. Our concern was to ensure that the capacity of the CWIS server to deliver documents was not going to be severely degraded by people using it to perform keyword searches.
The keyword searching system should make it easy to create a new index of a set of HTML documents. It should also be easy to update an existing index if the source documents change.
The ease of creation and maintenance of a search index will be determined by a number of factors. These include:
The purpose of a keyword searching facility is for people to use it effectively to locate the information they require. To this end it is vital that the software used meets a number of criteria that will determine how well the searching facility will meet people's needs.
These can be categorised into a number of sections.
The keyword searching facility must function quickly enough that people conducting searches will not grow impatient and discontinue the search. Ideally, on long searches, a progress report should be included.
The search interface must be easy to use. It should be based on an HTML forms interface. The search engine should be invoked using a cgi-bin script. The form should be easy to use with simple controls of search settings.
If possible, search results should display the search results in context to allow the user to decide more easily if particular 'hits' are relevant. Search results should also be displayed in a natural order. They should not be displayed simply in the order in which they were found. And they should not contain 'visible' HTML.
It must be simple to transfer from the search results to the original HTML document. Ideally, this should be by a 'live' link.
The keyword search engine should provide support for boolean searches using 'AND' and 'OR' at a minimum. It should also include a facility to perform partial word searches.
There should be capacity to control the maximum number of documents returned from a search and also control over case sensitive searching.
A number of the HTML collections that were to be indexed required numbers to be indexed. An example of this is the Undergraduate Handbook [HREF 4] which includes subject codes that should be searchable.
The first keyword system that was implemented was based on WAIS. WAIS is a database index and retrieval engine that was developed as a joint project between Apple Computer, Thinking Machines Corporation and Dow Jones.
WAIS is more than a simple database system to index and retrieve documents. It also incorporates a sophisticated networking component that allows search results to be exchanged from one computer to another using the Z39.50 protocol.
WAIS can index a variety of different document formats including HTML. It also has support for plain text files and image file formats including JPEG and GIF. (Image files only have their filenames indexed.)
The WAIS project resulted in a commercial version distributed by WAIS Inc. [HREF 8] and a public domain version known as freewais [HREF 9].
The freewais distribution includes an indexing program (waisindex), server software (waisserver) for making the database available over the network, a search engine for local wais databases (waisquery), a search engine for wais databases on other computers on the network (waissearch) and wais client software for vt100 terminals (swais) and X stations (xwais).
Most World Wide Web clients cannot communicate directly with a wais server. Instead they need a Web to wais gateway. The function of the gateway is to take a query from a Web client and format it into a query suitable for submission to the specified wais database. Once the query has been performed by the wais server the gateway formats the results for display back on the Web client.
The initial choice for a World Wide Web to wais gateway was wwwwais.c [HREF 10] developed by Kevin Hughes [HREF 11] at EIT. At that stage the software was at version 2.2.
The principal disadvantage with the wwwwais.c gateway was that the results from a search of a wais database were displayed showing the file names of matching documents, rather than their HTML titles. This often made it difficult to determine if a particular document in the results list was of interest.
Instead, an alternative gateway program kid-of-wais [HREF 12], based on the script wais.pl distributed with the ncsa server software and wwwwais.c, was chosen for use. It includes a facility to display the title of Web documents (if present) that are found by a search.
An additional script, called print-hit-bold [HREF 12] was used with kid-of-wais. Its function is to position the browser at the first occurrence of a search term within a document that is selected for viewing from the results list.
A forms-based cgi-bin Perl script was developed to provide the search interface and transfer the search terms and options to the WAIS gateway.
A copy of the script was made and slightly modified every time a new search database was created.

Figure 1. The form to search the Undergraduate Handbook using WAIS.
Figure 1 shows the forms-based interface that was created to access the search engine. It provides a simple interface for carrying out a search of a WAIS database (in this case the Undergraduate Handbook [HREF 4]). Boolean searches can be used and there is a pop-up menu to control the maximum number of matches to be returned from the search.
In this particular search, a second search option was available to allow searching for subject codes. This was an entirely separate search that used a grep-like search of a file matching subject codes to document URLs.
This was necessary because WAIS did not perform well with numbers in the subject code format. For example a search for the subject code 131-201 would locate that subject, but also about 30 others all of which contained one or other of the numbers but not always both.
It seemed that the search input string was not being treated as one string but that the two numbers that make up the code were being or'ed together.
Normal text searches of the WAIS database generally worked well however. For example, the results of the search shown in Figure 1 can be seen in Figure 2.

Figure 2. Search results of the Undergraduate Handbook using the search term 'Foucault and philosophy'.
The results of the search show the matching documents using their title rather than their file name. Each match is a 'live' link to the actual document. The results are displayed in the order of their WAIS 'relevancy' score from highest to lowest. The score is calculated on the basis of the frequency of matches with the search terms and their location within the document.
In addition, the size and type of document is displayed. This is useful for large documents or if the match is with an image file or other non-text file. People with slow network connections can choose not to download a file if it is very large.
This sections looks at the performance of the keyword searching system against the criteria that were established in the previous section 'Functionality and software requirements'.
System requirements defined general requirements for the searching system including the nature of the software and the platform it should run on.
WAIS was specifically designed to conform with Z39.50, the standard for exchanging information between databases on different computers.
There is no specific standard for World Wide Web to WAIS gateways, although the gateway must communicate with the Web server using the Common Gateway Interface (CGI) standard. The Common Gateway Interface is a standard for external gateway programs to interface with information servers such as HTTP servers.
Kid-of-wais is a script that conforms with the CGI guidelines.
The WAIS version used was freeWAIS 0.3. This a public domain implementation of the WAIS protocol. The source code is available.
Kid-of-wais and print-hit-bold are both in the public domain and their source code is available.
The WAIS server program is specifically designed to run on UNIX computers. To the best knowledge of the authors, it does not run on other platforms.
The kid-of-wais gateway is written in Perl. While Perl is available on UNIX, Macintosh and Windows platforms, kid-of-wais has been specifically designed to work with UNIX HTTP servers.
WAIS indexes do not seem to be particularly efficient. For example, the Undergraduate Handbook itself occupies 15.5MB of disk space. The WAIS index of the Handbook occupies a further 6MB of disk space.
Although WAIS fell within the range regarded as acceptable in the criteria it was still regarded as using excessive disk space for its indexes.
Administration/Maintenance defined the requirements for the running of the searching system from the server side.
A Perl script was created to carry out the indexing process. This was relatively easy to copy and adapt to index a new set of HTML documents.
The main disadvantage was that the process is manual and required reasonable UNIX skills. This did not cause problems because those people with the privileges to administer the server at this level all had good UNIX skills.
The indexing script described above includes controls over the file types that should be indexed and those that should not. The script is also used to control which directories are to be indexed and which ones should not be included.
The maintenance of the script was a manual process.
As implemented, reindexing of the WAIS database was not automated. Instead, the indexing script had to be re-run when a change to the HTML collection was made.
It would be feasible, however, to automate this process with 'cron' and 'make' so that the indexing script would be automatically run whenever the HTML documents were modified. There would be a delay, however, between the updates and when cron ran the script. The length of the delay would depend on how often cron was set to look for changes in the HTML files. The main reason this system was not implemented was lack of time and resources.
Freewais0.3 does have a facility for incremental re-indexing of its databases. This function was not used in this implementation as it would have required modification to the indexing script. Such a modification was not given a high priority.
Administration/Maintenance defined the requirements for the running of the searching system from the user's perspective.
The speed of keyword searches was considered adequate although by no means stunning. For example, the search shown in Figure 1 above took between 5 and 10 seconds depending on the load on the server.
Documents selected from the results page would typically take slightly longer to load. This was because they were being processed through the print-hit-bold CGI script to position the browser at the first match in the document.
The ease of use section of the criteria was further sub-divided into the following sub-sections.
A forms-based search interface (see Figure 1 above) was developed in Perl. The script firstly displayed a search form and then was re-invoked to send the search request to kid-of-wais (the WWW to WAIS gateway).
The form was simple to use and provided a satisfactory interface to search a set of HTML documents.
The results of the search showed the title of any matching documents. It did not show the actual matches inside the document. This made it difficult to determine if a particular document was relevant and one you wanted to actually view.
WAIS ranks the results according to a relevancy score based on the number of matches and if matches occur in document titles. In practice it was difficult to work out why results were ranked in a particular order. The WAIS score did not seem to be a useful ranking of the search results.
The search results showed the HTML title of the documents (if present). Since no document contents were displayed as part of the search results, no HTML was present.
WAIS can return URLs to documents that match the search. These were included in the results page so that a 'live' link existed to the actual documents. Print-hit-bold would then position the browser at the location of the first match in the document when a link was followed.
The user functionality section of the criteria was further sub-divided into the following sub-sections.
WAIS has support for boolean search terms including 'AND', 'OR' and 'NOT'. Complex searches involving multiple boolean search terms are also supported. For example WAIS will handle a search of the following form.
(computer and political) or science
A WAIS index has a built in maximum number of documents that can be returned from a search. This is defined at the time of indexing. This value can be varied downward from that limit. Facility was built in to the form to specify the maximum number of documents returned from a search up to a maximum of 60.
WAIS does not have facility to control the case sensitivity of searches. By default, searches are not case sensitive.
WAIS cannot perform partial word searches. By default, all WAIS searches are exact matches on the specified search strings. This was seen as a significant problem with the WAIS searching system.
WAIS does seem to index numbers but did not search for number strings in the way that was required. An additional searching system based on grep had to be developed to allow number searching in the desired way.
The WAIS system provided an adequate searching system that provided simple keyword searching in most circumstances. It was not as simple as it should be to create a new index, particularly if some control over what was to be indexed was required.
The principal concerns we had with the WAIS implementation were with user functionality. It was often difficult to determine if a particular match was relevant or not, other than by viewing the actual document. Partial word matches were not available, there was no control over case sensitive searches and numbers were not handled in the way we required.
For these reasons we looked for a better solution for our keyword searching facility.
GLIMPSE [HREF 13], which stands for GLobal IMPlicit SEarch, is a public domain general purpose indexing and query scheme for Unix file systems. In comparison to WAIS, Glimpse builds a very small index, typically well under 10% of the size of the indexed text, yet still allows fast and very flexible full-text retrieval. Search options include Boolean queries, approximate matching (partial words and misspellings), case sensitivity, and regular expressions. In contrast to WAIS, a Glimpse search will display the actual lines where a match occurred as opposed to filenames only. Its HTTP extensions provide the unusual capability to both search and browse from a single HTML form.
The Glimpse package consists of:
agrep),glimpseindex),glimpse),amgr), andaglimpse).These major components and their functions are described below.
agrep is a C program which functions as the file
searching engine. Powerful pattern matching is possible as the syntax
of regular expressions in agrep is generally the same as for
grep. In addition, agrep supports run-time
options such as the number of misspellings allowed, user specified
record delimiters, case sensitivity, partial or whole word matches, and
"best match".
Glimpse does not automatically index files - you have to run
glimpseindex. This C program will traverse any given
directory recursively and all text-based files will be indexed.
There are three indexing options: a tiny index (2-3% of the total size of all files), a small index (7-8%) and a medium-size index (20-30%). In all cases only single words are stored in the index, always as lower-case. In general, the larger the index the faster the search.
The index consists of several files with the prefix
.glimpse_ stored in the indexed directory. By default,
glimpseindex ignores non-text files such as compressed
files although these can be indexed with the provision of a user
defined filter program. Numbers may be indexed if required.
Glimpseindex stores the full filenames of all the files
that were indexed in the file .glimpse_filenames. It is
possible to exclude, or include, specified files or directory paths by
specifying patterns in the .glimpse_exclude and
.glimpse_include files, respectively.
An incremental indexing option, which adds to the index any files which have been created or modified since the index was built, is available for all but the largest size indexes. One can also increment the index by adding a nominated file.
This C program performs the searches against the index. It includes
all of agrep and when directed to a file, rather than an
index, it is functionally equivalent to agrep: the pattern
can be any agrep legal pattern (eg. wild cards, boolean
ANDs and ORs, classes of characters, regular expressions).
Most but not all of agrep's options are supported,
including the number of misspellings, record separators, case
sensitivity, interpretation of meta-characters, output of filenames or
matched records, whole or partial words, and best match mode. There is
no size limit for simple patterns or Boolean AND patterns but
complicated patterns are currently limited to 30 characters.
The powerful -F option of glimpse
limits the search based on pattern matches against filenames (boolean
ORs are supported). This enables both filename and file content
matching to be combined, by running agrep against
.glimpse_filenames to filter the results of the index
search.
Another useful option specifies whether the proximity of boolean matches should be restricted to a single record, or the whole file.
The components so far described can all be executed from a Unix shell. Additional components for the management and delivery of glimpse search capabilities to HTTP servers is provided by glimpsehttp. These HTTP related extensions include:
amgr is a Perl script which allows an HTTP server
administrator to define and control the indexing of multiple
collections of HTML. An administrator can control archives by defining
for each archive the root of the directory tree to be indexed, a
descriptive title, the type of index (tiny or small), and
whether numbers are indexed.
A key feature of amgr is that as it indexes an HTML
collection, it creates a special file, typically named
ghindex.html, that contains a forms-based interface to the
index. The definition and content of the search form is defined in a
site wide template from which specific instances of the form are
intelligently cloned and customised for that sub-directory. The archive
title, and a pre-defined administrator address, are included on
each form.
In addition to containing the forms interface to a search, each
instance of the form can contain hypertext links to search forms
in each sub-directory and hypertext links to all files in the current
directory. If the name of the hyperlinked file is present in a file
named .description in the directory, then a user specified
title is taken from the .description file and used as the
hypertext link text instead of the filename. Thus in the
ghindex.html form, users are able to browse as well as
search: they see a hyperlinked descriptive title for every file
present in the current directory and/or able to jump to the
ghindex.html form in any sub-directory (NB: the
ghindex forms are built so that searches invoked from
sub-directories are limited to that sub-section of the directory tree
using the -F option of glimpse
described above).
aglimpse is a CGI compliant Perl script which is invoked
when the user submits the search. It does the work of parsing the
user's search options, running a glimpse search of the
index, formatting the results as HTML, and passing them back to the
server for display by the user's browser.
The user interface controls supported in version 2.1 of the Glimpse
package include case sensitivity, partial or whole word match, number
of misspellings, maximum files returned, and the maximum matches per
file returned. Limiting a search to a particular directory is handled
via the PATHINFO CGI variable which is included at the end
of the ACTION= specification of the search form.
Installation of version 2.0 of the Glimpse package on a DEC 5100 Ultrix 4.3 was relatively straightforward. The glimpseHTTP Perl scripts required some simple pathname customisation and a copy of the scripts for each HTTP server.
Our initial experiences with Glimpse were good but could have been
better. The package promised a great deal of functionality but as
distributed would require considerable manual effort to administer. For
example, there was no assistance provided to build the
.description files, used to provide meaningful descriptive
titles in the browsing section of the search forms. Furthermore, even
if the .description file was present, its descriptive
titles were not used in the search results display returned to users.
Because the glimpseHTTP extensions were in Perl,
customising and adding additional functionality was straightforward. A
brief description of our functional enhancements to
glimpseHTTP follow.

Figure 3. The form to search the Undergraduate Handbook using Glimpse.
To achieve this additional feature we wrote a Perl script,
makedescfile.pl, which would recursively process a
directory, extract the HTML title enclosed in
<title>...</title> tags, and create the
.description file containing these descriptive titles in
each directory.
We then modified the indexing function of the archive manager
(amgr) script to provide the option of automatically
running makedescfile.pl. We then modified
aglimpse, the CGI interface program, to obtain a html
file's title from the .description file and display its
title in the search results.

Figure 4. Search results of the Undergraduate Handbook using Glimpse with the search term 'foucault and philosophy' .
It was soon apparent the existing hardcoded administrator address was
a limitation for a HTTP site with multiple collections of HTML
maintained by different authors. It was a simple modification of the
amgr Perl script to add extra fields to enable the
administrator name, and email address, to be specified individually for
each archive (site-wide default values are included). The search form
template was also modified to include this data in an
<address> tag.
The amgr script as distributed allowed a single
site-wide search template. We modified it to enable a template file to
be specified for each archive, enabling customised search forms to suit
an individual collection.
We later allowed the actual name of the search form to vary from the
hardcoded default of ghindex.html, so that a site-wide or
upper level index would not inadvertently overwrite the search forms
of another index.
These enhancements simply added the required interface elements to
useful agrep features which were not being exploited.
Specifically:
We added a checkbox to the form template to exploit the
-W option of glimpse. This option changes the scope of
Boolean queries to be the whole file, rather than a single record (the
default record delimiter is a newline). If the user's boolean search
finds no matches in a single line, they are advised to try the search
again with booleans matching across the whole file. Obviously matches
of boolean terms within the proximity of a single line are likely to be
more relevant, so we made this default.
The aglimpse script was further modified to build an OR pattern of
matching filenames if the relevant fields were present in the form, in
addition to its standard way of limiting a search to a sub-directory
via PATHINFO. This has the effect of enabling a user to
select via custom checkboxes, in any combination, the sub-directory
paths to search without the user needing to navigate up and down
individual sub-directories. At the time of writing, there was no way of
automatically generating these controls on the form as it is designed
for highly customised and specialised search forms.
The user can now control whether search text will be applied to
filenames or the index itself. In the case of the former,
aglimpse tells glimpse to search
.glimpse_filenames. For the latter, aglimpse
tells glimpse to search the index. This has useful and
powerful consequences for structured collections of HTML which employ
naming conventions in filenames. Based on naming conventions, it is
possible for the user to retrieve certain types of files.
For example, the 3200 file collection making up the Undergraduate
Handbook employs naming conventions. The filename type of search is
used to search for subject codes eg. a search for '/131-1'
will retrieve the titles of all first year History subjects (code
131).
By specifying that the search should return one match only per file,
aglimpse will now display titles only. This is in contrast
to the default search which will display every matching line within
each file.
Other minor adjustments were made to the appearance of the returned HTML, including parsing out any html tags, and displaying each matched file as a paragraph rather than the rather "loud" <H3> heading.
The controls for maximum matches, and the maximum files returned, were changed from text entry boxes to pop-up menus. Other minor adjustments were made to the processing of the template file to place the form and its controls at the front, provide a hotlink to the browsing section, and to enable sub-directories and/or filenames to be included or omitted in particular templates.
This evaluation considers the performance of the Glimpse package modified as described against the criteria discussed earlier.
Not Z39.50 compliant.
The aglimpse HTTP interface script is CGI
compliant.
The package is in the public domain and source code is available. At
the time of writing and based on discussions with the author of the
glimpseHTTP extensions, it is likely that most if not all
of these modifications will soon be incorporated in a future
distributed version of the Glimpse package. This demonstrates the
benefits of public domain code and collaborative development.
Glimpse is well integrated with Unix and its HTTP gateway is a simple CGI compliant script. A minor problem was encountered with access permissions on indexes but was overcome by a minor change to the archive manager.
Glimpse is not available on other platforms.
Space usage is extremely efficient and our experiences justify the claims made by Glimpse's authors. For example, the medium sized index built for the 15 Megabyte University Handbook collection of over 3000 files occupied just under 1 Megabyte, representing only 6-7% of the size of the total collection.
Indexing of very large collections may run into memory limitations on some systems, but the time required to index a collection is acceptable.
No problems have been encountered with excessive resource usage in the conduct of searches.
The archive manager makes creating a new index extremely easy and fast.
It is quite simple to exert fine-grained control over the paths, and
file extensions, that are to be indexed by glimpseindex
using defining patterns in the .glimpse_exclude and
.glimpse_include files. Further simple enhancements are
planned to amgr to automate building include and exclude
files for new collections based on site-defaults.
Can be achieved through cron of shell glimpseindex
commands. There is no way as yet of having these commands generated by,
or controlled via, the amgr program which is interactive in nature.
This could be overcome in the future by automating the running of the
amgr with Expect.
The amgr can not as yet handle this and so incremental
reindexing must be done manually. It is expected that the limitation on
incremental indexing of large indexes will be removed in a future
release.
The speed of glimpse depends mainly on the number and sizes of the files that contain a match and only to a second degree on the total size of all indexed files. If the pattern is reasonably uncommon, then all matches will be reported in a few seconds even in extremely large collections.
Glimpse finds whole phrases by splitting any multiword
pattern into its set of words and looking for all of them in the index
(which is single word based). It then applies agrep to find the
combined pattern in files which contain the single words of the phrase.
This can be slow for cases where both words are very common, but their
combination is not.
Overall our experience has shown that the performance of glimpse for
HTML searches is very acceptable, with most searches of the Handbook
index completing within 3-10 seconds. Filename based searches are
extremely fast because the search is a simple agrep search
of a single file, .glimpse_filenames.
GlimpseHTTP rates extremely well on all the ease of use criteria
defined with the exception of rank ordering of search results. This
could be rectified with further minor amendments to
aglimpse. The degree of fine-grained control of what is
searched and how it is searched is excellent. The default values
usually provide adequate results and the controls are located
unobtrusively so as not to confront the user with unnecessary
complexity.
The results display is excellent, with the full text of matches shown, and hypertext links to found files and lines which can take the user to the place where the match was found in the retrieved file. Where the HTML source is unwrapped (ie. no embedded newlines in paragraphs), the matches show an entire self-contained HTML element such as a paragraph.
All functionality criteria such as partial words, number searching, maximum match control, and case sensitivity are met in full. Simple Boolean searches, such as multiple ANDs and multiple ORs, work well. Some difficulties with combined ANDs and ORs were encountered and had not been resolved at the time of writing.
In addition, glimpseHTTP provided additional and
powerful functions which exceeded our expectations. The capacity to
browse as well as search is very useful but could be better
implemented. Unfortunately, directory and file browse lists do not
observe the rules defined in the include and
exclude files, although it is anticipated that it would
not be difficult to add this feature. This means, for example, that
directories that have been specifically excluded from indexing will
still appear in the list of sub-directories for browsing. Naturally,
the normal server controls still apply and if a directory is protected
by access controls this will remain the case.
The capacity to search filenames lends extremely powerful retrieval features to large collections with naming conventions. The capacity to select sub-directories in any combination via custom checkboxes was an unexpected benefit which is extremely useful for some collections.
As a search system for HTML collections, we conclude that the Glimpse package offers many advantages over the more commonly used WAIS in all areas except Z39.50 compliance: system requirements, administration and maintenance and, most importantly, user functionality is dramatically better than WAIS for both searching and browsing.
The capacity to integrate filename and text based searching offers opportunities for HTML authors to structure and name their collections in ways which offer extremely powerful search and presentation capabilities to users.
Being in the public domain, the glimpse package is well suited for further enhancements and customised modification of the package and extensions [HREF 15] to suit particular collections are making their appearance on the Web.
However, Glimpse's main strength for Web providers is as a general purpose search and browse facility which, once installed, makes it simple to index new collections and customise the search interface to suit the nature of the collection.
Wu and Manber, "Fast Text Searching With Errors," Technical report #91-11, Department of Computer Science, University of Arizona, June 1991 (available by anonymous ftp from ftp://cs.arizona.edu/agrep/agrep.ps.1).
Wu and Manber, "Agrep -- A Fast Approximate Pattern Searching Tool", To appear in USENIX Conference 1992 January (available by anonymous ftp from ftp://cs.arizona.edu/agrep/agrep.ps.2).
AusWeb95 The First Australian WorldWideWeb Conference