Expanding Web functionality by incorporating external programs - a casestudy of a search system.


Adrian Vanzyl, Unit of Medical Informatics, Monash University, 867 Centre Road, East Bentleigh, 3165. Email: adrian.vanzyl@med.monash.edu.au. Web: http://www.monash.edu.au/informatics/Default.html. 61-3-579-3188 Voice. 61-3-570-1382 Fax.
Keywords: WWW, CGI, Information Retrieval

Introduction

One of the most critical features inherent in the World Wide Web is the ability to serve non static documents. This allows dynamic processes such as search systems, databases or real time imaging systems to be integrated into the Internet. Traditionally a significant effort had to be expended to provide distributed access capability to an existing piece of software. An existing standalone search system for example would need network handling routines added to it, and have customised client programs written for it so that users at distributed sites could connect to the search system, run a search, and view the results. Where multiple client platforms have to be supported, the development load escalates even further. Using the Internet and World Wide Web, in particular the http protocol, a developer can rapidly link an existing package into the Internet, freed from the need to develop low level network interface code, and freed from the need to develop customised client applications. Diagram One illustrates the way in which clients and servers interact using the Hypertext Transfer Protocol (http) [1] and other protocols.

It is worth noting the following implications of this model.

1) Client software.

- multi protocol and multipurpose. All the popular graphical interface clients currently in use (such as Netscape and Mosaic), support access to multiple server protocols (in addition to http for true web documents). Users thus become familiar with the interface, and are not required to learn yet another software program to use a new tool such as a search system.
- cross platform. All these client programs run on the major graphical interface operating systems, including XWindows, Windows and Macintosh. This saves the custom application developer the need to redevelop cross platform tools. It also has significant implications for user training and support.

2) TCP/IP compliance

By passing information between the clients (such as Netscape) and the servers (such as an httpd server) using TCP/IP, the following advantages are realised:

- true distributed access. If both client and server are connected to the Internet, there is no restriction in where they can be physically located
- multiple network standards support - TCP/IP can be transported over and between different common local area network standards, including Appletalk and Novell.

3) http Servers

All http servers attempt to adhere to an internationally accepted set of standards. They all support the common gateway interface, with appropriate modifications depending on which platform they are running on. This means that:

- if the custom application (database, search system or other) only runs on one platform (such as a Macintosh), it now becomes possible to give users on anyother platform (such as Windows) access to that application's functionality, by linking the application (on the Macintosh) to the appropriate server running on the same platform as the application
- any kind of data can be served back to the user through the http protocol.This includes dynamically generated pages of html text (such as the result of a complex database query or the result of a search), as well as other data types. There are multiple examples of devices such as video cameras, vending machines and coffee pots being linked into the web so that they can deliver information about their current state as it exists at the time the client makes his or her request.

Integrating a search system into the Web

Our development of a search engine was a direct result of user need. Having developed a prototype under Unix, with a WAIS [6] search engine, we decided to move our server to a local machine for ease of maintenance. Our primary web server has been running on a Macintosh, and more recently a PowerPC. Since there was no equivalent to WAIS for the Macintosh, and since we had an existing search system (Total Research, written by Chris Priestley), the decision was made to make the necessary changes to the search system to allow it to integrate with the Macintosh based web server. We aimed to maintain the current user interface as far as possible. The following screen dump shows the original Macintosh interface.

Running a simple search using the novel "Flatland" by Edwin A. Abbott, demonstrates the following features:

- keyword in context display of matched keywords. This means that each matched keyword is displayed lined up with all other keywords, and surrounded by a fragment of the surrounding text to give the user the context within which the match occurred
- the total number of matches found
- a pop up list of search types (including booleans and near searches)
- the find box into which a search is typed.

By clicking on a line, the actual file is opened, and is scrolled to the position of the match, with the match being highlighted. The following screendump illustrates this.

The underlying search engine uses a realtime search system. The documents searched are not pre-indexed. This means that there is no overhead required to store the indices, no overhead in generating the indices, and no danger of missing hits for document sets that are constantly changing (such as a realtime news feed, or a web site where users can freely add new documents). It does however mean that the effective searchable document space has to be less than 10 megabytes in size or search times become excessive.

Creating the Web version of the search system.

To link a custom application into the World Wide Web, it is necessary for the developer to do two things. The first is to become familiar with the Common Gateway Interface (CGI) [2], which defines the way in which the web server passes client information to the custom application, and how it expects information to be passed back to it. The second is to become familiar with Hypertext Markup Language (html) [3], which defines the way in which information has to be returned to the client so that it displays and is formatted in the way the developer intended.

We placed the following limits on implementation possibilities:

- the system had to run on a Macintosh
- ideally it had to run in native mode on the PowerPC since search speed was critical
- it had to interface with the MacHTTP [5] web server.

Given that the original search engine was written in C, we examined the following languages and environments for achieving the above:

AppleScript
This flexible scripting language is widely used by web maintainers on the Macintosh. It requires however that the application to be integrated supports an appropriate set of AppleEvents. We decided against this option since we plan to move the engine to other platforms, including Windows and Unix.

PERL
Perl excels at string handling, and is commonly used for forms handling with Unix based web servers. There is a good Perl implementation for the Macintosh, but not yet for Windows. We examined the option of integrating a set of library routines for the search engine (written in C), and integrating these with a set of Perl scripts. This would provide a flexible solution with the option to adapt to cross platform implementation without too much difficulty. We decided against this option on the basis that there was no PowerPC native implementation of Perl, and that for this simple case all the string handling could be done in C code.

C
The original engine was written in C, and we had a compiler that produces native PowerPC risc code. This combination made us choose C as the language in which to implement the web based search engine. We hoped to realise significant speed improvements on the PowerPC, and this was indeed the case. By using ANSI code as far as possible, we hope to make the transition to other platforms as smooth as possible. A windows version has been implemented, with several additional features not found in the Macintosh version.

Implementation

Initial implementation consisted of two steps:

- creating the appropriate interface code to handle the CGI events produced by the server, and translate these into the equivalent commands and instructions that users entered directly into the original search system
- marking up the returned search strings in html, so that they display on screen in a manner reproducing the original system as closely as possible.

The following screen dumps show the first two of the three main interface screens. These were both easy to design, as they required no more than a minimal understanding of forms design in html, and html markup.

The first screen shows the standard interface for starting a search.

The second shows the context list of results. Note that this mimics the original standalone application completely.

State dependent information in a stateless system

To generate the final step, in which the user clicks on a match and is then presented with the actual document in which the matched word is highlighted, presented several problems. These can be generalised to the issue of maintaining state information in a stateless system.

A simpler example illustrates this issue well. A set of html documents are presented to a student. These documents explain a certain concept, and have a quiz section at the end. At the start of the session, the student enters his or her name, and at the completion of the session, a score is presented for all questions attempted, and a list of incorrect responses is available, linked back to the relevant document for instant review.

The World Wide Web was designed to be stateless. When a user/client requests a specific document from the server, a connection is established between them, the document is delivered, and then the connection is closed. The server maintains no real information about who the user is, how far they have progressed through the quiz, and certainly maintains no information about what their score is at that point in time. To maintain such information requires significant extra work by the author, in terms of writing scripts to store this state information for the client between different calls to the server.

The general solution is as follows:

- the first time the user/client connects to this set of documents, they are required to register and enter a username. A script stores this username on the server, along with information about the user's current score, and current document being viewed.
- after registering, the first of the documents in the set is returned to the user. The document returned however can not simply be the original document, but needs to be a specially modified copy of this document, which has embedded within it the username of the user for whom it was intended. In this way, when the user submits their response (for example, by answering the quiz at the bottom of the page), the document returned to the server can identify the user, and the scripts underlying the whole process of assessment and user tracking can then proceed.

Solutions have been proposed based on analysis of the log file produced by the server software, to track users and which documents they have accessed. This approach suffers from the following problems:

- the user information may in some cases only be written to the log after the request has been sent to the CGI script which handles this interaction. Thus at the time of calling the script, the information required by it is not yet in the log file
- scanning the log file is a time consuming process
- when a user logs in from a different machine, they can be confused with another previous user from that machine. This is particularly relevant where users come into the system through dial up lines, where their machine name and IP number is dynamically assigned, and changes with every connection.

For our search engine, we had to store the following state information, and embed it within each of the links listed in the context display:

- the name and full path of the document in which this match occurs
- the position of the match within the document
- the length of the matched word (to know how many characters to highlight for the match in the returned document).

The URL [4] produced for each match on the word man in the screen dump above is of the following form:
http://informatics.med.monash.edu.au/tr-www.cgi?__get,/Docs/Flatland.txt,18753,3

The __get command is an internal command understood by the search engine, and it specifies the three parameters as listed above.

When the search engine receives such a __get request, it does the following:

- opens the appropriate file
- reads into a buffer a number of characters before and after the match
- inserts some html code into this buffer which highlights the match and places an html named anchor around it
- finally it adds html code to the start of the buffer which has a link to the exact position of the match within the buffer, as well as a link to the original document in case the user wishes to download the entire document.

The returned buffer appears as follows in a web client.

Conclusion

The single most difficult problem to solve was that of generating the html links for the list of context hits, so that each link `knew' what file it relates to, and what position within that file it has to link to. The solution described is to create the link so that it calls the custom application with additional parameters that describe these positional variables.

The creation of dynamic html documents, such as that for the last screen in our example where the actual document with match is displayed, is straightforward to implement. However, doing so in a language such as C is not elegant. Six lines of Perl code can often achieve the same result as sixty lines of C code, is easier to understand and is faster to test and debug. Having written the whole system in C, it is our opinion that C should only be used to write code that is time critical. A language optimised for string handling, such as Perl, can dramatically reduce development time for those parts of the system which are string intensive and not time critical.

Developing code that functions cross platform can be trivial or extremely complex, depending on the following factors:

- what language is used? ANSI C or PERL works almost without change on different platforms.
- are machine specific library routines used? Code to recursively scan a directory tree for example is drastically different for the Macintosh, Windows and Unix.
- does the server being used support the CGI standard, and how is it implemented? Each of the operating systems mentioned above have servers that implement the standard in slightly different ways. The most significant difference is in the way that the message is passed to the CGI application. On the Macintosh, this is done using a custom set of AppleEvents. Under Windows it is primarily done by writing the parameters to a file, and under Unix it is a combination of environment variables and standard input.

We are strongly in favour of using a cross platform language, that has standard library calls that protect the software author from the low level implementation details of CGI calls, and that allows for reuse of modules and sharing of code between developers. Currently Perl is closest to achieving this standard.

References

[1] The http protocol, http://info.cern.ch/hypertext/WWW/Protocols/Overview.html

[2] Common Gateway Interface, http://hoohoo.ncsa.uiuc.edu/cgi/

[3] Hypertext Markup Language, html, http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html

[4] Universal Resource Locators, URLs, http://info.cern.ch/hypertext/WWW/Addressing/Addressing.html

[5] MacHTTP server system, http://www.biap.com/

[6] WAIS Information servers, http://www.wais.com/


Copyright

© Southern Cross University, 1994. Permission is hereby granted to use this document for personal use and in courses of instruction at educational institutions provided that the article is used in full and this copyright statement is reproduced. Permission is also given to mirror this document on WorldWideWeb servers. Any other usage is expressly prohibited without the express permission of Southern Cross University.
Return to the AusWeb95 Table of Contents

AusWeb95 The First Australian WorldWideWeb Conference