
It is worth noting the following implications of this model.
1) Client software.
- multi protocol and multipurpose.
All the popular graphical interface clients currently in use (such as
Netscape and Mosaic), support access to multiple server protocols (in
addition to http for true web documents). Users thus become familiar with
the interface, and are not required to learn yet another software program
to use a new tool such as a search system.
- cross platform. All these client programs run on the major graphical
interface operating systems, including XWindows, Windows and Macintosh. This
saves the custom application developer the need to redevelop cross platform
tools. It also has significant implications for user training and support.
2) TCP/IP compliance
By passing information between the clients (such as Netscape) and the servers (such as an httpd server) using TCP/IP, the following advantages are realised:
- true distributed access. If both client and server are connected to the
Internet, there is no restriction in where they can be physically located
- multiple network standards support - TCP/IP can be transported over and
between different common local area network standards, including Appletalk and
Novell.
3) http Servers
All http servers attempt to adhere to an internationally accepted set of standards. They all support the common gateway interface, with appropriate modifications depending on which platform they are running on. This means that:
- if the custom application (database, search system or other) only runs
on one platform (such as a Macintosh), it now becomes possible to give users
on anyother platform (such as Windows) access to that application's
functionality, by linking the application (on the Macintosh) to the
appropriate server running on the same platform as the application
- any kind of data can be served back to the user through the http
protocol.This includes dynamically generated pages of html text (such as the
result of a complex database query or the result of a search), as well as
other data types. There are multiple examples of devices such as video
cameras, vending machines and coffee pots being linked into the web so that
they can deliver information about their current state as it exists at the
time the client makes his or her request.

Running a simple search using the novel "Flatland" by Edwin A. Abbott, demonstrates the following features:
- keyword in context display of matched keywords. This means that each
matched keyword is displayed lined up with all other keywords, and surrounded
by a fragment of the surrounding text to give the user the context within
which the match occurred
- the total number of matches found
- a pop up list of search types (including booleans and near searches)
- the find box into which a search is typed.
By clicking on a line, the actual file is opened, and is scrolled to the position of the match, with the match being highlighted. The following screendump illustrates this.

We placed the following limits on implementation possibilities:
- the system had to run on a Macintosh
- ideally it had to run in native mode on the PowerPC since search speed
was critical
- it had to interface with the MacHTTP [5] web
server.
Given that the original search engine was written in C, we examined the following languages and environments for achieving the above:
AppleScript
This flexible scripting language is widely used by
web maintainers on the Macintosh. It requires however that the application
to be integrated supports an appropriate set of AppleEvents. We decided
against this option since we plan to move the engine to other platforms,
including Windows and Unix.
PERL
Perl excels at string handling, and is commonly used for
forms handling with Unix based web servers. There is a good Perl
implementation for the Macintosh, but not yet for Windows. We examined the
option of integrating a set of library routines for the search engine
(written in C), and integrating these with a set of Perl scripts. This
would provide a flexible solution with the option to adapt to cross platform
implementation without too much difficulty. We decided against this option on
the basis that there was no PowerPC native implementation of Perl, and that
for this simple case all the string handling could be done in C code.
C
The original engine was written in C, and we had a compiler
that produces native PowerPC risc code. This combination made us choose C as
the language in which to implement the web based search engine. We hoped to
realise significant speed improvements on the PowerPC, and this was indeed
the case. By using ANSI code as far as possible, we hope to make the
transition to other platforms as smooth as possible. A windows version has
been implemented, with several additional features not found in the Macintosh
version.
- creating the appropriate interface code to handle the CGI events
produced by the server, and translate these into the equivalent commands and
instructions that users entered directly into the original search system
- marking up the returned search strings in html, so that they display
on screen in a manner reproducing the original system as closely as possible.
The following screen dumps show the first two of the three main interface screens. These were both easy to design, as they required no more than a minimal understanding of forms design in html, and html markup.
The first screen shows the standard interface for starting a search.
The second shows the context list of results. Note that this mimics the original standalone application completely.
A simpler example illustrates this issue well. A set of html documents are presented to a student. These documents explain a certain concept, and have a quiz section at the end. At the start of the session, the student enters his or her name, and at the completion of the session, a score is presented for all questions attempted, and a list of incorrect responses is available, linked back to the relevant document for instant review.
The World Wide Web was designed to be stateless. When a user/client requests a specific document from the server, a connection is established between them, the document is delivered, and then the connection is closed. The server maintains no real information about who the user is, how far they have progressed through the quiz, and certainly maintains no information about what their score is at that point in time. To maintain such information requires significant extra work by the author, in terms of writing scripts to store this state information for the client between different calls to the server.
The general solution is as follows:
- the first time the user/client connects to this set of documents, they
are required to register and enter a username. A script stores this username
on the server, along with information about the user's current score, and
current document being viewed.
- after registering, the first of the documents in the set is returned to
the user. The document returned however can not simply be the original
document, but needs to be a specially modified copy of this document, which
has embedded within it the username of the user for whom it was intended.
In this way, when the user submits their response (for example, by answering
the quiz at the bottom of the page), the document returned to the server can
identify the user, and the scripts underlying the whole process of assessment
and user tracking can then proceed.
Solutions have been proposed based on analysis of the log file produced by the server software, to track users and which documents they have accessed. This approach suffers from the following problems:
- the user information may in some cases only be written to the log after
the request has been sent to the CGI script which handles this interaction.
Thus at the time of calling the script, the information required by it is not
yet in the log file
- scanning the log file is a time consuming process
- when a user logs in from a different machine, they can be confused with
another previous user from that machine. This is particularly relevant where
users come into the system through dial up lines, where their machine name and
IP number is dynamically assigned, and changes with every connection.
For our search engine, we had to store the following state information, and embed it within each of the links listed in the context display:
- the name and full path of the document in which this match occurs
- the position of the match within the document
- the length of the matched word (to know how many characters to
highlight for the match in the returned document).
The URL [4] produced for each match on the word man
in the screen dump above is of the following form:
http://informatics.med.monash.edu.au/tr-www.cgi?__get,/Docs/Flatland.txt,18753,3
The __get command is an internal command understood by the search engine, and it specifies the three parameters as listed above.
When the search engine receives such a __get request, it does the following:
- opens the appropriate file
- reads into a buffer a number of characters before and after the match
- inserts some html code into this buffer which highlights the match and
places an html named anchor around it
- finally it adds html code to the start of the buffer which has a link
to the exact position of the match within the buffer, as well as a link to the
original document in case the user wishes to download the entire document.
The returned buffer appears as follows in a web client.


The creation of dynamic html documents, such as that for the last screen in our example where the actual document with match is displayed, is straightforward to implement. However, doing so in a language such as C is not elegant. Six lines of Perl code can often achieve the same result as sixty lines of C code, is easier to understand and is faster to test and debug. Having written the whole system in C, it is our opinion that C should only be used to write code that is time critical. A language optimised for string handling, such as Perl, can dramatically reduce development time for those parts of the system which are string intensive and not time critical.
Developing code that functions cross platform can be trivial or extremely complex, depending on the following factors:
- what language is used? ANSI C or PERL works almost without change on
different platforms.
- are machine specific library routines used? Code to recursively scan a
directory tree for example is drastically different for the Macintosh, Windows
and Unix.
- does the server being used support the CGI standard, and how is it
implemented? Each of the operating systems mentioned above have servers that
implement the standard in slightly different ways. The most significant
difference is in the way that the message is passed to the CGI application.
On the Macintosh, this is done using a custom set of AppleEvents. Under
Windows it is primarily done by writing the parameters to a file, and under
Unix it is a combination of environment variables and standard input.
We are strongly in favour of using a cross platform language, that has standard library calls that protect the software author from the low level implementation details of CGI calls, and that allows for reuse of modules and sharing of code between developers. Currently Perl is closest to achieving this standard.
[2] Common Gateway Interface, http://hoohoo.ncsa.uiuc.edu/cgi/
[3] Hypertext Markup Language, html, http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html
[4] Universal Resource Locators, URLs, http://info.cern.ch/hypertext/WWW/Addressing/Addressing.html
[5] MacHTTP server system, http://www.biap.com/
[6] WAIS Information servers, http://www.wais.com/
AusWeb95 The First Australian WorldWideWeb Conference