Steve Ball, Department of Computer Science, Australian National University, ACTON, ACT 0200, Australia. Phone +61 6 249 5146 Fax: +61 6 249 0010 Steve.Ball@tcltk.anu.edu.au Home Page [HREF 1]
World Wide Web, Embed, Embeddable, Embedded, Components, Tcl, Tk, Tcl/Tk, SurfIt!, Plume, HTTP, HTML, CSS, XML
Many application developers wish to add the ability to access World Wide Web resources to their program. This paper describes two general-purpose libraries to provide easy access to Web documents: a library for retrieving and handling Web documents and a library for displaying HTML and XML documents.
Figure 1: Typical Architecture Of Web Application
SurfIt! [HREF 2] is a World Wide Web browser which has been available for use by Internet users and developers since 1995. This browser may be used as a stand-alone, general-purpose Web browser, in the same manner as Netscape Navigator or Microsoft Internet Explorer. While the development of SurfIt! as a browser has continued since its first release, many application developers who have used the browser have expressed much interest in using components of it to add Web access to their own applications. These are developers who do not wish to implement the user interface of their Web applications using a standard Web browser. Instead their applications will present a customised user interface within which Web documents will be presented to the user in some fashion. Examples of applications which may wish to present a customised user interface include Internet-based electronic games, specialised Intranet applications and applications using online HTML documents for their help subsystem. Developers of these applications require a library which will assist them in gaining access to Web resources, and to display Web documents as necessary. A typical architecture for these types of applications is shown in Figure 2.
Figure 2: Typical Architecture Of A Stand-Alone Web Application
There is no reason why a developer could not write their own code for accessing remote servers using HTTP and for displaying HTML documents. A minimal HTTP client package has been written using only approximately 100 lines of Tcl script code. Similarly, Uhler's HTML library [HREF 7] contains a 10 line HTML parser. However, the burden is upon the application developer to ensure correct operation of the protocol handler and adherance to the protocol specification. Also, layered functionality, such as document caching and so on, quickly adds to "code bloat". Uhler's complete HTML display library, incorporating document display code, is actually over 1000 lines of Tcl script code.
With this requirement in mind, the implementation of the next release of the SurfIt! browser, which has been renamed "Plume", has been undertaken with a view to releasing component libraries separately. The development of a subroutine library to provide functions for performing elementary Web-related tasks is not new: CERN's libwww C library [HREF 3] was first developed circa 1990. However, it is not easy to incorporate the use of this library into an application, which is the major design goal of the component libraries provided by Plume.
In order for a program to be able to manipulate World Wide Web documents it is first necessary to be able to retrieve a document's data into the program's address space. This is the basic purpose of the Document Handling Package (DHP). However, the task of manipulating a document is not as simple as presenting the data to the application.
The Document Handling Package provides an extensible framework for applications to perform all of the necessary processing on a document. It hides the complexity of basic document handling, such as accessing proxy servers and document caching, and allows the application to process documents at a high-level according to the document's media type.
Accordingly, the DHP needs to provide the following functions:
document. Loading a document is as
simple as issuing the Tcl command:
document loaduri URL
The document command has been created to deal with WWW documents, but
is not limited to use with the
World Wide Web. Although all documents are referred to by their URL,
a non-Web application can refer to
local files using the file: scheme.
loaduri method
shown above, a scheme handler is
invoked to manage the transfer of data according to the protocol
specified by the document's URL.
An interface is provided to allow the application to extend the
package with new scheme handlers,
the document scheme command. Handlers for the file: and http: schemes
are built-in, with the http:
scheme handler supporting HTTP/1.1 [HREF 5] client access.
Requests for the loading of documents are placed in a queue. There
are three queues: high, normal and
low priority. Documents loads are scheduled in priority order when
the necessary resources become
available. Resources may be restricted in various ways, for example
there may be a limit on the number
of open channels allowed, or loads from a particular document server
may be pipelined over a persistent
channel. The prioritising of document fetching allows the application
to favour certain categories of
documents over others, for example downloading an image map or an
applet is more important than
downloading the document background.
The system distinguishes between document load requests and the actual retrieval of the document data. This allows a document to be "loaded" concurrently in more than one request, but for the data to be fetched only once. For example, a HTML document could be displayed to the user who might then subsequently wish to save the document in a file before all of the document's data has been received. This request pattern is handled transparently by the Document Handling Package.
Figure 3: DHP Document Transfer
When the application makes a document request it must specify the
purpose of the request - is the
document to be saved to a file, or displayed to the user? This
accomplished by giving a -target
option to the document loaduri command. The value for -target may be
file, variable or auto (for
automatic media handlers, see below). In addition the option
-targetid gives any necessary further
information, such as the filename for a -target file argument, or the
window name for a media handler
which displays a document graphically. Following are some examples of
how documents can be
loaded for different purposes:
document loaduri uri -target variable -targetid myDoc
document loaduri uri -target file -targetid /home/user/web/myDoc
document loaduri uri -target auto -targetid .app.www
The first command stores the document data in the Tcl variable myDoc.
The second command stores the document data in the given file.
The last command passes the data to an automatic media handler, and
requests
that the handler displays the document in the Tk window .app.www.
These options can be abbreviated. If the value for the -target
option
begins with "." then it is assumed to be a Tk window name and the
document
is passed to an automatic media handler. If the value for the -target
option starts with a directory separator ("/" for Unix, "/" or "\" for
Windows and ":" for Macintosh) then it assumed to be a filename and
the document data is copied into that file. Hence the examples from above may be shortened to:
document loaduri uri -target /home/user/web/myDoc
document loaduri uri -target .app.www
In addition, the application may specify a -command option which
gives a Tcl script to be evaluated in the same manner as an
automatic media handler, see below. This option allows the
application to perform customised processing of document data, or to
"eavesdrop" on a data transfer. This may be especially
useful when the media handler is supplied by a third-party.
document type handler command. When a
document is loaded with an auto target, the handler which
is registered to accept that document's media type is invoked
and given the document's data. Certain handlers may declare
that the value given by the -targetid argument is the Tcl
command to invoke to process the document data. This is usually
how Tk (mega-)widgets are configured that display documents to the user, see below.
In order to make creating automatic media handlers easier and to provide a flexible interface, the Document Handling Package defines an interface to media handlers that uses a method familiar to Tcl programmers. The media handler is evaluated at "interesting" stages of the document load process, and the command has certain arguments appended to it before being passed to the Tcl interpeter for evaluation. The scheme handler defines which stages of the load process are "interesting". The following arguments may be appended to the media handler command, along with arguments allowing access to document and load meta-data
document type handler */* ;# accept all media types
proc watchLoad {event args} {
switch $event {
end {
# Variable "myVar" now has document data
}
default {# Could act upon other events too}
}
document loaduri uri -target variable -targetid myVar -command watchLoad
The follow commands will be executed as the data transfer occurs:
watchLoad begin docstate loadstate
watchLoad progress docstate loadstate {Connected to server}
watchLoad data docstate loadstate <>
watchLoad data docstate loadstate <>
watchLoad end docstate loadstate
A design goal of this library, as with the Document Handling Package, is to provide an easy-to-use system that is highly flexible and customisable. Another goal is to be able to seamlessly interface the WWW megawidget with the Document Handling Package.
When a HTML megawidget is created a widget command is also created for the application to control the widget. For example, the Tcl command:
html .app.www.html
Creates a new Tcl command .app.www.html, apart from also creating the
widget itself. The widget command supports the common Tk widget methods, such as the
configure method to change the widget's configuration options. It
also has a number of methods to control the content of the widget: the
HTML document.
The widget command provides a HTML element level interface to the HTML
document. The application uses the widget command's element method
to retrieve or modify the elements at
run-time. For example, to get the HTML text for the entire document,
the application would issue the Tcl command:
.app.www.html element get html
To get only elements that are in the document's <HEAD> section,
the application would use the command:
.app.www.html element get head
HTML:parse procedure. The parser generates a Tcl script which may be
evaluated to cause procedures to be invoked to process the
document, typically to display it to the user.
This script may also be regarded as a parse tree for the
HTML document.
There are three tables used by the parser to derive the document's structure. Firstly, a table listing whether an element is a "container" element or an empty element. Secondly, a table listing the content model for each container element. These two tables are derived directly from the HTML 3.2 DTD. Finally, a table is used to describe how to imply the existence of elements, given the context within which a start tag appears. This last table would not be necessary if all Web documents were strictly conforming to the HTML DTD, but HTML allows tags to be omitted where they are easily derived. For example, many Web documents omit the <HTML>, <HEAD> and <BODY> elements. The parse tree returned by the parser allows the application to manipulate the HTML document without having to be concerned with implied end tags, and so on.
Because Plume's HTML parser is completely table-driven, it is straight-forward to define new SGML elements. Some application developers find this feature attractive in order to be able to display arbitrary SGML documents, rather than having to create or generate HTML documents. This property is currently being exploited to develop support for the display of XML [HREF 8] documents.
All aspects of document presentation are controlled by Cascading Style Sheets (CSS). As with the HTML parser's DTD representation, the CSS implementation is table driven. This approach has the advantage of allowing new CSS properties to be defined. When a stylesheet is loaded, it is parsed and a table is created which is used during the display process. Cascaded stylesheets may be subsequently loaded, and their tables are merged together to form a final display table.
text/html
documents, as well as image/gif, image/x-portable-pixmap and
image/x-bitmap documents (these are the image formats that Tk can
display, and there are extensions which allow JPEG and TIFF image
formats to be displayed). It provides the load method for handling
document events. All that remains for the application programmer to do is to
connect the two systems together. This is done with the commands:
www .www ;# Creates a WWW megawidget called .www
.www configure -loadcommand {document loaduri -target {.www load}}
The WWW megawidget's -loadcommand script is invoked
whenever the megawidget requires a document to be loaded, for example
when a hypertext anchor is activated. The value given for this option
specifies the widget itself as the target of a document load.
The Document Handling Package provides high-level management of Web documents, including their retrieval from remote document servers, local in-memory or on-disk caching and media type dependent processing.
The WWW Megawidget provides several utilities for parsing and displaying HTML documents. It also provides overall management of the loading and displaying of HTML documents.
Steve Ball ©, 1997. The authors assigns to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grants a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers, and for the document to be published on mirrors on the World Wide Web. Any other usage is prohibited without the express permission of the authors.