Currently, a hypermedia project can be considered large if it must deal with gigabytes of textual information, and gigabytes of graphical material and/or terabytes of digitised video. The requirements for high performance differ depending on whether you are considering the initial development of the hyperbase or the deployment and the ongoing use of the system. Some elements of hypermedia, such as video, are computationally intensive through all phases of development and use.
Another major component of a hypermedia system is its textbase. The efficient retrieval of text from large textbases and even analysis of text - typical end-use functions - have been, for some time, achievable on a range of systems (although high performance implementations provide significant speed improvements and open other applications areas to treatment). However, recent experience demonstrates that significant use of high performance computing resources is required even for the textual components of a hypermedia system, especially during the initial development.
However, by far the most important high performance computing aspect of developing large hyperbases arises as a result of adopting effective strategies for the management of hyperlinks.
The PArliament Sound Text and IMage Environment Project is a two-year project that aims to demonstrate the applicability of advanced computational techniques in the area of hypermedia by building a demonstration hyperbase based on the text, video, sound and image data of the Australian Parliament. In the process, the project also aims to develop an extensive toolkit that can be used to build similar-sized systems for other datasets.
For a detailed description of the PASTIME Project, see the paper Hypermedia in the Australian Parliament [HREF 2]. Other details on the project can be found at the PASTIME home page [HREF 3].
Shortly, the Australian Parliament will provide Internet access to some of its data holdings via the Parliament Home Page [HREF 4]. Technology from the PASTIME Project has been used to assist in providing this public access, but it should be noted that not all of the material or functions described in this paper or in Hypermedia in the Australian Parliament are publicly available. In particular, no video or audio material is available, the textual material is a small subset of the holdings we have been experimenting with in the project, no postscript images of the Hansards are available, and no WAIS-compliant free text searching is available. However, where possible, screen images of what these capabilities look like have been provided in the paper Hypermedia in the Australian Parliament.
In order to minimise the extent of human involvement, software tools are needed to automatically (i) acquire data, either through capture of new data or conversion of existing data; (ii) generate indexes (either free-text indexes or HTML index pages); and (iii) identify and establish hyperlinks.
Data Acquisition
Like Parliament, most large organisations intending to build a hyperbase already have significant amounts of data that need to be converted to the appropriate hyperbase formats (e.g., HTML for text, GIF or JPEG for images, MPEG for video, and so on). The sheer volume involved dictates the need for automatic conversion tools.
While many conversion tools exist (e.g., rtftohtml), these tools often produce a very rudimentary translation, and can not address the frequent need to restructure or partition monolithic datasets into smaller, logical units of retrieval. For example, the primary data source for Hansard data is a single Wordperfect file per day per chamber, which may represent several hundred pages of the printed Hansard. However, each day's Hansard can be viewed as a sequence of distinct reports on individual Parliamentary activities - e.g., the presentation of a petition, the making of a speech by an individual Member or Senator in relation to proposed legislation, the asking of a question in question time and the provision of a response - and it is these distinct text segments that are the proper atomic pieces of information in the hyperbase.
Another problem with the standard conversion tools is that they can not be used for discovering and extracting relational information from the source dataset, such as document-type, dates, titles and other headings, page numbers, speaker information, etc. to be used in indexing the document, and identifying the subcollection to which it belongs.
The analysis, classification and partitioning of data into atomic pieces, and then into related subcollections, is quite amenable to classic data analysis techniques, and these tasks can then be realised via lexical tools such as Flex. However, deriving the pattern-set for recognising source data components and the action-set for generating target data objects is an iterative, empirical process. Typically, the system designer devises a set of source patterns and corresponding translation actions for the dataset, and then tests that the patterns have produced the correct conversions. In the case of the Parliamentary textual material, which comprises approximately 2 gigabytes of text, a run across the source data takes approximately 8 hours on a high-powered Sun 10 workstation - an overnight run. However, for complex datasets, it may take tens or even hundreds of test runs to perfect the pattern and action sets. The situation is exacerbated by the fact that the designer may ne! ed only tens of minutes to devise each new set of patterns and conversion actions for each test run - but must wait a day before getting the test results of any refinements and corrections. Once the set of patterns and conversion actions has stabilised, it will then take only one run, a matter of hours, to convert the entire existing source dataset to the required hyperbase format; but it may have taken many months to have reached this stage, during which only a few weeks of the designer's time has been effectively utilised.
The bottleneck here is clearly the speed of the lexical tools and the hardware on which they run. From an algorithmic perspective, the lexical tool Flex is hard to fault in its runtime performance, which leaves only the hardware. Within single processor systems, Sun 10 workstations are highly efficient (especially with 240 megabytes of RAM and high-speed high-volume disks). The task therefore has a clear high-performance computing component, and for this reason we are porting our conversion tools to run on ANU's massively parallel AP1000 computer [HREF 5] (which has 2 gigabytes of RAM, 128 parallel processors, 32 disks, and can sustain a disk I/O rate between the processors and the disks of approximately 50 megabytes per second).
While this work has yet to be completed, projections based on the disk and processor performance of the AP1000, and histograms of the runtime behaviour of the Flex-generated conversion routines on the Sun 10, indicate that the time required for a test run across 2 gigabytes of data should be reduced from 8 hours to about 10 minutes. Consequently, the time taken to convert significant volumes of source data to the appropriate hyperbase format should drop from in the order of a year to a couple of weeks.
Index Generation
It is rare for a large dataset to form a single homogenous collection. Rather, large datasets tend to give rise to a set of heterogenous subcollections, perhaps hierarchically related. These subcollections need to be indexed; for example, there is an index page for the tracks in this conference each item of which is a link to an index page to the papers in that track, each item of which is a link to the particular paper.
If these index pages are manually created, then as hyperbase items are added or removed the index pages will need to be manually edited. In the case of large hyperbases, this process is extremely time consuming and prone to error. Consequently, hyperbase index pages should be automatically generated and dynamically maintained. This is not complicated - the use of Perl, Flex and CGI scripts for this purpose is well understood: at the time an index page is requested from the hyperbase server, a tailor-made Perl or Flex routine is called via the CGI to dynamically generate the index page (it should be noted that while Perl scripts are efficient for small tasks, Flex provides a much more runtime efficient solution for complex indexing tasks).
However, as with the task of data conversion, the design and testing of scripts is an iterative and empirical matter with respect to a given dataset. Similarly, the use of high performance computing systems can significantly reduce the elapsed time required to devise effective scripts.
All text items in the hyperbase are also indexed via a freeWAIS-sf database. The main reason for this is to enable the effective handling of concept-based hyperlinks (for which, see the discussion below). The time required to build a text index for 2 gigabytes of data is approximately 30 days on a Sun 10. This is not an excessive time if the text index is built once, after the conversion, segmentation, and index-page generation routines have been devised.
However, in order to test the effectiveness of Flex scripts, it is often useful to have a free text database of the atomic articles generated. For this reason, we are using the free text database software written for the AP1000 and reported on most recently in (Hawking and Thistlewaite 1994). This provides the capability of loading and searching the entire text base within a matter of tens of minutes.
Problems with Identification and Establishment of Hyperlinks
A defining feature of hyperbases are the hyperlinks. Hyperlinks are rarely components of single documents, but rather relate one document or item of information to another separate item, and as such are essentially relational. Consequently, as the contents of the hyperbase change, the validity of the links also changes.
With HTML-based hyperbases, links are embedded or anchored in a document. This is problematic, for a number of reasons: (i) if an item I is removed from the hyperbase, then all other items that make links to I need to be identified and edited to remove those links; (ii) if an item is added to the hyperbase, then all other items that have potential links to it need to be identified and edited to add the links; and (iii) read-only documents (such as those on CD-ROM) can not be augmented with hyperlinks anchored in them.
Other systems, such as Hyper-G [HREF 6], provide for links to be kept separately from the document (which addresses problem (iii)), and unlike HTML they support bi-directional links which make it easier to identify the link sources (which addresses problem (i)). However, managing the link information is still difficult, and problem (ii) remains. Moreover, it is clear now that HTML is dominating the hypermedia user community, and that commercially viable hyperbase systems must live with the realities of HTML and wait for the HTML definition to be improved (a process that is beyond the control of individual hyperbase system designers).
As with other aspects of large hyperbases, manual identification, establishment and maintenance of hyperlinks is too time-consuming and too prone to error, and software tools are needed to perform these tasks quickly, reliably and automatically.
When items are added to the hyperbase (or converted in batch to establish the initial hyperbase), only HTML markup for formatting the document is generated - no anchors are embedded. This markup is a property of the document alone - it is independent of the other contents of the hyperbase. Associated with each subcollection of items in the hyperbase is a set of patterns. When a client requests a document, a Flex script is called via the CGI to then embedded anchors to any expressions that match the patterns.
The first point to note about this strategy is that provides a solution to all three problems mentioned in the previous section. Basic HTML documents in the hyperbase do not need editing, manually or automatically, to add or remove hyperlinks, and read-only documents can be supported.
The second point is that this solution is fully compliant with the existing HTML standard.
Other points to note are that (i) the currency of hyperlinks is not dependent on maintaining a separate database of link information, and accurately reflects the validity of links at the time the document is served by the hyperbase server; (ii) a consistent user-interaction paradigm is maintained; and (iii) the runtime efficiency of the pattern matching is very high, and requires no special hardware.
A final point to note is that the tools we are developing for embedding links can optionally respect or disrespect HTML anchors existing in the base HTML document. That is, it is permissible for the base HTML document to contain hyperlinks, and these can either be left unaltered and their anchoring expressions are essentially invisible to the pattern matching functions, or they can be visible to the matching functions, in which case such links can be removed or modified by the matching functions.
We are currently working on a version of these routines that can be activated as client-side functions via the CCI (in addition to the server-side functions). This will provided the capability for requesting clients to uniformly add, modify or remove hyperlinks from HTML documents provided by remote servers. For example, I may have private access to a very comprehensive hyperbase of world geographic information, and when I view any (appropriately matching) document I want all geographic place names to be linked to the items in my geographic hyperbase. Alternatively, I may know that a particularly close server site is a mirror site for a number of other more telecommunicationally remote sites - I can uniformly substitute all references to the remote sites with references to the closer mirror site.
However, discovering a reliable set of patterns is a computationally intensive task requiring high performance computing resources. Moreover, maximising the soundness and completeness of the pattern set is a non-trivial task.
It is also important to distinguish between different types of link: structural links that relate one entire document or piece of information to another (e.g., an item to its index page, or to the next or previous item in the index, or to the corresponding video clip, or to the corresponding Postscript version of the original document) and which can be used to form hierarchies or sub-collections of documents; referential links that map a word or phrase to the item that the word or phrase names (e.g., from "Native Title Bill 1994" in a Hansard article to the Bill itself); and (iii) intensional links that link words or phrases to other objects on the basis of some a priori shared meaning or some a posteriori shared relationship.
These issues are discussed in another paper currently in preparation, although some discussion of them occurs in the paper Hypermedia in the Australian Parliament [HREF 2].
AusWeb95 The First Australian WorldWideWeb Conference