Using XSL and XQL for efficient, customised access to dictionary information

Kevin Jansz [HREF1], Department of Linguistics, University of Sydney, Australia. kjansz@sultry.arts.usyd.edu.au

Wee Jim Sng [HREF2], School of Applied Science, Nanyang Technological University, Singapore. jimemail@singnet.com.sg

Nitin Indurkhya [HREF3], School of Applied Science, Nanyang Technological University, Singapore. nitin@cs.usyd.edu.au

Christopher Manning [HREF4], Departments of Computer Science and Linguistics, Stanford University, USA. manning@cs.stanford.edu


Abstract

XML is highly suited to representing richly structured information such as dictionary content, and conversely this well-structured storage of the information enables innovative browsing interfaces. We demonstrated this previously with the development of Kirrkirr, a web-based application that allows users to interactively explore a Warlpiri (a Central Australian language) dictionary in XML format. Two key design issues are customised presentation, and efficient access. The greater the level of customisation, the broader range of users Kirrkirr can accommodate. Efficient access is important if the application is to scale up to larger, more complex dictionaries. In this paper, we discuss these two issues and describe the usage of XSL and XQL to further enhance Kirrkirr. While Kirrkirr already provides a number of interfaces to the lexical information, the challenge of creating new features and providing even greater flexibility lies in allowing the user to access certain parts of the XML database without the overhead of greater memory and time usage. By enhancing the indexing techniques of the original Kirrkirr, users may use XSL to personalise the information they access. Emerging technologies such as XQL give the potential for not only efficient access to dictionary entries, but to the fields within the entries. Performance evaluation suggests that the use of XSL and XQL has had a very significant impact on Kirrkirr. The results can easily be seen to apply to a broad range of similar applications.


Introduction

Computational Lexicography

A language is more than individual words with a definition. It is a vast network of associations between words and within and across the concepts represented by words. The work described here is part of a broader project that has the general aim of providing people with a better understanding of this conceptual map. In particular, traditional paper dictionaries offer very limited ways for making such networks of meaning visible, whereas, on a computer, there are, in theory, no such limitations to the way information can be displayed. However, while dictionaries on computers are now commonplace, there has been little attempt to utilise the potential of the new medium. Most existing electronic dictionaries, whether on the web or on CDROM, present a plain, search-oriented representation of the paper version. In contrast, our goal has been to build fun dictionary tools that are effective for browsing and incidental language learning. Minimally, they should be as effective for browsing as the process of flicking through pages of a paper dictionary, but beyond that we aim to use the computer to provide new means of effective browsing.

For instance, we can show words grouped by synonyms (as in a thesaurus), antonyms, and many other types of linguistic categories rather than merely spelling. The key to the effectiveness of this type of browsing is that the user has control over the level of complexity in the relationships that are presented to them.

Initial focus: Warlpiri

The initial focus of our project was Warlpiri, an Australian Aboriginal language spoken in the Tanami desert northwest of Alice Springs. There were a number of factors influencing this choice:

Overview of Kirrkirr

Details of the first version of Kirrkirr, our Warlpiri dictionary browser, have been described before (Jansz 1998; Jansz, Manning and Indurkhya 1999a; Jansz, Manning and Indurkhya 1999b). The ideas in the design are quite general and applicable to other dictionaries as well. As an environment for the interactive exploration of dictionaries, it attempts to more fully utilise graphical interfaces, hypertext, multimedia, and different ways of indexing and accessing information

Written in Java, it can either be run over the web (high bandwidth) or run locally (here Java's main advantage is cross-platform support). As shown in Figure 1, it is currently made up of five main modules:

[screen shot of Kirrkirr]

Figure 1: Kirrkirr

As reflected in the system modules, this application is unlike any other e-dictionary as it was designed to cater for the needs of Warlpiri speakers with various levels of competence. Features such as the searching facility allow information to be accessed easily and quickly, while incorporation of animation and sounds makes the dictionary usable by speakers with little to no background with the language.

Efficient Access of XML content

XDI: An index for XML-based Dictionaries

Storing the lexical database in an XML formatted file is an effective median between the structure and built-in querying of a relational database and the flexibility and portability of a plain text document. The strengths of this approach were best appreciated in the development of the Kirrkirr dictionary browser.

Initial testing showed that if the program simply read in the entire XML file (about 10Mb of text) and stored it as parsed data structures within memory, memory usage was excessively high. A simple solution was to create an index file that contained the words, information about cross-references for the graphical display, and the corresponding file position of its entry in the XML file. As a result, only the index need reside in memory and of the 9300 entries in the dictionary, only those requested by the user, will be read in and processed. The use of an index resulted in significant performance improvement. This approach worked well with the XML parser being used (Microstar's Ælfred parser). Although the parser (like other XML parsers of which we are aware) was built to parse an entire file at a time, it was relatively easy to adapt the code to allow processing of just one entry. The use of an index file was also well suited to usage over the Web. Because the parser only processes the parts of the XML file, when they are required, the system is very efficient and can be used even over a low bandwidth connection. Once whole entries are parsed, they are kept temporarily in a memory cache that speeds up subsequent accesses to the same entry in the browsing session. The XDI scheme is described in Figure 2.

[XML Dictionary Indexing]

Figure 2: XDI: Indexing the XML lexical database for better information access

The idea of using an index is well-known from database systems. By using an index, Kirrkirr implicitly recognises the dictionary as a database (albeit a richly structured one) and XDI can be seen as customised index for such a database. One might well argue that if one is to view the dictionary as a database, then perhaps its better to use off-the-shelf database products that have their own customised index structures for efficient access. However, this is a tradeoff with preserving the rich structure in the dictionary. By using XDI, Kirrkirr tries to strike a balance between the indexing capabilities of standard database systems and the expressiveness and portability of XML. However, by using a customised solution, the system complexity goes up and precludes one from taking advantage of alternative solutions to the general problem of XML information retrieval.

XQL: A standard query language for XML content

The Potential of XQL

XQL is a set of extensions to the Extensible Style Language (XSL) specification that allows developers using XML to easily execute powerful, complex queries on XML documents. Proposed to the W3C by representatives from Microsoft, Texcel and WebMethods in 1998, it competes with the SQL-oriented XML-QL whose specification was submitted to the W3C by AT&T Labs. However, we will not further discuss XML-QL here, but only the XQL API underlying the new data access model of Kirrkirr.

Traditionally, structured queries have been used primarily for relational or object-oriented databases, and documents were queried with relatively unstructured full-text queries. Although sophisticated query engines for structured documents have existed for some time, they have not been a mainstream application. XML documents are structured documents - they blur the distinction between data and documents, allowing documents to be treated as data sources, and traditional data sources to be treated as documents. Some XML documents are nothing more than an ASCII representation of data that might traditionally have been stored in a database. Others are documents containing very little structure beyond the use of headers and tables. Kirrkirr is somewhere in between: an e-dictionary that has complex recursive structure, but also much relatively unstructured free text, and clearly needs effective query mechanisms for access.

Database developers have taken for granted the ability to execute queries on data stores for decades. However, XML being a young data technology, querying functionality had been very limited. XQL gives developers the querying functionality they have become used to in the database world, including the following:

The major differences between SQL and XQL are summarized in Table 1. It is clear that XQL is an invaluable tool with huge potential for accessing dictionary information stored in XML format.

SQL

XQL

The database is a set of tables.

The database is a set of one or more XML documents.

Queries are done in SQL, a query language that uses tables as a basic model.

Queries are done in XQL, a query language that uses the structure of XML as a basic model.

The FROM clause determines the tables which are examined by the query.

A query is given a set of input nodes from one or more documents, and examines those nodes and their descendants.

The result of a query is a table containing a set of rows.

The result of a query is a set of XML document nodes, which can be wrapped in a root node to create a well-formed XML document.

Table 1: Main differences between SQL and XQL

DOM (Document Object Model)

XQL implementations typically operate on a model of the XML document known as the DOM (Document Object Model). The DOM is a platform-independent, programming-language-neutral application programming interface (API) for HTML and XML documents. Its core outlines a family of types that represent all the objects that make up an XML document: elements, attributes, entity references, comments, textual data and processing instructions. With that, it defines the logical structure of documents and the way a document is accessed and manipulated. (DOM specifies how XML documents are represented as objects, so that they may be used in object oriented programs.)

Increasingly, XML is being used as a way of representing many different kinds of information that may be stored in diverse systems, and much of this would traditionally be seen as data rather than as documents. Nevertheless, XML presents this data as documents, and the DOM may be used to manage this data to allow programs to access and modify the content and the structure of XML documents from within applications. Anything found in an XML document can be accessed, changed, deleted, or added using the DOM, except for the XML internal and external subsets for which DOM interfaces have not yet been provided.

After the XML document has been parsed into a collection of objects conforming to DOM, the object model can be manipulated in any way that makes sense. This mechanism is also known as the "random access" protocol, as any part of the data can be visited at any time. The DOM usually resides in memory (it is the output of the XML parser), but it can also be stored on disk (to save on the time needed to parse the XML repeatedly) as a Persistent DOM (PDOM). When an XML document is large and not likely to change much, as is the case for dictionaries, using its PDOM representation can significantly speedup XQL querying.

XML Path Language (XPath)

The primary purpose of XPath [HREF12] is to address parts of an XML document by operating on the abstract, logical structure of an XML document, rather than its surface syntax. XPath gets its name from its use of a path notation (as in URLs) for navigating through the hierarchical structure of an XML document. XPath is useful as it provides a common model and syntax to express the patterns required by queries and transformation patterns for XML documents. It is intended primarily as a component that can be used by other specifications and does not define any conformance criteria for independent implementations (of XPath). This is why XQL (compatible with XPath) is used in Kirrkirr instead of a specific XPath implementation.

Using XQL in Kirrkirr

The XQL search engine and the PDOM used for the new dictionary representation in Kirrkirr originated from a research project at GMD-IPSI [HREF9], the Institute for Integrated Publication and Information Systems of the German National Research Centre for Information Technology. The PDOM incorporated consolidated concepts from many years of leading edge research and development in the fields of federated databases and document management systems. Persistency is achieved by indexed, binary files. The XML document is parsed once and stored in binary form, accessible to DOM operations without the overhead of parsing them when the information is required. The implementation uses a robust and efficient mix of indexing and query optimisation techniques. A cache architecture further boosts performance. This approach scales very well beyond the limitations of main memory. The PDOM is generated from the XML dictionary. Subsequently, Kirrkirr uses XQL to query the PDOM. Parsing of the XML need not be done repeatedly (it is only necessary when the dictionary changes) and access is faster.

The following is a simplified XML hierarchy of the dictionary.

[example DOM]

Figure 3: A sample DOM tree

The dictionary is a sequence of many entries, which include some subset of a large number of dictionary components, including a headword (HW element) and perhaps one or more pictures (IMAGE element).

To find an entry whose headword (<HW>) is 'jaja', the following query ([ ] is the filter clause, equivalent to the WHERE clause in SQL) may be used:

/DICTIONARY/ENTRY[HW='jaja']

Alternatively, if the PDOM index is known, say index = 9 for the word jaja, we can use the query:

/DICTIONARY/ENTRY[9]

The time taken to execute the above queries is very slow and depends very much on the number of <ENTRY> nodes in <DICTIONARY>. This is bad news even for the 9300 entries of the current Warlpiri dictionary, and would be totally impractical for something like a large English dictionary, which might have 100,000 or more headwords.

One solution is to split the DOM representation into multiple smaller DOM trees. However, this approach does not scale very well if more words are added to the dictionary in the future and the increase in code complexity is difficult to justify.

Fortunately, implementations of the DOM API provide efficient ways to extract a child node given its index in the parent's node list, and this can be used to execute the above queries more efficiently and extract the required entries. After extracting the required entry from the DOM tree, there may be a need to query the multiple descendant IMGI nodes to extract the filenames of the image files of a dictionary entry (<ENTRY>). This can be very simply achieved by the following query:

ENTRY/IMAGE//IMGI

The '//' denotes recursive descent and in the above context means finding all the IMGI nodes under the IMAGE node.

Customised Presentation of Dictionary Content

Yet another virtue of storing the dictionary in XML format is that there is an abstraction of the content of the data from the way it is going to be used or displayed. This makes using XML data very flexible in the way it can be formatted. At a fairly basic and straightforward level, the formatting of the textual content can be controlled by an eXtensible Style Language (XSL) stylesheet.

HTML-based presentation

Formatted dictionary entries with hypertext links are an adaptation of traditional dictionary structure that is useful for displaying and navigating related words. The functionality to let the user point and click on a referenced word to jump immediately to its corresponding entry is a better way of displaying written dictionary entries than requiring users to remember referenced words and perform a new search for their entry.

At the time of Kirrkirr's initial development, the challenge of formatting the XML data in a useable format was finding an XSL processor that implemented at least the basic features of the draft standard.

While the msxsl application from Microsoft was effective in generating HTML from an XSL file, because of it's limitation to Windows-platforms and the inability to deal with large amounts of data meant it could not be included with the Kirrkirr application or Applet. Hence the HTML needed to be pre-generation in a process that required each entry to be put in a separate XML file and then fed individually into the msxsl to generate a HTML file (9000+ files in all, for each entry in the dictionary). These files would then be packaged together with Kirrkirr.

Using XSL for enhanced customisation

An XSL file can be used to convert an XML document into format specific format such as HTML, Rich Text Format (RTF) or Postscript. This process is done by an XSL processor, which takes the XML file and the XSL file and creates the formatted file. The structure of an XSL document is essentially a list of rules that tell the XSL processor what to produce when it encounters certain "target-elements" in the XML file. XSL Transformations are now at the stage of a W3C Recommendation (XSLT [HREF14]).

The development of XSL tools is still at a fairly early stage, but nevertheless there is increased availability of XSL processors. In particular, with James Clark's Java-based XSL processor XT, it became possible for Kirrkirr to dynamically generate the formatted dictionary entries in real time. As shown in Figure 4 (continued from Figure 2), Kirrkirr takes the XML document object returned from the XML parser and together with the XSL style sheet specified by the user has the XSL processor generate a HTML document.

[HTML using XSL]

Figure 4: The dynamic HTML generation process in Kirrkirr

While HTML generation at runtime eliminated the need for HTML files to be packaged with the application (as discussed in Section 4.1), the key benefit was the ability to let the user customise the formatting process (the performance implications discussed in 5.4). At present the user can select from a few pre-defined style sheets which vary from including all the details in the dictionary entry (including related words, semantic domains, examples, senses, sub-senses etc.) to simply displaying the word, it's definition and the links to it's related words. An important aspect of providing this level of customisation is that elementary users need not be overwhelmed by the detailed lexical information in the dictionary and can still have access to the rich network of word relationships.

Performance Issues

While the use of XQL and XSL appear, on paper, to be useful and to advance the state-of-the-art, critical questions about efficiency and performance remain unanswered. In this section we examine some of these issues within the context of Kirrkirr. The tests in this section were conducted on a Pentium-133 machine with 48 Mbytes of RAM and running JDK 1.2.1. We deliberately used a comparatively slow and old machine because this is more typical of the equipment likely to be available to potential users of the Warlpiri dictionary.

Impact on start-up time

Method Size of index file Size of file to be loaded at start-up Start-up time needed
XML+XDI Index file -2.13Mb 2.13Mb 7min
One-PDOM 12.5Mb N.A. 13min 04s
One PDOM + Index PDOM - 12.5Mb
Index - 520Kb
520Kb 3min 30s
Segmented PDOM + Index PDOM - 12.5Mb
Index file - 454Kb
454Kb 55.48s

Table 2: Comparison of the start-up times for Kirrkirr with different representations.

Table 2 summarises the comparison results of start-up time. XML+XDI is the original Kirrkirr. In One-PDOM, the dictionary information needed is extracted from one monolithic PDOM. The performance of this method is very unsatisfactory. It is evident that a start-up index created from the PDOM is required. This start-up index need not be as complex as the XDI - it need only have the bare minimum information (just enough to get the rest by querying the PDOM). The use of such an index reduces the start-up time to half that of the original method. An optimised and enhanced version of this is also considered where dictionary information is broken into blocks. Part of the dictionary can be loaded and the user interface created and presented to the user, while the rest of the dictionary loads in the background. This speeds up the start-up time significantly and allows the user to play with some of the words, while the rest are being loaded in the background.

Dictionary Information Extraction

One of the main justifications for using XQL was speedier access using a standardised mechanism. Here we examine if this is true, and what more needs to be done to fix it.

Method Time taken for extract of dictionary information for a word
XDI 118 ms
XQL + PDOM 3700 ms (first query)
400 ms (subsequent query)

Table 3: Summary of time taken for dictionary information extraction

Table 3 compares the time taken for dictionary information extraction. The original method stores all the dictEntries (dictEntry is an object encapsulating the most often used information of a dictionary entry) in a Hashtable in main memory. There are two test results for the XQL method, this is due to the self-optimizing cache architecture of the XQL engine that ensured faster queries after the first (query).

Extraction of dictionary information by querying the PDOM is much slower than getting it from a Hashtable, even after using the faster timing for comparison (about three orders slower). The difference in speed is not a major concern when only information for one word is needed (for example, when a user selects a word from the list. This is due to the reaction time of the user). Unfortunately, when dictionary information for a large number of words is required (say during a search function), performance dropped significantly, due to accumulation of the extra timing required for extracting each of the large number of dictionary entries.

[caching with PDOM]

Figure 5: Faster dictionary information retrieval through addition of a cache.

A solution that matches the speed of the XDI retrieval and fast start-up time (PDOM implementation) is possible by adding a "cache" (basically a HashMap) between the PDOM and the user interface. All dictEntries retrieved from the PDOM are stored in the cache. Now only the first request for a dictEntry not present in the cache will incur a penalty; subsequent requests for the dictEntry can simply be answered by the cache. This cache, unlike a hardware cache, does not use any replacement policy (e.g. FIFO, Least Recently Used) to displace dictEntries from it.

Creation of Index file

The original Kirrkirr takes around 32min 26s to create the XDI index (on the same Pentium 133 machine). Apart from being slow, the index created was far too big and required too much time to load during start-up (as discussed in Section 5.1).

The new dictionary representation requires a smaller index (containing only essential dictionary information) for program start-up. Two steps are involved in creating this new index:

  1. Parsing of XML file to generate the PDOM.
  2. Creation of the index required at start-up from the PDOM.
Step Time Taken
XML -> PDOM 3min 2s
PDOM -> Index 10min 52s
Total 13min 54s

Table 4: Time taken for each step of the creation of start-up index (new version).

As shown above there is an improvement of 133% in the time taken to create the index file.

XSL Presentation

Although the technique of pre-generating the HTML files for each of the dictionary entries has the benefit of taking the load of the application, there were a number of reasons for moving to dynamic generation of HTML.

With a HTML file for each entry (9000+) in a single folder, the application not only becomes cumbersome, there is also a performance load on the file system. As demonstrated in Table 5 the performance impact of performing the XSL processing at the instant the user requests a word is not significant.

Step Time Taken
First word 4s
Other words 1-2s

Table 5: Time taken by XSL processor to generate a HTML entry (new version).

There is a noticeable load on the system when processing the first word request, as the Java security framework establishes the user has sufficient permission to access the file system. Future formatted entry requests are processed reasonably faster.

The performance of dynamic XSL generation of HTML is low enough to justify inclusion of the interactive formatting functionality.

Conclusions and Future Work

In this paper we have described the integration of XQL and XSL components into the Kirrkirr interface for dictionary visualisation. A great advantage in using XML for the storage of data is that it can be converted into any format simply by specifying rules in an XSL style sheet. We have discussed how various XSL style sheets can customise output for different user needs, and in the future we plan to continue exploring the use of XSL for producing printed versions of the dictionary as well.

Being able to produce multiple customised versions of printed dictionaries at low cost (in terms of human labour) is another central need in the domain of dictionaries for languages with few speakers. We are also exploring the development of a GUI interface for defining XSL stylesheets, so they can more easily be customised by users. For efficient access to and searching of dictionary information, we have explored replacing the original custom indexing of Kirrkirr with XQL. While XQL offers many advantages in terms of flexibility and standardisation, our current results suggest that a combination of custom indexing and XQL is the way to achieve optimal performance.

In other work not reported here, we have also continued to investigate the usability of dictionary software by diverse user populations, and to explore various alternative dictionary interfaces modules. While the focus of all this research has been on Warlpiri, this research (and the software constructed) can easily be applied to other languages, including other Australian languages and English. In general, our hope is to provide a better understanding of the usefulness and practicality of innovative dictionary browsing environments. Flexible and efficient infrastructure components of the sort we have explored here are a necessary foundational part of this research.

References

M. Corris, C. Manning, S. Poetsch, J. Simpson, Dictionaries and endangered languages, to appear. In David Bradley & Maya Bradley (eds) Language Endangerment and Language Maintenance: an active approach.

K. Jansz, "Intelligent processing, storage and visualisation of dictionary information", Computer Science Honours Thesis, Basser Department of Computer Science, University of Sydney, 1998 [HREF6]

K. Jansz, C. Manning, N. Indurkhya, "Kirrkirr: Interactive Visualisation and Multimedia from a Structured Warlpiri Dictionary", Ausweb 99 [HREF7]

K. Jansz, C. Manning, N. Indurkhya, "Kirrkirr: A Java-based visualisation tool for XML dictionaries of Australian Languages", XML/SGML Asia Pacific 1999 [HREF8]

M. Laughren and D. Nash, "Warlpiri Dictionary Project: aims, method, organisation and problems of definition" Papers in Australian Linguistics No. 15: Australian Aboriginal Lexicography pp 109-133, Pacific Linguistics 1983.

Hypertext References

HREF1
http://www.sultry.arts.usyd.edu.au/kjansz
HREF2
http://web.singnet.com.sg/~jimemail
HREF3
http://www.cs.usyd.edu.au/~nitin
HREF4
http://www.sultry.arts.usyd.edu.au/cmanning
HREF5
http://www.sultry.arts.usyd.edu.au/kirrkirr
HREF6
http://www.sultry.arts.usyd.edu.au/kjansz/thesis/
HREF7
http://ausweb.scu.edu.au/aw99/papers/manning/
HREF8
http://www.allette.com.au/xmlasia99/presentations.html (XML/SGML 99)
HREF9
http://xml.darmstadt.gmd.de/xql/index.html (GMD-IPSI XQL Engine Version 1.0.2 index page)
HREF10
http://www.texcel.no/whitepapers/xql-design.html (Robie, J. 1998, The Design of XQL.)
HREF11
http://metalab.unc.edu/xql/xql-proposal.html (Robie, J. 1999, XQL (XML Query Language))
HREF12
http://www.w3.org/TR/1999/REC-xpath-19991116.html (XML Path Language (XPath) Version 1.0, W3C Recommendation 16 November 1999)
HREF13
http://www.w3.org/TR/xsl (Extensible Stylesheet Language (XSL) Version 1.0, W3C Working Draft 27 March 2000)
HREF14
http://www.w3.org/TR/xslt (XSL Transformations (XSLT) Version 1.0, W3C Recommendation 16 November 1999)

Copyright

Kevin Jansz, Wee Jim Sng, Christopher Manning and Nitin Indurkhya, © 2000. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.


[ Proceedings ]


AusWeb2K, the Sixth Australian World Wide Web Conference, Southern Cross University, PO Box 157 , Lismore NSW 2480, Australia Email:AusWeb2K@scu.edu.au