Steve Ball, Zveno
Most major word processing applications, such as Microsoft Word and Open Office, now have the ability to save in XML format. However, it is often the case that the schema used by the word processor is not desirable for further processing or storage of the document. In this case it is necessary to convert the word processor's XML representation of the document into an application-defined schema. However, a word processing document is essentially an unstructured document: it is a long list of paragraphs. Hence the task of converting a word processing document into a structured document is a specialised type of grouping problem.
The XSLT language is a natural choice for converting a document from one XML-based schema into another XML-based schema. Grouping is a common problem when developing XSLT stylesheets and there are a number of well-known design patterns for achieving the desired structures in the result document. These techniques include positional grouping and Muenchian grouping. A description of each type of grouping technique is given, along with an explanation of how the grouping method works. While all of these methods do produce the correct result, in the case of the transformation of large word processing documents their runtime performance is unacceptable.
A new technique has been developed for handling documents that typically result from word processing applications. This technique uses modal template processing to achieve its results. It exhibits fast runtime performance using standard XSLT v1.0 features. A description of the technique given, along with an explanation of how high performance is gained.
Table of Contents
Grouping together different elements in a XML source document [XML]into an alternative structure in the result document is a common problem when developing XSL stylesheets [XSLT]. There are many variations on the problem of grouping, one example is as follows. Given a source XML document that lists AusWeb conference chairs:
Example 1.
<chairs>
<chair>
<name>Paul Thistlewaite</name>
<year>1995</year>
<year>1996</year>
<year>1997</year>
<year>1998</year>
</chair>
<chair>
<name>Helen Ashman</name>
<year>1995</year>
<year>1996</year>
<year>1997</year>
<year>1998</year>
</chair>
<chair>
<name>Roger Debreceny</name>
<year>1999</year>
</chair>
<chair>
<name>Allan Ellis</name>
<year>1999</year>
<year>2000</year>
<year>2001</year>
<year>2002</year>
<year>2003</year>
<year>2004</year>
</chair>
</chairs>
Note that the example document has a uniform structure: the chairs element contains a sequence of chair element, which in turn contain a single name element followed by a sequence of year elements. For display purposes it may be desirable to rearrange this list of names into a two-column table, such as:
Example 2.
<table>
<row>
<entry>Paul Thistlewaite</entry>
<entry>Helen Ashman</entry>
</row>
<row>
<entry>Roger Debreceny</entry>
<entry>Allan Ellis</entry>
</row>
</table>
This is an example of positional grouping. Solving this type of problem using XSLT involves the following steps:
Identify the nodes that are to be grouped
Process only one of the nodes, typically the first
When processing that node, find the other nodes in that group
So, for the above problem a solution might be as follows:
Example 3.
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:strip-space elements='*'/>
<xsl:template match='chairs'>
<!-- Step 1: We need to grouped together chair
elements pair-wise.
-->
<table>
<!-- Step 2: Select for processing all the
odd-numbered chair elements.
-->
<xsl:apply-templates
select='chair[position() mod 2 = 1]'/>
</table>
</xsl:template>
<!-- This template will only process the odd-numbered
chair elements.
-->
<xsl:template match='chair'>
<row>
<entry>
<xsl:apply-templates select='name'/>
</entry>
<!-- Step 3: Now find the even-numbered
chair elements.
-->
<entry>
<xsl:apply-templates
select='following-sibling::chair[1]/name'/>
</entry>
</row>
</xsl:template>
</xsl:stylesheet>
Another type of grouping problem [Tennison] is grouping by content. For example, given the same source document as above it may be desirable to find the conference chairs for a given year:
Example 4.
<chairs year='1997'> <name>Paul Thistlewaite</name> <name>Helen Ashman</name> </chairs> <chairs year='2002'> <name>Allan Ellis</name> </chairs>
A stylesheet to produce this result is as follows:
Example 5.
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:strip-space elements='*'/>
<!-- Step 1: Group by year -->
<xsl:template match='/'>
<!-- Step 2: Select the first node that is
the year we are interested in.
-->
<xsl:apply-templates
select='/chairs/chair/year[. = "1997"][1]'/>
<xsl:apply-templates
select='/chairs/chair/year[. = "2002"][1]'/>
</xsl:template>
<!-- Step 3: For a particular year,
find all chairs.
-->
<xsl:template match='year'>
<chairs year='{.}'>
<xsl:apply-templates
select='preceding-sibling::name|
ancestor::chair/following-sibling::chair
[year = current()]/name'/>
</chairs>
</xsl:template>
<xsl:template match='name'>
<xsl:copy/>
</xsl:template>
</xsl:stylesheet>
A common use of either type of grouping is to derive structure from a flat-structured data source [Pawson]. That is, to have a deeper element hierarchy in the result document than existed in the source document. As an example, below is an extract of another version of the sample XML source document (the full document may be found in flatsource.xml). Note that its structure is also uniform: it is a recurring sequence of title, year, chair (repeatable) and location elements.
Example 6.
<conferences> <title>AusWeb 95</title> <year>1995</year> <chair>Paul Thistlewaite</chair> <chair>Helen Ashman</chair> <location>Ballina Beach Resort</location> <title>AusWeb 96</title> <year>1996</year> <chair>Paul Thistlewaite</chair> <chair>Helen Ashman</chair> <location>Conrad Jupiters Casino</location> <title>AusWeb 97</title> <year>1997</year> <chair>Paul Thistlewaite</chair> <chair>Helen Ashman</chair> <location>Conrad Jupiters Casino</location> . . . </conferences>
The structure of the above sample document is very flat - the element hierarchy is only two elements deep. It would be much better to have a deeper element hierarchy, as shown below (the complete document may be found in deeper.xml).
Example 7.
<conferences>
<conference>
<title>AusWeb 95</title>
<year>1995</year>
<location>Ballina Beach Resort</location>
<chairs>
<name>Paul Thistlewaite</name>
<name>Helen Ashman</name>
</chairs>
</conference>
<conference>
<title>AusWeb 96</title>
<year>1996</year>
<location>Conrad Jupiters Casino</location>
<chairs>
<name>Paul Thistlewaite</name>
<name>Helen Ashman</name>
</chairs>
</conference>
<conference>
<title>AusWeb 97</title>
<year>1997</year>
<location>Conrad Jupiters Casino</location>
<chairs>
<name>Paul Thistlewaite</name>
<name>Helen Ashman</name>
</chairs>
</conference>
.
.
.
</conferences>
The desired result document now has an element hierarchy depth of 4 elements. In order to achieve the required structure, grouping by content is necessary. In this case, the Meunchian method [Meunch] may be used to identify which nodes belong to a particular group. The Meunchian method involves the following steps:
First find that node which delimits the desired group.
Select a set of nodes, usually in a particular axis, that includes all nodes to be grouped. This set will probably contain nodes that are not part of the group.
Apply a predicate to the selected set of nodes that compares the identity of the next delimiting node to the actual delimiting node found in step 1. There are two common methods for identifying a node:
Use the generate-id() function. This function always returns the same, unique string for a given node.
Use the | set union operator combined with the count() function. The union operation eliminates duplicates, so the count of a node unioned with itself will be 1.
Further, this is a two-level grouping problem: first, the content is grouped by conference and second, the conference chairs are grouped together.
An XSL stylesheet to transform the original source document into the desired result document is as follows:
Example 8.
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:strip-space elements='*'/>
<!-- Step 1: Group title, year, chair and location
together.
-->
<xsl:template match='conferences'>
<conferences>
<!-- Step 2: Process the title elements,
which being each group.
-->
<xsl:apply-templates select='title'/>
</conferences>
</xsl:template>
<!-- Step 3: For each title, find the related
nodes. These occur before the next title.
-->
<xsl:template match='title'>
<!-- Find the next location element
which delimits this group.
-->
<xsl:variable
name='nextLocation'
select='following-sibling::location[1]'/>
<conference>
<xsl:apply-templates
select='.|following-sibling::year[1]|
$nextLocation'/>
<!-- Now group together the chair elements
for this particular conference.
-->
<chairs>
<xsl:apply-templates
select='following-sibling::chair
[count($nextLocation|
following-sibling::location[1]) = 1]'/>
</chairs>
</conference>
</xsl:template>
<xsl:template match='chair'>
<name>
<xsl:apply-templates/>
</name>
</xsl:template>
<xsl:template match='*'>
<xsl:copy/>
</xsl:template>
</xsl:stylesheet>
This solution utilizes positional grouping for the first level of grouping (conferences) and Meunchian grouping for the second level of grouping (conference chairs).
Methods for grouping in XSLT are suitable when the total number of nodes being grouped is relatively small. Also, grouping methods are easily applied when the document is uniform and the grouping is only one- or two-levels deep. However, there are significant classes of documents where these conditions are not the case. One such class of documents is word processing documents that are saved as XML; both Microsoft Word 2003 [MSWord] and Open Office [OpenOffice] have this capability.
Word processing documents are fundamentally unstructured. Where they do have structure, it is related to the physical structuring of the visual appearance of the document. For example, a document may have a section in single column followed by a section in two columns. Many organisations wish to convert word processing documents into a logically structured format, such as DocBook [Walsh] or TEI [TEI]. In this case, physical structures are of no consequence and so should be discarded. Instead, the logical structures must be identified; this is most easily and reliably achieved by marking-up the document using "styles" but can also be done by applying heuristics.
When converting a word processing document into a structured form, the first step is to "normalise" the document. This involves applying a XSL stylesheet that for the most part copies the document's data identically to the result, but removes the physical structures. Other unnecessary data may also be removed during this proces s, such as style information. The data that remains is a flat list of paragraphs, as shown below:
Example 9.
<office:document-content
xmlns:office="http://openoffice.org/2000/office"
xmlns:text="http://openoffice.org/2000/text"
office:class="text" office:version="1.0">
<office:body>
<text:p text:style-name="chair">Paul Thistlewaite</text:p>
<text:p text:style-name="Standard">1995</text:p>
<text:p text:style-name="Standard">1996</text:p>
<text:p text:style-name="Standard">1997</text:p>
<text:p text:style-name="Standard">1998</text:p>
<text:p text:style-name="chair">Helen Ashman</text:p>
<text:p text:style-name="Standard">1995</text:p>
<text:p text:style-name="Standard">1996</text:p>
<text:p text:style-name="Standard">1997</text:p>
<text:p text:style-name="Standard">1998</text:p>
<text:p text:style-name="chair">Roger Debreceny</text:p>
<text:p text:style-name="Standard">1999</text:p>
<text:p text:style-name="chair">Allan Ellis</text:p>
<text:p text:style-name="Standard">1999</text:p>
<text:p text:style-name="Standard">2000</text:p>
<text:p text:style-name="Standard">2001</text:p>
<text:p text:style-name="Standard">2002</text:p>
<text:p text:style-name="Standard">2003</text:p>
<text:p text:style-name="Standard">2004</text:p>
<text:p text:style-name="Standard"/>
</office:body>
</office:document-content>
This small example contains 20 paragraphs. However, some applications must routinely deal with documents that contain well in excess of 2000 paragraphs.
Structured document schemata, such as DocBook or TEI or of an equivalent complexity, often have a number of sectioning components. For example, DocBook has a sectioning hierarchy as follows:
In fact, DocBook defines 5 numbered sections so potentially the depth of the sectioning element hierarchy is 8. From the perspective of grouping, this is a multi-level grouping problem and is far more complex than one- or two-level grouping.
An XSL stylsheet that transforms the Open Office document above into a structured form shown earlier is as follows:
Example 10.
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:office="http://openoffice.org/2000/office"
xmlns:text="http://openoffice.org/2000/text"
exclude-result-prefixes='office text'>
<xsl:strip-space elements='*'/>
<xsl:template match='office:body'>
<chairs>
<xsl:apply-templates
select='text:p[@text:style-name = "chair"]'/>
</chairs>
</xsl:template>
<xsl:template match='text:p[@text:style-name = "chair"]'>
<chair>
<name>
<xsl:apply-templates/>
</name>
<xsl:variable name='nextChair'
select='following-sibling::text:p
[@text:style-name = "chair"][1]'/>
<xsl:choose>
<xsl:when test='$nextChair'>
<xsl:apply-template
select='following-sibling::text:p
[generate-id(
following-sibling::text:p[@text:style-name = "chair"][1]) =
generate-id($nextChair)]'/>
</xsl:when>
<xsl:otherwise>
<!-- Take all following paragraphs
to the end of the document
-->
<xsl:apply-templates select='following-sibling::text:p'/>
</xsl:otherwise>
</xsl:choose>
</chair>
</xsl:template>
<xsl:template match='text:p[@text:style-name = "Standard"]'>
<year>
<xsl:apply-templates/>
</year>
</xsl:template>
</xsl:stylesheet>
In addition to the complexity of multi-level grouping, structured text documents exhibit a non-uniform pattern of sectioning components. That is, although the sectioning components are well-defined (books contain chapters, chapters contain sect1s, and so on) it is not necessarily the case that section components will appear at certain levels. With some schemata, it is also the case that certain elements may appear to different levels in the hierarchy. For example, some sect1 elements may contain sect2 children, whereas others may not. Also, lists or tables may occur at any level in the component hierarchy.
It is possible to deal with these various characteristics when converting word processing documents to structured documents. Handling multi-level documents causes the complexity of the application of Meunchian methods to rise markedly, but the problem is still tractable. However, the application of Muenchian methods to large, flat structures results in very poor performance. This is because the method is essentially calculating the intersection of two sets.
Determining a set nodes along a single axis is efficient, but applying a predicate to every node in the set is very costly. This cost is incurred for every node in every group. The performance cost for documents of different sizes is given below. It is important to note that the result of using the Meunchian method is correct, the desired result document is produced, it is only the runtime performance that is undesirable. Also, some XSLT processors provide an extension function to calculate the in tersection of two nodesets, which obviates the need for the approaches described here. However, for maximum portability it is best to avoid the use of extensions.
Description | Small | Medium | Large |
|---|---|---|---|
Level 1 | 1.85s | 814s | 24+ hrs |
Level 2 | 0.45s | 4.6s | ? |
Notes:
All times were measured on an Apple Macintosh PowerBook, 800MHz G4 PowerPC, 512MB RAM, using Gnome libxml2 and libxslt [Gnome].
In order to achieve two-level grouping, the processing was performed using two separate stylesheets; one to find the first level groups and a second to find all second level groups. Note how finding the second level groups is much faster, because the sibling chains have been shortened by the first grouping stylesheet.
The "Small" document is approximately 100 paragraphs. The "Medium" document is approximately 2000 paragraphs. The "Large" document is approximately 10000 paragraphs.
Although it has been shown that the Muenchian method performs poorly on word processing style documents, it must be realised that the problem is not the size of the document per se; the real problem is that word processing style documents have large, flat structures. The Meunchian method has unacceptable behaviour when grouping occurs in structures that have large numbers of sibling elements. Large (semi-)structured documents may not necessarily exhibit the same bad performance. In fact, once some structures have been established in the document (perhaps in a processing cha in) and the sibling lengths have been reduced the Meunchian method may once again be successfully applied.
So far, only a simple use of the Meunchian method has been shown. It is in fact possible to dramatically speed-up processing by using keys, via the key() function. In order to take advantage of keys, a key is created for each level of grouping desired in the result document. Each key then provides a fast lookup for the nodes in each group at that level. The key for each level of grouping must combine the key at previous levels, and so the complexity of this approach increases with the depth of grouping.
The XSL stylesheet solution shown above is modified to use a key:
Example 11.
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:office="http://openoffice.org/2000/office"
xmlns:text="http://openoffice.org/2000/text"
exclude-result-prefixes='office text'>
<xsl:strip-space elements='*'/>
<xsl:key name='immediate-nodes'
match='node()[not([@text:style-name = "chair"])]'
use='generate-id(preceding-sibling::text:p
[@text:style-name = "chair"][1])'/>
<xsl:template match='office:body'>
<chairs>
<xsl:apply-templates
select='text:p[@text:style-name = "chair"]'/>
</chairs>
</xsl:template>
<xsl:template match='text:p[@text:style-name = "chair"]'>
<chair>
<name>
<xsl:apply-templates/>
</name>
<xsl:apply-templates
select='key("immediate-nodes", generate-id())'/>
</chair>
</xsl:template>
<xsl:template match='text:p[@text:style-name = "Standard"]'>
<year>
<xsl:apply-templates/>
</year>
</xsl:template>
</xsl:stylesheet>
This solution improves the runtime performance, as shown in the table below. Note that the time required to build the key(s) is now significant; the actual application of the templates is very fast.
Small | Medium | Large |
|---|---|---|
2.43s | 141s | 3.9hrs |
Notes:
The same documents were used for timing these results, as well as the same hardware and software.
A single stylesheet was used to perform both first- and second-level grouping.
All of the methods described so far result in performance times that are unacceptable. They are also difficult to use with non-uniform structures. A new method has been developed that results in acceptable performance. This method is based on an approach described by Mike Kay in [Pawson], but extended to handle multi-level grouping. Instead of using the Meunchian method to calculate the intersection of two sets, this method iterates through the document one node at a time and "remembers" the context of the node in order to build the desired structures in the result document. Template modes are used to set the current context.
The new method uses the same basic techniques of grouping: identify the nodes to be grouped, process the node that is the start of the group and while processing that node gather together the nodes in that group. In addition, to achieve high performance some additional rules are introduced:
Use the idiom following-sibling::*[1] to step through the source document a single node at a time.
When performing an expensive computation, save the result in a variable and pass that variable as a parameter. That is, avoid re-computing the same (expensive) result.
Send duplicate nodesets for processing, but also include a parameter that determines when to stop processing, to avoid duplicating data. Use a mode to set the component level.
Separate grouping problems into different stylesheets, and chain the processing of a document through those stylesheets. This simplifies the development of the stylesheets and can improve performance (by reducing the number of templates in a stylesheet).

To illustrate the new method, a XSL stylesheet is shown below that transforms the Open Office example document into a structured form, using the new approach:
Example 12.
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:office="http://openoffice.org/2000/office"
xmlns:text="http://openoffice.org/2000/text"
exclude-result-prefixes='office text'>
<xsl:strip-space elements='*'/>
<xsl:template match='office:body'>
<chairs>
<!-- Step 1: Group "chair" nodes together.
-->
<xsl:apply-templates
select='text:p[@text:style-name = "chair"]'/>
</chairs>
</xsl:template>
<xsl:template match='text:p[@text:style-name = "chair"]'>
<chair>
<name>
<xsl:apply-templates/>
</name>
<!-- Find the end delimiter for this group.
-->
<xsl:variable name='nextChair'
select='following-sibling::text:p
[@text:style-name = "chair"][1]'/>
<xsl:choose>
<xsl:when test='$nextChair'>
<!-- Process the next paragraph.
Set "year" mode to signify that we're
looking for year values.
-->
<xsl:apply-templates
select='following-sibling::text:p[1]'
mode='year'>
<!-- Include the end delimiter so that
we know when to stop.
-->
<xsl:with-param name='nextChair'
select='$nextChair'/>
</xsl:apply-templates>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates
select='following-sibling::text:p[1]'
mode='year'/>
</xsl:otherwise>
</xsl:choose>
</chair>
</xsl:template>
<!-- This template only has effect when we're
looking for year values.
-->
<xsl:template
match='text:p[@text:style-name = "Standard"]'
mode='year'>
<xsl:param name='nextChair' select='/..'/>
<xsl:choose>
<xsl:when test='not($nextChair)'>
<!-- This is the last group -->
<year>
<xsl:apply-templates/>
</year>
<xsl:apply-templates
select='following-sibling::text:p[1]'
mode='year'/>
</xsl:when>
<!-- Have we reached the end of this group? -->
<xsl:when
test='generate-id(.) =
generate-id($nextChair)'/>
<xsl:otherwise>
<!-- Still in a group -->
<year>
<xsl:apply-templates/>
</year>
<xsl:apply-templates
select='following-sibling::text:p[1]'
mode='year'/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<!-- Eliminate all unmatched elements to avoid
duplicates.
-->
<xsl:template match='text:p'/>
</xsl:stylesheet>
Extending this solution to multiple component levels is very straight-forward. The template that matches each level must check for a terminating condition. Any combination of explicit parameters or heuristics may be used as a terminating condition. The template is also passed a nodeset of further nodes to process.
As an example, the following word processing document is to be transformed into a DocBook document:
This document has three levels of headings, using styles named (surprisingly) Heading 1, Heading 2 and Heading 3. A common heuristic is that each level of heading delimits the content of the heading before it. Hence this is a three-level grouping problem. An extract of an XSL stylesheet to transform this document into DocBook is shown below (the complete stylesheet may be found in stylesheet6.xsl). This extract shows the templates for handling one heading level.
Example 13.
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:office="http://openoffice.org/2000/office"
xmlns:text="http://openoffice.org/2000/text"
exclude-result-prefixes='office text'>
<xsl:strip-space elements='*'/>
<xsl:template match='office:body'>
<article>
<title>
<xsl:apply-templates
select='text:h[@text:style-name = "Heading 1"]'/>
</title>
<!-- Handle any paragraphs appearing in the "front-matter" -->
<xsl:apply-templates
select='text:h[@text:style-name = "Heading 2"][1]/
preceding-sibling::text:p'/>
<!-- Now create the sections.
Find all of the candidate sections,
store in a variable,
then kick off the processing with the first section -->
<xsl:variable name='sections'
select='text:h[@text:style-name = "Heading 2"]'/>
<xsl:apply-templates
select='$sections[1]' mode='section'>
<xsl:with-param name='sections' select='$sections'/>
</xsl:apply-templates>
</article>
</xsl:template>
<xsl:template
match='text:h[@text:style-name = "Heading 2"]'
mode='section'>
<xsl:param name='sections' select='/..'/>
<sect1>
<title>
<xsl:apply-templates/>
</title>
<!-- Edge condition: the last section must be handled separately -->
<xsl:choose>
<xsl:when test='$sections'>
<!-- Are there any subsections in this section?
If so, store them in a variable -->
<xsl:variable
name='subsections'
select='$sections[1]/
preceding-sibling::text:h
[@text:style-name = "Heading 2"]'/>
<xsl:apply-templates
select='following-sibling::*[1]'
mode='subsection'>
<xsl:with-param
name='nextSection'
select='$sections[1]'/>
<xsl:with-param
name='subsections'
select='$subsections'/>
</xsl:apply-templates>
</xsl:when>
<xsl:otherwise>
<!-- This is similar, but the subsections
are determined differently -->
...
</xsl:otherwise>
</xsl:choose>
</sect1>
<!-- Now continue with the next section -->
<xsl:apply-templates
select='$sections[1]'
mode='section'>
<xsl:with-param name='$sections[position() != 1]'/>
</xsl:apply-templates>
</xsl:template>
<xsl:template match='text:h'
mode='subsection'>
<xsl:param name='nextSection'
select='/..'/>
<xsl:param name='subsections'
select='/..'/>
<xsl:choose>
<!-- Have we reached the end delimiter
for this subsection? If so stop -->
<xsl:when test='generate-id() = generate-id($nextSection)'/>
<xsl:when test='@text:style-name = "Heading 2"'>
<!-- Find the (potential) end delimiter
for this subsection -->
<xsl:variable
name='nextSubsection'
select='following-sibling::text:h
[@text:style-name = "Heading 2"][1]'/>
<!-- Find the subsubsections, if any.
Store them in a variable -->
<xsl:variable
name='subsubsections'
select='following-sibling::text:h
[@text:style-name = "Heading 3"]'/>
<sect2>
...
</sect2>
<!-- Are there more subsections *within this group*
to continue with? -->
<xsl:if test='count($subsections|$nextSubsection) =
count($subsections)'>
<xsl:apply-templates
select='$nextSubsection'
mode='subsection'>
...
</xsl:apply-templates>
</xsl:if>
</xsl:when>
...
</xsl:choose>
</xsl:template>
.
.
.
</xsl:stylesheet>
This solution improves the runtime performance to an acceptable level, as shown in the table below:
Small | Medium | Large |
|---|---|---|
0.48s | 10.57s | 1020s |
Notes:
The same documents were used for timing these results, as well as the same hardware and software.
The stylesheet used performs two-level grouping.
Most modern word processing applications offer the ability to save a document in an XML format. However, the schema used by the word processing application is usually not suitable for storage or further processing of the document. Word processing documents are essentially flat structures; they are a long list of paragraphs. Building up deeper structures in the document using XSLT is a particular type of grouping problem.
Grouping is a common problem in XSLT and a number of techniques to achieve grouping have been shown. These include positional and Meunchian grouping, both with and without using keys. Although these techniques do produce the correct result document, their runtime performance (even with keys) has been shown to be unacceptable.
A new technique has been developed, based on the use of modal template processing. This technique has been shown to be able to handle multi-level grouping tasks, as well as non-uniform grouping structures. Although the technique is verbose (even by XML standards!), the application of the technique is mostly boilerplate. Future work is aimed at developing a code-generator to automatically produce the boilerplate templa tes given a description of the input start/end delimiters and the desired result structures.
The new technique described achieves high performance. The comparison of the various technqies on documents of different sizes is summarised below. For a baseline comparison, the timing of an identity transformation is given.
Description | Small | Medium | Large |
|---|---|---|---|
(Identity) | 0.25s | 1.29s | 14s |
Meunchian level 1 | 1.85s | 814s | 24+ hrs |
Meunchian level 2 | 0.45s | 4.6s | ? |
Meunchian with keys | 2.43s | 141s | 3.9hrs |
Modal templates | 0.48s | 10.57s | 1020s |
eXtensible Markup Language, Second Edition. Tim Bray (Ed). W3C Recommendation October 2000.
XSL Transformations (XSLT). James Clark. W3C Recommendation November 1999.
Jeni Tennison. Jeni's XSLT Pages: Grouping.
XSLT FAQ. Flat file transformation.
XSLT FAQ. Grouping.
Microsoft Word 2003. Microsoft Office Online Home Page.
Open Office. OpenOffice.org: Home Page.
DocBook: The Definitive Guide. Norm Walsh. Read The Book.
Text Encoding Initiative. TEI Website.
Gnome libxml. Daniel Veillard. xmlsoft.org.