Multi-level Non-uniform Grouping of Very Large Flat Structured Documents

Steve Ball, Zveno

Abstract

Most major word processing applications, such as Microsoft Word and Open Office, now have the ability to save in XML format. However, it is often the case that the schema used by the word processor is not desirable for further processing or storage of the document. In this case it is necessary to convert the word processor's XML representation of the document into an application-defined schema. However, a word processing document is essentially an unstructured document: it is a long list of paragraphs. Hence the task of converting a word processing document into a structured document is a specialised type of grouping problem.

The XSLT language is a natural choice for converting a document from one XML-based schema into another XML-based schema. Grouping is a common problem when developing XSLT stylesheets and there are a number of well-known design patterns for achieving the desired structures in the result document. These techniques include positional grouping and Muenchian grouping. A description of each type of grouping technique is given, along with an explanation of how the grouping method works. While all of these methods do produce the correct result, in the case of the transformation of large word processing documents their runtime performance is unacceptable.

A new technique has been developed for handling documents that typically result from word processing applications. This technique uses modal template processing to achieve its results. It exhibits fast runtime performance using standard XSLT v1.0 features. A description of the technique given, along with an explanation of how high performance is gained.


Table of Contents

Grouping In XSLT v1.0
Handling Large, Multi-level, Non-uniform, Flat-structured Documents
Grouping in Large, Multi-level, Non-uniform, Flat-structured Documents
Conclusion
References

Grouping In XSLT v1.0

Grouping together different elements in a XML source document [XML]into an alternative structure in the result document is a common problem when developing XSL stylesheets [XSLT]. There are many variations on the problem of grouping, one example is as follows. Given a source XML document that lists AusWeb conference chairs:

Example 1. 

Source Document
<chairs>
  <chair>
    <name>Paul Thistlewaite</name>
    <year>1995</year>
    <year>1996</year>
    <year>1997</year>
    <year>1998</year>
  </chair>
  <chair>
    <name>Helen Ashman</name>
    <year>1995</year>
    <year>1996</year>
    <year>1997</year>
    <year>1998</year>
  </chair>
  <chair>
    <name>Roger Debreceny</name>
    <year>1999</year>
  </chair>
  <chair>
    <name>Allan Ellis</name>
    <year>1999</year>
    <year>2000</year>
    <year>2001</year>
    <year>2002</year>
    <year>2003</year>
    <year>2004</year>
  </chair>
</chairs>

Note that the example document has a uniform structure: the chairs element contains a sequence of chair element, which in turn contain a single name element followed by a sequence of year elements. For display purposes it may be desirable to rearrange this list of names into a two-column table, such as:

Example 2. 

Desired Result Document 1
<table>
  <row>
    <entry>Paul Thistlewaite</entry>
    <entry>Helen Ashman</entry>
  </row>
  <row>
    <entry>Roger Debreceny</entry>
    <entry>Allan Ellis</entry>
  </row>
</table>

This is an example of positional grouping. Solving this type of problem using XSLT involves the following steps:

  1. Identify the nodes that are to be grouped

  2. Process only one of the nodes, typically the first

  3. When processing that node, find the other nodes in that group

Figure 1. 

Positional Grouping

So, for the above problem a solution might be as follows:

Example 3. 

XSL Stylesheet 1
<xsl:stylesheet version='1.0'
  xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>

  <xsl:strip-space elements='*'/>

  <xsl:template match='chairs'>

    <!-- Step 1: We need to grouped together chair
	 elements pair-wise.
      -->

    <table>
      <!-- Step 2: Select for processing all the
	   odd-numbered chair elements.
	-->
      <xsl:apply-templates
	 select='chair[position() mod 2 = 1]'/>
    </table>
  </xsl:template>

  <!-- This template will only process the odd-numbered
       chair elements.
    -->
  <xsl:template match='chair'>
    <row>
      <entry>
	<xsl:apply-templates select='name'/>
      </entry>

      <!-- Step 3: Now find the even-numbered
	   chair elements.
	-->

      <entry>
	<xsl:apply-templates
	   select='following-sibling::chair[1]/name'/>
      </entry>
    </row>
  </xsl:template>

</xsl:stylesheet>

Another type of grouping problem [Tennison] is grouping by content. For example, given the same source document as above it may be desirable to find the conference chairs for a given year:

Example 4. 

Desired Result Document 2
<chairs year='1997'>
  <name>Paul Thistlewaite</name>
  <name>Helen Ashman</name>
</chairs>
<chairs year='2002'>
  <name>Allan Ellis</name>
</chairs>

A stylesheet to produce this result is as follows:

Example 5. 

XSL Stylesheet 2
<xsl:stylesheet version='1.0'
  xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>

  <xsl:strip-space elements='*'/>

  <!-- Step 1: Group by year -->

  <xsl:template match='/'>
    <!-- Step 2: Select the first node that is
	 the year we are interested in.
      -->
    <xsl:apply-templates
       select='/chairs/chair/year[. = "1997"][1]'/>
    <xsl:apply-templates
       select='/chairs/chair/year[. = "2002"][1]'/>
  </xsl:template>

  <!-- Step 3: For a particular year,
       find all chairs.
    -->
  <xsl:template match='year'>
    <chairs year='{.}'>
      <xsl:apply-templates
	 select='preceding-sibling::name|
		 ancestor::chair/following-sibling::chair
		   [year = current()]/name'/>
    </chairs>
  </xsl:template>

  <xsl:template match='name'>
    <xsl:copy/>
  </xsl:template>

</xsl:stylesheet>

A common use of either type of grouping is to derive structure from a flat-structured data source [Pawson]. That is, to have a deeper element hierarchy in the result document than existed in the source document. As an example, below is an extract of another version of the sample XML source document (the full document may be found in flatsource.xml). Note that its structure is also uniform: it is a recurring sequence of title, year, chair (repeatable) and location elements.

Example 6. 

Source Document
<conferences>
  <title>AusWeb 95</title>
  <year>1995</year>
  <chair>Paul Thistlewaite</chair>
  <chair>Helen Ashman</chair>
  <location>Ballina Beach Resort</location>

  <title>AusWeb 96</title>
  <year>1996</year>
  <chair>Paul Thistlewaite</chair>
  <chair>Helen Ashman</chair>
  <location>Conrad Jupiters Casino</location>

  <title>AusWeb 97</title>
  <year>1997</year>
  <chair>Paul Thistlewaite</chair>
  <chair>Helen Ashman</chair>
  <location>Conrad Jupiters Casino</location>
.
.
.
</conferences>

The structure of the above sample document is very flat - the element hierarchy is only two elements deep. It would be much better to have a deeper element hierarchy, as shown below (the complete document may be found in deeper.xml).

Example 7. 

Desired Structured Result Document
<conferences>
  <conference>
    <title>AusWeb 95</title>
    <year>1995</year>
    <location>Ballina Beach Resort</location>
    <chairs>
      <name>Paul Thistlewaite</name>
      <name>Helen Ashman</name>
    </chairs>
  </conference>
  <conference>
    <title>AusWeb 96</title>
    <year>1996</year>
    <location>Conrad Jupiters Casino</location>
    <chairs>
      <name>Paul Thistlewaite</name>
      <name>Helen Ashman</name>
    </chairs>
  </conference>
  <conference>
    <title>AusWeb 97</title>
    <year>1997</year>
    <location>Conrad Jupiters Casino</location>
    <chairs>
      <name>Paul Thistlewaite</name>
      <name>Helen Ashman</name>
    </chairs>
  </conference>
.
.
.
</conferences>

The desired result document now has an element hierarchy depth of 4 elements. In order to achieve the required structure, grouping by content is necessary. In this case, the Meunchian method [Meunch] may be used to identify which nodes belong to a particular group. The Meunchian method involves the following steps:

  1. First find that node which delimits the desired group.

  2. Select a set of nodes, usually in a particular axis, that includes all nodes to be grouped. This set will probably contain nodes that are not part of the group.

  3. Apply a predicate to the selected set of nodes that compares the identity of the next delimiting node to the actual delimiting node found in step 1. There are two common methods for identifying a node:

    • Use the generate-id() function. This function always returns the same, unique string for a given node.

    • Use the | set union operator combined with the count() function. The union operation eliminates duplicates, so the count of a node unioned with itself will be 1.

Further, this is a two-level grouping problem: first, the content is grouped by conference and second, the conference chairs are grouped together.

An XSL stylesheet to transform the original source document into the desired result document is as follows:

Example 8. 

XSL Stylesheet 3
<xsl:stylesheet version='1.0'
  xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>

  <xsl:strip-space elements='*'/>

  <!-- Step 1: Group title, year, chair and location
       together.
    -->

  <xsl:template match='conferences'>
    <conferences>

      <!-- Step 2: Process the title elements,
	   which being each group.
	-->
      <xsl:apply-templates select='title'/>
    </conferences>
  </xsl:template>

  <!-- Step 3: For each title, find the related
       nodes.  These occur before the next title.
    -->

  <xsl:template match='title'>

    <!-- Find the next location element
	 which delimits this group.
      -->
    <xsl:variable
       name='nextLocation'
       select='following-sibling::location[1]'/>

    <conference>
      <xsl:apply-templates
	 select='.|following-sibling::year[1]|
		 $nextLocation'/>

      <!-- Now group together the chair elements
	   for this particular conference.
	-->
      <chairs>
	<xsl:apply-templates
	   select='following-sibling::chair
		   [count($nextLocation|
		   following-sibling::location[1]) = 1]'/>
      </chairs>
    </conference>
  </xsl:template>

  <xsl:template match='chair'>
    <name>
      <xsl:apply-templates/>
    </name>
  </xsl:template>
  <xsl:template match='*'>
    <xsl:copy/>
  </xsl:template>

</xsl:stylesheet>

This solution utilizes positional grouping for the first level of grouping (conferences) and Meunchian grouping for the second level of grouping (conference chairs).

Handling Large, Multi-level, Non-uniform, Flat-structured Documents

Methods for grouping in XSLT are suitable when the total number of nodes being grouped is relatively small. Also, grouping methods are easily applied when the document is uniform and the grouping is only one- or two-levels deep. However, there are significant classes of documents where these conditions are not the case. One such class of documents is word processing documents that are saved as XML; both Microsoft Word 2003 [MSWord] and Open Office [OpenOffice] have this capability.

Word processing documents are fundamentally unstructured. Where they do have structure, it is related to the physical structuring of the visual appearance of the document. For example, a document may have a section in single column followed by a section in two columns. Many organisations wish to convert word processing documents into a logically structured format, such as DocBook [Walsh] or TEI [TEI]. In this case, physical structures are of no consequence and so should be discarded. Instead, the logical structures must be identified; this is most easily and reliably achieved by marking-up the document using "styles" but can also be done by applying heuristics.

When converting a word processing document into a structured form, the first step is to "normalise" the document. This involves applying a XSL stylesheet that for the most part copies the document's data identically to the result, but removes the physical structures. Other unnecessary data may also be removed during this proces s, such as style information. The data that remains is a flat list of paragraphs, as shown below:

Figure 2. 

Open Office Document

Example 9. 

Extract Of Open Office Document Content
<office:document-content
   xmlns:office="http://openoffice.org/2000/office"
   xmlns:text="http://openoffice.org/2000/text"
   office:class="text" office:version="1.0">
  <office:body>
    <text:p text:style-name="chair">Paul Thistlewaite</text:p>
    <text:p text:style-name="Standard">1995</text:p>
    <text:p text:style-name="Standard">1996</text:p>
    <text:p text:style-name="Standard">1997</text:p>
    <text:p text:style-name="Standard">1998</text:p>
    <text:p text:style-name="chair">Helen Ashman</text:p>
    <text:p text:style-name="Standard">1995</text:p>
    <text:p text:style-name="Standard">1996</text:p>
    <text:p text:style-name="Standard">1997</text:p>
    <text:p text:style-name="Standard">1998</text:p>
    <text:p text:style-name="chair">Roger Debreceny</text:p>
    <text:p text:style-name="Standard">1999</text:p>
    <text:p text:style-name="chair">Allan Ellis</text:p>
    <text:p text:style-name="Standard">1999</text:p>
    <text:p text:style-name="Standard">2000</text:p>
    <text:p text:style-name="Standard">2001</text:p>
    <text:p text:style-name="Standard">2002</text:p>
    <text:p text:style-name="Standard">2003</text:p>
    <text:p text:style-name="Standard">2004</text:p>
    <text:p text:style-name="Standard"/>
  </office:body>
</office:document-content>

This small example contains 20 paragraphs. However, some applications must routinely deal with documents that contain well in excess of 2000 paragraphs.

Structured document schemata, such as DocBook or TEI or of an equivalent complexity, often have a number of sectioning components. For example, DocBook has a sectioning hierarchy as follows:

set
book, article
preface, chapter, appendix
section, sect1
sect2

In fact, DocBook defines 5 numbered sections so potentially the depth of the sectioning element hierarchy is 8. From the perspective of grouping, this is a multi-level grouping problem and is far more complex than one- or two-level grouping.

An XSL stylsheet that transforms the Open Office document above into a structured form shown earlier is as follows:

Example 10. 

XSL Stylesheet 4
<xsl:stylesheet version='1.0'
   xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
   xmlns:office="http://openoffice.org/2000/office"
   xmlns:text="http://openoffice.org/2000/text"
   exclude-result-prefixes='office text'>

  <xsl:strip-space elements='*'/>

  <xsl:template match='office:body'>
    <chairs>
      <xsl:apply-templates
	 select='text:p[@text:style-name = "chair"]'/>
    </chairs>
  </xsl:template>

  <xsl:template match='text:p[@text:style-name = "chair"]'>
    <chair>
      <name>
	<xsl:apply-templates/>
      </name>

      <xsl:variable name='nextChair'
	select='following-sibling::text:p
		  [@text:style-name = "chair"][1]'/>

      <xsl:choose>
	<xsl:when test='$nextChair'>

	  <xsl:apply-template
	     select='following-sibling::text:p
	       [generate-id(
		  following-sibling::text:p[@text:style-name = "chair"][1]) =
	        generate-id($nextChair)]'/>

	</xsl:when>
	<xsl:otherwise>

	  <!-- Take all following paragraphs
	       to the end of the document
	    -->

	  <xsl:apply-templates select='following-sibling::text:p'/>
	</xsl:otherwise>
      </xsl:choose>
    </chair>
  </xsl:template>

  <xsl:template match='text:p[@text:style-name = "Standard"]'>
    <year>
      <xsl:apply-templates/>
    </year>
  </xsl:template>

</xsl:stylesheet>

In addition to the complexity of multi-level grouping, structured text documents exhibit a non-uniform pattern of sectioning components. That is, although the sectioning components are well-defined (books contain chapters, chapters contain sect1s, and so on) it is not necessarily the case that section components will appear at certain levels. With some schemata, it is also the case that certain elements may appear to different levels in the hierarchy. For example, some sect1 elements may contain sect2 children, whereas others may not. Also, lists or tables may occur at any level in the component hierarchy.

It is possible to deal with these various characteristics when converting word processing documents to structured documents. Handling multi-level documents causes the complexity of the application of Meunchian methods to rise markedly, but the problem is still tractable. However, the application of Muenchian methods to large, flat structures results in very poor performance. This is because the method is essentially calculating the intersection of two sets.

Figure 3. 

Intersection Of Two Sets

Determining a set nodes along a single axis is efficient, but applying a predicate to every node in the set is very costly. This cost is incurred for every node in every group. The performance cost for documents of different sizes is given below. It is important to note that the result of using the Meunchian method is correct, the desired result document is produced, it is only the runtime performance that is undesirable. Also, some XSLT processors provide an extension function to calculate the in tersection of two nodesets, which obviates the need for the approaches described here. However, for maximum portability it is best to avoid the use of extensions.

Muenchian Method Performance

Description

Small

Medium

Large

Level 1

1.85s

814s

24+ hrs

Level 2

0.45s

4.6s

?

Notes:

Although it has been shown that the Muenchian method performs poorly on word processing style documents, it must be realised that the problem is not the size of the document per se; the real problem is that word processing style documents have large, flat structures. The Meunchian method has unacceptable behaviour when grouping occurs in structures that have large numbers of sibling elements. Large (semi-)structured documents may not necessarily exhibit the same bad performance. In fact, once some structures have been established in the document (perhaps in a processing cha in) and the sibling lengths have been reduced the Meunchian method may once again be successfully applied.

So far, only a simple use of the Meunchian method has been shown. It is in fact possible to dramatically speed-up processing by using keys, via the key() function. In order to take advantage of keys, a key is created for each level of grouping desired in the result document. Each key then provides a fast lookup for the nodes in each group at that level. The key for each level of grouping must combine the key at previous levels, and so the complexity of this approach increases with the depth of grouping.

The XSL stylesheet solution shown above is modified to use a key:

Example 11. 

XSL Stylesheet 5
<xsl:stylesheet version='1.0'
   xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
   xmlns:office="http://openoffice.org/2000/office"
   xmlns:text="http://openoffice.org/2000/text"
   exclude-result-prefixes='office text'>

  <xsl:strip-space elements='*'/>

  <xsl:key name='immediate-nodes'
	   match='node()[not([@text:style-name = "chair"])]'
	   use='generate-id(preceding-sibling::text:p
		[@text:style-name = "chair"][1])'/>

  <xsl:template match='office:body'>
    <chairs>
      <xsl:apply-templates
	 select='text:p[@text:style-name = "chair"]'/>
    </chairs>
  </xsl:template>

  <xsl:template match='text:p[@text:style-name = "chair"]'>
    <chair>
      <name>
	<xsl:apply-templates/>
      </name>

      <xsl:apply-templates
	 select='key("immediate-nodes", generate-id())'/>

    </chair>
  </xsl:template>

  <xsl:template match='text:p[@text:style-name = "Standard"]'>
    <year>
      <xsl:apply-templates/>
    </year>
  </xsl:template>

</xsl:stylesheet>

This solution improves the runtime performance, as shown in the table below. Note that the time required to build the key(s) is now significant; the actual application of the templates is very fast.

Muenchian Method With Keys Performance

Small

Medium

Large

2.43s

141s

3.9hrs

Notes:

Grouping in Large, Multi-level, Non-uniform, Flat-structured Documents

All of the methods described so far result in performance times that are unacceptable. They are also difficult to use with non-uniform structures. A new method has been developed that results in acceptable performance. This method is based on an approach described by Mike Kay in [Pawson], but extended to handle multi-level grouping. Instead of using the Meunchian method to calculate the intersection of two sets, this method iterates through the document one node at a time and "remembers" the context of the node in order to build the desired structures in the result document. Template modes are used to set the current context.

The new method uses the same basic techniques of grouping: identify the nodes to be grouped, process the node that is the start of the group and while processing that node gather together the nodes in that group. In addition, to achieve high performance some additional rules are introduced:

To illustrate the new method, a XSL stylesheet is shown below that transforms the Open Office example document into a structured form, using the new approach:

Example 12. 

XSL Stylesheet 5
<xsl:stylesheet version='1.0'
   xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
   xmlns:office="http://openoffice.org/2000/office"
   xmlns:text="http://openoffice.org/2000/text"
   exclude-result-prefixes='office text'>

  <xsl:strip-space elements='*'/>

  <xsl:template match='office:body'>
    <chairs>

      <!-- Step 1: Group "chair" nodes together.
	-->

      <xsl:apply-templates
	 select='text:p[@text:style-name = "chair"]'/>
    </chairs>
  </xsl:template>

  <xsl:template match='text:p[@text:style-name = "chair"]'>
    <chair>
      <name>
	<xsl:apply-templates/>
      </name>

      <!-- Find the end delimiter for this group.
	-->

      <xsl:variable name='nextChair'
	select='following-sibling::text:p
		[@text:style-name = "chair"][1]'/>

      <xsl:choose>
	<xsl:when test='$nextChair'>

	  <!-- Process the next paragraph.
	       Set "year" mode to signify that we're
	       looking for year values.
	    -->

	  <xsl:apply-templates
	     select='following-sibling::text:p[1]'
	     mode='year'>

	    <!-- Include the end delimiter so that
		 we know when to stop.
	      -->

	    <xsl:with-param name='nextChair'
			    select='$nextChair'/>
	  </xsl:apply-templates>
	</xsl:when>

	<xsl:otherwise>
	  <xsl:apply-templates
	     select='following-sibling::text:p[1]'
	     mode='year'/>
	</xsl:otherwise>
      </xsl:choose>
    </chair>
  </xsl:template>

  <!-- This template only has effect when we're
       looking for year values.
    -->

  <xsl:template
     match='text:p[@text:style-name = "Standard"]'
     mode='year'>

    <xsl:param name='nextChair' select='/..'/>

    <xsl:choose>
      <xsl:when test='not($nextChair)'>
	<!-- This is the last group -->
	<year>
	  <xsl:apply-templates/>
	</year>

	<xsl:apply-templates
	   select='following-sibling::text:p[1]'
	   mode='year'/>

      </xsl:when>

      <!-- Have we reached the end of this group? -->
      <xsl:when
	 test='generate-id(.) =
	       generate-id($nextChair)'/>

      <xsl:otherwise>
	<!-- Still in a group -->
	<year>
	  <xsl:apply-templates/>
	</year>

	<xsl:apply-templates
	   select='following-sibling::text:p[1]'
	   mode='year'/>

      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

  <!-- Eliminate all unmatched elements to avoid
       duplicates.
    -->
  <xsl:template match='text:p'/>

</xsl:stylesheet>

Extending this solution to multiple component levels is very straight-forward. The template that matches each level must check for a terminating condition. Any combination of explicit parameters or heuristics may be used as a terminating condition. The template is also passed a nodeset of further nodes to process.

As an example, the following word processing document is to be transformed into a DocBook document:

Figure 4. 

Example Word Processing Document

This document has three levels of headings, using styles named (surprisingly) Heading 1, Heading 2 and Heading 3. A common heuristic is that each level of heading delimits the content of the heading before it. Hence this is a three-level grouping problem. An extract of an XSL stylesheet to transform this document into DocBook is shown below (the complete stylesheet may be found in stylesheet6.xsl). This extract shows the templates for handling one heading level.

Example 13. 

XSL Stylesheet 6
<xsl:stylesheet version='1.0'
   xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
   xmlns:office="http://openoffice.org/2000/office"
   xmlns:text="http://openoffice.org/2000/text"
   exclude-result-prefixes='office text'>

  <xsl:strip-space elements='*'/>

  <xsl:template match='office:body'>
    <article>
      <title>
	<xsl:apply-templates
	   select='text:h[@text:style-name = "Heading 1"]'/>
      </title>

      <!-- Handle any paragraphs appearing in the "front-matter" -->

      <xsl:apply-templates
	 select='text:h[@text:style-name = "Heading 2"][1]/
		 preceding-sibling::text:p'/>

      <!-- Now create the sections.
	   Find all of the candidate sections,
	   store in a variable,
	   then kick off the processing with the first section -->

      <xsl:variable name='sections'
		    select='text:h[@text:style-name = "Heading 2"]'/>

      <xsl:apply-templates
	 select='$sections[1]' mode='section'>

	<xsl:with-param name='sections' select='$sections'/>
      </xsl:apply-templates>
    </article>
  </xsl:template>

  <xsl:template
     match='text:h[@text:style-name = "Heading 2"]'
     mode='section'>

    <xsl:param name='sections' select='/..'/>

    <sect1>
      <title>
	<xsl:apply-templates/>
      </title>

      <!-- Edge condition: the last section must be handled separately -->

      <xsl:choose>
	<xsl:when test='$sections'>

	  <!-- Are there any subsections in this section?
	       If so, store them in a variable -->

	  <xsl:variable
	     name='subsections'
	     select='$sections[1]/
		     preceding-sibling::text:h
		     [@text:style-name = "Heading 2"]'/>

	  <xsl:apply-templates
	     select='following-sibling::*[1]'
	     mode='subsection'>

	    <xsl:with-param
	       name='nextSection'
	       select='$sections[1]'/>

	    <xsl:with-param
	       name='subsections'
	       select='$subsections'/>
	  </xsl:apply-templates>
	</xsl:when>

	<xsl:otherwise>

	  <!-- This is similar, but the subsections
	       are determined differently -->
...
	</xsl:otherwise>
      </xsl:choose>
    </sect1>

    <!-- Now continue with the next section -->

    <xsl:apply-templates
       select='$sections[1]'
       mode='section'>
      <xsl:with-param name='$sections[position() != 1]'/>
    </xsl:apply-templates>
  </xsl:template>

  <xsl:template match='text:h'
		mode='subsection'>
    <xsl:param name='nextSection'
	       select='/..'/>
    <xsl:param name='subsections'
	       select='/..'/>

    <xsl:choose>
      <!-- Have we reached the end delimiter
	   for this subsection?  If so stop -->
      <xsl:when test='generate-id() = generate-id($nextSection)'/>

      <xsl:when test='@text:style-name = "Heading 2"'>

	<!-- Find the (potential) end delimiter
	     for this subsection -->

	<xsl:variable
	   name='nextSubsection'
	   select='following-sibling::text:h
		   [@text:style-name = "Heading 2"][1]'/>

	<!-- Find the subsubsections, if any.
	     Store them in a variable -->

	<xsl:variable
	   name='subsubsections'
	   select='following-sibling::text:h
		   [@text:style-name = "Heading 3"]'/>

	<sect2>
...
	</sect2>

	<!-- Are there more subsections *within this group*
	     to continue with? -->

	<xsl:if test='count($subsections|$nextSubsection) =
		      count($subsections)'>

	  <xsl:apply-templates
	     select='$nextSubsection'
	     mode='subsection'>
...
	  </xsl:apply-templates>
	</xsl:if>
      </xsl:when>
...
    </xsl:choose>
  </xsl:template>
.
.
.
</xsl:stylesheet>

This solution improves the runtime performance to an acceptable level, as shown in the table below:

New Approach Performance

Small

Medium

Large

0.48s

10.57s

1020s

Notes:

Conclusion

Most modern word processing applications offer the ability to save a document in an XML format. However, the schema used by the word processing application is usually not suitable for storage or further processing of the document. Word processing documents are essentially flat structures; they are a long list of paragraphs. Building up deeper structures in the document using XSLT is a particular type of grouping problem.

Grouping is a common problem in XSLT and a number of techniques to achieve grouping have been shown. These include positional and Meunchian grouping, both with and without using keys. Although these techniques do produce the correct result document, their runtime performance (even with keys) has been shown to be unacceptable.

A new technique has been developed, based on the use of modal template processing. This technique has been shown to be able to handle multi-level grouping tasks, as well as non-uniform grouping structures. Although the technique is verbose (even by XML standards!), the application of the technique is mostly boilerplate. Future work is aimed at developing a code-generator to automatically produce the boilerplate templa tes given a description of the input start/end delimiters and the desired result structures.

The new technique described achieves high performance. The comparison of the various technqies on documents of different sizes is summarised below. For a baseline comparison, the timing of an identity transformation is given.

Muenchian Method Performance

Description

Small

Medium

Large

(Identity)

0.25s

1.29s

14s

Meunchian level 1

1.85s

814s

24+ hrs

Meunchian level 2

0.45s

4.6s

?

Meunchian with keys

2.43s

141s

3.9hrs

Modal templates

0.48s

10.57s

1020s

References

[XML]

eXtensible Markup Language, Second Edition. Tim Bray (Ed). W3C Recommendation October 2000.

[XSLT]

XSL Transformations (XSLT). James Clark. W3C Recommendation November 1999.

< dt>[Tennison]

Jeni Tennison. Jeni's XSLT Pages: Grouping.

[Pawson]

XSLT FAQ. Flat file transformation.

[Meunch]

XSLT FAQ. Grouping.

[MSWord]

Microsoft Word 2003. Microsoft Office Online Home Page.

[OpenOffice]

Open Office. OpenOffice.org: Home Page.

[Walsh]

DocBook: The Definitive Guide. Norm Walsh. Read The Book.

[TEI]

Text Encoding Initiative. TEI Website.

[Gnome]

Gnome libxml. Daniel Veillard. xmlsoft.org.

Copyright

Steve Ball, Zveno, © 2004. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.