Creating and managing documents with LifeWeb


Thuy-Linh Nguyen, School of Computer Science and Software Engineering, Monash University Thuy.Linh.Nguyen@csse.monash.edu.au

Heinz Schmidt, School of Computer Science and Software Engineering, Monash University Heinz.Schmidt@csse.monash.edu.au


Abstract

With increasingly complex enterprise demands and the explosion of the web, serious problems with the Web are showing. Among them, management and maintenance are two of the most well-known, inherent in the inflexibility, the lack of infrastructure and the missing semantic structure of web documents. LifeWeb has been proposed as an approach towards solving these problems by providing an object-oriented model with the life design for Web documents. Key concepts of LifeWeb are centred around the semantic structure of documents, and its separation from presentation and raw contents material. Selective access and customised presentations are expressed as rules of so-called activation schemes. Link accesses activate these schemes. This paper describes the underlying model and the implementation of a prototype for LifeWeb and explains how a GUI application can be developed on top of the LifeWeb engine to facilitate and simplify the authoring and management of Web documents.


Introduction

The coming of the World Wide Web has brought an enormous growth in publishing multi-media contents and presenting it via the internet. An important problem in this new domain is the dynamics of the publication. While the content of paper-based documents is fairly fixed, Web contents can be changed at any time. The current Web system however, does not provide ways to manage such changes. A number of tools have emerged supporting the authoring and managing of web documents. Typically they focus on drag-and-drop HTML generation and some rudimentary "project" management of HTML projects. Unfortunately, most Web contents are written straight in HTML, the assembler language of the Web.

A document normally posseses three distinct properties, namely content, structure and presentation. HTML, however, defines them all together. HTML structural elements cannot be used to express the semantic structure of the whole document and are usually treated in a presentation-oriented manner. These and other presentational specifications are hard coded into the HTML source file, and the file system of the server machine is usually used to express the document semantic structure. For instance one might put each component in a separate directory, such as ../mybook/chapter1/ for the Chapter1 component, and ../mybook/chapter2/ for the Chapter2 component, etc.

Changes to structured documents often need to be synchronised among its components. For example, the removal of Chapter1 would require all following chapters to be renumbered, all references to Chapter1 removed, and a new table of contents generated. These maintenance operations are typically carried out manually using commands and tools native to the server machine, ie. orthogonal to the Web [HREF5]. If the file system has been used to represent the document structure, manipulations to Web resources are required in both the file system and the Web system, which are essentially two disjoint domains [HREF5]. This process is expensive and prone to mistakes and inconsistencies. Its problems are well documented [HREF5, HREF6].

LifeWeb [NGU98.2] has been proposed as a step towards addressing these problems by providing a data model for the Web document system using proven object-oriented technology. Object-orientation is well-known for its flexibility and other features such as extensibility, manageability, maintainability, and reusability.

LifeWeb

Documents

LifeWeb [NGU98.1&2] is an object-oriented model for the Web document system. Figure 1 shows the top level of the core design of the model using the Unified Modeling Language [UML] notation [HREF3].

Figure 1. LifeWeb - top level object diagram

In Fig. 1, the central class is that of Document. In LifeWeb a document is then decomposed into objects, holding an internal state defined by a set of attributes, and a behaviour defined by a set of methods. Document is viewed in three dimensions: structure, material and presentation.

The material dimension (to the right of Document in Fig. 1) accounts for the raw materials that compose a document, such as text, graphics, sound, video, etc. The presentation view includes presentation directives such as table, list, frame, or article parameters and styles which determine how the material is presented to human readers.

The hierarchical structure is captured by decomposing a Document into Structural Components recursively. For a book, such components could be its volume, part, chapter, section, paragraph, etc. A Document can have zero or more such components, each of which is a Document itself. For this purpose, Structural Component inherits from Document (see Fig. 1.). We expect that at the lowest level in the hierarchy Documents will be Material and Presentation only. Some contents may also be presented using a mixture of Structural Components and Presentation without explicit material. For instance, forms with contents to be provided by the user can be generated in this way. In general various mixtures of structure, contents and presentation are meaningful and therefore possible in LifeWeb.

This recursive structure of LifeWeb documents allows the system to take care of the structural maintenance and customisation of documents without affecting the materials being restructured or the presentation of these materials. Such restructuring may include inserting, removing and replacing document components, possibly while maintaining their ordering and their intergrity within the document hierarchy. Most structural links (such as table of content and navigational links) within the document can also be automatically generated by the system during runtime, eliminating a large maintenance costs. Structural Component objects hold a numberingScheme attribute, which defines whether a collection of them should be regarded as a sequence (possibly numbered), or a simple unordered set. It also specifies the format for numbering them.

Presentation

The format and layout of documents is defined by the Presentation objects. A repository of Presentation objects can be defined and reused for many document components. This separation of presentation from other aspects of the document frees authors from worrying about the document format and layout at authoring time, allowing them to concentrate on the provision of the content and its hierarchical and hyperlink structure. The task of formatting and layout can be achieved separately, if necessary, and professionally by a graphics design expert. It also makes it possible to define different presentations for the same content for various specific needs such as customisation, maintenance, and so on. The definition in Fig. 1 also permits documents without presentation. In this case some default or external presentation will be chosen (see below).

Materials

Material objects hold the real contents populating the given document hierarchy. These are unstructured, possibly fragmented, raw materials used to fill in the document objects or their structural components. Offsprings of the Material class (not shown in Fig. 1) define the actual data type of the content, such as text, image, audio, video, script, and so on. The same Material object can be shared and reused by several Document or Structural Component objects. This could considerably reduce maintenance costs in the case of frequently recurring text or graphics.

Hyperlinks

The separation between documents and materials also helps reduce the infamous broken link problem among remote sites, where references to a remote object are no longer valid because of its removal or migration. As Material objects are wrapped up in objects of the Document branch, they and their hardware dependent attributes such as their hard location and thus the URLs are private to the document and hidden from the clients. Documents thus provide a level of indirection making clients less dependent on the actual location of the document material. The referential intergrity within the document itself and with the underlying file system is maintained by the LifeWeb system. Hyperlink is a class encapsulating the source and an anchor (text or image) from which the link eminates, a reference to which the link terminates, and a scope to specify how wide and how deep the link is defined. Hyperlinks are thus bi-directional and can be queried to trace up the source for a given destination to find out whether the latter is still referenced by any remote document for its safe removal. The scope attribute allows hyperlinks to be defined recursively in the document tree, saving maintenance costs in the case of frequently recurring links.

In summary, the object-oriented design of LifeWeb and its separation in the three relatively independent aspects of contents, structure and presentation, promises uniform and simplified methods for web document authoring and management, which are rooted in the document structure itself. The resulting methods allow the documents to be maintained and managed efficiently; its layout and format suitably controlled; the authoring process facilitated; and referential integrity preserved or enhanced.

The Extensible Markup Language (XML) [HREF4]

The Extensible Markup Language, abbreviated XML, is the product of the collective work of the World-Wide-Web Consortium [HREF4]. Like the Hypertext Markup Language (HTML), it is also a subset of the Standard Generalized Markup Language (SGML) [ISO8879], allowing the use of tags (including nested tags) to describe elements in a document. Unlike HTML however, which defines only a fixed tag set for a single document type, XML is really a metalanguage. It permits the definition of tags and other markup syntax (including the definition of HTML in terms of XML. An XML document for example, might look as follows:

<?xml version="1.0" encoding="UTF-8"?>

 <!DOCTYPE book System="http://www.lifeweb.org/schema/book.dtd">

 <chapter id="chapter1" heading="CHAPTER 1">

   <section id="section1.1" heading="Section 1.1">

     This is the text of section 1

   </section>

 </chapter> 

Obviously, this example uses <chapter> and <section> tags not present in HTML. Such tags are defined in a Document Type Definition (DTD). A DTD is a single file, or several files, which formally define a particular type of document. It describes the schema of the corresponding document type, specifying factors such as what element types there are, their attributes and attributes types, their possible nesting, and so on. The DTD for the XML document in the above example define the two elements chapter and section as follows:

<!ELEMENT chapter (section)>

<!ATTLIST chapter

	id		ID		#REQUIRED

	heading		CDATA		#IMPLIED>

<!ELEMENT section (section|CDATA)>

<!ATTLIST section

	id		ID		#REQUIRED

	heading		CDATA		#IMPLIED>

When a DTD is declared in an XML document as the file http://www.lifeweb.org/schema/book.dtd in the above example, it is used to validate the XML document when the document is parsed.

The openness of XML and the expressive power derived from its meta-level are its major advantage. Various document types can be defined and domain-specific "jargons" specified, meaningful to the respective communities.

XML also differs from HTML in that it forces the separation between document content and presentation, imposing the use of stylesheet to specify how the user-defined elements are presented in the browser.

LifeWeb XML

LifeWeb is a syntax-independent data model, and, in principle supports the binding to any schema description language that supports object orientation such as the Interface Description Language (IDL) [HREF2], XML [HREF4], and so on. We have chosen XML since it has become the W3C recommendation and will very likely dominate in the web of the future. A LifeWeb Schema (LWS) is represented as a Document Type Definition (DTD) [HREF4], where each LifeWeb class corresponds to a DTD element. In the example above, two classes, chapter and section, are defined. XML however, is not fully object-oriented and inheritance is not supported. We provide for inheritance by the use of a special attribute "superclass" known to the LifeWeb system. The value of this attribute is the name of the superclass.

A LifeWeb object essentially holds information describing its own structure. Thus it could be expressed in any knowledge description language that supports object metadata and nesting, such as Telos [MYL90], XML [HREF4], etc. Again we have chosen XML as it is becoming the universally accepted representation for Web documents. Each component of the document is an XML element identified by a document-wide unique id. Its properties are described as attribute-value pairs. Attributes of the LifeWeb object are essentially the object metadata and can be designed for use in resource discovery, resource grading, and so on. Syntactically there is no difference between a LifeWeb document and an XML document. However, a LifeWeb object does not directly contain its materials, its presentation or structural components. Instead it references these. This enforces the complete separation of the structural aspect from other aspects of the document.

Custom Activation Shemes

Presentation is handled on the basis of structural units in LifeWeb. According to Fig. 1, each document entity can be associated with a presentation object, which is responsible for formatting the document object. We propose to use the well-established theory of translation schemes known from compiler construction [ARJ88] for activating the processing of documents and for customised selections and presentations either provided by the author of the web documents or driven by client profiles and search contexts.

Note that a document component hierarchy is very similar to a parse tree (or directed acyclic graph) of the flat presentation of a given document. In this analogy, the materials associated to a document component become the terminals (leaves) of such a tree. A document node itself manifests a production rule expressing the sequential composition of a document from its next lower-level components. In [KS89] the second author has shown how object-oriented grammars and translation schemes can be used for describing the decoration and activation of complex structured artifacts and for simplifying their customisation and maintenance.

These ideas are currently being applied to LifeWeb by viewing different presentations as named translation schemes. In LifeWeb we call these activation schemes. The name of a given activation scheme is associated to at most one presentation object for any given document. Each such named presentation corresponds to an activation rule for that document level. Altogether the activation rules of the same name define the activation scheme. Calling for a particular presentation invokes only the rules of the requested scheme. As shown in [KS89] such a scheme can be implemented as a set of methods of the same name along the hierarchy of objects and their respective classes. The schemes permit selection (or omission) of irrelevant substructures or links, or different presentations, possibly directed by parameters passed along the levels of the hierarchy or by attributes stored at the given level. Attributes stored with documents are typically related to their contents, history or author. Parameters passed are ultimately representing client data or profiles, as ultimately they stem from client requests for specific searches or customised presentation.

A component document is visited by either following the internal reference from a document to its structural component or by following an external hyperlink from one document to the next. Thus when LifeWeb follows such a link in the context of a given request, it activates the link, so to speak, by firing the respective activation scheme at that level, i.e., by activating the respective presentation rule with the relevant parameters. Activation rules can select subdocuments, thus applying the current scheme to the respective object. Or, they fire off other actions, for instance primitive presentation functions such as the generation of a table, of forms, or other XML output. The activation of another scheme is also possible as one of the actions not directly following the hierarchical structure. In this way, searches for other relevant documents, or retrieval from data bases can be expressed.

Presently in our prototype, we have only realised a table and a list format as initial primitive activations of our proof of concept implementation. The web author simply provides values for the necessary attributes of the required format, such as number of rows, columns, item delimiter, etc, and the presentation object will put the document contents into proper format.

If no presentation object is specified, the object will be displayed as is without any formatting (for example, as a text chunk), unless formatting is handled externally by stylesheets such as Extensible Stylesheet (XSL) [HREF4] or Cascading Stylesheet (CSS) [HREF4]. Either way, the presentational aspect is separated from the structural and content aspects of the document. All three views of the document (structure, content, presentation) are thus supported.

Implementing LifeWeb

The LifeWeb model is implemented as a set of Java classes and servlets. A request for a LifeWeb document is received by the LifeWebServlet, which will load the document, and passes it on to an XML parser for necessary processing. We have derived a LWParser from the XML Parser (XML4J) developed by IBM AlphaWorks [HREF1] to parse the document into LifeWeb objects. ElementHandlers can be registered to the parser before the document is read, allowing LifeWeb elements to be properly processed while the document is being parsed. Our system has three groups of ElementHandlers. The ContentHandlers are responsible for loading the contents and inserting hyperlinks into the document, the StructureHandlers for layout and ordering structural components; and the PresentationHandlers, which are registered to the StructureHandlers, for formatting structural components before they are returned to the LifeWebServlet. All the handler objects interact with LifeWeb objects for services specific to each LifeWeb class. The fully filled-in and formatted document is finally returned back to the servlet to be written back to the client. Figure 2 depicts the architecture of the system.

 

Figure 2 - LifeWeb system architecture

 

Creating and managing documents with LifeWeb

The design of the LifeWeb system allows for LifeWeb objects to be created and managed in a simple "drag-and-drop" paradigm. Figure 3 shows a potential user interface of the LWManager, which is to be built on top of the LifeWeb engine. It is very similar to the Windows File Manager. The LWManager has two views, Material View and Presentation View. In each view the window always has two parts, one containing objects of the document branch, and the other objects of the material branch or presentation branch.

In the Material View the Document part shows the structure of (part of) a Document object, which is very similar to that of a DOS or UNIX file system. Items from the Material part can be dragged and dropped onto those in the Document part, to be included in a Document object. (Note that an inclusion of a Material object in a Document object is represented as a reference to the former in the latter. As Material objects are separated from Document objects, the contents of the included Material objects are not copied into the Document objects). A Material object can be included in multiple Document objects, thus it is possible to include, for example, the same logo, in each section. Dragging and dropping items within the Material part would behave exactly the same as dragging and dropping files and folders in a file system. Items in the Document part can also be included or referenced (hyperlinked) in one another by the same drag-drop operation, with inclusion being the default action (left-click, drag and drop), and reference can be selected with a series of right-click, drag and drop operations. The usual restrictions are imposed on these operations, such as that a component cannot be included in its sub-component or itself, and so on.

Figure 3 - LWManager : Material View

In the Presentation View dragging and dropping items from the Presentation part onto items in the Document part creates an association between the two. The user will then be asked to enter values specific to each Presentation object (figure 4).

Figure 4 - Manager: Presentation View

Hyperlinks can generally be inserted in objects of the document branch using the Insert menu, or in a drag-drop operation as described above within the same document.

Figure 5 - LWManager: File/New Menu

The user finds (part of) the LifeWeb schema in the File/New menu, where a class can be chosen to create a new object. Figure 5 shows the File/New menu in the Material View with the upper part containing classes of the Document branch and the lower part classes of the Material branch. In the Presentation View Presentation classes will be shown instead of Material classes. The Generate XML… menu command will generate the XML file based on the information visually presented in the LWManager

Each object has a set of properties that can be displayed and edited as shown in figure 6. Some properties such as Header and Footer are defined recusively unless they are overridden by other values at lower level.

Figure 6 - LWManager: Object Property

As real contents in a LifeWeb document are only linked and not embedded, user access to the document is independent of the physical location of these contents, provided that the internal referential integrity of the document is maintained. The move() method in the Material class provides this support. Referential integrity to remote documents still cannot be controlled. A LifeWeb document however, being so compact, can be used to describe an entire web site. Existing hierarchical structure will then be modelled by the hierarchy of documents and their components. Relational (link) structures interweaving these documents will be represented by local Hyperlinks whose integrity is maintained by LifeWeb. At worst, an artificial "top" document will have to be introduced to fit the whole into the expected structure. Clients will not be aware of the top document. The number of document URLs to be maintained explicitly by local web authors would thus be considerably reduced, and consequently the likelihood of incomplete or inconsistent links.

Related work

Data modelling has recently become a topic of considerable research interest on the Web, with XML in the lead, followed by many related and supporting technologies such as XLink, XSL (Extensible Stylesheet), XSchema, RDF (Resource Description Framework), RDFS (Resource Description Framework Schema) [HREF4], and so on. XML provides the syntax to represent Web documents, RDF the data model to describe Web resources. LifeWeb differs from both, in that it is a syntax neutral model focusing on representing Web documents by describing them using metadata. It can use both the XML syntax to represent its objects and the RDF concepts to describe its resources. Of the above LifeWeb is the only model that provides a processing capability to Web objects. Some authoring tools such as NetFusion also offer structural and presentational views. The structural view in NetFusion however, relates to the structure of the web site, not that of the document. HyperG is a system originally developed at the University of Graz separating the hierarchical structure of documents from their hyperlinks. Links are automatically generated and not embedded in documents themselves [Mau98]. Hyper-G uses SGML, a precursor of XML. HyperG however does not separate contents and presentation. Furthermore, Hyper-G is a very large system which reportedly is difficult to customise or extend. It is written in C and C++ and Java bindings are not available to our knowledge. More recently the system has been commercialised under the name of Hyperwave and attracted a number of European and international awards for its innovative separation of links from documents (separating and automating indexing of web documents). With the commercialisation however the research prototypes and underlying concepts of Hyperwave have become proprietary and inaccessable for research and study. We have not found any project approaching the management of live web documents from a content point of view.

 

 

Future work and conclusion

This paper describes the implementation of a prototype of LifeWeb and explains how this model can facilitate the process of creating and managing Web sites in a simple drag-and-drop paradigm. This has been made possible by augmenting Web objects with processing capabilities and by separating the structural, content and presentational aspects of the document. The LifeWeb project does not aim at providing its own schematic model or syntax, but will use that of the language chosen to represent the LifeWeb model, currently the Document Type Definition (DTD) [HREF4]. A DTD however, does not fully support object-orientation. While many object-oriented schemas are being proposed (XSchema, RDFS, etc), we simply provide for the requirement by the use of a special attribute "superclass" and a systematic activation of generalised presentation rules.

The initial prototype and its efficiency is promising, albeit still simplistic at this stage. Also in this prototype we have not incorporated XLink or other emerging hyperlink proposals. Future research will also include object genes and a life design concept [NGU98.2], whereby some structural and presentation rules are self-activating and describe the acquisition of structure in an unstructured collection of documents or the adaptation of document structure and/or presentation to an evolving environment and changing user or author requirements.

Acknowledgement

We are grateful to Dr. Sayed Sajeev who patiently read draft versions of this paper and provided helpful feedback.

Bibliography

[ARJ88] A.V. Aho, R. Sethi and J. Ulman, "Compilers Principles, Techniques and Tools", 1988

[Byte95] "Hyper-G organizes the Web", BYTE Nov. 1995

[KS89] B. Kramer and H-W. Schmidt, "Software Developing Integrated Environments with ASDL" pp. 98-107, IEEE Software, 1/1989

[Mau98] H. Maurer "Web-based knowledge management" pp. 122-123, IEEE Computer, 3/1998

[MYL90] J. Mylopoulos et al, "Telos: Representing Knowledge about Information Systems" pp. 325-362, ACM Transactions on Information Systems, Vol 8, No 4, 10/1990,

[NGU98.1] T-L. Nguyen et al, "Object-Oriented Modeling of Multimedia Document" pp. 578-582, Proceedings of WWW7 Conference , Computer Networks and ISDN Systems: The International Journal of Computer and Telecommunications Networking, Vol 30, No 1-7, 1998,

[NGU98.2] T-L. Nguyen et al, "LifeWeb: An Object-Oriented Model for the Web" pp. 301-308, Proceedings of SCI'98 & ISAS'98 Conference, Vol. 3, 1998

[ING95] D.B. Ingham et al, "Bringing Object Oriented Technology to the Web", WWW4 International Conference Proceedings [HREF5], 1995

[ING97] D.B. Ingham et al, "Supporting Highly Managable Web Services", WWW6 International Conference Proceedings [HREF6], 1997

[OMG] OMG IDL Syntax and Semantics [HREF2]

[UML] UML Reference [HREF3]

[W3C] World-Wide-Web Consortium [HREF4]

[XML] XML Parser in Java, IBM Alpha Works

Hypertext References

HREF1
http://www.alphaWorks.ibm.com/Home/
HREF2
ftp://www.omg.org/pub/docs/formal/97-09-08.pdf
HREF3
http://www.rational.com/uml/references/
HREF4
http://w3.org
HREF5
http://w3objects.ncl.ac.uk/pubs/bootw/. Also at http://www.w3.org/pub/Conferences/WWW4/Papers2/141/
HREF6
http://w3objects.ncl.ac.uk/pubs/shmws/


Copyright

Thuy-Linh Nguyen and Heinz Schmidt, (C) 1999. The author assigns to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The author also grants a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.


Proceedings ]


AusWeb99, Fifth Australian World Wide Web Conference, Southern Cross University, PO Box 157, Lismore NSW 2480, Australia Email: "AusWeb99@scu.edu.au"