Beyond Word: simple website creation via a support tool for web based communication, publishing and teaching

Daniel Barnes, Web Publisher, SET Portfolio. Email: daniel.barnes@rmit.edu.au

Linda Pannan, Director, Online Learning - Development & Research, SET Portfolio. Email: linda.pannan@rmit.edu.au

Neeraj Arora, Consultant Programmer. Email: narora@narora.net

ASSETT Research Group, Advancing Scholarship and Science Education through Technology, RMIT University, Melbourne, 3000.

Abstract

Amongst other problems, our experience of considerable volumes of large word documents needing conversion to navigable websites sowed the seed for discovery of a simple tool to transform MS Word documents to well structured, and indexed standalone websites at the flick of a few buttons. This paper describes the development and success of such a tool. It is now operational and provides an open web-based system that is fully scalable, and has no ongoing costs. The websites it creates are of professional quality as well as being searchable, reusable and fully customizable.

The aim of the work was to make the web more accessible to teaching staff, to allow them to publish academic standard websites without knowledge of HTML tools and to broaden the base of academic publishers on the web. Academics may use this tool to produce web based learning materials and it could be useful to them (and others) in publishing other materials such as academic papers on the web, although this is yet to be explored.

Introduction

Distribution of learning and information resources via the Web has become an accepted, and often expected, part of the educational experience. Student uptake continues to increase and, yet, studies of current levels of use of digital technology and online development tools by many educators reveal a definite preference for easy to use software, with the use of more complex tools being the realm of a few enthusiasts. (Bennett, Priest, and McPherson 1999; Pannan and McGovern 2003) However, these educators are well aware of the potential of the Web as a learning resource, they appreciate the possibilities of interactivity via the online medium but they lack the resources, the time needed to develop the learning materials and environments they wish for their students, and the time to learn the skills that might enable them to do all this efficiently (O'Reilly 2000; Pannan and McGovern 2003; Rossiter 1999).

Many academics rely on professional electronic media developers to create and maintain their web based learning materials. But, as Holt and Segrave (2003) warn, there is the possibility of technology and pedagogic disjunctions emerging that will diminish the educational value of such material unless the academic teacher takes a foreground role in providing educational input in support of student learning. Another alternative is the provision of simple tools that will allow the academic to take a more active role in the development process. Although a plethora of tools for web production are available, many are complex, costly or simply not suited to an academic learning environment, supporting the question posed by Laurillard (2002) ‘can they take a form that is simple enough to place the control of learning design with the teacher, rather than the software designer?’

Use of MS Word

Microsoft Office has been a standard tool of choice for authoring content in academia for over a decade. While this software has proven very useful in the printing, sharing and dissemination of course materials, Microsoft Word and PowerPoint have limitations in online learning environments. The structures inherent in Word documents, such as major topics and subtopics, are not converted to their web equivalent representation during the <Save as a Web Page> function provided for Word conversion. The resultant HTML page is a completely linear web resource navigable only by scrolling to reveal a few paragraphs of content at any time within a screen that is without context to the entire document. The viewer does not have a reminder of where they have been or where they might be headed, causing disorientation and quelling motivation.

Converting MS Word documents to web standard formats suitable for online learning requires the academic content creator to have expertise in web publishing or access to web publishers. This web publishing step requires repetitive manual copy-and-pasting of each page, repetitive manual hyper linking from the menu panel to the appropriate content, and repetitive reformatting for web viewing. There is considerable room for error that requires rigorous proofing. When it is considered that all this is performed on a document that has already been created in MS Word, presumably already structured, styled and proofed by the academic as they created the document, it is clearly unnecessary double handling.

Our experience in developing course materials for international online learning delivery has been that considerable volumes of large word documents required conversion to navigable websites. The time pressures, both for the academic and the web publishing teams, were enormous. Delays in the author deadlines often resulted in the timeline for conversion to HTML being only a matter of weeks. If we consider that the authors had already considered the pedagogy at the time of creating the Word documents, and assuming that the same learning designs can be effective across different media (Mayer 2003), the question has to be asked as to why this material couldn’t be automatically converted to a standalone navigable website that reflected this pedagogy.

Possible automated production of a well structured, indexed website

As authors create MS Word documents they consider the structure of the content, grouping it into logical and manageable chunks. Significant shifts in content are generally identified by different headings, and the relative importance of the content is weighted by heading levels. This simple process is a standard and conveys much implicit meaning to the reader, and the learner. Main headings indicate what the key learning objectives may be, subheadings suggest how significant a concept may be to their study, and the table of contents or index provide a clear guide to how key concepts fit together.

A common standard in online web material is a left menu panel that loads the appropriate content into the page when activated. This menu panel is much like a Table of Contents in MS Word, breaking content into logical chunks and placing a chunk into the context of the whole. Generally, menu panels remain visible on each page. Websites also often have a banner at the top of the page, which is consistent for every page, and provides a generic label for the content, again, rather like the header or footer in a MS Word document.

The identification of simple linkages between these two representations of the same resource material raises the possibility of finding or developing a tool that converts a simple MS Word document into a standard well structured and indexed website. If Word styles could be used to generate a table of contents, which behaves as a navigation panel, and also to break the content into the corresponding chunks of information, then automation of the repetitive conversion tasks should be possible. Web publishers might then have the time available to value-add to the online courseware, for example by applying design aesthetics, improving graphics and creating animations. Also, if academics can be removed from the web publishing they gain valuable time for enhancing the content. Appropriate use of both specialist resources should lead to enhancement of the learning experiences developed for the learner.

This paper explores the process of automated conversion of Word documents into indexed websites. First, current tools for this task are investigated, the specific requirements are explored, and issues encountered in development of the tool are described in brief. Finally, the implementation and future developments are discussed, culminating in concluding remarks.

Automating MS Word to Website conversions

Currently available tools

Several software tools for conversion of Word documents to stand-alone navigable websites are currently on the market. A review of these software products against the following tool competencies criteria

  1. Usability for content creators
  2. Flexibility of the software to process the idiosyncrasies of varied content creators
  3. Customisability of output
  4. Reusability of input and output
  5. Scalability of the software up to University wide implementation
  6. Affordability including startup and ongoing costs
  7. System requirements

established that our requirements could not be met by existing third-party software. While some software appears easy to use, the output cannot be customised. Other software requires content to be created in a specific Word template. All require some initial installation of the software. Where start-up costs are low, ongoing costs are high. Licensing fees make University wide rollouts costly and would require considerable consultation before implementation. The two software packages of most interest are ‘Word To Web v.2.5’ (SolutionSoft 2004) and ‘HTML Express v.6.0.5’ (Logictran 2003). The former requires an intermediate understanding of web publishing and website structures, does not allow easy post-conversion customisation and is excessively expensive if implemented at University-scale. The latter is less complicated to use and does allow post-conversion customisation with a cascading style sheet, however it has ongoing yearly maintenance costs.

While several organizations report satisfaction with some third-party software, the criteria they employed for product implementation were not as rigid and the method of delivery not as flexible as is demanded by the academic sector (State Services Commission, 2003). No software product currently available offered the usability, flexibility, customisability, scalability, or affordability we required.

Our requirements

The key specifications for a MS Word to navigable Website conversion tool that will provide adequate functionality for the academic sector are identified as follows.

1. System must require very minimal user training.

Ideally there would be no training requirement as use of the tool will be simple and well guided; the skills the MS Word users already have should be adequate. Reasons for this requirement are that there is no budget for such training, particularly if the tool is made available University-wide, and that a training requirement will act as a barrier to use due to lack of time and motivation factors. Considering authors are currently not actively involved in the web development, it seems unlikely they would perceive a benefit from undertaking the training.

It is also recognised that some users will not have minimal knowledge of using Microsoft Word styles. Complying with the desire to create a system that involves minimal user training will require real-time processing where the user manually nominates the specific locations where the document will be split into web pages.

A further implication of minimal user training is the system must not alter the appearance of the input file during the conversion process so that the output is ‘what you see is what you get’.

Diagram 1: User interaction diagram from MS Word file to standalone website conversion

2. System should have a web based interface.

A web based interface became a critical requirement as client-side installation of software in a universal rollout is time consuming in large institutions, and therefore costly. Such activity needs careful consideration and considerable justification before it can occur. Further, client-side installation inevitably requires ongoing IT support. The potential of a web-based interface that downloads content onto the desktop was also rejected as IT functions in large organizations carefully manage levels of security and many users would not be administrators of their own desktops and therefore unable to authorise software downloads for the web.

A less apparent implication of this requirement is that the system will not be able to manage or store user files, shifting the responsibility of file management back to the user and eliminating the inevitable ongoing support such functions require.

3. System must allow customisation and enhancement of website post-conversion.

Reallocating web development resources to enhancing learning materials was a key driver in identifying the need for this system. The potential for ready re-purposing of the material for multiple deliveries is very attractive and is a step towards the goal of object-orientated development.

Diagram 2a: The WoW web interface

Diagram 2b: WoW handling MS Word styles to specify page breaks

A tool to place ‘Word on the Web’

Overview

The WoW (Word on the Web) system we developed operates as described in Diagram 1. The content creator converts their Word document to an HTML file using the standard Word <Save as a Web Page> function. Next, they access the WoW web page (see Diagram 2) to select one of the processing options and upload their HTML file, plus any associated files such as graphics, as a zip file. The server then processes this input zip file according to the selected option and generates a standalone website in a frameset, with an extracted Cascading Style Sheet. This output is placed on the server for a limited time and the user is informed of its availability via email.

Technical design, constraints and implementation

The tools used in developing this web based application were selected on the basis of their reliability, stability under load and ease of use, to minimise development time. Apache is used as the web server, the PHP scripting language enhanced the deployment time, and Solaris and PERL provide reliability and stability in running critical applications, and text processing power. The design of WoW as a remote processing system, centrally managed to ease upgrade and patches in further development stages, places some constraints on its use. Hence, WoW is an open system that does not encompass document storage, does not store user data, profile, input/output files, and therefore, also does not need a login.

Two options for processing the HTML input document are offered in WoW:

  1. Styled document processing, requires no further interaction from users.
  2. Tagged document processing, enables users to define the sub-pages of their website.

In option 2 the user is required to insert hidden tags in the form of hidden Word text, such as “WoW_level_1: <Title for the index>” into their Word document. This tag is converted to an HTML anchor, when the Word document is converted to HTML, and indicates the position of the document breaks, and the level of the web page to be created.

The WoW design separates presentation from processing, and consists of four sub-components, which are atomic with respect to software design. These are

  1. the WoW HTML interface, including success and error pages
  2. the PHP/PERL script, to process the files submitted using the WoW HTML interface

  3. the PERL script, to process the input HTML files

  4. the CRON (Scheduler) job, to organise the running of the PERL scripts and produce the output zip file, and email that notifies the user of the URL pointing to its location.

For the first step, the web designer is free to make an interface fitting their specific need. This includes making static success or error pages. The interface submits the input files and selected options to the backend WoW script responsible for collecting files from the user. A CRON (scheduler on UNIX) job is initiated to run at a defined time. This takes a snapshot (list and/or contents) of all files deposited at that time and runs the WoW processing script on them, serially. Since a CRON job runs like a user process, there are no time constraints imposed on the script. The system stores no files belonging to the user and the output file is also deleted after a set time.

If heavy use of a system occurs load balancing strategies may need to be invoked. Since there is no constraint in WoW that requires session management, load balancing can take place through use of DNS round robin resolution, or from the gateway to the server cluster. The bulk of the processing occurs offline, and the main component of this is the CRON jobs, or scheduled jobs/processes or programs, that initiate the non-interactive WoW scripts. These can be configured to run in off-peak hours for a particular WoW server which distributes the load on the server over time, and improves the system performance as well as the shared server's throughput. Hence, WoW can run on many shared, load balanced machines, reducing or removing the need for specialised hardware.

Finally, it is worth noting that a third, real time processing option was explored. This option enabled users who had not styled nor tagged their documents to insert sub-page splits interactively. This last option was eventually abandoned as long processing times for some documents exceeded the server time-out. Investigation of the cause of the problem identified several issues of relevance to the former two options.

Diagram 3: An example of Website output from MS Word course notes converted through WoW

The real time processing option used PHP scripts and served using Apache and libphp4.so (PHP's dynamically loadable module for Apache). The scripts were written to accept a zip file containing the MS Word created HTML document and related images. After decompression, the XML parser functions made available by PHP were used to process the HTML file, containing the xHTML produced by Word.

Unfortunately, the parse time increases with the number of MS Word tags in the document, and removal or clean-up of the tags is impossible since it spoils the formatting we are trying to preserve. Added to this is the problem that the real time processing requires that the system re-parses the previously parsed sections of the HTML document on every split request as this is required to locate the split as well as locate the bounds of the data between consecutive real time splits. Further, the real time option requires more than one user initiated request over a common context, that is, the extracted Cascading Style Sheet, the input document, output website and the two bounds on the chunked data. This creates the need for a common session management backend, such as a database server, to maintain load balancing. These issues all result in an excessive processing overhead for the real time option.

By comparison, the styled or tagged processing options require only one pass of the parser through the HTML document and, while long processing times and server time-outs occurred for the real time processing of very large documents and documents with many MS Word tags, no timeouts were experienced for these options. Since the number of document splits does not alter the total parses performed on the HTML document for first two options, it is only the overall length of the document, the number of tags and, possibly, interference from the shared server processes that may lead to a scheduler timeout in this system, but it is unlikely to occur readily.

WoW! Where is it now?

Current status

The system is currently in the user-testing and implementation phase. A group of users were identified during the initial scoping phase based on their perceived need for a system such as this. They are completing a usability test that will be analysed in a report on both the interface, help support, and functionality of the system. This will highlight where users are becoming confused or frustrated, and test how intelligent and robust the functionality is. Some further development of WoW may be required.

Use of WoW

Potential users of the WoW system within the University include academics, Web Publishers, Research and PhD students, and administrators, who could use the system for:

There are also a number of potential uses outside the University, particularly in activities that require online publication of manuals, policies and procedures, and support documentation. WoW would be particularly useful for functions areas that do not have existing web publishing expertise.

Advantages of WoW

The advantages of the system include that operation of the WoW modules is transparent to the user, and no user training is required (see Diagram 5). It is easily accessible on the Web and is fully scalable, no client-side installation is required and there are no ongoing costs. WoW is able to process any styled word document, it places minimal demands on the server and no file management or storage is required. Finally, the Website output produced by WoW is searchable, reusable and fully customisable.

Currently, WoW produces output based on the HTML input provided by the end user forcing the responsibility of the visual impact of the resultant standalone website onto the end user. At present, WoW performs the repetitive task of splitting and chunking the content into logical parts, as defined by the user. WoW does not insert client-side scripting code, nor does it standardise content in a template, which might make the output more usable but would reduce the flexibility for the user.

Initial feedback from our first users indicates that the automation of the repetitive conversion tasks reduces the scope for human error and minimises the editing time, allowing more time for enhancing course materials. The removal of double handling by the content author and the web developer is appreciated, and it is anticipated that the creation of only one document for use in both print and web delivery will make it simpler for more frequent updating of web based materials to be performed.

Diagram 4: An example of Website output from a WoW-converted MS Word academic paper

Future development

MS Word produces xHTML/HTML that does not comply with W3C HTML or xHTML standards (W3C 2004a; W3C 2004b, W3C 2004c). The development of WoW has focused on a WYSIWYG principle, using the MS Word code to retain the formatting in the original submission irrespective of its compliance to W3C standards. WoW uses a HTML parser from the library of perl modules (CPAN 2001) that does not depend on the validity of the document. The following compromises were necessary to produce WYSIWYG output renderable in all browsers:

A priority for future development will be overcoming MS-Word non-conformity to W3C standards, producing conformant documents that do not sacrifice formatting while making site content accessible to a variety of audiences, including disabled learners (Alexander 2003). Our aim is to find a generic solution that would be forward compatible with all MS-Word versions.

An elegant solution currently being explored is using XML, XSLT and CSS to produce websites that obey formatting and presentation rules. This phase of development would involve drafting of a schema or DTD that is generic enough to accommodate most kinds of data with its hierarchy. Content from submitted documents would be extracted and placed in an XML document obeying the schema or DTD. Designers/professionals could then develop XSLT and CSS that can process these XML documents to produce websites/HTML conforming to W3C standards as laid down at that point in time (W3C 2004d). If the schema or DTD would change in the future, XSLT would be used to change all old documents to their new versions. Most of the tasks outlined above could be done automatically. With content separated from presentation, we could have the same content presented in PDF, PS, HTML, RTF, using FLASH, or any method that has a converter built to interpret the XML content (Reenskaug 2004). An example of the integration of XML, XSLT and CSS is presented at the Arora (2004) website.

Other future developments, indicated through user-testing, may include options to enhance indexing, such as 'Use javascript drop-down menus (collapsible/non-collapsible) in table of contents', and provide templates as an option to improve the output website created, such as 'Standardise to template A/B/C' (where examples of templates A/B/C are provided). Indexing to specific words and content may also be introduced in future revisions. For example, an index such as 'Integration 52, 98' will link the page numbers (and the file numbers) to the appropriate pages and anchor to the spot where the word 'Integration' appears in the page.

Diagram 5: An email prompts the user to download their converted files from the server to their desktop.

Conclusion

Many academics rely on professional web publishers and media developers to create and maintain their web based learning materials simply because they do not have the required time or the skill level to perform this task efficiently themselves. Unfortunately, the demand often outweighs the supply of support available. As the availability of, at a minimum, the presence of some online or web based support materials for face-to-face courses becomes an accepted, and often expected, part of the educational experience it seems an appropriate time to introduce a simple to use and targeted tool to assist teaching staff to publish academic and learning materials on the web.

A tool that is simple to use, requires no knowledge of HTML and transforms MS Word documents to well structured, professional quality, navigable, standalone websites that are also searchable, reusable and fully customizable is reported in this paper. The tool is an open web-based system that is fully scalable, and has no ongoing costs. This tool promises to make the web more accessible to teaching staff, to allow them to publish academic standard websites without knowledge of HTML tools. The extracted cascading style sheet allows for easily and timely customisation of the website for different online deliveries.

Further advantages of this ideal conversion process were identified as automating repetitive tasks, removing double handling by the content author and the web developer, creation of only one document for use in both print and web delivery, reducing the scope for human error and editing time, freeing up time for enhancing course materials, and most importantly, meeting deadlines with better quality web based learning resources. WoW!

References

Alexander, D. (2003) How Accessible Are Australian University Web Sites? AusWeb 2003-The Ninth Australian World Wide Web Conference, Gold Coast, July 2003, accessed April 2004 [HREF1].

Arora, N. (2004). Neeraj Arora's Website, viewed 20 May 2004, [HREF2].

Bennett, S., Priest, A. and McPherson, C. (1999). Learning about online learning: An approach to staff development for university teachers. Aust J Educational Technology. 15(3), 207-221.

CPAN (2001). HTML-TokeParser-Simple-2.2, viewed 20 May 2004, [HREF3].

Logictran 2003, HTML Express: the easy way to go from Word to RTF on the Web, viewed 29 March 2004, [HREF4].

Holt, D. and Segrave, S. (2003) Creating and sustaining quality e-learning environments of enduring value for teachers and learners. In G.Crisp, D.Thiele, I.Scholten, S.Barker and J.Baron (Eds), Interact, Integrate, Impact: Proceedings of the 20th Annual Conference of the Australasian Society for Computers in Learning in Tertiary Education. Adelaide, 7-10 December 2003, vol. 1, pp 226-235.

Laurillard, D. (2002). Design tools for E-learning. Keynote address, in A. Williamson, C. Gunn, A. Young, and T. Clear (Eds), Winds of change in the sea of learning: Proceedings of the 19th Annual Conference of Australasian Society for Computers in Learning in Tertiary Education. Auckland, NZ, 8-11 December 2002, vol. 1, pp 3-4.

Mayer, E. (2003) The promise of multimedia learning: using the same instructional design methods across different media. Journal of the European Association for Research on Learning and Instruction (EARLI) 13, 125-139.

O'Reilly, M., Ellis, A. and Newton, D. (2000). The Role of University Web Pages in Staff Development: Supporting Teaching and Learning Online. AusWeb2K-The Sixth Australian World Wide Web Conference, Cairns, June 2000, accessed April 2004 [HREF5].

Pannan, L. and McGovern, J. (2003) Mainstreaming online delivery: staff experience and perceptions. In G.Crisp, D.Thiele, I.Scholten, S.Barker and J.Baron (Eds), Interact, Integrate, Impact: Proceedings of the 20th Annual Conference of the Australasian Society for Computers in Learning in Tertiary Education. Adelaide, 7-10 December 2003, vol. 1, pp 396-406.

Reenskaug, T.M.H. (2004). Model-View-Controller (MVC), viewed 20 May 2004, [HREF6].

Rossiter, D. (1999). Building a Web-based Framework to embed the teaching and learning of technological literacy. In R. Debreceny and A. Ellis (Eds) The web after a decade, Proceedings of AusWeb ’99, The Fifth Australian World Wide Web Conference. Lismore, July 1999, pp 554-560.

Solutionsoft 2004. (2004) WordToWeb 2.5: Automatic HTML Publishing for Microsoft Word, viewed 29 March 2004, [HREF7].

State Services Commission (2003) e-governments.govt.nz: Survey of Word to HTML conversion solutions, viewed 29 March 2004, [HREF8].

W3C (2004a). Technical Reports and Publications, viewed 20 May 2004, [HREF9] [HREF10] [HREF11] [HREF12].

W3C (2004b). Markup Validation Service v0.6.5, viewed 20 May 2004, [HREF13].

W3C (2004c). CSS Validation Service, viewed 20 May 2004, [HREF14].

W3C (2004d). Extensible Markup Language (XML), viewed 20 May 2004, [HREF15].

Hypertext References

HREF 1 http://ausweb.scu.edu.au/aw03/papers/alexander3/

HREF2 http://www.narora.net

HREF3 http://search.cpan.org/~ovid/HTML-TokeParser-Simple-2.2/Simple.pm

HREF4 http://www.logictran.net/products/

HREF5 http://ausweb.scu.edu.au/aw2k/papers/o_reilly/paper.html

HREF6 http://heim.ifi.uio.no/~trygver/themes/mvc/mvc-index.html

HREF7 http://www.solutionsoft.com/w2w.htm

HREF8 http://www.e-government.govt.nz/web-guidelines/word-to-html-conversion.asp

HREF9 http://www.w3.org/TR/xhtml11

HREF10 http://www.w3.org/TR/html401

HREF11 http://www.w3.org/TR/xslt

HREF12 http://www.w3.org/TR/REC-CSS2

HREF13 http://validator.w3.org

HREF14 http://jigsaw.w3.org/css-validator

HREF15 http://www.w3.org/XML

Copyright

Linda Pannan, Daniel Barnes, Neeraj Arora, © 2004. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.