From HTML to XHTML: A Practical Users Guide to Upgrading a Web Site

Dale Burnett, Professor Emeritus, University of Lethbridge, Lethbridge, Alberta, Canada. dburnet2@telusplanet.net

Allan Ellis, School of Commerce and Management, Southern Cross University, Lismore, NSW, Australia. allan.ellis@scu.edu.au

Abstract

Users of new technologies are always faced with the challenge of keeping up to date with ongoing developments. Many users were introduced to the Web when HTML was the only mark-up language available. The current skills set and usage patterns of these users, particularly when they have remained casual users, often do not reflect the newest languages and protocols or the latest available tools. This paper outlines the evolution of Web languages and protocols and provides a practical case history on how to upgrade a Web site from HTML to XHTML.

Introduction

In the beginning there was HTTP and HTML. If you authored Web pages in the early 90’s you were familiar with these acronyms and a range of tags that you created using a text editor. Creating a Web page was like hand stitching a garment. You added tags at key places to get the desired presentation effect when viewed with a browser. Soon new special purpose tools such as HTML editors, with WYSIWYG capability, became available. Creating a Web page became a lot easier as authors did not have to deal with the underlying code - the software generated it for them. In time a new set of standards and languages including XML, XHTML and CSS emerged.

Now authors have the potential to create much more sophisticated sets of Web pages where global changes can be made very easily and where content can be organized so that it can be viewed in a variety of ways on a range of devices. But old habits die hard. Many Web authors, some with long experience but not extensive technical knowledge, have still not made the transition from HTML to XHTML and these new standards. This paper reviews the general issues, describes the underlying features of these new standards, and indicates why they should be used. Finally, it provides an example of upgrading a HTML Web site to one using XHTML and CSS.

In the Beginning

A markup language allows text to be combined with additional information. This additional information can be about features such as structure and layout. Tim Berners-Lee’s pioneering work with Robert Cailliau at CERN resulted in the development of a handful of tags that made up the earliest version of HyperText Markup Language (HTML). They were first described in detail in a 1991 document [HREF1] but not formally defined until several years later. Anyone who authored Web pages in the early and mid 90’s can probably recall many of them. Remember <b> this is in bold </b>. If you attended the only tutorial offered at AusWeb95 you were shown how to use these tags to markup text and view with a Web browser.

In his 1999 book, “Weaving the Web”, Berners-Lee explains that the notion of using tags was inspired by the existing publishing language Standard Generalized Markup Language (SGML). SGML is in fact a complex meta-language that can be used to define various markup languages of which HTML is but one example. You can consider HTML an SGML application that defines a specific set of tags for marking up Web pages. Of course this basic tag set has evolved. A brief history of HTML reads as follows:

1993:            HTML was published at IETF as a working draft (Berners-Lee & Connolly, 1993) [HREF2]
1995:            HTML 2.0 was published as IETF (Berners-Lee & Connolly, 1993) [HREF3]
1996 -1997: Extensions that allow for tables, client-side image maps and internationalization were progressively introduced.
1997:            HTML 3.2 was published as a W3C recommendation (Raggett) [HREF4]
1997:            HTML 4.0 specification was released by W3C [HREF5]
2000:            The ‘text/html’ media type was defined (Connolly & Masinter) [HREF6] (This makes previous standards obsolete.)
2008:            HTML 5 was published by W3C (Hickson & Hyatt) [HREF7]

Documents marked up in HTML can be transferred from computer to computer, using the server/client communications protocol HyperText Transfer Protocol (HTTP). Version 1.1 was published in 1999 [HREF8] and is still in common use today. Together with the already existing Transmission Control Protocol (TCP) the Web was born (Gillies & Cailliau, 2000). The output of the first generation of Web authors was based on these early standards.

Along Comes XML

Why change things? The unfortunate fact is that HTML has limited use. This is not surprising as it was originally conceived to aid the sharing of reports and research papers between scientists in the high-energy physics community. Tim Berners-Lee and Robert Cailliau had no idea of the diversity of demands that would be placed on the Web as it expanded into the broader community and moved from handling the display of simple text files to running complex, interactive multimedia presentations. As an illustration of this limitation consider the example of the <em> (for emphasis) tag. It does not specify what form the emphasis should take, that will vary depending on the settings of the individual user’s system. Such ambiguity cannot be tolerated if you are handling complex, media-rich files.

So while SGML is too complicated, HTML is too simple. This created an opportunity and in the late 1990’s a group of people including Jon Bosak, Tim Bray, James Clark and others came up with an Extensible Markup Language (XML). It is termed extensible because it allows users to define their own elements. It can therefore facilitate the sharing of structured data across different information systems such as Internet connected computers. Like SGML, XML is not itself a markup language, it is a specification for defining markup languages.

The recommendation for XML 1.0 was published by the World Wide Web Consortium (W3C) in February 1998 and subsequently went through a number of updates with the 4th Edition being released in 2006 [HREF9].

McGrath (1998) in response to the question “Can you explain XML in less than half a page” responded “No”, but then said if you only read half a page right now… and then went on to state:

“XML is a computer language for describing information. So to is HTML. XML improves on the HTML approach and makes the Web a better place in which to do business, to learn, and to have fun. HTML is a great technology that has changed the world. However, a great deal of information is lost when data is converted into HTML – information that, if preserved, can be used to build a whole new world of computer applications on the Web.” (p. 6)

Enter XHTML and CSS

Extensible HTML (XHTML) is a powerful and flexible markup language defined by the XML specification. XHTML is sometimes referred to as the latest version of HTML. It is in fact the successor as it has really been created by adding constraints to XML. HTML is the antecedent technology to XHTML.

A good description of the relationship between HTML, XML and XHTML is to be found in Wikipedia:

 “Whereas HTML is an application of Standard Generalized Markup Language (SGML), a very flexible markup language, XHTML is an application of XML, a more restrictive subset of SGML. Because they need to be well-formed, true XHTML documents allow for automated processing to be performed using standard XML tools—unlike HTML, which requires a relatively complex, lenient, and generally custom parser. XHTML can be thought of as the intersection of HTML and XML in many respects, since it is a reformulation of HTML in XML.” [HREF10]

You can confirm this for your self if you compare the original XHTML W3C Recommendation, XHTML 1.0 [HREF11] with traditional HTML code.

One other language that is crucial in the move forward from basic HTML is Cascading Style Sheets (CSS) that was defined in 1998 in the text/css media type specification [HREF12]. This is a style sheet language used to describe the presentation of a document written in a markup language and its most common application is to style Web pages written in HTML or XHTML.

From Technical Specifications to Practical User Applications

Most users want to know just enough technical details to make their job of actual, hands-on Web authoring easier. Therefore enough about technical specifications, let’s move on to the practical text books and readily available software that relate to this subject.

There are numerous books that discuss XML, XHTML and CSS. For example, Westermann (2002) and Holzner (2004) focus on XML, McGrath (1998) deals with E-Commerce applications, and Boumphrey (1998) and Holzchlag (2003) and MacFarland (2006) discuss Cascading Style Sheets. Unfortunately at present much of the software to support XML is still relatively technical and really only of use to those with  a fair degree of computer programming experience.

At the same time authoring software is improving. For example, the latest versions of Dreamweaver CS3 [HREF13] and Adobe GoLive [HREF14] provide a WYSIWYG environment for producing XHTML and CSS files. This is a necessary feature for many Web authors who have difficulty understanding the underlying code but who are quite productive using a WYSIWYG platform. W3C [HREF15] maintains Web sites for XML [HREF16], XHTML [HREF17] and CSS [HREF18]. Online tutorials for XML [HREF19], XHTML [HREF20] and CSS [HREF21] are also available .

There is a saying in writing science books for the general public that every equation reduces the number of readers by half. A similar guideline likely applies to Web authoring: every line of code reduces the number of readers by half. With that in mind, let’s continue.

Web Authoring Today

What should a Web author know in 2008 about the actual authoring process? It depends on many factors.

First, one should have both a broad and a deep understanding of the content being presented. Second, one should have a basic sense of how to organize the material into a defensible structure that communicates the intended message. Third, one must be able to communicate clearly. Fourth, one should be able to translate the results of integrating the above three considerations into a web site. Given that the first three considerations already involve a serious ongoing commitment of time and energy, what is a reasonable expectation for the learning of an additional technical skill and technique, particularly when that skill is itself is in a state of  evolution? Compounding the problem, at least at first glance, is the fact that one is often not really interested in coding.

How much is enough, and what should that enough look like? Clearly there are no definitive answers to these two questions, but nonetheless one can make a few rough first approximations. A few hours, every now and then, does not seem unreasonable. This is time spent keeping an eye on trends and general issues without becoming mired in the details. Also a few hours learning a new skill, or coming to grips with a new idea or way of thinking, seems fair. Much of this will undoubtedly occur in some form of a need-to-know situation. A particular problem suddenly becomes a priority item and one learns something new while considering various alternatives. The learning is unstructured but effective. But sometimes there is also a place for a more structured and  efficient approach learning. This is when one decides to read a book, journal article, follow the contents of a Web site, attend a seminar, participate in a workshop or take a course with the goal of deliberately learning something new. The goal is more future oriented than in a need-to-know situation. One learns in the expectation that it will be useful later, even though that future situation may be ill-defined at the time. How often have we heard, or perhaps said, “Trust me. This will be useful later.” If the individual is strongly self-directed, this may more properly be, “I trust my judgment. This will be useful later.”

The fundamental shift in thinking when working with XHTML and CSS is to separate the content of the Web page from the formatting commands. Basically one first creates a data file (the XHTML file) that contains the content as well as information about the structure of the data, and another file (the CSS file) that describes how the data should appear on various devices such as computer screens, printers, cell phones, iPods, …

This merits repeating. The idea is to create two files. One file contains all of the material (text, images, media files) that one wants to share and the second file contains a set of commands that specifies how that content is to appear. This is quite different than the traditional approach to web page coding where the two activities are blended into the same document (the HTML file). This also means that one needs to learn two different coding languages (XHTML and CSS). The first is relatively easy as XHTML looks very much like HTML. Most of the differences involve a tightening up of the standards (e.g. every tag must have a closing tag, all tags must be lower case only). But the syntax for CSS is totally new.

Why would we want to do this? Why should we separate what has always been a seamless integrated whole into two discrete parts? Why indeed. This is a case of one step backward and two steps forward. The backward step is having to think about a display in a totally new way. The two steps forward are: (1) the power behind a change in format, and (2) the ease of preparing one set of data for multiple platforms. In the case of a Web site with numerous pages, perhaps hundreds, if one decides to change the appearance of a heading, one can make the change with only a few keystrokes in a CSS file. Similarly if one wants to display a data set on a printer as well as an iPod, one need only create a separate CSS file for each device and leave the actual data file untouched. It’s the perfect way to “prepare once” and be able to display in “multiple ways”.

The next surprise (shock might be a better word) is with the way one conceptualizes the structure of the data. It is not hard, but it is different. At its most fundamental level, the idea is the same as with using a table to define a layout. Only here the focus is on individual cells. Each cell is a rectangle. The position of the cell is specified by giving the coordinates of the top left corner together with information about the width of the cell. There is also information about the font (type, size, color, alignment)) of the text that will be found within this cell. Cells will automatically expand in length to accommodate whatever text is identified as being within that cell. The actual text is stored in the XHTML file. The description of the cell location and size as well as the formatting information of the text within the cell is defined in the CSS file.

Given that one has a relatively clear idea of what one is trying to accomplish, the remaining question is how to do all of this. As before with HTML, there are two basic approaches. One is to learn all of the markup tags and manually code the respective XHTML and CSS files. Fair enough, if that is your preference. The other approach is to learn some form of WYSIWYG authoring package.

A Case History: Upgrading an Existing Web Site

Much of what follows will be a blend of the personal history of one of the authors (Burnett) and concomitant generalizations. Let me begin with a short personal story. I began my professional career over 40 years ago as a programmer in a large corporation. After a few months training I was given responsibility for a fairly large and important program. After about 3 weeks I had a program that was at the 95% complete stage – there were just a few bugs left to resolve. Time passed. About 2 months later I was still at the 95% stage. My strategy of correcting a small problem was not working. As I would add some code to correct a problem I would often create a new problem. It is difficult to change strategies when it has worked well – after all I was 95% of the way done. Another month went by. I was still at 95%. This was not good. But something else was happening during these three months. I was learning an awful lot about the program and I was learning a lot about what the program was really supposed to do. I finally decided to scrap the whole thing. By then the code was simply a mess of patches upon patches upon patches. I sat down on the weekend with a clean sheet of paper and wrote a totally new program in two days. It took another two days to correct a few typos and simple bugs and I was finished. The hardest part was deciding to abandon my earlier effort and begin anew. I will come back to this story later.

In 1995 I taught myself a new set of skills: HTML. I wanted to be able to create a Web page like the few I had seen at that time and the way to do that was to write a series of commands that could be read by a browser (Mosaic at first, and then Netscape) and displayed as a Web page. It was magic. This was also the time of AusWeb95 in Ballina which I attended. Now fast forward to  2005. I have been creating Web pages for just over a decade. My HTML skills have remained quite modest. Instead I have learned to use a WYSIWYG editor (Dreamweaver) that allowed me to create Web pages by essentially thinking of myself as using a word processor with a few special features such as inserting images and sound files and learning how to add links so the viewer could branch to another page. This was fine as long as one was only adding text. But one also wanted to have a modest degree of control over the layout of the page. The breakthrough idea here was that of a table. Table commands were initially set up to create … , you guessed it, tables. But then someone realized that one could specify the size of different rows and columns and further, one could place images as well as formatted text in any cell. Essentially this meant that one could use a table to define the layout of the page. A structured way to approach design had arrived, and it was easy. Well, at least the technical aspects were easy. Taste and aesthetics were more problematic.

Discussing the authoring of Web pages and sites is primarily a matter of matching the needs of the viewer with those of the presenter. It is a multi-dimensional world and the matching process involves a number of these dimensions (level and type of technical expertise; available time and resources; attitude). The following sections assume that the reader has some previous experience creating Web pages, but does not assume that the person is technically sophisticated.

My Web Site

I have been maintaining a personal daily Web site for over two years. The sites were originally written using an early version of what was then Macromedia’s Dreamweaver software with little regard for the actual coding. The output was in traditional HTML markup. Once in a while something unexpected would happen and it was necessary to get involved with the code, but this was relatively rare. Preparing a Web page was very similar to word processing. Type and format the text. Screen layout was accomplished by the judicious use of tables. Thus a screen display was thought of a table of, say, three columns and 4 rows. Then something (e.g. text, images) would be inserted into each of the cells. Definitely not rocket science. Design was primarily at the level of web site file management: where are the various pages stored (i.e. in some form of hierarchical folder structure) and how are the links to these pages organized.

But I began to notice a new collection of books in the stores.  Many of these books mentioned an organization called W3C and new coding languages such as XML, XHTML and CSS. These first books were fairly typical of what one would expect from a group of very sophisticated leading-edge software developers: unintelligible to anyone outside the group. The term user-friendly was synonymous with wuss. It was a case of if you had to ask the price, then you probably couldn’t afford it. If you had to ask about XML, then it was probably too technical/difficult for you to understand. While that may be a defensible attitude if you are with the geek crowd, it definitely has the effect of excluding those that might want to join. Many web sites were similar. They were excellent sources of information if one already was familiar with the code and liked to work with it [HREF22, HREF23, HREF24, HREF25, HREF26]. They also provide a rich resource for the history and development of the Web and the evolution of standards. But they were a bit daunting to the uninitiated.

Now lets imagine the following scenario: you have created a web site that consists of a hundred Web pages and which utilizes your tried and tested HTML skills. You have used tables to format the pages and have added a substantial amount of content. Suddenly you are faced with the task of upgrading it to reflect the latest standards of Web authoring and to gain the new level of functionality that they offer. What do you do?

There are two basic approaches. One is to follow a well-laid out sequence of steps to modify the code. This sequence of steps is almost guaranteed to result in a new set of files that meets the requirements. Holzschlag (2003) gives an example of this approach in the seventh chapter of her book. This may make sense, but only if one already understands the reasons behind each step. Which in turn means that one needs to study the basic ideas of specifying the structure and formatting principles in a CSS file as well as removing much of this information from the original HTML file. Care is needed as if errors are created they could be hard to locate and fix. My earlier story about patches on  patches is relevant here as well.

The second approach is to is to bite the bullet, embrace the new concepts and create a new CSS file from scratch. Creating the XHTML file is relatively easy in comparison since it primarily is a matter of removing tags from the original HTML file.

Regardless of which approach one selects, one must also decide whether to use a WYSIWYG software package or to code the files directly. In my case the ease and speed of using Dreamweaver made it an obvious choice. But, and this is an important but, there will still occur situations where the unexpected is on the screen and your only recourse is to look at the code and try tweaking a few commands. Sometimes this is easier than it looks as you can often guess where the problem lies by comparing the code with the resulting display and then simply changing a parameter or a tag. However it does help to have a basic familiarity with the syntax of both XHTML and CSS.

An Example of Upgrading a Web Site to XHTML

Here is a screen shot of the basic Web page from my 2007 daily journal:

screen1

Here is a screen shot of the basic Web page from 2008:

screen2

First, let me highlight the basic changes, then I will describe how the new page was created. The most obvious change is the creation of a left margin that contains all of the links to other pages on the site. There are usually trade-offs when one makes a change. In this case the trade-off is between a reduced width for the daily content versus the increased flexibility of immediate access to all parts of the site. It must be stressed that this change is not due to the switch from HTML to XHTML, it is due to increased experience with the site and a recognition that continuous attention to the design will more than likely produce improvements.

This is very much like my experience forty years ago. Sometimes it is better to begin with a clean screen and start over than it is to try to modify an existing screen to incorporate a new feature. The new screen consists of only a few elements. There is the left margin, the top heading area and the main content area.

With Dreamweaver CS3 (Guntrer & Valade, 2008; Karlins, 2007; McFarland, 2007) the basic creation of a page like this is absurdly simple. One selects New from the File menu and receives a screen full of options.

1. Select HTML for Page Type (more about this in a moment – see step 3)
2. Select 2 column liquid left sidebar header and footer for Layout,
3. Select XHTML 1.0 Transitional for DocType (this means that the page will be in XHTML not HTML), and
4. Select Create New File for Layout CSS.

That is all one needs to do. Dreamweaver will then create a CSS file and an XHTML file with some dummy data in each area. All of the necessary code has been created for you. Now you can treat the page like a word processor and insert your content.

But there is a lot happening behind the screen. Dreamweaver had created two files and has also created the code so that the two files closely interact. The XHTML file identifies the various areas of the screen (left sidebar, heading, content, and footer). It also places some content in each area. The second file, the CSS file, contains all of the commands for controlling the appearance of the content in each of the areas.

What remains for the author? Everything that turns the display from a boilerplate template to a highly personalized, colorful and idiosyncratic attraction. The main modification to the XHTML file will involve changing the number and character of the different areas. Changing the actual content is much easier as it only involves typing and insertion of images. Changes to the CSS file are more complex as there are so many options. The good news is that all of the options may be selected from various pull-down menus. The bad news is that it is not always obvious what effect a particular option will have on the resulting display.  There is still some substantive learning to be undertaken, but at least it is no longer at the level of code.

The final step is verifying that your files meet the standards for XHTML and CSS. This is not a totally straight forward issue as there is still some debate about the fine details of some standards. There are two steps to the process. Dreamweaver provides a validation button that will identify most of the potential problems. Just click on the button and it will scan your file, or your complete Web site, for possible problems. The overall situation is complicated by the fact that different proprietary browsers react differently to some code. Thus you have a file that fails to meet a particular standard and yet the browser may handle it in a way that keeps everything working. The converse is also true. Sometimes you may have a file that satisfies all of the standards and yet the browser may display it in an inappropriate manner. The W3C also maintains interactive Web sites [HREF27, HREF28] that will examine your files and identify any problems. Sites that have done this are encouraged to display a small icon that indicates that the page satisfies W3C standards.

Concluding Remarks

The Web is a dynamic entity. It continues to evolve as the numerous people involved in its use interact and think of new applications and uses (Tapscott & Williams, 2006). There has been a gradual shift from HTML to XHTML and CSS. Recent developments in software authoring packages are moving the focus from the domain of code to that of a visual WYSIWYG environment. Such developments are likely to increase the acceptance of the new standards among authors who are reticent to become involved in the detailed technical aspects of Web page creation.

References

Berners-Lee, T. (1999). Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor. Harper Collins, New York.

Boumphrey, F. (1998). Professional Style Sheets for HTML and XML. Wrox Press Ltd, Birmingham.

Gillies, J. & Cailliau,R. (2000). How the Web was Born: The Story of the World Wide Web. Oxford University Press, Oxford.

Gunter, S. K. & Valade, J. (2008). Dreamweaver CS3 and Flash CS3 Professional. Wiley: Hobokon, N. J.

Holzner, S. (2004). SAMS Teach Yourself  XML in 21 Days. SAMS, Indianapolis, Indiana.

Holzschlag, M. E. (2003). Cascading Style Sheets: The Designer’s Edge. San Francisco.

Karlins, D. (2007). Adobe Dreamweaver CS3: How-Tos – 100 Essential Techniques. Adobe Press: Berkeley, Calif.

McFarland, D. S. (2006). CSS: The Missing Manual. O’Reilly Media: Sebastopol, Calif.

McFarland, D. S. (2007). Dreamweaver CS3: The Missing Manual. O’Reilly Media: Sebastopol, Calif.

McGrath,S. (1998). XML by Example: Building E-Commerce Applications. Prentice Hall, NJ.

Tapscott, D. & Williams, A. D. (2006). Wikinomics. Penguin: Toronto.

Westermann, E. (2002). Learn XML in a Weekend. Premier Press, Cincinnati, Ohio.

Hypertext References

HREF1
http://www.w3.org/History/19921103-hypertext/hypertext/WWW/MarkUp/Tags.html
HREF2
http://www.w3.org/MarkUp/draft-ietf-iiir-html-01.txt
HREF3
http://tools.ietf.org/html/rfc1866
HREF4
http://www.w3.org/TR/REC-html32
HREF5
http://www.w3.org/TR/REC-html40-971218/
HREF6
http://tools.ietf.org/html/rfc2854
HREF7
http://www.w3.org/TR/html5/
HREF8
http://tools.ietf.org/html/rfc2616
HREF9
http://www.w3.org/TR/2006/REC-xml-20060816/
HREF10
http://en.wikipedia.org/wiki/XHTML
HREF11
http://www.w3.org/TR/xhtml1/
HREF12
http://tools.ietf.org/html/rfc2318
HREF13
http://www.adobe.com/ap/products/dreamweaver/?sdid=BDMGJ
HREF14
http://www.adobe.com/products/golive/
HREF15
http://www.w3.org/
HREF16
http://www.w3.org/XML/
HREF17
http://www.w3.org/TR/xhtml1/
HREF18
http://www.w3.org/Style/CSS/
HREF19
http://www.w3schools.com/xml/default.asp
HREF20
http://www.w3schools.com/xhtml/default.asp
HREF21
http://www.w3schools.com/css/default.asp
HREF22
http://www.evolt.org/article/The_XHTML_Transition_It_s_not_that_difficult/17/9953/index.html
HREF23
http://webkit.org/blog/68/understanding-html-xml-and-xhtml/
HREF24
http://www.irt.org/articles/js192/index.htm
HREF25
http://www.builderau.com.au/program/web/soa/Making-the-switch-to-XHTML/
HREF26
http://whitepapers.zdnet.co.uk/0,1000000651,260006696p,00.htm
HREF27
http://validator.w3.org/
HREF28
http://jigsaw.w3.org/css-validator/

Copyright

<Author's Name>, © 2008. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.