Getting Rid of the Webmaster


Andrew Waugh, CSIRO Division of Information Technology, 723 Swanston Street, Carlton, VIC, 3053 Phone +61 3 9282 2666, Fax +61 3 9282 2600 Email:andrew.waugh@dit.csiro.au

Rowan McKenzie, CSIRO Division of Information Technology, 723 Swanston Street, Carlton, VIC, 3053 Phone +61 3 9282 2666, Fax +61 3 9282 2600 Email:rowan.mckenzie@dit.csiro.au


Keywords: WorldWideWeb, Intranets
The work reported in this paper has been funded in part by the Cooperative Research Centres Program through the Department of the Prime Minister and Cabinet of the Commonwealth Government of Australia

Abstract

This paper describes a Web submission tool which is designed to be used in a corporate web environment. The submission tool allows owners of pages to submit pages for inclusion in the corporate web. During the submission process, the tool extracts metadata for use by search engines and checks for invalid hypertext links. The goal of the tool to reduce the amount of effort required to maintain a reliable corporate web. In turn this will reduce the cost of running a corporate web.

Introduction

Corporate Webs are increasingly seen as an efficient means of distributing information within an organisation. The attraction of the Web is that it supports 'just in time' information distribution.

Traditional information distribution in an organisation involved copying documents to everyone who might be interested. Apart from being a waste of paper, copies were inevitably delivered to people who had no interest in the information (or need to see the information). Equally the distribution invariably missed some people who should have got the information.

Even if the information was correctly distributed to the right people, the information had often been lost by the time it was to be used. Some people, on the other hand, never lost information, but they never updated it either, and so old information floated around in the system causing confusion.

The best way of retrieving information in an organisation is to identify the person responsible for the information and to only retrieve it from them when the information is required. No need to keep track of information that floods across your desk, nor is there any risk of using out of date information.

A corporate web is simply an extension of this idea. The information is made available on the Web by the person responsible for it. Users retrieve information when it is required.

Corporate Webs already exist. However, we do not believe that corporate webs have reached their full potential. We believe that this will not happen until the costs of making information available on the web are reduced.

Maintaining the infrastructure of a corporate web involves the activities normally undertaken by a 'Webmaster'. These activities include adding new and modified pages to the web, ensuring pages are consistent in style, and checking and fixing broken hypertext links.

While it is essential to undertake these activities to maintain a reliable and useful corporate web, they are an overhead as they are not directly related to making information available. Such overheads should be reduced as far as possible as they reduce the economic viability of a corporate web. A single Webmaster would cost an organisation $80,000 to $100,000 per year including overheads. It makes economic sense to attempt to reduce the number of Webmasters used by automating some of their functions.

The purpose of this paper is to describe one tool designed to reduce the cost of maintaining a corporate web.

It appears that very few WWW tools currently address this problem. Some tools address the issue of checking and repairing links and these will be briefly discussed in the section below.

The term 'intranet' has recently become popular for the concept of adopting internet technology, particularly Web technology, within organisations. We prefer the term 'Corporate Web' in this paper as we are specifically interested in Web technology and not in other internet technology.

Web Page Submission Tool

A web submission tool is used by the maintainer of a web page to submit new pages to the corporate web, to make modifications to existing pages, or to delete pages from the corporate web.

The tool consists of three main modules; a browser, a client-side submission module, and a server-side submission module. The browser allows the user to inspect pages from the corporate HTTP server and select links in pages for modification. The presentation is similar to conventional graphical web browsers. The client-side submission module provides dialog to the user so that the head of a set of HTML pages on the user's machine can be chosen for submission, and options such as metadata can be specified. The server-side submission module accepts HTML pages from the client for inclusion in the repository and performs the necessary modifications to existing pages in the repository. This module also handles security and page ownership so that only the owner can make further modifications.

The Java language was chosen for implementing the submission tool because of its suitability to network applications and its platform independence. Java's standardised Application Programming Interface was a further attraction.

The submission tool can perform the following functions:

  1. Separates ownership of the web infrastructure from the ownership of the information.
  2. Authenticate the user to ensure that they have the authority to add, modify, or delete those pages from the corporate web.
  3. Check the information being submitted for quality and errors. This could include checking the content (e.g. spelling), checking for valid HTML (including checking for disallowed extensions to HTML), and checking that hypertext links within the submitted pages are valid and do not point outside the corporate web.
  4. Generate metadata from the web pages for input to a corporate search engine.
  5. Check for, and flag, hypertext links outside the submitted pages that have become invalid.
  6. Copy the pages to the corporate web servers.
  7. Add, if necessary, hypertext links from the existing corporate web pages to the newly submitted pages.

The following sections expand on some of these functions.

Separation of Powers

The submission tool separates ownership of the corporate web infrastructure from ownership of the information in the web. This separation assists in improving the reliability of the corporate web.

If the information owners are responsible for running their portion of the corporate web, technical reliability suffers. Information owners may not have the skills or resources that are available to specialised system administrators. The servers used (often desktop PCs) are usually not as reliable as corporate servers, and the network links to the servers usually do not have the capacity to reliably serve popular pages.

On the other hand, if the system administrators are responsible for the information content of the corporate web the information reliability suffers. System administrators usually cannot keep track of changes to information as well as the people who are responsible for that information.

By separating the ownership of the infrastructure from the information, the reliability of accessing the web can be guaranteed and the reliability of the information on the corporate web can be improved.

Validity of Hypertext Links

It is easy to check a corporate web for invalid hypertext links. It is more difficult to economically maintain hypertext links. Consider the steps required to fix a broken hypertext link:
  1. Read the log file to identify the broken link.
  2. Call up the page on which the broken link occurs to identify its context.
  3. Work out what happened to the linked page. This can be very time consuming; particularly if the page has been modified as the link may no longer be relevant.
  4. Edit the page with the broken link to either fix the link or delete it.
Fixing a broken hypertext link would take at least 10 minutes, and more likely 20 or 30 minutes. One Webmaster could consequently fix between 8 and 24 links in a morning; not many in a large, dynamic, corporate web. Much can be done to automate this process.

First, the Web submission tool can ensure that any hypertext links added to the corporate web are valid.

Second, it can support a system to fix links that are broken as pages are deleted or moved. Note that we believe that the submission tool should not *prevent* links being broken when pages are deleted or moved.

The only way of preventing links from being broken when a page is deleted is to prevent deletion until all links to the page have been removed. But the owner of the page may not own the pages which point to that page and so may not be able to directly delete those links. The owner could ask the owners of the links to make the necessary changes, but cannot force them to make these changes. However, it may be essential for the page to be moved or changed; for example for legal reasons. This conflict between ownership rights can only be resolved by allowing owners of pages the freedom to delete or move pages without requiring links pointing to that page to be updated.

If the submission tool cannot prevent links from being broken, it can leave the corporate web in an internally consistent state. It can also notify the owners of the broken links and assist them in repairs.

The first step is to identify and deactivate broken links. This prevents users of the corporate web from receiving errors while the link is being fixed and improves the webs apparent reliability. Identifying broken links can be done by a scan of the web, but it is probably preferable to maintain a database of links as part of the submission tool. Having identified the broken links the next step is to deactivate them. This involves 'turning off' the link so that the users cannot 'click' on it. Deactivating broken links can be done automatically before the page has been deleted or moved.

The second step is to assist the owner of the broken link in fixing (or deleting) the link. The implementation of access control on modifying web pages implies that the system keeps track of who 'owns' particular pages. This information is used to email the owner that a particular link has been deactivated because it was broken. The email must contain sufficient information to allow the owner to identify the broken link and should include as much information as possible as to what happened to the page the link pointed to. This information should be included twice; once in a human readable form and once in a machine processable form.

The machine processable form allows the email to be fed into tools which support the owner in fixing the broken link. A simple tool would open the web page for editing, delete the broken link, and then allow the owner to edit the text around the link (if necessary). If the link was broken because the page it pointed to had been moved, a more complex tool could automatically open the relocated page so that the owner could verify that the link is still useful. If so, the broken link could be automatically repaired and reactivated. Finally, if the page had been replaced by a new page (not necessarily in the same location or on exactly the same topic) the repair tool could start by searching for the new page. This could be by searching for a page with a similar title.

Automatically deactivating broken links should greatly improve the apparent reliability of the corporate web; this is important when it is being used by non-technical staff. Providing tools to support the repair of the broken links reduces repair time (and hence cost) and allows repairs to be made by people other than the Webmaster.

An aside. This approach will not support repairs to links outside the corporate web (e.g. hot links from individuals web pages). Assistance in fixing these links can be achieved by extending the error page to contain the correct hypertext links. If the exact replacement page is not known, the corporate web could perform a search for likely pages and return these on the error page. The system could be extended to silently redirect broken links, but returning an error page is preferable as it indicates to the user that a link has broken and prompts them to fix it.

At least two existing tools address support the detection and repair of broken links. Examples are Morningside's SITEMAN and Adobe's SiteMill. Both these approaches are primarily to maintain hyperlinks. They

  1. check sets of web pages to ensure that hyperlinks are still valid,
  2. check for orphans (web pages with no pointers to them); and
  3. allow links to be changed as pages are moved.
Neither tool appears to address the issue of deleting pages.

Standard Components

Most corporate webs have a standard style which usually includes:
  1. a header, often including the corporate logo;
  2. standard background;
  3. a footer, usually including standard buttons 'up' and to 'home'.

It is currently easy for page owners to forget these standard components. Forgetting the standard buttons at the foot of a page is very common and now extremely bad practice. These days few people browse webs. Most use search engines to directly locate pages of interest. Unfortunately, these pages often lack links 'up' to the parent page or to a 'home' page. The result is that users can locate pages, but then cannot browse the web to access 'nearby' pages which are also of interest. It is extremely common, for example, for a search to direct you to a page which contains part of a document with no way of going to the complete document.

The submission tool could easily add these standard components as the page is added to the corporate web.

An alternative, however, would be for the web server to add these standard components dynamically to pages as it is responding to HTML requests. This has a number of advantages:

  1. The components are not stored with each page. This can save significant amounts of storage.
  2. Changing the corporate style (e.g. the logo) only involves changing one thing, instead of modifying every page.
It would be desirable to have the ability to change the style depending on the page being retrieved and also to suppress the corporate style wholly or partially on certain pages.

Capture of Metadata

Few users browse the web these days. It is more common to use a search engine to find interesting pages and to go directly to them. A search engine is consequently an essential part of a corporate web. A major function of the submission tool is the management of the metadata used by the search engine.

Metadata is information about data. In the context of a corporate web, this includes

  1. descriptive information about a web page (e.g. title, authors, subject, keywords);
  2. retrieval information (e.g. host, location, retrieval protocol);
  3. maintenance information (e.g. owner, disposal instructions)

This metadata needs to be captured when a web page is created. The metadata needs to be updated when the page is modified and removed when the page is deleted. The issues for the submission tool are:

  1. Capturing the new or changed metadata; and
  2. Managing the addition and deletion of this metadata from the search engine storage.

Capturing the metadata is the most difficult part of the exercise. Metadata may be manually created (e.g. through a pop up window), it may be automatically derived from the web page, or created by some combination of automatic and manual creation.

For much of the metadata, automatic capture and maintenance by the submission tool is the most appropriate method. This particularly includes the retrieval information (e.g. host and location). Some of the descriptive information can be captured from structured HTML documents (e.g. title) and some maintenance information (e.g. owner) can be supplied by the submission tool itself.

The most difficult metadata to capture is that requiring human judgement. The ideal solution would be for the creator or maintainer of a web page to fill in a pop-up window with a description of the page. Although easy to implement, there are a number of problems with this approach. The first is simply motivation. It is difficult to encourage some people to fill in forms; how many people fill in the pop-up form when Word first saves a document? It is possible to make it mandatory to fill in the fields but this just leads to subject entries of 'aaaa' and other similar strings. The second problem is ensuring consistent terminology in the descriptions of pages. Consistency should be found both in the descriptions of similar pages and over time. The problem of consistency can be addressed by training or by the provision of a taxonomy which encourages owners to use similar terminology.

Instead of requiring the owner or maintainer of the page to categorise the page, this task can be delegated to a specialist cataloger. A specialist cataloger can be highly trained and this reduces the problem of ensuring consistency amongst page descriptions and over time. Because each page is added to the corporate web through the submission tool, the submission tool can automatically refer each new page to the cataloger and this ensures that pages are not overlooked. Unfortunately, a cataloger is yet another overhead on the running of a corporate web.

Automatic categorisation of web pages is the final option. This is attractive as it is relatively cheap and can be applied to all web pages. Automatic classification could be as simple as storing the complete text of the page in the search engine and using full text searching. There are other alternatives, including probabilistic classification. Important issues with automatic classification are

  1. using the context of the web page to guide the classification process. Current web search tools often return hits on pages that form parts of other documents.
  2. suppressing duplicate results
  3. Taxonomies. Most automatic classification schemes base their decisions on words which appear in the document. The problem of consistency of indexing terms between documents is consequently difficult.

Hybrid automatic/manual classification processes are also possible. For example, metadata extraction could commence with automatic extraction. The derived classification keywords could be presented to the owner/maintainer to improve upon.

The methods used to classify pages are likely to change over time as more effective algorithms are discovered. We believe that a language like Java will be of benefit here. Instead of installing a classification program on each desktop in an organisation, Java allows the program to be downloaded as required from the server. Updating and changing the classification program is consequently easy.

Conclusion

We do not actually believe that a large corporate web could be run without a Webmaster. But equally, we believe that it is not cost effective for a Webmaster to perform low level management functions. By automating these functions, the administration cost of running a corporate web will fall and make this technology more attractive. This paper has outlined a tool to automate some web management functions.

The first function is control over the contents of the corporate web. By providing authentication and access control, the tool allows the owners of information in an organisation can retain control over the content of thier web pages, whilst ceding control over the web infrastructure to corporate system administrators. This improves the access reliability and the reliability of the information on the web.

The second function is the generation and maintenance of metadata for the corporate search engine. Generating metadata requires a complex mix of automatic extraction of information from the web page and manual creation or tuning of the information by the owner. The use of Java allows the submission tool to be down loaded as required from the central server. This allows easy changing of the submission tool without the headache of requiring thousands of copies of the tool to be updated.

The final function addressed in this paper is support for maintenace of links; in particular the automatic deactivation of links that have been broken when web pages are relocated. The owners of pages containing these links are automatically notified and an additional tool is provided to minimise the repair cost.


Copyright

CSIRO ©, 1996. The authors assigns to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM, and for the document to be published on mirrors on the World Wide Web. Any other usage is prohibited without the express permission of the author.
Pointers to other Papers
Papers & posters in this theme All Papers & posters AusWeb96 Home Page

AusWeb96 The Second Australian WorldWideWeb Conference "ausweb96@scu.edu.au"