Rowan McKenzie, CSIRO Division of Information Technology, 723 Swanston Street, Carlton, VIC, 3053 Phone +61 3 9282 2666, Fax +61 3 9282 2600 Email:rowan.mckenzie@dit.csiro.au
Traditional information distribution in an organisation involved copying documents to everyone who might be interested. Apart from being a waste of paper, copies were inevitably delivered to people who had no interest in the information (or need to see the information). Equally the distribution invariably missed some people who should have got the information.
Even if the information was correctly distributed to the right people, the information had often been lost by the time it was to be used. Some people, on the other hand, never lost information, but they never updated it either, and so old information floated around in the system causing confusion.
The best way of retrieving information in an organisation is to identify the person responsible for the information and to only retrieve it from them when the information is required. No need to keep track of information that floods across your desk, nor is there any risk of using out of date information.
A corporate web is simply an extension of this idea. The information is made available on the Web by the person responsible for it. Users retrieve information when it is required.
Corporate Webs already exist. However, we do not believe that corporate webs have reached their full potential. We believe that this will not happen until the costs of making information available on the web are reduced.
Maintaining the infrastructure of a corporate web involves the activities normally undertaken by a 'Webmaster'. These activities include adding new and modified pages to the web, ensuring pages are consistent in style, and checking and fixing broken hypertext links.
While it is essential to undertake these activities to maintain a reliable and useful corporate web, they are an overhead as they are not directly related to making information available. Such overheads should be reduced as far as possible as they reduce the economic viability of a corporate web. A single Webmaster would cost an organisation $80,000 to $100,000 per year including overheads. It makes economic sense to attempt to reduce the number of Webmasters used by automating some of their functions.
The purpose of this paper is to describe one tool designed to reduce the cost of maintaining a corporate web.
It appears that very few WWW tools currently address this problem. Some tools address the issue of checking and repairing links and these will be briefly discussed in the section below.
The term 'intranet' has recently become popular for the concept of adopting internet technology, particularly Web technology, within organisations. We prefer the term 'Corporate Web' in this paper as we are specifically interested in Web technology and not in other internet technology.
The tool consists of three main modules; a browser, a client-side submission module, and a server-side submission module. The browser allows the user to inspect pages from the corporate HTTP server and select links in pages for modification. The presentation is similar to conventional graphical web browsers. The client-side submission module provides dialog to the user so that the head of a set of HTML pages on the user's machine can be chosen for submission, and options such as metadata can be specified. The server-side submission module accepts HTML pages from the client for inclusion in the repository and performs the necessary modifications to existing pages in the repository. This module also handles security and page ownership so that only the owner can make further modifications.
The Java language was chosen for implementing the submission tool because of its suitability to network applications and its platform independence. Java's standardised Application Programming Interface was a further attraction.
The submission tool can perform the following functions:
The following sections expand on some of these functions.
If the information owners are responsible for running their portion of the corporate web, technical reliability suffers. Information owners may not have the skills or resources that are available to specialised system administrators. The servers used (often desktop PCs) are usually not as reliable as corporate servers, and the network links to the servers usually do not have the capacity to reliably serve popular pages.
On the other hand, if the system administrators are responsible for the information content of the corporate web the information reliability suffers. System administrators usually cannot keep track of changes to information as well as the people who are responsible for that information.
By separating the ownership of the infrastructure from the information, the reliability of accessing the web can be guaranteed and the reliability of the information on the corporate web can be improved.
First, the Web submission tool can ensure that any hypertext links added to the corporate web are valid.
Second, it can support a system to fix links that are broken as pages are deleted or moved. Note that we believe that the submission tool should not *prevent* links being broken when pages are deleted or moved.
The only way of preventing links from being broken when a page is deleted is to prevent deletion until all links to the page have been removed. But the owner of the page may not own the pages which point to that page and so may not be able to directly delete those links. The owner could ask the owners of the links to make the necessary changes, but cannot force them to make these changes. However, it may be essential for the page to be moved or changed; for example for legal reasons. This conflict between ownership rights can only be resolved by allowing owners of pages the freedom to delete or move pages without requiring links pointing to that page to be updated.
If the submission tool cannot prevent links from being broken, it can leave the corporate web in an internally consistent state. It can also notify the owners of the broken links and assist them in repairs.
The first step is to identify and deactivate broken links. This prevents users of the corporate web from receiving errors while the link is being fixed and improves the webs apparent reliability. Identifying broken links can be done by a scan of the web, but it is probably preferable to maintain a database of links as part of the submission tool. Having identified the broken links the next step is to deactivate them. This involves 'turning off' the link so that the users cannot 'click' on it. Deactivating broken links can be done automatically before the page has been deleted or moved.
The second step is to assist the owner of the broken link in fixing (or deleting) the link. The implementation of access control on modifying web pages implies that the system keeps track of who 'owns' particular pages. This information is used to email the owner that a particular link has been deactivated because it was broken. The email must contain sufficient information to allow the owner to identify the broken link and should include as much information as possible as to what happened to the page the link pointed to. This information should be included twice; once in a human readable form and once in a machine processable form.
The machine processable form allows the email to be fed into tools which support the owner in fixing the broken link. A simple tool would open the web page for editing, delete the broken link, and then allow the owner to edit the text around the link (if necessary). If the link was broken because the page it pointed to had been moved, a more complex tool could automatically open the relocated page so that the owner could verify that the link is still useful. If so, the broken link could be automatically repaired and reactivated. Finally, if the page had been replaced by a new page (not necessarily in the same location or on exactly the same topic) the repair tool could start by searching for the new page. This could be by searching for a page with a similar title.
Automatically deactivating broken links should greatly improve the apparent reliability of the corporate web; this is important when it is being used by non-technical staff. Providing tools to support the repair of the broken links reduces repair time (and hence cost) and allows repairs to be made by people other than the Webmaster.
An aside. This approach will not support repairs to links outside the corporate web (e.g. hot links from individuals web pages). Assistance in fixing these links can be achieved by extending the error page to contain the correct hypertext links. If the exact replacement page is not known, the corporate web could perform a search for likely pages and return these on the error page. The system could be extended to silently redirect broken links, but returning an error page is preferable as it indicates to the user that a link has broken and prompts them to fix it.
At least two existing tools address support the detection and repair of broken links. Examples are Morningside's SITEMAN and Adobe's SiteMill. Both these approaches are primarily to maintain hyperlinks. They
It is currently easy for page owners to forget these standard components. Forgetting the standard buttons at the foot of a page is very common and now extremely bad practice. These days few people browse webs. Most use search engines to directly locate pages of interest. Unfortunately, these pages often lack links 'up' to the parent page or to a 'home' page. The result is that users can locate pages, but then cannot browse the web to access 'nearby' pages which are also of interest. It is extremely common, for example, for a search to direct you to a page which contains part of a document with no way of going to the complete document.
The submission tool could easily add these standard components as the page is added to the corporate web.
An alternative, however, would be for the web server to add these standard components dynamically to pages as it is responding to HTML requests. This has a number of advantages:
Metadata is information about data. In the context of a corporate web, this includes
This metadata needs to be captured when a web page is created. The metadata needs to be updated when the page is modified and removed when the page is deleted. The issues for the submission tool are:
Capturing the metadata is the most difficult part of the exercise. Metadata may be manually created (e.g. through a pop up window), it may be automatically derived from the web page, or created by some combination of automatic and manual creation.
For much of the metadata, automatic capture and maintenance by the submission tool is the most appropriate method. This particularly includes the retrieval information (e.g. host and location). Some of the descriptive information can be captured from structured HTML documents (e.g. title) and some maintenance information (e.g. owner) can be supplied by the submission tool itself.
The most difficult metadata to capture is that requiring human judgement. The ideal solution would be for the creator or maintainer of a web page to fill in a pop-up window with a description of the page. Although easy to implement, there are a number of problems with this approach. The first is simply motivation. It is difficult to encourage some people to fill in forms; how many people fill in the pop-up form when Word first saves a document? It is possible to make it mandatory to fill in the fields but this just leads to subject entries of 'aaaa' and other similar strings. The second problem is ensuring consistent terminology in the descriptions of pages. Consistency should be found both in the descriptions of similar pages and over time. The problem of consistency can be addressed by training or by the provision of a taxonomy which encourages owners to use similar terminology.
Instead of requiring the owner or maintainer of the page to categorise the page, this task can be delegated to a specialist cataloger. A specialist cataloger can be highly trained and this reduces the problem of ensuring consistency amongst page descriptions and over time. Because each page is added to the corporate web through the submission tool, the submission tool can automatically refer each new page to the cataloger and this ensures that pages are not overlooked. Unfortunately, a cataloger is yet another overhead on the running of a corporate web.
Automatic categorisation of web pages is the final option. This is attractive as it is relatively cheap and can be applied to all web pages. Automatic classification could be as simple as storing the complete text of the page in the search engine and using full text searching. There are other alternatives, including probabilistic classification. Important issues with automatic classification are
Hybrid automatic/manual classification processes are also possible. For example, metadata extraction could commence with automatic extraction. The derived classification keywords could be presented to the owner/maintainer to improve upon.
The methods used to classify pages are likely to change over time as more effective algorithms are discovered. We believe that a language like Java will be of benefit here. Instead of installing a classification program on each desktop in an organisation, Java allows the program to be downloaded as required from the server. Updating and changing the classification program is consequently easy.
The first function is control over the contents of the corporate web. By providing authentication and access control, the tool allows the owners of information in an organisation can retain control over the content of thier web pages, whilst ceding control over the web infrastructure to corporate system administrators. This improves the access reliability and the reliability of the information on the web.
The second function is the generation and maintenance of metadata for the corporate search engine. Generating metadata requires a complex mix of automatic extraction of information from the web page and manual creation or tuning of the information by the owner. The use of Java allows the submission tool to be down loaded as required from the central server. This allows easy changing of the submission tool without the headache of requiring thousands of copies of the tool to be updated.
The final function addressed in this paper is support for maintenace of links; in particular the automatic deactivation of links that have been broken when web pages are relocated. The owners of pages containing these links are automatically notified and an additional tool is provided to minimise the repair cost.
| Papers & posters in this theme | All Papers & posters | AusWeb96 Home Page |
AusWeb96 The Second Australian WorldWideWeb Conference "ausweb96@scu.edu.au"