Yet Another Attempt to Alleviate the Stale Links Problem


Andrew Humphrey, ITS The University of Melbourne, Parkville 3052 Australia. Phone +61 3 9344 7497 Fax +61 3 9347 4803 Email:a.humphrey@its.unimelb.edu.au

Keywords

hyperlinks, databases, repetition

Abstract

Fundamentally the design of the hypertext anchor tag the image tag in HTML is flawed, their respective definitions allow no scope for destinations to have knowledge of links pointing at them. This paper attempts to present one method for overcoming this oversight. The method presented externalises the meta information about a page and stores it in a database. An programming language, omnimark, with explicit support for SGML was used to facilitate data conversion and to ease the programming task.


Introduction

One of the strengths of the world wide web is also one of its weaknesses; the ability to make navigable links to almost anywhere or anything. It is this flexibility that is its downfall. The person creating the link has no guarantee that the destination will exist at any point in the future, or worse that the content of the destination will be semantically the same as when the link was made.

This paper presents yet another attempt at solving this well know problem. The approach taken is to externalise the storage and management of ``meta  information'' about the document. Firstly a description of the problems and the motivations for trying to correct them is presented, this is followed by a description of this solution and finally some mention is made of further work being planned based on this idea.

Limitations

This paper does not begin to attempt to solve the problem of semantic change of the documents, it merely attempts to present a method for reducing the confusion that arises from ``broken'' or incorrect links and outdated information. It is also more applicable to an intranet type arrangement where most of the documents and links are resident on one, or at most a small number of, web server(s).

The Problem Space

The fundamental problem with hyperlinks in HTML is that they are ``one-way'' in nature. The destination of a link has no idea that it is being referred to and thus has no way of letting the link, or more significantly, the author of the link that it has been removed or semantically changed. The solution presented here is to change this and attempt to make links ``two-way''.

The flexibility of the web and its ubiquity in higher educational institutions has lead to the World Wide Web being used for information dissemination in the same manner that noticeboards and memos used to be in the pre-network age. This means that documents stored on a web server often have a period of relevance after which they should be removed. The management of the removal of expired papers is simple enough, the difficult part is removing all links to these documents, again the problem is the lack of knowledge in documents of links that point at them.

A Solution

Preliminaries

The method used to make links ``two-way'' is to take them out of the HTML and store them in a relational database and use the indexing and searching capabilities of the database to facilitate the process. The rest of this paper is an attempt to describe the methods used to realise this ideal.

At this point we introduce the idea of a soft expiry time and a hard expiry time. The soft expiry time is the point at which the information contained within a page is considered old. The hard expiry date is the time after which the information contained within a page is incorrect and should not be accessible. This information is not carried within an HTML page and cannot easily be automatically generated. Thus it is necessary for the author of a page to provide this information at the time the page is ``published''.

The architecture

The architecture used in this paper is relatively simple

architecture

A module was written for the apache web server that deals with text/html pages. It parses them looking for <LINKID> tags and replaces them with the relevant link. This module communicates with a MSQL database which holds the link and expiry information. Omnimark was used for the initial data conversion task and also in maintenance tasks. An external module was written for Omnimark to enable it to communicate directly with a msql database.

An important change to existing practice was made: HTML pages must be submitted to a publishing process before they are visible from the Web at large. No longer can individuals directly access the document directory of the web server. They are forced to go through a HTML form using the PUT protocol of HTTP/1.0 to submit either a single page or a zip or stuffit archive containing a tree of files for processing.

The Omnimark Program

Omnimark is a programming language designed to facilitate the processing of SGML documents, it has powerful pattern matching and parsing capabilities. The omnimark script used to do the initial data conversion from an existing HTML tree and the one used for the inserting of new pages follows exactly the same steps:

  1. Ensure that the file being submitted is parsable HTML.
  2. Assign an unique id to the page.
  3. Check to see if the page is in the ``pending'' if so rewrite any now valid links with <LINKID number>.
  4. For every <A>...</A> or <IMG> tag, check to see the destination page exists.

The Link Rewriting Algorithm

The apache module uses the follow algorithm to rewrite the LINK tags into proper HTML <A>...</A> and <IMG> tags. PENDINGLINK tags are simply written as link text as standard (ie not hyperlinked) text.

  1. For each LINKID tag check the database to determine the remote end of the link.
  2. If the linkid is found in the database check to ensure the remote end has not expired. If it is past the soft expiry date then output a <A>...</A> tag with a following warning image. If it is past the hard expiry date then or if the link is not found, do not output a tag, but instead output the link text as normal text

The Publishing Interface.

The author is free to use any method he or she desires to generate the HTML that they wish to publish. The changes occur at the time they wish to make it available to others. At that time the author must use a web form to specify the file(s) they wish to publish and their soft and hard expiry times. Upon submission of the web form the author is given immediate feedback, as to whether the attempt was successful and the reasons for failure if applicable. The primary reason for rejection is HTML that does not parse. In fact a nice side effect of this process is you can ensure that all pages on a web site are valid html and by slight modifications to the omnimark program you can ensure that desired attributes or tags, (ie width, height and alt text attributes for image tags, or META tags containing indexing information)

The Database

The database containing the link and image information is a fairly simple one consisting of three tables. One for <LINKS> one for <PENDINGLINKS> and the third to contain information about the pages themselves. The page table has columns for a unique id for the page, the URL of the page, the soft expiry date of the page, the hard expiry date of the page, the author of the page, the creation date of the page and the last modified date of the page. The links tables have columns for the unique link id, the link text or alt text in the case of an image, the type of link (image or anchor) and a flag for whether a link is external or internal.

In order to speed up retrieval of data from the database, indexes have been created in the page table on the URL and id fields and in the links tables on the id fields, the URL and the remote id fields. The other fields are never retrieved as a part of a SQL WHERE clause in the web server operations and hence indexes are omitted from them to conserve disk space.


Future Work

Currently the apache module calls an external program to do the parsing of the web page and this forking of extra processes is far from acceptable for a high volume web site. Incorporation of an HTML parser into apache is a large task and one beyond the scope of the current project, however should this scheme succeed this would be an obvious improvement.

There is no method provided for authors to change the meta information stored about a page nor is there any feedback mechanism for authors after they have published their pages to tell them about now broken links and the like. This sort of functionality should be easy to add with the W3-MSQL feature of the MSQL database, time has prevented this from happening so far.

This work could be extended to force authors to catagorise their works to facilitate indexing and searching.

Another path of exploration is some sort of automation of the process, currently documents to be published must be submitted via form, this is cumbersome for lots of files so the ability to combine many files into one transaction was provided (via tar, zip or stuffit files). This still requires human intervention. A friendlier interface would be one that is closer to current practice, authors are given access to the document space and the act of ftping or copying (via samba or CAP) a document into the web tree could automatically triggers the publishing process, with feedback sent via email to the author.


Copyright

Andrew Humphrey ©, 1997. The author assigns to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The author also grants a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers, and for the document to be published on mirrors on the World Wide Web. Any other usage is prohibited without the express permission of the author.


[All Papers and Posters]


AusWeb97 Third Australian World Wide Web Conference, Southern Cross University, PO Box 157, Lismore NSW 2480, Australia Email: AusWeb97@scu.edu.au