Implementing conceptual linking on todays Web

Duncan Martin, Web Technology Research Group, University of Nottingham, Jubilee Campus, Nottingham, United Kingdom.djm@cs.nott.ac.uk

Mark Truran, Web Technology Research Group, University of Nottingham, Jubilee Campus, Nottingham, United Kingdom.mat@cs.nott.ac.uk

Dr. Helen Ashman, Web Technology Research Group, University of Nottingham, Jubilee Campus, Nottingham, United Kingdom.hla@cs.nott.ac.uk

Abstract

In this paper, we discuss a modular approach to extended linking with the Goate link enabling proxy. We discuss which tasks are assigned to the proxy and which form part of the language modules. We then discuss a proposed conceptual linking language implemented as a Goate module.

Introduction

Goate (Martin, 2002) is a application facilitating extended linking on ordinary HTML browsers. The theory behind Goate is that although HTML links are limited in that they are single directional and single headed, they are still capable of emulating higher level linking behaviours.

For example; two uni-directional links when combined form a bi-directional link, a collection of uni-directional links form a multi-headed link (Ashman, 2000), and specifying the destination point within a document is possible providing the destination document author has pre-declared that point.

Goate itself does not implement any particular linking language, but instead provides a translation service between various high-level linking languages and low-level HTML linking. In order provide these facilities, Goate must be able to write to both the source and destination documents, something which on the public Internet won't be possible. Instead of writing to the actual pages on the server, Goate alters the documents in transit by functioning as a HTTP proxy. This behaviour is explained in more detail in the following section.

THE ROLE OF THE PROXY

The Goate proxy provides the following facilities

Interception of communication

As previously mentioned, in order for Goate to be able to implement extended linking it needs to be able to alter the documents that get delivered to the browser, something which is achieved by operating as a HTTP proxy.

A HTTP proxy is a relay for HTTP requests. Clients send requests to the proxy, which then sends these requests onto the destination server. Replies are sent back to the proxy, which then forwards onto the browser. Proxies such as Squid[HREF1] are already in widespread use and perform network infrastructure functions such as caching and controlling access to the Internet.

Proxies that alter content as it passes through this relay process (Brooks, 1995)(Barrett, 1998) do already exist in systems such as DLS (Carr, 1995)(De Roure, 1999) and Webvise (Grønback, 1999), although they differ from Goate in that they aim to use the browser as one viewer in a larger semi-open hypertext system (we say ‘semi-open’ as only a limited subset of all viewers for any document type is supported).

Correction of documents

It is important that language modules working on a document can be assured of the document's structure. HTML browsers have traditionally been very tolerant of syntactic errors and therefore our proxy must be equipped to deal with mal-formed and erroneous pages in an efficient and speedy fashion.

Goate does this by correcting documents to be well formed after retrieval, that is to say all opening tags have a closing tag, and tags are closed in the opposite order to which they are opened. Other requirements such as attribute values being quoted are also corrected before the language modules are called.

Language modules

A language module is distinct piece of code that is linked to the proxy at run-time. This relationship is similar to that between a Web browser and plug-in, the plug-in is not part of the browser itself but closely interfaces with it.

The basic function of the language modules is to ‘understand’ links of a certain type that are contained within a document. For example, you may have a module to process embedded XLinks [HREF2].

The details regarding the interaction between language modules and the proxy are discussed in more detail below. For the time being, it is enough to know that after calling of the language modules, extended links within the document will have been translated into an internal Goate format, and stored in a database.

Insertion of links and anchors

Once the high-level links have been processed by their respective language modules, they are removed from the source document. The proxy then uses a database query to retrieve source links relating this page. Each link is inserted into the document tree in an internal format known as Goate-link. A Goate-link has no particular textual representation, and is a data structure used to store link destination, link directionality, and a reference to the specification that created the link.

Also inserted are any in-page anchors (i.e. <a name="..." />), which form the destination points for links specified on other pages.

Link appearance and rendering

The final stage of processing our downloaded document is to transform the links from their Goate representation to a HTML representation.

As has been discussed elsewhere (Weinreich, 2001) underlining as a visual cue for linking is not without problems. Goate, therefore, follows the recommendations of The look of the link and uses background shading to identify links, with different colours distinguishing between single/multi-headed links, and forward/backward links.

Sections that occur within one level of Goate-link become single headed links. The HTML presentation of these links is trivial, normal <a href> tags are used, with in-line CSS style to decide the colour. In the case of a browser unable to display CSS style in this way the link appearance is the browser default for any HTML link.

Sections contained by more than one level of Goate-link are multi-headed links. Where possible these are rendered as a pop-up box using JavaScript and CSS. A number of possible renderings for browsers unable to support this are under consideration.

Since Goate has knowledge of the specific browsing-platform being used (where a browsing platform is a combination of browser software and operating system), it can tailor the HTML/JavaScript/CSS accordingly. With this approach we hope to support a wide range of browsers, namely: Internet Explorer, Mozilla, Opera, Konqueror, Netscape 4.7 (as well as any other browser via ‘text-only’ mode).

THE ROLE OF LANGUAGE MODULES

The key role of a language module is one of ‘end-point evaluation’. Given a link destination specification (of a type the module supports) and a document, the module must be able to return the points in that document referred to (if any).

There are two occasions where the end-point evaluation function in the module is called, which will now be discussed.

Time of evaluation approaches

Take as an example a link specification on a source HTML page that equates to link to the fifth paragraph on the destination page. At the time this link is seen we do not know either the precise location of the fifth paragraph (in terms of an XML element node or byte offset), or if the destination page has a fifth paragraph at all.

One approach to this uncertainty is to display the link, and work out where the fifth paragraph is (if indeed it exists at all) when the destination document is delivered. We call this mode speculative linking.

Alternatively, the module could load and parse the destination page to check if the destination specified exists. If not, the source link is never displayed. This mode we call confirmed linking. Both of these approaches have advantages and disadvantages. Goate will support both, with the specific module dictating which one is in use for any given link.

In terms of end-point evaluation, confirmed linking works out the end-point when the source link is displayed, whilst speculative linking delays this until the destination document is loaded, with the proxy calling the end-point evaluation function at this later point.

In either case, the module uses an API call to store the information about the source link, e.g. the destination page, link specification used and evaluating language. If this was a confirmed link then the destination point previously worked out is also stored, to save having to re-evaluate the link on destination document delivery.

SUGGESTED LANGUAGES

Using the modular approach we hope to encourage many linking languages to be implemented, using Goate as the infrastructure. Below are a few examples of languages that could be implemented in this way:

CLING

One of the modules that we will be writing for the Goate proxy is known as CLING (conceptual linking in Goate). The CLING module will enable ontological linking by adding to the XLink vocabulary. The XLink specification has at present two basic kinds of link - 'simple' and 'extended'. While the simple XLink offers little in the way of stand-alone functionality previously unavailable in the <A> or <IMG> HTML tags, the extended XLink supplies many of the higher level linking capabilities discussed above (bi-directional links, n-ary linking etc.) However, both share the following identifying features:

We propose to extend the XLink specification by adding a third base type - a link of type 'concept' - that differs significantly from this approach by delaying target resolution (based on a document content search) to the moment of traversal.

Usage

Like simple and extended XLinks, a link of type 'concept' is ultimately an element. In the following example, the <demo> tag will serve as a sample linking element:

<demo
xlink:type="concept"
xlink:href="cookery.xml"
xlink:title="narrow conceptual link"
xlink:key="cake">Click here for cake!
</demo>

The link is clearly classified for processing as a conceptual link using the type attribute. The remainder of the XLink then goes on to specify the target resource ('cookery.xml') and the target concept ('cake'), which we have labelled the ‘key’. Possible variations from this simple framework then allow for multiple key descriptions (delimited by commas) to refine the accuracy of the link. Familiar regular expressions can also be used to broaden the focus of the locator attribute, thereby selecting a subset of resources. Here, for example, we have a conceptual link with a locator attribute that points towards a entire domain rather than a single document:

<demo
xlink:type="concept"
xlink:href="www.cookery.com/*"
xlink:title="broad conceptual link"
xlink:key="lobster, dinner, party">Click here for Lobster recipes!
</demo>

Information retrieval

CLING is responsible for analysing and flagging occurrences of key concepts in the resources identified in the locator, and then passing this information back to Goate as internal links. Its job is therefore primarily information reconnaissance, and as this is the case it can borrow heavily from techniques and models developed in the information retrieval (IR) community. When the CLING module parses a document it examines individual words and remove those words that lend no conceptual weight to the text by using a ‘stop list’ of common English pronouns, articles and prepositions. Next, all of the nouns in the remaining word list are stemmed, that is to say, all instances of plurals are reduced to singular expressions (‘goats’ becomes ‘goat’ and so on). Finally, the remaining words are processed for conceptual similarity with the key attribute using a thesaurus-based comparison. Words that share conceptual propinquity are flagged and an internal link is created which can then be passed back to Goate for further processing.

Multiple resources, multiple targets

When the locator specifies a large subset of resources to be searched then the possibility of an unmanageable quantity of targets arises. If this is the case then CLING is also responsible for truncating the list of targets to a suitable size before communicating with Goate by applying relevance criterion. To do this the stop list is applied and noun stemming is performed as above. Those pages containing conceptually appropriate words (or features) are selected and then represented using what is known as the Vector Space Model. In this model a matrix is created in which the frequency of features is plotted against the host documents. Each feature in the matrix is then weighted using the inverse of Shannon’s information theory, so that the more often a word occurs in a document the more likely that document will be a suitable target. (Salton, 1988)

CLING also takes advantage of working theories generated by the search engine community, in particular peer reference link currency analysis. Using this approach, the more times a web page is used as the target of a link by its conceptual peers, the more likely that that page is reliable or informative. Given a subset of resources, CLING can track the relationships and interconnections between those resources to perform a similar filtering operation on a much smaller scale, thereby delivering the most appropriate target.

Advantages of a conceptual link

The advantages of a conceptual link and conceptual hypermedia in general have been known and discussed for some time (Carr, 2001), although work has so far concentrated on linking resources that share ontological similarities rather than enabling the more powerful any-concept linking that CLING can provide (Truran, 2002). A broadly expressed conceptual link should exhibit the following beneficial properties:

Future work

An obvious line of enquiry is the metadata issue. How can CLING use languages such as DAML[HREF7], OIL(Fensel, 2000) or RDF[HREF8] to speed-optimise or refine its resource reconnaissance? How will the move from thesaurus to ontology affect the complexity and usability of what is, in essence, a relatively simple system?

Another related issue may also be the pre-evaluation of resources. At present CLING evaluates all resources it is passed ‘on the fly’ using the key attribute as a guide. Passed a large enough subset of resources, this evaluation process can entail substantial (i.e. unworkable) delays. Pre-evaluation of web pages offers a possible solution but comes with its own cost in terms of both storage requirements (how many web pages should you cache) and processing overhead (when and how often does the evaluation occur). Evaluation ‘on-spec’ for an as yet undeclared concept also significantly increases the likelihood that it would be necessary for CLING to adopt and use and established metadata language to annotate each resource as it went through the pipeline.

ACKNOWLEDGMENTS

This work is supported by the EPSRC - Grant number 20164

References

Ashman, H. (2000)."Relation modelling sets of hypermdia links and navigation". Computer Journal 43. OUP./p>

Barrett, R. and Maglio, P.P (1998). "Intermediaries: new places for producing and manipulating Web content". Proceedings of the 7th International World Wide Web Conference.

Brooks, C., Mazer, M.S., Meeks, S. and Miller, J. (1995)."Application-Specific Proxy Servers as HTTP Stream Transducers". Proceedings of the 4th International World Wide Web Conference.

Carr, L.A., DeRoure, D., Hall, W. and Hill, G. (1995) "The distributed link service: A tool for publishers, authors and readers". Proceedings of the 4th International World Wide Web Conference.

Carr, L., Hall, W., Bechhofer, S., Goble, C. (2001). "Conceptual Linking: Oncology-based Open Hypermedia", Proceedings of the Tenth International World Wide Web Conference, p.334-342.

De Roure, D., El-Beltagy, S., Carr, L. and Hall, W. (1999). "A Distributed Link Service using Query Routing". Poster session of the 8th International World Wide Web Conference.

Fensel, D., Horrocks, I., Van Harmelen, F., Decker, S., Erdmann, M., Klein, M. (2000). "Oil in a nutshell", Lecture Notes in Artificial Intelligence p1-12, Proceedings of the 12th European Workshop on Knowledge Acquisition, Modeling, and Management, Springer-Verlag

Grønback, K., L. Sloth Ørbæk, P. (1999). "Webvise: browser and proxy support for open hypermedia structuring mechanisms of the World Wide Web". Proceedings of the 8th International World Wide Web Conference.

Martin, D., Ashman, H. (2002). "Goate: XLink and beyond", Proceedings of ACM HT'02.

Salton, G. (1988). "Automatic text processing: the transformation, analysis and retrieval of information by computer", Reading, Mass: Wokingham, Addison Wesley.

Truran, M., Ashman, H. (2002) "Human Error and the Semantic Web", Proceedings of the 4th EPSRC PRAC conference.

Weinreich, H., Obendorf, H. and Lamersdorf, W. (2001). "The look of the link - Concepts for the user interface of extended hyperlinks". Proceedings of ACM Hypertext 01.

Hypertext References

HREF1
http://www.squid-cache.org/
HREF2
http://www.w3c.org/TR/xlink/
HREF3
http://www.ruby-talk.org/cgi-bin/scat.rb/ruby/ruby-talk/37623/
HREF4
http://slashdot.org/article.pl?sid=02/04/06/1354223/
HREF5
http://www.w3.org/TR/xpath/
HREF6
http://www.w3c.org/TR/xptr/
HREF7
http://www.daml.org/
HREF8
http://www.w3.org/TR/2000/CR-rdf-schema-20000327/

Copyright

Andrew Treloar © 2000. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.