The Layout Independent Index Service


Arkadi Kosmynin (arkadi@mel.dit.csiro.au) Andrew Waugh (ajw@mel.dit.csiro.au) CSIRO Division of Information Technology
Keywords: Indexing, Searching, Databases

Introduction

The problem of finding information is recognised in Resource Discovery. Global queries are expensive or impossible because the databases are large and distributed. In addition it is hard to apply traditional database optimisation techniques because the Internet environment changes so rapidly that it is not possible to design a permanent database structure or optimise the database indices.

In this paper we describe the structure and algorithm of an indexing mechanism which is able to answer queries like "find the instance" or "find the set of instances". The distinguishing feature of this index is that no knowledge of the information structure of the database or layout of the information in the database is required to construct the query. The structure of information in the index is easy to modify "on the fly", so it is possible to make the database "learn", that is, be self-modifying in order to adjust its structure and organisation to the query flow based on the flow statistics.

Two features allow us to achieve these characteristics: we store all the database structure information (like names and relations of the types) with the data itself, and every elementary value (including type names) is stored in the index. This preserves the semantics of the information and increases the effectiveness of query processing. It also means the database can automatically adjust to the changing environment.

In this paper we also discuss the way of distribution of index information and organising the resource discovery service.

One of the aims of this project is to simplify searching on the Web and the building Web tools by providing a more intelligent, flexible and effective index service.

The problem

The major task of a resource discovery system is similar to that of a library catalogue system, but is much more complex. The first difference is that a library catalogue system has only a few types of objects to search for; primarily books and papers. This uniformity allows librarians to choose relevant attributes to search for and to efficiently index those attributes. In a resource discovery system, however, the number of types of resources to search for is potentially unlimited. Although the standard attributes which have been used to catalogue traditional resources are still useful, new types of resources are very likely to demand new cataloguing techniques. Even for traditional resources, new computerised storage and retrieval techniques will likely need new cataloguing methods. The cataloguing techniques appropriate to a World Wide Web page, or for a single frame from a video are unknown. Worse, it is not clear yet what features these new resources, or new ways of looking at traditional resources, will require. Any resource discovery system must allow for the development of ideas and concepts we do not yet envisage.

When people say "I want to find the object," it usually means that they have partial knowledge of the object, and want the search system to find matching objects based on this partial knowledge. This is frequently not easy to do even for one fixed type of object. For example, although library catalogue systems are able to find a book based on the title, author name or subject keywords, there are many other possible types of attributes a user might specify. The content of the cover picture may seem an unlikely attribute to search on, but in some cases might significantly narrow the search scope. This problem becomes much more complex when we deal with an unlimited and growing set of object types. The problem of multiple, unplanned, ways of specifying a resource is the first problem which we address in this paper.

The second difference between a resource discovery and a library catalogue system is that library catalogues can give an authoritative answer for the whole library collection, and sometime even for other libraries as well. In the current state of Internet resource discovery there are no authoritative answers; it is as if request must be repeated for each library shelf and then the responses received over a satellite connection. There are hundreds of thousands of sites (or "shelves") on the Internet offering information, but, to find anything, these sites must be individually contacted. The problem of the publishing information on the Internet and the automatic routing of queries to the correct server is the second problem addressed in this paper.

In this paper we discuss the General Index for Resource Discovery (GIRD). GIRD is an attempt to solve the problems described above.

Terminology

In this paper we use the following resource discovery terminology used in Whois++ [WHO]. Data is held on many different base level servers. An abstract (centroid in Whois++) is prepared of the collection of data on each base level server. Each abstract is sent to one or more index servers. Each index server collects abstracts from one or more base level servers and indexes these abstracts. Client programs send requests to index servers which use the information from the abstracts to identify the base level servers which can answer the query. The client program is then referred to these base level servers to retrieve the actual answer. Whois++ allows index servers to form an abstract of the collection of abstracts it holds and to send this abstract to other index servers. This allows information to be distributed amongst many servers.

GIRD has a very similar distribution concept. The abstract format we use to exchange information is SGML [SGML] It is assumed that, if the document is not already in SGML it can be easily 'marked up' into that format. It should be noted that HyperText Markup Language (HTML) [HTML] used in Web is an SGML application. The major part of this paper describes the database which we use to store these SGML abstracts.

At the end of this paper we compare our proposed system with Harvest, Whois++, and network Spiders/Robots.

Indexing in GIRD

In this section we will discuss our solution to the first problem, a database which is independent of a fixed schema and can grow and develop as new types of resources are defined, and new ways of referring to existing resources are developed. This is a key part of an integrated resource discovery solution because federation of index servers is easer to achieve if the index servers do not depend on the information structure within each index.

Currently, we have built software that performs decomposition of SGML structured documents into our index structure for fast access. The incremental indexing rate is about a second per every 2K of the SGML document. This can be improved. We have been experimenting with 1.3 MB file which contains information about CSIRO organisational structure and staff. The resulting index is about 2.9 MB. We can easily build in compression to reduce this size, but there is a trade off between compression and performance. We expect to achieve a compressed file with a size of 0.9 MB in this case, however the resulting performance penalty is not yet known.

Synonyms and homonyms problems

Synonyms are different words with the same meaning; for example 'dog' and 'hound'. This causes problems because different people naturally name the same object using different names. Another type of synonyms is caused by the use of different languages to name an object. This is particularly important in an international network which is identifying non-textual resources such as photographs. A third type arises due to language drift over time as the meanings of words change.

Homonyms are groups of words pronounced or spelt in the same way, but having different meanings. The problem of homonyms arise when federating databases when different types have the same name, but different meanings.

Problems with synonyms and homonyms even exist in traditional databases, despite there being much better ways to control the design than exist in the Internet.

The synonyms problem

Basically, there are two approaches to solving the problem of synonyms. The first is to provide a standard set of names for objects; much as librarians use a controlled vocabulary when cataloguing. Although this approach will prove useful, it has serious problems in the Internet. First, objects will have different names in catalogues based on different languages. Second, agreement on standard names takes time, and by the time the 'standard' name is agreed upon, non standard names will be widely used. Third, names will change over time and between regions even in the same language. However, we can assume that if a publisher is interested in people using the information they have published, they will take the effort to carefully name their information. If the publisher does not know what is the accepted name for an attribute or object, they can import a dictionary of existing names from an index and choose an appropriate one. If this name is not acceptable, the publisher may provide a header that contains lists of synonyms for the document. The indexing server then will record these synonyms when the index is loaded into the server. The information contained in the synonym list will also be used for query processing.

The second approach to solving the problem of synonyms is to ignore the problem. The publisher may use any attribute and object names they wish. But, the publisher is potentially restricting the likelihood that a client program will find the published information.

The homonyms problem

This problem is unavoidable as each publisher may structure an object differently. In addition, different instances of an object may contain different contents. For example, some people do have published papers, some do not.

Consequently, different instances of an object must be allowed to have different structures. This does not cause any problems in the indexing stage, but it does cause problems when answering queries. First, it decreases the effectiveness of queries because it enlarges the potential search set. For example, if we mix cars and motorcycles into the same type, we have to look for a car or a motorcycle in combined set of cars and motorcycles. Second, it also decreases the effectiveness of queries because it makes it necessary to trace more possible connections between two objects. This problem will be explained in more detail in the next section.

Search algorithm

The search algorithm for the data structure described earlier appears, at first, to be simple: find the elementary values mentioned in the query; trace their ancestors using the references; find the intersection of the resulting sets and the answer has been found. But the dynamic and diverse information structure requirement introduces complexity into this algorithm.

As described in the previous section, objects of the same type may have different structure. For example, some names might have the form:

<name>John Smith</name>

while others might have the form:

<name>
	<fir_name> John </fir_name>
	<fam_name> Smith </fam_name> </name>

So, if a query is looking for the name which is connected to the words "John" and "Smith", the algorithm must follow both paths: the direct one and through the objects of type fir_name and fam_name. The more variations in the structure of an object, the more possible paths there are to search.

Before a search, the algorithm has to generate all the possible search paths (these are called the "search strategies"). It is not feasible to do this before each search, so they must be generated once (for the first query) and stored with the type information. But then a new problem arises: what if a new object, with a new structure, is added after the search strategies affected by this new object have been generated? The strategies have to be corrected or regenerated. Potentially this may affect every strategy in the database in order to allow access to the new information.

Fortunately, immediate access to the freshly published information is frequently not critical in resource discovery, so the regeneration can often wait for some time. These events can be processed a couple of times each day. Second, this problem only arises when an information object with a new information structure is being published, which should not happen often. Third, only the existing strategies have to be regenerated (or removed) and not all possible strategies.

The flexibility of information structure, however, not only decreases effectiveness of query processing. It also provides a way to adjust the information structure for better query processing because it allows us to add additional connections between objects to shorten the mostly used strategies when it does not cause ambiguity.

We are currently working on the prototype query engine. The first version of the engine will process simple queries such as:

get <component name> of <component name>
	where <word, ...> in <component name>

The <component name> will not usually be an attribute.

Examples:

get name of department where Smith in person (Find heading of the department Smith works in.)

get name of person where Mathematics and Statistics in department (List of Mathematics and Statistics Dept staff.)

Publishing

Provided the problems of diverse information structure and uniform query language have been overcome, how can a resource be published?

To publish a piece of information is to make it available for search and retrieval. To achieve this, a publisher must send a structured information description of the resource along with a Uniform Resource Locator (URL) [URL] to the index server. By "structured" we mean some way of indicating of semantic meaning and relations of the information components. Currently we use a subset of SGML for this purpose.

Distribution and automatic query routing

Automatic query routing is the routing of a query without human intervention to the right server which has the information to answer the query. Automatic intelligent query routing is a very desirable part of an integrated resource discovery solution. If achieved, it would save enormous amounts of time in searching.

The queries users ask of a resource discovery system can be divided into two types. The first type is "any useable information on this subject". Examples of this type are "How can I decode MIME message?", "Where can I get this software?". Once a single thread of a net of information about a subject is found, the linkages can be followed to find related information. Frequently, this first thread is a 'frequently asked questions' (FAQ) anthology or something similar. Since this type of query is common, the FAQ (and similar information) should be replicated and widely distributed.

The second type of question is for a very particular piece of information or for complete information on some subject: "Find e-mail address of person about which I know that ..." or "Find all papers which were written by Dr X on subject Y". Generally, any particular query will be rare, and consequently there is no reason to widely distribute information which can answer such queries. But any piece of information can become popular. This leads to the conclusion that desirable resource discovery solution must not only support answering such detailed queries, but be capable of automatically adjusting the replication and distribution of information.

To support these two types of queries, we suggest a combined network and centralised relationships among the index servers. There would be a dedicated center or small set of centers that contain all the index information available. In addition, there would be a network of primary index servers that contains replicated copies of the most popular information and accumulate knowledge for query routing. The primary servers are the first servers client programs interact with. This architecture is designed to avoid accessing the centralised servers as much as possible by distributing answers to common queries. General queries will quickly encounter one of these primary servers and will be answered. Only queries needing detailed or complete information must be referred to the central servers.

Every primary server has a cache of all the information objects which have been retrieved on the server or through the server. The client program first submits a query to the local server. If the query requires complete information, it is passed to the central servers. Otherwise the local server processes the query in its local cache. If this local lookup fails, the server routes the query to the most relevant server according to some (heuristic) criteria. There, the procedure starts again. If the query is not answered after some fixed number of iterations, the query is routed to the central servers to be answered.

We plan to base our query routing decisions on two sources of information. The first source is information exchange among the index servers. The most distinguished subjects (according to some criteria) stored in a primary server's cache are periodically exported to the server's neighbours. The second source is the results of previous query routing. After a server answers a query, the result is returned through the same servers which originally routed the query. These servers can now identify the server which can answer queries of this type. The servers adjust their routing heuristics or, if the information is likely to be popular, store the reply in their cache.

Consequently, the computational and communication resources are never spent for nothing. If the resources do not directly answer a query, at least they assist in disseminating information about the distribution of information.

There are many possible variations in the strategy of storing information from replies in the cache. Caching of information need not happen every time a reply contains information the server did not previously know. The server might only cache the information if the cost of retrieving the information from the server which contains it is higher than some threshold and if this data is expected to be frequently used. In other words, if it is cheap enough to re-retrieve the data, there is no point in caching it.

It should be possible to adjust the strategies of query routing and cache substitution so that the servers will gradually specialise in particular topics. For example, suppose, server A contains data X1 in its cache. Neighbouring servers B, C, and D are continually asking A for X1, but do not store a copy of the data in their own cache because they do not consider it necessary according their cache substitution criteria. This flow of queries to A for X1 will keep X1 in the cache at A as long as X1 is popular. Now suppose that some client program requests server B for information X2 which is related to X1, but is not available on B. The query is routed to A because A contains X1, and X2 is related to X1. If A does not hold X2, nor does any close server, A will have to submit the query to a remote server or even to the centre and download the data. A's cache now contains two related pieces of data: X1 and X2. By this process the related information will be gradually collected on server A.

Previous approaches

Three current systems for indexing data in the net are Centreoids (used in Whois++), Harvest, and Spiders/Robots. In this section we will compare GIRD with these systems.

Harvest

Harvest [HAR] is a general purpose resource discovery system. It consists of a suite of programs which implement the various functions of a complete system. The system we have proposed performs similar functions to the broker (index server) in Harvest.

Harvest has the concept of a 'gatherer'. This is the component which produces the abstract from the data stored in a base level server. We have not considered this functionality in this paper as a number of methods could be used for generating the SGML abstracts used by our system:

Abstracts in Harvest are exchanged using the SOIF protocol. This information is based on attributes. There are standard sets of attributes to describe a particular type of resource. This has the advantage of reducing the severity of the synonym problem described earlier in this paper. It does not solve the problem, however, and has the disadvantage of being relatively inflexible as new types of resources are defined or new ways of describing resources are needed. Our abstracts are described using SGML, with no fixed tags or DTD. This maximises the severity of the synonyms problem, but increases flexibility. Harvest can use different search engines (databases) to support its index server (broker). Currently two are supported: Glimpse and Nebula. This allows the system administrator to trade access speed for database size. The database proposed in this paper has the advantage over Glimpse and Nebula in that it directly supports structured information. Potentially, this allows much more flexible and powerful searching capabilities.

Harvest does not support distribution of the information in the index servers (brokers). A broker can not form an abstract of its own data and give it to another broker. Harvest envisages that brokers will specialise in particular subjects, but no mechanism is described to allow a user to find the particular broker which specialises in the topic of interest to the user. (It should be noted that Harvest allows the information can be replicated in its entirety.) Our proposed system has an information structure to allow distribution of information.

Whois++

Whois++ [WHO] is a more restricted system than Harvest. Instead of being a general purpose resource discovery tool, Whois++ is designed as a directory service which will store information about 'objects' and permit 'whitepages' (i.e. queries lookup an attribute about an object) and 'yellowpages' queries (i.e. find all objects with this characteristic) lookups.

Whois++ abstracts are based on relatively fixed 'templates' specifications of the attributes which can be used to describe a particular type of objects. This is similar to the SOIF approach used in Harvest and has the same advantages and disadvantages when compared with the approach we have used.

Unlike Harvest, Whois++ does have a model of how information can be distributed amongst many index servers. Information from abstracts can be distributed from one index server to another. A protocol is defined to describe how queries are passed from one index server to another to find the information to answer the query. To achieve this, the servers are arranged in a rough hierarchy

Spiders and Robots

Spiders and robots [SPI] are programs which actively explore a linked resource network such as the World Wide Web. They can be considered to be a Harvest gatherer which is not fixed to a particular base level server but which contacts many base level servers according to a policy and retrieves information. The controlling policy is fixed by the owner of the spider or robot, not by the publisher of the information. This means that activities of spiders and robots often have significant resource implications for publishers. We have preferred to adopt the model where publishers make information available to index servers as this allows publishers to control their own resources.

Spiders and robots have a major advantage in a system where resources on different machines are linked (such as WWW). Linked documents are likely to be closely related. A spider or robot will follow links from host to host, quickly building information about related resources. Forming an abstract of the data on one base level server will miss those related resources not on that server. If the other server is not indexed, that resource might not be found. In a system (such as ours, Harvest, or Whois++) which concentrates on resources available at a particular site, it is easy to miss related resources.

Conclusions and future work

We have described how SGML abstracts can be stored in an index server without requiring a fixed DTD or fixed database structure. (WWW pages are represented using SGML and hence can be stored into the server.) We implemented a prototype system for indexing the abstracts. We also described how queries can be answered using this index.

A mechanism has been proposed by which index servers can co-operate so that queries can be answered even though primary servers do not store all of the indexing information. We compared our approach with Whois++, Harvest and Spiders/Robots.

We plan to undertake the following work.

References

[URL] Uniform Resource Locators (URL) T. Berners-Lee, L. Masinter, M. McCahill (Eds). Request for Comments 1738, December 1994.

[HTML] HyperText Markup Language Specification - 2.0 T. Berners-Lee, D. Connolly (et al). Internet Draft, February 8, 1995.

[WHO] Architecture of the WHOIS++ service Peter Deutsch, Rickard Schoultz, Patrik Faltstrom, Chris Weider. Internet Draft, January 1995.

[HAR] Harvest: A Scalable, Customizable Discovery and Access System C. Mic Bowman, Peter B. Danzig, Darren R. Hardy, Udi Manber, Michael F. Schwartz and Duane P. Wessels. Technical Report CU-CS-732-94, Department of Computer Science, University of Colorado, Boulder. August 1994, revised March 1995,

[SPI] Guidelines for robot writers Martijn Koster. Technical Report, 1994.

[SGML] Practical SGML Eric van Herwijnen, Kluwer Academic, Boston, 1994.


Copyright

© Southern Cross University, 1995. Permission is hereby granted to use this document for personal use and in courses of instruction at educational institutions provided that the article is used in full and this copyright statement is reproduced. Permission is also given to mirror this document on WorldWideWeb servers. Any other usage is expressly prohibited without the express permission of Southern Cross University.
Return to the AusWeb95 Table of Contents

AusWeb95 The First Australian WorldWideWeb Conference