In this paper we describe the structure and algorithm of an indexing mechanism which is able to answer queries like "find the instance" or "find the set of instances". The distinguishing feature of this index is that no knowledge of the information structure of the database or layout of the information in the database is required to construct the query. The structure of information in the index is easy to modify "on the fly", so it is possible to make the database "learn", that is, be self-modifying in order to adjust its structure and organisation to the query flow based on the flow statistics.
Two features allow us to achieve these characteristics: we store all the database structure information (like names and relations of the types) with the data itself, and every elementary value (including type names) is stored in the index. This preserves the semantics of the information and increases the effectiveness of query processing. It also means the database can automatically adjust to the changing environment.
In this paper we also discuss the way of distribution of index information and organising the resource discovery service.
One of the aims of this project is to simplify searching on the Web and the building Web tools by providing a more intelligent, flexible and effective index service.
When people say "I want to find the object," it usually means that they have partial knowledge of the object, and want the search system to find matching objects based on this partial knowledge. This is frequently not easy to do even for one fixed type of object. For example, although library catalogue systems are able to find a book based on the title, author name or subject keywords, there are many other possible types of attributes a user might specify. The content of the cover picture may seem an unlikely attribute to search on, but in some cases might significantly narrow the search scope. This problem becomes much more complex when we deal with an unlimited and growing set of object types. The problem of multiple, unplanned, ways of specifying a resource is the first problem which we address in this paper.
The second difference between a resource discovery and a library catalogue system is that library catalogues can give an authoritative answer for the whole library collection, and sometime even for other libraries as well. In the current state of Internet resource discovery there are no authoritative answers; it is as if request must be repeated for each library shelf and then the responses received over a satellite connection. There are hundreds of thousands of sites (or "shelves") on the Internet offering information, but, to find anything, these sites must be individually contacted. The problem of the publishing information on the Internet and the automatic routing of queries to the correct server is the second problem addressed in this paper.
In this paper we discuss the General Index for Resource Discovery (GIRD). GIRD is an attempt to solve the problems described above.
GIRD has a very similar distribution concept. The abstract format we use to exchange information is SGML [SGML] It is assumed that, if the document is not already in SGML it can be easily 'marked up' into that format. It should be noted that HyperText Markup Language (HTML) [HTML] used in Web is an SGML application. The major part of this paper describes the database which we use to store these SGML abstracts.
At the end of this paper we compare our proposed system with Harvest, Whois++, and network Spiders/Robots.
Currently, we have built software that performs decomposition of SGML structured documents into our index structure for fast access. The incremental indexing rate is about a second per every 2K of the SGML document. This can be improved. We have been experimenting with 1.3 MB file which contains information about CSIRO organisational structure and staff. The resulting index is about 2.9 MB. We can easily build in compression to reduce this size, but there is a trade off between compression and performance. We expect to achieve a compressed file with a size of 0.9 MB in this case, however the resulting performance penalty is not yet known.
Homonyms are groups of words pronounced or spelt in the same way, but having different meanings. The problem of homonyms arise when federating databases when different types have the same name, but different meanings.
Problems with synonyms and homonyms even exist in traditional databases, despite there being much better ways to control the design than exist in the Internet.
The second approach to solving the problem of synonyms is to ignore the problem. The publisher may use any attribute and object names they wish. But, the publisher is potentially restricting the likelihood that a client program will find the published information.
Consequently, different instances of an object must be allowed to have different structures. This does not cause any problems in the indexing stage, but it does cause problems when answering queries. First, it decreases the effectiveness of queries because it enlarges the potential search set. For example, if we mix cars and motorcycles into the same type, we have to look for a car or a motorcycle in combined set of cars and motorcycles. Second, it also decreases the effectiveness of queries because it makes it necessary to trace more possible connections between two objects. This problem will be explained in more detail in the next section.
As described in the previous section, objects of the same type may have different structure. For example, some names might have the form:
<name>John Smith</name>
while others might have the form:
<name> <fir_name> John </fir_name> <fam_name> Smith </fam_name> </name>
So, if a query is looking for the name which is connected to the words "John" and "Smith", the algorithm must follow both paths: the direct one and through the objects of type fir_name and fam_name. The more variations in the structure of an object, the more possible paths there are to search.
Before a search, the algorithm has to generate all the possible search paths (these are called the "search strategies"). It is not feasible to do this before each search, so they must be generated once (for the first query) and stored with the type information. But then a new problem arises: what if a new object, with a new structure, is added after the search strategies affected by this new object have been generated? The strategies have to be corrected or regenerated. Potentially this may affect every strategy in the database in order to allow access to the new information.
Fortunately, immediate access to the freshly published information is frequently not critical in resource discovery, so the regeneration can often wait for some time. These events can be processed a couple of times each day. Second, this problem only arises when an information object with a new information structure is being published, which should not happen often. Third, only the existing strategies have to be regenerated (or removed) and not all possible strategies.
The flexibility of information structure, however, not only decreases effectiveness of query processing. It also provides a way to adjust the information structure for better query processing because it allows us to add additional connections between objects to shorten the mostly used strategies when it does not cause ambiguity.
We are currently working on the prototype query engine. The first version of the engine will process simple queries such as:
get <component name> of <component name> where <word, ...> in <component name>
The <component name> will not usually be an attribute.
Examples:
get name of department where Smith in person (Find heading of the department Smith works in.)
get name of person where Mathematics and Statistics in department (List of Mathematics and Statistics Dept staff.)
To publish a piece of information is to make it available for search and retrieval. To achieve this, a publisher must send a structured information description of the resource along with a Uniform Resource Locator (URL) [URL] to the index server. By "structured" we mean some way of indicating of semantic meaning and relations of the information components. Currently we use a subset of SGML for this purpose.
The queries users ask of a resource discovery system can be divided into two types. The first type is "any useable information on this subject". Examples of this type are "How can I decode MIME message?", "Where can I get this software?". Once a single thread of a net of information about a subject is found, the linkages can be followed to find related information. Frequently, this first thread is a 'frequently asked questions' (FAQ) anthology or something similar. Since this type of query is common, the FAQ (and similar information) should be replicated and widely distributed.
The second type of question is for a very particular piece of information or for complete information on some subject: "Find e-mail address of person about which I know that ..." or "Find all papers which were written by Dr X on subject Y". Generally, any particular query will be rare, and consequently there is no reason to widely distribute information which can answer such queries. But any piece of information can become popular. This leads to the conclusion that desirable resource discovery solution must not only support answering such detailed queries, but be capable of automatically adjusting the replication and distribution of information.
To support these two types of queries, we suggest a combined network and centralised relationships among the index servers. There would be a dedicated center or small set of centers that contain all the index information available. In addition, there would be a network of primary index servers that contains replicated copies of the most popular information and accumulate knowledge for query routing. The primary servers are the first servers client programs interact with. This architecture is designed to avoid accessing the centralised servers as much as possible by distributing answers to common queries. General queries will quickly encounter one of these primary servers and will be answered. Only queries needing detailed or complete information must be referred to the central servers.
Every primary server has a cache of all the information objects which have been retrieved on the server or through the server. The client program first submits a query to the local server. If the query requires complete information, it is passed to the central servers. Otherwise the local server processes the query in its local cache. If this local lookup fails, the server routes the query to the most relevant server according to some (heuristic) criteria. There, the procedure starts again. If the query is not answered after some fixed number of iterations, the query is routed to the central servers to be answered.
We plan to base our query routing decisions on two sources of information. The first source is information exchange among the index servers. The most distinguished subjects (according to some criteria) stored in a primary server's cache are periodically exported to the server's neighbours. The second source is the results of previous query routing. After a server answers a query, the result is returned through the same servers which originally routed the query. These servers can now identify the server which can answer queries of this type. The servers adjust their routing heuristics or, if the information is likely to be popular, store the reply in their cache.
Consequently, the computational and communication resources are never spent for nothing. If the resources do not directly answer a query, at least they assist in disseminating information about the distribution of information.
There are many possible variations in the strategy of storing information from replies in the cache. Caching of information need not happen every time a reply contains information the server did not previously know. The server might only cache the information if the cost of retrieving the information from the server which contains it is higher than some threshold and if this data is expected to be frequently used. In other words, if it is cheap enough to re-retrieve the data, there is no point in caching it.
It should be possible to adjust the strategies of query routing and cache substitution so that the servers will gradually specialise in particular topics. For example, suppose, server A contains data X1 in its cache. Neighbouring servers B, C, and D are continually asking A for X1, but do not store a copy of the data in their own cache because they do not consider it necessary according their cache substitution criteria. This flow of queries to A for X1 will keep X1 in the cache at A as long as X1 is popular. Now suppose that some client program requests server B for information X2 which is related to X1, but is not available on B. The query is routed to A because A contains X1, and X2 is related to X1. If A does not hold X2, nor does any close server, A will have to submit the query to a remote server or even to the centre and download the data. A's cache now contains two related pieces of data: X1 and X2. By this process the related information will be gradually collected on server A.
Harvest has the concept of a 'gatherer'. This is the component which produces the abstract from the data stored in a base level server. We have not considered this functionality in this paper as a number of methods could be used for generating the SGML abstracts used by our system:
Abstracts in Harvest are exchanged using the SOIF protocol. This information is based on attributes. There are standard sets of attributes to describe a particular type of resource. This has the advantage of reducing the severity of the synonym problem described earlier in this paper. It does not solve the problem, however, and has the disadvantage of being relatively inflexible as new types of resources are defined or new ways of describing resources are needed. Our abstracts are described using SGML, with no fixed tags or DTD. This maximises the severity of the synonyms problem, but increases flexibility. Harvest can use different search engines (databases) to support its index server (broker). Currently two are supported: Glimpse and Nebula. This allows the system administrator to trade access speed for database size. The database proposed in this paper has the advantage over Glimpse and Nebula in that it directly supports structured information. Potentially, this allows much more flexible and powerful searching capabilities.
Harvest does not support distribution of the information in the index servers (brokers). A broker can not form an abstract of its own data and give it to another broker. Harvest envisages that brokers will specialise in particular subjects, but no mechanism is described to allow a user to find the particular broker which specialises in the topic of interest to the user. (It should be noted that Harvest allows the information can be replicated in its entirety.) Our proposed system has an information structure to allow distribution of information.
Whois++ abstracts are based on relatively fixed 'templates' specifications of the attributes which can be used to describe a particular type of objects. This is similar to the SOIF approach used in Harvest and has the same advantages and disadvantages when compared with the approach we have used.
Unlike Harvest, Whois++ does have a model of how information can be distributed amongst many index servers. Information from abstracts can be distributed from one index server to another. A protocol is defined to describe how queries are passed from one index server to another to find the information to answer the query. To achieve this, the servers are arranged in a rough hierarchy
Spiders and robots have a major advantage in a system where resources on different machines are linked (such as WWW). Linked documents are likely to be closely related. A spider or robot will follow links from host to host, quickly building information about related resources. Forming an abstract of the data on one base level server will miss those related resources not on that server. If the other server is not indexed, that resource might not be found. In a system (such as ours, Harvest, or Whois++) which concentrates on resources available at a particular site, it is easy to miss related resources.
A mechanism has been proposed by which index servers can co-operate so that queries can be answered even though primary servers do not store all of the indexing information. We compared our approach with Whois++, Harvest and Spiders/Robots.
We plan to undertake the following work.
[HTML] HyperText Markup Language Specification - 2.0 T. Berners-Lee, D. Connolly (et al). Internet Draft, February 8, 1995.
[WHO] Architecture of the WHOIS++ service Peter Deutsch, Rickard Schoultz, Patrik Faltstrom, Chris Weider. Internet Draft, January 1995.
[HAR] Harvest: A Scalable, Customizable Discovery and Access System C. Mic Bowman, Peter B. Danzig, Darren R. Hardy, Udi Manber, Michael F. Schwartz and Duane P. Wessels. Technical Report CU-CS-732-94, Department of Computer Science, University of Colorado, Boulder. August 1994, revised March 1995,
[SPI] Guidelines for robot writers Martijn Koster. Technical Report, 1994.
[SGML] Practical SGML Eric van Herwijnen, Kluwer Academic, Boston, 1994.
AusWeb95 The First Australian WorldWideWeb Conference