As the number of sites which provide information grows, it becomes increasingly difficult for users to find the information which interests them. There is no central authority controlling the quantity, quality, or subject matter of the information, nor are there any agreed standards for indexing or categorisation. For example, there is no standard for keywords, nor is there a standard classification scheme for the WWW such as that set up by the ACM to classify the computing literature.
To meet the need for finding information in the already large and rapidly growing web, numerous indexes of various types have been constructed. In the absence of standards, the indexes attempt to group documents in a variety of ways:
A potential pitfall of indexes is that they can contain too much information. This can be as bad as holding too little. In a large index, a person looking for information can become overwhelmed by large reams of data and many "false hits" in response to their queries. Indeed the whole idea of an index is to give a more compact representation than the full set of information itself.
To solve this potential problem, the WWW.AU index aims to point users at a single "definitive" document from each site. Together with a pointer at that document, the index stores a brief description of the whole site. In this way, queries can be more accurately matched and a shorter list of matches given to the user.
The entries in the index include information about the category, organisation, and description of the site, as well as a password for later modification. These entries can be added by any user of the World Wide Web via a forms interface. In practice, most of the entries have been added by the owners of the sites and by the author of this paper.
The author also periodically checks new entries and sometimes modifies the given descriptions or the given categories to keep the index accurate. Users of the index also provide feedback which is used to correct errors and delete sites which are no longer operative.
Occasionally, a new category is created manually when there is a clear need. This usually happens when one category is growing too large and there is a logical way to reorganise the groupings, or in response to feedback from people who enter their sites into the index but feel that no existing category accurately reflects the content of their site.
These web robots differ in the details of how the traversal is carried out, what documents are retained and for how long, and which parts of the documents are used for index searches. These can include the title, the initial text in the body, and the hypertext information relating to the pattern of the links.
An alternative philosophy is used by Aliweb [HREF 7] which provides an Archie-like indexing for the Web. This system requires owners of information to maintain a special document at their own site describing the information which they make available. The owner registers the address of the special document with Aliweb, which then accesses it on a daily basis in order to maintain an up-to-date index in a central place.
An advantage of this approach over the robots mentioned above is that there is no need for the robot to guess which are the important documents at any site and which parts of those documents best capture the semantic content that is most useful for indexing purposes. With Aliweb, in contrast, a person who is likely to understand the overall content of the site, namely the owner, is charged with that task. The downside is that the notation used to write the special documents is somewhat complex, which could partially explain the relatively low number of sites contributing those documents in practice.
A more generalised system for collecting and indexing information is the Harvest [HREF 8] Information Discovery and Access System. This provides a set of flexible tools which can be configured to parse different types of information and thereby create custom indexes. Among other uses, Harvest has been configured to collect web home pages, rather than attempt to collect all web documents.
It is probably also more uniform than indexes produced by the amalgamation of idiosyncratic descriptions produced individually by document owners. This is the case because the traditional indexer has a wider view of the field of documents to be indexed. Some World Wide Web indexes which are related to this more traditional "moderated" approach are listed below.
This method of growing and maintaining the index is consistent with the overall aim for WWW.AU which is to keep the index as small as possible and of high quality in order to help users find information without transmitting long lists of potential sites containing a large number of false hits.
The URL is the "Uniform Resource Locator" which points at the definitive document for that site. The category gives a first-level classification of the content area of the site (such as "Music" or "Computer_Science"). The node and network are components of the URL which are extracted automatically and used in conjunction with the category for indexing purposes.
The organisation and branch fields are typically used to give the name of the organisation which owns the site and a department or sub-group (e.g. CSIRO, Division of Information Technology). The password is supplied at the time the entry is first created, and is then required in order to modify the entry at any later time. The description can contain any text describing the whole collection of information which defines the site. Again to encourage quality rather than quantity, this description is limited in length to 256 characters.
A decision was made to start with a fixed set of categories based on an analysis of the actual WWW sites in Australia in late 1994. People adding new sites must choose the nearest fit from amongst the existing categories. The author of this paper reviews the new entries periodically, adjusting categories where that is felt to more accurately reflect a site's content.
Occasionally, new categories are created and existing entries regrouped accordingly. This happens either in response to feedback that no category properly matches a particular site, or if a category is growing too large and there is a logical way to split and recombine categories to keep the spread of entries per category more even.
This policy has been borne out in practice since owners of sites do tend to go out of their way to provide feedback, frequently requesting more detailed categories for their own sites. In the event, only four new categories have been needed for the index to date. The new categories created were Internet_Services, Media, Science, and Sport. The latter two were created in response to entries which had no really suitable match to existing categories. Media was created because of the high growth rate in this category during early 1995. The Internet_Services category was created from entries which had previously been placed in the rapidly growing Commercial and Network categories. The actual number of entries per category and how this has changed over time is described elsewhere in this paper.
Each entry must have a unique URL and this is used as the key field for the index. No two entries in the database can share the same URL. The data entry form also facilitates specifying a password which is needed for amending the entry, a main topic (eg the name of the organisation), a subtopic (eg the name of a department), and a concise description of the site up to a maximum of 256 characters.
Finally, the person adding the entry selects one of the existing categories, and activates the "Add to Index" button. On receipt of the information, the WWW.AU index automatically carries out integrity checking on the fields, and sends back either an acknowledgment showing the entry which has been automatically added to the database, or a message saying that the entry has not been added and listing the problem found in the data. In the latter case, most web browsers allow the user to go back one page to the filled in form, modify the form as required, and resubmit.
As with the case of adding a new entry, integrity checking is carried out to ensure that the database is not accidentally updated by an incomplete or wrongly filled in form.
The main feedback received to date which requires deletions is from users of the index who find that a particular site listed no longer responds to them. However, these users could not use a deletion form anyway as they do not know the password of the offending site. Feedbacks of this nature are acted upon by the maintainer of the index.
The maintainer also occasionally deletes sites to keep the size of the index down, even if the sites are still working. This is done when one site is reachable from another closely related site (usually running on the same host and maintained by the same person), and when it is felt that the description of the closely related site would suggest the existence of the other site to a reasonable user. In other words, the two sites are really one site by the definition given in this paper.
The second search method provided is free text searching. This enables a user to enter any string and ask for those entries which match. This is very handy when the user knows a phrase, word, or part of a word which is likely to be contained in the descriptions or key fields of any relevant entries. In contrast, the first search method is useful for browsing through entries on related subjects.
At present, the database has three key fields which are the category, the network and the node in that order. Consider the first example mentioned above. The search request search.cgi?WWW.AUDB/URL.db does not include a category and therefore the whole database is selected. If there were fewer than 20 records, they would all be listed. Since there are more than 20 records, they are classified by the first key field, namely by category. This returns an index of categories, each entry looking somewhat like the second example.
The second example search.cgi?WWW.AUDB/URL.db+Media restricts the database to the "Media" category. If there are not too many entries in that category, they will all be listed. Otherwise, they will be further indexed according to the next key field, in this case the network. This would yield an index of entries similar to the third example.
Thus for browsing the WWW.AU index, users do not need to perform any typing. They can simply click their way through a hierarchy of progressively narrower indexes, until a short list of entries is finally produced.
The example above will yield a list of all sites concerned with tennis. The search is carried out on key fields and on descriptive fields. All searches are case independent.
It is interesting to note that this growth was not uniform over the different categories of sites. A table showing the variations is given below. These figures are affected by the introduction of four new categories over the period, Internet_Services, Media, Science, and Sport. In particular, the Internet_Services category took some of the entries from the Commercial and Network categories, and the Science category reduced the Research category.
Category Dec 94 March 95 Growth % Growth Architecture 2 7 5 250 Art 7 28 21 300 Astronomy 3 13 10 333 Business_Studies 3 24 21 700 Chemistry 3 8 5 167 Commercial 13 33 20 154 Computer_Centre 4 7 3 75 Computer_Science 13 19 6 46 Computing 20 53 33 165 Earth_Science 2 7 5 250 Education 2 19 17 850 Engineering 8 37 29 363 Environment 14 50 36 257 Government 6 32 26 433 Internet_Services 56 56 Library 10 38 28 280 Mathematics 4 23 19 475 Media 15 15 Medical 9 44 35 389 Miscellaneous 14 23 9 64 Movie 1 8 7 700 Multimedia 3 13 10 333 Music 2 21 19 950 Network 13 33 20 154 Personal 4 20 16 400 Physics 4 22 18 450 Research 18 16 -2 -11 School 3 19 16 533 Science 9 9 Social_Science 7 43 36 514 Sport 16 16 Student 2 5 3 150 TAFE 1 1 0 0 University 40 47 7 18 TOTAL 235 809 574 244

The graph above shows the 10 categories which experienced the most growth (400% and upwards) over the period December 1994 to March 1995. The main explanation for these high growth numbers is probably the low base from which they started. By contrast, a category such as Computer_Science shows a lower growth rate of 46% (still significant over a four month period!), largely because a good number of Computer Science Departments already had constructed web sites before December 1994. The same is probably true of general University sites.
The TAFE category is remarkable for its low initial base and not a single additional entry over the four month period. This is in marked contrast to the School category, for example. As mentioned above, the drop in the Research category is accounted for by the introduction of the new Science category.
Another perspective on these categories can be obtained by studying the actual usage statistics for the WWW.AU index.

Surprisingly, the daily usage is not even, with a peak occurring on Tuesdays and Wednesdays. No explanation for this observation is at hand. There was also a peak usage during the first week of January, possibly related to the holiday season, when the number of hits approached 1,000 per day.
It is quite remarkable that the top twenty hosts are each making over 50 hits per month on the index. Surely at least some of these represent a set of users accessing the Internet via a common host. The top 100 hosts are accessing the WWW.AU at least 25 times each per month. Even the top 1,000 hosts are responsible for at least 6 hits each per month.
The WWW.AU index has been accessed from approximately 40 different countries. The graph above shows the top 10 regions in terms of the number of hits during the four week period, mid February to mid March.
The percentage of hits coming from each of these sites can be compared to the percentages from the same regions, two months earlier. It is interesting that in the December to January period, Australian users comprised only one third of the usage, whereas this grew to two thirds only two months later. The percentage from most of the other domains dropped accordingly, with the biggest drop being from US sites.
This graph shows the most used categories in the index. This was measured by counting the database entries accessed by users over a four week period (mid February to mid March).
The most accessed category is that containing the general University home pages, although this dropped a little in relative terms from two months earlier. The biggest relative increases were in the Commercial and Computing categories.
The index is serving at least 2,000 users per month, with the top 1,000 of these hitting the index at least six times per month.
A key feature of the index has been the focus on keeping it small and accurate,
partly by culling entries which effectively point at different parts of what can logically be considered one site.
The accuracy comes partially from the authors themselves providing a succinct description of their site together with a definitive URL.
However, all entries are moderated to achieve a level of consistency and uniformity.
A big advantage has been gained from storing the entries in a database rather than hard-coding HTML documents. Hard-coded documents are difficult to maintain and develop an inertia which can render them obsolete fairly rapidly. WWW.AU makes it relatively easy to add and modify entries, with the indexes adjusting themselves immediately and automatically. This occurs since the HTML for the indexes is generated on the fly when and as required by incoming requests.
A further advantage of storing the index as a database is that it can more easily be totally reorganised. For example, the look and feel can be readily experimented with by simply tweaking a few software parameters in the computer programs which generate the HTML responses from the database in real time. This would be much harder to achieve if the layout of the entries was tied down ahead of time by building it in to a large number of HTML documents.
On the search front, it would be convenient to make the string matching facility more powerful and general by adding a regular expression facility for specifying more complex matches.
A further area for improvement lies in the HTML forms used for updating the index.
For example, it would be handy to automatically generate an update form whose default values are the
existing values of the desired record in the database.
At present, the forms begin with standard default values which often need to be replaced in their entirety.
Finally, in the area of deletions, one cannot in practice rely on the owners of a site to register a deletion when the site is dismantled or moved. More often than not, the first indication of this type of situation comes from user feedback. Users become concerned when clicking on an index entry no longer gets the user to the promised destination. A possible solution worth exploring is to automatically poll each URL in the database from time to time, maintaining statistics as to the outcome. If the site is unavailable sufficiently often, it could be scheduled for automatic deletion.
Number of users
Over any one month period, there are at least 2,000 different users of the WWW.AU index. Actually, it is difficult to measure distinct users, but we know that there are at least 2,000 different "hosts" (i.e. computers) accessing the index. It is quite likely that some of these hosts, and especially the high usage hosts, represent a number of people availing themselves of the index.

Usage by region

Usage by category

Table of usage by category
The graph above is taken from part of the following table. Note that all categories attracted at least a respectable amount of interest. Even half a percent represents about 100 hits per month.
Jan % March % Diff
University 18.1 16.0 -2.1
Commercial 6.3 13.2 6.9
Computing 3.7 9.4 5.7
Personal 8.8 6.5 -2.2
Miscellaneous 8.8 6.5 -2.3
Network 2.5 5.4 2.9
Government 4.4 4.3
Movie 5.2 4.3
Multimedia 5.2 3.1 -2.1
Research 2.0 2.5
Art 4.4 2.5 -1.9
Education 2.2 2.4
Music 3.4 2.4 -1.1
Library 2.7 2.3
Environment 1.8 2.3
Computer_Science 2.2 1.8
Medical 1.8 1.6
Engineering 1.6 1.5
Astronomy 1.2 1.4
Computer_Centre 1.4 1.3
Student 1.4 1.3
TAFE 1.7 1.3
Earth_Science 1.6 1.2
Social_Science 1.6 1.2
Business_Studies 1.4 1.2
School 1.0 1.1
Architecture 1.3 0.7
Chemistry 0.7 0.5
Mathematics 0.8 0.4
Physics 0.5 0.3
Conclusions
The WWW.AU index of Australian web sites has been operative for over four months. Its growth in terms of number of sites indexed has been quite remarkable. Even more noteworthy is the level of use, two thirds of which is coming from Australia and one third from other countries. That level of use almost doubled over a two month period, to over 21,000 hits per month in March 1995.Further work
The idea of maintaining control of the set of categories has been successful, but challenges will arise as the database grows larger. It is envisioned that an extra level of categories will need to be added as the number of entries per category gets too large. For example, the Science category could have sub-categories such as Physics, Chemistry, Astronomy and General. The database software can already handle this by simply adding an additional key field to the database description. Similarly, it is easy to reorganise the database because of the flexible way in which the entries are stored.Acknowledgments
I would like to acknowledge Jamie Scuglia and Jason Baragry for a number of valuable suggestions and considerable programming effort.
Hypertext References
Copyright
© Southern Cross
University and Leslie Goldschager 1995.
Permission is hereby granted to use this document for
personal use and in courses of instruction at educational institutions provided
that the article is used in full and this copyright statement is reproduced.
Permission is also given to mirror this document on WorldWideWeb servers. Any
other usage is expressly prohibited without the express permission of Southern
Cross University.
Return
to the AusWeb95 Table of Contents