BOTH: Cooperative Automatic Web Navigation
and Hierarchical Filtering
Michele Angelaccio, Department of "Informatica, Sistemi e Produzione",
University of Rome "Tor Vergata",
v. della Ricerca Scientifica 1, Rome, Italy Phone +39-6-72594478
Fax +39-6-2020519
Email: angelaccio@utovrm.it
Home Page:Michele Angelaccio [HREF 1]
L. Zamburru,
D. Genovese
Keywords: WorldWideWeb, Browsing Tools, Information Filtering, Cooperative Agents,
Network Computing.
Introduction
The exponential growth of the World-Wide Web (shortly W3)
is mainly due to the possibility of publishing the pages in an easy way on
the Internet by using the Hypertext-based HTML language. However the enormous
resulting hypertext graph (Web Space) makes it difficult to search for documents in absence of a
uniform and efficient organization.
Keeping a large amount of data well organized is difficult. In fact, the notion of "well
organized" is highly subjective
and personal. What one user finds clear and easy to browse may be difficult for users
who have different needs or backgrounds.
Therefore, besides the browsing process, some information discovery tool must be
taken into account by
the user.
These are of two types:
- Subject Trees/Subject Catalogues
- index-based Search Engines
A Subject Tree is a catalogue of the Web resources organized in a browsable hierarchy of titles,
each covering a topic. It can also be searched in order to speed the navigation, but user
browsing remains the fundamental
interaction paradigm.
The number of Subject Trees is increasing and often the proposed links are other Subject
Trees devoted to more specific topics.
An index-based tool offers instead an up-to-date database of W3 resources that can be queried
by means of a form-based interface in order to localize a certain document.
Even if they can be asked by a conventional browser and the answers are put in the form of
hypertext links, their purpose
is to avoid the browsing process. Example of Subject Tress are Yahoo, BUBL, etc. whereas the
most diffused index-based tools are Lycos, WebCrawler, Altavista, etc.
Furthermore the Yahoo Catalogue makes available a searching tool ranging on its local pages with the
possibility to redirect the query to other Search Tools.
Unfortunately due to the rapidly increasing size of the WebSpace the use of these tools is
becoming insufficient.
Example of inadequacy are:
- the list of answers extracted from the databases indexed by a search tool
may be too long
and a manual browsing starting from the first and most promising link may
be preferable to waiting for other answers;
- it is impossible that the indexed database covers the entire WebSpace, so in some cases the
desired document may not be indexed and may be reached by starting from some other returned link.
Thus the information discovery process appears like a two stage process:
- information retrieval (made by means of Subject Tree tools and/or index-based search
tools);
- information filtering of the documents pointed out by the retrieved documents;
Among the proposed solutions in the field of searching tools optimization(see
Bush Library, Hamline University, 1995 [HREF5]
for a list of sites that discuss search tools) there are:
- the Meta-Search Engine tools that aim to interface more than
one search tool in order to mask the differences and to collect the answers.
- the
Web Ants project [HREF3]
that has been introduced with the purpose of
increasing the efficiency and the quality of the search engine tools by using
a network parallelism in
the implementation of the indexing phase and database searching.
Fish Search [HREF2]
(De Bra et al 1994)
and Letizia (Lieberman 1995) are two example of tools that can be used to support
the information filtering phase.
Fish Search recursively exploits the links contained in each document starting from the current
page and taking into account the user query for collecting relevant documents. Originally
the spider program was a part of the client browser. A later form-based version tries to
overcome this inefficiency by implementing the spider as a server functionality.
However the main difficulty is the inability for the user to have access to the structure of the
document graph. The output is a list
of links similar to the index-based search tool.
Letizia is based, instead, on a browsing assistance agent that learns from the user actions
and follows autonomously the links on the current page to give hints to the user during
its browsing process.
We feel that the shortcomings in the above tools are mainly the inadequacy of the output shown
and the inability to furnish
an idea of the way in which documents are retrieved.
In each case the path connecting retrieved documents is not displayed and the user is
forced to navigate in order to better understand the topology of the document graph.
What is expected is the capability of visualizing the documents that are explored by the spider,
in an efficient way and
by showing the inter-documents connections. This would help the user in the information
filtering phase by discarding as
soon as possible unfruitful paths.
This requires an integration of the search process and
visualization process of the retrieved document.
BOTH (Browsing support Oriented To Hierarchies) is a search tool that offers:
- A document-tree based user interface support for information filtering
- It interfaces the user with the search engine and extracts the information view by
showing a dynamic HTML page containing a tree, each of whose nodes is a
link to documents retrieved by the search engine.
- A cooperative distributed search engine
- This implements an
automatic browsing system of the relevant documents space. Its architecture can be thought of
as a distributed information gathering system (Spider) that searches by starting from a
specified set of links. There are "browsing agents" that navigate and cooperate for filtering
the information, and are coordinated by a "master agent" that interacts with
the user.
Its name derives from its property of being both a cooperative search engine that makes
automatic
navigation and an information filtering support via a graphical visualization of the retrieved
documents graph.
We remark that in this way BOTH has the capability to
integrate browsing with searching paradigm.
The user can activate BOTH in the following cases:
- to filter the WebSpace pointed out by the starting links determined by using an index-based
search tool;
- to filter the WebSpace pointed by a Subject Tree tool going beyond its local pages.
The query submission includes the navigation parameters definition of
the relevant documents graph characteristics.
The BOTH spider starts the search for the documents in the WebSpace and stores them
on a local cache (mirror of the documents graph)
To guarantee an efficacious visualization of the documents in the mirror and to allow an
efficient support for the filtering, the
user interface is implemented by means of a dedicated HTML page (HotView) that shows a
document-links tree dynamically updated.
The HotView page helps the user in the information filtering phase by selecting the
relevant pages on the most promising paths of the tree.
This has been achieved by a form-based mechanism embedded into the hypertextual tree that
allows the selection of one or more tree nodes.
After selecting the most interesting pages, the user can submit either a new query on the
interest graph that starts from the selected pages, or a request for the concatenation of the
selected pages.
Section 2 gives a description of the user interface with an example. Section 3
shows the architecture and Section 4 discusses some applications.
The Information Filtering Model
In order to explain the characteristics of the BOTH Interface, we give a better description of the
process of searching for
relevant documents. We assume that this process is described by an Information
Filtering model having the following characteristics:
- a browsing process carried out on a hypertextual document graph called relevant
documents graph (RDG) and denoted by:
- a set of initial URL retrieved in a previous phase (information retrieval phase);
- all the pages with the property to be reachable from the initial pages and containing the
occurrence of some keywords;
- the selection of a set of links to the most relevant pages.
To clarify with an example, we consider the following query (Q): "Michelangelo OR Paintings"
that aims to search for documents describing the paintings of Michelangelo Buonarroti.
The RDG is denoted by:
- a set of initial pages regarding Ancient Paintings or Michelangelo works
(we assume that a page focusing on paintings of Michelangelo has not been retrieved
in the information retrieval phase);
- all the pages reachable from the initial pages whose content matches the query Q;
The main difficulties concerning this process are:
- wrong paths
- In some cases irrelevant pages are retrieved and examined. Typically this is a source
of inefficiency due to time for downloading files and the increasing network traffic.
For example a path could be Arts-> Renaissance->
{Raffaello, Leonardo,..}
where leaves do not contain information about Michelangelo.
- multiple source of information (delocalization)
- There are cases for which the information is distributed on different documents.
For instance, there could be various documents each dealing with the ancient paintings hosted in
a given museum where Michelangelo is
mentioned. In absence of a unique reference, the user is forced to navigate among different
pages.
To overcome these difficulties, BOTH furnishes a set of features in support of information
filtering phase (filtering functions):
- search parameters definition (a suitable form is accessible via World-Wide Web).
- BOTH needs to know the URLs of a set of pages
from which to start the search. These pages correspond to the output of the localization phase
mentioned in section 1; to this purpose a part of the query page form is devoted to the
specification of the initial pages URL (initial pages specification);
- The user inserts a list of keywords interpreted as a boolean
OR (query definition). A mechanism of stopwords has been implemented (i.e. BOTH will ignore high
recurrent words occurring in the query) ;
- maximum depth of the graph extracted from RDG or the upper
bound on the length of the paths followed by the spider (maximal exploration depth);
- maximum outdegree of the graph extracted from RDG or upper bound on
the number of links to be explored in each page (maximum exploration breadth);
- use of a local cache that mirrors the RDG pages retrieved by the Spider;
- visualization of the mirror in a dedicated HTML page (HotView). In this page the
hypertext structure is reassembled by using the links of the original hypertext and displayed as
a tree by means of HTML list tags. The tree data structure is computed by using a breadth first
algorithm described in section 3.
- output of all documents selected in the hotview page merged in a single HTML answer
page (called Result page). This page in addition contains a form devoted to query refinement
that takes as initial pages the ones selected in the HotView.
The user takes advantage of the compactness given by the Hotview to select the most relevant
pages.
We show a sketched Hotview that would be obtained from the submission to BOTH of
the query Q introduced above.
- Link1-Arts
- Link 1.1
- Link 1.2 Renaissance Paintings
-
Link 2.2...
- .....
This tree structure shows the pages that are stored in the mirror by the Spider and preserves the
search order. For example, one possible most relevant page would be the one corresponding
to the link 1.3.
A key role among the filtering functions is played by the document references visualization
carried
out by the HotView.
The main characteristics can be pointed out as follows (HotView features):
- hypertext topology
- The hypertext organization of the RDG has been shown in hierarchical way. Each node
of the tree displayed in the HotView is represented by using the following node-items:
- document references
- they are the title of the page, and the hot link to the original
document in W3;
- filtering support items
- they include a score number, evaluated on the basis of the number of
keyword matches, and an abstract, obtained by concatenating a set of phrases each
containing at least one keyword.
The advantage is that these features are now combined with the capability to browse through a
tree instead of a list of links.
For instance, if two documents have the same
score number, the one having a better score number sequence taken from the root could be
more promising.
- redundancy
- The WebSpace is a Direct Graph that may contain cycles like the one corresponding to
back links. BOTH outputs a tree that is the result from visiting the RDG characterized by a
mechanism of cycle avoiding. This corresponds to a spanning tree algorithm except for the
nodes that have indegree grater than one. These nodes will be replicated for each incoming arc
to better evidence all the possible browsing paths that reach that node.
- dynamic visualization
- the HTML code of the HotView page is updated during search process . When a new
tree-level is available in the local cache, the corresponding node-items are added to the
HotView. When the Spider checks the outgoing links of one page, a blinking icon
is displayed near the page title.
- page selection made on the tree
- By applying the HTML form syntax directly to the code corresponding to the tree, we
are able to filter relevant pages by allowing an easy way to extract the corresponding
subgraph from the displayed tree. This has been accomplished by means of a check-box for
each node. When the user selects one check box, the corresponding page will be referred by
using local cache address instead of the original URL .
- page merging request
- A click on the retrieve button activates the merging of the local files
corresponding to the selected pages. The result will be sent to the remote W3 client (browser)
and it provides an easy way to give a compact view of two or more pages dealing with the same
arguments. In this way the merge of documents can be obtained without navigating on
the tree and by avoiding an explicit local saving of each page.
Figures 1...3 show a typical search session.
Fig.1 - Search form
Figure 1 shows the query form accessible via World-Wide Web. There are:
- Fields for initial pages setting.
By using a meta-search approach, BOTH can start search from
a Subject Tree Home
Page or can query a search engine and take into account the answers with maximum score.
On the other hand, the user can use a
personal Hot List, by pasting a URL in a text area;
- a text box in which insert the keywords;
- field used to change default setting of the search parameters;
Figure 2 is an example of HotView where some nodes are selected;
figure 3 corresponds to the
so called result
page obtained merging the selected documents. The result page contains a query form in
which the user can
choose among the selected pages the starting points for a new refinement search.
In the next section we give a brief description of the BOTH components and the way
how search engine is organized.
The software architecture
Recent advances in index-based searching
(Web Ants [HREF3] )
and disconnected browsing
(WebIQ [HREF4] ),
have pointed out the importance of a distributed implementation of the
information gathering system. This is strengthened in BOTH because tree
updating requires to be interleaved with search engine in order to do
hierarchical visualization as fast as possible.
We have chosen the following master/slave architecture for the search engine:
- a coordinator agent (the master) coordinates the slaves.
This agent creates and updates the hotview page.
- a pool of micro-browser agents (the slaves). Each slave executes the
automatic navigation and the filtering/caching of the retrieved documents.
The coordinator (HVC -HotView Coordinator) and the micro-browsers (MBW) are allocated
on an heterogeneous network of PC and workstations and run on a wide set of UNIX-like
platforms. The implementation of the master-slave communications has been done by means of
a dedicated C communication library using the TCP/IP protocol suite.
Each MBW receives a retrieval request from HVC and activates the concurrent Network Agents
whose function is to contact the remote HTTP servers and to download the requested
documents. To guarantee a low traffic density on the TCP IP network (Wide Area Network)
the files retrieved by the Network Agents are stored on a distributed disk cache defined on the
LAN.
MBWs also calculate score and determine
abstract.
The HVC, instead, accepts incoming search requests and controls slaves' activity. A
tree structure containing informations about the retrieved pages is created and updated for each
active user. Periodically, HVC uses these structures to updates HTML code of the HotVew.
BOTH keeps the HotView file in a directory accessible via World-Wide Web during the search.
Therefore the user can sometime check the search progress and use the partial results.
BOTH is a multiuser system, although not designed for a free access to the entire Web
community. All the HotView pages (each user has a unique identifier) are stored together,
and a page is removed after a fixed period of time, in order to allow free search sessions: the
user submit query and reconnect later to see results. In this way, the normal browsing activity is
not interrupted.
The construction of each one document tree T
displayed in the HotView Page proceeds as follows:
- (level T(0)) The root is a node specified by user;
- (level T(i+1), i < d) For each node n' in T(i), a maximum
of w external links are followed by the Spider except for the case of n' being
already visited (weak cycle avoiding property).
All the retrieved documents matching the query are linked to n' in the next level of T.
All nodes in T correspond to a document matching the query.
The parameters d and w are specified in the query form. A search time-
out and an overall expansion limit can also be settled.

Fig. 4 - System architecture.
The BOTH overall organization is illustrated in figure 4. There is described the
Spider components
and a gateway component. This latter process parses the CGI Form
data incoming from the HTTP server by assuming the following three user request
types:
- Query request that contains the list of the initial URLs, the query, all navigational
parameters (maximal depth and maximal outdegree of the retrieved documents graph) and time-
out settings;
- Result page request that contains a list of document reference corresponding to the
HotView selection;
- Query refinement request, similar to the query request but specified in the result
page (the initial pages are among those selected in the HotView).
The coordinator receives from the Gateway process the message corresponding to the user request
and schedules each micro-browser in
accord to a given policy. After starting the URL loading, each micro-browser filters the arriving
text by applying the search keyword and returns to the coordinator a list of the next URLs to be
loaded. Concurrently, the HVC creates the HotView page and, in cooperation with the micro-
browser, decides which URL must be loaded next.
The inefficiency due to high network load is mitigated
by introducing a local cache that avoids network access for pages already retrieved by the Spider.
Applications
The combination of the Spider and the hierarchical visualization of the retrieved pages offered
by BOTH augments the capability to filter a relevant document graph. Therefore BOTH can be
used in all cases for which the user, after an information retrieval phase has too much answers
that must be explored. In particular, among all the possible benefits provided by BOTH there are:
- a way to search for a page reachable by using a a Subject tree in absence of a
search tool provided by the corresponding server;
- a compact view of a Web relevant document graph;
- a way to avoid unnecessary page downloading during browsing process;
- the capability to merge related information located in different pages.
To illustrate in particular the fourth benefit, we show in figures 5 and 6 another example of BOTH
use. The goal is to create, starting from a list of newspaper sites, an HTML page showing how news
treating a same argument have been reported
by different newspaper. BOTH merges the original text extracted from
all different newspapers sites (se the Hotview in Fig.5) in the Result page (Fig.6). Hence the user
has a compact view of different journalistic point of view without browsing through each
newspaper site.
.

Fig 5 - An HotView about italian news .

Fig 6 - The Result page merging the information distributed in the HotView selection
displayed in fig. 5.
Conclusion
Among the search tools that are faced to help the user browsing, BOTH may be considered
a first step towards a solution for the Information Discovery problem on the Web
without masking the hypertextual structure.
Its main property is to combine automatic navigation on a well defined document graph with
hierarchical synthesis of the document graph being explored.
This allows a fast selection of the most promising links and an easy way to have a compact view
of information spread over the retrieved documents.
Among the open issues there are:
- a load balancing strategy to better allocate each searching subtask.
- a performance evaluation for studying the evolution of system parameters like cache size and
response time.
- other possible applications that require distributed Spider like disconnected browser for mobile
computing.
- implementation of more advanced type of Proxies Server by means of BOTH approach.
References
Lieberman, H. (1994) Letizia: An Agent That Assists WebBrowsing
Proceedings of AAAI 95 - AI Applications in Knowledge Navigation and Retrieval -
MIT Cambridge MA, USA, 1995, pp. 97 - 103
DeBra, P., Houben, G. J., Kornatzky, Y. (1994) Navigational Search in
the WorldWideWeb
URL: ftp://ftp.win.tue.nl/pub/infosystems/www/Mosaic-2.1-fish-short-paper.tar.gz
Hypertext References
- HREF1
-
- http://russell.ce.utovrm.it/home/angelac/index.html
- Home Page:Michele Angelaccio
- HREF2
-
- http://www.win.tue.nl:82/win/cs/is/debra/wwwf94/article.html
- DeBra,et al. HTML paper: "Search for arbitrary information
in the WWW:the fish search for Mosaic".
- HREF3
-
- http://agent2.lycos.com:8001/web
-WebAnts Progress Report.
- HREF4
-
- http://www.cs.purdue.edu/homes/wittmam
-WEBIQ Design Report
- HREF5
-
- http://www.hamline.edu/library/links/comparisons.html
- Web Search Tools comparison from Bush Library, Hamline University, 1995
Copyright
Michele Angelaccio, Lorenzo Zamburru, Domenico Genovese ©, 1996. The authors assigns to Southern Cross University
and other educational and non-profit institutions a non-exclusive licence to use
this document for personal use and in courses of instruction provided that the article
is used in full and this copyright statement is reproduced. The authors also
grant a non-exclusive licence to Southern Cross University to publish this document
in full on the World Wide Web and on CD-ROM and in printed form with the conference
papers, and for the document to be published on mirrors on the World Wide Web.
Any other usage is prohibited without the express permission of the author.
AusWeb96 The Second Australian WorldWideWeb Conference
"ausweb96@scu.edu.au"