The notion of wrapping a web server to produce XML documents from unstructed web pages is driven by the need to produce structured data that can be used by a variety of applications. The web contains vast amounts of information that cannot be used by most applications as it targets a human audience. A solution to this is to automate the browsing process and convert the unstructured extracted information into a more structured format such as XML. This is called wrapping. We have used two different tools to wrap several tourist sites into XML The tools we have used are Norfolk, a system developed by the CSIRO TED group and W4F, initially developed at the University of Pennsylvania and now a commercial product. This report describes our practical experience with the tools and compares them. The comparison highlights features required by a wrapper system to support real applications.
XML (Extensible Markup Language) allows semantic information to be stored within a document as well as the underlying data. In comparison with HTML (HyperText Markup Language), which is mainly concerned with specifying the layout of a page, XML separates the data from the information used to layout the page. XML documents can be displayed in much the same as a HTML page by defining a stylesheet for the page. This allows XML tags to have meaning for the application generating or processing the XML page, and hence allows machine processing of the information in the page.
The word ‘wrapper’ originates from the database community. A wrapper in this context is used as a mediator between several databases and an application ([7],[11]). In a similar way, in the web environment a wrapper converts information from HTML documents into structured information (like XML). The structured information can be saved for later use, such as answering queries, or generated dynamically on demand through a Web interface or from an application.
Although there is promising research on wrapper induction ([1], [3]) that automatically generates wrappers from examples, this technique is not yet effective in generating real wrappers for Web sites. The approach we chose was to manually write wrappers using a scripting language designed for the purpose of wrapping. We specifically used two tools for this purpose - Norfolk and W4F. Norfolk is a language and a system developed by the TED group since 1997, initially for creating virtual documents from heterogenous sources [5]. It has recently been extended to cater for the creation of XML documents for the purpose of wrapping. W4F (World Wide Web Wrapper Factory) began as a research tool developed by the University of Pennsylvania (Penn Database Group) [2]. It has now become a wrapping product.
We have used Norfolk and W4F to wrap different tourist sites to get XML data for another project demonstrator. The sites we wrapped were www.fodors.com and www.intown.com.au. Wrapping is worthwhile only for sites that offer a lot of data in a regular format (e.g. a list of hotels) or a large set of similar pages (e.g. hotels for different cities). This is generally the case when web pages are generated from a database. Wrapping is complicated, however, by the fact that the HTML generated may not be regular (this may be so even if the page appears to be regular when viewed by a browser). A second use of wrapping is to wrap a page in which data is updated very frequently (for example a forecast page or a stock quote page). In this case the benefit comes from wrapping the same page thousands of times. We have not explored this second type of application.
In this paper, some of the difficulties encountered with Norfolk and W4F will be outlined. In addition, they will be compared as wrapper tools in general. This has involved some testing for a variety of wrappers written for the same web server using each of the two tools.
A wrapper has three main tasks: first, to retrieve a web document, second, to extract the relevant information from the web page; and third, to map the information into the XML format required by any specific application.
Let take the example of the tourist Web site http://www.fodors.com/ and the page that lists hotel information for Melbourne. An actual page2 retrieved is shown in Figure.1. The wrapper retrieves the page using its URL and then it extracts the relevant information. The page display details for 19 hotels in Melbourne and for each of them we want to get it’s the hotel’s name, address, phone, fax, price range and description.

The objective of the wrapper is to extract the relevant information from the HTML page and to transform it into a format (XML) that can be easily reused by application. Figure 2. shows the HTML source of the page to be wrapped (on the left) and the resulting XML (on the right) after the information has been extracted.

It is expected that the same wrapper can be used to wrap hotel information available for other cities on the same site by changing the URL of the page that is passed to the wrapper.
The two wrappers we have used work from a tree representation of the Web page. This takes advantage of the hierarchical structure of the tags (when present). Figure 3. shows a simplified version of the tree associated with the HTML page presented in figure2. The labels in the tree are the name of the tags, while the the value of the tree leaves are made of the actual textual content of the elements or attributes defined by the tags.

Both Norfolk and W4F have a language that uses the notion of tree path to locate the information to be extracted. An absolute path in a HTML tree is an expression that describes how to access a specific element by traversing the tree from the root of the tree specifying each child along the path. For example, if we suppose there is only one table in the document body, the absolute path expression in W4F:
html.body.table.tr[0].td[0].b.text (1) |
will access the node that contains the string "THE ADELPHI".
A relative path is a path that is described using descendants rather than specifying each node. For example, in Norfolk:
html..table..td |
will retrieve all the elements (sub-trees) with label td in all the table elements.
In both Norfolk and W4F paths can be selected when some conditions are fulfilled (expressed with a where condition). For example, using Norfolk syntax:
select html..table as $table where $table..td contains "THE ADELPHI" (2) |
will only select tables that contain the string "THE ADELPHI" in some cell.
The XML Path Language (XPath)[6] is a W3C recommendation for addressing parts of an XML document using location paths. Xpath are more complex those described here as they use more than the two axes "child" and "descendant" for traversing the tree (e.g. parent and ascendant). Norfolk and W4F were designed before XPath and can be seen as a simplified version of XPath using a different syntax.
The Norfolk system has been developed by the CSIRO from 1996 [[5],[9]]. It was initially designed for reusing information from structured and semi-structured data sources and reassembling the information into virtual documents, in particular an HTML set of pages. An example of such an application is our project home page ([HREF1]) where staff pages and publications are dynamically generated from data stored in an SQL database and a proprietary publication directory system. One obvious benefit is that a publication is entered once into the system but can be displayed on any page where it is relevant.
Norfolk has been used since 1997 in a number of internal and external projects to extract information from HTML Web pages and integrate it with other information before displaying it to the user. This amounts to wrapping HTML pages into new HTML documents. An example is our Experts Search ([HREF3]), where results from a search engine and corporate home pages are integrated into a new page [13]. Since XML is becoming a standard for integrating information we have extended Norfolk to generate XML documents instead of HTML documents. Norfolk can be used in batch mode for generating physical XML files, or in dynamic mode. In the latter XML documents will be created when required and possibly displayed with XSLT stylesheets, or accessed by other wrappers.
Norfolk is currently an advanced prototype that can be download from the Web (see. [HREF4]). For a full description of the Norfolk language the reader can refer to earlier publications. Some examples will be given when making comparisons with W4F. We chose W4F for comparison with Norfolk because it was freely available at that time and very popular in the research community. It was also one of the very few available research systems able to deal with the real applications we were aiming at.
Extending the evaluation criteria proposed in [3] for evaluating Wrapper induction, we will evaluate Norfolk and W4F (Wrapper scripting), according to:
The expressiveness of a wrapper scripting language has to be evaluated for the three tasks of accessing pages, extracting and mapping information. Norfolk allows the integration of the three tasks within the wrapper definition, whereas W4F has a clear separation between them. We will see the respective advantages of each approach while comparing the two wrappers in more details along the three tasks mentioned above. The task of accessing the pages will be considered in the section on combining wrappers.
Extraction in both Norfolk and W4F is done using the paths of the relevant information in the parsed HTML tree. In addition, some basic string matching rules are used (regular expressions).
This process begins by pruning the tree to reach the relevant subtrees and then applying the matching rules on the parts of the tree that contain the necessary information.
In the following example, the wrapper extracts the text of a row (tr) j of a table where the second tr element contains the string "We found", and another tr element j contains the string "Convert it".
| W4F example: | |
|
This equivalent expression in Norfolk would be:
| Norfolk example: | |
|
|
There is no real difference, other than syntactic, between how the two tools find the proper table and extract the specific information. The major difference is that Norfolk introduces explicit variables ($table and $tr) to make the joins on table and tr, while W4F use index variables (i and j).
In the next sections, we are going to compare in more details the expressiveness of the two languages for path expressions.
W4F and Norfolk have relatively similar ways of navigating the tree representation of the parsed web page.
For example, html.body.table[1].tr[0].td[2] would be interpreted in the same way by both W4F and Norfolk, although the default for the index value is different.
| e.g. html.body.table[1].tr[0].td would be interpreted as: | |
|
W4F has three ways of accessing nodes in a tree:
"." indicates a direct child "->" will search any part of the tree using depth-first traversal "-/>" will only search within the scope of the current sub-tree |
Norfolk has only two ways:
"." indicates a direct child ".." will search within the scope of the current sub-tree |
In Norfolk, extraction rules return nested lists of trees on which one can recursively apply other extraction rules. Appropriate functions (namely text, sgml, xml) will return the text or the underlying tagged source associated with the node. The textual value of a node or an attribute can also be extracted directly with an expression such as
html.body.table[1]..td[2].#VALUE |
which is equivalent to:
text(html.body.table[1]..td[2]) |
The fundamental difference between the two languages is that path expressions in Norfolk never fail - they only return an empty list if the specified expression does not correspond to any actual node. With W4F, the expressions must refer to a real path in the tree for the wrapper to execute successfully. To avoid W4F to fail on irregular or incomplete data, defaults (exceptions) must be specified for all extraction rules. This is because the majority of websites, even ones that are relatively consistent, will have some fields of information missing.
An example of this in W4F is:
html.body->table[0].tr[2]->font[1].txt, match("(.*?) Fax"), default("");
|
In case a restaurant does not give a fax number, the wrapper will generate an empty string. If the default rule is not specified, W4F will fail to wrap this page.
HTML and XML documents are generally semi-structured documents, e.g. documents that present an irregular and incomplete structure. We believe that the Norfolk semantic for interpreting the path expressions is better suited to semi-structured documents than W4F semantic. The Norfolk semantics are similar to the one defined for XPATH expressions as defined by W3C.
W4F requires an explicit definition of the mapping rules and these must correspond in order and number to the extraction rules. An feature is that W4F generates a DTD from these mapping rules. This obviously guarantees that the resulting XML document will conform with the DTD. This is simple and useful, but restrictive for most applications. Indeed it is often necessary to have optional elements in a DTD and this is not possible with W4F since the mapping rules are relatively simplistic. It is also a common problem that an application has a predefined DTD which the XML documents generated should conform to.
Norfolk does not provide any DTD as a result of wrapping. Instead it can explicitly map to a previously defined DTD format. This is more appropriate when it is necessary to apply one DTD to all the kinds of tourism information we wrapped. However Norfolk does not verify that the XML document resulting from the wrapping conforms to the DTD. This has to be checked separately with appropriate tools until the wrapper is validated.
Another situation that is often encountered when wrapping a web server is that it is necessary to add static values for some elements of the generated XML document. This is because one does not need or want to extract them from the pages. For example, when wrapping a hotel website each page may contain a list of hotels in a particular city. The name of each city does not need to be extracted since it is already known (or found in the parameter for the page). W4F will not allow the addition of static information in the mapping. This is a great disadvantage and limits W4F in comparison with Norfolk which permits any inclusion of additional tags and values in the resulting pages.
The only way of creating a static value in W4F is to use a default value:
html.body->table[1].tr[3]->web.txt, default("http://www.fodors.com")
|
This allows you to create a tag with a static value that is to be repeated for all the elements this extraction rule maps into, which might not be desired. The example below shows that Norfolk only needs the wrapper to have a tag created with the value in it, and it will be embedded into the generated document,e.g. in a tag "source":
| <source>http://www.fodors.com</source> |
In Norfolk the values could also be taken from input parameters passed to the wrapper.
In summary, Norfolk and W4F have equivalent expressive power regarding the path expressions.W4F is a bit more declarative and simpler, but the Norfolk language offers better semantics for extraction from semi-structured documents, as well as being more flexible for mapping into predetermined DTD formats.
Writing wrappers often involve extracting data from a set of pages rather than a single page. When wrapping pages that describe a list of movies, for example, the description for the movie may be on a different page than the page listing the cinema where it is screened. Another common situation is to first extract a list from an initial page (e.g. a list of cities), then to follow the links from the list elements to new pages for extracting further elements with another wrapper.
Both Norfolk and W4F allow the retrieval of several pages within the one wrapper. However, a W4F wrapper only allows extraction from one of the retrieved pages, while other pages are just used as navigational pages. This is done in a separate retrieval section. Although the separation between retrieval and extraction makes the language more readable, it is a limitation that prevents W4F wrappers to extract, combine and map information from a set of linked pages. It is possible to write a set of independent wrappers and to combine them using a java program, which goes beyond the scope of the wrapping language.
Norfolk combines the wrapper language with basic crawler functions that allow to access any number of linked pages, extracting and combining information from any of them. This can be done within a single wrapper, or by combining wrappers that can pass parameters and results to each others. Norfolk is well designed for wrapping a set of pages into XML data and recursively calling wrappers. The set of pages may be linked or not. A declarative language for following links has been defined for Norfolk [8], but it has not been implemented. Currently the part of language used for crawling is very procedural and requires the URL to be explicitly extracted from the link before Norfolk opens the http connection and gets the linked document. As it is, it is flexible and convenient enough to write realistic and complex wrappers.
W4F offers a very limited way of following links and requires a Java program to combine specialised wrappers. It is not possible using the W4F language to build a single wrapper that extracts information from several pages.
W4F has an extraction wizard which gives the absolute tree path for any part of the web page. This is very good idea, because, when developing a wrapper, you must first locate the information you are interested in extracting and the HTML structure is often very complex. However, the wizard only allows you to see these paths if the mouse is scrolled over the relevant item (the path disappears too quickly) or by viewing the source of the page after it has been run through the extraction wizard. Norfolk has no such wizard and it is clearly missing. A wizard more integrated with the wrapping language would be desirable.
Machine learning-based wrappers such as [1] usually offer a good user interface since the user needs give examples of what and how to wrap before it can deduce the wrapping rules. Although scripting-based wrappers are programmed manually it would be a very useful to supply them with a wrapping user interface. Debugging is much easier with Norfolk since intermediate results can be printed. Debugging incorrect mappings can be very difficult with W4F.
Since a wrapper is usually used dynamically and regularly to get updated data, it is important that it is efficient. We did some testing for times for both Norfolk wrappers (against each other) and two wrappers written using Norfolk and W4F (for the same web server).
Three different Norfolk wrappers were tested against each other. These were wrapping information from http://www.fodors.com/ for 25 cities to get the list of restaurants, hotels and activities for each city. Norfolk generated 25 XML documents for each wrapper. The tests were run 11 times and at different times of the day. The table on the next page shows the details of the tests. The wrappers themselves can be found in Appendix A.
The only odd result was obtained in the test run on 11/01/01 at approximately 10:30am. This was an unexpected result and is most likely due to a problem with the server at the time. The averages calculated did not include this result.
The most notable difference is that the restaurants wrapper took the longest and the activities wrapper took the least time in wrapping. This can be attributed to the amount of extraction and mapping the wrapper has to do. The wrapper retrieves the page once only, so that is not an affecting factor in this case.

These tests show that the time of the wrapping itself (extraction and mapping) is significant part of the total time including the server access and the download of the pages. It was important to determine this before carrying tests for comparing Norfolk and W4F.
This phase of testing was for the purpose of determining whether Norfolk was slower than W4F. This was the expected result, since the Norfolk language is interpreted whilst W4F is compiled into Java classes. However, as table 2. indicates, there is no major difference between the performance of the two on average. In fact, Norfolk is slightly, but not significantly, faster.
The two wrappers were both using the same tourism website (http://www.fodors.com). However, only the hotel pages were tested for the same twenty-five cities.

General studies suggest that web pages, on average, are prone to changes every month. The rate of change is particularly high for commercial sites generated from databases since this simplifies maintenance and changes can be made consistently at a low cost. Wrappers, such as Norfolk and W4F, that strongly rely on the underlying HTML tag structure to locate information are vulnerable to changes and can be broken very easily.
It is a well-known problem that references within trees are hard to maintain when the tree is modified. Much attention to the problem has been paid in the Hypertext and Web community, specifically for trying to maintain semantic links within documents. In [4] a solution is proposed, using a node signature based on a combination of the absolute path of the node, the string value of the node and the one of its neighbours, as well as an associated repair algorithm.
The problem in wrappers is not exactly the same as for referential links. For referential links one tries to keep the reference on an actual content in order to keep the link semantically relevant. If the content of the target node has changed too much, the link may become irrelevant. In wrappers, on the other hand, one wants to maintain references to placeholders for content, which is expected to change (e.g. the price of a product). The value of the content cannot be part of the signature of the node, although the content of a neighbour node can be used.
Incomplete path expressions with condition like those presented in section 4.1 are more robust to structural change than absolute paths from the root. However they are not strong enough to cope with large changes, and even small changes can break a wrapper. When this happens the wrapper should at least be able to locate and report the problem.
We have started to investigate signatures that compare value (string) extracted by the wrapper with a domain vocabulary and will complain when a wrapped value is not of the expected type. For example, common expected words in a hotel list would be "hotel", "room", "pool", etc. In the IBM system ANDES [10], a lot of attention is paid to data validation. Their system monitors the quality of data extraction (e.g. valid numeric range of value) and alerts the administrator when adjustments to the wrapper are required.
More research is needed to make wrappers much more robust to change than they are today. An approach would be to replace the to-down, structure-first, extraction with bottom-up, content-first, extraction. Another approach would develop strong repair algorithms based on tree differences.
Wrapping, in general is something that is aimed at making the job of gathering and converting information more efficient. For this, a wrapping tool must be easy to use for the human who writes the wrapper. It must also be fast in wrapping.
Our conclusion is that W4F and Norfolk are very effective in wrapping Web sites and relatively easy to use. The two tools offer a comparable language for wrapping single Web pages (comparable expressiveness), and take about the same time to wrap pages (comparable efficiency). Norfolk supports more complex wrapping from sets of linked pages and can run in dynamic or batch mode. W4F offers a wizard interface that is helpful for beginners.
Although Norfolk was found to be more flexible and powerful to write actual wrappers, it is lacking a user-friendly interface similar to W4F Wizard.
Both tools need to develop robust techniques to deal with frequent changes in the wrapped sites if the sites have to be wrapped regularly.
[1] This work was developped when Sabine Jabbour, student at Monash University, was doing her internship at CSIRO.
[2] The screen dump has been generated in 2001 and the page format has changed since.
[1] Brad Adelberg, Mattiew Denny (1999), "Nodose version 2.0", in SIGMOD Record, 28(2), p.540-543, 1999.
[2]Fabien Azavant, Arnaud Sahuguet (2000), World Web Wrapper Factory (W4F), User Manual, April 2000.
[3] Nicholas Kushmerick (2000), "Wrapper induction: Efficiency and expressiveness", in Artificial Intelligence 118, (2000) 15-68, Elsevir (Ed.)
[4] Thomas A. Phelps and Robert Wilensky (1999), "Robust Intra-document locations", in Proc. of WWW9 Conference, Amsterdam, 1999.
[5] Anne-Marie Vercoustre and François Paradis (1997), "A Descriptive Language for Information Object Reuse through Virtual Documents", in 4th International Conference on Object-Oriented Information Systems (OOIS'97), Brisbane, Australia, pp299-311, 10-12 November, 1997.
[6] J. Clark and S. DeRose (eds) (1999), XML Path Language (Xpath) Version 1.1, W3C Recommendation, Novembre 1999, http://www.w3c.org/TR/xpath.
[7] G. Mecca, P. Merialdo, P. Atzeni (1999), "ARANEUS in the Era of XML", in IEEE Data Engineering Bullettin, Special Issue on XML, September, 1999 http://www.dia.uniroma3.it/Araneus/
[8] Anne-Marie Vercoustre and François Paradis (1998), "Reuse of Linked Documents through Virtual Document Prescriptions", in Lecture Notes in Computer Science 1375, Proceedings of Electronic Publishing '98, Saint-Malo, France, pp499-512, 1-3 April, 1998.
[9] François Paradis and Anne-Marie Vercoustre (1998), "A Language for Publishing Virtual Documents on the Web", in International Workshop on the Web and Databases, Valencia, Spain, March 1998.
[10] Jussi Myllymaki (2001), "Effective Web data Extraction with Standard XML Technologies", in 10th International World Wide Web Conference (WWW10), Hong Kong, May 2001.
[11] J. Hammer, M. Breunig, H. Garcia-Molina, S. Nestorov, V. Vassalos, R. Yerneni (1997). "Template-Based Wrappers in the TSIMMIS System". In the 26th SIGMOD International Conference on Management of Data, Tucson, Arizona, May 1997.
[12] L.Liu, W. Han, D. Buttler, C.Pu, and W.Wang (1999), "An XML-based Wrapper generator for Web Information Extraction", in Sigmod'99, Philadelphia, 1999.
[13] Nick Craswell, David Hawking, Anne-Marie Vercoustre and Peter Wilkins (2001), "P@NOPTIC Expert: Searching for Experts not just for Documents", Poster in Ausweb 2001, April 2001. http://ausweb.scu.edu.au/aw01/papers/edited/vercoustre/paper.html/p>
© Copyright 1997-2001, CSIRO Australia. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web. No Rights to Research Data is given. CSIRO and the Author/s remain free to use their own research data including tables, formulae, diagrams and the outputs of scientific instruments.