Web Metadata Compared: Institution, Community, Internet

Baden Hughes [HREF1], Research Fellow, Department of Computer Science and Software Engineering [HREF2], The University of Melbourne [HREF3], Victoria, 3010, Australia. badenh@csse.unimelb.edu.au

Eve Young, Coordinator - Digital Repositories, Information Services [HREF4] , The University of Melbourne [HREF3], Victoria, 3010, Australia. e.young@unimelb.edu.au

Abstract

In this paper, we evaluate the use of web metadata at The University of Melbourne. We outline the background to the development of an institutional metadata standard, describe previous work with in the institutional and external contexts, and compare how web metadata and more specific publication-oriented metadata usage patterns differ across various institutions, communities of practice, and the web in general.

Introduction

Before the advent of 'deep content aware' search engines, many popular web search engines required metadata of various types as the basis for their indexing operations. As such, in the mid to late 1990's many organizations invested heavily in the creation web metadata particularly in the form of of HTML meta tag content to attempt to exert maximum leverage over third party information discovery engines. A derivative of this effort was the development of internal standards for the use of such metadata elements and associated extensions in the enterprise context.

In this paper, we systematically evaluate the usage of one application of web metadata, the largely administrative web metadata in The University of Melbourne's web content. Our interest is to review how web metadata is used within the institutional context, and to contrast these usage patterns with other available data regarding web metadata. Hence we correlate our findings against work by other researchers in the same enterprise, in communities such as the Open Archives Initiative and the Open Language Archives Community, and on the open web. In doing so we discover that there are some generally similarities, but that overall, web metadata publishing varies considerably. We reflect on a number of the challenges this empirical study reveals, particularly in the institutional context of web content management and resource discovery.

Web Metadata at The University of Melbourne

The University of Melbourne's World Wide Web Publishing Policies and Guidelines (The University of Melbourne, 1999-2001), first published in 1999 specified the inclusion of nine largely administrative meta tags, some relating to authorization and currency of the page's information.

In May 2001, a white paper identifying the growing importance of metadata, the emergence of standards such as Dublin Core, and recommending the university consider the application of a metadata standard to official web pages was written (Young, 2001). The University of Melbourne's Information Strategy Advisory Committee (ISAC) endorsed these recommendations in July 2001, and requested the establishment of an expert committee (the "Metadata Working Group" (hereafter, MWG)) to advise on the implementation of a uniform approach to the creation of metadata on university web sites.

After consideration, the MWG recommended Dublin Core (Dublin Core Metadata Initiative, 2004) as the metadata standard using 12 of the then 15 elements and some of the date qualifiers on the basis that Dublin Core was an international standard; relatively easy to implement; used by many other universities and governments; suited current requirements and environment; and ensured future interoperability with other emerging metadata standards such as IMS and LOM, which may be used on campus in other contexts.

Thus The University of Melbourne's metadata standard directly imports from Dublin Core the following elements: DC.Title, DC.Creator, DC.Subject and Keywords, DC.Description and Description, DC.Publisher, DC.Contributor, DC. Rights, DC.Date, DC.Date.Modified, DC.Language, DC.Format, DC.Identifier.

The entire set of Dublin Core metadata elements were not adopted since only these core elements were viewed as being of interest to the institution (adopting a minimal barrier to entry was one related consideration, however, we see later that even the mandated minimum is met a very small proportion of the time.)

It is also worthwhile noting that the adoption process included no provision for the underlying Dublin Core standard changing and hence for the institutional adoption to be modified accordingly. An additional weakness, recognised at the time but not addressed, was the potential for web content standards and publishing practices to change considerably.

Additionally the MWG identified the importance of metadata in the administration and management of the web pages, including content and publishers, and the integration with other projects such as the development of a web archiving strategy and the implementation of a content management system for web publishing. In essence, this resulted in the overloading of the institutional metadata standard as a purely descriptive vocabulary, and extended its influence into publishing workflows. Some (informally expressed) local administrative extensions were added to the core Dublin Core elements: UM.Creator.Email, UM.Date.ReviewDue, UM.Authoriser.Name, UM.Authoriser. Title, UM.Maintainer.Name, UM.Maintainer.Email.

The MWG developed a draft metadata standard based on Dublin Core that was initially endorsed in July 2003. This was successfully trialled prior to implementation in the Metadata Pilot Project and was initially implemented on web pages required to use the university web page templates, as specified in the guidelines (The University of Melbourne, 1999-2001). An initial report on this project can be found in Young and Booth (2003).

Having developed a standard for institutional web metadata, higher order applications were built on the assumption that the metadata standard itself would be followed. The most notable example is the institutional search engine at The University of Melbourne, which has custom built functionality allowing for search over metadata fields only (both Dublin Core based, and institutional types).

Data Sources

There are few publically available sources which consider web metadata as the focus of research. In this paper, we utilise several different sources (described below), and attempt to draw comparisons between them, while recognising their different domain groundings may affect the overall results.

Institution

In previous work, Hughes and Young (2005) conducted a systematic broad coverage review of the use of the institutional metadata standard across all web content at The University of Melbourne. A full crawl of The University of Melbourne web presence was conducted in March 2005 and the same URL set was acquired again in August 2005. It should be emphasised that this is real web data from the institution, rather than a hand curated set.

The software used was the Internet Archive's Heritrix suite (Internet Archive, n.d.), an open-source, extensible, web-scale, archival-quality web crawler project. In total 57Gb of data was retrieved from www.unimelb.edu.au and its associated sub-domains over a period of 146 hours. A total of 1,431,645 million documents were retrieved from 632 HTTP servers. Of these ~1.4 million documents, 659,171 (~46%) were of the MIME type text/html; the explicit target of The University of Melbourne's web metadata standard. The remaining 54% of web documents were of various other MIME types.

It should be noted that a number of caveats are in order with regard to the crawl methodology and derivative statistics. We were unable to collect web content from institutional sites which implemented access restrictions either via robots.txt style or OSI Layer 4 or higher level access restrictions. We were unable to collect web content from institutional sites which required any form of explicit authentication (basic or otherwise).

A number findings can be made with regard to the crawled data:

HTML no longer dominant : The University of Melbourne's metadata creation processes have been primarily oriented at creating Dublin Core-extended metadata as simple HTML meta tags (The University of Melbourne Web Centre, 2004). However, data gathered from the March 2005 crawls shows that pure HTML content in fact is no longer the largest constituent at either a numerical or size-wise rank. As such, a metadata standard which makes the assumption that HTML embedded metadata is sufficient to cover the majority of online document types is increasingly outmoded either for reasons of expression or compatibility with content creation tools. Clearly there is a need for a revised metadata standard addressing newer formats such as PDF.

Web-accessibility of rich document types : Many formats which are not addressed by the The University of Melbourne's guidelines for metadata creation but which do offer some potential for restricted metadata inclusion eg MS Word, MS Excel, PDF. However, document types such as XML do not easily allow for the embedding of metadata internal to the resource except if specifically considered at design time.

The emergence of dynamic documents: An analysis of the URIs which do not have the .htm or .html file extension shows that many (around 38%) of these documents are in fact dynamic, that is generated server side on demand by technologies such as PHP, ASP and JSP. No thought currently has been given by The University of Melbourne's web metadata standard to the inclusion of metadata in automatically generated documents of this type although it would be possible using simple templates.

Community

Open Archives Initiative

Ward (2003) considered the use of Dublin Core based metadata in the context of the Open Archives Initiative (OAI). Ward's data was acquired by directly harvesting a large number of OAI data providers over an extended period of time, and although somewhat dated given the dynamicity of the OAI data providers, is once again legitimate data rather than a curated sample. Overall Ward's findings were that a core sub-set of metadata elements were used widely, but a significantly smaller proportion of non-core metadata elements were used throughout several hundred OAI repositories. The detail of her analysis is included later for comparison.

Open Language Archives Community

Hughes (2004) performed an in depth analysis of a particular OAI sub-community, the Open Language Archives Community, considering how community extensions to Dublin Core metadata were used. The data was collected again by harvesting real OLAC records through the development of a metadata quality assessment framework, rather than selective harvesting. Hughes' findings were that in broad terms, the earlier trend analysis of Ward (2003) held for language archives, although specific qualified Dublin Core extensions pertinent to this user community contrasted markedly with broader norms (for example, the use of language metadata elements).

Other Communities

It should be recognised that other community metadata standards exist; in the Australian context both the Australian Government Locator Service (AGLS) and the EdNA Metadata Standard. Neither of these standards are considered in this paper.

Web

Recently, researchers at Google have released findings from an interesting study on web authoring statistics (Google Inc, 2005), derived from an analysis of 1 billion web pages. This work included a consideration of the use of metadata on the web at large, based on pages which appeared within the Google index. The findings of this study were that Dublin Core-based metadata does not rank in the top 10 metadata elements appearing on web pages, but that they do occur within the top 50 elements.

Unfortunately, the Google study only published very coarse grained summary information, and so a direct comparison between metadata in the enterprise and on the broader internet is not directly possible.

Web Metadata Compared

In Figure 1, we show the frequency of use of metadata elements for surveys which provide quantitative data.

Ward (2003)
%
Hughes and Young (2005)
%
creator
21.5
description
28.5
identifier
17.2
subject
26.4
title
11.4
title
23.9
date
11.1
publisher
18.9
type
10.7
creator
18.7
subject
6.6
date
18.4
description
6.2
contributor
16.3
rights
4.2
language
6.2
publisher
3.1
format
5.3
coverage
2.7
identifier
4.9

Figure 1: Frequency of Metadata Element Use

A number of observations can be drawn from the statistics above.

First, we can observe that there is a considerable difference in the rankings between Ward (2003) findings from the OAI and our earlier findings for web metadata element occurrence in Hughes and Young (2005). This may be explained because of the domain differences, although a more advanced analysis may show that there are other factors to be considered.

Second, we can see that the relative frequencies between elements is substantially different: in particular the Hughes and Young (2005) data shows that overall there is a more consistent use of Dublin Core-based metadata web than in the broader OAI. In part this may be attributed to a single domain of authorship within The University of Melbourne.

Third, we can observe also that the relative ratios between individual Dublin Core elements is significantly different. Consider for example, the differences between the top ranked elements from each study and their counterparts: Ward (2003) shows the top ranked element, creator, used 21.5% of the time. Hughes and Young (2005) show description used 28.5% of the time.

In Figure 2, we tabulate the rank order of the frequency of occurrence of web metadata elements from surveys, without regard for the exact quantities.

Ward (2003)
Hughes (2004)
Hughes and Young (2005)
Google (2005)
creator
subject
description
title
identifier
title
subject
language
title
description
title
creator
date
date
publisher
subject
type
identifier
creator
publisher
subject
creator
date
description
description
format
contributor
identifier
rights
type
language
date
publisher
contributor
format
format
coverage
publisher
identifier
rights

Figure 2: Rank Order of Web Metadata Elements From Various Sources

A number of observations can be drawn from the statistics above.

It is clear that the four different sources have considerably different rank orders for the top 10 Dublin Core-based elements. At depth 1 of the ranked elements, there is no similarity. At depth 2 of the ranked elements, two sources have both subject and title. At depth 3 of the ranked elements, all 4 have title, 2 have description, subject and creator. At depth 4 of the ranked elements, all 4 have title, 3 have subject, and 2 have description, creator and date. At depth 5 (50% of the way through the result set), all 4 have title, 3 have subject and creator, and 2 have description and date. This shows that the same set of elements is emerging across all surveys regardless of their sources, although the relative ordering maybe slightly different owing to particular use cases and discovery paradigms: title, subject, creator, description and date are the core descriptive elements across all data sets.

Reflections and Challenges for the Future

Large scale search engines such as Google are not using web metadata in the form of HTML meta tag information any more but rather perform full text indexing (Richardson, 2004). Hence the benefit to general web search of metadata creation according to a given standard within the institution is almost zero for external searchers, although it may still retain currency for other administrative purposes eg the authorization of web content publication. This leads to the need to distinguish between the institutions need for web content management, and how metadata facilitates this goal, and decoupling from web search experience in general. This transition was clearly not foreseen by The University of Melbourne, despite their canonicalization of a web metadata standard well after this trend was apparent.

The recent introduction at The University of Melbourne of institution wide Content Management System (CMS) is likely to have pervasive effects over the longer term. Notably, the institutional metadata standards failed to address distributed content creation (or underestimated the pervasive effect of publish-to-web type technologies to all staff), a context into which the new CMS is targeted. Embedding metadata creation at the point of content creation, leveraging inbuilt capacity to create as much core metadata as possible is certainly desirable. The functional requirements specifications for the CMS included the provision of numerous core web metadata fields (both Dublin Core-based and institutional) automatically populated from within the publishing workflow, while others are compulsorily to be supplied by content creators prior to saving within the CMS. Hence from this point on new web pages, and modified versions of old web pages will be forced to include correct institutional metadata, without requiring specific attention by page authors. We recognise that certain metadata elements, eg subject, benefit greatly from human input and cannot be fully automated.

A return to first principles with regard to the currency of the Dublin Core Metadata Set and the expression of The University of Melbourne web metadata is warranted. While the existing metadata standard was in part driven by resource discovery needs, these motivations have largely been surpassed with the arrival of fully featured web search engines, and as such, the role of institutional metadata appears largely to be administrative. One could even question if there is a need for a metadata standard (either broad coverage or institutional) at all given the low impact of existing efforts.

Finally, it may be interesting to correlate our findings against other community metadata standards eg AGLS or EdNA, a process which depends on the availability of suitable comparative data.

Conclusion

Despite being identified as one of the leading universities with regard to metadata implementation (Ivanova, 2004), on reflection we see that The University of Melbourne still faces significant challenges in the deployment of metadata across institutional web content. Even over a 2 year period we have found that compliance is in fact a moving target with the evolution of external standards, web content creation tools, and web content demography. While a strong basis for institutional metadata was formed by the adoption of Dublin Core, the disparate content creation environment and rapidly changing composition of web content has induced a less than satisfactory application of these standards. Automated metadata creation and assessment, forming a significant component of future work may address this problem in part, although only a longitudinal study, with adequately established baseline metrics will demonstrate if we are any closer to effective use of web metadata. With the pending deployment of a new institution wide content management system, the inconsistencies in web metadata will be addressed over time. However, the fact that broad coverage web search engines no longer rely on this data type for indexing operations reduces the effectiveness of this process considerably. The existing institutional web search engine does leverage The University of Melbourne web metadata standard and as such local users are likely to see an improvement in their ability to effectively fulfill their information needs from this revision.

References

Dublin Core Metadata Initiative, 2004. Dublin Core Metadata Element Set, Version 1.1: Reference Description. http://dublincore.org/documents/dces/. Available online [HREF5]. Last accessed 5 May 2006.

Google Inc., 2005. Web Authoring Statistics. http://code.google.com/webstats/index.html. Available online [HREF6]. Last accessed 15 March 2006.

Baden Hughes and Eve Young, 2005. If we're not there yet, how far do we have to go ? A review of web metadata at The University of Melbourne. Proceedings of DC-ANZ 2005. http://eprints.unimelb.edu.au/archive/00000923/. Available online [HREF7]. Last accessed 15 March 2006.

Baden Hughes, 2004. Metadata Quality Evaluation: Experience from the Open Language Archives Community. Proceedings of the 7th International Conference on Asian Digital Libraries. LNCS 3130. Springer Verlag. pp. 320-329.

Internet Archive, n.d. Heritrix. http://crawler.archive.org. Available online [HREF8]. Last accessed 15 March 2006.

Nelly Ivanova, 2004. Metadata and Australian Universities: An Environmental Scan. Proceedings of AusWeb 2004: The Tenth Australian World Wide Web Conference. Southern Cross University. http://ausweb.scu.edu.au/aw04/papers/refereed/ivanova/paper.html. Available online [HREF9]. Last accessed 15 March 2006.

Joanna Richardson, 2004. Competing in a World Scooped by Google. Proceedings of AusWeb 2004: The Tenth Australian World Wide Web Conference. Southern Cross University. http://ausweb.scu.edu.au/aw04/papers/refereed/richardson2/. Available online [HREF10]. Last accessed 15 March 2006.

Jewel Ward, 2003. A Quantitative Analysis of Unqualified Dublin Core Metadata Element Set Usage within Data Providers Registered with the Open Archives Initiative. Proceedings of the 2003 Joint Conference on Digital Libraries. IEEE Computer Society Press. pp. 315-317.

Eve Young. 2004. Metadata at The University of Melbourne. http://buffy.lib.unimelb.edu.au/ird/metadata/index.htm Available online [HREF11]. Last accessed 15 March 2006.

Eve Young and Martine Booth, 2003. The University of Melbourne Metadata Implementation. Proceedings of DC-ANZ Conference 2003. http://www.dc-anz.org/conf2003/DC-ANZAgenda.html. Available online [HREF12]. Last accessed 15 March 2006.

The University of Melbourne Web Centre, 2004. University metadata standard: what, why & how. Manuscript, The University of Melbourne. http://www.unimelb.edu.au/webcentre/training/trainingfiles/metadata_training_notes.pdf. Available online [HREF13]. Last accessed 15 March 2006.

Hypertext References

HREF1
http://www.csse.unimelb.edu.au/~badenh/
HREF2
http://www.csse.unimelb.edu.au/
HREF3
http://www.unimelb.edu.au/
HREF4
http://www.infodiv.unimelb.edu.au
HREF5
http://dublincore.org/documents/dces/
HREF6
http://code.google.com/webstats/index.html
HREF7
http://eprints.unimelb.edu.au/archive/00000923/
HREF8
http://crawler.archive.org
HREF9
http://ausweb.scu.edu.au/aw04/papers/refereed/ivanova/paper.html
HREF10
http://ausweb.scu.edu.au/aw04/papers/refereed/richardson2/
HREF11
http://buffy.lib.unimelb.edu.au/ird/metadata/index.htm
HREF12
http://www.dc-anz.org/conf2003/DC-ANZAgenda.html
HREF13
http://www.unimelb.edu.au/webcentre/training/trainingfiles/metadata_training_notes.pdf

Copyright

Baden Hughes and Eve Young, © 2006. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.