Collecting Low-Density Language Materials on the Web

Timothy Baldwin [HREF1], Senior Lecturer, Department of Computer Science and Software Engineering [HREF2], The University of Melbourne [HREF3], Victoria, 3010, Australia. tim@csse.unimelb.edu.au

Steven Bird [HREF4], Associate Professor, Department of Computer Science and Software Engineering [HREF2], The University of Melbourne [HREF3], Victoria, 3010, Australia. tim@csse.unimelb.edu.au

Baden Hughes [HREF5], Research Fellow, Department of Computer Science and Software Engineering [HREF2], The University of Melbourne [HREF3], Victoria, 3010, Australia. badenh@csse.unimelb.edu.au

Abstract

Most web content exists in a few dozen languages. Hundreds of other languages - the `low-density languages' - are only represented in scarce quantities on the web. How can we locate, store and describe these low-density resources? In particular, how can we identify linguistically interesting resources, such as translation sets and multilingual documents? In this paper we describe ongoing research in which we integrate a number of discrete systems (language data crawler, automated metadata generation tools, language data repositories and federated search services) to address the identification, retrieval, description, storage and access issues for low-density language materials from the web.

Introduction

Finding resources of interest on the web is often a low precision activity, despite the sophistication of popular search engines. The best results are often achieved with metasearch services, which aggregate results for the same query from many different search engines, in an effort to more efficiently fulfill a user's information needs. In contrast, for the specific task of identifying `linguistically interesting' resources such as translation sets and multilingual documents, we propose that domain-specific techniques can increase the likelihood of finding relevant URLs. Such techniques include exploiting taxonomies of language names and linguistic terms, and leveraging connections between linguistic communities and place names.

Our research is addressing the information discovery challenge for web-based resources in low-density languages. This research is grounded in the Australian context, where there is a strongly multilingual environment; there is active content creation by speakers of languages other than English; and immigrant communities are interested in retaining linguistic and cultural links (both with other immigrants from the same backgrounds as well as with countries of origin). We are integrating existing software to build digital collections of web-based minority language materials and providing services for users to interact with these found objects, both directly and via intermediaries.

Language-Centric Resource Discovery

This methodology of collating and publishing material on the web for the purposes of linguistic research and language technology development has been very successful. This success is due in large measure to the quality of web search engines and the ability of users to adapt to the exigencies of keyword query. However, therein lies a critical weakness: as the web grows, resource discovery has become a hit-and-miss affair; it is easy for users to be inundated with irrelevant resources, and to miss important resources because they did not try enough combinations and translations of query terms. A recent attempt to address this problem has been OLAC, the Open Language Archives Community (Open Language Archives Community, n.d.). OLAC applies new technologies in digital libraries from the Open Archives Initiative (Open Archives Initiative, n.d.) to support a worldwide virtual library of language resources. OLAC users can search over 30,000 resources in over 30 language archives simultaneously, using keyword or fielded search over the stored metadata, by way of either a customised search engine (Hughes and Kamat, 2005) or through a DP9 Gateway accessible to Google.

However, OLAC has two major shortcomings of its own. First, it can only be used to search resources which have already been catalogued. Second, most of the resources, once found, cannot be accessed because they are not available on the web. The shortcomings of the available resource discovery technologies are most easily understood with the help of the following hypothetical scenarios. We assume that the user is a language researcher, teacher, or learner, and is searching for specific language resources:

Scenario 1: Finding resources for a specified language. A user searches the web using the name of the language. However, the user experiences low precision as this language name is also a normal word in other languages. The user also experiences low recall, since there is a variety of spellings for the language name, and since most texts in the language do not explicitly identify the language anyway. Attempts to refine the search by limiting its scope to a country (e.g. site:.ar) do not prune the result set to a manageable size, and eliminate some of the most useful resources created by the diaspora. Additional keywords like "dictionary" must be entered in several languages (e.g. "worterbuch", "diccionario"), requiring more effort for limited returns.

Scenario 2: Finding resources by proximity. A research project is investigating some geographical region like the Afghanistan/Pakistan borderland, and requires information on the linguistic situation, including numbers of languages, population per language, literacy level, and available language resources. Web search using the name of the political region only turns up materials about companies and government agencies. Another project is investigating Australia's Western Desert Languages, and the user discovers that it is necessary to search for each of a dozen language names (and variant spellings) separately, a tedious and error-prone process.

Scenario 3: Finding examples of a linguistic construction. A user wants to find examples of sentences that contain multiword expressions that match a specified template, such as verb-particle constructions involving the word 'up'(e.g. '... put the team up'). Searching just on the word 'up' turns out to be fruitless. The user picks a verb at random and tries using Google's starred expressions, e.g. 'put * * up'. This finds a handful of examples but gives no sense of which verb-particle constructions most often involve 'up'. By using automatic means to identify instances of particular constructions in the web corpora collected in this project and mapping the results onto a language-universal annotation schema of syntactic construction types, we are able to support such queries.

System Architecture

In response, we are integrating a range of existing software together to provide end-to-end services enabling efficient resource discovery. The overall architecture consists of four components:

Language Crawler: In the discovery phase, we use LangGator (Hughes, 2005), a high precision, domain-focused web crawler to identify and retrieve URIs of "linguistic interest" from the web. LangGator is based on systematic query expansion, metasearch, and rank aggregation preceding URI content retrieval.

Metadata Creation: In the description phase, we are adapting automated metadata creation tools such as the generalised DC-Dot (Powell, 2000) to create DC-based OLAC metadata (Simons and Bird, 2003). Where fine-grained distinctions are required for accurate description we are adapting best-of-breed machine learning tools such as the the libiViaMetadata library (Paynter, 2005) to handle the classification task. Other machine learning approaches are being adopted specifically for language identification problems (Hughes et al, 2006), using international standard taxonomies for languages such as the Ethnologue (Gordon (ed.), 2005). The resulting metadata descriptions are published to OLAC as an OAI Static Repository (van de Sompel and Lagoze (eds), 2002).

Language Archive: Having located, retrieved and described URIs of linguistic interest, we are storing a local copy of the URI content in a digital repository. This approach is strongly motivated by the relatively short half-life of web content.

Language-Aware Search Engine: We are leveraging the OLAC Search Engine (Hughes and Kamat, 2005), together with new user interfaces allowing for geographically-oriented search, and programmatic interfaces for service-based interaction with the collections.

Collaboration

From the outset this project has been viewed as a shared effort between a number of other parties. In particular, we are collaborating with:

Conclusion

In order to provide end user services and allow for more efficient information need fulfillment, we are integrating specialised components in the discovery and description of web-based language data. This is aimed at supporting our vision of increasing the likelihood that end users who are interested in low-density languages will be able to more efficiently locate web resources.

References

Raymond G. Gordon Jr (ed), 2005. Ethnologue: Languages of the World (15th Edition). SIL International, Dallas.

Baden Hughes, 2005. Towards Effective and Robust Strategies for Finding Web Resources for Lesser Used Languages. Proceedings of Lesser Used Languages and Computer Linguistics 2005. Springer Verlag.

Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Nicholson and Andrew MacKinlay, 2006. Reconsidering Language Identification for Written Language Resources. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC2006). European Language Resources Association.

Baden Hughes and Amol Kamat, 2005. A Metadata Search Engine for Digital Language Archives. DLib Magazine 11(2), February 2005. http://www.dlib.org/dlib/february05/hughes/02hughes.html. Available online [HREF7]. Last accessed 13/4/2006.

William D. Lewis, (n.d.). Online Database of Interlinear (ODIN). http://www.csufresno.edu/odin. Available online [HREF9]. Last accessed 8/5/2006.

Open Archive Initiative, n.d. Open Archive Initiative. http://www.openarchives.org/. Available online [HREF11]. Last accessed 8/5/2006.

Open Language Archives Community, n.d. Open Language Archives Community. http://www.language-archives.org/. Available online [HREF10]. Last accessed 8/5/2006.

Gordon Paynter, 2005. Developing Practical Automatic Metadata Assignment and Evaluation Tools for Internet Resources. Proceedings of the Fifth ACM/IEEE Joint Conference on Digital Libraries (JCDL). ACM Press. pp. 291-300.

Andy Powell, 2000. DC-Dot Dublin Core Metadata Editor. http://www.ukoln.ac.uk/metadata/dcdot/. Available online [HREF6]. Last accessed 13/4/2006.

Gary Simons and Steven Bird, 2003. The Open Language Archives Community: An infrastructure for distributed archiving of language resources Literary and Linguistic Computing. Volume 18; pp. 117-128.

Herbert van de Sompel and Carl Lagoze (eds), 2002. Specification for an OAI Static Repository and an OAI Static Repository Gateway. Open Archives Initiative. http://www.openarchives.org/OAI/2.0/guidelines-static-repository.htm. Available online [HREF8]. Last accessed 13/4/2006.

Hypertext References

HREF1
http://www.csse.unimelb.edu.au/~tim/
HREF2
http://www.csse.unimelb.edu.au/
HREF3
http://www.unimelb.edu.au/
HREF4
http://www.csse.unimelb.edu.au/~sb/
HREF5
http://www.csse.unimelb.edu.au/~badenh/
HREF6
http://www.ukoln.ac.uk/metadata/dcdot/
HREF7
http://www.dlib.org/dlib/february05/hughes/02hughes.html
HREF8
http://www.openarchives.org/OAI/2.0/guidelines-static-repository.htm
HREF9
http://www.csufresno.edu/odin
HREF10
http://www.language-archives.org
HREF11
http://www.open-archives.org

Copyright

Timothy Baldwin, Steven Bird and Baden Hughes, © 2006. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.