Collecting Low-Density Language Materials on the Web

Timothy Baldwin, Senior Lecturer, Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia. Email: tim@csse.unimelb.edu.au

Steven Bird, Associate Professor, Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia. Email: sb@csse.unimelb.edu.au

Baden Hughes, Research Fellow, Department of Computer Science and Software Engineering, The University of Melbourne, Victoria 3010, Australia. Email: badenh@csse.unimelb.edu.au


Keywords

low density languages, web crawling, language identification, metadata, digital archives, domain-specific web search


Abstract

Most web content exists in a few dozen languages. Hundreds of other languages - the `low-density languages' - are only represented in scarce quantities on the web. How can we locate, store and describe these low-density resources? In particular, how can we identify linguistically interesting resources, such as translation sets and multilingual documents? In this paper we describe ongoing research in which we integrate a number of discrete systems (language data crawler, automated metadata generation tools, language data repositories and federated search services) to address the identification, retrieval, description, storage and access issues for low-density language materials from the web.


[ Full Paper ] [ Presentation ] [ Proceedings ] [ AusWeb Home Page ]

 

 

 

 

All materials Copyright AusWeb06. The Twelfth Australasian World Wide Web Conference, Australis Noosa Lakes, from 1st to 5th July 2006
Contact: Norsearch Conference Services +61 2 66 20 3932 (outside Australia) (02) 6620 3932 (inside Australia) Fax (02) 6626 9317