Developing a dynamic, multilingual, sense-sensitive dictionary from Web search interactions

Gavin Smith, Student, School of Computer and Information Science, [HREF1], University of South Australia, [HREF2], GPO Box 2471 Adelaide, South Australia 5001. smigs003@students.unisa.edu.au

Mark Truran, Lecturer, School of Computer Science, [HREF3], University of Teesside, [HREF4], Tees Valley, TS1 3BA, United Kingdom. m.a.truran@tees.ac.uk

Helen Ashman, Associate Professor, School of Computer Science and Information Science, [HREF1], University of South Australia, [HREF2], GPO Box 2471 Adelaide, South Australia 5001. Helen.Ashman@unisa.edu.au

Abstract

Recently a number of dynamic techniques based on the Internet have been developed addressing specific issues within the field of information retrieval. While complementary to each other in many cases, such techniques have been developed in isolation with specific goals and promising results. Of note to the field of cross-language information retrieval (CLIR) is progress along the fronts of disambiguation and synonym generation, using dynamic term acquisition. Together they provide the basis for the development of a promising resource for CLIR - a dynamic (in the sense of term acquisition), sense-sensitive, multilingual dictionary. Such a resource has large potential, directly targeting the open out-of-vocabulary problem. Specifically this paper indicates work on extending and merging two approaches, namely co-active intelligence and query clustering.

I. Introduction

Co-active intelligence (Truran 2005) has recently been indicated to have the potential to provide an important resource for cross-language information retrieval (CLIR) - a dynamic, potentially broad, sense-sensitive multilingual dictionary (Ashman et al. 2007). Coupled with a technique known as query clustering (Beeferman and Berger 2000; Cui et al. 2003; Gao et al. 2007; Wen et al. 2002; Xue et al. 2004), such a resource has the potential to provide a solution to the open problem of translating the ever-changing number of arbitrary terms within natural language, known as the out-of-vocabulary (OOV) problem. Such a dictionary not only has the potential to impact the field of CLIR, but also on related fields such as Machine translation, the development of writing aids and others. While the exact language coverage of such a technique in general is unknown (although is growing as data continues to be collected), this paper reports preliminary observations in the theoretical construction of such a resource and highlights the applicability of such a resource in the context of current state-of-the-art CLIR systems.

The translation of terms can be useful in a number of areas. Most obviously, translation is needed so speakers of one language can access information in another. This has a commercial impact as, for example, major software developers need to translate manuals and support information into sometimes up to 20 different languages. CLIR is a slightly different application of translation, as the aim is to submit a search request in one language and to be presented with appropriate search results in any language.

Automatic translation is an area that is by no means solved. Significant effort still goes into it with, for example, a €17M new project funded by the Science Foundation Ireland incorporating a major automatic translation component1.

The rest of the paper continues as follows: in section II we present details on the theoretic development of a dynamic multilingual, sense sensitive dictionary, including a number of challenges and solutions. In section III we present related work. In section IV we highlight the benefit and place of such a resource in the context of the state-of-the-art within CLIR.

II. Development of a dynamic multilingual sense sensitive dictionary

Co-active Intelligence: Creating sense sensitivity

Word and phrase sense disambiguation is a complex and open problem, with traditional techniques leveraging either static resources such as WordNet or surrounding context as is prominent in document translation. While context based disambiguation is well known - within CLIR it tends to perform poorly due to the lack of context contained in typical queries. Static, dictionary-based methods are also sub-optimal, being expensive and inherently limited in the number of terms they contain and hence can translate, leading to the well known out-of-vocabulary (OOV) problem (Ballesteros and Croft 1998). To solve this problem a number of recent researchers have turned to Internet-based approaches (Gracia et al. 2006; Truran 2005). Co-active intelligence is one such approach, and is adopted in the work presented here. Other approaches are covered later in the related work section and are less suited for the creation of the dictionary style resource proposed here.

Truran (2005) defines co-active intelligence as the process of disambiguation by developing clusters of web resources under sense sensitive labels derived for search queries. For example at least two clusters would exist for the word 'jaguar', one for the car brand and one for the animal. The development of such clusters is achieved by mining search selections from Internet search engines (i.e. Google), with the basic premise that, within a given search session a user will search only for a single semantic meaning of a query and therefore resources selected by a user in a given search session (co-selected resources) are semantically related. The notion of related is then further strengthened by multiple occurrences of the same associations in multiple search sessions, by multiple users. In the case of co-active intelligence each term can then lead to a number of clusters with each cluster being a set of resources relating to a specific meaning of the term. In this way clusters can be seen as containers with distinct labels - terms of a particular semantic sense.

Query Clustering: Developing a 'live thesaurus'

The generation of a dynamic thesaurus using weblogs is not a new idea, described back in 2000 (Beeferman and Berger) based on search query clustering. Search query clustering is the process in which query logs are mined to provide a measure of similarity between search queries (Beeferman and Berger 2000; Cui et al. 2003; Gao et al. 2007; Wen et al. 2002; Xue et al. 2004). Such a method is inherently dynamic as query logs are continually generated and mined, updating the underlying query clusters and hence the similarity relationships between query terms. Achieved in the same fashion as co-active intelligence, by mining 'click-though' data - i.e. user search result selections, a query is said to be related to a document if it is selected as a result of a search. If two queries have enough of the same resources linked to them via this method the two terms are then deemed to be similar. Typically the notion of similarity is quantified with a similarity metric (Wen et al. 2002). Development of a 'live thesaurus' then proceeds by applying a threshold, ranking based on the similarity metric (Gao et al. 2007) or by using the mined query-document correlations and document-document term correlations to bridge the query and document terms (Cui et al. 2003).

A co-active 'live thesaurus'

Much like query clustering, co-active intelligence generates query labelled resource cluster pairs dynamically from web logs. In a similar fashion to query clustering, similarity measures can then be employed based on the observation of identical resources present within different clusters, effectively highlighting cluster overlap. As indicated by Ashman et al. (2007) such overlap, when statistically significant, shows a clear semantic equivalence between the terms, for which a transitive relationship holds. This differs from query clustering in which the generated synonym relations are not transitive. An example is the query term 'jaguar'. Traditional query clustering would find the synonym relations ('jaguar', 'car'), ('jaguar', 'big cats'), ('jaguar', 'jaguar animal'). Such relationships are not transitive as the synonym relation ('car', 'big cats') is obviously not valid. In the co-active intelligence based approach the above relationships would be found, but 'jaguar' would be uniquely tagged per the sense instance. I.e. ('jaguar'_1, 'car'), ('jaguar'_2, 'big cats'), ('jaguar'_2, 'jaguar animal'). Therefore between any instance of unique terms a transitive relation is maintained - the relationship ('jaguar animal', 'big cats') is valid and no incorrect associations are generated.

The ability to mine transitive relations is highly beneficial in the creation of a translation resource with recent work on transitive mining of dictionaries reporting 75% increase in retrieval in an image context for 'minor' languages (Etzioni et al. 2007).

Ensuring coverage: Data requirements

Co-active cluster overlap, like query clustering, is a statically based approach and as such requires a significant amount of data to produce interesting results. Looking the similar query clustering technique, results are typically reported from weblogs of a size ranging from 22GB containing nearly five million query sessions (Cui et al. 2003) to a set of filtered weblogs containing seven million unique queries, with each of the seven million unique queries having been selected at least five times (Gao et al. 2007).

When generalizing the 'live thesaurus' to a multilingual translation resource it is expected that such data requirements will inherently increase as we utilize a subset of the generated 'live thesaurus'. As described by Ashman et al. (2007) there are a number of ways cross-language term overlap could occur. Consider a user searching for a translation or definition of the English word 'dog' into Swedish - 'hund' or vice versa. The user may search using either 'dog' or 'hund' but still select similar pages - that is, pages that give the correct translation, or if the user is bilingual, pages that simply relate to that meaning. Ashman et al. describe the latter as 'bridging' activity. The former can be considered bridging resources, resources that contain terms in both languages and are returned in either case. In this way the document provides the translation link, rather than a bilingual human. The process is still co-active, however, as it is the human interaction that selects the resource as relevant in both languages.

As previously mentioned cluster overlap can be generated by 'bridging' activity or documents which provide the basis for the creation of a dynamic sense sensitive multilingual dictionary. As this process is independent of the disambiguation stage, the amount of overlap can be seen to be similar to that achieved by query clustering. While the exact number of generated synonyms has not been released, these techniques have resulted in promising results in near-synonym dependant tasks (Cui et al. 2003; Gao et al. 2007). On a less positive note, however, such cluster overlap generation relies heavily on two uncertain factors. Firstly in order for co-active intelligence to use bridging documents it requires that the documents be given a sufficiently high rank in both languages appearing in the document by the search engine. If this does not happen the user will not see the document in their search results and will not select it - preventing the co-active intelligence associating the webpage to the search term. Secondly bridging activity relies on a large number of bilingual searchers - searches where a user searches in a language and then selects a result appearing in another language. This relies on both the search engine adequately returning mixed language results and also the presence of a large number of bilingual users who exhibit this selection behavior. While we have little knowledge on the former the latter presents a problem - we are simply re-mining the CLIR provided by the search engine for the given terms.

While the above arguments may be pessimistic, the basis for generating a sense sensitive multilingual dictionary has not been eroded; the process may just be slow at acquiring terms. Potential low ranking of bridging documents, if it occurs will still build up cluster overlap, either because a number of results will still be ranked highly or due to deep searches, where a user goes past page one. In addition, while bridging activity may not provide additional results than those found by the CLIR techniques within the search engine it will positively link terms with correct sense. Such sense discrimination is something that is rarely done, with most dictionary-based methods for translation using sense-insensitive dictionaries. Essentially we are not only re-mining but also refining and extending the CLIR process already present in the search engine.

In order to make the process efficient, and reduce the amount of data required to generate significantly weighted clusters, two new techniques are proposed. The first technique aims to tighten weak overlap between clusters. Based on shallow content analysis of images, a cross language resource, the technique aims to trawl weak associations between clusters to either increase or refute the statistical similarity.

The technique is as follows: each cluster's documents are parsed and the content images compared. If they are the same then two clusters are brought closer together. In essence this primarily mines the pages on the Internet that have been written in two or more languages. Using this removes the need for the assumption that both pages have been written with exactly the same content, replacing it with the broader assumption that the major content is equivalent. This reflects the reality in a number of cases such as the Swedish train site www.sj.se and a website on a German castle www.neuschwanstein.de/. A generalization of such a technique is to relax the requirement that the pictures be exactly the same file. This leads in a small way into content analysis, the departure from pure, co-active intelligence. Such content analysis, however, is language-independent and works across all clusters, thus maintaining a large number of the principles of co-active intelligence. Perhaps the bigger drawback of such an approach is the computational complexity issues it raises. Such a process would need to be seen more as a crawler. Similar to smart indexing of webpages, the process would work on a feasible, restricted set of the content available on the Internet. The primary restriction is to pages identified though co-active intelligence as being relevant, i.e. pages that have been placed into clusters by users. Further restrictions on which images to be processed also can be made - for instance only clusters with a small overlap, say from the bridging document co-active process. In this way analysis is done primarily to increase an already present weak association to a significant one. By doing this the process helps augment the more 'pure' co-active intelligence approaches. Other restrictions could include clusters that share documents from the same domain - a heuristic based on the perceived target documents mentioned earlier - content translations provided by websites or on clusters that contain a large number of identical images. In line with the principles of co-active intelligence this process is again language independent and works across all clusters in all languages.

The second technique is based on the previously mentioned ability to process synonym relationships between the identified clusters in a transitive fashion. Transitive systems have been proposed in a large number of cases to increase the effectiveness of translation systems, particularly to increase vocabulary size. Examples include using pivot languages (Ballesteros and Sanderson 2003) when direct translations between the source and target language do not exist and recent work on merging data from dictionaries and Wiktionaries data (Etzioni et al. 2007). In both cases, particularly using pivot languages, results have proven the validity of such techniques. In the case of co-active cluster overlap, a system similar to that described by Etzioni et al. is proposed - the generation of a translation graph with paths though the graph suggesting word translations not initially obvious.

While such work is still a work in progress, such techniques serve to reduce the size of the data required by extracting and examining additional data provided by the weblogs.

Regardless of the exact size of the query logs available and any additional techniques introduced it is obvious that the coverage of such weblogs, and hence the mined associations is limited to the domain from which the data comes from. This was highlighted in some preliminary experiments conducted using a small sample of approximately 330,000 queries recorded from a single university over a twelve-month period. While as expected the clusters that formed were statistically weak, they were also domain limited, reflecting the teaching focuses at the particular university. In terms of future research this is a major bottleneck - while we are in the process of acquiring further data, additional weblogs - particularly weblogs from university or organization with a high proportion of bilingual or multilingual employees/students would be of great benefit.

In terms of the motivation for the development of such a resource such an observation of limited domain coverage is less marked. Consider a large, multilingual organization wanting to increase the performance of CLIR. Such an organization would inherently have a limited domain for their queries and in addition many common day-to-day searches may use terms not within a traditional static dictionary. This is a specific instance of the OOV problem, one which co-active intelligence is particularly apt at dealing with. While in general this will prevent the co-active multilingual dictionary from become 'complete' in the sense of listing all language terms it will contain those commonly used within the organization - enabling cross language information retrieval within the required domain. In essence the co-active approach essentially shares the multilingual knowledge across the organization.

III. Related work

The field of CLIR involves searching for documents in a target language(s) based on a query in a source language. Approaches can be roughly categorized as shown in the table below.

Query TranslationDocument Translation
Generalised2 dictionary based query translationMachine TranslationMachine Translation
Static bilingual dictionaries

Produces a number of target language translations per term

Corpora based techniques

Use of probabilities that the words translate via analysis of parallel corpora

Internet based

Mining of Internet resources to determine translations between terms

Table 1 Categorization of CLIR approaches

Document Translation

Document translation refers to the process of translating the documents in the target language into the source language so that a monolingual search can be performed on them. Since entire documents have to be translated, machine translation is used in this instance. This process has inherent computational complexity - translating the near infinite pool of documents on the Internet is out of reach of modern processing efforts, and certainly more complex than the alternative, simply translating the query from the source language into the target language. While hybrid approaches have been proposed (Jianqiang and Douglas 2006; McCarley 1999) the query translation approach is the most common (Christof and Bonnie 2005; Kishida et al. 2007).

Query Translation

Within the query translation category a further two approaches exist, namely generalised dictionary based approaches and machine translation. In this case machine translation is used to translate the query into an (typically single) equivalent query. While quite popular, machine translation techniques in this instance suffer from the lack of context typically present in short queries (Braschler et al. 2000). In addition they tend to have trouble dealing with the informal grammatical specification a query typically contains, and are also prone to meaning loss due to the selection of only a single query as an output.

Generalized dictionary based query translation is the second category of query translations in CLIR and generally involves the following stages:

  1. Break the query down into language units
  2. Optionally identify the sense of the units (perform word sense disambiguation)
  3. Translate the units using either a sense sensitive or traditional multilingual dictionary (the former if step 2 was performed). Multiple translations are typically kept.
  4. Use numerous techniques to improve the result
  5. Search using the translation (monolingual IR)
  6. Optional: Additional techniques to improve/merge the result(s)

The sub-categories of such a technique then are based on the type of translation resource ('dictionary') they employ.

Static dictionary methods

Static dictionaries are the most common form of a translation resource use in CLIR. While static dictionaries are the simplest form of dictionary, they suffer from a range of problems - the most prominent being the out-of-vocabulary problem (OOV), ambiguity and the failure to translate multi-term concepts such as phrases (Kishida 2005).

The OOV problem is the situation that occurs when the machine translating the query comes across a word it does not know, or does not have a translation for. Typically this occurs for proper nouns, slang and newly coined terms, such as 'blog', that many not have existed when such static dictionaries were created. A promising recent attempt involved combining multi-language dictionaries through graph based methods to find additional meanings through transitive paths between the languages (Etzioni et al. 2007). Using the technique they report an increase in correct image retrieval by 75% in the first 15 pages. They note, however, that typically only non-sense sensitive dictionaries are available and this hinders their effort. In order to counter this they introduce a probabilistic method, but indicate that such a method could use improvement and highlight such as future work. In addition their method still suffers from the OOV problem, however, to a lesser degree with the authors reporting a three-time factor increase in translatable terms with a precision of 0.8. Other attempts to solve the OOV problem come from Internet based dictionary efforts, which are mentioned in a following section detailing such techniques.

The ambiguity problem is multifaceted. Not only are there at least two senses for the word 'bat' but there may be multiple words in Swedish for a single sense, with some no longer in common usage. In some cases monolingual disambiguation followed by translation with sense sensitive dictionaries have been proposed to counter the first problem (Gracia et al. 2006), and recently query log analysis has been proposed for the second (Gao et al. 2007). These approaches are described in more detail below.

Gracia et al. (2006) propose a method to first create a kind of monolingual sense sensitive dictionary from multiple ontologies. Using such a dictionary they expand the query based on each semantic meaning iteratively and use frequency statistics from Google searches to disambiguate the monolingual query. In automatic mode this essentially selects the most common semantic interpretation of the keyword set, while in semi-automatic mode provides a list of possible interpretations to the user. The disambiguated monolingual query can then be used to better translate the query using a sense sensitive multilingual dictionary. Based on the Internet such an approach to solving the disambiguation problem is very broad. However, coupled with the ontological approach it is limited in the first instance by the OOV problem. While multiple ontologies are used, ontologies are not dynamic, and must be explicitly created and maintained. In this regard they suffer from similar problems to static dictionaries. Therefore the process suffers from a lack of a dynamic sense sensitive dictionary.

Gao, Niu et al. (2007) propose another solution to the ambiguity problem that covers a range of problems. Their approach, based on query logs, works by first translating the source query into a number of candidate translations using a number of approaches including machine readable dictionaries and parallel corpora based methods ('rough' translations). They then resolve poor translations based on the assumption that a query in one language will have been asked, at some point, by a user in the target language. To achieve this they mine a similarity measure between the translation(s) of the source query and the queries within the log files, using query cluster techniques (described previously) and a similarity measure. Using monolingual similarity measures a threshold value is established, and only the queries in the log files that are associate to the 'rough' translations of the source query with a greater similarity value are considered as translations. The 'rough' translations are then discarded. This reduces the effect of the ambiguity problem when out-of-date or non-common translations are selected as they are 'corrected' back into queries from log files which inherently only contains current usage queries. Such an approach works well as there is a larger amount of data to perform monolingual similarity measures. While using an identical data source (click though data from query logs) in the final stages of the query translation to co-active intelligence, the technique and intended use is distinctly different. Such an approach uses mined query terms as possible translations, determining which one to use by assigning similarity measures to each mined query term. The translation of the source query, which is achieved independently, is then compared to the mined terms and the closest matching mined term substituted as the final translated query. In this way the technique is a translation refinement technique and therefore does not address problems that occur in the earlier stages, such as ambiguity. In contrast co-active clustering aids the initial translation, with things such as ambiguity the primary concern.

Parallel and comparable corpora based methods

The second approach within query translation methods is to use parallel or comparable corpora to generate equivalences between terms. This in essence creates a similarity thesaurus. Parallel corpus refers to collection of documents that have been directly translated. Examples of this include translations of the Bible (Chew and Abdelali 2007) and the Europarl corpus which is a collection of the proceeding of the European Parliament from 1996 translated into 11 different languages. In addition numerous Governments in bilingual countries such as Canada have similar parliamentary proceedings and official records directly translated, creating parallel corpus (Koehn 2005).

Techniques range from sentence and even direct word alignment from parallel corpora to a number of statistical methods. While such methods have shown good results, the approach is limited by the availability of such parallel corpora (Kishida 2005). Not only is it limited in the number of language pairs for which parallel corpus are available, it is limited in domain. For example there is little parallel text containing significant alignments of informal language as such text is not methodically translated in large quantities. In this regard, in general, these techniques fall short on their own. Some automated attempts at extracting parallel corpora from the Internet have been proposed such as PTMiner and an associated filtering system by Nie and Cai (2001). While they report good results, the approach uses static, hand selected 'anchor text' to determine pages from the Internet (such as 'in Chinese' or 'Chinese version') which must be developed for each language. Pages are then paired using similarity measures based on the filename. For a recent summary of techniques used to mine parallel corpus see (Kishida 2005). Such techniques are still quite limited though and have yet to provide a complete solution for CLIR.

In light of the unavailability of parallel corpora, researchers have turned to comparable corpora. Comparable corpora relax the requirement of parallel corpora from requiring the document being an exact translation to being a translation close to, or at the most relaxed level, simply highly related. Typical examples include pairs of news documents from the same time period that were published in two or more languages. While each document was written by a different group of people under different editors, the documents can be aligned so that story pairs match (Tuomas et al. 2007). Thus the document pairs are then highly related in their content. Two major approaches to mining comparable corpus have been proposed. The first looks for comparable sections of the documents, with a recent paper by Munteanu and Marcu (2006) looking at aligning sub-sequences of arbitrary granularity, typically lower than the sentence level. The selected sections are then processed using techniques for parallel corpus. The second approach uses statistical methods, with co-occurrence statistics being popular. An example of such a technique is presented in a paper by Diab and Finch (2000). Their approach is based on distribution profiles - equivalent terms between comparable documents will have a similar distribution. A more recent approach introduces the use of phonetic information to help mine terms (Wai et al. 2007). In all cases the approaches start with a corpora that is known to be on approximately the same topic. This requirement limits the available corpora to specific domains where approximate content translations exist, although the availability of news sources means the range of vocabulary is seen as rather extensive.

Internet based techniques

In both static, dictionary-based approaches and corpus-based approaches, language coverage problems tend to occur. As mentioned, this OOV problem is one of the core problems CLIR research faces. While comparable corpus mining goes a way to solving such a problem by generating large sets of data to compare and mine terms from, other innovative techniques based on mining different aspects from the Internet have been proposed. The previously mentioned paper by Gracia et al. (2006) is one of these. Leveraging the large number of ontologies freely available on the Internet they attempt to get a larger coverage of language. Like approaches using machine-readable dictionaries, however, the number of specifically created/generated resources that are available limits the technique.

A popular approach has been to find webpages where 'courtesy translations' appear - manual translations of terms by the content developer that appear near the original term. Pages like these can be found using techniques such as searching for the term to be translated only in pages of the target language (Wen-Hsiang et al. 2002). Statistical analysis can be performed in order to extract translations (Zhang and Vines 2004). Such techniques have been shown to be 'fairly error prone' and recently a hybrid method using linguistic patterns alongside concurrence measures was proposed as a solution (Zhou et al. 2007).

An alternative, but similar approach to searching for 'courtesy translation' is statistical mining from pages retrieved from the source term in the target language. In a recent attempt the above approach was extended to take into account potentially different semantic meanings in order to correctly identify more relevant web pages to mine (Fang et al. 2006).

While promising results have been returned for most Internet based approaches there is still room for improvement, with authors reporting less than 90% retrieval rates compared to a monolingual baseline in their respective papers.

IV. Benefits of a co-active sense-sensitive multilingual dictionary

The use of a sense-sensitive multilingual dictionary has a number of benefits. Firstly it provides the means for making a distinction between word senses enabling users to choose a sense. Since a source query may have multiple senses, each of which may have multiple translations, in some cases the number of candidate translations can increase at a rapid rate, obscuring relevant results if the correct translation is not identified (Hull and Grefenstette 1996), either automatically as in Gracia et al. (2006) or by human interaction. Secondly the co-active nature of the dictionary provides at least a partial solution to the OOV problem, with the ability to address terms that cannot be found within static dictionaries. As such incomplete co-active multilingual dictionaries are of benefit within CLIR systems, providing a 'learning' component to complement the static general resources. Finally co-active intelligence provides an increased ability to generate synonyms and hence translations by allowing the use of transitive relations between cluster overlap.

Looking at the requirement for such a resource within state-of-the-art approaches we see Etzioni et al. (2007) indicating that sense sensitive dictionaries would improve their transitive approach to large multilingual dictionary mining. In addition such a resource would facilitate the resolution of potential problems in complete term coverage in the work using multiontologies by Gracia et al. (2006). Finally the resource provides the opportunity to improve the work done by Gao et al. (2007) by providing a better initial approximate, sense sensitive translation by combining the work done on monolingual disambiguation by Gracia et al. (2006).

V. Conclusion

This paper reported preliminary observations in the development of a broad, dynamic multilingual translation resource and highlighted the benefits of such a resource in the context of state-of-the-art research. Backed by strong research in both query clustering and the more recent co-active intelligence, such an approach has the potential to provide a solution to the out-of-vocabulary problem as well as addressing the ambiguity problem. Specifically such an approach is of importance for the less discussed technical, medical or scientific domains for which large machine-readable dictionaries, or corpora for these terms do not exist.

VI. References

Ashman, Zhou, Goulding, Brailsfordand Truran (2007). "The Global Perpetual Dictionary of Everything" in Proceedings of the 13th Australasian WWW Conference 2007.

Ballesteros, L. and Croft, W. (1998). "Resolving ambiguity for cross-language retrieval" in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval 1998 p.64-71.

Ballesteros, L. and Sanderson, M. (2003). "Addressing the lack of direct translation resources for cross-language retrieval" in Proceedings of the twelfth international conference on Information and knowledge management 2003 p.147-152.

Beeferman, D. and Berger, A. (2000). "Agglomerative clustering of a search engine query log" in Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining 2000 p.407-416.

Braschler, M., Krause, J., Peters, C. and Schauble, P. (2000). "Cross-Language Information Retrieval (CLIR) Track Overview" in Proceedings of the TREC8 2000 p.25-34.

Chew, P. and Abdelali, A. (2007). "Benefits of the 'Massively Parallel Rosetta Stone": Cross-Language Information Retrieval with over 30 Languages" in Proceedings of the Annual Meeting - Association for computational linguistics 2007 v.45 p.872-879.

Christof, M. and Bonnie, J.D. (2005). "Iterative translation disambiguation for cross-language information retrieval" in Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval 2005 p.520-527.

Cui, H., Wen, J.-R., Nie, J.-Y. and Ma, W.-Y. (2003). "Query expansion by mining user logs" in IEEE Transactions on Knowledge and Data Engineering v.15 n.4 p.829-839.

Diab, M. and Finch, S. (2000). "A statistical word-level translation model for comparable corpora" in Proceedings of the Conference on Content-based multimedia information access (RIAO) 2000.

Etzioni, O., Reiter, K., Soderland, S. and Sammer, M. (2007). "Lexical Translation with Application to Image Search on the Web" in Proceedings of the MT Summit XI 2007.

Fang, G., Yu, H. and Nishino, F. (2006). "Chinese-English Term Translation Mining Based on Semantic Prediction" in Proceedings of the COLING/ACL on Main conference poster sessions 2006 p.199-206.

Gao, W., Niu, C., Nie, J.-Y., Zhou, M., Hu, J., Wong, K.-F. and Hon, H.-W. (2007). "Cross-lingual query suggestion using query logs of different languages" in Proceedings of the Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval 2007 p.463-470.

Gracia, J., Trillo, R., Espinoza, M. and Mena, E. (2006). "Querying the web: a multiontology disambiguation method" in Proceedings of the Proceedings of the 6th international conference on Web engineering 2006 p.241-248.

Hull, D.A. and Grefenstette, G. (1996). "Querying across languages: a dictionary-based approach to multilingual information retrieval" in Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval 1996 p.49-57.

Jianqiang, W. and Douglas, W.O. (2006). "Combining bidirectional translation and synonymy for cross-language information retrieval" in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval 2006 p.202-209.

Kishida, K. (2005). "Technical issues of cross-language information retrieval: a review" in Information Processing & Management v.41 n.3 p.433-455.

Kishida, K., Chen, K., Lee, S., Kuriyama, K., Kando, N. and Chen, H.H. (2007). "Overview of CLIR Task at the Sixth NTCIR Workshop" in Proceedings of the NTCIR-6 Workshop Meeting 2007 p.1-19.

Koehn, P. (2005). "Europarl: A parallel corpus for statistical machine translation" in Proceedings of the MT Summit X: The tenth machine translation summit 2005 p.79-86.

Mccarley, J.S. (1999). "Should we translate the documents or the queries in cross-language information retrieval?" in Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics 1999 p.208-214.

Munteanu, D.S. and Marcu, D. (2006). "Extracting parallel sub-sentential fragments from non-parallel corpora" in Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL 2006 p.81-88.

Nie, J.-Y. and Cai, J. (2001). "Filtering noisy parallel corpora of web pages" in Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics 2001 v.1 p.453-458.

Truran, M. (2005). The Theory and Proactive of Co-active Search. PhD Thesis. University of Nottingham.

Tuomas, T., Jorma, L., Kalervo, J., Martti, J. and Heikki, K. (2007). "Creating and exploiting a comparable corpus in cross-language information retrieval" in ACM Transactions on Information Systems v.25 n.1 p.4.

Wai, L., Shing-Kit, C. and Ruizhang, H. (2007). "Named entity translation matching and learning: With application for mining unseen translations" in ACM Transactions on Information Systems v.25 n.1 p.2.

Wen, J.R., Nie, J.Y. and Zhang, H.J. (2002). "Query clustering using user logs" in ACM Transactions on Information Systems v.20 n.1 p.59-81.

Wen-Hsiang, L., Lee-Feng, C. and Hsi-Jian, L. (2002). "Translation of web queries using anchor text mining" in ACM Transactions on Asian Language Information Processing (TALIP) v.1 n.2 p.159-172.

Xue, G.-R., Zeng, H.-J., Chen, Z., Yu, Y., Ma, W.-Y., Xi, W. and Fan, W. (2004). "Optimizing web search using web click-through data" in Proceedings of the thirteenth ACM international conference on Information and knowledge management 2004 p.118-126.

Zhang, Y. and Vines, P. (2004). "Using the web for automated translation extraction in cross-language information retrieval" in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval 2004 p.162-169.

Zhou, D., Truran, M., Brailsford, T. and Ashman, H. (2007). "NTCIR-6 Experiments using Pattern Matched Translation Extraction" in Proceedings of the NCTIR workshop 6 2007 p.145-151.

Hypertext References

HREF1
http://www.cis.unisa.edu.au
HREF2
http://www.unisa.edu.au
HREF3
http://www.tees.ac.uk/schools/SCM/
HREF4
http://www.tees.ac.uk/
HREF5
http://www.sfi.ie/content/content.asp?section_id=674&language_id=1

Footnotes

1. The 'Next Generation Localization' project [HREF5]

2. 'Generalised' simply refers to the use of a dictionary like resource, to look up terms, be this a machine-readable dictionary, statistics based on corpora or other WWW methods. More information on these topics is presented later in this section.

Copyright

Gavin Smith, Mark Truran, Helen Ashman © 2008. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.