Dr. Rafael A. Calvo [HREF1], Web Engineering Group The University of Sydney [HREF2], NSW, 2006. rafa@ee.usyd.edu.au
Prof. Jae-Moon Lee [HREF1], Hansung University[HREF3], Web Engineering Group The University of Sydney [HREF2], NSW, 2006. jmlee@ee.usyd.edu.au
News articles represent some of the most popular and commonly accessed content on the web. This paper describes how machine learning and automatic document classification techniques can be used for managing large numbers of news articles. In this paper, we work with more than 800,000 of Reuters news stories and classify them using a Naive Bayes and k-Nearest Neighbours approach. The articles are stored in newsML format, commonly used for content syndication. The methodology developed would enable a web based routing system to automatically filter and deliver news to users based on an interest profile.
Information overload ? in which individuals are faced with an oversupply of content ? could become the road rage for the new millennium. In order for this content to become useful information and empower users, rather than frustrate or confuse them, we need novel ways of delivering only what is needed, at the right time and in the right format. News stories are the most important source of up-to-date information about the world around us, and represent some of the most often updated, and highest quality content on the web. Therefore, it is most important to develop ways to processes news efficiently. In this paper we have studied machine learning models, and applied them to the automatic classification of large numbers of online news articles.
Language technologies have provided excellent tools for information retrieval by exploiting the content being indexed. In the beginnings of the Internet, search engines and classification systems on the web, only used information from unstructured text to index and find content search engines (i.e. Altavista); later they used the topology of the net to find which of the information being indexed was more important (i.e. Google). Meanwhile, the progress achieved in document classification has not been as noticeable, and most web page classification is done manually by thousands of human indexers, in private catalogues or in large scale ones such as Yahoo! and the Open Directory Project. Instead of using human labour, automatic document classification tools[3,11,12] use statistical models, similar to those in information retrieval. These systems can be used to create hierarchical taxonomies of content as are commonly found on the web, on e-learning systems and even on traditional library information systems.
News story articles are written by reporters from all over the world. These reporters often work for news agencies such as the Associated Press and Reuters. These agencies collect the news, edit it and sell bundles of articles to the periodicals accessed by web users (e.g. news.yahoo.com, Sydney Morning Herald, etc). It is important for both the agencies and the periodicals to have an organized well-managed stream of news. News are normally classified according to taxonomies that are relevant to readers, (e.g. ¡°politics¡±, ¡°Iraq¡± or ¡°Oil¡±). This classification can be very difficult because it requires human expertise to spot relationships between the taxonomy and the documents. Even the experts do not agree on what should go where and inter-indexer consistency in classification has considerable variation [9].
Automatic classification techniques use algorithms that learn from these human classifications, so they can only do as well as the human training data provided. In addition, different algorithms can learn different types of patterns in the data. In order to compare the classification performance of different algorithms, researchers have a set of standarised benchmarks, with a particular dataset, and a well defined task. The most popular classification benchmark during the late nineties was a Reuters collection called Reuters-21578 (based on Reuters-22173) with 21578 documents, that had to be classified in about 100 different [4,13]. This benchmark is still used currently to compare the performance of different algorithms but the challenges now lie in moving towards larger scale document classification [8]. In 2002, Reuters released a new research dataset with over 800,000 documents that we discuss in this paper.
There are several Machine Learning (ML) algorithms that have been successfully used in the past [4,7,11,13]. They include Neural Networks, Naive Bayes, Support Vector Machines (SVM) and k-Nearest neighbours (kNN). Each of these methods has their advantages and limitations on classification performance and scalability. The choice of algorithms will depend in the application, and the amount of data to be used. In web applications, efficiency is of particular importance, since the large number of users and data can make some algorithms unfeasible.
Section 2 of this paper describes the Reuters RCV1 collection used in this project, focusing on the challenges offered by its size, structure and the richness of its XML structure. Section 3 describes the ML algorithms we have used, namely Naive Bayes and k-Nearest Neighbours, section 4 describes how we have improved a classification framework [12] in order to make it scalable for web applications. Section 5 describes the performance results for the classifier and the improvements in memory and CPU requirements. Section 6 concludes and summarizes our future work in the area.
The Reuters RCV1 Corpus [9] consists of all English news stories (806,791) published by Reuters in the period between 20/8/1996 and 19/8/1997. The news is stored as files in XML format, using a News ML Document Type definition (DTD). NewsML is an open standard being developed by the International Press and Telecommunications Council (IPTC). The news is written by approximately 2000 reporters and then classified by Reuters specialists in a number of ways. The classified news articles are then syndicated by websites such as news.yahoo.com and news.google.com or periodicals like the Sydney Morning Herald that may or may not have a website.
Due to seasonal variations, the number of stories per day is not a constant. In addition, on weekdays there are an average of 2,880 stories per day compared to 480 on weekends. Approximately 3.7Gb is required for the storage of the uncompressed XML files.
The NewsML schema contains metadata produced by human indexers about themes and categories. When two humans index a document differently they create inter-indexer variations. These can be measured using a correction rate C=(NC/NE)*100, where NE is the number of stories indexed by an editor and NC is the number of times an editor has been corrected by a second editor. Normally untrained editors will have a higher C than more experts one, but even when they are all experienced correction rates of 10% are common. In the RCV1 collection there are correction rates of up to 77%. Since ML algorithms learn from examples ?classifications done by humans- correction rates are an important limiting factor to their performance. Performance measures in classification systems are really a measure of how much they correlate to the human classifiers, it is not possible for a student to be better than the teacher, or at least it not possible for the teacher to know so.
RCV1 data is stored in XML documents providing the metadata information normally required by news agencies and periodicals who need to deliver the stories to end users. NewsML defines a rich schema shown simplified in Figure 1, we can see that it has entities for title, headline, text, copyright and several types of classification. For our experiments we have used all three available classifications (topic, country and industry) as a single task.

Figure 1: An example of newsML document
Web applications will increasingly exploit this type of metadata. As it has been discussed by researchers studying the concept of the semantic web [2], the next revolution in the Internet will come when web applications have access to structured collections of information, and set of inference rules that can be used to perform automated reasoning. This project aims at producing such applications.
Language technologies is an emerging field with researchers from different backgrounds. Mathematicians, linguistics and computers scientists are producing theories and software tools that manage content more efficiently. This research includes speech processing, information retrieval and document classification. The models in document classification are similar to those in the other areas, and extensive literature describes them in detail [4,7,11,14]. For the sake of brevity, in this section we only summarize the basic concepts.
The essence of the classification techniques is that documents and categories can be represented in high dimensional vector space. Since classification consists of assigning documents to predefined a categories, machine learning techniques have to learn the mapping between the two vector spaces, document to categories.
The simplest model to represent a document as a vector is the binary representation. In this model, a vector has the dimension of the dictionary, and the occurrence of a word puts a 1 in the corresponding element, all other elements being 0.
The next level of complexity, and the one we have used, is called Term Frequency (TF) since the value of the element is equal to the number of occurrences of the term in the document. If the term appears more often, it is often because it is more important. This is not always the case, since words like articles and prepositions often do not add much to the information value. These terms are called stop-words and normally eliminated from the document. Other technique to reduce the number of terms is stemming, where words are reduced to their common stem (i.e. ¡°runner¡± and ¡°runners¡± are reduced to ¡°run¡±.
Finally, more elaborate models take into account the idea that terms that appear in many documents are not telling us much about an individual one, so an Inverse Document Frequency (IDF) weight is used. These models require computing statistics across the whole corpus, making it much more computational expensive. Since our goal in this project was to study scalable techniques that could be used in web applications, we have used the simpler TF vector representations.
Once the documents have been represented as vectors, they can be further reduced using statistical reduction techniques such as c2, or term frequency. These techniques select those terms that have higher impact or correlation with the classification. We do not discuss them here as several other sources describe them in detail [4,7,11,13].
Machine learning algorithms are trained on a subset of the corpus, and later tested on a different subset to measure their performance. Several ML algorithms have been used successfully in a number of applications [3,11], including Naive Bayes and kNN both used here.
The success of a classification is often stated with two standard efficacy measures: precision and recall. Precision is the probability that if a random document is classified as part of C, this classification is correct. Recall is the probability that if a random document ought to be classified under C, this classification is actually made. These two quantities are highly and inversely correlated. We can have absolute recall by assigning all documents to all categories, but at the cost of having the worst precision. We can also have high precision by never assigning a document to a category, but this would make the classifier useless.
The trade-off between recall and precision is controlled by setting the classifiers¡¯ parameters. Both values should be provided to describe the performance. Another common performance measure is the F1-measure:

When dealing with multiple classes there are two possible ways of averaging these measures, namely, macro-average and micro-average. The macro-average weights equally all the classes, regardless of how many documents belong to it. The micro-average weights equally all the documents, thus favouring the performance on common classes. Different classifiers will perform different in common and rare categories. Learning algorithms are trained more often on more populated classes thus risking local over-fitting.
Object Oriented Application Frameworks (OOAF) are software engineering artefacts that improve reusability of design and implementation [5,6,10]. The classification framework used in this project[HREF4] [12] was designed to provide a consistent way to build document categorization systems. It allows the developer to focus on the quality of the more critical and complex modules by allowing reuse of common, simple base modules. In this project we study how the framework can be used in a news stories classification system and extended the framework to be more scalable as required in web applications by adding an object buffering module.
One of the most serious limitations when training ML algorithms, is the amount of data required to achieve satisfactory performance and the scalability issues that arise when training such algorithms. Memory requirements are often the first obstacle for managing large corpora, specially for a text categorization system, where the ML algorithms are being trained on very large vector spaces and large amounts of data. Most operating systems have dealt with this problems by adding support for virtual memory, but these general purpose techniques are not always enough or appropriate for special applications like ours. In those cases, applications often have an adaptive buffer management tool that allows the application to allocate specific amounts of memory.

Figure 2: Run time object diagram
We have extended our text categorization framework [12] so it can handle much larger corpora. Figure 2 shows the object diagram of important classes in the run time of the framework. The circle in Figure 2 represents an object and the arrow represents a relationship meaning a category/document ¡®HAS-A¡¯ feature vector (FV). The document object corresponds exactly to a document in the training or testing sets. The category object corresponds exactly to a category. For the RCV1, there are over 800K documents, so the framework needs to manage that number of document objects and the same number of feature vectors. The object buffer can be configured to store as many objects as possible in RAM memory, and requesting other ones from the hard disk as they are required. Thus, there is only a constant amount of objects in the memory, and the system can be optimised for the available resources and independently on the amount of data.
The buffered object module we implemented is described in Figure 3. The module supports a unique persistent Object Identifier (OID) for all objects stored in it. The framework will then use this OID instead of the object or its reference. If the framework needs an object to be stored in the buffered object module, it makes a request to the buffered object manager using the OID. The buffered object manager first searches for the request object in the object pool. If object manager finds it, then it returns the object¡¯s reference, otherwise it requests the object to the buffered file manager which tries to minimize I/O to the disk. The buffered file manager reads the request object from the disk, and returns it to the buffered object manager. As the buffered object manager receives the object from the buffered file manager, it firstly stores the received object to the buffer pool and returns it to the requester in the framework.

Figure 3: Architecture of the buffered object module
In order to optimise memory usage, the buffered object manager uses a constant amount of memory. Our current implementation follows the Least Recently Used (LRU) strategy [1]. In this strategy, common for paging in operating systems, the least recently used (read or written) object (or page) is selected to be taken out of the buffer (paged out). The same rule is also used in other cache systems when they select which cache entry to flush. In future work we plan to add new buffering strategies such as: Most Recently Used (MRU) and others. We expect that the selection of these strategies will depend on each Machine Learning algorithm. For example, Naive Bayes might work better with a LRU and kNN with a MRU strategy.
With a scalable framework we are able to show the feasibility of automatically classifying large amounts of news stories. Since news stories in the RCV1 dataset are in newsML format the extensions to the framework need to include more efficient XML parsing. We first tried the Perl and C implementations of the SAX package. Adapting one of the libraries implemented in C and integrating it into the framework produced the best results.
We have tested the extended framework on the Reuters RCV1 data in order to:
We performed three series of experiments: one to determine how Naive Bayes and kNN algorithms performed for different amounts of training data, a second one to see if using the newsML structure could improve performance by giving different weights to the title of the story, and finally how they scaled ?in computation time- for different amounts of data.
Although the precision of the classification itself should not be modified by the extensions to the framework (buffering objects), we expected that being able to train them with larger amount of data would improve performance. In the first set of experiments we studied the performance for 10, 50 and 100 days of training data.
The data was selected from the original RCV1 dataset using a simple strategy where we randomly chose 80% of the total amount of dates as training set, and 20% as test set. With this approach the total number of news on each set is not an exact percentage since there are different number of stories on each day. This sampling strategy has the disadvantage of producing test sets that might have somewhat different statistics, for example a particular date (i.e. Bush¡¯s election or September 11) could start a new type of documents that did not appear in the training set. The notation XTr-Yte means that the news stories of X days are used as the training documents and news stories of Y days are used as testing documents. Table 1 and Table 2 describe the data used in each subset. 10Tr, 50Tr and 100Tr means that we used 10, 50 and 100 days for training.
The second goal of the experiments was to see if the newsML structure could be used to improve performance and find guidelines on how to do it. We expected that giving more weight to more important attributes of the newsML schema (i.e. the title) would be beneficial. The content of title element is similar to the headline. In the experiments we have fixed the weighting factor of headlines and text elements to 1, and we tried several weighting factors for title, T = 0, 1, 3, 7, 17. The weighting factor T means we give T times the frequency count of the word in the document.
Figure 4 through 7 show the performance results for the different subsets. Figure 4 and 6 are the results for the Naive Bayes classifier, and indicate that micro-precision (miP) of 80-90% are achievable. We must remember that: first, due to the inter-indexer variations, 100% precision is not humanly automatically possible. Second, and that these results only show the correlation with the human classifier, this means we assume humans are always right, and this is not always the case, in fact machines might be doing it better.
Figure 5 and 7 show the results for the kNN classifier. We can see that the classification performance is better than for the Naive Bayes classifier. This is particularly true with the recall measures. The computational performance of kNN is an obstacle and we have not performed tests for all the subsets. KNN is not optimum for web applications, such as the one we discuss in this paper, since all the processing is performed at test time (there is no training), this means that in a real application all the computation would be performed when the news arrive and need to be redirected. kNN showed to have much better recall performance (miR=0.55) compared to Naive Bayes (miR=0.19). This means kNN is able to assign more documents to some category, and this might be of great importance in some applications. Precision (the number of correct classifications) was somewhat degraded by using kNN, but not in a considerable amount.
Analysing the performances for different weighting schemes we find that the optimum T seems to be around 3. We can also see that increasing the number of training days from 10 to 50 improved the miP but does not seem to improve the miF1 measure in a considerable amount.
| Data Type |
Training Data |
Testing Data |
||
Files |
Bytes |
Files |
Bytes |
|
10Tr-2Te |
23,298 |
70MB |
5,872 |
18MB |
50Tr-10Te |
107,543 |
331MB |
14,936 |
46MB |
100Tr-20Te |
228,089 |
701MB |
43,008 |
133MB |
Table 1: Amount of data for the selected subsets
Data Type |
Categories |
Categories Per File |
Files Per Category |
10Tr-2Te |
670 |
5.17 |
43.54 |
50Tr-10Te |
718 |
5.18 |
170.58 |
100Tr-20Te |
730 |
5.22 |
371.37 |
Table 2: Categories for the selected subsets

Figure 4: Performance of Naive Bayes for 10Tr-2Te subset

Figure 5: Performance of kNN for 10Tr-2Te subset

Figure 6: Performance of Naive Bayes for 50Tr-10Te subset

Figure 7: Performance of kNN for 50Tr-10Te subset
A third set of experiments was performed to measure computational efficiency. We measured the three kinds of executing time: preprocessing, training and testing. The x axis in Figure 8 represents the three subsets studied and the y axis represents the executing time measured in seconds The preprocessing includes reading the training documents, vectorising and storing them to disk as needed. Thus, the preprocessing is the same to all the algorithms. The training time depends in the machine learning algorithms, for Naive Bayes we show is linear. Finally, the testing time refers to categorizing the documents, and it is also dependent on the given algorithm. The executing time of preprocessing, training and testing increase linearly showing good scalability. This means that the object buffering module worked well.

Figure 8: Execution Time of Naive Bayes
We have shown the feasibility of using automatic document classification for large-scale management of news stories. We have focused on scalability issues that are particularly important to web applications.
The classification performance shown with these experiments is extremely promising and would allow real web applications to use this type of functionality. We obtained 0.9 miP, meaning that 9 out of 10 stories were correctly classified. The miR and therefore the miF1 are somewhat lower because some classes contain few examples and are harder to classify. Naive Bayes classifiers produce lower quality classification but seem to be better suited for applications were classification needs to be performed at real time, kNN instead produces better classification but places to much load on classification time since there is not training.
Since we wanted to assess the possibility of using automatic classification techniques for real web applications, we chose a sampling strategy that produces lower classification performance, but is more scalable and therefore realistic.
We have extended our classification framework in order to manage large amounts of documents. The object buffer has been successful in increasing the number of documents that we were able to manage by an order of magnitude and reduced the computational time to approximately one fifth.
Using faster XML parsing for the pre-processing stage showed to be important in managing large amounts of newsML documents. So very efficient parsers are required for content syndicators who manage large quantities of news stories.
This project will continue by further studying the scalability issues of other algorithms and integrating the classifier into real web application. This will be done following the integration of the framework to the Postgres database, as described in [13], and using it in web application frameworks as described in [12].
This project was partially funded by the Australian Research Council, through a Linkage International Fellowship, Hansung University and the University of Sydney. The authors also want to acknowledge Ken Williams for the development and support in using the classification framework.