Laura Thomson [HREF1], Lecturer, School of Computer Science and Information Technology, RMIT University, GPO Box 2476V [HREF2], Melbourne, Victoria, 3001. laura@cs.rmit.edu.au
Previous studies of web access logs for user profiling and hence document personalization conducted analyses based principally on web site content, usage, and topology. In this work the analysis was conducted on the basis of what could transparently be learned about the end user. The user's IP address was used to obtain publicly available information about the user, specifically their geographical location, top level domain, and the contents of their organizational homepage. This personal information was then combined with standard content and usage analysis. Standard data mining algorithms were applied to web logs from an academic website using a generalization-based clustering feature set with the various types of user data used as the class variable. A number of useful nuggets of information were found, demonstrating the feasibility of this approach.
Data mining of web server access logs to discover information about a web site's users is a relatively new but growing area of research.
A web server access log contains in sequential order all the page requests that have been made of a particular web server. Each request tells us the IP of the requesting user, the date and time of the request and the page requested. The basic procedure for mining this data is described by Cooley et al (1997) as follows. The log data is "cleaned" to remove noise such as requests for images, style sheets, and applets. From the remaining data we can reconstruct user transactions. A user transaction consists of the set of requests that make up a single visit to the website by a single user, or at least what appears to be a single user based on the IP address and timing of the requests. The set of user transactions can then be analyzed to look for patterns in user behaviour. This information can then be used for web site personalization.
Heer (2002) describes the information that can be extracted from web access logs as Content, Usage, and Topology (CUT). Content refers to examining the content of the web pages viewed by an individual user and treating these as the field of interest of that user. Usage refers to seeing which pages are visited by an individual user. Topology refers to analyzing the link structure of the site. All of this analysis centers on the log data alone, with the user information (the IP) used purely for transaction identification.
In this work logs from an academic web site (RMIT Computer Science) were used. Personal data about the user was added to the web log usage data before data mining algorithms were applied to the data. The personal data used was readily available information obtained from a user's IP address: the geographic location of that IP, the type of organization (based on the top level domain associated with the IP address), and the homepage of the user's organization (based on the hostname retrieved from the user's IP address).
This information was used to classify users and examine the preferences of different types. More specifically, the differences between users from inside and outside our institution, inside and outside Australia, from different countries, and with different top level domains were examined. This work follows the work of Ciesielski and Lalani (2003), where sessions were classified according to their origin inside or outside the organization and inside or outside Australia. Classification by country of origin and top level domain were added.
In addition, a novel technique of attempting to extract site users' homepages was used. A set of heuristics were applied to the homepage text in order to classify the end user's organization. This classification was then added to the data set.
The web access logs used in these experiments are from the RMIT School of Computer Science and Information Technology web site, formerly located at HREF3
The site is, at the time of writing, being moved over to a new CMS but these results are from the pre-CMS version of the site.
The log used in the experiments was extracted from the logs for the month of February 2003 and contains 3563 unique user sessions.
A log extract is shown in Figure 1.
Figure 1: Sample log entries
This figure shows three entries in the log. The first shows a request from IP 63.60.195.131 on the 1st of February for the homepage ("GET /"). The second show a request for a staff member personal page (/~caspar/turbo_mazda_323.htm) from a different IP. The third request is for the CS logo (csitwblogo.gif) from the same IP as our first visitor.
The fields in the log are as follows: the user's IP; their username and password entered by the user in response to any authentication requests (in this case all are blank); the date and time; the raw HTTP request received by the web server; the HTTP response code issued by the server in response to the request; the number of bytes of data returned in response to the request; the referring page if any; and the user agent making the request (typically a browser).
The first and third requests are from the same IP (and assumed user), while the second request is from a different IP. This data illustrates some issues that we need to deal with in regards to data cleaning and session extraction.
Certain requests were removed from the data. In HTTP, each image on a page is requested as a separate HTTP request. Typically you will see a pattern in the log of a page being requested, then the user's browser silently requesting all the other files it needs to display that page. All requests for images, class files, style sheets, and .ico files were removed as they do not add any additional information.
Requests that obviously emanate from proxy servers or spider programs were also removed. This is done by applying rules about the hostname and user agent of the requestor. Specifically, requests from hostnames containing the word proxy and user agents that match known search engine spiders are removed.
Finally, requests that resulted in an HTTP error are removed. These may be useful for some forms of analysis (such as link checking) but not in this scenario. These requests are detected by looking at the HTTP response code. Codes in the 400 and 500 series represent these types of errors.
As you can see in Figure 1, requests that form part of the same user session are not necessarily located one after another in the log file. Sessions can be reconstructed by putting a set of requests from the same IP together. In this case a set of requests from the same IP on the same day are considered to be a single user session. There are other more sophisticated approaches to sessionization such as those described by Cooley et al (1999) which may be tried in future.
For the purpose of these experiments a feature set based on the directories within the web site structure that end users had visited was used. For each transaction a set of Boolean fields showing which of these directories had been visited was generated. For example, if users had visited any number of pages in the directory /research, they would receive a Y against this feature.
Since the examined website is largely organized into topic subdirectories, this gives us some of the advantages of content based feature extraction without the overhead of actually doing so. This approach, called "Generalization-based clustering", combines the usage and content based analysis and was first suggested by Fu (1999).
A list of the directories used is shown in Table 1.
| general |
| international |
| courses |
| undergraduate |
| dualaward |
| doubledegree |
| research |
| shortcourse |
| online |
| scholarship |
| academic_program_files |
| plagiarism |
| subjectguides |
| timetables |
| employment |
| rules |
| students |
| helpdesk |
| tsg |
| Results |
| staff |
In addition to these directories two further features were added to the set: an indicator of whether the particular user visited the site homepage, and an indicator of whether the user visited any of the staff members' homepages (each of which is in its own directory). All staff homepages were grouped together to simplify the analysis.
Further features based on the user data were added in each of the experiments. These are detailed in the sections on the experiments below.
Experiments were run using the Waikato Environment for Knowledge Analysis or WEKA, available from the University of Waikato in New Zealand (Witten and Eibe, 1999). In this work the EM clustering algorithm and the OneR and J48 classifiers were used. The OneR classifier tries to gives a single class predictive rule based on the data supplied. The J48 classifier produces a set of rules.
This experiment (and the next) was conducted as a follow up to the experiments in Ciesielski and Lalani (2003). In those experiments the same access logs as used here were analyzed using a number of different feature sets, including the first 3 and last 2 pages visited in a session, and an incidence matrix showing whether the user had visited any of the top 20 most visited pages in the site. In this study a different feature set was used.
Here the generalization-based clustering approach is used in conjunction with a class variable to show whether end user sessions originated inside or outside the university.
User location was determined by looking up each user IP in the DNS server to obtain a hostname, and checking whether that hostname contained the string "rmit". This covers anyone logged in on campus as well as students and staff who have dialled up via the university's modems. Students and staff logging in from an outside provider will not be recognized as being an "inside RMIT" session. This is the same classification approach used in Ciesielski and Lalani (2003).
Using the EM clusterer, three principal clusters of users were found in this experiment.
The first group, staff homepage visitors, visited a staff member's homepage only, that is, they did not enter via the site's main navigation pages This group represented 46% of user sessions in the sample, visited only staff homepages. These users were from outside RMIT.
The second, browsers, were characterized by visiting all the first level pages in the site (that is all the pages available from the home page), but none of the second level pages. This group represented 28% of users, These users were from inside RMIT. This represents surfing behaviour of our students and staff seeking department information.
The third cluster, representing 10% of users, visited the site homepage only. These users were from inside RMIT. This cluster is likely to be caused by the fact that the site homepage is the default homepage for machines within the School. These users have opened a browser, seen the homepage, and then moved on to pages outside the scope of our web logs.
The OneR classifier does not produce a useful rule with this data. (Since there are more visitors from outside RMIT than inside RMIT, it predicts that visitors are likely to be from outside RMIT, which is not particularly interesting. The J48 classifier gives the same result.)
This experiment was again done to repeat the work in Ciesielski and Lalani (2003) with the directory based feature set rather than the feature set used in that study. Also, that work uses hostname to determine whether a visitor is from inside Australia or outside Australia (that is, hostnames ending in .au are within Australia). In this work a more sophisticated approach is used.
Each user IP was looked up using the GeoIP library [HREF4], the web industry standard for IP localization. A field was added to the user session record indicating whether the user was inside Australia (Y) or outside Australia (N).
The three largest clusters found mirrored the clusters found in the inside RMIT/outside RMIT experiment.
The largest, consisting of 45% of visitors, visited only staff homepages. These visitors were from outside Australia.
The second largest cluster, representing 25% of users, visited the homepage and first level pages. These users were from inside Australia.
The third largest cluster, representing 12% of users, visited the site homepage only. These users were from inside Australia.
The OneR classifier found that if visitors go to the students directory then they are from Australia, and if they do not they are from outside Australia. This rule was correct in 78% of instances.
The J48 classifier shows this same basic classification scheme in action, with more rules that bring the classification accuracy up to 79%. Since the increase in accuracy is marginal, this classifier is not reproduced here.
The goal of this experiment was to investigate whether patterns of user behaviour were associated with users originating in particular countries. Each user IP was looked up using the GeoIP library [HREF4] and the country name added to the user session record. This was new work.
Using the EM clustering algorithm, we found two principal clusters of users, representing similar groups to those found in the inside/outside RMIT and inside/outside Australia experiments.
Browsers were characterized by visiting all the first level pages in the site (that is all the pages available from the home page), but none of the second level pages. Browsers originated from Australia, the USA, India, and Indonesia. They represented 27% of users.
Staff homepage visitors visited a staff member's homepage only, that is, they did not enter via the site's main navigation pages. These visitors were overwhelmingly from the United States, followed by smaller clusters from Australia, Canada, the UK and Germany. There were five times as many USA visitors in this category as there were visitors from Australia. This cluster represented 46% of users.
We found a third "non-cluster" of visitors from the USA and Australia who did not visit any of the directories in our list. These visitors visited a wide variety of pages throughout the site, to which there was no obvious pattern.
The OneR classifier suggests with 54% accuracy that users who visit the students directory are from Australia, and those who do not are likely to be from the United States. This reflects in large part the locations of most of our visitors, so this rule reduces to something that is not significant considering the accuracy.
The J48 classification models produced could correctly classify 56% of records at best. (Adding additional classification rules does not increase accuracy by much over the OneR classifier.)
The goal of this experiment was to look for a correlation between user behaviour and the top level domain (TLD) of the organization from which they originated. Each user IP was looked up to obtain a hostname where possible. This was new work.
A classification variable was added to each user session record representing the domain type, that is, com, edu, net, gov, or other. Countries where .ac is used instead of .edu were converted to edu. Countries where .co is used instead of .com were converted to com. (These countries include the United Kingdom and New Zealand.) The category 'other' includes IPs for which a hostname could not be obtained (the majority) and various other less usual top level domains such as org, biz, and info.
Overall, the TLD breakdown was as shown in Table 2.
| TLD | % of visitors |
|---|---|
| gov | <1% |
| edu | 14% |
| other | 26% |
| net | 28% |
| com | 32% |
The EM clusterer produced five main clusters.
The largest, representing 30% of users, visited only individual staff members' homepages. These users came from other (35%), com (29%), and net(28%) TLDs. These numbers reflect the overall distribution of visitors and are therefore not particularly interesting.
The second cluster, representing 28% of users, visited the home page and top level pages. These users came from net (35%), com (24%), and edu (23%) domains. In this case users from net and edu are over-represented and users from com are under-represented. Users from edu and net would typically represent our own students on campus or at home, and potentially prospective students.
The third group, representing 17% of users, followed no obvious surfing pattern in the data and were mostly from other TLDs.
The fourth group, representing 12% of users, visited a mix of staff directory pages, staff homepages, and some visited the timetable pages. These users were from com domains (95%).
The fifth and final group, representing 10% of users, visited only the homepage. These were mostly net TLD users (43%).
The OneR classifier produced could correctly classify 36% of records at best, which is close to chance and therefore not statistically significant.
The J48 classifier produced could correctly classify 39% of records at best, which is close to chance and therefore not statistically significant.
This experiment made use of a novel idea: that it is possible to extract information about end users from their organizational homepages. The procedure used for this was as follows.
Where a hostname could be obtained from a user's IP address, it was attempted to convert the hostname to a URL. For example, if a user logged in from hostname laura.cs.rmit.edu.au, their URL would be determined as www.cs.rmit.edu.au.
This web server was then contacted and the web page found as the index page was downloaded. This page was then analyzed according to a simple set of heuristic rules. These rules check for indicators that the user is connecting from an ISP, or from a university, research institute, or other educational body. The resulting assessment of the user's organization type is added to each user session record as an orgtype field which is used as a classification variable. The company orgtype represents users from a com or net TLD which does not appear to be an ISP.
| Organization Type | % |
|---|---|
| other | 55.5% |
| company | 25.0% |
| isp | 5.3% |
| other_edu | 3.0% |
| university | 10.8% |
| research | 0.3% |
As you can see in Table 3, half of user organization types are still classified as "other". Further work is being done on classification of organization type using homepage indexing, and this should improve results.
The EM clusterer produces three main clusters.
The largest, representing 28% of users, represents those users who visit the homepage and first level pages (the browsers). 55%of these visitors are from orgtype other, 19% from universities, and 17% from companies. In this cluster, universities are over-represented and companies are under-represented. Based on the previous experiments, this cluster is biased by our own students.
The next two clusters are really very interesting. Both represent the staff homepage visitor group discussed elsewhere. In the second largest cluster (22% of sessions) we have users who visited another page or two (to no particular pattern) as well as visiting homepages. 85% of these users have orgtype company, and 14% have orgtype ISP.
The third group, representing 19% of users, visited a staff homepage and no other pages. 99% of this group had orgtype other. The discrepancy between groups two and three represents a true data mining nugget.
The OneR classifier produced could correctly classify 56% of records at best, which is close to chance and therefore not statistically significant.
The J48 classifier produced could correctly classify 57% of records at best, which is close to chance and therefore not statistically significant.
Many researchers have applied standard data mining algorithms to access logs to look for "golden nuggets" of information, for example Chen (1996), Cooley et al (1997), Yang et al (2001), Ciesielski and Lalani (2003), and Mobasher(2004). These are used principally for web site document personalization as in Mobasher, et al (1999) or prefetching and caching as in Yang, et al (2001). Some research focuses on better and more accurate methods for data cleaning and transaction extraction, for example Cooley, et al (1999).
Most researchers turn their attention to new and better ways of mining data from the logs. Some focus on better feature extraction as in Ciesielski and Lalani (2003) and others on better algorithms as in Chen, et al (1996).
The idea of using user location (inside or outside the organization, inside or outside Australia) as a class variable comes from Ciesielski and Lalani (2003). While personalization systems that require user signup and login and therefore store personal user data are well known (see for example Mobasher (2004)), the idea of trying to discover as much about the user as possible based only on their IP is novel.
Extracting as much information as can be obtained about an end user seems a promising basis for web access log analysis which has not yet been fully explored. Behavioural differences between user groups can be found from information that can be looked up once a user makes their first request from a website, telling us their IP, hostname, location, and organization type. In particular, the homepage heuristic analysis performed in this study provided interesting results. These differences between user types can be applied to adapt web site structure and information to the demonstrated preferences of that type of user.
The author would like to thank Dr. James Thom, Dr. Vic Ciesielski and Anand Lalani for their helpful discussions.
M.-S. Chen, J.-S. Park, and P. S. Yu. Data Mining for Path Traversal Patterns in a Web Environment. Proceedings of the 16th International Conference on Distributed Computing Systems, pages 385-392, May 27-30 1996.
Ciesielski, V., and Lalani, A., Data mining of web access logs from an academic web site. In Proceedings of the Third International Conference on Hybrid Intelligent Systems (HIS'03), 2003.
Cooley, R., Srivastava, J., Mobasher, B., Web Mining: Information and Pattern Discovery on the World Wide Web, Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'97), November 1997.
Cooley, R., Mobasher, B., and Srivastava, J. Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, 1(1), February 1999.
Fu, Y., Sandhu, K., Shih, M. Asho, Generalization-Based Approach to Clustering of Web Usage Sessions, in Proceedings of WEBKDD 1999 (San Diego CA, August 1999), 21-38
Heer. Jeffrey, and Ed Huai Hsin Chi, Separating the Swarm: categorization methods for user sessions on the web, in Proceedings of the CHI 2002 Conference on Human Factors in Computing Systems (CHI-02), pages 243-250, New York 2002.
B. Mobasher, Web Usage Mining and Personalization, in Practical Handbook of Internet Computing, CRC Press 2004.
B. Mobasher, R. Cooley, and J. Srivastave. Automatic personalization based on Web usage mining. Technical Report TR99010, Department of Computer Science, DePaul University, 1999.
Qiang Yang, Haining Henry Zhang, and Ian Tianyi Li. Mining Web logs for prediction models in WWW caching and prefetching. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'01), San Francisco, August 2001.
Witten, Ian, H., and Frank. Eibe, Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann 1999.