Sandrine Balbo, Steve Goschnick, Derek Tong
Department of Information Systems
The University of Melbourne
Parkville VIC 3010 - Australia
Cécile Paris
ICT Centre, CSIRO
Locked bag 17
North-Ryde NSW 1670 - Australia
The WWW is now ubiquitous, and yet its usability is still of major concern. Usability testing methods are able to identify flaws prior to the launch of a site. However, their application typically involves direct observation, requiring availability of participants and evaluators in a synchronised manner. This, in turn, implies tight schedules with little leeway for flexibility. In this paper, we present WAUTER (Web Automatic Usability Testing EnviRonment), a suite of open source tools to assist in web usability evaluation, capturing and comparing intended vs. actual use of a web site. WAUTER harnesses web user visits to do so and is also intended to support remote evaluation.
Usability evaluation. Web log monitoring. Task models. Automatic analysis.
The WWW is now ubiquitous, and yet its usability is still of major concern. Usability testing methods are able to identify flaws prior to the launch of a site. However, their application typically involves direct observation, requiring availability of participants and evaluators in a synchronised manner. This, in turn, implies tight schedules with little leeway for flexibility.
Increasingly, useful automatic web evaluation tools and techniques remain the holy grail of many web site evaluation practitioners today. The traditional usability evaluation and analysis approach is to employ test subjects in usability laboratories (typically 5 to 10 people), or to observe users in the context of their everyday web usage, and then analyse hours of video. However this approach is getting harder to justify in the competitive global environment. Given the large number of site visitors that most web sites attract, particularly with respect to the small number of test subjects used in a typical laboratory-based usability test, it is exceptionally useful to harness those site visits, in particular for automatic web site evaluation and analysis.
This is what we propose to do in this project, through an interrelated set of open source software tools that we are currently developing (see Figure 1), which together we call WAUTER – Web Automatic Usability Testing EnviRonment ( http://wauter.weeweb.com.au ). Besides providing tools to assist in web usability evaluation and harnessing web visits to do so, WAUTER also supports the remote usability evaluation of web sites, which is clearly desirable to avoid the time-and-space related constraints imposed by traditional usability techniques as mentioned above.
All tools in WAUTER have standalone functions, and are currently operational. Those that were completed before the current WAUTER project and strategy are being integrated into a unified set of tools.
The remainder of this paper is structured as follows. First, we present a brief discussion on some related work, followed by the motivation for this research. We then describe the basic concepts of WAUTER before presenting the detail of each tool in the suite:


Figure 1. WAUTER overview
We conclude with a summary of the current state of the WAUTER suite of open source tools, and outline our plan for delivering those near-term features still outstanding. We also speculate on the possible uses of the complete WAUTER suite and probable additions to future versions.
A definition of usability by Usability.gov follows: “Usability is the measure of the quality of a user's experience when interacting with a product or system”. In other words, usability covers the following questions: can the user find what they are looking for efficiently? Does the web site support the user in achieving their goals effectively and to their satisfaction?
Usability evaluation is an integral step in the development of a successful web site. While automated evaluation can only be applied after the site has been completed, it nevertheless has many practical uses and advantages, as expanded from [7] or [18], including:
In addition, due to the nature of the web and the use of machine understood HTML in the evaluation process, automation tools enable the collection and analysis of a large amount of data. This data is useful for usability purposes but might also be useful for other purposes.
Because of the many advantages, there have been a number of computational tools aimed at automating usability evaluation, so many in fact that some people have categorised the genre. The remainder of this section will not cover the tools and methods themselves but will briefly present the main results from these categorisations and surveys.
Ivory characterises automated evaluation tools as fitting within one of the five categories listed in Table 1 below [8, chapter 10]:
Table 1. Ivory’s categories of evaluation tools:
|
1- Performance measurement tools |
Tools that assess performance of a server, such as through traffic analysis. |
|
2- Log file analysis tools |
Tools that analyze log files to identify problems in the interface. |
|
3- Guideline review tools |
Assessing deviation of the website from its original design. |
|
4- Textual analysis tools |
Analyze problems relating to confusing headings or links in the web page. |
|
5 Information-seeking simulation tools. |
Mimicking the browsing process of users. |
Performance measurement tools become useful as the site in question attracts significant web traffic.
Log file capture generates large files that require statistical analysis tools in order to interpret. Log File Analysis Tools embody approaches to streamline that analysis without specialised statistical analysis skills.
Guideline Review Tools apply a set of design guidelines to a site. The global adoption and open-design nature of the WWW has lead to a large number of website design guidelines. Many of these guidelines offer conflicting views of what is important for website design [6]. Hence, tools that review a site from a guideline point-of-view are specific to a given set of guidelines.
First, there are actually few automatic tools for the Web. This is highlighted by [7], where, out of the 132 methods surveyed, only 29 applied to the Web. Of these, only 9 offered some form of automation tools. In another survey, Winckler et al. looked at 23 usability evaluation methods and 12 automatic tools, selected from 49 study cases of Web evaluation [18].
Second, these surveys emphasise the fact that, when it comes to web evaluation, most of the tools found deal with HTML code verification or searching for broken links. Other tools mainly perform checks based on guidelines from the W3C Web content Accessibility [17] and Section 508 Accessibility [15]. These tools thus provide a severely restricted measure of a site’s usability.
However, disregarding the degree of automation in tools used, the use of evaluation tools and guidelines in general can vary considerably depending on the evaluators and methods used. [7] reported that in some instances less than 1% of overlap in usability findings was observed amongst the evaluation methods used.
We conclude from the findings in these surveys and elsewhere that there is still a lot of scope to develop useful automated evaluation tools for the web, and that, as a result of their limitations, current automated tools are largely underutilised in the usability testing of websites.
So, why should we bother to come up with ‘yet another set of tools’ to support usability evaluation? As we have just outlined, there is still a need to provide simple but effective support to the evaluator. Our specific approach is motivated by the following beliefs and desires:
In the words of Karis “Even if an automated technique finds only a subset of existing problems, if it is efficient and easy to use, and could be used on areas that would otherwise receive no attention, then it makes sense to use it” [10]. Clearly, the development and use of HTML code verification tools mentioned above fits this premise. Yes, we agree that it is a useful use of tools that address a limited range of usability criteria.
In our opinion, while there are automated tools, few of them actually address the primary issue of usability, that is: how well a user manages to perform the tasks for which the web site is designed.
An aspect of usability evaluation is ensuring that the system being tested adheres to ergonomic guidelines (rules) to focus on users’ needs. The creation of automated tools for web site evaluation requires formalisation of these rules, either at a syntactic or lexical level, in order to apply them to the code which renders the user interface. However not all ergonomic (guidelines) rules can be automated in a straightforward manner. Generally, ergonomic rules require high-level descriptions and information about user intentions (e.g., tasks). These cannot necessarily be easily recovered automatically, simply by looking at relationships amongst related documents. This is indeed advocated by [18] to improve the quality of evaluation methods, as evidence in the following quote: “..we need more information about (the) system and users than is currently available. Currently, such information about (the) user is made implicitly in the design but (can) not be automatically extracted by tools”.
Providing this additional information to bear to the problem and then being able to exploit ergonomics guidelines in automated evaluation tools is the main problem we are addressing with WAUTER. To this end, WAUTER provides a functional set of tools that automate the capture and analysis parts of particular usability evaluation methods through the use of log files and task models, while also providing a supporting environment for a usability engineer to critique the results. The role of such an environment in the context of general web usability testing is outlined below.
Figure 1 shows an overview of the WAUTER suite of tools.
The three basic elements of WAUTER are:
The simple, but effective, idea of comparing the intended task that a user is to perform on a web site with the task they actually perform, rests on a series of heuristic rules enabling the comparison of the intended use with the actual use.
WAUTER's three basic elements match the first two stages proposed by Ivory [8] for any evaluation tools: capture and analysis, provide tools to automate these phases. Although the WAUTER suite of tools doesn't provide an automatic critique tool, it allows an engineer to input time-stamped annotations about possible problems about the web site design.
We now present each of the three basics elements of WAUTER in turn.
Task modelling has long been advocated by the research community as a beneficial technique for usability design and evaluation [14]. As pointed out in Balbo et al ., task models can play various roles within the software development life cycle [3]. For example task models are already being used to generate automatically portions of the user interface code [11], portions of the end-user software documentation [4] , for knowledge acquisition [2] or for usability evaluation [9].
Importantly, task models can be used to represent the intended task of the user when using a system, which can also be seen as a high-level description of the user's intention. In the WAUTER project, we extend the use of task models one step further to include automated usability evaluation. We thus model the intended task of the user with a task model. In WAUTER, we employ a specific task modelling notation, the Diane+ notation [16]. We present below the basics necessary to understand the examples used in this paper (see figures 2, 3 and 4).
The DIANE+ notation can express:
Within WAUTER, WIMM and ATMA combined allows the practitioner to create automatically an initial task model for the web site under current evaluation. This initial model can then be manually edited through a graphical editing tool, TAMOT. The result is a task model that captures the intended use of the web site. This model provides the baseline for analysing the actual use of the web site, as recorded by log files, explained below. In a sense, the automatically induced and then edited task model replaces the formal construction of tasks in a manually devised and executed usability test. The tools themselves are described in the next section.

Figure 2. The top level decomposition of the task of buying a cinema ticket via the www.villagecinemas.com.au web site.
As an example of a task model in the DIANE+ notation, the three figures 2, 3 and 4 represent the intended use of a web site that enables the user to book cinema tickets online. The figures were generated using ATMA then TAMOT.

Figure 3. The decomposition of one of the sub-goals in figure 2: Search by cinema

Figure 4. The decomposition of one of the sub-goals in figure 3: finish choosing
The capture of web navigation information is done through proxy-based logging, as opposed to server-side or client-side logging. This overcomes the problems inherent in server-side logging, such as how caching affects the reliability of data recorded, and the many events that cannot be captured by the server (such as internal browser interaction). Since the proxy-based system is installed on the client machine, it also overcomes the practical issue of access to the server. The proxy-based system also has advantages over a pure-client based system since it is not browser-dependent. The proxy-based software intercepts the HTML code (regardless of how it was generated on the server) enroute to the browser, and inserts within it, appropriate logic (in standard JavaScript) to capture all the user events of interest.
The WAUTER analysis uses a set of pre-defined heuristics to assess any deviation from the intended use to the actual use. Here, we extend the definition provided in table 1, in which we don't just mention an original design, but also the original usage intention.
In some earlier work by the first author, the basic concept of the heuristic rules for comparing intended and actual use was created [1]. WAUTER now builds upon this work and extends the initial idea to the web environment. The heuristic rules have been developed by manually performing analysis of a number of log files. This allowed us to identify patterns of behaviour that would reveal potential usability problems. We now briefly describe them.
Rule 1 : Direction shift.
A direction shift is detected from the task model as the user stops progressing along a set path in the tree. For example, a user may start searching by cinemas (going along the left hand side path on figure 2), but then will interrupt this initial search and start searching by movie instead (right hand side path in figure 2). In the Village Cinema context, this reflects the difficulty encountered, by the users, when trying to find a cinema, as the only mean provided is to browse through a long list of cinemas (c/f table 2). Abandoning that route and trying the other option to look for a film, is what we call a direction shift.
Table 2. The list of cinemas from the Village Cinema web site
Airport West |
Ballarat |
Bendigo |
Century City Walk |
Cinema Europa Jam Factory |
Cinema Europa Knox |
Cinema Europa Southland |
City Centre |
Coburg Drive-In |
Cranbourne |
Crown |
Dandenong |
Doncaster Twin |
Fountain Gate |
Frankston |
Geelong |
Gold Class Century City |
Gold Class Crown |
Gold Class Geelong |
etc . |
Rule 1 allows the automatic analyser to detect this sort of behaviors and notify the usability engineer, via the annotations, of potential problems.
Rule 2 : Immediate cancelling of an action.
Cancelling an action immediately after starting it may denote a navigation problem within the interface.
Rule 3 : Re-occurrence of actions.
The repetition of an elementary action (e.g., an action that is not decomposed further in the task model) may denote some lack of feedback. For example, clicking a button twice in a form may highlight a slow response from the system that misleads the user in thinking that his/her action wasn't taken into account by the web site.
Rule 4 : Irrelevant actions.
Irrelevant actions have no meaning within the web site under test. For example, trying to click on text that is neither a hyperlink nor a button.
Rule 5 : Timing.
The ATMA tool enables the usability engineer to associate, to each task or action, a timing. If, during the actual interaction, by a user, the timing is significantly greater than the initial allocated timing, then an annotation will be created to highlight the discrepancy, in both the log file and the task model. This is especially useful when the user if filling forms, or buying over the web, to monitor the discrepancy between the expected time it will take to achieve a goal, against the actual time it takes users to achieve that goal.
WAUTER is a suite of five interrelated software tools to allow for the capture of both the intended and actual uses and their comparison. The software includes a proxy-based logging system (WIMM), a task-based log file analysis tool (WEMA), support for task model generation and editing (through ATMA and TAMOT), and, finally, a tool combining capture and visualisation (WEPN). We now present each tool in turn.
WIMM is the proxy-based event-capturing tool for web-based navigation. It is able to record events such as page changes and form input, radio button selection and text field input, and it stores the event details in an XML-based log file. This log file is later used as input to analysis tools such as WEPN, WEMA or ATMA. WIMM also provides filtering options so that the XML file can be customised to capture specific event types.
ATMA takes a set of WIMM log files that represent a particular task (e.g., ‘Purchasing a movie ticket') conducted within the web site and automatically generates a corresponding intended task model. To achieve this, the input files are created by WIMM as it monitors a usability engineer going through a controlled, predefined navigation of the website. The usability engineer attempts to navigate all the different paths that can be taken to complete the task. We assume the usability engineer knows the site well.
The ATMA generated task model is created using Diane+ format and stored in an XML file compatible with the TAMOT system. TAMOT is a task model creator and editor [12]. Utilising the Diane+ notation, TAMOT allows human computer interaction practitioner to generate a task model or edit an existing one, through a simple drag and drop interface. Figure 5 displays the user interface of the TAMOT tool. The practitioner can then create a report of the task model as a set of HTML files, or save it as an XML file compatible with the WEMA system. This allows TAMOT to be used as an editor, to correct or improve any deficiencies resulting from the output produced by the ATMA automatic generation process.

Figure 5. The TAMOT user interface displaying the two tasks presented in figures 2 and 3.
WEMA is based on the EMA system discussed in [1], though refined to focus specifically on tasks within a web context. WEMA takes two inputs: a file containing a task model representing the task undergoing evaluation, and a log file which represents an end-user attempt to complete the task. WEMA annotates both files in accordance with the set of predefined heuristic patterns (as discussed above) that identify potential problems in the website design. The task model input is in the form of a TAMOT XML file, either created manually in TAMOT itself, or automatically through the ATMA system. The second input is a WIMM log file representing the end-user navigation while completing the task, (not to be mistaken with the WIMM file generated by the usability engineer while navigating the site, to generate the “ideal” task model).
The main functionality of WEPN is to help a usability engineer visually analyse the user interacting with a web site. It existed as a standalone system prior to the WAUTER integration [5]. It takes a WIMM log file, together with up to four digital video images of the task being performed by the end-user (i.e., four different camera angles including the screen-image), along with other files that contain information pertaining to the actual structure of the website (such as the names of the HTML files which together make up the site), The WEPN environment (see figure 6) includes an interactive site map (synchronised with the video images), which is a visual representation of the web site structure portrayed as a network of nodes and links. The user can click on nodes in the site map to observe the video images of when the user visited that particular page of the web site. WEPN also allows a usability engineer to add annotations at any time. These are also synchronised with both the site map and the video images. Finally, WEPN exports a list of these annotations in HTML. If the original WIMM file had been previously annotated through the WEMA system, these annotations are automatically added to the existing annotations within WEPN.

Figure 6. The WEPN user interface.
WAUTER involves two main categories of actors, usability engineers and end-users .
There are two types of usability engineers that may use WAUTER. Each requires general knowledge of usability testing.
Some of the tools in the WAUTER environment existed before the project. They have now been integrated to form part of the environment. They include: TAMOT [12] and WEPN [5]. The remaining tools we have described, WIMM, ATMA and WEMA, have been build in 2004 and are all at the stage of robust prototypes, ready to be released on the Open Source market. We believe that the only way we can now grow WAUTER and push its limits, is to open it to the World-Wide Web and Human-Computer Interaction communities and get it adopted and used more widely.
We conclude this paper with a comparison of the WAUTER suite to a couple of commercially available tools, that also embrace the capture and analysis approach that WAUTER follows, namely: MORAE and Noldus.
The MORAE tool is similar to a combination of the WIMM and WEPN systems within WAUTER. However MORAE is not tailored to a web environment, and it is unclear whether browser specific events can be appropriately identified and captured within it. MORAE also does not visualize the web site in the way the WEPN site map can, and thus the analysis of end-user interaction is limited to a video capture of either the end-user or a desktop-screen capture, without the overarching visualization of where the problem is, in relation to the web site structure. WEPN also allows for four independent video captures of the end-user interaction, as opposed to MORAE, which allows for fixed inputs of a desktop-screen capture and one video capture. However MORAE does provide a more extensive annotation system, highlighting some possible future extensions to the WEPN system.
The Noldus Observer tool is a commercial tool with an existing significant marketplace, for recording and analysing behaviours of both humans and animals. It requires the observing researcher, to manually score or record the behaviours of test subjects, either live during the experiment or from video footage and other recording mediums. The scoring of behaviour is usually done by encoding the keys on a computer keyboard, to represent certain test subjects, certain behaviours (eg. talking), and certain secondary modifiers (eg another person with whom the test subject may be interacting). Ie., while automatically combining data about multiple test subjects and behaviours against a common timeline, Noldus Observer requires the actual scoring of events to be manually entered by the observing researcher. WAUTER is primarily designed to capture and identify behaviour automatically, with respect to the sub-field of Web Site evaluation.
With respect to video footage, Noldus Observer can now record up to two video images simultaneously, while WEPN has four simultaneous DV video views, allowing for more flexibility in complex user situations. Noldus Observer has a considerable array of graphing and elementary statistical functions built into the product. The researcher sets up a study in Noldus Observer, by defining a Configuration consisting of subjects, behaviours and modifiers. It is a general tool for observing behaviour, meaning that a task model could be constructed within its subject-behaviour-modifier framework. It is a relatively expensive product.
We have just started an evaluation of the tools by using them to do a usability evaluation of the CSIRO internal web site. We are currently in the middle of this evaluation. In this web site evaluation, we are asking CSIRO staff to perform on the intranet a number of predefined tasks, under the observation of two usability evaluation experts, while, at the same time, the tools will be used to log their actions. The experts will make observations. These observations will then serve as a benchmark for the utility evaluation of the WAUTER suite of tools themselves.
We have presented our current suite of tools to assist in web site evaluation. In the future, we are looking at two separate extensions of WAUTER:
ATMA: Automatic Tasks Model Author
TAMOT: TAsk MOdeling Tool
WAUTER: Web Automatic Usability Testing EnviRonment
WEMA: Web EMA, where EMA is the French abbreviation for Automatic Mechanism for usability Evaluation
WEPN: Web Evaluation Path Navigator
WIMM: Web Interface Monitoring and Management
This work wouldn't exist without the effort of many students from around the world and various CSIRO staff involved in the Isolde project [13]. The WAUTER research project is being funded by a University of Melbourne - CSIRO collaborative research program under grant number 13643.