Carmine Sellitto[HREF1], Lecturer, School of Information Systems,[HREF2] , PO Box 14428, Victoria University [HREF3], Melbourne, Victoria, 8001. Carmine.sellitto@vu.edu.au
The advent of the Web has seen an increasing incidence of writers citing online documents in their publications- a feature that some have argued as being invalid due to the ephemeral and transient nature of Web pages. This paper reports on a study that examined the citation of Web-located bibliographical references in academic articles. After evaluating over two thousand references it was found that nearly 50% of all citations used in papers referred to a Web-located resource. Moreover, a relatively high proportion of all these Web references were invalid, not being able to be found at the author-specified URL. The major reason for invalid or missing Web citations was the commonly encountered HTTP404 message, followed by server time-out and bad host error messages. The .edu was the domain associated with with the greatest number of missing citations.
The wide spread adoption of the Internet, and in particular the World Wide Web as a publishing medium has significantly altered the manner in which information is disseminated and shared. Arguably, the Web has become the preferred medium for delivering business and organisational information to its constituent elements, allowing information to be effectively collated and presented in a useful form. The powerful publishing feature of Web is evident by the numerous informational models that have been proposed in the general information systems and information science literature (Davenport 1997; O'Brien 2001; Turban et al. 2002) where the substitute value of the Web in delivering traditionally produced information products is highlighted. Just as more and more information has become easily accessible via the Web, so has there been a corresponding use of referencing to Web-located resources by authors of papers, reports and scholarly works (Spinellis 2003). Various authors have questioned the permanency of these cited Web resources and reported on disappearance rates of URLs in scholarly papers that focus mainly on journal published articles (Lawrence et al. 2001; Rumsey 2002; Spinellis 2003). The use and permanence of Web-located references cited by conference proceedings articles appears to have been overlooked- this may be due to a perception that conference publications do not appear to have the same prestige and scholarly value as journal items. However, the importance of the conference publication is such that it provides avenues for less established or new researchers to publish their work, as well as fostering a formal and informal scholarly and collaborative environment. This study examines the loss of Web-located references that have been cited in a set of peer reviewed conference articles.
The ancient Roman and Greek writers considered it a duty to report on the general conversations of the populace without having to verify their sources, originality or even correctness, the reader was the one that decided what to believe (Veyne 1988). In those times writers were collectors of oral history and provided a means of recording and transmitting cultural traditions, Veyne compares the job of these early story-tellers to that of the modern day journalist. In the Middle Ages, as a result of the powerful influence of the church, writing and authorship took on a copyist perspective. The church disapproved of original thought with writers engaged primarily in the reproduction of already existing religious scriptures and text, writers had a stenographic function in not altering God's view of truth- any originality in writing was discouraged (Czarniawska-Joerges 1998). Moreover, Boyle (1992) indicates that church writers had a selfless obligation to find and reproduce divine manuscripts in an attempt to propagate the faith. By the end of the seventeenth century the practice of referencing began to appear in written works, a practice that was invariably associated with the nascent university and academic professions at the time (Czarniawska-Joerges 1998). Veyne (1988) indicates that part of the interchange amongst the seventeenth century scholarly community involved the soliciting of collegial comment on new ideas and theorems- possibly a preliminary practice of the formal modern-day academic peer-view applied to publications. New works might have been considered controversial, hence, the onus was on the writer to justify their arguments and inferences in support of these new ideas. Consequently, the bibliographical reference and the practice of citation appears to have evolved as an important feature that gave credence and justification to the author's original thoughts, theorems, models and assertions. Grafton (1997) chronicles the history of the citation and referencing inferring it is a subtle way of documenting the progress of knowledge as well as portraying the evolution of modern scholarship. At the same time that individuals commenced exploring and documenting original ideas, the invention and adoption of mass printing and replication made publishing available to many citizens (Hesse 1996). The mass printing phenomenon resulted in a growing number of textual falsifications and plagiarism that impacted significantly on those in academia and the professions- the main originators of new ideas. This copying problem eventually led to the concept of copyright as addressed by Queen Anne's Statute of 1710, which enshrine quasi-legal ownership rights of original ideas to authors and their publishers. Recognition and ownership of copyright invariably allowed a remunerative payment of some form to be gained by the rightful owner. Thus, from a copyright perspective, referring to a previous author's work in one's own manuscript not only recognises the ownership of original ideas of others, but addresses a fundamental copyright condition through a symbolic payment via the use of citation (Czarniawska-Joerges 1998). Today, authors of reputable published works make the customary use of citation and referencing to other resources in an attempt to support and buttress their own ideas- a process which also positions their work in context with others (Webster and Watson 2002; Zerby 2002). Indeed, citation practice in the academic literature review signals an awareness of ethical publishing principles and behaviour that is commensurate with recognising the knowledge ownership of other writers. Moreover, the contributory recognition of previous authors, allows an author to stand on the shoulders of others in furthering their own work.
The ability to access information via the Web has allowed authors to substitute some of the traditional paper-based resources such as books, journals, reports and notes with an electronic equivalent. Moreover, with the vast quantity and easily accessible documentation available on the Web, many authors often cite Uniform Resource Locators (URLs) as part of the attribution process when it comes to acknowledging supporting material in their publications (Rumsey 2002; Spinellis 2003). The URL represents a unique Internet locator of a digital information resource and is generally written as an alphanumeric string (Powell 2003). The URL as an addressing mechanism has various sections with each section providing an important component of Internet addressing. Consider the composition of following generic URL:
Numerous style manuals elaborate on how Web-located references should be cited with the specific attributes including the URL and date of resource access. However, there is an assumption that the resource, like traditional printed publications, has a permanency associated with its creation. Web-located resource permanency in the context of this study refers to a Web-located resource being easily located at the point specified by the referenced URL in the article. The issue of permanency has become an important feature of referencing due to the way that Web-located information is being increasingly cited in both general and academic publications. When an author cites a resource located on the Web there is a fundamental assumption of resource permanency- that is, the particular information resource will be found at the cited location. However, given the increase in citations to Web-located resources, there has been concern on the way that URL references are disappearing (Rumsey 2002). The concern with disappearing Web-resources is manifested through the growing incidence of broken links- an issue that first flagged by Kahle (1997) who suggested that the average lifetime of a URL might be just 44 days. Various authors have subsequently reported the problem of disappearing URLs and the undermining impact on scholarly works (Davis and Cohen 2001; Lawrence, Coetzee et al. 2001; Zhang 2001; Markwell and Brooks 2002; Rumsey 2002; Markwell and Brooks 2003; Spinellis 2003). The ability to find, access and interpret URL cited references not only underpins scholarly publications, but is also significant in the way many have interpreted the Web as a medium that provides vast amounts of information as well as overcoming geographical access barriers to that information. Consequently, the concept of disappearing URLs tends to challenge these notions.
AusWeb is an Australian based conference that was first held in 1995 and has provided a major forum for both industry and academics within Australia to discuss the rapidly evolving technologies and usage of the Web. The conference archive (http://ausweb.scu.edu.au/aw04/archive/index.html) served as a source for papers to evaluate.
123 articles associated with the Education and Training stream were selected from the conference archive as the focus of the study. Bibliographic references were defined as those references that appeared as a list at the end of the article under the Reference, Hypertext References or Bibliographic sections. No attempt was made to evaluate individual references for their content value- it was assumed that citations were of equal importance in contributing to an article's theoretical base. The World Wide Web Consortium's (W3C) Link Checker was used to evaluate broken links associated with bibliographic references that cited a Web-located resource. Link Checker is a freely available online service (http://validator.w3.org/checklink) that tests a submitted Web page for broken hypertext links and reports the types of HTTP messages these links encounter.
Each online article in the Education and Training stream from the years 1995 to 2003 was examined to check that all bibliographic references links were active- that is, they were marked up as a hypertext link. Any non-hypertext URL links were noted and tested manually. Link-checker results allowed the identification of non-active Web-located references, the HTTP messages associated with these references and identified the type of top-level domains (for example: edu, gov, net) that these broken links referred to. Articles were manually checked for the total number of citations used.
A total of 123 papers were examined. Collectively, the papers contained 2162 references of which 48.1% (1041) cited a Web-located resource.
The number of Web references per paper ranged from a low of 3.5 in 1997 to a high of 12.3 in 2001. The average number of Web-located references per paper was 8.5 across all articles assessed. The 1996 papers— the second year the conference was run— had the highest absolute number of Web-located references (193) when compared to other years. The greatest number of Web-located references cited by a paper was 41, with numerous authors not citing any Web-located references. Table 1 summarises annual paper citation results for the 1995 to 2003 period.
| Year | Papers(N) | Total references(N) | Web references | Average citations per paper | Average Web references per paper |
| 1995 | 18 | 236 | 166 | 13.1 | 9.2 |
| 1996 | 21 | 334 | 193 | 16.4 | 9.2 |
| 1997 | 12 | 181 | 63 | 15.9 | 3.5 |
| 1998 | 3 | 55 | 30 | 18.3 | 10.0 |
| 1999 | 15 | 212 | 90 | 14.1 | 6.0 |
| 2000 | 11 | 185 | 97 | 16.8 | 8.8 |
| 2001 | 15 | 359 | 184 | 23.9 | 12.3 |
| 2002 | 16 | 336 | 116 | 21.0 | 7.3 |
| 2003 | 12 | 254 | 102 | 21.2 | 8.5 |
Web-located references were checked to see if they could be located at the specific URL cited in the paper. Table 2 summarises results from Link Checker indicating the status of missing Web-located references found in the papers.
| Year | Total references (N) | Web references (N) | Active Web references (N) | Missing Web references (N) | Missing Web references as a percentage of all references |
| 1995 | 236 | 166 | 55 | 111 | 47% |
| 1996 | 344 | 193 | 49 | 144 | 42% |
| 1997 | 181 | 63 | 27 | 36 | 20% |
| 1998 | 55 | 30 | 19 | 11 | 20% |
| 1999 | 212 | 90 | 35 | 55 | 26% |
| 2000 | 185 | 97 | 66 | 31 | 17% |
| 2001 | 359 | 184 | 132 | 52 | 14% |
| 2002 | 336 | 116 | 87 | 29 | 9% |
| 2003 | 254 | 102 | 93 | 9 | 4% |
| Total | 478 | ||||
| Proportion of missing references | 46% | ||||
Since 1995, there has been a progressive loss of Web-located references as a proportion of all the cited references. In that period 46% (478) of Web-located references could not be found at the documented Web address cited by authors. The 2003 results indicate that as little as 4 months after papers were published some 4% of the Web-located references cited in papers were not active and had started to disappear. The results clearly indicate that the older the paper, the higher the number of missing Web-located references— which is consistent with finding from many of the studies (see previous) that have under taken similar investigations into missing Web-located resources.
Four main types of HTTP messages associated with missing Web references were found in the study— these are summarised in Table 3.
| Year | Web references (N) | Missing Web references (N) | HTTP 404 | HTTP 504(Timeout) | HTTP 502(Bad Gateway) | HTTP 403(Restricted) |
| 1995 | 166 | 111 | 41 | 27 | 39 | 4 |
| 1996 | 193 | 144 | 110 | 15 | 18 | 1 |
| 1997 | 63 | 36 | 23 | 9 | 4 | 0 |
| 1998 | 30 | 11 | 5 | 2 | 3 | 1 |
| 1999 | 90 | 55 | 43 | 4 | 4 | 4 |
| 2000 | 97 | 31 | 17 | 8 | 4 | 2 |
| 2001 | 184 | 52 | 27 | 16 | 8 | 1 |
| 2002 | 116 | 29 | 21 | 3 | 4 | 1 |
| 2003 | 102 | 9 | 7 | 0 | 2 | 0 |
| Totals | 1041 | 478 | 294 | 84 | 86 | 14 |
| As a proportion of each HTTP type | (61.5%) | (17.6%) | (18.0%) | (2.9%) | ||
Server side HTTP error codes 502 and 504 are indicative of server side problems and were almost equally split between a network timeout signal error being received and the host server not being found. The 404 HTTP error indicating a missing URL page was the most commonly encountered message and represented 62.5% of all HTTP messages. This indicates that Web-located resources that were referenced in articles at the time of publishing had disappeared from the specified location as designated by the URL. Moreover, this finding tends to reinforce the transient nature of the Web as a publishing medium where information placed on a Web page is non-static, having a time dimension feature associated with it— a feature that allows information to be easily reformed, updated, altered or deleted.
The top-level domain associated with missing URLs was the education domain (edu). The edu domain is associated with educational organisations and Australian and International Universities were prominently represented in the URLs investigated. Many authors in the years 1995 and 1996 referred to resource files that used ftp and gopher as a transfer method. Consequently, as servers that supported such file transfer methods disappeared or were phased-out a large number of missing references that used ftp and gopher in these years have been detected. The number of missing ftp and gopher domains in recent times is negligible by virtue of HTTP being the predominate file transfer method on the Web. Table 4 summarises the types of domains identified with missing Web-located references.
| Year | Missing Web references (N) | Domains associated with missing Web references | ftp, gopher & no domain | ||||
| .edu | .com | .org | .gov | .net | |||
| 1995 | 111 | 71 | 7 | 3 | 1 | 0 | 29 |
| 1996 | 144 | 114 | 9 | 2 | 2 | 1 | 16 |
| 1997 | 36 | 21 | 4 | 1 | 1 | 3 | 6 |
| 1998 | 11 | 1 | 7 | 0 | 0 | 0 | 3 |
| 1999 | 55 | 32 | 7 | 3 | 4 | 0 | 9 |
| 2000 | 31 | 11 | 5 | 4 | 2 | 0 | 9 |
| 2001 | 52 | 30 | 9 | 3 | 2 | 3 | 5 |
| 2002 | 29 | 17 | 3 | 4 | 1 | 1 | 3 |
| 2003 | 9 | 6 | 2 | 1 | 0 | 0 | 0 |
| Totals | 478 | 303 | 53 | 21 | 13 | 8 | 80 |
| As a percentage of all missing domains | 63.4% | 11.1% | 4.4% | 2.7% | 1.7% | 16.7% | |
This study examined the loss of Web-located references for a set of education and training papers in the AusWeb Conference archive. The study is one of the few (but increasing) studies that have examined the permanency of Web-located references cited in academic articles. Moreover, the study specifically focussed on a conference that uses the World Wide Web as a core component of its publication process. Some 1041 Web-located resources that authors referred to in supporting the theoretical foundations of their articles were examined and 46% of these resources were not able to be located at the URL that was specified. The major reason for missing Web references was that the page was not found (HTTP 404)— a finding that not only reinforces the notion of Web pages being ephemeral and time reliant in nature, but supports the notion that Web pages lack a sense of stability when it comes to scholarly citation.
The inability to locate specifically Web-located resources at the author-cited URL tends to undermine the Web as a medium that has allowed diverse document publication as well as significantly enhanced information access. The loss of Web–located resources is important considering the wide spread practice of using the Web is an effective medium for information delivery and dissemination. Arguably, this effectiveness is time reliant with short-term applicability— this study showing that Web published documents do disappear creating instability in the medium. Hence, in the Web environment even though a large volume of resources are easily accessible, there appears to be a certain degree of predictability that a cited URL may disappear, be redirected, become restricted in access or altered from original form.
From the scholarly literature review perspective, the inability to locate a resource at the specified URL weakens some of the theoretical foundations that the author has used to underpin the article. Furthermore, the way that authors cite others — standing on their shoulders allowing them to see a little further— although valid at the time of publication, becomes invalid once the citations disappear. The historical evolution of referencing and citation behaviour has been based on the paradigm of document publication and ownership (through copyright) at a point in time— however, the application of traditional citation methods using the non-static and evolving Web document does not appear well suited to this. Moreover, just as numerous documents and information sources were lost due to using unstable publishing media after the widespread adoption of printing technologies (Kahle 1997), disappearing URLs may be a symptom of the Web being a relatively new technology that is yet to mature as an established and stable medium.
Future research should undertake studies that expand on these findings in other articles published in academic proceedings. These proceedings may be conferences that use the Web as the primary publishing venue, or those that publish proceedings using the traditional non-electronic press. Moreover, a comprehensive side-by-side comparison of the way authors value Web referencing in journal and conference articles has not yet been undertaken— a study that could elaborate on comparative same-author citation habits and experiences that would provide insight into author citation behaviour allowing best practices to be defined. A further avenue of research should examine the barriers to proposed solutions that address the loss of Web-located resources— some solutions have been documented, however, presently lack critical mass in implementation.
An interesting paradox in this study is that all AusWeb conference articles in the publicly accessible archive had persisted— some articles from 1995. Hence, any author that cited any of these Web-located resources by virtue of being in the AusWeb archive would have allowed readers to successfully access these documents 100% of the time, when it came to examining them as a bibliographic source. The AusWeb model may be closely linked to issues associated with a well planned and scalable directory structure on the web server. Moreover, it may also be interesting to examine the history of URLs for all past AusWeb conferences to see if they have remained stable from their time of first publication on the web— or are they still accessible on the AusWeb site but have been modified over the years. Thus, the question arises as to whether the way AusWeb publishes and manages the conference electronic proceedings is an exemplary model for addressing some of the issues associated with disappearing references to Web-located resources.
Boyle J. (1992). A Theory of Law and Information Copyright, Spleens, Blackmail and Insider Trading. California Law Review, 80 (6): pp. 1416-1538.
Czarniawska-Joerges B. (1998). A Narrative Approach to Organizational Studies, Thousand Oaks, CA: Sage.
Davenport T. H. (1997). Information Ecology: Mastering the Information and Knowledge Environment , New York: Oxford University Press.
Davis P. M. and Cohen S. A. (2001). The Effect of the Web on Undergaduate Citation Behavior 1996-1999. Journal of the American Society for Information Science and Technology, 52 (4): pp. 309-314.
Grafton A. (1997). The Footnote*: A Curious History, Cambridge, Massachusetts: Harvard University Press.
Hesse C. (1996). Books in Time. The Future of the Book. Berkley, LA: University of California.
Kahle B. (1997). Preserving the Internet. Scientific American, 276 (3): pp. 72-74.
Lawrence S., Coetzee F., Glover E., Pennock D. M., Flake G. and Nielsen F. (2001). Persistence of Web References in Scientific Research. IEEE Computer, 34 (2): pp. 26-31.
Markwell J. and Brooks D. W. (2002). Broken Links: The Ephemeral Nature of Educational WWW Hyperlinks. Journal of Science Education and Technology, 11: pp. 105-108.
Markwell J. and Brooks D. W. (2003). Link Rot Limits the Usefulness of Web-based Educational Material in Biochemistry and Molecular Biology. Biochemistry and Molecular Biology Education, 31: pp. 69-72.
O'Brien J. A. (2001). Management Information Systems: Managing Information Technology in the Internetworked Enterprise, 5th edition. Boston: McGraw-Hill.
Powell T. (2003). HTML & XHTML: The Complete Reference, 4th edition. Berkeley: Osborne/McGrawHill.
Rumsey M. (2002). Runaway Train: Problems of Permanence, Accessibility, and Sustainability in the use of Web Sources in Law Review Citations. Law Library Journal, 94: pp. 27-39.
Spinellis D. (2003). The Decay and Failures of Web References. ACM, 46 (1): pp.71-77.
Turban E., Mclean E., Wetherbe J., Bolloju N. and Davison R. (2002). Information Technology Management: Transforming Business in the Digital Economy, New York: John Wiley.
Veyne P. (1988). Did the Greeks Believe in Their Myths , Chicago: University of Chicargo.
Webster J. and Watson R. T. (2002). Analyzing the Past to Prepare the Future: Writing a Literature Review. MIS Quarterly, 26 (2): pp. xiii-xxiii.
Zerby C. (2002). The Devils Advocate: A History of Footnotes*, NY: Touchstone.
Zhang Y. (2001). Scholarly Use of Internet-Based Electronic Resources. Journal of American Society for Information Science and Technology, 52 (8): pp. 628-654.