Hashes DO grow on trees - Document integrity at every level

Helen Ashman, University of Nottingham, U.K.
hla@cs.nott.ac.uk


Abstract

This abstract reports work-in-progress on a novel application of a document integrity mechanism, called hashing. The key element of this application is the repeated application of hashing at multiple different levels of the document. This has substantial benefits over the normal application of hashing for document intergrity, because it makes it possible to discover what has been changed in the document, and in some cases, how the document has been changed.

Introduction

Introduction

There are many reasons for wanting to ensure that a document has not been changed from a known version or original, and this includes the need for mutual trust between parties digitally signing documents, being sure of the origin of a document, knowing if a document is the original or a faithful copy of the original, knowing whether links pointing to internal elements of documents have been displaced, and so on.

The work reported in this abstract involves developing a multi-layered approach to document integrity. It requires a document to be somehow represented in a tree structure, either in its native form such as with XML documents, or purely for the purpose of the integrity checking.

Background

This work arose out of the need to be able to test whether an image was a true copy or the original, including its digital watermark. The work we first encountered in this area computed hashes over image sections, as opposed to computing hashes over the whole document, so that the image could be checked upon arrival at its destination for a legitimate watermark and other general integrity.

There are two issues arising out of this, the first is

  1. the granularity of sectioning the image
  2. how to actually section the image

Granularity

The purpose of hashing (or computing the result of a hash function) is to create an almost unique "fingerprint" or message digest of a document.

Hashing involves the computation of a "hard" function over an arbitrary chunk of data. The function is chosen for it to be computationally infeasible to find the original data that generated the resulting hash and additionally for it to be computationally infeasible to find another chunk of data that generates the same hash.

However, the smaller the amount of data being hashed, the easier it is to perform a "brute-force" search, computing the hashes of all possible chunks of data of that size to discover which was the original data. This means that the data is subject to illicit change much more easily.

On the other hand, smaller chunks of data being hashed make it possible to discover where the document has been changed, at least in a general area.

So here is a conflict in requirements - the document needs to be sectioned into smaller chunks so that the general area of any changes can be discovered, but the smaller the section, the easier it is to fraudulently change the document, thereby negating the entire hashing procedure!

The simple solution is to hash the document at many levels, providing security against fraud by hashing the entire document, while at the same time, providing a mechanism to "drill down" to discover the exact location of any changes. The algorithm to perform this is described below.

Turning arbitrary documents into tree structures

The next step is to work out how to section the document at many levels. The key requirement is that In other words, the document must somehow be cast into a tree structure.

This is necessary in order to enable the "drilling down" to discover where the document has been changed, as the algorithm will show.

In some cases, the document is already in a tree structure, for example it may be an XML document. However in other cases, such as image data, the data needs to be artificially represented in a tree structure.

It is also necessary to decide what is the lowest level of document decomposition required, which may be determined by how many possible "atoms" there are at that level and how important it is to be able to recover the original document from the changed version.

The algorithm

The algorithm is, very simply, to firstly hash the entire document, then to hash the next lower level elements in whole, then to hash each level in turn below that until the lowest level is reached.

In the example of applying this algorithm to image data, we use a quad-tree representation (Samet, 1984). This way of representing data is used in many areas for two-dimensional data, including representation of geographic data. The same notion can be extended to any other dimension of data where necessary.

Quad-tree decomposition works by dividing an image into quadrants, and iteratively dividing each quadrant into yet more quadrants. The condition to stop any further decomposition of a given quadrant is that there be not enough information in the quadrant to warrant further breakdown. So for example in geographic data, the data is continually broken into quadrants but stopping when there is very little data in a given quadrant such as in a desert or body of water.

See (Brigger, 1995) for a description of quad-tree representations.

So, to apply the hashing algorithm to image data, we do the following, starting with the whole document:

The hashes can be stored in a number of ways, but in general it is useful to separately send them or at least to encrypt them (as is done in the PGP secure mailing tool) so that a fraudster is unable to re-compute the hash on the fraudulent copy of a document and pass it off as the original hash.

Tracing changes

When a document arrives at its destination, the recipient can check its integrity very swiftly. merely by computing the hash of the top-level, i.e. the hash of the entire document. It is only if this computation fails to match up with the expected whole-document hash that the lower levels of hashing are inspected.

To find where the document has been changed, the recipient needs to compute all the hashes at a given level, then "drill down" and compute hashes at lower levels only for those quadrants whose hashes failed the match the expected value. So should a single pixel have been changed, only one quadrant out of every section will fail the hash test, at every level.

Finding out how the document has been changed involves a brute-force search to discover the (or a) original data that would have generated the hash. This is the only possible approach, given that hash algorithms always lose data (by design) and hence it is impossible to actually reconstruct data from its hash. Brute-force checking is the only way to recover the original data, and because of the nature of hashing, there will be more than one data that generates the same hash. So the recipient will be presented with a small number of possible original data.

The benefit however is that the original data can be recovered using the brute-force approach. This is not feasible with the normal use of hashing as the data size for hashing is deliberately chosen for security, i.e. it is deliberately chosen to be resistant to brute-force attacks. So combining small hashes for easy recovery of original data with large hashes for security gives the best of both worlds.

Summary

The work reported here is work in progress, and there remains much to do, for example, finding the best way to store hashes for faster access to lower levels when drilling down to find changed data. We also want to try out this approach on tree-structured documents such as XML documents, and on unstructured text documents, such as system files on the network (checking for illicit changes by unauthorised persons).

However, the principle of multi-layered hashing clearly offers the best of both worlds in that it can be used not only to show when documents have been changed. as is usual in secure document systems, but it can show where and sometimes even how a document has been changed.

References

(Brigger, 1995) Patrick Brigger, doctoral thesis, section "Quad-tree decomposition", 1995, http://ltssg3.epfl.ch/pub_files/brigger/thesis_html/node21.html

(Samet, 1984) Hanan Samet, "The Quadtree and Related Hierarchical Data Structures", Computing Surveys, Vol. 16, No. 2, June 1984, pp. 187-257.


Copyright

Helen Ashman, (C) 2000. The author assigns to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The author also grants a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.

[ Proceedings ]


AusWeb2K, the Sixth Australian World Wide Web Conference, Rihga Colonial Club Resort, Cairns, 12-17 June 2000 Contact: Norsearch Conference Services +61 2 66 20 3932 (from outside Australia) (02) 6620 3932 (from inside Australia) Fax (02) 6622 1954