Helen Ashman,
University of Nottingham, U.K.
hla@cs.nott.ac.uk
The work reported in this abstract involves developing a multi-layered approach to document integrity. It requires a document to be somehow represented in a tree structure, either in its native form such as with XML documents, or purely for the purpose of the integrity checking.
There are two issues arising out of this, the first is
Hashing involves the computation of a "hard" function over an arbitrary chunk of data. The function is chosen for it to be computationally infeasible to find the original data that generated the resulting hash and additionally for it to be computationally infeasible to find another chunk of data that generates the same hash.
However, the smaller the amount of data being hashed, the easier it is to perform a "brute-force" search, computing the hashes of all possible chunks of data of that size to discover which was the original data. This means that the data is subject to illicit change much more easily.
On the other hand, smaller chunks of data being hashed make it possible to discover where the document has been changed, at least in a general area.
So here is a conflict in requirements - the document needs to be sectioned into smaller chunks so that the general area of any changes can be discovered, but the smaller the section, the easier it is to fraudulently change the document, thereby negating the entire hashing procedure!
The simple solution is to hash the document at many levels, providing security against fraud by hashing the entire document, while at the same time, providing a mechanism to "drill down" to discover the exact location of any changes. The algorithm to perform this is described below.
This is necessary in order to enable the "drilling down" to discover where the document has been changed, as the algorithm will show.
In some cases, the document is already in a tree structure, for example it may be an XML document. However in other cases, such as image data, the data needs to be artificially represented in a tree structure.
It is also necessary to decide what is the lowest level of document decomposition required, which may be determined by how many possible "atoms" there are at that level and how important it is to be able to recover the original document from the changed version.
In the example of applying this algorithm to image data, we use a quad-tree representation (Samet, 1984). This way of representing data is used in many areas for two-dimensional data, including representation of geographic data. The same notion can be extended to any other dimension of data where necessary.
Quad-tree decomposition works by dividing an image into quadrants, and iteratively dividing each quadrant into yet more quadrants. The condition to stop any further decomposition of a given quadrant is that there be not enough information in the quadrant to warrant further breakdown. So for example in geographic data, the data is continually broken into quadrants but stopping when there is very little data in a given quadrant such as in a desert or body of water.
See (Brigger, 1995) for a description of quad-tree representations.
So, to apply the hashing algorithm to image data, we do the following, starting with the whole document:
The hashes can be stored in a number of ways, but in general it is useful to separately send them or at least to encrypt them (as is done in the PGP secure mailing tool) so that a fraudster is unable to re-compute the hash on the fraudulent copy of a document and pass it off as the original hash.
To find where the document has been changed, the recipient needs to compute all the hashes at a given level, then "drill down" and compute hashes at lower levels only for those quadrants whose hashes failed the match the expected value. So should a single pixel have been changed, only one quadrant out of every section will fail the hash test, at every level.
Finding out how the document has been changed involves a brute-force search to discover the (or a) original data that would have generated the hash. This is the only possible approach, given that hash algorithms always lose data (by design) and hence it is impossible to actually reconstruct data from its hash. Brute-force checking is the only way to recover the original data, and because of the nature of hashing, there will be more than one data that generates the same hash. So the recipient will be presented with a small number of possible original data.
The benefit however is that the original data can be recovered using the brute-force approach. This is not feasible with the normal use of hashing as the data size for hashing is deliberately chosen for security, i.e. it is deliberately chosen to be resistant to brute-force attacks. So combining small hashes for easy recovery of original data with large hashes for security gives the best of both worlds.
However, the principle of multi-layered hashing clearly offers the best of both worlds in that it can be used not only to show when documents have been changed. as is usual in secure document systems, but it can show where and sometimes even how a document has been changed.
(Brigger, 1995) Patrick Brigger, doctoral thesis, section "Quad-tree decomposition", 1995, http://ltssg3.epfl.ch/pub_files/brigger/thesis_html/node21.html
(Samet, 1984) Hanan Samet, "The Quadtree and Related Hierarchical Data Structures", Computing Surveys, Vol. 16, No. 2, June 1984, pp. 187-257.
[ Proceedings ]