Evidence Integrity

What a Hash Value Proves, and What It Does Not Prove

A practical explanation of data integrity, binary matching, and common overstatements in hashing claims.

Introduction

After working with digital evidence for years, you hear this claim often: "The files have the same hash, so they are the same." It sounds technical and conclusive. It is neither, without context.

Hash values are one of the most important tools in digital forensics. They are highly useful when applied correctly and easy to overstate when they are not. The goal here is to explain what a hash value proves and what it does not prove.

Start With an Example

Take a look at the Figures A and B below.

At first glance, it's pretty obvious that these images are completely different. Different subject matter. Different content. No way to argue that these are the same picture.

However, they both generate the exact same MD5 hash:

253DD04E87492E4FC3471DE5E776BC3D

Figure A: ship image used in MD5 collision example — Figure A: image of ship in MD5 collision example

Figure B: plane image used in MD5 collision example — Figure B: image of plane in MD5 collision example

There was no mistake in calculating these hashes. They do match and correspond to the underlying data for the respective files. This situation should sound paradoxical.

Hash values have often been called "digital fingerprints" and, therefore, suggest uniqueness. Logically speaking, this cannot happen. Different files cannot have the same digital fingerprints.

But they can.

And this phenomenon is called hash collision. This serves as a good reminder about another important characteristic of a hash function:

Hash does not care how file appears or what it represents. Only data structure is analyzed.

To put it simply, hash is unaware of the meaning and significance. It does not recognize that one file depicts a ship, while the other one shows a plane. All it knows is a set of bytes and math behind it.

This is the whole point here.

What a Hash Value Actually Is

A hash is the output of an algorithm applied to input data. A file goes in, and a fixed length alphanumeric value comes out.

Two characteristics make hashes useful:

The same input produces the same output.
Even a small change to input usually produces a different output.

That is why hashing is effective for integrity checks. It is also why precision in wording matters.

What a Hash Value Actually Proves

1) Data has not changed

When hash values are generated at acquisition and matched later, that supports integrity. This is central to forensic imaging, evidence handling, and chain of custody validation.

2) Two files match at the binary level

Matching hashes under the same algorithm indicate binary level equality. This is useful for duplicate detection, copy verification, and preservation checks.

However, same data should not be casually translated into same meaning.

What a Hash Value Does Not Prove

It does not prove what kind of file it is (image, video, document, or otherwise).
It does not prove what the file means in context.
It does not prove who created, viewed, or modified the file.
It does not prove when an event occurred.
It does not prove the reliability of a process that is undocumented or uncontrolled.

The MD5 example above highlights this problem directly. Two different images can be engineered to collide. For that reason, MD5 is widely considered outdated for security sensitive uniqueness claims.

Practical Implications

In practice, hashing is often overstated when reports skip limitations. Common mistakes include using a hash match to claim content identity, relying on weak algorithms without context, and omitting the protocol used to generate and verify results.

A better framing is straightforward: hash matching supports data sameness, not content sameness.

Proper Hashing Mindset

The safest statement is this: hash values show matching bytes under a defined method. They do not provide narrative context by themselves.

Conclusion

Hash values remain indispensable in digital forensic work. They are among the best tools for integrity verification and data comparison. But they prove a narrow claim. They do not independently establish meaning, identity of actor, timeline, or uniqueness, especially when older algorithms such as MD5 are involved. Understanding that boundary is essential for accurate legal and forensic analysis.