Approximate matching promises to be a powerful addition to the exact matching provided by the NSRL RDS. This page highlights the work the NSRL and its colleagues are doing to develop, assess and promote approximate matching technologies.
NIST is interested in establishing the “science of approximate matching” and on developing use cases which will allow the forensics community to understand the benefits of approximate matching. Both of these will help enable the technology transfer of approximate matching from the lab to operational use in computer forensics (and possibly other fields).
Traditional hash sets, such as the NSRL RDS, enable the exact comparison of files: either file A is identical to file B – i.e. it is a bit for bit copy - or it is not. These hashes have the advantages that they are reasonably quick to compute and quick to compare, but they cannot assess how different the two files are in the case where they are not identical: they may be unrelated, or they may differ by a single flipped bit. Of greater interest are those situations in which files A and B represent different versions of a file (e.g. an email, an application or a binary library) or where file A is partly or wholly contained within file B.
In other words, approximate matching provides a means to assess/quantify the relationship between two files beyond same/not same.
Widespread adoption of this technology has been hampered by the computationally intense and time consuming nature of comparing each file in an investigation with a large database of approximate hashes. However, we are approaching the time when this will be feasible and the NSRL intends to be a central resource for approximate matching adoption in the forensic community.
NIST's Special Publication defining approximate matching is available here (PDF)
If you have questions or comments, please email us at nsrl [at] nist.gov (nsrl[at]nist[dot]gov).
We envision two broad categories of use:
1) Clustering and comparison of similar files:
Documents: manuscripts, email trails and IM logs, program source code, browser logs, etc. Potential use in establishing shared communications, plagiarism, IP violations, document versioning...
Executables: installed applications, operating system libraries, downloaded software. Potential use in IP protection, malware detection, increasing coverage of conventional hash sets ...
Multimedia: music, video, e-books, images. Potential use in license enforcement/IP protection, tamper detection (e.g. CP insertion, other steg)
2) Detection of file objects included in/extracted from larger files
Documents: manuscripts, email trails and IM logs, program source code, browser logs, etc. Potential use in establishing plagiarism, IP violations, illicit object detection.
Executables: installed applications, operating system libraries, downloaded software. Potential use in IP protection, malware detection, illicit object detection.
Multimedia: music, video, e-books, images. Potential use in license enforcement/IP protection, tamper detection (e.g. CP insertion, other steg).
The NSRL makes data sets available using ssdeep and sdhash algorithms. See the Non-RDS Hash Sets page.
Are there uses that you or your organization would like us to include in our assessments? If so, please email us at nsrl [at] nist.gov (nsrl[at]nist[dot]gov).
ssdeep downloads and resources: http://ssdeep.sourceforge.net/
sdhash downloads and resources: http://roussev.net/sdhash/sdhash.html
email: nsrl [at] nist.gov (nsrl[at]nist[dot]gov)