Criminals today may try to hide the digital evidence of their misdeeds by burying it in the mass of code that underpins the graphical user interface that now dominates computer use. Unless you know what you are looking for specifically, these particular bits can be very hard to find.

One of the most basic ways digital evidence is concealed is by changing the file type. For example, a spreadsheet containing incriminating evidence is reclassified as a jpeg file and hidden among thousands of vacation photos. Looking for needles in the haystack like this is a most daunting task.

The only easy way to look for file types had been to search through file extensions (e.g., .jpg, .png, .pdf) but those extensions can be easily renamed thereby invalidating the search for the altered ones. A second way to find altered files is to use the “magic bytes” through a hex editor. The hex editor detects a block of byte values at the beginning of the file and uses that to identify the file type. The problem with this method is that it is time consuming as every file must be opened for it to be identified. Another problem with this method is that there is no standard for magic bytes so there are some file types that do not use them. This technique also works only in binary code and there is no predetermined length to the number string. All these difficulties lead to a very leaky file detection method.

A third method, proposed by Kostantinos Karampidis and Giorgos Papadourakis in a recent paper in the Journal of Digital Forensics, Security, and Law, outlines a three-step method that uses computational intelligence to ferret out these altered file types:

It is a three-stage process involving feature extraction, feature selection and classification, as illustrated in Figure 1. Initially all files from the dataset are loaded and the features are extracted. Afterwards, feature selection is accomplished using a genetic algorithm and finally a neural network performs the classification. Byte Frequency Distribution (BFD) is used as a feature extraction method. In order to create the BFD, the number of occurrences of each byte value in an input file is counted and an array with elements from 0 to 255 is created. Then each element of the array is normalized by dividing with the maximum occurrence. The final result is a file containing 256 features for each instance. The next stage is feature selection, in order to decrease the number of features. Feature selection is the procedure of finding and selecting the minimum number of the most informative relevant features.

As a search method, a genetic algorithm was used… As a fitness function the Correlation based Feature Selection (CFS) (MA Hall, 1999) algorithm is utilized. This algorithm evaluates the candidate solutions from the genetic algorithm and choses those which include features highly associated to the file type category and low correlated with each other, by calculating each candidate’s solution merit.

The third and final stage is classification, which was performed with a one hidden layer neural network using the backpropagation algorithm.

Testing their system on a dataset produce by Caltech containing 914 images in 101 subfolders, the proposed three step method delivered with 100% accuracy in identifying forged jpg and gif images while 98.81% of the png images were identified.

The researchers also tested the dataset with more traditional methods like the k-means algorithm but the detection accuracy was way off.

Computational intelligence as a way of detecting altered file types looks to have a promising future in the area of digital forensics. There will still have to much testing and validation to be done but the addition of one more effective tool in uncovering criminal behavior is surely welcome.

[Find the full paper here.]