Big data is now intimately entwined with the world of forensic investigations. It’s the new normal; with so many sensor points around us, it’s estimated that by 2025 the average connected person will have over 4,800 interactions with IoT devices every day.

The movement of data storage from physical hard drives in computers to virtual storage in the cloud has altered the digital forensics landscape. Multiply this difficulty with the huge increase in the volume of data (both stored and real-time) and the different varieties in which the data is now formatted. Even the basic tenet of repeatability in its strictest sense may fall by the wayside. This gives renewed urgency for the forensics investigators to validate their tools and make them more reliable and to improve documentation so that it stands up to scrutiny.

Since cloud data can now be modified remotely very quickly, the deployment of well-trained Digital Evidence First Responders (DEFR) has been mulled. Since it would be impossible to capture all the data at once, the DEFRs would have to perform triage and then decide what relevant parts of the data should be collected in the time allowed.

Techniques used for extracting valuable information and insights from Big Data may now also be employed in digital forensics.

These techniques may include:

  • Neural Networks – Used to detect patterns in large amounts of unclassified data without specific programming rules. By studying examples, neural networks learn and improve as they go along. Such neural networks do well in image and speech recognition. Forensics investigators can employ such networks to search image and speech files.


  • Natural Language Processing (NLP) – Used in big data to read and gain information from such data as emails. Statistics based NLP is machine learning done automatically wherein more input data means greater accuracy.


  • Audio and image forensics –AI-based learning techniques can not only sharpen sound but can also search through volumes of audio data for particular patterns. This is also the case with searching through images. Forensic scientists helping police prosecute an alleged pedophile can do searches for child pornography on the accused’s storage media and on the cloud.


  • Random Forests—A technique based on using a large number of decision trees in order to implement machine learning tasks. It is also important in ranking the importance of variables in any given problem. Random forest techniques deliver quick results and they handle incomplete data much better than other techniques.


  • MapReduce—A technique best use in handling large amounts of unstructured data. The Map part of this technique brings structure to unstructured data and enables the processing of the data. The data processed in parallel on a number of connected computers that provide fault tolerance and redundancy. This makes for processing speed and reliability.