Boris Kiefer: Data Recovery

Some Software:

Here are a few scripts that I wrote for data recovery. I assume that you have been successful in mounting the hard drive in question. You are welcome to use (and likely need) to change the provided scripts to fit your needs. The files come with absolutely no warranty.

In my PHYS476/576 course "Computational Physics" I address issues concerning data integrity, data safety, data lifecycle, data forensics, data recovery, feature extraction, and workflow design.

Recovery of Pictures:

Processing: Use metadata stored in the header of recovered jpeg files to sort into year/month labeled directories.

Processing: Deal with other file types you may have recovered: doc, xls, ppt, ai,...

Recovery of pdfs:

Pruning: move doc, xls, ppt, and pdfs < 100kB out of the recovered directories.

Processing: convert page 1 ... 5 of remaining pdfs into cleartext.

Processing: search of cleartext files by keyword and move matching files to a keyword labeled directory.

Processing: remove duplicate recovered files in keyword labeled directories.
In my experience this step is crucial since it reduces the filecount by a factor 10 to 20.

Auxillary: Count the number of files in the recovered directories.

Auxillary: Remove pdf if no corresponding cleartext file exists in keyword labeled directories.
Be careful: In this step you will loose pdfs with no corresponding cleartext file, for example scanned pdfs. Recovered files often have meaningless names, such as "f523231216.pdf", automatic renaming is desirable especially if the number of recovered files is large, in my tests over 100000 files! Initially I thought that I may be able to extract bibliographic information from the cleartext files. While the information is in the file, it is unstructured in the sense that the desired information cannot be extracted easily. I think that it is more promising to use software such "cermine" that attempts to convert pdfs into structured (metatagged) xml files, which facilitates filtering. I did some runtime tests, and found that it takes on average ~30s for generating an xml file, resulting in ~35 days for conversion 100000 files. In contrast, the pruning step reduces the numer of files by ~40-50% (in my experience), and removal of duplicates reduces this by an additional factor of 10 to 20, leading to conversion times of ~0.9 to 2 days. Therefore, pruning and pre-treatment of the recovered files is vital.

Auxillary: Remove cleartext if no corresponding pdf file exists in keyword labeled directories.
Be careful: In this step you will loose cleatext files with no corresponding pdf file.

Auxillary: Remove all characters other than "a-z,A-Z,." from files in keyword labeled directories.>
Depending on the filename of the recovered file, it may contain characters such as "(,_,?,...".
In these cases, Linux surrounds the filename with single quotes, hindering efficient further text processing.