Metadata (data about data) has become a fairly well known aspect in electronic discovery. There are many tools available which can read and even change metadata. So, it is with caution that a forensic analyst should use metadata. However, it can be an important step in your investigation.
Tools for metadata extraction
Google will find several million sites if you search for metadata extraction. I personally prefer open source tools so I can evaluate exactly what is happening, explain it thoroughly if required and, if necessary, modify the program. Here are a few links to sites for various tools:
My preference is Harlan’s tool as it does a great job and it can see all the source (although I don’t enjoy programming in Perl all that much).
Installing File-Word and wmd.pl on Linux
So, here are the steps to install Harlan’s tool in Linux. You first must have perl installed (I’m using 5.8.8) and then use CPAN to install the necessary modules.
- perl -MCPAN -e ‘install Bundle::CPAN’ (install CPAN bundle)
- perl -MCPAN -e shell (run the shell)
- install OLE::Storage
- install OLE::PropertySet
- install Startup
- install Carp (no need to force if you get that error)
- install Unicode::Map
- wget http://www.cpan.org/modules/by-authors/id/H/HC/HCARVEY/File-MSWord-0.1.zip
- unzip File-MSWord-0.1.zip
The package comes with a sample script called testwd.pl which is written for Windows however you can instruct perl to interpret the script and it will run fine. To run the script, simply type:
- perl testwd.pl /doc/to/analyze.doc
and the metadata will be displayed to standard out. If you get an error that the MSWord.pm file cannot be found, you should copy the file (from the zip archive) to your perl path which will be displayed in the error message.
Harlan wrote a script called wmd.pl which makes the output more reader friendly and I am posting a copy here with his permission. It can be hard to locate this file on the Internet.
Microsoft Office file formats
It may be helpful at times to reference the specifications for the various Microsoft Office file formats. With the latest release of Micorosft Office (2007), the file formats are now in XML and are a open, published standard. Prior to this format, the file formats were binary and were not as easily accessible. You may find the following links helpful:
- How to extract information from Office files by using Office file formats and schemas
- Ecma Office Open XML File Formats Standard
- Microsoft Office Binary File Formats
- Word 97 structure from wvWare notes
If you have found other tools helpful, please let me know so I can update this article.