-
Notifications
You must be signed in to change notification settings - Fork 0
Signature Creation Walkthrough
Development of file format signatures follows five basic steps: collection of a corpus of test files, identification of files to examine, examination of those files, creation of the signature, and testing of the signature.
- Collection of test files
Sample files can be found in institutional archives, online archival repositories (such as the Internet Archive), at online FTP sites, and at various websites devoted to preserving historical electronic files.
For this document I will be using files collected from the modland.com website. The site says "probably the largest module archive in the world. Currently there are 456411 modules online in 374 different formats, old and new!"
These mods are sound files used by computer games created from the 1980's onward. Few file format specifications existed at the time, so game creators developed their own formats for uses in games. The Modland collection is a fantastic resource to practice PRONOM signature development on. The website also maintains a collection of file format specifications, which are a great resource to use in the development of signatures.
Commands used in this document are all running in a Linux environment, but can be also run on Windows machines running appropriate software environments.
The modland website can be downloaded in it's entirety using the wget command.
wget -m modland.com
In the root directory of the site is a file called allmods.zip which contains a 25MB text file listing all of the files on the site. Here are the first 10 files listed.
259547 Ace Tracker/505/bekiffte maschinen.am 116994 Ace Tracker/505/blazing wings.am 73421 Ace Tracker/505/blocks.am 45667 Ace Tracker/505/calling.am 395546 Ace Tracker/505/corridor extreme.am 16774 Ace Tracker/505/dazed.am 646599 Ace Tracker/505/distant.am 14779 Ace Tracker/505/dsp chip.am 20523 Ace Tracker/505/eclosure.am 39842 Ace Tracker/505/egzacutable.am
- Identification of files to examine
Once you have gained familiarity with the modland collection, you will be able to check if these files in the Ace Tracker format have documentation associated with them. The file format documents are located in the pub/documents/format_documentation/ directory. Unfortunately there is no documentation for this format.
The Ace Tracker files are located at /pub/modules/Ace Tracker/. There are a total of 57 .am files in the collection. Copy those files to a single directory.
- Examination of files.
I begin my examination of these files by running the Unix 'file' command on the files.
file * 0 00.am: data 1-800-natalie.am: data 20000.am: data 400 beats.am: data a long time apart 1.am: data a perfect moment.am: data ace 2 introduction.am: data behind a slime gate.am: data bekiffte maschinen.am: data
No help there, the file format signature is not recognized by the file command. Next I run siegfried against the files (siegfried uses the PRONOM format signature database.)
sf *
filename : '0 00.am' filesize : 66418 modified : 2017-12-24T01:31:28-05:00 errors : matches : - ns : 'wikidata' id : 'UNKNOWN' format : version : mime : basis : warning : 'no match; possibilities based on extension are fmt/917, fmt/918, fmt/919, fmt/920, fmt/921'
Siegfried does not recognize the files, so the signature is not in the PRONOM database.
Next I run trid against the files. trid works differently than file or siegfried. trid format signatures are created automatically by running heuristic analysis against the files. This means that format signatures created in siegfried may or may not be unique to a specific file format. However, trid can return some information on over 13,000 files, so it's a good sanity check in the development of new signatures.
trid *
TrID/32 - File Identifier v2.24 - (C) 2003-16 By M.Pontello
Definitions found: 12974
Analyzing...
File: 0 00.am
100.0% (.AM) Ace Tracker module (12504/2/4)
Trid identifies the file! For the time being we will ignore this result and continue to develop our signature independent of trid, and we can go back and see if our signature matches the trid signature.
The next step is to examine the files using a hex editor, I am using Okteta.
We can see from the examination of the three files that each files begins with the string "AM01" There is also what appears to be a common string "ASD1" later in the file. This information is probably sufficient to uniquely identify these files, but we can run a few more processes against all of the files in the directory (while it's possible to do this in a hex editor, it can be quite time consuming.)
The file lcs.py in this repository is a small Python program which examines binary files for unique strings. Specifically the program identifies the Longest Common Sequence within the files. It will examine the beginning part of the file for strings which all the files have in common. lcs.py takes two arguments on the command line, the first is the extension to examine, and the second is how many bytes of the file to search. Generally file format signature occur at the beginning of the file. Running lcs.py against the first 8 bytes of the .am files produces:
lcs am 8 First 8 binary bytes of 0 00.am are: b'AM01\x00\x00\x00\x7f' First 8 hexadecimal bytes of 0 00.am are: 414d30310000007f' First 8 Binary bytes of 1-800-natalie.am are: b'AM01\x00\x00\x00\x7f' First 8 hexadecimal bytes of 1-800-natalie.am are: 414d30310000007f' Found the common sequence 414d30310000007f at location (0, 17) ASCII version of common sequence is AM01\x00\x00\x00\x7f Checking other files in directory. File 0 00.am contains the string 414d30310000007f' at location (0, 17) File 1-800-natalie.am contains the string 414d30310000007f' at location (0, 17) File 20000.am contains the string 414d30310000007f' at location (0, 17) File 400 beats.am contains the string 414d30310000007f' at location (0, 17) File a long time apart 1.am contains the string 414d30310000007f' at location (0, 17) File a perfect moment.am contains the string 414d30310000007f' at location (0, 17) File ace 2 introduction.am DOES NOT CONTAIN THE STRING. File behind a slime gate.am contains the string 414d30310000007f' at location (0, 17) File bekiffte maschinen.am contains the string 414d30310000007f' at location (0, 17) ... File synasium.am contains the string 414d30310000007f' at location (0, 17) File synthpop.am contains the string 414d30310000007f' at location (0, 17) File tears in rain.am contains the string 414d30310000007f' at location (0, 17) File the inuit experience.am contains the string 414d30310000007f' at location (0, 17) File tightrope.am contains the string 414d30310000007f' at location (0, 17) File tiny.am contains the string 414d30310000007f' at location (0, 17) File treshold.am contains the string 414d30310000007f' at location (0, 17) File warm waves.am DOES NOT CONTAIN THE STRING. File waving.am contains the string 414d30310000007f' at location (0, 17) 47 files with matches. 10 files without matches.
We can see that many of the files contain the same first 16 characters, but some do not. Re-running lcs.py with a smaller byte size:
lcs am 4 First 4 binary bytes of 0 00.am are: b'AM01' First 4 hexadecimal bytes of 0 00.am are: 414d3031' First 4 Binary bytes of 1-800-natalie.am are: b'AM01' First 4 hexadecimal bytes of 1-800-natalie.am are: 414d3031' Found the common sequence 414d3031 at location (0, 9) ASCII version of common sequence is AM01 Checking other files in directory. File 0 00.am contains the string 414d3031' at location (0, 9) File 1-800-natalie.am contains the string 414d3031' at location (0, 9) File 20000.am contains the string 414d3031' at location (0, 9) File 400 beats.am contains the string 414d3031' at location (0, 9) File a long time apart 1.am contains the string 414d3031' at location (0, 9) File a perfect moment.am contains the string 414d3031' at location (0, 9) ... File synasium.am contains the string 414d3031' at location (0, 9) File synthpop.am contains the string 414d3031' at location (0, 9) File tears in rain.am contains the string 414d3031' at location (0, 9) File the inuit experience.am contains the string 414d3031' at location (0, 9) File tightrope.am contains the string 414d3031' at location (0, 9) File tiny.am contains the string 414d3031' at location (0, 9) File treshold.am contains the string 414d3031' at location (0, 9) File warm waves.am contains the string 414d3031' at location (0, 9) File waving.am contains the string 414d3031' at location (0, 9) 57 files with matches. 0 files without matches.
All 57 of the test files contain the same hexadecimal bytes, which the hex editor showed us to be "AM01". This matches the trid identification (which was created by a Python program performing the same basic analysis.)
- Creation of format signature
We now have enough information to tentatively create a new format signature. Using the PRONOM Signature Development Utility we can create a signature file for siegfried to use.
NOTE: A PRONOM signature can be very complex (see PRONOM Workshop Slides) for more information on the creation of signatures. We could make the following signature more complex by adding in the string "ASD1" in addition to the "AM01" string in the signature. This would be done by using the Add Sequence button on the Signature Development Tool, and using the hex 41534431 as the Signature field and changing the Anchor setting to Variable.
The mimetype was identified by running the Linux mimetype command:
mimetype *.am 0 00.am: application/octet-stream 000ace.am: application/octet-stream 1-800-natalie.am: application/octet-stream 20000.am: application/octet-stream 400 beats.am: application/octet-stream a long time apart 1.am: application/octet-stream
Save the resulting XML file locally to your computer.
- Testing of signature
Follow the siegfried how-to to create a test signature and run it against the test files.
sudo cp Ace-Tracker-module-v1.0-signature-file.xml /usr/share/siegfried/custom sudo roy build -extend Ace-Tracker-module-v1.0-signature-file.xml am.sig sf -sig am.sig *.am --- filename : '0 00.am' filesize : 66418 modified : 2017-12-24T01:31:28-05:00 errors : matches : - ns : 'wikidata' id : 'dev/1' format : 'Ace Tracker module' version : '1.0' mime : 'application/octet-stream' basis : 'extension match am; byte match at 0, 4' warning :
We can see that siegfried now recognizes the file! The last step is to run the new signature against a large existing corpus of records to ensure that the signature is unique to the format you are testing.
Now we can check the trid identification for .am files
and we see that our signature matches the one in trid.
Zipping up the sample files and the PRONOM XML files and submitting the format signature can be done at this point. Congratulations, you have successfully created a simple file format signature! Once the signature is added to the PRONOM database, anyone using DROID or siegfried to identify files will be able to identify .am files. You may also looking submitting the information to the Unix file command maintainers, and possibly to other sites which identify file formats (i.e. http://fileformats.archiveteam.org/wiki/Main_Page)