Extending Libmagic for Identification of Science Resources

Pratik Shrivastava

I am a second year master student pursuing Information Management at the University of Illinois at Urbana Champaign(UIUC). I received my B. Tech in Information and Communication Technology from DA-IICT, Gandhinagar Gujarat in 2008. Before pursuing my masters I worked as a Senior Software developer with Oracle for 4 years, and have more than 8 years of professional experience working in the software industry. My research interest revolves around Data Analytics, Data Provenance and Data mining. In my free time, I like playing soccer and cricket and I also like cooking in spare time.

Project Description: 

Reliable determination of file formats is necessary to help ensure appropriate processing can be applied to the file. This is especially important when files are intended to be reused in the future since any knowledge of the producing system may be lost. There are many subtle variations in file formats that have significant implications for consumers. For example, many metadata standards are serialized as XML (text/xml or application/xml media type), but more detail is required for actual processing of the metadata. This information is usually available through a combination of the namespace(s) and schema(s) referenced by the XML. Manual interpretation of this information is relatively straightforward though is error prone due to subtle differences that may be present.

The goal of this project is to extend the capabilities of the Linux (or equivalents on OS X and Windows) file command to allow automatic identification of common science metadata and data formats. Two main activities are anticipated to achieve this goal. 1) Supporting additional file formats by extending or adding to the existing "magic" configuration files used by the file command. These magic files contain rules that enable identification of files by matching patterns within the file. 2) Provision of a simple REST service that accepts a file (or a portion thereof) and returns a JSON encoded response containing the identification of the file as provided by the file command.
- https://github.com/threatstack/libmagic
- http://jhove.openpreservation.org/
- https://linux.die.net/man/1/file

Primary Mentor: 
Dave Vieglais