Tika java class library available through the Apache group. It supports media type detection based on file type signatures, metadata extraction and text parsing and extraction.
Supported Document Formats:
The Tika application can be run in either command line mode or as a graphical user interface (GUI) mode. Tika is written in Java and the class library can be used in directly in other programs where needed.
Those with advanced programming skills can extend the Tikal to meet specific project or analysis needs not covered by the basic release. It is an open source project at the Apache Software Foundation and available under the Apache License version 2.0 (ALv2).