Unable to identify txt file type using DROID command prompt

374 views
Skip to first unread message

Jerry Kuang

unread,
Oct 28, 2015, 3:36:02 PM10/28/15
to droid-list
Hello everyone,

I just found It is unable to identify txt file types in command line mode. I wonder whether you guys have any suggestions on solving this.

The command I use is the following.
java -jar droid-command-line-6.1.5.jar -q -Ns DROID_SignatureFile_V82.xml -Nr "C:\droid-binary-6.1.5-bin" 

The result is like the following:
Inline image 1

Thank you
Jerry

ross-spencer

unread,
Oct 28, 2015, 3:44:10 PM10/28/15
to droid-list


binarySignatureIdentifier.matchBinarySignatures(request);


Where we have two functions: 

matchExtensions()
matchBinarySignatures ()

The two need to be called separately. 

It would be good if I could consult the original storyboards, but the design of the CSV output at the time, IIRC was to provide unambiguous identification on the command line in the most simplistic form possible. Identified, or not. 

matchExtensions() would normally be used to provide information on extension matches - such as will always occur on a .txt file (that is actually just .txt). While it might be useful to put in a GitHub issue to have identification returned for extensions in the CSV output (a new command line argument), an alternative tool might be needed in your workflow.

TNA have a DROID UTF-8 validator: https://github.com/digital-preservation/utf8-validator 
But you might be better off with a tool like File in Linux, or Siegfried which will provide information like this:

filename : 'Running DROID.txt'
filesize : 3083
modified : 2014-10-17T02:48:26+13:00
errors   :
matches  :
  - id      : pronom
    puid    : x-fmt/111
    format  : 'Plain Text File'
    version :
    mime    : 'text/plain'
    basis   : 'extension match; text match ASCII'
    warning :

--

I think that should provide an okay summary to the best of my knowledge right now, TNA or someone else on the list may be able to add more. 

Hope that helps.

Ross

Andy Jackson

unread,
Oct 28, 2015, 6:43:29 PM10/28/15
to droid-list
Ross, thanks for the information about DROID and matchExtensions() - I hadn't realised I'd missed that part of the DROID logic when building the Nanite DroidDetector (https://github.com/openpreserve/nanite). I'll be able to make a fresh release that includes this soon.

As for identifying text, this is an area where I find combining DROID with other tools works well. Both the 'file' tool and Apache Tika take a sample from the start of a file and analyse it using heuristics that are pretty good at differentiating plain text from binary. These tools will also attempt to determine the character set/encoding that the text is using, which is handy. Personally, I start with the results from Apache Tika but then allow the results from DROID to override them when DROID makes a positive match via a binary or container signature.

HTH,
Andy Jackson
Message has been deleted

Jerry Kuang

unread,
Nov 4, 2015, 1:53:52 PM11/4/15
to droid-list
Thank you Ross and Andy, both of you guys provide very helpful information for this issue.

Any people from the national archives and records could answer this question?

Thank you in advance.

Jerry

Brian O'Reilly

unread,
Nov 5, 2015, 6:08:24 AM11/5/15
to droid-list
Hi Jerry
 
Whilst the "no profile" syntax from the command line will not identify text files (or other files identified by external signature, i.e extension ), you can run a profile from the command line using a different set of flags, and this will make the identifications in a similar way to the GUI version of DROID:
e.g.  > droid -a "C:\myFiles\myTextFilesDirectory" -p "C:\droid\profiles\textProfile.droid" 
 
The -a flag can take either a directory or file (including multiple entries). If you then open up the generated profile in the GUI, you should see that text files have been identified by extension.  Unfortunately, the "-q" flag has no effect in this approach and in the current version, a lot of Hibernate-related log information is output to the console.  Ths will be fixed in the next release, though in the meantime you can redirect the output to a file e.g. from the Windows command prompt:
> droid -a C:\myFiles\myTextFilesDirectory -p C:\droid\profiles\textProfile.droid > C:\droid\log.txt
 
Hopefully this will be of help to you.
Regards, Brian

Jerry Kuang

unread,
Nov 11, 2015, 11:08:40 AM11/11/15
to droid-list
Hi Brian,

Appreciate your information. 

Would you guys plan to include functions for identifying text format and others by extrenal signatures in no profiles mode?

Another issue is that I compared the profile mode results with the non-profile result and I found that jar's PUID returned by these two approaches are different. Please refer to the screenshots.

Thank you
Jerry




Jerry Kuang

unread,
Nov 11, 2015, 11:41:42 AM11/11/15
to droid-list
Hi Brian,

The filter flag does not have too many filter fields for me to further filter and process the results. I wonder whether you guys would consider add filter fields herr, such as FILE_PATH.

The non profile mode identification results are neat and clear since file name and PUID are enough for me, so if what I mentioned in my previous comment could be included in future release, that would be excellent.\

Thank you
Jerry

Jerry Kuang

unread,
Nov 11, 2015, 2:25:09 PM11/11/15
to droid-list
Hi Brian,

Any approaches to filter the records start with "zip:file" in URI? Please see the screenshot following. It seems current filter flag does not support filter by URI. 

Thank you

Jerry

ross-spencer

unread,
Nov 11, 2015, 5:44:16 PM11/11/15
to droid-list
Hi Jerry,

RE: Filtering on URI - A feature of this analysis code I wrote is to pull apart the URI so that you can filter on exactly that https://github.com/exponential-decay/droid-sqlite-analysis 

This was previously how I structured the database and so it wasn't exposed to users. I've quickly created an export mechanism so you can export a new deconstructed DROID CSV that will have the following columns:

ID PARENT_ID URI URI_SCHEME FILE_PATH DIR_NAME NAME METHOD STATUS SIZE TYPE EXT LAST_MODIFIED EXTENSION_MISMATCH MD5_HASH FORMAT_COUNT PUID MIME_TYPE FORMAT_NAME FORMAT_VERSION

I'd like to see the standard DROID output output the URI SCHEME too, but I don't think I've requested this anywhere officially. Perhaps one for the issues list? - Also, as an interface, adding a new column is likely to break compatibility with some user's code, and so that is one TNA would need to manage. 

Ross


On Thursday, 29 October 2015 08:36:02 UTC+13, Jerry Kuang wrote:

Jerry Kuang

unread,
Nov 12, 2015, 4:22:22 PM11/12/15
to droid-list
Hi Rose,

The analysis tool for CSV file is quite useful. Thank you for showing this to me. I will try this in my application.

Jerry

Brian O'Reilly

unread,
Jan 12, 2016, 11:43:07 AM1/12/16
to droid-list
Hi Jerry
Sorry not to have come back to you earlier.  The no-profile mode is intended to provide a lightweight alternative approach with basic functionality.  There are no plans to include recognition by extension or other enhancements in the future, though we do review requests periodically so will bear it in mind (along with the filter suggestions you also raised).  We're currently testing the next release and will look at the identification issue you raised.  Possibly though you had different signature files in each case - in no profile mode, there is no default signature file and you have to specify the signature file to use via the -Nc flag.
Regards, Brian


On Wednesday, November 11, 2015 at 4:08:40 PM UTC, Jerry Kuang wrote:
Hi Brian,

Appreciate your information. 

Would you guys plan to include functions for identifying text format and others by extrenal signatures in no profiles mode?

Another issue is that I compared the profile mode results with the non-profile result and I found that jar's PUID returned by these two approaches are different. Please refer to the screenshots.

Thank you
Jerry






Reply all
Reply to author
Forward
0 new messages