Unable to identify txt file type using DROID command prompt

Jerry Kuang

unread,

Oct 28, 2015, 3:36:02 PM10/28/15

to droid-list

Hello everyone,

I just found It is unable to identify txt file types in command line mode. I wonder whether you guys have any suggestions on solving this.

The command I use is the following.

java -jar droid-command-line-6.1.5.jar -q -Ns DROID_SignatureFile_V82.xml -Nr "C:\droid-binary-6.1.5-bin"

The result is like the following:

Thank you

Jerry

ross-spencer

unread,

Oct 28, 2015, 3:44:10 PM10/28/15

to droid-list

The code hasn't been changed too much since I worked on it. The class being used on the command line is: https://github.com/digital-preservation/droid/blob/5bac17322b46f6a9b80ae95d597d24cc9c6b858d/droid-command-line/src/main/java/uk/gov/nationalarchives/droid/command/action/NoProfileRunCommand.java

And the key is the function called here: https://github.com/digital-preservation/droid/blob/5bac17322b46f6a9b80ae95d597d24cc9c6b858d/droid-command-line/src/main/java/uk/gov/nationalarchives/droid/command/action/NoProfileRunCommand.java#L153

binarySignatureIdentifier.matchBinarySignatures(request);

It belongs to this class: https://github.com/digital-preservation/droid/blob/f3ca932b1b98f84be30a64b75569b87f54ef7101/droid-core/src/main/java/uk/gov/nationalarchives/droid/core/BinarySignatureIdentifier.java

Where we have two functions:

matchExtensions()

matchBinarySignatures ()

The two need to be called separately.

It would be good if I could consult the original storyboards, but the design of the CSV output at the time, IIRC was to provide unambiguous identification on the command line in the most simplistic form possible. Identified, or not.

matchExtensions() would normally be used to provide information on extension matches - such as will always occur on a .txt file (that is actually just .txt). While it might be useful to put in a GitHub issue to have identification returned for extensions in the CSV output (a new command line argument), an alternative tool might be needed in your workflow.

TNA have a DROID UTF-8 validator: https://github.com/digital-preservation/utf8-validator

But you might be better off with a tool like File in Linux, or Siegfried which will provide information like this:

filename : 'Running DROID.txt'

filesize : 3083

modified : 2014-10-17T02:48:26+13:00

errors :

matches :

- id : pronom

puid : x-fmt/111

format : 'Plain Text File'

version :

mime : 'text/plain'

basis : 'extension match; text match ASCII'

warning :

--

I think that should provide an okay summary to the best of my knowledge right now, TNA or someone else on the list may be able to add more.

Hope that helps.

Ross

Andy Jackson

unread,

Oct 28, 2015, 6:43:29 PM10/28/15

to droid-list

Ross, thanks for the information about DROID and matchExtensions() - I hadn't realised I'd missed that part of the DROID logic when building the Nanite DroidDetector (https://github.com/openpreserve/nanite). I'll be able to make a fresh release that includes this soon.

As for identifying text, this is an area where I find combining DROID with other tools works well. Both the 'file' tool and Apache Tika take a sample from the start of a file and analyse it using heuristics that are pretty good at differentiating plain text from binary. These tools will also attempt to determine the character set/encoding that the text is using, which is handy. Personally, I start with the results from Apache Tika but then allow the results from DROID to override them when DROID makes a positive match via a binary or container signature.

HTH,
Andy Jackson

Message has been deleted

Jerry Kuang

unread,

Nov 4, 2015, 1:53:52 PM11/4/15

to droid-list

Thank you Ross and Andy, both of you guys provide very helpful information for this issue.

Any people from the national archives and records could answer this question?

Thank you in advance.

Jerry

Brian O'Reilly

unread,

Nov 5, 2015, 6:08:24 AM11/5/15

to droid-list

Hi Jerry

Whilst the "no profile" syntax from the command line will not identify text files (or other files identified by external signature, i.e extension ), you can run a profile from the command line using a different set of flags, and this will make the identifications in a similar way to the GUI version of DROID:

e.g. > droid -a "C:\myFiles\myTextFilesDirectory" -p "C:\droid\profiles\textProfile.droid"

The -a flag can take either a directory or file (including multiple entries). If you then open up the generated profile in the GUI, you should see that text files have been identified by extension. Unfortunately, the "-q" flag has no effect in this approach and in the current version, a lot of Hibernate-related log information is output to the console. Ths will be fixed in the next release, though in the meantime you can redirect the output to a file e.g. from the Windows command prompt:

> droid -a C:\myFiles\myTextFilesDirectory -p C:\droid\profiles\textProfile.droid > C:\droid\log.txt

Hopefully this will be of help to you.

Regards, Brian

Jerry Kuang

unread,

Nov 11, 2015, 11:08:40 AM11/11/15

to droid-list

Hi Brian,

Appreciate your information.

Would you guys plan to include functions for identifying text format and others by extrenal signatures in no profiles mode?

Another issue is that I compared the profile mode results with the non-profile result and I found that jar's PUID returned by these two approaches are different. Please refer to the screenshots.

Thank you

Jerry

Jerry Kuang

unread,

Nov 11, 2015, 11:41:42 AM11/11/15

to droid-list

Hi Brian,

The filter flag does not have too many filter fields for me to further filter and process the results. I wonder whether you guys would consider add filter fields herr, such as FILE_PATH.

The non profile mode identification results are neat and clear since file name and PUID are enough for me, so if what I mentioned in my previous comment could be included in future release, that would be excellent.\

Thank you

Jerry

Jerry Kuang

unread,

Nov 11, 2015, 2:25:09 PM11/11/15

to droid-list

Hi Brian,

Any approaches to filter the records start with "zip:file" in URI? Please see the screenshot following. It seems current filter flag does not support filter by URI.

Thank you

Jerry

ross-spencer

unread,

Nov 11, 2015, 5:44:16 PM11/11/15

to droid-list

Hi Jerry,

RE: Filtering on URI - A feature of this analysis code I wrote is to pull apart the URI so that you can filter on exactly that https://github.com/exponential-decay/droid-sqlite-analysis

This was previously how I structured the database and so it wasn't exposed to users. I've quickly created an export mechanism so you can export a new deconstructed DROID CSV that will have the following columns:

ID PARENT_ID URI URI_SCHEME FILE_PATH DIR_NAME NAME METHOD STATUS SIZE TYPE EXT LAST_MODIFIED EXTENSION_MISMATCH MD5_HASH FORMAT_COUNT PUID MIME_TYPE FORMAT_NAME FORMAT_VERSION

I'd like to see the standard DROID output output the URI SCHEME too, but I don't think I've requested this anywhere officially. Perhaps one for the issues list? - Also, as an interface, adding a new column is likely to break compatibility with some user's code, and so that is one TNA would need to manage.

Ross

On Thursday, 29 October 2015 08:36:02 UTC+13, Jerry Kuang wrote:

Jerry Kuang

unread,

Nov 12, 2015, 4:22:22 PM11/12/15

to droid-list

Hi Rose,

The analysis tool for CSV file is quite useful. Thank you for showing this to me. I will try this in my application.

Jerry

Brian O'Reilly

unread,

Jan 12, 2016, 11:43:07 AM1/12/16

to droid-list

Hi Jerry
Sorry not to have come back to you earlier. The no-profile mode is intended to provide a lightweight alternative approach with basic functionality. There are no plans to include recognition by extension or other enhancements in the future, though we do review requests periodically so will bear it in mind (along with the filter suggestions you also raised). We're currently testing the next release and will look at the identification issue you raised. Possibly though you had different signature files in each case - in no profile mode, there is no default signature file and you have to specify the signature file to use via the -Nc flag.
Regards, Brian

On Wednesday, November 11, 2015 at 4:08:40 PM UTC, Jerry Kuang wrote:

Hi Brian,

Appreciate your information.

Would you guys plan to include functions for identifying text format and others by extrenal signatures in no profiles mode?

Another issue is that I compared the profile mode results with the non-profile result and I found that jar's PUID returned by these two approaches are different. Please refer to the screenshots.

Thank you
Jerry

Reply all

Reply to author

Forward