markdown in PRONOM?

82 views
Skip to first unread message

ross-spencer

unread,
Dec 13, 2017, 12:19:18 PM12/13/17
to droid-list
Hi All - I notice Markdown isn't recognised as a format in PRONOM yet. Difficult one, because what signature could be used? And should there be another .md extension? What's the general consensus on these concerns? 

Ross Spencer

unread,
Dec 13, 2017, 12:21:23 PM12/13/17
to droid...@googlegroups.com
Correction, '.md' doesn't appear as an extension. Just noticed that the extension search in PRONOM is a generous one. 


On Wed, Dec 13, 2017 at 11:19 AM, ross-spencer <all.along.the....@gmail.com> wrote:
Hi All - I notice Markdown isn't recognised as a format in PRONOM yet. Difficult one, because what signature could be used? And should there be another .md extension? What's the general consensus on these concerns? 

--
You received this message because you are subscribed to the Google Groups "droid-list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to droid-list+unsubscribe@googlegroups.com.
To post to this group, send email to droid...@googlegroups.com.
Visit this group at https://groups.google.com/group/droid-list.
For more options, visit https://groups.google.com/d/optout.

Matt Palmer

unread,
Dec 14, 2017, 3:53:55 AM12/14/17
to droid-list
Hi Ross, I remember looking at textual format recognition when we were doing DROID 6.

The only way to do it efficiently as far as I can see is to create a new kind of signature. It's possible to recognise languages with key words, and structured and semi structured text with other heuristics. You also have to recognise the character encoding.

I do have code to recognise text files as opposed to binary, extracting their encoding at the same time. Would also need efficient keyword recognisers such as the Aho Corsasik search algorithm to avoid grinding DROID to a halt.

Definitely something which would be generally useful given how many text file formats there are. Would also be good for the existing HTML and XML binary signatures which are fairly inefficient and inaccurate at the moment.

Cheers

Matt

Matt Palmer

unread,
Dec 14, 2017, 3:54:44 AM12/14/17
to droid-list
Of course, just adding the md extension would be a quick win...

Dclipsham

unread,
Dec 14, 2017, 10:00:19 AM12/14/17
to droid-list
Hi Ross,

Generally speaking, we prefer to add new PRONOM entries where we have a signature ready to go also. Clearly there are a lot of 'uncharacterisable'* formats though, and we feel it would be better to have a descriptive entry without a signature than to have no entry at all. We tend to add these on-request - see e.g http://www.nationalarchives.gov.uk/PRONOM/fmt/1119 Jupyter Notebook added in v93 at request of University of Edinburgh.

So, would you like to write and submit a description for Markdown? Next release will likely be late-Jan/early Feb.


David


*uncharacterisable given current PRONOM syntax limitations and signature identification methodology - I'm not wanting to preclude or diminish alternative identification approaches.

Dclipsham

unread,
Dec 14, 2017, 10:13:47 AM12/14/17
to droid-list
On a sidenote we ran an internal Machine Learning hackathon at The National Archives last week and one of the teams looked at using ML approaches to attempt to distinguish between different types of source code (just using 'random' content from GitHub as both training and validation data. There were some promising early results, especially given the 2-day time-frame, and we'd like to look at it further....

Matt Palmer

unread,
Dec 14, 2017, 5:04:43 PM12/14/17
to droid...@googlegroups.com
That sounds like a really interesting approach - I hope you do look into it more and let us know what you find!


On 14 December 2017 at 15:13, Dclipsham <david.c...@nationalarchives.gsi.gov.uk> wrote:
On a sidenote we ran an internal Machine Learning hackathon at The National Archives last week and one of the teams looked at using ML approaches to attempt to distinguish between different types of source code (just using 'random' content from GitHub as both training and validation data. There were some promising early results, especially given the 2-day time-frame, and we'd like to look at it further....

--
You received this message because you are subscribed to a topic in the Google Groups "droid-list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/droid-list/SFg-nPSSEQs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to droid-list+unsubscribe@googlegroups.com.

Nick Krabbenhoeft

unread,
Jan 2, 2018, 1:25:41 PM1/2/18
to droid-list
Just saw this thread, and chiming in to encourage publishing some of the results of the ML source code recognition.

Also, Github has a good list of common extensions for text-based formats stored in yml in its Linguist repo. Here's the entry for markdown. https://github.com/github/linguist/blob/master/lib/linguist/languages.yml#L2612
Their recognized markdown extensions include .md, .markdown, .mdown, .mdwn, .mkd, .mkdn, .mkdown, .ron, and .workbook

Dclipsham

unread,
Jan 3, 2018, 4:49:51 AM1/3/18
to droid-list
Thanks Nick. I think we're aiming to put something together for a conference submission. Not sure which conference yet though!

Thank you also for the Linguist link.

David

rspe...@artefactual.com

unread,
Feb 16, 2018, 5:42:24 PM2/16/18
to droid-list
Hi David, 

What do you think of something like this as a description?

Markdown was created by John Gruber and Aaron Swartz circa 2004. The purpose of 
Markdown is to let users write clean text-based documents that do not suffer 
from the legibility issues of other 'markup' formats. 

Markdown uses combinations of characters, for example, hashes (pound-sign), 
asterisks, and combination of square- and rounded- brackets to prefix or suffix 
parts of text. The symbols provide instructions to an interpreter. A 
single hash '#' for example, that prefixes a line of text is an instruction to
make that line a top-level header in a formatted document. Two hashes '##' is an
instruction to render, or output, a secondary header. And so on.

Ultimately the result of writing markdown is a document that can be parsed into 
a well-formed version of other presentation languages such as HTML or XHTML.

There is no single specification for Markdown, nor is there a single canonical
output. That is, Markdown syntax could be converted into many other file types. 

A good description of the background of Markdown and a list of Markdown 
'flavors' can be found on the Archiveteam Just Solve It File Formats Wiki: 

Wikipedia also provides a thorough description of Markdown and its syntax: 

For some reason I can't easily access the PRONOM server as I type (very slow response rates to my HTTP requests) so I can't verify if there is already an MD extension listed. Given the lack of signature that we can use, we should probably also avoid the risk of multiple identification like .TXT inside the database. 

Happy to re-draft this pending yours or community comments. 

Ross

Dclipsham

unread,
Feb 21, 2018, 5:27:45 AM2/21/18
to droid-list
Thanks Ross, really thorough.

I'd like to change the word 'good' in the fifth paragraph for 'detailed' if that's okay. We try to avoid qualitative/judgment terms where we can.

Are you still experiencing the same symptoms with PRONOM behaving slowly? It's very zippy where I am, but perhaps to be expected. Are there any other actions that are taking a particularly long time to execute?

MD isn't yet in PRONOM. The latest release was November 2017.

David

Ross Spencer

unread,
Feb 21, 2018, 10:26:06 AM2/21/18
to droid...@googlegroups.com
Hi David,

‘Detailed’ probably doesn’t fit for the same reason. Maybe ‘Further descriptions can be found here [x] and here [x].’ / ‘Additional’, etc.

Cheers, 
Ross
--
You received this message because you are subscribed to the Google Groups "droid-list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to droid-list+...@googlegroups.com.

dclipsham

unread,
Feb 21, 2018, 12:11:44 PM2/21/18
to droid...@googlegroups.com
Thanks Ross, 

I was thinking detailed in a literal 'contains detail' sense rather than the qualitative measure, but I can see the ambiguity so I'll likely go with 'additional' to avoid that ambiguity.

Thanks again,
David





Sent from Samsung Mobile on O2

Kuldar Aas

unread,
Jun 4, 2018, 3:08:06 AM6/4/18
to droid-list
+1 for having DROID support "markup-signatures" or some other way of textual signatures. The specific national problem we have is that much of our received content is digitally signed, ie. in local formats "ddoc" and "bdoc" which are effectively XML files. Currently we are forced to develop specific ingest workflow forks which detect the ddoc/bdoc files by the extension. Would be much nicer if we could get DROID to identify the format in a more trustworthy way and as part of the standard ingest workflow. Don't mind doing the signatures ourselves but with the current bit signatures logic it's a nightmare.. 

(Kuldar from the National Archives of Estonia)
Reply all
Reply to author
Forward
0 new messages