fmt/6, fmt/142 and wave_format_pcm

185 views
Skip to first unread message

Andrea Byrne

unread,
Aug 3, 2016, 7:22:12 PM8/3/16
to droid-list

Helllo all,
My colleague, format enthusiast Ross Spencer and I have been doing a little bit of research on WAVES, and I went down a rabbit hole ,and was wondering two things about PRONOM history and future directions re: signatures for wave files.

I have two WAVE files from our organization, Archives New Zealand. They were both encoded with PCM audio, however, one was made using apple software and has a RIFF FLLR chunk with about 4k bytes of zeroes before the data chunk. The other file doesn't have the FLLR chunk, and the data chunk is 16 bytes after the format chunk.. The file with the FLLR chunk identifies as fmt/6 (WAVE classic, I suppose!) and the one without identifies as fmt/142 (waveformatex). I am wondering if these are indeed two different formats, or if fmt/142 is the preferred signature and could be improved to included wave files that were encoded with the FLLR chunk [ I guess this could be done by changing the sig from  52494646{4}57415645666D7420[!10]{3}[!FEFF]{16-1000}64617461 to something like 52494646{4}57415645666D7420[!10]{3}[!FEFF]{16-4000}64617461 ], or if the waveformatex signature was specifically developed to exclude WAVE files that have the FLLR chunk, because they are understood to be two different formats. I'm happy to share these files with interested parties.


Also, it looks like the fmt/142 files we have that are not identifying as fmt/141 (PCMWAVEFORMAT) is because the length of the format chunk is 18, and not 16 bytes. Do the extra two bytes fundamentally change the file's format? In the same vein, I was wondering how useful it would be to develop a signature for wave_format_pcm or instead to refine fmt/141 to include cases where the format chunk is longer than expected.

I'd be happy to discuss/research/develop this further if it's wanted by the community.

Thanks!
Andrea Byrne


Dclipsham

unread,
Aug 4, 2016, 9:16:22 AM8/4/16
to droid...@googlegroups.com
Hi Andrea, thanks for the submission. 

On the FLLR chunk, judging by this thread: http://lists.apple.com/archives/coreaudio-api/2011/Aug/msg00052.html - it would appear that FLLR is an Apple-specific construct, and that its inclusion could inhibit WAV playback on certain decoders (search '"FLLR" chunk' on Google for plenty more examples of playback issues). 

For this reason, I'm wondering whether it might be worth having a specific PRONOM entry for such WAVs, noting the potential issue. I'd be interested to hear what others think.


Note that fmt/6 does not necessarily imply 'classic' WAVE; we would ordinarily expect it to conform to one of fmt/141, fmt/142, or fmt/143 (assuming it's not BWAV of RF64, of course), rather, though it could be classic 'WAVEFORMAT' it is more likely that its structure may not conform to one of our more specific format types and is therefore noteworthy. For BWAV (e.g. fmt/1) we explain this in our 'generic' entry - "A BWAVE identified as generic by DROID likely uses an encoding format unknown to PRONOM, or perhaps a structural difference, and users are encouraged to let the PRONOM team know should this occur." It's probably useful for us to include similar text on fmt/6. 
 

Yes, the 2 extra bytes do change the file type, as the 2 bytes signify the 'cbSize' member, but this difference may be immaterial. Microsoft cover this here: https://msdn.microsoft.com/en-us/library/windows/hardware/ff536383(v=vs.85).aspx 
"The only difference is that WAVEFORMATEX contains a cbSize member and PCMWAVEFORMAT does not. According to the WAVEFORMATEX specification, cbSize is ignored if wFormatTag = WAVE_FORMAT_PCM (because cbSize is implicitly zero); cbSize is used for all other formats. Thus, in the case of a PCM format, PCMWAVEFORMAT and WAVEFORMATEX contain the same information and can be treated identically." 
So a WAVEFORMATEX must contain the cbSize member but can have the format tag WAVE_FORMAT_PCM, whereas plain PCMWAVEFORMAT cannot have a cbSize member.

A further useful reference is here:

David

Tyler Thorsted

unread,
Aug 4, 2016, 11:47:53 AM8/4/16
to droid-list
All,

If I understand right, the standard allows for other chunks before the data chunk, with only the fmt coming first and the data chunk sometime after.

"The WAVE form is defined as follows. Programs must expect (and ignore) any unknown chunks encountered, as with all RIFF forms. However, fmt must always occur before data, and both of these chunks are mandatory in a WAVE file."
Since the FLLR chunk is simply a "filler", like the JUNK chunk, and doesn't contain anything important, I don't think it needs a special fmt ID. However, I think a good preservation plan would be to extract the existence of the chunk for migration planning, since it appears many parsers have trouble ignoring the chunk.

I was able to confirm a WAVE file saved out with Quicktime adds the FLLR chunk and DROID then identifies it as a fmt/6. Using BWF Metaedit to add Broadcast Wave data gives priority in DROID and comes out correctly as a BWAVE (fmt/703) with the FLLR chunk still intact.

My 2 cents.

Tyler

Dclipsham

unread,
Aug 4, 2016, 7:56:30 PM8/4/16
to droid-list
Thank you Tyler. I think with Andrea's and your own investigations and examples, plus Dave Rice's input into the interpretation of the specification [see Twitter, but specifically https://twitter.com/dericed/status/761325854182010880], it will be appropriate to extend the 16-1000 byte wildcard to a 16-* wildcard, rather than create a new entry. Where format risk assessments exist (e.g http://wiki.dpconline.org/index.php?title=File_Formats_Assessments), I would very much recommend that the playback risk evident in some decoders (where the interpretation of the specification is stricter than the specification itself allows) is explicitly acknowledged.

Calvin Lawrence

unread,
Aug 26, 2016, 5:27:27 PM8/26/16
to droid-list
Is the detailed information referred to in this thread generally available to users?  It would be really helpful to know the rationale behind different entries for a given format in the registry.  Otherwise, as far as I can tell, except for the PUID some of them are identical.  I looked at dpconline and it wasn't as clear as what is stated here.

Thank you,
Calvin

ross-spencer

unread,
Aug 28, 2016, 11:02:55 PM8/28/16
to droid-list
Hi Calvin, David, Tyler, et al.


The addition of a wildcard signature wouldn't help from this perspective - but if that's what is needed then that is important to understand as well. 

We've our small set that I've put up on the OPF website - we've many more - but they're all likely to be pretty homogeneous coming from the same source. We'd need a more varied set to make more useful changes/compromises. For example, creating matching BOF and EOF signatures for chunks book-ending the audio data? - understanding if this is feasible etc. 

Does TNA or anyone on this list have access to a corpus, or corpora, or ability to make one, that could help test strategies for making WAV type signatures? 

Ross

Calvin Lawrence

unread,
Aug 29, 2016, 10:32:03 AM8/29/16
to droid-list
If there are specific WAV formats that are needed I can check our collections for them.

Dclipsham

unread,
Sep 20, 2016, 6:04:34 AM9/20/16
to droid...@googlegroups.com
For reference, below are all the current .wav extension signatures. I do not want to remove wildcards without a better option; as Dave Rice observed "[PRONOM]...always seems to presume from samples rather than specs" and I'm keen to correct that perception where possible. I believe the performance implications of the handling of legitimate wildcards are a problem for the tools, not the registry. WAV in particular would benefit from better handling of 'families' of formats that have similar signatures and I see that as a challenge for the identification algorithm. That said, I'm always open to suggestions for signature improvements that don't compromise accuracy.

On corpora, the bulk of the BWF work came via Jay at NLNZ, and Tyler et al at LDS Church. At The National Archives we don't receive many WAVs at all and I've used Audacity and various DAW tools for creating my own samples.

Wav Signatures

Broadcast 0 Generic - fmt/1

52494646{4}57415645*62657874{350}0000

Broadcast 0 MPEG - fmt/706

52494646{4}57415645*62657874{350}0000*666D7420{4}5000
52494646{4}57415645*666D7420{4}5000*62657874{350}0000

Broadcast 0 PCM - fmt/703

52494646{4}57415645*62657874{350}0000*666D7420{4}0100
52494646{4}57415645*666D7420{4}0100*62657874{350}0000

Broadcast 0 WAVEX - fmt/709

52494646{4}57415645*666D7420{4}FEFF*62657874{350}0000
52494646{4}57415645*62657874{350}0000*666D7420{4}FEFF

Broadcast 1 Generic - fmt/2

52494646{4}57415645*62657874{350}0100

Broadcast 1 MPEG - fmt/707

52494646{4}57415645*62657874{350}0100*666D7420{4}5000
52494646{4}57415645*666D7420{4}5000*62657874{350}0100

Broadcast 1 PCM - fmt/704

52494646{4}57415645*62657874{350}0100*666D7420{4}0100
52494646{4}57415645*666D7420{4}0100*62657874{350}0100

Broadcast 1 WAVEX - fmt/710

52494646{4}57415645*666D7420{4}FEFF*62657874{350}0100
52494646{4}57415645*62657874{350}0100*666D7420{4}FEFF

Broadcast 2 Generic - fmt/527

52494646{4}57415645*62657874{350}0200

Broadcast 2 MPEG - fmt/708

52494646{4}57415645*62657874{350}0200*666D7420{4}5000
52494646{4}57415645*666D7420{4}5000*62657874{350}0200

Broadcast 2 PCM - fmt/709

52494646{4}57415645*62657874{350}0200*666D7420{4}0100
52494646{4}57415645*666D7420{4}0100*62657874{350}0200

Broadcast 2 WAVEX - fmt/711

52494646{4}57415645*666D7420{4}FEFF*62657874{350}0200
52494646{4}57415645*62657874{350}0200*666D7420{4}FEFF

EXIF Audio 2.0 - x-fmt/397

52494646{4}57415645*666D7420{4}(01|07|11)00*4C495354{4}6578696665766572{4}30323030*64617461

EXIF Audio 2.1 - x-fmt/389

52494646{4}57415645*666D7420{4}(01|07|11)00*4C495354{4}6578696665766572{4}30323130*64617461

EXIF Audio 2.2 - x-fmt/396

52494646{4}57415645*666D7420{4}(01|07|11)00*4C495354{4}6578696665766572{4}30323230*64617461

WAVE - fmt/6

52494646{4}57415645*666D7420{18-*}64617461 (N.B. simplifying to just 52494646{4}57415645 (RIFF{4}WAVE) as suggested by NLNZ. This PUIDs purpose as a catchall for WAVs that don't conform to a more specific entry is better served by a simpler signature)

RF64 - fmt/712

52463634FFFFFFFF5741564564733634 (N.B. changing 0xFFFFFFFF to {4} as per MIT Libraries observations)

RF64 MBWF - fmt/713

52463634FFFFFFFF5741564564733634*62657874 (N.B. changing 0xFFFFFFFF to {4} as per MIT Libraries observations)

WAV PCM - fmt/141

52494646{4}57415645666D7420100000000100{14-1000}64617461 (N.B. changing to {14-*} as per Andrea's submission)

WAV WAVEX - fmt/142

52494646{4}57415645666D7420[!10]{3}[!FEFF]{16-1000}64617461 (N.B. changing to {16-*} as per Andrea's submission)

WAV WAVEXTENSIBLE- fmt/143

52494646{4}57415645666D7420{4}FEFF{38-1000}64617461 (N.B. changing to {38-*} as per Andrea's submission)

ross-spencer

unread,
Sep 21, 2016, 1:01:28 AM9/21/16
to droid-list
It's an understandable position if i can state the inverse: 'to receive (create signatures) from specs rather than to presume from samples', and you do say, where possible. 

But on the broader point, for developers of signatures, it seems that if  that it could be in danger of not reflecting each way signatures can or may need to be developed:
  • for example, handling implementations which output files differently from the specification
  • testing against variants in the specification for which samples cannot be found (gold-plating?)
  • and, reading a specification from left to right, or BOF to EOF, then 'up to 1GB+- of (x) type of data' does read like something that requires a wildcard. {0-x} style signatures shouldn't help performance here either, BUT, if you read from EOF to BOF which the specification probably won't do... then it may be possible to derive the bytes immediately after a wildcard (*) as a more precise offset from EOF which both follows the specification, their implementation (samples) and meets the requirements regarding accuracy, and so these are alternatives I think would be reasonable to seek over time, depending on what the samples show us.
Regarding the tools, and performance, then one of the main issues I raise in the report above is not simply that DROID is slower when we introduce a certain number of similar signatures each with wildcard, but also, that with its default settings, DROID won't always match wildcard (*) signatures because of its initial maximum bytes-to-scan value -  and for audio type data as an example, we will see this happen because of the size of its payload. I think it is here that the interchange between registry/tool/accuracy is thrown into relief. 

I agree about working to better alternatives, but that part of collecting samples is difficult, (like here, collecting so many WAV files across such a corpus of signatures), so it's hard when we can't all make them available to each other. I previously produced a list of all wildcard signatures and made a call to doing this work as a community. In case the list is useful it is here: https://github.com/exponential-decay/digital-preservation-stage-boss-one/blob/master/wildcard-signature-information/PRONOM-wildcard-signatures-v86.csv - perhaps there's a hack-a-thon in this someday.
 
Also, if it helps, when your new release is out as described above, i can re-run the tests done in the original report to understand the impact, which might be useful in any future discussions around tools and their performance. 


- Exclude 

Matt Palmer

unread,
Sep 21, 2016, 8:05:00 AM9/21/16
to droid-list
One suggestion for a future DROID version might be to have a "max scan length override" for particular signatures/formats.  This could be supplied in a separate file, or in the configuration.

It would allow DROID to scan for longer for particular format signatures - so you could turn on infinite scanning for wave files, but keep a more restricted approach for formats which don't need it.  Best of both worlds...?

Regards,

Matt.
Reply all
Reply to author
Forward
0 new messages