Loss of streams when normalizing MXF to MKV

Scott Shepherd

unread,

Dec 28, 2018, 1:21:19 PM12/28/18

to archivematica

Hi all,

In looking at the Preservation Planning section of the Archivematica demo/sandbox, it looks like the default rule for MXF files is to normalize to FFV1/MKV for preservation. But if the original MXF file has caption or timecode/data streams, don't they get stripped out of the MKV? This is what happens whenever I test with ffmpeg on my own, and it concerns me because we want the normalized preservation master to maintain the essential characteristics of the original file, including additional streams.

How do you deal with this issue? Do you just accept the loss and move on (knowing you still have the original MXF), or is there another option I'm missing?

See attached for examples of the types of streams I'm talking about in the original MXF files.

Thanks,

Scott

MXF timecode streams.PNG

MXF text streams.PNG

Ashley Blewer

unread,

Dec 28, 2018, 1:40:57 PM12/28/18

to archiv...@googlegroups.com

Hey Scott!

Thanks for asking this question.

You are right that the ffmpeg script we use is going to strip out the timecode data streams. If they are important to you, you could possibly extract them (use the ffmpeg -map feature to pull them out) and carry them as an additional sidecar file or add them to the Matroska file as an Attachment (anything can be added as an Attachment, and subtitles tracks typically are stored here). That would allow you to keep the data within the Matrsoka file, but it would require extra steps and modification of the Archivematica Format Policy Registry.

If you're interested in exploring this further, I can recommend the CELLAR Working Group through the IETF which is working on issues surrounding these needs for archivists while Matroska is being standardized. It might be helpful to bring up to the group, as some experts on there might have an answer on the best way to do this mapping, if you are interested in the benefits of FFV1+MKV as a supplemental normalized file and also require the timecode datastream.

Something else to consider is the long-term storage cost of both MXF and the FFV1/MKV file. They'll both be quite big, so it may not be in the budget scope to perform normalization on these files and rather double-down on your preferred format.

Anyone else have experience with timecode data?

Ashley

--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To post to this group, send email to archiv...@googlegroups.com.
Visit this group at https://groups.google.com/group/archivematica.
For more options, visit https://groups.google.com/d/optout.

--

Ashley Blewer

AV Preservation Specialist

Artefactual Systems, Inc.

https://www.artefactual.com/

@archivematica & @accesstomemory

l...@lrcd.com

unread,

Dec 28, 2018, 2:05:55 PM12/28/18

to archivematica

Hi,

On Friday, December 28, 2018 at 9:21:19 AM UTC-9, Scott Shepherd wrote:

See attached for examples of the types of streams I'm talking about in the original MXF files.

Can you provide a link to a short sample MXF file that contains these streams?

Scott Shepherd

unread,

Dec 28, 2018, 2:54:21 PM12/28/18

to archiv...@googlegroups.com

Lou, I can't share the files I have on hand, but there is one here under "MXF Samples" that has timecode streams: http://www.freemxf.org/ . It doesn't have the text/caption streams, unfortunately.

Ashley, thanks for the quick response. I have tried mapping and have had trouble, but it may just require some more learning. I may reach out to CELLAR as well, as you suggested. I'm not even sure of the value of the timecode streams or why you'd need more than one. I'm just concerned about losing what's already there. My bigger concern is the caption streams--they contain closed captions in multiple languages and I'd hate to lose that. I do have one potential but very manual method of extracting the captions as sidecar files, but it might be a good last-resort sort of option. You also raise a good point about whether it would make more sense to just keep the original MXF as-is. All good things to consider at this early stage.

Scott

--
You received this message because you are subscribed to a topic in the Google Groups "archivematica" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/archivematica/pJM3prjlsDE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to archivematic...@googlegroups.com.

Scott Shepherd

unread,

Jan 4, 2019, 11:59:22 AM1/4/19

to archivematica

All,

I'm still curious if anyone has experience normalizing to MKV and losing information. It's not just MXF files, though that has been the biggest issue for me. I also see, for example, that with .MTS files from video cameras there is a data stream that appears to be the date and time of each frame. Going to MKV strips this stream.

Is anyone actually implementing this process of normalizing to FFV1/MKV, and have you experienced this type of loss? If so, do you just accept the loss knowing that the essence of the video is preserved or do you have a work-around?

Thanks,

Scott Shepherd

Audiovisual Preservation - Church History Department

The Church of Jesus Christ of Latter-day Saints

sbshe...@ldschurch.org

Kieran O Leary

unread,

Jan 4, 2019, 12:14:53 PM1/4/19

to archiv...@googlegroups.com

Hi

On Fri, 4 Jan 2019, 16:59 Scott Shepherd <sheph...@gmail.com wrote:

All,

I'm still curious if anyone has experience normalizing to MKV and losing information.

We normalise uncompressed video and dpx to FFV1/MKV in the Irish Film Institute. We are not archivematica users,but we deal with the issue you are experiencing. We lose the timecode data track that exists in our source v210/mov files due to the current lack of data track support in Matroska.

We can accept this loss as these timecode tracks only contain the starting timecode value anyhow,all other timecodes are generated based on duration from the first frame and the frame rate.ffmpeg stores the starting timecode value as a metadata tag so we have access to the key piece of metadata and can easily create a timecode track from that tag.

With dpx, we use rawcooked to ensure that key metadata is retained by storing it externally.

We experimented with normalising prores/mov to FFV1/mkv and decided that it was not a good idea right now as things like clean aperture values are not currently migrated to Matroska.

So it's a case by case basis. Of you are experiencing loss in the normalisation and it's a significant loss,maybe don't normalise?

Best,

Kieran O'Leary

--

You received this message because you are subscribed to the Google Groups "archivematica" group.

To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.

Scott Shepherd

unread,

Jan 4, 2019, 4:09:39 PM1/4/19

to archivematica

Kieran,

Thank you for your insight. It's very helpful to understand what you've experienced. We're just exploring the concept of normalization and trying to identify kinks such as this so we can decide how to handle them. Ideally, we'd normalize everything to a common format in order to reduce the risk of obsolescence and minimize the number of formats in the digital repository. But then there is reality we have to consider. It may be that there are some outliers we have to treat differently.

Scott Shepherd

Kieran O Leary

unread,

Jan 4, 2019, 5:34:02 PM1/4/19

to archiv...@googlegroups.com

Hi Scott,

We are in a similar boat. I don't know a huge amount about your MTS files but I'd imagine that there might be some self description that exists within the bitstream of those files that would not survive reencoding to FFV1 or anything else.

I am hoping to formalise all this into a policy soon,perhaps I could run it all by you as I think we are doing similar research? The policy will be something along the lines of:

1. Aim for complete reversibility (zip/tar/rawcooked)

2. If #1 is not possible,When essential metadata and context will not be lost, it is appropriate to normalise the bitstreams if they can be proven to decode to the same RGB values (like v210 producing the same framemd5 values as the normalised FFV1/MKV).

3. If #2 is not possible, attempts should be made to add support for such a normalisation in an open source project such as FFmpeg

4. If #3 is not possible, normalisation should not be carried out as the process can not be classed as lossless in any sense. Attempts should be made to ensure that open source playback software can render the file correctly.

Something like that, just off the top of my head. It will need to be refined more for sure to account for corner cases.

Best,

Kieran.

Ashley Blewer

unread,

Jan 4, 2019, 5:42:25 PM1/4/19

to archiv...@googlegroups.com

Hey all,

Kieran, thank you so much for these use cases, they are very helpful, I think.

For some context, RAWcooked is an application that leverages the preservation and compression features of Matroska, FFV1, and FLAC for storage but allows for files to be extracted again to their original content, and works with a variety of DPX files/flavors!

Ashley

Scott Shepherd

unread,

Jan 4, 2019, 6:22:01 PM1/4/19

to archivematica

Kieran,

I like your idea of a decision tree that describes what's ideal, then descending from there. We were using a very complicated decision tree to sort out incoming born-digital files (looking at the codec in addition to the container) and normalizing only the most at risk. What we're hoping to do in the future is to greatly simplify that tree by normalizing ALL incoming born-digital files to one (or very few) preservation formats and keeping the original as well. The following document makes an excellent case for this approach, pointing out that the alternative "wait and see" approach is very risky: https://www.archivematica.org/download/EvelynMcLellan-PreservationPlanning.pdf

Some of my thoughts have stemmed around what properties of the original file must be absolutely maintained and what are we okay with changing or losing. The timecode piece I'm less certain about, but as in my initial post I am very concerned about losing things like closed caption data. For example, I have a DNxHD/MXF file with three different caption languages embedded in the 436M track of the MXF. Technically, captions can always be re-created, but that's sure a lot of work. That's why I'm thinking of suggesting to our team that even if we normalize most things to MKV, we may have reason to use MXF as an alternative (especially if the original is already an MXF).

Scott

Andrew Berger

unread,

Jan 4, 2019, 8:31:59 PM1/4/19

to archivematica

Hi all,

It sounds like we have some similar MXF files and we have decided not to normalize them at this point. Partly this is because of the same issues surrounding captions and other streams, but our thinking has also been informed by looking at the larger processes that produce these files, at storage costs, and at potential contexts for re-use. Taking each of these considerations in order:

1. The processes that create the files

In our case, the MXF files represent television shows where each episode is an edited version of a live event held by the museum (generally a talk or panel discussion). The creation of these files is outsourced to a third-party vendor who specializes in meeting broadcast requirements. It proved very difficult to create these particular files in-house - all production up to the creation of the MXF is done in-house - and having to re-do it would be a real burden. Even if we normalized, we would keep the MXF for at least as long as it met broadcaster needs.

One thing we might change, since we're in the position of creating the files in the first place, is having captions delivered as sidecar files in addition to being embedded in the MXF files. The most recent delivery of episodes included subtitles as .scc files but this does not seem to have been the result of a deliberate request. If you have input into how files are delivered, it might be worth trying to go further up the workflow to get captions in a format that could be later combined with normalized MKV.

2. Storage costs

In my experience, transcoding to FFV1/MKV from a format like MXF increases the file size, sometimes substantially. Of course this depends on the specific characteristics of the source, but while I've seen significant savings going from v210/uncompressed to FFV1/MKV, I haven't seen that coming from most other digital video sources in our collection[1]. Combined with keeping the original MXF, normalization would add significantly to storage costs.

As a side note, there would also be an additional transcoding step if we were to store as MKV but deliver in another format. This would be difficult to fit into our current workflows, though it might not be a problem in your context. If we were to store both the original and the normalized file, then the question is largely around storage costs and risk management via file format choices[2].

3. Contexts of re-use

The MXF files aren't the only files we preserve from live events. There are also edit masters (usually in Prores) and highly compressed H264 versions that are uploaded to Youtube. The MXF files are actually shorter than the other files because they've been edited to fit a television time slot. Most requests for partial footage of an event actually go through the Prores file.

This means the most likely re-use scenario for the MXF version is for re-broadcast, in which case it likely would need to be delivered as MXF again. This actually came up once last year when a broadcast partner had an issue with their asset management system and couldn't access two episodes previously sent them. We sent them the MXF files, on a hard drive via overnight delivery, and they were able to fulfill their broadcast schedule. So even though we don't have a lot of internal need for the MXF files, they still serve a purpose as MXF and are likely to continue to do so for a while.

Apologies for going on at length about considerations that might not apply to you, but I think it can be worth looking beyond technical file format assessments when analyzing whether or not to normalize. I should also add that our decision not to normalize is specifically not to normalize at the time of ingest. The question can be reopened, and I would welcome a clear normalization target format for born digital video in the future.

Andrew Berger

Computer History Museum

[1] This also generally happens with Prores, H264, and DV, which we also do not normalize. In a context where storage costs were less of an issue, we might save the originals plus normalized files. But at the moment we are relying more on the preservation and maintenance of A/V playback software, rather than particular formats, for continued access in the future. We do check that every file can be rendered in ffmpeg and/or VLC before ingest.

[2] We are going to move to FFV1/MKV for digitization, as the storage savings are substantial over v210 and offset any extra file management work that this might create. At the moment, our in-house team needs files in a format like Prores to be able to use them with their editing tools but FFV1/MKV could be more widely supported in the future.

Kieran O Leary

unread,

Jan 4, 2019, 8:34:54 PM1/4/19

to archiv...@googlegroups.com

Hi
Replies below

On Fri, 4 Jan 2019, 23:22 Scott Shepherd <sheph...@gmail.com wrote:

Kieran,

I like your idea of a decision tree that describes what's ideal, then descending from there. We were using a very complicated decision tree to sort out incoming born-digital files (looking at the codec in addition to the container) and normalizing only the most at risk.

What were most at risk? There are very few AV formats out there not supported by FFmpeg..

What we're hoping to do in the future is to greatly simplify that tree by normalizing ALL incoming born-digital files to one (or very few) preservation formats and keeping the original as well.

Keeping the original is a great way to go. We kept the original XDCAM EX cards when we migrated the AV streams to single concatenated MKV files, but for DPX/v210, we delete the originals and only keep the normalised FFV1/MKV. We keep three copies on LTO so retaining originals plus normalised can have serious storage repercussions.

The following document makes an excellent case for this approach, pointing out that the alternative "wait and see" approach is very risky: https://www.archivematica.org/download/EvelynMcLellan-PreservationPlanning.pdf

Some of my thoughts have stemmed around what properties of the original file must be absolutely maintained and what are we okay with changing or losing. The timecode piece I'm less certain about, but as in my initial post I am very concerned about losing things like closed caption data. For example, I have a DNxHD/MXF file with three different caption languages embedded in the 436M track of the MXF. Technically, captions can always be re-created, but that's sure a lot of work. That's why I'm thinking of suggesting to our team that even if we normalize most things to MKV, we may have reason to use MXF as an alternative (especially if the original is already an MXF).

That sounds like you should just keep the original alright.

Minimizing the amount of file formats in your care is definitely preferable in my opinion.

Best,

Kieran

Stephen McConnachie

unread,

Jan 5, 2019, 11:24:26 AM1/5/19

to archivematica

Hi Scott, Ashley, Kieran, Andrew,

What a great discussion! I often feel we don’t talk about normalisation pros and cons and strategies enough in the a/v domain, so this is ace.

I loved Kieran Kunya’s presentation at NTTW3 last Autumn where one of the messages was ‘Archivists relax (a bit), FFMPEG has you covered for most things, it’s going nowhere, it’s written in C, etc’

We joked at work afterwards that it should be called Plenty Of Time To Wait... But it did seriously give me a reassuring feeling about normalising or retaining: I totally agree with Kieran’s case-by-case approach, and I’d emphasise too that institutional context is important, and can mean there’s no absolute right and wrong.

For example we acquire AS-11 50i MXFs from UK broadcasters, compliant with the industry-wide Digital Production Partnership standard. We definitely don’t aim to normalise those to FFV1 MKV, it seems counter-productive to do that. But we might normalise MXFs acquired from a less standards-facing production context.And we are about to start norma,using our uncompressed V210 MOVs from mass videotape digitisation workflows, to FFV1 Matroska, and like Kieran we accept the loss of some ancillary data in the output files.

Maybe we could have a normalisation strand in NTTW4....

Cheers,
Stephen

Ashley Blewer

unread,

Jan 5, 2019, 1:33:58 PM1/5/19

to archiv...@googlegroups.com

Andrew and Stephen, thanks for the excellent thinking around your decision-making when it comes to normalization! I think it is beneficial for institutions broadly, this discussion.

As a mentioned earlier, normalization in the context of Archivematica means 1) making a copy of the original and not deleting the original and 2) making an access copy. In Archivematica, if you choose to normalize, you can normalize for preservation and/or for access (which is h264/mp4 by default, IIRC). Outside of Archivematica, the assumption is only keeping the normalized format, so I just feel that is important to mention. Storage size and cost is absolutely massive when dealing with a/v assets, so I expect most institutions do not have the capacity to normalize to preservation in the Archivematica context, because the filesizes are huge.

I'm not going to take a hard stance in any direction, because fundamentally it is up to each institution to research and understand their needs as it relates to decisions around this complex and controversial? topic, such as existing resources for computer processing, storage capacity, collecting policies, limiting file formats their predecessors will have to worry about in the future, etc etc. Some institutions have decided to perform normalization but many do not. I think the points laid out by Andrew above are great to consider when doing the preservation planning work prior to getting started with a major initiative, and agree wholeheartedly about the benefits of MKV/FFV1/LPCM at the digitization stage, when an archivist is forced to make such a strict decision for the digital life of an originally analog asset.

I also agree with the decision for keeping large collections of files set in their own formats -- this isn't like a "there must be one format to rule them all" kind of situation, and it depends on the content, the provenance, the collection missions of the institution, all the stuff you learn about in archives school, basically. The slidedeck Scott references is a high-level overview of the decision-making process Artefactual went through years ago when choosing what formats are best suited as default options for optimal formats, but that's only if normalization is the desire of the institution. All of the normalization options are heavily configurable in the Format Policy Registry, also known as the Preservation Planning tab, so you can normalize to any ol' thing you want.

We are a complex and broad field, which is why Archivematica is set up with so many decision points, allowing each institution to decide what is best for them!

Ashley

--
You received this message because you are subscribed to the Google Groups "archivematica" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archivematic...@googlegroups.com.
To post to this group, send email to archiv...@googlegroups.com.
Visit this group at https://groups.google.com/group/archivematica.
For more options, visit https://groups.google.com/d/optout.

--

Ashley Blewer

unread,

Jan 5, 2019, 1:36:35 PM1/5/19

to archiv...@googlegroups.com

Oh, and for people following along, here is the video that Stephen mentions, from the conference No Time To Wait. This was one of my favorite talks out of the whole conference, because its gives some context into the process of reverse-enginerring proprietary video codecs for FFmpeg.

Scott Shepherd

unread,

Jan 7, 2019, 6:08:07 PM1/7/19

to archivematica

Kieran and all,

I love the statement in the PDF I mentioned before about the advantages of normalizing at the point of ingest: "Adopting a wait and see approach means putting off an undefined amount of work for an indefinite period of time at an unknown cost."

We get files donated from all over the world in every format imaginable. After watching our digital team write and attempt to maintain such a vast decision tree, I had an epiphany. I realized that in the time they were spending on determining risk levels for hundreds of file AV and non-AV file types we could have normalized them ALL and gone to lunch! Well, you know what I mean. The human work goes into developing the process which can then be automated. This frees up human thinking for other tasks.

That's why I'm exploring the idea of normalizing everything upfront. I figure if we have the original AND a normalized copy, we're covered if either becomes obsolete. I realize that not every institution can keep both, and some of the responses to this post have made good arguments for alternative approaches. But I still think the "normalize upfront" concept is worth thinking through to whatever extent makes sense. Maybe you can't normalize video, but what about audio where the file sizes are so much smaller? It doesn't have to be an all-or-nothing kind of thing.

To echo what others have said, each institution has to consider its own needs and resources in conjunction with the technical aspects of preservation. Thanks to all who have contributed to this discussion. It is very enlightening.