Writes with old schema fail after dataset update

34 views
Skip to first unread message

Chris Hirstein

unread,
May 19, 2015, 6:49:20 PM5/19/15
to cdk...@cloudera.org
We are seeing writes fail with AvroRuntimeException if the dataset has been updated with new schema but data is written with the old schema. Its due to a deep copy in the DatasetRecordWriter[1]. Its using the schema from the dataset descriptor and not from the caller on the deep copy which is leading to the exception even if the two schemas are compatible. Could there be a check to see if the schema on dataset is compatible with the schema being written with? 

Ryan Blue

unread,
May 19, 2015, 6:57:27 PM5/19/15
to Chris Hirstein, cdk...@cloudera.org
Hi Chris, thanks for letting us know about this.

I think the problem is related to what Joey recently ran into, where the
schema used by the writer needs to match the data rather than the
dataset. To fix this, we're working on the `asSchema` method to safely
change the runtime schema. There's a PR for this:

https://github.com/kite-sdk/kite/pull/346

I'll try to get a fix into the upcoming 1.1.0 release for it.

One odd thing that I wasn't aware of is that deep copy. Anyone know why
there is a deep copy in the writer if we aren't using avro-reflect?

rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Chris Hirstein

unread,
May 20, 2015, 9:02:47 AM5/20/15
to cdk...@cloudera.org, chir...@gmail.com
Thanks for the quick response Ryan! I'll keep an eye on this PR and look for the 1.1.0 release.

Thanks,
Chris

Ben Roling

unread,
May 20, 2015, 9:44:15 AM5/20/15
to Chris Hirstein, cdk...@cloudera.org
>> One odd thing that I wasn't aware of is that deep copy. Anyone know why
there is a deep copy in the writer if we aren't using avro-reflect?

I don't really know for sure but I assumed it was for parquet.  I may not be reading correctly but it looks to me like parquet could buffer pointers to parts of the original records so a deep copy is required to make sure the parquet output isn't corrupted if the original record object is mutated after the write call.

I'd been meaning to ask about this and if my assumption is correct whether there could be a mechanism to allow disabling the copy when parquet isn't being used.

--
You received this message because you are subscribed to the Google Groups "CDK Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdk-dev+u...@cloudera.org.
For more options, visit https://groups.google.com/a/cloudera.org/d/optout.

Ben Roling

unread,
May 20, 2015, 9:55:20 AM5/20/15
to Chris Hirstein, cdk...@cloudera.org
Actually I guess git-blame shows the copy was added for CDK-496 which doesn't look directly linked to parquet.  I can't dig into it any further right now but perhaps that will ring some bells for someone.

Chris Hirstein

unread,
Sep 24, 2015, 11:29:53 AM9/24/15
to CDK Development, chir...@gmail.com
I am still seeing an issue after uplifting to 1.1.0. The exception is now being thrown in the avro appender since its using the schema from the descriptor. 

Do you have any suggestions handling a deployment concern when the dataset can be updated with a new passive version of the schema while old processing is still writing with the previous version of the schema? 

Our use case is having multiple oozie coordinators running workflows on different frequencies that write to the same partitioned dataset. When deploying new versions of code we allow the old processing to complete before the new instances being running. This means that we can have a period where two versions of the schema are being written to the same dataset.


On Tuesday, May 19, 2015 at 5:57:27 PM UTC-5, Ryan Blue wrote:

Ryan Blue

unread,
Sep 24, 2015, 12:39:50 PM9/24/15
to Chris Hirstein, CDK Development
On 09/24/2015 08:29 AM, Chris Hirstein wrote:
> I am still seeing an issue after uplifting to 1.1.0. The exception is
> now being thrown in the avro appender since its using the schema from
> the descriptor.
>
> Do you have any suggestions handling a deployment concern when the
> dataset can be updated with a new passive version of the schema while
> old processing is still writing with the previous version of the schema?
>
> Our use case is having multiple oozie coordinators running workflows on
> different frequencies that write to the same partitioned dataset. When
> deploying new versions of code we allow the old processing to complete
> before the new instances being running. This means that we can have a
> period where two versions of the schema are being written to the same
> dataset.

Chris,

You're right, it looks like although the view's schema is checked to see
if it can be used to write, we don't pass it through to the writer.
Sorry about that, it's an oversight on my part. I think we just need to
pass the view's schema through, we already validate that it can be used
for writing.

I've opened KITE-1081 [1] to track this. It's not too difficult, so feel
free to fix it before I have time and I'll check it in. Thanks, Chris!

rb

[1]: https://issues.cloudera.org/browse/KITE-1081

Chris Hirstein

unread,
Sep 28, 2015, 11:14:32 PM9/28/15
to CDK Development, chir...@gmail.com
Will do. Thanks Ryan!

meghana nataraj

unread,
Oct 12, 2015, 12:07:36 PM10/12/15
to CDK Development, chir...@gmail.com
Hi Ryan,

I have opened a pull request [1] for the issue, KITE-1081. Please take a look at it whenever you find time. 


Thanks,
Meghana
Reply all
Reply to author
Forward
0 new messages