Invalid Sync when running Hive query against Avro file that is being append to

1,451 views
Skip to first unread message

Steve

unread,
Apr 28, 2015, 3:59:32 PM4/28/15
to cdk...@cloudera.org
I have been testing the ability to execute a query using Impala or Hive against a table in which the data is stored in Avro, and the Avro file is being appended to during the query by a single thread.

During one of my tests, the Hive query failed due to the following exception:

Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!

I repeated the Hive query several times during the append operation and also after it was completed.  I received the same exception every time.

I also ran the same query using Impala and received the same exception.

I can see how this is possible while the avro file is being appended to, but it seems that it should not happen after the avro file is closed.

How is it possible that that sync marker can become invalid during this scenario?

Thank you.



Ryan Blue

unread,
Apr 28, 2015, 4:09:46 PM4/28/15
to Steve, cdk...@cloudera.org
Hi Steve,

Could you talk a little more about how you're writing the file? I don't
think Kite will make the file visible until the entire file is closed to
avoid partial files. If you're not using Kite, then it's a little hard
for me to comment on what is happening that might cause this.

More inline...

On 04/28/2015 12:59 PM, Steve wrote:
> I have been testing the ability to execute a query using Impala or Hive
> against a table in which the data is stored in Avro, and the Avro file
> is being appended to during the query by a single thread.
>
> During one of my tests, the Hive query failed due to the following
> exception:
>
> Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException:
> Invalid sync!
>
> I repeated the Hive query several times during the append operation and
> also after it was completed. I received the same exception every time.

Looks like the data is getting corrupted, but reading a file while it is
being written shouldn't be the cause. It isn't surprising to get this
exception while a file is being written, but getting it afterward means
the file is corrupt.

> I also ran the same query using Impala and received the same exception.

Yeah, more confirmation that it is corrupted.

> I can see how this is possible while the avro file is being appended to,
> but it seems that it should not happen after the avro file is closed.
>
> How is it possible that that sync marker can become invalid during this
> scenario?

I think reading while appending to the file might just be a red herring.
I don't see how that would corrupt it.

rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Steve

unread,
Apr 29, 2015, 10:12:49 AM4/29/15
to cdk...@cloudera.org, steve.h...@gmail.com
Hi Ryan,

I am writing to an Avro file using the Avro APIs (DataFileWriter).  In my test, I was writing thousands of records without opening/closing between each write, and then closing at the end of the test.  I am now testing the scenario in which I close the file between each append.  

I know the file is visible, because my (Hive/Impala) query (when it doesn't return the invalid sync error) will return results while the append test is running.

-Steve

Steve

unread,
May 1, 2015, 10:13:04 AM5/1/15
to cdk...@cloudera.org, steve.h...@gmail.com
I was able to reproduce this error by killing my append job before it completed.  I'm chalking this up to just a corrupt avro file since I'm not able to reproduce it otherwise.

Thanks.
Reply all
Reply to author
Forward
0 new messages