Batch write for DatasetWriter

19 views
Skip to first unread message

Manthosh Kumar T

unread,
Oct 4, 2016, 6:43:44 AM10/4/16
to CDK Development
Hi,
      Using Java, say I'm creating a Hive Dataset to write data in Avro file format and I periodically push (say 2 hours) around 2 million records, I would construct each record, call DatasetWriter.write() and finally call DatasetWriter.close(). So each time when I call the write() method for a record, it's appended in a temp Avro file in HDFS until close() is called. 
     So writing 2 million might involve a lot of IO (Kindly correct me if I'm wrong). Is this fine? Is there a way write data in batches? 

Thanks,
Manthosh

Joey Echeverria

unread,
Oct 4, 2016, 6:24:48 PM10/4/16
to Manthosh Kumar T, CDK Development
Hi Manthosh!

Avro, and therefore Kite, will be flushing records to disk as you hit
the sync interval, defaults to ~64KB. So you won't be generating all
of the I/O at once. When you close, that will force a flush of the
final Avro block and then close the temporary file. Then, it will
rename the file to a non-temporary name. So you're getting the benefit
of buffered I/O with the record-at-a-time API without delaying all of
the I/O to the call to close.

I hope that helps!

-Joey
> --
> You received this message because you are subscribed to the Google Groups
> "CDK Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cdk-dev+u...@cloudera.org.
> For more options, visit https://groups.google.com/a/cloudera.org/d/optout.



--
-Joey

Manthosh Kumar T

unread,
Oct 5, 2016, 3:15:28 AM10/5/16
to CDK Development, mant...@gmail.com
Hi Joey,
       Thanks for the reply. Until the sync interval is hit, data will be in memory? How do I increase the sync interval?

      P.S : Pardon my inexpertness with Avro files
Reply all
Reply to author
Forward
0 new messages