How to delete 1 or more records from a dataset

10 views
Skip to first unread message

Mats Naslund

unread,
Feb 8, 2016, 7:48:47 PM2/8/16
to CDK Development
Just wondering what would be the best approach to delete 1 or more records from a partitioned dataset. When I attempt to use deleteAll I get "Cannot cleanly delete view" and I've read "Because records are grouped into files by partition values, we either have to delete an entire file because we know all records match the deletion criteria, or rewrite a file to remove the matching records. We decided to delete entire files and reject deletes that require rewriting." Does this still apply? How difficult would it be to change this?

Thank you

Micah Whitacre

unread,
Feb 9, 2016, 9:54:01 AM2/9/16
to CDK Development
The rewriting the entire file or deleting the entire file is still the approach.

Manipulating a single record is not trivial because typically the records are stored in side of a larger file and that larger file is also typically compressed.  In the case of data stored as Avro, the data is block compressed so you'd be manipulating a specific block in a file.  

This restriction is also built on top of HDFS which doesn't support modification but instead append only.[1]

If you need to modify individual records you might look at something like HBase.

Mats Naslund

unread,
Feb 9, 2016, 10:57:33 AM2/9/16
to CDK Development
Thank you
Reply all
Reply to author
Forward
0 new messages