A failed move file operation in Hive may result in data loss

18 views
Skip to first unread message

Carol Chapman

unread,
Jul 25, 2022, 10:44:49 PM7/25/22
to MR3
For non-ACID tables, Hive writes the data to a temporary directory, then deletes the files in the target table, and finally moves the files in the temporary directory to the target table.

If something goes wrong at the last step, Hive will abandon all operations and just print the exception log. But the operational status is success. At this point the temporary directory will be cleared, resulting in data loss.

Sungwoo Park

unread,
Jul 27, 2022, 3:43:30 AM7/27/22
to MR3
This might be true in Hive (where renaming can be used instead of moving), but the data loss is limited only to the result of the query being executed, so I think it is not a critical problem.

--- Sungwoo

Carol Chapman

unread,
Jul 27, 2022, 10:00:28 PM7/27/22
to MR3
This can cause very serious problems when using insert overwrite statements. This results in data loss during ETL.

Sungwoo Park

unread,
Jul 27, 2022, 10:36:31 PM7/27/22
to MR3
Yes, this could be a problem with Hive and not specifically with Hive-MR3. In fact, I think there are several patches dealing with insert overwrite that has not been merged to Hive 3.

--- Sungwoo

Carol Chapman

unread,
Jul 27, 2022, 10:43:02 PM7/27/22
to MR3
Yes, this is indeed the problem of hive. If there is a relevant patch, can we merge it first?

Sungwoo Park

unread,
Aug 2, 2022, 8:45:55 AM8/2/22
to MR3
Figuring out what patches to merge is the main problem. Besides, backporting a patch is not always feasible and sometimes it takes a lot of time to find earlier patches that it depends on. If you know of a specific (critical) patch that is missing in Hive-MR3, please let me know. For your problem, you could just use transactional tables.

Cheers,

Sungwoo

Reply all
Reply to author
Forward
0 new messages