recover pickled data: pickle data was truncated

iMath

unread,

Dec 25, 2021, 7:01:20 AM12/25/21

to

Normally, the shelve data should be read and write by only one process at a time, but unfortunately it was simultaneously read and write by two processes, thus corrupted it. Is there any way to recover all data in it ? Currently I just get "pickle data was truncated" exception after reading a portion of the data?

Data and code here :https://drive.google.com/file/d/137nJFc1TvOge88EjzhnFX9bXg6vd0RYQ/view?usp=sharing

Marco Sulla

unread,

Dec 26, 2021, 8:45:24 AM12/26/21

to

Use a semaphore.

On Sun, 26 Dec 2021 at 03:30, iMath <redsto...@163.com> wrote:
>
> Normally, the shelve data should be read and write by only one process at a time, but unfortunately it was simultaneously read and write by two processes, thus corrupted it. Is there any way to recover all data in it ? Currently I just get "pickle data was truncated" exception after reading a portion of the data?
>
> Data and code here :https://drive.google.com/file/d/137nJFc1TvOge88EjzhnFX9bXg6vd0RYQ/view?usp=sharing

> --
> https://mail.python.org/mailman/listinfo/python-list

Barry Scott

unread,

Dec 26, 2021, 11:49:23 AM12/26/21

to

> On 26 Dec 2021, at 13:44, Marco Sulla <Marco.Sul...@gmail.com> wrote:
>
> Use a semaphore.
>
> On Sun, 26 Dec 2021 at 03:30, iMath <redsto...@163.com> wrote:
>>

>> Normally, the shelve data should be read and write by only one process at a time, but unfortunately it was simultaneously read and write by two processes, thus corrupted it. Is there any way to recover all data in it ? Currently I just get "pickle data was truncated" exception after reading a portion of the data?

You have lost the data in that case.

You will need to do what Marco suggests and lock access to the file.
How you do that depends your OS. If is unix OS then its likely you
will want to use fcntl.flock().

Barry

iMath

unread,

Dec 29, 2021, 10:50:53 AM12/29/21

to

> You have lost the data in that case.

But I found the size of the file of the shelve data didn't change much, so I guess the data are still in it , I just wonder any way to recover my data.

Chris Angelico

unread,

Dec 29, 2021, 1:24:45 PM12/29/21

to

On Thu, Dec 30, 2021 at 4:32 AM iMath <redsto...@163.com> wrote:
>
> > You have lost the data in that case.
>
> But I found the size of the file of the shelve data didn't change much, so I guess the data are still in it , I just wonder any way to recover my data.

Unless two conflicting versions got interleaved, in which case I
strongly advise you NOT to try unpickling it.

If you really feel like delving into it, try manually decoding the
pickle stream, but be very careful.

ChrisA

Avi Gross

unread,

Dec 29, 2021, 1:54:38 PM12/29/21

to

I am not an expert on the topic but my first reaction is it depends on how
the data is corrupted and we do not know that. So I am addressing a more
general concept here.

Some algorithms break if a single byte or even bit changes and nothing
beyond that point makes sense. Many encryption techniques are like that and
adding or deleting a byte might throw things off completely.

But if your problem is that two processes or threads wrote interleaved and
yet resulted in an output of a similar size, then, yes, in some cases some
of the data could be retrieved, albeit be fragmentary and unreliable. If
they both included say a data structure with names and phone numbers, it is
possible you get two partial or complete copies and maybe retrieve a phone
number you can try and see if it works. But the tax authorities might not
react favorably to your recovery of a business expense if it is possible the
currency amount was corrupted and perhaps a few zeroes were appended at the
end.

For some mission-critical purposes, I am sure people have come up with many
ideas including perhaps making multiple copies before an exit spread across
multiple disks and sites or reading the file back in and checking it. But
corruption can happen for many reasons including at the level of the disk it
is written to.

--
https://mail.python.org/mailman/listinfo/python-list

Marco Sulla

unread,

Dec 29, 2021, 2:13:21 PM12/29/21

to

On Wed, 29 Dec 2021 at 18:33, iMath <redsto...@163.com> wrote:
> But I found the size of the file of the shelve data didn't change much, so I guess the data are still in it , I just wonder any way to recover my data.

I agree with Barry, Chris and Avi. IMHO your data is lost. Unpickling
it by hand is a harsh work and maybe unreliable.

Is there any reason you can't simply add a semaphore to avoid writing
at the same time and re-run the code and regenerate the data?

iMath

unread,

Dec 31, 2021, 5:55:15 AM12/31/21

to

Thanks for your replies! I didn't have a sense of adding a semaphore on writing to pickle data before, so corrupted the data.
Since my data was colleted in the daily usage, so cannot re-run the code and regenerate the data.
In order to avoid corrupting my data again and the complicity of using a semaphore, now I am using json text to store my data.

Barry

unread,

Jan 1, 2022, 8:09:47 AM1/1/22

to

> On 31 Dec 2021, at 17:53, iMath <redsto...@163.com> wrote:

That will not fix the problem. You will end up with corrupt json.

If you have one writer and one read then may be you can use the fact that a rename is atomic.

Writer does this:
1. Creat new json file in the same folder but with a tmp name
2. Rename the file from its tmp name to the public name.

The read will just read the public name.

I am not sure what happens in your world if the writer runs a second time before the data is read.

In that case you need to create a queue of files to be read.

But if the problem is two process racing against each other you MUST use locking.
It cannot be avoided for robust operations.

Barry

> --
> https://mail.python.org/mailman/listinfo/python-list

Marco Sulla

unread,

Jan 1, 2022, 11:14:13 AM1/1/22

to

I agree with Barry. You can create a folder or a file with
pseudo-random names. I recommend you to use str(uuid.uuid4())

On Sat, 1 Jan 2022 at 14:11, Barry <ba...@barrys-emacs.org> wrote:
>
>
>
> > On 31 Dec 2021, at 17:53, iMath <redsto...@163.com> wrote:
> >

> That will not fix the problem. You will end up with corrupt json.
>
> If you have one writer and one read then may be you can use the fact that a rename is atomic.
>
> Writer does this:
> 1. Creat new json file in the same folder but with a tmp name
> 2. Rename the file from its tmp name to the public name.
>
> The read will just read the public name.
>
> I am not sure what happens in your world if the writer runs a second time before the data is read.
>
> In that case you need to create a queue of files to be read.
>
> But if the problem is two process racing against each other you MUST use locking.
> It cannot be avoided for robust operations.
>
> Barry
>
>
> > --
> > https://mail.python.org/mailman/listinfo/python-list
>

> --
> https://mail.python.org/mailman/listinfo/python-list

Barry Scott

unread,

Jan 2, 2022, 7:37:47 AM1/2/22

to

> On 1 Jan 2022, at 16:13, Marco Sulla <Marco.Sul...@gmail.com> wrote:
>
> I agree with Barry. You can create a folder or a file with
> pseudo-random names. I recommend you to use str(uuid.uuid4())

At work and personally I use iso-8601 timestamps to make the files unique and easy to
find out when they where created.

:>>> t = datetime.datetime.now()
:>>> t
datetime.datetime(2022, 1, 2, 12, 34, 1, 267935)
:>>> t.strftime('%Y-%m-%dT%H-%M-%S')
'2022-01-02T12-34-01'
:>>>

That is good enough as long as you create the files slower than once a second.

Oh and yes use JSON, it is far better as a way of exchanging data than pickle.
Easy to read and check, can be processes in many languages.

Barry

>
> On Sat, 1 Jan 2022 at 14:11, Barry <ba...@barrys-emacs.org> wrote:
>>
>>
>>
>>> On 31 Dec 2021, at 17:53, iMath <redsto...@163.com> wrote:
>>>

iMath

unread,

Jan 5, 2022, 3:44:22 AM1/5/22

to

Thanks for all your kind help, wish you a promising year!