Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

MemoryError and Pickle

3,854 views
Skip to first unread message

Fillmore

unread,
Nov 21, 2016, 6:28:05 PM11/21/16
to

Hi there, Python newbie here.

I am working with large files. For this reason I figured that I would
capture the large input into a list and serialize it with pickle for
later (faster) usage.
Everything has worked beautifully until today when the large data (1GB)
file caused a MemoryError :(

Question for experts: is there a way to refactor this so that data may
be filled/written/released as the scripts go and avoid the problem?
code below.

Thanks

data = list()
for line in sys.stdin:

try:
parts = line.strip().split("\t")
t = parts[0]
w = parts[1]
u = parts[2]



#let's retain in-memory copy of data
data.append({"ta": t,
"wa": w,
"ua": u
})

except IndexError:
print("Problem with line :"+line, file=sys.stderr)
pass

#time to save data object into a pickle file

fileObject = open(filename,"wb")
pickle.dump(data,fileObject)
fileObject.close()

John Gordon

unread,
Nov 21, 2016, 6:43:44 PM11/21/16
to
In <o0vvtm$1rpo$1...@gioia.aioe.org> Fillmore <fillmor...@hotmail.com> writes:


> Question for experts: is there a way to refactor this so that data may
> be filled/written/released as the scripts go and avoid the problem?
> code below.

That depends on how the data will be read. Here is one way to do it:

fileObject = open(filename, "w")
for line in sys.stdin:
parts = line.strip().split("\t")
fileObject.write("ta: %s\n" % parts[0])
fileObject.write("wa: %s\n" % parts[1])
fileObject.write("ua: %s\n" % parts[2])
fileObject.close()

But this doesn't use pickle format, so your reader program would have to
be modified to read this format. And you'll run into the same problem if
the reader expects to keep all the data in memory.

--
John Gordon A is for Amy, who fell down the stairs
gor...@panix.com B is for Basil, assaulted by bears
-- Edward Gorey, "The Gashlycrumb Tinies"

Peter Otten

unread,
Nov 21, 2016, 7:40:46 PM11/21/16
to
Fillmore wrote:

> Hi there, Python newbie here.
>
> I am working with large files. For this reason I figured that I would
> capture the large input into a list and serialize it with pickle for
> later (faster) usage.

But is it really faster? If the pickle is, let's say, twice as large as the
original file it should take roughly twice as long to read the data...


Chris Kaynor

unread,
Nov 21, 2016, 7:49:59 PM11/21/16
to
On Mon, Nov 21, 2016 at 3:43 PM, John Gordon <gor...@panix.com> wrote:
> In <o0vvtm$1rpo$1...@gioia.aioe.org> Fillmore <fillmor...@hotmail.com> writes:
>
>
>> Question for experts: is there a way to refactor this so that data may
>> be filled/written/released as the scripts go and avoid the problem?
>> code below.
>
> That depends on how the data will be read. Here is one way to do it:
>
> fileObject = open(filename, "w")
> for line in sys.stdin:
> parts = line.strip().split("\t")
> fileObject.write("ta: %s\n" % parts[0])
> fileObject.write("wa: %s\n" % parts[1])
> fileObject.write("ua: %s\n" % parts[2])
> fileObject.close()
>
> But this doesn't use pickle format, so your reader program would have to
> be modified to read this format. And you'll run into the same problem if
> the reader expects to keep all the data in memory.

If you want to keep using pickle, you should be able to pickle each
item of the list to the file one at a time. As long as the file is
kept open (or seeked to the end), you should be able to dump without
overwriting the old data, and read starting at the end of the previous
pickle stream.

I haven't tested it, so there may be issues (if it fails, you can try
using dumps and writing to the file by hand):

Writing:
with open(filename, 'wb') as fileObject:
for line in sys.stdin:
pickle.dump(line, fileObject)

Reading:
with open(filename, 'wb') as fileObject:
while not fileObject.eof: # Not sure of the correct syntax, but
gives the idea
line = pickle.load(fileObject)
# do something with line


It should also be noted that if you do not need to support multiple
Python versions, you may want to specify a protocol to pickle.dump to
use a better version of the format. -1 will use the latest (best if
you only care about one version of Python.); 4 is currently the latest
version (added in 3.4), which may be useful if you need
forward-compatibility but not backwards-compatibility. 2 is the latest
version available in Python 2 (added in Python 2.3) See
https://docs.python.org/3.6/library/pickle.html#data-stream-format for
more information.

Steve D'Aprano

unread,
Nov 21, 2016, 8:00:01 PM11/21/16
to
But the code is more complex, therefore faster.

That's how it works, right?

*wink*




--
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

Steve D'Aprano

unread,
Nov 21, 2016, 8:40:31 PM11/21/16
to
On Tue, 22 Nov 2016 10:27 am, Fillmore wrote:

>
> Hi there, Python newbie here.
>
> I am working with large files. For this reason I figured that I would
> capture the large input into a list and serialize it with pickle for
> later (faster) usage.
> Everything has worked beautifully until today when the large data (1GB)
> file caused a MemoryError :(

At what point do you run out of memory? When building the list? If so, then
you need more memory, or smaller lists, or avoid creating a giant list in
the first place.

If you can successfully build the list, but then run out of memory when
trying to pickle it, then you may need another approach.

But as always, to really be sure what is going on, we need to see the full
traceback (not just the "MemoryError" part) and preferably a short, simple
example that replicates the error:

http://www.sscce.org/



> Question for experts: is there a way to refactor this so that data may
> be filled/written/released as the scripts go and avoid the problem?

I'm not sure what you are doing with this data. I guess you're not just:

- read the input, one line at a time
- create a giant data list
- pickle the list

and then never look at the pickle again.

I imagine that you want to process the list in some way, but how and where
and when is a mystery. But most likely you will later do:

- unpickle the list, creating a giant data list again
- process the data list

So I'm not sure what advantage the pickle is, except as make-work. Maybe
I've missed something, but if you're running out of memory processing the
giant list, perhaps a better approach is:

- read the input, one line at a time
- process that line


and avoid building the giant list or the pickle at all.


> code below.
>
> Thanks
>
> data = list()
> for line in sys.stdin:
> try:
> parts = line.strip().split("\t")
> t = parts[0]
> w = parts[1]
> u = parts[2]
> #let's retain in-memory copy of data
> data.append({"ta": t,
> "wa": w,
> "ua": u
> })
> except IndexError:
> print("Problem with line :"+line, file=sys.stderr)
> pass
>
> #time to save data object into a pickle file
>
> fileObject = open(filename,"wb")
> pickle.dump(data,fileObject)
> fileObject.close()

Let's re-write some of your code to make it better:

data = []
for line in sys.stdin:
try:
t, w, u = line.strip().split("\t")
except ValueError as err:
print("Problem with line:", line, file=sys.stderr)
data.append({"ta": t, "wa": w, "ua": u})

with open(filename, "wb") as fileObject:
pickle.dump(data, fileObject)


Its not obvious where you are running out of memory, but my guess is that it
is most likely while building the giant list. You have a LOT of small
dicts, each one with exactly the same set of keys. You can probably save a
lot of memory by using a tuple, or better, a namedtuple.

py> from collections import namedtuple
py> struct = namedtuple("struct", "ta wa ua")
py> x = struct("abc", "def", "ghi")
py> y = {"ta": "abc", "wa": "def", "ua": "ghi"}
py> sys.getsizeof(x)
36
py> sys.getsizeof(y)
144


So each of those little dicts {"ta": t, "wa": w, "ua": u} in your list
potentially use as much as four times the memory as a namedtuple would use.
So using namedtuple might very well save enough memory to avoid the
MemoryError altogether.


from collections import namedtuple
struct = namedtuple("struct", "ta wa ua")
data = []
for line in sys.stdin:
try:
t, w, u = line.strip().split("\t")
except ValueError as err:
print("Problem with line:", line, file=sys.stderr)
data.append(struct(t, w, a))

with open(filename, "wb") as fileObject:
pickle.dump(data, fileObject)


And as a bonus, when you come to use the record, instead of having to write:

line["ta"]

to access the first field, you can write:

line.ta
0 new messages