Clarification on How to Handle "Large" Outputs

22 views
Skip to first unread message

EpiGrad

unread,
Nov 27, 2011, 12:30:58 AM11/27/11
to PiCloud
I'm working on developing some code to run on PiCloud that involves
somewhat large NumPy arrays. Essentially, each run of a highly
parallel job generates an array of about 7 columns and 5000 rows. At
the end of the day, it would be nice to have the collected data - or
chunks of it - in a nice convenient CSV file for later analysis on my
local machine.

Batching any number of runs so I don't have to call 10,000 jobs means
the data will almost inevitably exceed the 16 mb per job limit of
cloud.call. I'm not entirely sure how to go about using the
cloud.files interface for this though. Should each job within
cloud.call write a file, and then use cloud.files.put() to store that
text file in the Cloud Files system, and then pull them all down using
cloud.files.get to pull them all onto my local machine, or is there a
more logical way to do that?

Thanks,

Eric

Ken Elkabany

unread,
Nov 27, 2011, 6:26:42 PM11/27/11
to pic...@googlegroups.com
Hi Eric,

Yes, you're correct. Each job should be responsible for saving its own data to cloud.files.

An efficient way to do what you want is to generate the CSV in memory using the Python csv module and then upload it using cloud.files.putf. This avoids writing the file to disk, and then redundantly reading the file from disk to upload it. Here's a quick example:

import csv
from cStringIO import StringIO

# create a file-like object that resides purely in memory
f = StringIO()

# create a csv writer object
w = csv.writer(f)

# you can call this function repeatedly to write as many rows as you want
# this writes a row with values from 0 to 9
w.writerow(range(10))

# (optional) you can see that the csv writer is writing to the StringIO obj
print f.getvalue()

# save your csv to cloud.files
cloud.files.putf(f, 'name_for_data')

# you can retrieve the data at a later time
# get will save the csv to a file of the same name
cloud.files.get('name_for_data')

Helpful links:

How large is the input data for each function call? Are the functions taking in a CSV and outputting a CSV? How long does each run take when you haven't batched jobs together?

Ken
Reply all
Reply to author
Forward
0 new messages