Save all results in one google cloud file

Patrice De Saint Steban

unread,

Mar 22, 2014, 1:24:37 PM3/22/14

to app-engine-...@googlegroups.com

Hello,

I wan't to do a database export jobs in CSV, I think to use a Mapreduce Job, read all datastore entity with the DatastoreInput, in the mapper, juste emit the CSV string for the current entity, and use the ValueProjectionReducer to just pass through values.

But the CloudSqlFileOutput write one file for each shards jobs.

How to export all csv lines in one file with the mapreduce library ?

Thanks.

Tom Kaitchuck

unread,

Mar 24, 2014, 2:31:04 PM3/24/14

to app-engine-...@googlegroups.com

The output class will determine the number of reduce shards. If you want a single file of output configure your output class to produce a single file, and it will run a single reduce shard. This will not affect the parallelism of your mapper.

--
You received this message because you are subscribed to the Google Groups "Google App Engine Pipeline API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to app-engine-pipeli...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeffrey Tratner

unread,

Sep 18, 2014, 1:03:01 PM9/18/14

to app-engine-...@googlegroups.com

You could also grab all the files generated by reducer and then compose them together via the GCS api - https://developers.google.com/storage/docs/json_api/v1/objects/compose

To unsubscribe from this group and stop receiving emails from it, send an email to app-engine-pipeline-api+unsub...@googlegroups.com.

Emanuele Ziglioli

unread,

Oct 1, 2014, 11:25:24 PM10/1/14

to app-engine-...@googlegroups.com

I do all this in a task queue, but I can only export up to 150,000 entities (average size: 733 bytes) before I hit the 10m limit.

So I'm looking at alternatives.

The problem I have with MapReduce is that I that I need to preserve the output order, while with MapReduce entities arrive out of order.

So perhaps I should be using the PipeLine API and have each task to fetch a subset of entities, and then hand over the job to the next task in a serial way.

I'd be curious to see if others are following this approach.

Tom Kaitchuck

unread,

Oct 2, 2014, 3:45:52 PM10/2/14

to app-engine-...@googlegroups.com

The different Map shards are going to be operating on different segments of the data in parallel. So while they will each process it in the order they receive it from the datastore query, the parallelism will destroy this. If you want a speffic order in the output, use the item that you want to order by as the key when you emit from Map. (In this case you could emit the value of the data store key as the Key, and the value as whatever data you want to associate with it.) This way the sort stage will order the data for you before it is handed to the reducer.

--

You received this message because you are subscribed to the Google Groups "Google App Engine Pipeline API" group.

To unsubscribe from this group and stop receiving emails from it, send an email to app-engine-pipeli...@googlegroups.com.

Emanuele Ziglioli

unread,

Oct 2, 2014, 4:54:54 PM10/2/14

to app-engine-...@googlegroups.com

Hi Tom,

doesn't that mean that all entities (or at least) will have to be kept in memory in order to be sorted?

That's wouldn't really be possible with App Enginge's instances, not enough memory for the numbers I'm talking about: >150,000 entities.

I can't really think of a simple solution.

What I have now is a loop:

1. query for 150 keys, order by date

2. async batch get entities from memcache

3. query for the next 150 keys

4. serialize

5. repeat from 2

I got to that number (150) by measuring how long the whole cycle would take. Anything bigger wouldn't give any more gain.

I look at the total number of entities by the time it takes, and the maximum I got to was around 280 entities per second.

It's soooo slow!

To unsubscribe from this group and stop receiving emails from it, send an email to app-engine-pipeline-api+unsub...@googlegroups.com.

Tom Kaitchuck

unread,

Oct 2, 2014, 7:42:41 PM10/2/14

to app-engine-...@googlegroups.com

The MapReduce library is able to sort datasets much larger than your memory. This is a major part of it's power. I routinely run TB size jobs using the Java version.

The library will do something similar to what you are describe above, but it has optimize batching on reads and writes, automatic error handling that is transparent to your program's logic, and can operate in parallel on huge datasets.

To unsubscribe from this group and stop receiving emails from it, send an email to app-engine-pipeli...@googlegroups.com.

Emanuele Ziglioli

unread,

Oct 2, 2014, 7:47:15 PM10/2/14

to app-engine-...@googlegroups.com

On Friday, 3 October 2014 12:42:41 UTC+13, Tom Kaitchuck wrote:

The MapReduce library is able to sort datasets much larger than your memory. This is a major part of it's power. I routinely run TB size jobs using the Java version.

Really?!! I thought MapReduce would just use normal task queues, and that means each task is limited to the instance memory size.

How can a single task exceed that? I'd like to know more, are you sure you're not looking at the combined memory size of many different tasks?

The library will do something similar to what you are describe above, but it has optimize batching on reads and writes, automatic error handling that is transparent to your program's logic, and can operate in parallel on huge datasets.

If what you said is true, that it's the holy grail! ;-)

To unsubscribe from this group and stop receiving emails from it, send an email to app-engine-pipeline-api+unsubscr...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Tom Kaitchuck

unread,

Oct 2, 2014, 8:08:48 PM10/2/14

to app-engine-...@googlegroups.com

On Thu, Oct 2, 2014 at 4:47 PM, Emanuele Ziglioli <the...@emanueleziglioli.it> wrote:

On Friday, 3 October 2014 12:42:41 UTC+13, Tom Kaitchuck wrote:
The MapReduce library is able to sort datasets much larger than your memory. This is a major part of it's power. I routinely run TB size jobs using the Java version.

Really?!! I thought MapReduce would just use normal task queues, and that means each task is limited to the instance memory size.

How can a single task exceed that? I'd like to know more, are you sure you're not looking at the combined memory size of many different tasks?

Yes. It never exceeds the memory required in a single task, but instead uses many tasks.

Each job is, but it is broken into three stages: (See: https://cloud.google.com/appengine/docs/java/dataprocessing/)

The first Map just does a streaming pass mapping input to K,V pairs. (Little memory required)

The second stage (built-in, no code required) does a tiered merge sort of that data. (This requires passing over the data N times where N is proportional to Log(DataSIze)/Log(InstanceMemorySize).) So larger instances will process the data faster, but any size instance should work.

The third stage Reduce is similar to map in that it is just being streamed data.

The idea is if you can structure your problem, to make that sort do all the heavy lifting, then you just need to write two simple functions.

Tom Kaitchuck

unread,

Oct 2, 2014, 8:23:59 PM10/2/14

to app-engine-...@googlegroups.com

Sorry my math is screwy above. A more exact formula is that in P passes it can process data as large as: 32^(P+1) * (MemoryAvailableForSort) * ReduceShards
Where overhead is ~64mb typically. So in the above case where ReduceShards is 1 a 1TB job should be able to be processed in 2 passes on an F4-1g or B8, and in 3 passes on smaller instances. Obviously this could be improved in situations that don't require a single reduce shard. However even this shouldn't take very long as remember each pass over the data involves many instances working in parallel.

Reply all

Reply to author

Forward