Expensive DataStore writes on _AE_Pipeline_Barrier

36 views
Skip to first unread message

Caio Iglesias

unread,
Jul 15, 2014, 1:41:35 AM7/15/14
to app-engine-...@googlegroups.com
After running several pipelines I noticed that my datastore write usage was piling up.
I checked app stats and pipeline/run was doing as much as 378 datastore writes.
I narrowed it down to _AE_Pipeline_Barrier's blocking_slots list of keys for the "start" and "finalize" keynames.

I'm running a denormalizing pipeline that streams to bigquery, pretty much like this:

class IncomingItem(base_handler.PipelineBase):
    def run(self, keyname, store, sku, order_id):

        product = yield Product(sku)
        manufacturer = yield Manufacturer(product)
        
        order = yield Order(order_id)
        customer = yield Customer(order)
        address = yield Address(order)
        gender = yield Gender(order)
        
        denormalized = yield MergeAndSave(keyname, product, manufacturer, order, customer, address, gender, store)
        yield BigQueryStream(denormalized)

Am I pushing too much on the amount of dependant pipelines? Should I break this up to get less dependancy?

David Hardwick

unread,
Jul 15, 2014, 11:00:32 AM7/15/14
to app-engine-...@googlegroups.com
It would be great if the pipeline overhead (barriers, slots, jobs, etc.) could be written to CloudSQL since a lot of this data is ephemeral (pipelines only run a few hours or days, typically) and you would have lower data costs both in read/writes and storage (since it would be easier and cheaper to clean up old, ephemeral data in CloudSQL).




--
You received this message because you are subscribed to the Google Groups "Google App Engine Pipeline API" group.
To unsubscribe from this group and stop receiving emails from it, send an email to app-engine-pipeli...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
David Hardwick, CTO

p. 646.237.5388
m. 703.338.0741
a. 3405 Piedmont Road NE, Suite 325, Atlanta, GA 30305
e. david.h...@bettercloud.com

Caio Iglesias

unread,
Jul 15, 2014, 11:40:48 AM7/15/14
to app-engine-...@googlegroups.com

Yeah, well... but since it's already on datastore I'm just thinking about tweaking my logic.

I phrased my question wrong since I can't get less dependency. I'm thinking I should have less siblings pipelines. I will reorganize the code that way and see if I can get datastore writes down.

Too many keys on that indexed list property.

Caio Iglesias

unread,
Jul 18, 2014, 4:50:01 PM7/18/14
to app-engine-...@googlegroups.com
By running child pipelines I'm saving on some writes. Down to 208, from which 156 are on _AE_Pipeline_Slot.

class IncomingItem(base_handler.PipelineBase):
    def run(self, keyname, store, sku, order_id):

        product = yield Product(sku)
        
        order = yield Order(order_id)
        
        denormalized = yield MergeAndSave(keyname, product, product.manufacturer, order, order.customer, order.address, order.gender, store)
        yield BigQueryStream(denormalized)

By passing forward just product and order to MergeAndSave I'll manage to reduce those writes even more I guess, by having less slots to be filled.

class IncomingItem(base_handler.PipelineBase):
    def run(self, keyname, store, sku, order_id):

        product = yield Product(sku)
        
        order = yield Order(order_id)
        
        denormalized = yield MergeAndSave(keyname, product, order, store)
        yield BigQueryStream(denormalized)

Exploding index possibly averted.
Reply all
Reply to author
Forward
0 new messages