如果避免glom写数据重复?

23 views
Skip to first unread message

muxueqz(张明源)

unread,
Dec 26, 2013, 5:24:50 AM12/26/13
to dpark...@googlegroups.com
Hi:
    最近在用Dpark写数据到ElasticSearch,用urllib2 post到ElasticSearch的bulk接口,
    我发现用glom的情况,有时候会有数据重复,这应该与map-reduce的失败重试有关
    不知有什么好办法可以避免吗?

Windreamer

unread,
Dec 26, 2013, 6:03:29 AM12/26/13
to dpark-users
hi 明源,

这个不仅仅是glom的问题,而是dpark/spark计算模型的问题,目前没有什么好办法

对于有副作用的dpark自定义函数,需要自己来保证多次重算下副作用的一致性。

---
Windreamer
--
You received this message because you are subscribed to the Google Groups "DPark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dpark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

muxueqz(张明源)

unread,
Dec 26, 2013, 9:11:09 PM12/26/13
to dpark...@googlegroups.com
嗯,确实。
另外我发现mergeSplit再glom().map()的话,有时会丢一些数据,还没找到原因

Windreamer

unread,
Dec 26, 2013, 9:15:08 PM12/26/13
to dpark-users
是么?不知道有没有简单的可复现的例子

---
Windreamer

muxueqz(张明源)

unread,
Dec 26, 2013, 9:17:43 PM12/26/13
to dpark...@googlegroups.com
还没,
mergeSplit的设计只用来给saveasText吗?还是可以像我这样用在map之前?

田忠博

unread,
Dec 27, 2013, 5:09:15 AM12/27/13
to dpark-users
可以用在map之前,mergeSplit主要是为了合并过多的split
Reply all
Reply to author
Forward
0 new messages