在mesos模式下一定概率出错

55 views
Skip to first unread message

策马江湖

unread,
Jul 11, 2014, 4:02:53 AM7/11/14
to dpark...@googlegroups.com
业务

对日志进行清洗,把channle_id不在数据库中的记录过滤掉。

程序


为了简化起见,只保留了主要的代码。

问题

多进程模式下没有问题,而在mesos模式下却会存在一定的概率出错情况。

输出信息

mesos的输出信息在出错与正常的情况下一致:

[2014-07-11 15:17:13,011] INFO     [dbconn] : MYSQL - CONNECTION <mysql.connector.connection.MySQLConnection object at 0x1f49710> - SUCCEED
[2014-07-11 15:17:13,481] DEBUG    [broadcast] : broadcast started: tcp://v-bj-test-83:62486
[2014-07-11 15:17:13,483] DEBUG    [broadcast] : guide start at tcp://v-bj-test-83:62486
[2014-07-11 15:17:13,625] INFO     [broadcast] : broadcast a1e8d19b-040d-4375-93df-4c855a07e7e7 in 1 blocks
[2014-07-11 15:17:13,625] DEBUG    [broadcast] : server started at tcp://v-bj-test-83:55590
[v-bj-test-84] 2014-07-11 15:17:14,594 [WARNING] [executor@v-bj-test-84] default webserver at 5055 not available
[v-bj-test-85] 2014-07-11 15:17:14,601 [WARNING] [executor@v-bj-test-85] default webserver at 5055 not available
[2014-07-11 15:17:15,269] DEBUG    [broadcast] : server recv: 1 ('a1e8d19b-040d-4375-93df-4c855a07e7e7', 0)
[2014-07-11 15:17:15,319] DEBUG    [broadcast] : server recv: 1 ('a1e8d19b-040d-4375-93df-4c855a07e7e7', 0)
[2014-07-11 15:17:46,807] DEBUG    [broadcast] : server recv: 1 ('a1e8d19b-040d-4375-93df-4c855a07e7e7', 0)
[2014-07-11 15:17:46,886] DEBUG    [broadcast] : server recv: 1 ('a1e8d19b-040d-4375-93df-4c855a07e7e7', 0)
[2014-07-11 15:18:13,859] DEBUG    [broadcast] : server recv: 1 ('a1e8d19b-040d-4375-93df-4c855a07e7e7', 0)
[2014-07-11 15:18:15,832] DEBUG    [broadcast] : server recv: 1 ('a1e8d19b-040d-4375-93df-4c855a07e7e7', 0)
[2014-07-11 15:18:44,724] DEBUG    [broadcast] : server recv: 1 ('a1e8d19b-040d-4375-93df-4c855a07e7e7', 0)
[2014-07-11 15:18:45,847] DEBUG    [broadcast] : server recv: 1 ('a1e8d19b-040d-4375-93df-4c855a07e7e7', 0)
[2014-07-11 15:19:10,934] DEBUG    [broadcast] : server recv: 1 ('a1e8d19b-040d-4375-93df-4c855a07e7e7', 0)
[2014-07-11 15:19:14,903] DEBUG    [broadcast] : server recv: 1 ('a1e8d19b-040d-4375-93df-4c855a07e7e7', 0)
[2014-07-11 15:19:41,911] DEBUG    [broadcast] : server recv: 1 ('a1e8d19b-040d-4375-93df-4c855a07e7e7', 0)
[2014-07-11 15:19:44,890] DEBUG    [broadcast] : server recv: 1 ('a1e8d19b-040d-4375-93df-4c855a07e7e7', 0)
[2014-07-11 15:20:11,671] DEBUG    [broadcast] : server recv: 1 ('a1e8d19b-040d-4375-93df-4c855a07e7e7', 0)
[2014-07-11 15:22:51,663] DEBUG    [broadcast] : Sending stop notification to all servers ...
[2014-07-11 15:22:51,665] DEBUG    [broadcast] : server recv: 0 None
[2014-07-11 15:22:51,666] DEBUG    [broadcast] : stop Broadcast server tcp://v-bj-test-83:55590


通过echo $?会发现有的时候执行成功,而有的时候执行失败。

不明就里,还请大神指点一下!万分感谢!



田忠博(Zhongbo Tian)

unread,
Jul 11, 2014, 4:19:24 AM7/11/14
to dpark-users
Hi sunzy,

   似乎你提供的gist地址打开以后需要登录,所以看不到你的脚本, 从日志上并没有看到什么特别东西,只是觉得是不是日志也并不是全部的日志?因为没有看到job调度的相关信息。 如果方便的话请提供脚本和更完整一些的日志。

谢谢



--
You received this message because you are subscribed to the Google Groups "DPark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dpark-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

策马江湖

unread,
Jul 11, 2014, 4:32:17 AM7/11/14
to dpark...@googlegroups.com
感谢回复。

没想到gitlab的public gist竟然还要登陆,不好意思!

新的gist地址:


日志:

I0711 16:15:16.348577 24616 status_update_manager.cpp:300] Received status update TASK_RUNNING (UUID: 19cb0cf9-c3a0-e04e-75b1-1d483846d0d0) fo                                                                   r task 5:216:5 of framework 201407111456-1393230090-5050-27021-0003
I0711 16:15:16.353021 24616 status_update_manager.cpp:351] Forwarding status update TASK_RUNNING (UUID: 19cb0cf9-c3a0-e04e-75b1-1d483846d0d0)                                                                    for task 5:216:5 of framework 201407111456-1393230090-5050-27021-0003 to mas...@10.1.11.83:5050
I0711 16:15:16.349350 24612 slave.cpp:1798] Handling status update TASK_LOST (UUID: dc20874b-a858-b37f-364e-4dc22f1534c6) for task 5:216:5 of                                                                    framework 201407111456-1393230090-5050-27021-0003 from @0.0.0.0:0
I0711 16:15:16.361174 24612 status_update_manager.cpp:300] Received status update TASK_LOST (UUID: dc20874b-a858-b37f-364e-4dc22f1534c6) for t                                                                   ask 5:216:5 of framework 201407111456-1393230090-5050-27021-0003
I0711 16:15:16.365046 24612 status_update_manager.cpp:375] Received status update acknowledgement (UUID: 19cb0cf9-c3a0-e04e-75b1-1d483846d0d0)                                                                    for task 5:216:5 of framework 201407111456-1393230090-5050-27021-0003
I0711 16:15:16.369014 24612 status_update_manager.cpp:351] Forwarding status update TASK_LOST (UUID: dc20874b-a858-b37f-364e-4dc22f1534c6) for                                                                    task 5:216:5 of framework 201407111456-1393230090-5050-27021-0003 to mas...@10.1.11.83:5050
I0711 16:15:16.376309 24612 status_update_manager.cpp:375] Received status update acknowledgement (UUID: dc20874b-a858-b37f-364e-4dc22f1534c6)                                                                    for task 5:216:5 of framework 201407111456-1393230090-5050-27021-0003
I0711 16:15:16.490314 24612 slave.cpp:1191] Asked to shut down framework 201407111456-1393230090-5050-27021-0003 by mas...@10.1.11.83:5050
I0711 16:15:16.490407 24612 slave.cpp:1216] Shutting down framework 201407111456-1393230090-5050-27021-0003
I0711 16:15:16.496587 24612 slave.cpp:2449] Shutting down executor 'default' of framework 201407111456-1393230090-5050-27021-0003
I0711 16:15:21.501266 24609 slave.cpp:2518] Killing executor 'default' of framework 201407111456-1393230090-5050-27021-0003
I0711 16:15:21.507995 24609 process_isolator.cpp:268] Killed the following process trees:
[
--- 24799 python
]
I0711 16:15:21.866076 24613 process_isolator.cpp:479] Telling slave of terminated executor 'default' of framework 201407111456-1393230090-5050                                                                   -27021-0003
I0711 16:15:21.866204 24613 slave.cpp:2178] Executor 'default' of framework 201407111456-1393230090-5050-27021-0003 has terminated with signal                                                                    Killed
I0711 16:15:21.871322 24613 slave.cpp:2308] Cleaning up executor 'default' of framework 201407111456-1393230090-5050-27021-0003
I0711 16:15:21.875337 24613 slave.cpp:2380] Cleaning up framework 201407111456-1393230090-5050-27021-0003
I0711 16:15:21.879290 24613 status_update_manager.cpp:262] Closing status update streams for framework 201407111456-1393230090-5050-27021-0003
I0711 16:15:21.875402 24611 gc.cpp:56] Scheduling '/tmp/mesos/slaves/201407111456-1393230090-5050-27021-1/frameworks/201407111456-1393230090-5                                                                   050-27021-0003/executors/default/runs/531954cf-4f5f-4c45-81f4-b64e323475b9' for removal
I0711 16:15:21.887215 24611 gc.cpp:56] Scheduling '/tmp/mesos/slaves/201407111456-1393230090-5050-27021-1/frameworks/201407111456-1393230090-5                                                                   050-27021-0003/executors/default' for removal
I0711 16:15:21.891232 24611 gc.cpp:56] Scheduling '/tmp/mesos/slaves/201407111456-1393230090-5050-27021-1/frameworks/201407111456-1393230090-5                                                                   050-27021-0003' for removal


在 2014年7月11日星期五UTC+8下午4时19分24秒,田忠博写道:

田忠博(Zhongbo Tian)

unread,
Jul 11, 2014, 5:18:33 AM7/11/14
to dpark-users
Hi sunzy,

   从mesos的slave日志来看, 似乎有部分任务失败了(Task Lost), 失败的原因可能你需要再仔细看看,有可能是executor节点的权限问题,或者是内存使用过多被杀。当然当部分任务失败后,Dpark会尝试重试,是否重试成功,只看一个slave的日志也无法确定。如果能提供完整的脚本输出日志(包括标准输出和标准错误)可能能有进一步的线索。
Reply all
Reply to author
Forward
0 new messages