My Bolts are keeping shut down

783 views
Skip to first unread message

川 杜

unread,
Apr 1, 2012, 6:19:35 AM4/1/12
to storm-user
Hi all

Any one can help me on my problem?
I setup a 12 workers cluster on 3 boxes. My jobs are mainly
collecting log and counting the summary. and then push the summary to
db every 5mins. however, my bolts are keeping shut down by the reason
of timeout.

2012-04-01 17:10:57 supervisor [INFO] Shutting down and clearing state
for id 6bd310a9-d885-45f7-8efe-1931dda89e28. State: :timed-out,
Heartbeat: #:backtype.storm.daemon.common.WorkerHeartbeat{:time-secs
1333271426, :storm-id "RealTimeAnalysis-1-1333209246", :task-ids (8
18), :port 10022}

As my job are get the sum of one sub set, and then to the whole
set, the shutting will cause data loss. i cannot get enough info from
log.


My supervisor.log


2012-04-01 16:25:49 supervisor [INFO] Launching worker with command:
java -server -Xmx768m -Djava.library.path=/usr/local/lib:/opt/local/
lib:/usr/lib -Dlogfile.name=worker-10023.log -
Dlog4j.configuration=storm.log.properties -cp /home/op/work/storm/
storm-0.6.2/storm-0.6.2.jar:/home/op/work/storm/storm-0.6.2/lib/ring-
jetty-adapter-0.3.11.jar:/home/op/work/storm/storm-0.6.2/lib/commons-
io-1.4.jar:/home/op/work/storm/storm-0.6.2/lib/json-simple-1.1.jar:/
home/op/work/storm/storm-0.6.2/lib/slf4j-log4j12-1.5.8.jar:/home/op/
work/storm/storm-0.6.2/lib/log4j-1.2.16.jar:/home/op/work/storm/
storm-0.6.2/lib/servlet-api-2.5.jar:/home/op/work/storm/storm-0.6.2/
lib/clojure-contrib-1.2.0.jar:/home/op/work/storm/storm-0.6.2/lib/
reflectasm-1.01.jar:/home/op/work/storm/storm-0.6.2/lib/
carbonite-1.0.0.jar:/home/op/work/storm/storm-0.6.2/lib/
guava-10.0.1.jar:/home/op/work/storm/storm-0.6.2/lib/clout-0.4.1.jar:/
home/op/work/storm/storm-0.6.2/lib/core.incubator-0.1.0.jar:/home/op/
work/storm/storm-0.6.2/lib/commons-fileupload-1.2.1.jar:/home/op/work/
storm/storm-0.6.2/lib/compojure-0.6.4.jar:/home/op/work/storm/
storm-0.6.2/lib/commons-lang-2.5.jar:/home/op/work/storm/storm-0.6.2/
lib/tools.macro-0.1.0.jar:/home/op/work/storm/storm-0.6.2/lib/curator-
client-1.0.1.jar:/home/op/work/storm/storm-0.6.2/lib/jline-0.9.94.jar:/
home/op/work/storm/storm-0.6.2/lib/httpcore-4.1.jar:/home/op/work/
storm/storm-0.6.2/lib/httpclient-4.1.1.jar:/home/op/work/storm/
storm-0.6.2/lib/servlet-api-2.5-20081211.jar:/home/op/work/storm/
storm-0.6.2/lib/hiccup-0.3.6.jar:/home/op/work/storm/storm-0.6.2/lib/
snakeyaml-1.9.jar:/home/op/work/storm/storm-0.6.2/lib/asm-3.2.jar:/
home/op/work/storm/storm-0.6.2/lib/junit-3.8.1.jar:/home/op/work/storm/
storm-0.6.2/lib/jzmq-2.1.0.jar:/home/op/work/storm/storm-0.6.2/lib/
commons-codec-1.4.jar:/home/op/work/storm/storm-0.6.2/lib/
libthrift7-0.7.0.jar:/home/op/work/storm/storm-0.6.2/lib/ring-
core-0.3.10.jar:/home/op/work/storm/storm-0.6.2/lib/ring-
servlet-0.3.11.jar:/home/op/work/storm/storm-0.6.2/lib/commons-
logging-1.1.1.jar:/home/op/work/storm/storm-0.6.2/lib/
zookeeper-3.3.3.jar:/home/op/work/storm/storm-0.6.2/lib/
clojure-1.2.0.jar:/home/op/work/storm/storm-0.6.2/lib/joda-
time-1.6.jar:/home/op/work/storm/storm-0.6.2/lib/slf4j-api-1.5.8.jar:/
home/op/work/storm/storm-0.6.2/lib/minlog-1.2.jar:/home/op/work/storm/
storm-0.6.2/lib/curator-framework-1.0.1.jar:/home/op/work/storm/
storm-0.6.2/lib/clj-time-0.3.0.jar:/home/op/work/storm/storm-0.6.2/lib/
jetty-util-6.1.26.jar:/home/op/work/storm/storm-0.6.2/lib/
jetty-6.1.26.jar:/home/op/work/storm/storm-0.6.2/lib/commons-
exec-1.1.jar:/home/op/work/storm/storm-0.6.2/lib/kryo-1.04.jar:/home/
op/work/storm/storm-0.6.2/lib/jsr305-1.3.9.jar:/home/op/work/storm/
storm-0.6.2/log4j:/home/op/work/storm/storm-0.6.2/conf:/home/op/work/
storm/state/supervisor/stormdist/RealTimeAnalysis-1-1333209246/
stormjar.jar backtype.storm.daemon.worker
RealTimeAnalysis-1-1333209246 80954a44-5bc6-4a67-aa98-2794447c5fd6
10023 a6dd4003-9fdf-40c5-9a92-8a1e75330459
2012-04-01 16:25:49 supervisor [INFO]
a6dd4003-9fdf-40c5-9a92-8a1e75330459 still hasn't started
2012-04-01 16:25:50 supervisor [INFO]
a6dd4003-9fdf-40c5-9a92-8a1e75330459 still hasn't started
2012-04-01 16:25:50 supervisor [INFO]
a6dd4003-9fdf-40c5-9a92-8a1e75330459 still hasn't started
2012-04-01 16:25:51 supervisor [INFO]
a6dd4003-9fdf-40c5-9a92-8a1e75330459 still hasn't started
2012-04-01 16:25:51 supervisor [INFO]
a6dd4003-9fdf-40c5-9a92-8a1e75330459 still hasn't started
2012-04-01 16:25:52 supervisor [INFO]
a6dd4003-9fdf-40c5-9a92-8a1e75330459 still hasn't started
2012-04-01 17:10:57 supervisor [INFO] Shutting down and clearing state
for id 6bd310a9-d885-45f7-8efe-1931dda89e28. State: :timed-out,
Heartbeat: #:backtype.storm.daemon.common.WorkerHeartbeat{:time-secs
1333271426, :storm-id "RealTimeAnalysis-1-1333209246", :task-ids (8
18), :port 10022}
2012-04-01 17:10:57 supervisor [INFO] Shutting down 80954a44-5bc6-4a67-
aa98-2794447c5fd6:6bd310a9-d885-45f7-8efe-1931dda89e28
2012-04-01 17:10:57 supervisor [INFO] Shut down 80954a44-5bc6-4a67-
aa98-2794447c5fd6:6bd310a9-d885-45f7-8efe-1931dda89e28
2012-04-01 17:10:57 supervisor [INFO] Launching worker with assignment
#:backtype.storm.daemon.supervisor.LocalAssignment{:storm-id
"RealTimeAnalysis-1-1333209246", :task-ids (8 18)} for this supervisor
80954a44-5bc6-4a67-aa98-2794447c5fd6 on port 10022 with id
3940aac7-70a1-452d-ad81-64a3b79a88c3

Nathan Marz

unread,
Apr 3, 2012, 5:36:43 PM4/3/12
to storm...@googlegroups.com
Can you send over the nimbus.log for your topology? Also, did you override any of the supervisor.worker.* configs? Finally, what version of Storm is this with?
--
Twitter: @nathanmarz
http://nathanmarz.com

川 杜

unread,
Apr 5, 2012, 10:43:26 PM4/5/12
to storm-user
Nathan, thank you very much for your rely.

I am using 0.6.2 and there is no supervisor.worker.* configs.

It seems that the worker first found that the connection to zookeeper
is broken, and then after about 20 seconds, worker tried to reconnect
to the same session on zookeeper, which has timed out, and has been
deleted. the worker tried to rebuild the connection to zookeeper, and
failed. I cannot find any app log on worker log, so i don't know what
happened.

In my bolt, i started one timer task(every 5 mins), and run some time
cost work(about several seconds). will that effect the storm's
performance?

the worker's log is like:
2012-04-04 14:45:22 ClientCnxn [INFO] Client session timed out, have
not heard from server in 13400ms for sessionid 0x136695d83c700b1,
closing socket connection and attempting reconnect
2012-04-04 14:45:33 ConnectionStateManager [INFO] State change:
SUSPENDED
2012-04-04 14:45:33 ConnectionStateManager [WARN] There are no
ConnectionStateListeners registered.
2012-04-04 14:45:36 ClientCnxn [INFO] Opening socket connection to
server data1.wf.ppweb.com.cn/192.168.0.101:2181
2012-04-04 14:45:36 ClientCnxn [INFO] Socket connection established to
data1.wf.ppweb.com.cn/192.168.0.101:2181, initiating session
2012-04-04 14:45:46 cluster [WARN] Received event :disconnected::none:
with disconnected Zookeeper.
2012-04-04 14:45:47 ConnectionState [WARN] Session expired event
received
2012-04-04 14:45:47 ClientCnxn [INFO] Unable to reconnect to ZooKeeper
service, session 0x136695d83c700b1 has expired, closing socket
connection
2012-04-04 14:45:47 ZooKeeper [INFO] Initiating client connection,
connectString=192.168.0.101:2181/storm sessionTimeout=20000
watcher=com.netflix.curator.ConnectionState@24cd59e9
2012-04-04 14:45:47 ClientCnxn [INFO] Opening socket connection to
server /192.168.0.101:2181
2012-04-04 14:45:50 ClientCnxn [INFO] Socket connection established to
data1.wf.ppweb.com.cn/192.168.0.101:2181, initiating session
2012-04-04 14:45:50 ConnectionStateManager [INFO] State change: LOST
2012-04-04 14:45:50 ConnectionStateManager [WARN] There are no
ConnectionStateListeners registered.
2012-04-04 14:45:50 cluster [WARN] Received event :expired::none: with
disconnected Zookeeper.
2012-04-04 14:45:53 ClientCnxn [INFO] Session establishment complete
on server data1.wf.ppweb.com.cn/192.168.0.101:2181, sessionid =
0x136695d83c700ca, negotiated timeout = 20000
2012-04-04 14:45:53 ConnectionStateManager [INFO] State change:
RECONNECTED
2012-04-04 14:45:53 ConnectionStateManager [WARN] There are no
ConnectionStateListeners registered.
2012-04-04 14:45:59 ClientCnxn [INFO] EventThread shut down


my nimbus log goes like:
2012-04-01 01:00:22 nimbus [INFO] Cleaning inbox ... deleted:
stormjar-2b2796d7-a3d1-41ed-8099-e6784e3dbf97.jar
2012-04-01 16:25:49 nimbus [INFO] Task RealTimeAnalysis-1-1333209246:2
timed out
2012-04-01 16:25:49 nimbus [INFO] Task
RealTimeAnalysis-1-1333209246:12 timed out
2012-04-01 16:25:49 nimbus [INFO] Reassigning
RealTimeAnalysis-1-1333209246 to 10 slots
2012-04-01 16:25:49 nimbus [INFO] Reassign ids: [2 12]
2012-04-01 16:25:49 nimbus [INFO] Available slots:
(["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10023]
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10023])
2012-04-01 16:25:49 nimbus [INFO] Setting new assignment for storm id
RealTimeAnalysis-1-1333209246:
#:backtype.storm.daemon.common.Assignment{:master-code-dir "/home/op/
work/storm/state/nimbus/stormdist/
RealTimeAnalysis-1-1333209246", :node->host
{"c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" "data0.wf.ppweb.com.cn",
"80954a44-5bc6-4a67-aa98-2794447c5fd6" "data1.wf.ppweb.com.cn",
"b5b94eb7-485a-493b-8780-d9835dab3e2e" "data2.wf.ppweb.com.cn"}, :task-
>node+port {1 ["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10020], 2
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10023], 3
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10020], 4
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10021], 5 ["80954a44-5bc6-4a67-
aa98-2794447c5fd6" 10021], 6 ["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81"
10021], 7 ["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10022], 8
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10022], 9
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10022], 10
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10023], 11
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10020], 12
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10023], 13
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10020], 14
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10021], 15
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10021], 16
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10021], 17
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10022], 18
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10022], 19
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10022], 20
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10023], 21
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10020]}, :task->start-time-
secs {1 1333209247, 2 1333268749, 3 1333209247, 4 1333209247, 5
1333209247, 6 1333209247, 7 1333209247, 8 1333209247, 9 1333209247, 10
1333209247, 11 1333209247, 12 1333268749, 13 1333209247, 14
1333209247, 15 1333209247, 16 1333209247, 17 1333209247, 18
1333209247, 19 1333209247, 20 1333209247, 21 1333209247}}
2012-04-01 17:48:44 nimbus [INFO] Task
RealTimeAnalysis-1-1333209246:14 timed out
2012-04-01 17:48:44 nimbus [INFO] Reassigning
RealTimeAnalysis-1-1333209246 to 10 slots
2012-04-01 17:48:44 nimbus [INFO] Reassign ids: [4 14]
2012-04-01 17:48:44 nimbus [INFO] Available slots:
(["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10020]
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10023])
2012-04-01 17:48:44 nimbus [INFO] Setting new assignment for storm id
RealTimeAnalysis-1-1333209246:
#:backtype.storm.daemon.common.Assignment{:master-code-dir "/home/op/
work/storm/state/nimbus/stormdist/
RealTimeAnalysis-1-1333209246", :node->host
{"c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" "data0.wf.ppweb.com.cn",
"80954a44-5bc6-4a67-aa98-2794447c5fd6" "data1.wf.ppweb.com.cn",
"b5b94eb7-485a-493b-8780-d9835dab3e2e" "data2.wf.ppweb.com.cn"}, :task-
>node+port {1 ["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10020], 2
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10023], 3
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10020], 4 ["80954a44-5bc6-4a67-
aa98-2794447c5fd6" 10020], 5 ["80954a44-5bc6-4a67-aa98-2794447c5fd6"
10021], 6 ["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10021], 7
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10022], 8 ["80954a44-5bc6-4a67-
aa98-2794447c5fd6" 10022], 9 ["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81"
10022], 10 ["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10023], 11
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10020], 12
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10023], 13
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10020], 14
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10020], 15
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10021], 16
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10021], 17
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10022], 18
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10022], 19
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10022], 20
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10023], 21
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10020]}, :task->start-time-
secs {1 1333209247, 2 1333268749, 3 1333209247, 4 1333273724, 5
1333209247, 6 1333209247, 7 1333209247, 8 1333209247, 9 1333209247, 10
1333209247, 11 1333209247, 12 1333268749, 13 1333209247, 14
1333273724, 15 1333209247, 16 1333209247, 17 1333209247, 18
1333209247, 19 1333209247, 20 1333209247, 21 1333209247}}
2012-04-01 18:19:34 nimbus [INFO] Task
RealTimeAnalysis-1-1333209246:10 timed out
2012-04-01 18:19:34 nimbus [INFO] Task
RealTimeAnalysis-1-1333209246:20 timed out
2012-04-01 18:19:34 nimbus [INFO] Reassigning
RealTimeAnalysis-1-1333209246 to 10 slots
2012-04-01 18:19:34 nimbus [INFO] Reassign ids: [10 20]
2012-04-01 18:19:34 nimbus [INFO] Available slots:
(["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10021]
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10023])
2012-04-01 18:19:34 nimbus [INFO] Setting new assignment for storm id
RealTimeAnalysis-1-1333209246:
#:backtype.storm.daemon.common.Assignment{:master-code-dir "/home/op/
work/storm/state/nimbus/stormdist/
RealTimeAnalysis-1-1333209246", :node->host
{"c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" "data0.wf.ppweb.com.cn",
"80954a44-5bc6-4a67-aa98-2794447c5fd6" "data1.wf.ppweb.com.cn",
"b5b94eb7-485a-493b-8780-d9835dab3e2e" "data2.wf.ppweb.com.cn"}, :task-
>node+port {1 ["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10020], 2
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10023], 3
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10020], 4 ["80954a44-5bc6-4a67-
aa98-2794447c5fd6" 10020], 5 ["80954a44-5bc6-4a67-aa98-2794447c5fd6"
10021], 6 ["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10021], 7
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10022], 8 ["80954a44-5bc6-4a67-
aa98-2794447c5fd6" 10022], 9 ["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81"
10022], 10 ["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10021], 11
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10020], 12
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10023], 13
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10020], 14
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10020], 15
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10021], 16
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10021], 17
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10022], 18
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10022], 19
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10022], 20
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10021], 21
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10020]}, :task->start-time-
secs {1 1333209247, 2 1333268749, 3 1333209247, 4 1333273724, 5
1333209247, 6 1333209247, 7 1333209247, 8 1333209247, 9 1333209247, 10
1333275574, 11 1333209247, 12 1333268749, 13 1333209247, 14
1333273724, 15 1333209247, 16 1333209247, 17 1333209247, 18
1333209247, 19 1333209247, 20 1333275574, 21 1333209247}}
2012-04-01 19:16:42 nimbus [INFO] Task RealTimeAnalysis-1-1333209246:7
timed out
2012-04-01 19:16:42 nimbus [INFO] Task
RealTimeAnalysis-1-1333209246:17 timed out
2012-04-01 19:16:42 nimbus [INFO] Reassigning
RealTimeAnalysis-1-1333209246 to 10 slots
2012-04-01 19:16:42 nimbus [INFO] Reassign ids: [7 17]
2012-04-01 19:16:42 nimbus [INFO] Available slots:
(["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10023]
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10023])
2012-04-01 19:16:42 nimbus [INFO] Setting new assignment for storm id
RealTimeAnalysis-1-1333209246:
#:backtype.storm.daemon.common.Assignment{:master-code-dir "/home/op/
work/storm/state/nimbus/stormdist/
RealTimeAnalysis-1-1333209246", :node->host
{"c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" "data0.wf.ppweb.com.cn",
"80954a44-5bc6-4a67-aa98-2794447c5fd6" "data1.wf.ppweb.com.cn",
"b5b94eb7-485a-493b-8780-d9835dab3e2e" "data2.wf.ppweb.com.cn"}, :task-
>node+port {1 ["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10020], 2
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10023], 3
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10020], 4 ["80954a44-5bc6-4a67-
aa98-2794447c5fd6" 10020], 5 ["80954a44-5bc6-4a67-aa98-2794447c5fd6"
10021], 6 ["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10021], 7
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10023], 8 ["80954a44-5bc6-4a67-
aa98-2794447c5fd6" 10022], 9 ["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81"
10022], 10 ["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10021], 11
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10020], 12
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10023], 13
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10020], 14
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10020], 15
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10021], 16
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10021], 17
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10023], 18
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10022], 19
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10022], 20
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10021], 21
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10020]}, :task->start-time-
secs {1 1333209247, 2 1333268749, 3 1333209247, 4 1333273724, 5
1333209247, 6 1333209247, 7 1333279002, 8 1333209247, 9 1333209247, 10
1333275574, 11 1333209247, 12 1333268749, 13 1333209247, 14
1333273724, 15 1333209247, 16 1333209247, 17 1333279002, 18
1333209247, 19 1333209247, 20 1333275574, 21 1333209247}}
2012-04-01 19:47:44 nimbus [INFO] Task RealTimeAnalysis-1-1333209246:9
timed out
2012-04-01 19:47:44 nimbus [INFO] Task
RealTimeAnalysis-1-1333209246:19 timed out
2012-04-01 19:47:44 nimbus [INFO] Reassigning
RealTimeAnalysis-1-1333209246 to 10 slots
2012-04-01 19:47:44 nimbus [INFO] Reassign ids: [9 19]
2012-04-01 19:47:44 nimbus [INFO] Available slots:
(["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10022]
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10023])
2012-04-01 19:47:44 nimbus [INFO] Setting new assignment for storm id
RealTimeAnalysis-1-1333209246:
#:backtype.storm.daemon.common.Assignment{:master-code-dir "/home/op/
work/storm/state/nimbus/stormdist/
RealTimeAnalysis-1-1333209246", :node->host
{"c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" "data0.wf.ppweb.com.cn",
"80954a44-5bc6-4a67-aa98-2794447c5fd6" "data1.wf.ppweb.com.cn",
"b5b94eb7-485a-493b-8780-d9835dab3e2e" "data2.wf.ppweb.com.cn"}, :task-
>node+port {1 ["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10020], 2
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10023], 3
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10020], 4 ["80954a44-5bc6-4a67-
aa98-2794447c5fd6" 10020], 5 ["80954a44-5bc6-4a67-aa98-2794447c5fd6"
10021], 6 ["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10021], 7
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10023], 8 ["80954a44-5bc6-4a67-
aa98-2794447c5fd6" 10022], 9 ["b5b94eb7-485a-493b-8780-d9835dab3e2e"
10022], 10 ["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10021], 11
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10020], 12
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10023], 13
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10020], 14
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10020], 15
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10021], 16
["c45b8f05-4f2c-4a9f-85fb-5e64f6a31d81" 10021], 17
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10023], 18
["80954a44-5bc6-4a67-aa98-2794447c5fd6" 10022], 19
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10022], 20
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10021], 21
["b5b94eb7-485a-493b-8780-d9835dab3e2e" 10020]}, :task->start-time-
secs {1 1333209247, 2 1333268749, 3 1333209247, 4 1333273724, 5
1333209247, 6 1333209247, 7 1333279002, 8 1333209247, 9 1333280864, 10
1333275574, 11 1333209247, 12 1333268749, 13 1333209247, 14
1333273724, 15 1333209247, 16 1333209247, 17 1333279002, 18
1333209247, 19 1333280864, 20 1333275574, 21 1333209247}}
2012-04-01 20:00:28 nimbus [INFO] Delaying event :remove for 30 secs
for RealTimeAnalysis-1-1333209246
2012-04-01 20:00:28 nimbus [INFO] Updated
RealTimeAnalysis-1-1333209246 with status {:type :killed, :kill-time-
secs 30}
2012-04-01 20:00:58 nimbus [INFO] Killing topology:
RealTimeAnalysis-1-1333209246
2012-04-01 20:00:59 nimbus [INFO] Cleaning up
RealTimeAnalysis-1-1333209246
2012-04-01 20:03:24 nimbus [INFO] Uploading file from client to /home/
op/work/storm/state/nimbus/inbox/stormjar-
edb4c029-7c1b-408d-8af3-176771aafb34.jar
2012-04-01 20:03:24 nimbus [INFO] Finished uploading file from
client: /home/op/work/storm/state/nimbus/inbox/stormjar-
edb4c029-7c1b-408d-8af3-176771aafb34.jar
2012-04-01 20:03:24 nimbus [INFO] Received topology submission for
RealTimeAnalysis with conf {"storm.id"
"RealTimeAnalysis-2-1333281804", "DB_PORT" "27017",
"Global_DB_COLLECTION" "globalState", "Partner_DB_COLLECTION"
"partnerState", "SubPartner_DB_COLLECTION" "subPartnerState",
"com.yuncheng.realtime.sourceCount" "3", "DB_NAME" "p2pstatus",
"DB_USER" "p2pstatus", "LoadBalanceSubPartner_RUN_DELAY" "10",
"SubPartner_RUN_INTERVAL" "300000", "Partner_RUN_INTERVAL" "300000",
"com.yuncheng.realtime.cluster" "wf", "Global_RUN_DELAY" "20000",
"DB_ADDRESS" "bj2.ppweb.com.cn", "com.yuncheng.realtime.source_0"
"192.168.0.100:10100", "topology.kryo.register" nil,
"Global_RUN_INTERVAL" "300000", "topology.workers" 10,
"com.yuncheng.realtime.source_1" "192.168.0.101:10100",
"com.yuncheng.realtime.source_2" "192.168.0.102:10100",
"SubPartner_RUN_DELAY" "5000", "Partner_RUN_DELAY" "10000",
"LoadBalanceSubPartner_RUN_INTERVAL" "10000", "display_port" "tcp://
192.168.0.100:20000", "DB_PASSWD" "status"}

川 杜

unread,
Apr 5, 2012, 10:51:25 PM4/5/12
to storm-user
By the way, my task is to count the page view every 5 mins, the timed
out and restart of the bolt do damage my work(if the bolt stop in the
middle of the 5 mins, my page view is only one half). I wonder if the
timed out is a normal situation, then how does the counting job
work? Is there other choice?

PS, The peak request per second on my system is about 10k.

Looking forward for your reply.

Regards.
> RealTimeAnalysis-1-1333209246:17 timed out ...
>
> read more »

James Xu

unread,
Apr 5, 2012, 11:13:07 PM4/5/12
to storm...@googlegroups.com
I think this issue worth paying much attention to. Due the the fact that all the storm daemons are fail-fast, it makes us a little hard to write things like aggregation, because you need to store some state in the memory, but the daemon may die at any time, so you would loose the results.

Nathan: how about add a hook for the "die" event, so developer can at least save the state into db?

2012/4/6 川 杜 <nau...@gmail.com>

Nathan Marz

unread,
Apr 8, 2012, 9:03:07 PM4/8/12
to storm...@googlegroups.com
It does seem possible that your Zookeeper is getting overloaded. One thing you can try is increasing "nimbus.task.timeout.sec" on your Nimbus machine to see if that helps. You should also carefully check your Zookeeper setup with these docs to make sure it's set up correctly: http://zookeeper.apache.org/doc/r3.3.4/zookeeperAdmin.html 

Let me know if any of that helps.

Nathan Marz

unread,
Apr 8, 2012, 9:04:36 PM4/8/12
to storm...@googlegroups.com
There's no reliable way to ensure that the die event is called (imagine that the node suddenly disappears). Since there's no way to make it reliable, it wouldn't be a useful addition IMO.


2012/4/5 James Xu <xumingmin...@gmail.com>

Ted Dunning

unread,
Apr 8, 2012, 11:15:30 PM4/8/12
to storm...@googlegroups.com
It is much better to persist partial counts to networked storage periodically and only acknowledge tuples when you persist the counts that include them.  When you fail, you can recover state from this log.

2012/4/8 Nathan Marz <natha...@gmail.com>

abhinav

unread,
Jul 6, 2012, 1:14:53 PM7/6/12
to storm...@googlegroups.com
I've been running several topologies that used to store some state in the bolts - but had to move away from that pattern because of Storm's fail-fast nature. I got around it by using a solution identical to the one described by Ted - storing partial state in a Redis store and acking only after the storage call returned successfully. This has been working well so far (about 2 months) on a three node (m1.large) cluster with 3-5k messages flying through the topology per second.

Cheers,
Abhi

On Fri, Jul 6, 2012 at 5:29 AM, HOHO <huweis...@gmail.com> wrote:
i have the same problem with you. so i want to know if you have solved this problem.
will you give me some advice. 
thank you very much!


在 2012年4月1日星期日UTC+8下午6时19分35秒,川 杜写道:
Reply all
Reply to author
Forward
0 new messages