Re: [EM] receive_data not being called before a disconnect

37 views
Skip to first unread message

James Tucker

unread,
Aug 26, 2012, 12:53:49 PM8/26/12
to eventm...@googlegroups.com
Getting data on the ethernet layer is not the same as the app reading the data.

That reset is forceful, and it's likely much of the data flow you have seen cannot make it to the app after this happens.

The reason is likely composed of several layers:

 * Your app is processing incoming data relatively slowly
 * EM is being blocked from reads due to large amounts of work being done (there's a low general tick rate in the reactor)
 * EM hasn't had a chance to read all the data before the RST is received.

As for other races, I can't really advise there without seeing the code, but it's worth mentioning, as most people don't seem to realize, that Fibers can cause much the same race conditions as threads. You won't race on individual expressions (not that you generally do in MRI anyway), but you can still race in your application structure. Example: https://gist.github.com/1220800. The synchrony stuff might feel nice to some folks, but to sell it as a "race condition free, fast alternative to threads" is incorrect on all statements.

Regarding trouble shooting, you need to unwind the events that are occurring in your apps timeline (not entirely trivial now that you're hiding state away in stacks (fibers)). You'll probably need to smatter your code with logging, but as you've observed, timing based race conditions are non-trivial to debug in high level languages because making changes for debugging cause significant changes to the runtime profile. This is why simpler always wins in these environments in the long run.

If I were attacking the code, it's likely I would first refactor for simplicity (this is not "ease of authorship" (e.g. synchrony), it's runtime simplicity - few, purposeful steps). When I talk about simplicity, I mean the whole application stack, not just "the code". That includes all the libraries you use, including the interpreter. Once you have a simple enough program you can reason about portions of it with precision. That reasoning will then lead you to identify the problems. Alternatively you can try to use more complex debugging techniques, but I suspect you'll come up against the ruby tragedy of tooling, being able to switch fibers, wrangle the reactor, and so forth are not really tooled at all in our environments. You're probably on 1.9, which also means that most of the more useful gdb scripts you might find out there also won't work for you, and the backtraces will also be less useful unless you know the JIT.

Good luck.


On Aug 23, 2012, at 11:47 AM, Jonathan Hyman <hyma...@gmail.com> wrote:

I'm using EventMachine to write an Apple Push Notification Service. The way that this works is that I open up a connection to Apple and send notifications, and if they receive an invalid notification then they send back a packet with data indicating an error and then drop the connection. I use receive_data to read the error. When I send a small number of notifications, this works great: I'll send a dozen or so notifications with a few purposefully malformed payloads and get data back in receive_data. However, if I try to send thousands of notifications with a purposefully malformed payload in the middle, I expect to receive the error packet in receive_data. I confirmed that the packet is received by my machine using Wireshark, but in some cases the receive_data callback is never called. I do see the unbind callback caused by the connection dropping, but it appears that the data that comes before that is being ignored. In Wireshark, this looks like a FIN packet followed by a RST.

The way in which I wrote this is using EM-Resque (https://github.com/SponsorPay/em-resque), which uses Redis as a queue for jobs. My EventMachine process looks at the queue for push notification jobs and if it finds one, sends push notifications. EM-Resque uses EM::Synchrony.sleep in between checking for new jobs. EM::Synchrony.sleep uses Fibers and yields them to be non-blocking. Thinking that perhaps the Fiber yielding was the cause of this, I rewrote that loop do not yield but instead use EM::Timers. Something like this:

work_loop = lambda do
  # look for a new job and process it
  EM::Timer.new(1) { EM.next_tick(&work_loop) }
end

This did not solve the problem, so I don't think it's related to the EM::Synchrony.sleep.

I am also seeing weird races here; sometimes it works and I get the data back in receive_data, but sometimes it does not. If I remove a log statement in between each send, that increases the probability of it succeeding. So my gut tells me there's a race going on here.

My best guess is that the disconnect is coming in and flushing out any data that would be sent to receive_data (the reverse of close_connection without writing). I'm not sure if that is the case, I don't know if disconnect could drop data on the floor, but that's my current hypothesis.

Does anyone have any suggestions for troubleshooting? 

--
You received this message because you are subscribed to the Google Groups "EventMachine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/eventmachine/-/vSX606k_GXgJ.
To post to this group, send email to eventm...@googlegroups.com.
To unsubscribe from this group, send email to eventmachine...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/eventmachine?hl=en.

Reply all
Reply to author
Forward
0 new messages