I have done a lot of experiments over the past few weeks and came to
a few interesting conclusions. First some background, then issues,
solutions and conclusions.
I wrote a test harness for a poker server that understands the
different binary packets and can send and receive them. The harness
launches each "script" in a separate unbound thread that connects to
the server via TCP and does its work.
The main goals of the project were: easy scripting, very high number
of connections from the harness (a few thousand) and running on
Windows. I develop on Mac OSX but have a Windows machine for testing
and to run the poker server.
Another key goal was to support the server encryption. SSL encryption
is done in a wierd way that requires attaching read/write OpenSSL
BIOs to the SSL descriptor so that SSL encrypts to/from memory.
Encrypted chunks are then taken from the BIOs and sent as payload in
servver packets.
Overall, I probably spent about 4 weeks writing the server and about
2 more weeks grappling with the various issues. The issues centered
around 1) the program trashing memory like no tomorrow, 2)
intermittent crashes on Windows and 3) not being able to launch a
high number of connections on Windows before crashing.
I significantly improved trashing of memory by switching to plain
Haskell structures from nested lists of wxHaskell-style properties
(attr := value). Intermittent crashes were harder to troubleshoot,
specially given that things were running smoothly on Mac OSX.
Stack traces pointed into libcrypto (part of OpenSSL) and thus to the
BIOs that I was allocating. I guesses that OpenSSL was maxing out
some resources and closed the leak by explicitly freeing the SSL
descriptor which freed the associated BIO structures. Then things got
wierder as my program started crashing in a different place entirely
with stack traces like this:
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x3139322e
0x0027c174 in s8j1_info ()
(gdb) where
#0 0x0027c174 in s8j1_info ()
#1 0x0021c9f4 in StgRunIsImplementedInAssembler () at StgCRun.c:576
#2 0x0021cdc4 in schedule (mainThread=0x1100360,
initialCapability=0x308548) at Schedule.c:932
#3 0x0021dd6c in waitThread_ (m=0x1100360, initialCapability=0x0) at
Schedule.c:2156
#4 0x0021dc50 in scheduleWaitThread (tso=0x13c0000, ret=0x0,
initialCapability=0x0) at Schedule.c:2050
#5 0x00219548 in rts_evalLazyIO (p=0x29b47c, ret=0x0) at RtsAPI.c:459
#6 0x001e4768 in main (argc=2262116, argv=0x308548) at Main.c:104
I took waitThread_ as a clue and started digging deeper.
Whenever I connect to the server or send a command I wait for X
seconds and if not connected or desired command is not received I
throw an exception which fails the script. I implemented the timeout
combinator a couple of different ways, including that in the
Asynchronous Exceptions paper but it did not help. I think the issue
has to do with killing threads that are using FFI. Although I'm
killing threads that call the Haskell connectTo, hGetBuf, etc. I
think it's still FFI.
I disposed of timeouts entirely, leaving connectTo as it is and using
hWaitForInput on my socket handle to simulate timeouts. This improved
things tremendously and I'm now able to run a few thousands of
unbound script threads on Windows with OpenSSL FFI and everything.
Memory usage is still higher than I would have liked and crashes in
OpenSSL still happen when the number of threads/memory usage is
really high so there's still room for improvement. I should probably
go back to using a foreign finalizer (SSL_free) on the SSL
descriptors rather than freeing them explicitly as the freeing does
not happen if a script fails mid-way.
I'm quite satisfied with my first Haskell project. I love Haskell and
will continue hacking away with it. This list is invaluable in the
depth of offered help whereas #haskell (IRC) is invaluable when speed
matters. I'm quite amazed at the things I have been able to do, the
expressiveness of Haskell and the clean looks.
Clean looks can be deceptive, though, as they can hide code of
amazing complexity. Fundeps, existential types, HList take a while to
grasp. Also, I feel somewhat like a pioneer and I definitely got more
than a fair share of arrows in my back.
I had GHC run out of memory during compilation (fixed by SPJ), had it
quit midway during compilation with an error about generated extents
being too large in assembler code. I had GHC crash at runtime with an
error like "fromJust not returning Just, this could not be
happening!". Yesterday's error topped them all:
internal error: update_fwd: unknown/strange object 0
Please report this as a bug to glasgow-ha...@haskell.org,
or http://www.sourceforge.net/projects/ghc/
I think I got this when using +RTS -C0 -c.
Overall, the experience with Haskell has been exhilarating and I'm
already preparing to use it on my next projects like detecting
collusion in poker as well as rake optimization (Dazzle paper very
helpful here!). Still, I think that GHC can be a bit rough around the
edges and I would think twice about writing high-performance network
apps with it.
Thanks, Joel
P.S. The Glasgow Distributed Haskell (GdH) people are supposed to
have a mailing list and I would love to share my findings twith them
but I could not find the mailing list itself.
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
What would your impression be of building an application in Haskell
versus Erlang from a practical point of view given your experiences
with this project and the Erlang poker server?
My feelings having developed a little with Erlang and embarking on a
Haskell project are that the learning curve is far steeper with
Haskell but it is far more elegant and readable. I'm still climbing
that curve though (IO makes me want to pull my hair out).
Thanks for writing up that post mortem. There's lots of good info in
there, especially for a newbie like myself.
Cheers,
Scott
> What would your impression be of building an application in Haskell
> versus Erlang from a practical point of view given your experiences
> with this project and the Erlang poker server?
I would have been done much faster and with far less trouble. The
scripting would have been a royal pain in the rear for the customer,
though. But, again, I would have been done much faster as network
clients/servers is what Erlang excels at. That and concurrency.
Haskell... I'm still trying to figure out why reading from a Chan
with getChanContents and then printing out the contents works and
doing the same with readChan and looping blocks. Or why the app now
crashes violently on Mac OSX but works without a hitch on Windows.
And I still don't have a good timeout combinator.
I felt very excited this morning given the newly found love between
my app and Windows but the excitement lasted only until I realized
that hWaitForIO blocks all other threads :-(.
> My feelings having developed a little with Erlang and embarking on
> a Haskell project are that the learning curve is far steeper with
> Haskell but it is far more elegant and readable. I'm still climbing
> that curve though (IO makes me want to pull my hair out).
Unless lightning strikes and tomorrow morning I figure out what's the
deal with the spurious Mac OSX crashes, I think this might be my last
network app in Haskell. I should really be spending time on the
business end of the app intead of figuring out platform differences
and the like.
Joel
Joel, I think it's fantastic that you've been pushing on Haskell in the
way you have. What I learn from your experience is that the *language*
is pretty good for what you wanted to do (esp lightweight concurrency)
but the *libraries* in the area of networking are lacking both
functionality and (more particularly) robustness.
I hope you don't abandon Haskell altogether. Without steady, friendly
pressure from applications-end folk like you, things won't improve.
It's incredibly valuable feedback. But I can see that when you have to
deliver something next week you can't wait around for some someone to
get around to fixing your problem. (They aren't paid either!) Maybe
you can use Haskell for something less mission-critical, so that you can
keep up the pressure?
Meanwhile, let me utter my customary encouragement to the Haskell
community out there: please pitch in and help! Haskell will only break
into real applications, of the kind Joel has been writing, if we can
offer robust libraries, and that depends utterly on you. Don't wait for
someone else to do it.
Simon
Jan
> I hope you don't abandon Haskell altogether. Without steady, friendly
> pressure from applications-end folk like you, things won't improve.
Nah, I'm just having a very frustrating Friday. I think I need some
direction in which to dig and a bit of patience over the weekend. For
example,
What does this mean precisely? My take is that the GHC runtime is
trying to call a C function. this much I gathered from the source
code. It also seems that since I do not see another library at #0
then the issue is within GHC. Is that the right take on it?
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x3139322e
0x0027c174 in s8j1_info ()
(gdb) where
#0 0x0027c174 in s8j1_info ()
#1 0x0021c9f4 in StgRunIsImplementedInAssembler () at StgCRun.c:576
#2 0x0021cdc4 in schedule (mainThread=0x1100360,
initialCapability=0x308548) at Schedule.c:932
#3 0x0021dd6c in waitThread_ (m=0x1100360, initialCapability=0x0) at
Schedule.c:2156
#4 0x0021dc50 in scheduleWaitThread (tso=0x13c0000, ret=0x0,
initialCapability=0x0) at Schedule.c:2050
#5 0x00219548 in rts_evalLazyIO (p=0x29b47c, ret=0x0) at RtsAPI.c:459
#6 0x001e4768 in main (argc=2262116, argv=0x308548) at Main.c:104
> It's incredibly valuable feedback. But I can see that when you
> have to
> deliver something next week you can't wait around for some someone to
> get around to fixing your problem. (They aren't paid either!) Maybe
> you can use Haskell for something less mission-critical, so that
> you can
> keep up the pressure?
I can't change who I am, I just gotta push the envelope. I would not
have stood the pain of doing this project in Erlang, for example,
what with all the nested data structures, etc.
I'm not waiting for someone to fix my problem, I would gladly fix it
myself if I understood where the problem is. It used to be fairly
clear before when the stack trace pointed to one of the OpenSSL
libraries. In this particular case I don't even know how to start
debugging this. Do I set a break point in s8j1_info? But it's
something else periodically, like s34n_info.
Do I inspect the C code somehow? But how do I do that? How do I debug
the GHC runtime?
Thanks, Joel
On Nov 18, 2005, at 10:42 AM, Jan Stoklasa (gmail) wrote:
> Hi,
> so sad, so true...
> At least haskell ideas sneak into mainstream languages under
> disguise (LINQ
> anyone?). C-Java-C# syntax that business "developers" and their
> bosses love
> so much is mandatory so the result lack the beauty we all know and
> appreciate, but it is kinda nice to see functional programming going
> mainstream at last. Maybe, "Lambda" is the IT buzzword of next
> decade :-).
> On Nov 18, 2005, at 10:17 AM, Simon Peyton-Jones wrote:
>
>> I hope you don't abandon Haskell altogether. Without steady,
>> friendly pressure from applications-end folk like you, things won't
>> improve.
>
> Nah, I'm just having a very frustrating Friday. I think I need some
> direction in which to dig and a bit of patience over the weekend. For
> example,
>
> What does this mean precisely? My take is that the GHC runtime is
> trying to call a C function. this much I gathered from the source
> code. It also seems that since I do not see another library at #0
> then the issue is within GHC. Is that the right take on it?
The stack trace doesn't mean much at all I'm afraid - GHC doesn't use
the C stack, so any stack trace generated for a crash inside the Haskell
code is mostly useless. It does tell you the block in which the crash
happened (s8j1_info), and it tells you that the crash was in Haskell and
not C. The rest of the frames on the stack are from the GHC runtime,
and you'll pretty much always see these same frames on the stack for any
crash inside Haskell code.
How we normally proceed for a crash like this is as follows: examine
where the crash happened and determine whether it is a result of heap or
stack corruption, and then attempt to trace backwards to find out where
the corruption originated from. Tracing backwards means running the
program from the beginning again, so it's essential to have a
reproducible example. Without reproducibility, we have to use a
combination of debugging printfs and staring really hard at the code,
which is much more time consuming (and still requires being able to run
the program to make it crash with debugging output turned on).
You can get debugging output by compiling your program with -debug, and
then running it with some of the -D<something> options (use +RTS -? for
a list, +RTS -Ds is a good one to start with).
Cheers,
Simon
> You can get debugging output by compiling your program with -debug,
> and
> then running it with some of the -D<something> options (use +RTS -?
> for
> a list, +RTS -Ds is a good one to start with).
I'm still working on a repro case but here's what I get...
+RTS -Ds
..
scheduler: checking for threads blocked on I/O
sched: -->> running thread 1103 ThreadRunGHC ...
sched: --<< thread 1103 (ThreadRunGHC) stopped: is blocked on an MVar
all threads:
thread 1225 @ 0x1539000 is not blocked
thread 1224 @ 0x1506aa4 is not blocked
thread 1223 @ 0x15066a4 is not blocked
..
scheduler: checking for threads blocked on I/O
sched: -->> running thread 1107 ThreadRunGHC ...
Segmentation fault
1107 is not blocked in the list of all threads. What options should I
try next?
Thanks, Joel
_______________________________________________
> On Nov 18, 2005, at 1:55 PM, Simon Marlow wrote:
>
>> You can get debugging output by compiling your program with -debug,
>> and then running it with some of the -D<something> options (use +RTS
>> -? for a list, +RTS -Ds is a good one to start with).
>
> I'm still working on a repro case but here's what I get...
>
> +RTS -Ds
> ...
> scheduler: checking for threads blocked on I/O
> sched: -->> running thread 1103 ThreadRunGHC ...
> sched: --<< thread 1103 (ThreadRunGHC) stopped: is blocked on an MVar
> all threads:
> thread 1225 @ 0x1539000 is not blocked
> thread 1224 @ 0x1506aa4 is not blocked
> thread 1223 @ 0x15066a4 is not blocked
> ...
> scheduler: checking for threads blocked on I/O
> sched: -->> running thread 1107 ThreadRunGHC ...
> Segmentation fault
>
> 1107 is not blocked in the list of all threads. What options should I
> try next?
That doesn't tell us much unfortunately. Can you send a disassembly of
the block in which the crash happened?
Is it always the same block, BTW? Does changing the heap size (+RTS
-H<size>) have any effect?
Cheers,
Simon
> That doesn't tell us much unfortunately. Can you send a
> disassembly of
> the block in which the crash happened?
>
> Is it always the same block, BTW? Does changing the heap size (+RTS
> -H<size>) have any effect?
I don't think changing the heap size has any effect. I tried a run
with -H512m and the only difference was that it crashed at 0x00000005
with the same kernel protection failure. The address for s34n_info is
the same, everything else the same, including stack trace and
addresses and offsets in it.
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000000
0x0024ef88 in s34n_info ()
(gdb) where
#0 0x0024ef88 in s34n_info ()
#1 0x00211eb4 in StgRunIsImplementedInAssembler () at StgCRun.c:576
#2 0x0020f048 in schedule (mainThread=0x1100360,
initialCapability=0x2fd508) at Schedule.c:932
#3 0x0020fff0 in waitThread_ (m=0x1100360, initialCapability=0x0) at
Schedule.c:2156
#4 0x0020fed4 in scheduleWaitThread (tso=0x13c0000, ret=0x0,
initialCapability=0x0) at Schedule.c:2050
#5 0x0020cd70 in rts_evalLazyIO (p=0x29216c, ret=0x0) at RtsAPI.c:459
#6 0x001d80fc in main (argc=2212180, argv=0x2fd508) at Main.c:104
(gdb) disas 0x0024ef88
Dump of assembler code for function s34n_info:
0x0024ef70 <s34n_info+0>: mr r10,r25
0x0024ef74 <s34n_info+4>: addi r9,r25,8
0x0024ef78 <s34n_info+8>: mr r25,r9
0x0024ef7c <s34n_info+12>: cmplw cr7,r9,r26
0x0024ef80 <s34n_info+16>: bgt- cr7,0x24efb4 <s34n_info+68>
0x0024ef84 <s34n_info+20>: lwz r2,4(r14)
0x0024ef88 <s34n_info+24>: lbzx r0,r2,r15
0x0024ef8c <s34n_info+28>: cmpwi cr7,r0,0
0x0024ef90 <s34n_info+32>: bne- cr7,0x24efc4 <s34n_info+84>
0x0024ef94 <s34n_info+36>: lis r2,42
0x0024ef98 <s34n_info+40>: lwz r2,20668(r2)
0x0024ef9c <s34n_info+44>: stw r2,4(r10)
0x0024efa0 <s34n_info+48>: stw r15,0(r9)
0x0024efa4 <s34n_info+52>: addi r14,r9,-4
0x0024efa8 <s34n_info+56>: lwz r29,0(r22)
0x0024efac <s34n_info+60>: mtctr r29
0x0024efb0 <s34n_info+64>: bctr
0x0024efb4 <s34n_info+68>: li r0,8
0x0024efb8 <s34n_info+72>: stw r0,108(r27)
0x0024efbc <s34n_info+76>: lwz r29,-4(r27)
0x0024efc0 <s34n_info+80>: b 0x24efac <s34n_info+60>
0x0024efc4 <s34n_info+84>: addi r15,r15,1
0x0024efc8 <s34n_info+88>: addi r25,r9,-8
0x0024efcc <s34n_info+92>: lis r29,37
0x0024efd0 <s34n_info+96>: addi r29,r29,-4240
0x0024efd4 <s34n_info+100>: b 0x24efac <s34n_info+60>
0x0024efd8 <s34n_info+104>: .long 0x21
0x0024efdc <s34n_info+108>: .long 0x240000
End of assembler dump.
This is not quite the error that I was expecting but they could be
related, I'm not sure. In any case, you can retrieve the repro
project thusly:
darcs get http://test.wagerlabs.com/postmortem
You need OpenSSL to build these so don't forget to add -lssl -lcrypto
to either ghc or ghci.
I would appreciate if we could all collectively look at this as
things are either wierd or I'm missing something obvious. I will
apply any patches sent to me.
I run like this:
ghci -fglasgow-exts -lssl -lcrypto
:l Server
main
ghci -fglasgow-exts -lssl -lcrypto
:l Client
main
I get in the server window:
interactive: unknown exception
14:51:39: ThreadId 1: Accepted new connection: {handle: <socket: 5>}
14:51:39: ThreadId 1: Verify locations: 1
14:51:39: ThreadId 1: sslGetError: 2
14:51:39: ThreadId 4: Starting SSL handshake...
14:51:39: ThreadId 4: Reading from BIO...
14:51:39: ThreadId 4: Waiting for BIO 0x01108670
14:51:39: ThreadId 4: waitForBio: gotta wait a bit...
If you look at SSL.hs you will see that I'm calling threadDelay right
after this message. No other messages are produced. This tells me
that threadDelay is throwing an exception.
Why would it, though? And how can I tell what the exception is? If I
comment out the threadDelay then I get the exception somewhere in the
expect code after bytes are sent to the other side.
Overall, my intent is to get this to work for 1 thread and then try,
say, 5 or 10 thousand.
Thanks, Joel
> I hope you don't abandon Haskell altogether. Without steady, friendly
> pressure from applications-end folk like you, things won't improve.
> It's incredibly valuable feedback. But I can see that when you
> have to
> deliver something next week you can't wait around for some someone to
> get around to fixing your problem. (They aren't paid either!) Maybe
> you can use Haskell for something less mission-critical, so that
> you can
> keep up the pressure?
Here is some feedback on a negative experience I had with Haskell
recently (really about the only negative experience :)
I was playing with one of the Haskell OpenGL libraries (actually it's
a refined FFI) over the summer and some things about it rubbed me the
wrong way. I wanted to try fixing them but I really couldn't figure
out how to get ahold of the code and start hacking. I found some
candidates, but it seemed like old cvs repositories or something. I
was confused, ran out of time and moved on. Why do I bring it up?
If it had been obvious where to get an official copy of the library I
could have tried sending in some patches to make things work the way
I wanted. I'm a huge fan of darcs repositories, BTW.
Thanks,
Jason
- Cale
Hmmm, as the OpenGL/GLUT/OpenAL/ALUT guy I have to admit that I should really,
really update the web pages about those packages. But anyway: Asking on any
Haskell mailing list (there is even one especially for the OpenGL/GLUT
packages) normally gives you fast response times. Without even knowing that
there is a problem, there is nothing I can fix. :-) And don't hesitate to ask
questions about the usage of those packages, because this is valuable
feedback, too. Regarding the repository: The normal fptools repository is the
"official" one for those packages. But IIRC, most GHC binary packages include
OpenGL/GLUT support, so there is normally no urgent need for a home-made
version. All packages are already cabalized, but I have to admit that I have
never tried to build them on their own.
Cheers,
S.
I'm fixing the server side and once that is done will clean up SSL at
the end of the handshake and launch a few thousand clients. It's not
a good repro case yet although I would love to know why withTimeOut
is throwing that exception.
Joel
On Nov 18, 2005, at 5:02 PM, Christian Maeder wrote:
> Sorry, I can only show you my output on
> Linux turing 2.6.11.4-21.9, but I don't know what's going on and
> will not have more time this week.
>
> Cheers Christian
>
> maeder@turing:/local/maeder/haskell/postmortem> ./server
> 17:55:14: ThreadId 1: Accepted new connection: {handle: <socket: 4>}
> 17:55:14: ThreadId 1: Verify locations: 1
> 17:55:14: ThreadId 1: sslGetError: 2
> 17:55:14: ThreadId 4: Starting SSL handshake...
> 17:55:14: ThreadId 4: Reading from BIO...
> 17:55:14: ThreadId 4: Waiting for BIO 0x080d10d8
> 17:55:14: ThreadId 4: waitForBio: gotta wait a bit...
> server: unknown exception
The server just sits there, goes through the SSL handshake and...
does nothing else. The clients go through the handshake with the
server and do nothing else. The handshake goes through X number of
times and then the client crashes.
On Nov 18, 2005, at 1:55 PM, Simon Marlow wrote:
> How we normally proceed for a crash like this is as follows: examine
> where the crash happened and determine whether it is a result of
> heap or
> stack corruption, and then attempt to trace backwards to find out
> where
> the corruption originated from. Tracing backwards means running the
> program from the beginning again, so it's essential to have a
> reproducible example. Without reproducibility, we have to use a
> combination of debugging printfs and staring really hard at the code,
> which is much more time consuming (and still requires being able to
> run
> the program to make it crash with debugging output turned on).
Simon
Would you guys be willing to guide me through this? I could then
possibly become the next Mac OSX expert :-).
I have the disassembler dumps, etc. I do not know how to approach
this problem. I read up a bit on the GHC internals, STG, code
generation, etc.
Thanks, Joel
P.S. Please feel free to take the email exchange offline, could be
too boring for everyone else
On Nov 21, 2005, at 9:35 AM, Simon Peyton-Jones wrote:
> If it's MacOS specific, we're not going to be much help at GHC HQ,
> because we don't have any (Macs that is). Wolfgang Thaller is the
> MacOS
> expert, but maybe there are others now?
What about the non-OSX issue of using a Chan to collect traces from
thousands of threads?
It's not working very well for me when I use readChan in a loop (see
the code). getChanContents works much better but then the logger
thread is stuck forever and everything else that waits on it is stuck
as well.
The output from logger (Util.hs) stops after a few lines and thus
memory taken starts to grow because all the output sent to the chan
is not being processed.
Thanks, Joel
On Nov 21, 2005, at 9:35 AM, Simon Peyton-Jones wrote:
> If it's MacOS specific, we're not going to be much help at GHC HQ,
> because we don't have any (Macs that is). Wolfgang Thaller is the
> MacOS
> expert, but maybe there are others now?
Greetings, Bane.