Practice meeting, Dec. 17th

Lyle Kopnicky

unread,

Dec 16, 2018, 7:09:06 PM12/16/18

to pdxfunc

When: Dec. 17rd, 6:30-8:30pm

Where: Collective Agency Downtown, Suite 1108, 511 SW 10th Ave, Portland, OR (opposite side of the floor from the elevators)

We'll continue our discussion of Chapter 4, "Dataflow Parallelism: The Par Monad", of Parallel and Concurrent Programming in Haskell. We'll briefly discuss the exercise from the section "Rate-Limiting the Producer", then move on to the section "Example: A Conference Timetable". At the meeting anyone will be able to present and contribute to the discussion.

Meetup link to RSVP:

https://www.meetup.com/Portland-Functional-Programming-Study-Group/events/qjbbjqyxqbwb/

Hope to see you then!

Lyle Kopnicky

unread,

Dec 17, 2018, 12:50:27 AM12/17/18

to pdxfunc

By the way, I fixed the code I had previously written to implement the exercise on p. 69, that we revised in the meeting on Nov. 19th.

This exercise was to take Stream.hs and rsa-pipeline.hs, and introduce rate limiting in the streaming by generating Fork constructions, which consumers would use to trigger the next chunk of data to be produced.

You can find the revised files in these gists:

rsa-pipeline-fork.hs

StreamFork.hs

You may recall that we spent the entire meeting fleshing out the streamMap function, only to find that when we ran it, it had a runtime error.

So, what was the problem? It turns out that when we handle the Fork constructor in streamMap, we forgot to fork! So some code ended up waiting on an IVar that never got computed. I'm not sure how it decided to error out at that point rather than blocking indefinitely.

There was also another bug, which caused duplicate entries in the output. The problem was an off-by-one error. The generator reaches the end of the chunk when i == chunkSize - 1, not when i == chunkSize.

How did I find these bugs? I instrumented the loops with calls to Debug.Trace.trace. That way I could print out values such as the loop index, or describe the case branch. The strict evaluation of the Par monad made it easy to guarantee the traces would be printed.

I also simplified the main loop. Instead of all the encrypt/decrypt on long strings, I just had it generate a stream from a small list, "abcd", and then do a streamFold that produced the result in reverse. I varied the chunk size and fork distance to produce edge cases. This is how I found the duplicate entry bug. But I was not getting the "no result" error.

So, that problem had to be caused by streamMap. So I added another step to my stream, that incremented the character. I found that if streamFromList was set to high values for chunkSize and forkDistance, then streamMap had no problem, even if it had to produce multiple chunks. The problem occurred when I set the chunkSize and forkDistance the same for both functions. Or if the forkDistance in streamFromList was smaller than the output length, and we used streamMap.

Therefore the problem was likely in the case where streamMap handled the Fork constructor. I examined the cases by eye, comparing the implementation of streamMap to streamFromList. This is where I noticed that streamMap was missing the call to fork a new thread. I fixed that, and everything worked.

In terms of run times, when the encryption/decryption runs on /usr/dict/words, the original program takes 18 seconds on my 2-core machine. With a chunkSize of 16 and forkDistance of 8, on all pipeline steps, the rate-limited version takes 13.5 seconds. With a chunkSize of 200 with a forkDistance of 100, it got down closer to 12 seconds. So, we must be gaining some efficiency from the rate-limiting. It'd be interesting to discuss why. The amount of garbage generated and collected seems to be about the same, so that doesn't explain it.

See you at the meeting tomorrow night!

- Lyle

Lyle Kopnicky

unread,

Dec 17, 2018, 1:22:39 AM12/17/18

to pdxfunc

Actually, I was mistaken. The rate-limited version generates and collects a lot more garbage. Total allocation is 20GB as opposed to 0.5GB for the original version. Peak heap size is 18.5MB in the rate-limited version, 10MB in the original. So it's curious that it's significantly faster even while generating much more garbage.

Threadscope also shows that the original version has a lot of unused time on one of the HECs, whereas the rate-limited version keeps both HECs occupied pretty much all the time. Again, I'm not sure how that works, since it should be bottlenecked by the encryption time.

- Lyle

Lyle Kopnicky

unread,

Dec 17, 2018, 1:49:40 AM12/17/18

to pdxfunc

Oh, of course. The reason it's getting better core utilization is that in the original version, each pipeline stage used a single thread and so was fixed to a single core, but in the rate-limited version it's generating a new thread per chunk, so it's able to distribute the work to the available core. So sometimes it's doing encryption on two cores at once. An advantage of chunking that we've seen before.

As for the memory usage, we can discuss tomorrow why it's 40x as much. Part of it is the thread overhead.

karoyakani

unread,

Dec 18, 2018, 2:11:17 AM12/18/18

to pdxfunc

Great work Lyle, thanks.

I don't recall if we discussed the "40x" "memory usage" issue in the meeting. I appreciate if you comment on the following it up.

Another concern about the IVar issue: your streamFromList carries along two IVar's that make the code looking hard.

Here's StackOverflow article: "Parallel Haskell. Rate-Limiting the Producer"

https://stackoverflow.com/questions/24773130/parallel-haskell-rate-limiting-the-producer

that has has an Answer (although it's only a hint and not a complete solution) of "Zeta" that uses only one IVar.

I think it's worth a study.

- TJ

Lyle Kopnicky

unread,

Dec 22, 2018, 12:05:14 AM12/22/18

to pdx...@googlegroups.com

Thanks, TJ!

Yeah, it occurred to me on the way home that I forgot to bring up the memory issue. We could talk about it next time. The first thing I would do is just look back at prior chapters where we used chunked parallel data vs. unchanged, serial data. If the chunking means we use 40x memory, then that explains it. If not, we have to look at other parameters.

I looked at the post you linked to. The solution that Zeta posted is incorrect. Their solution does not implement forkDistance as intended by Simon Marlow. They are essentially just producing chunks of chunkSize, with the first chunk being smaller, with a length of forkDistance. The solution does do rate limiting, but it never runs any of the production in parallel. The forks are always at the end of each chunk. This is why they were able to simplify the code.

So, for example, suppose that forkDistance is 4, and chunkSize is 8. Then Zeta’s solution would produce a chain like this:

1111F 22222222F 33333333F 44444444F

…where the digit indicates the order of the chunk, and F indicates the forking point.

But look at what Marlow says in the book: "The Fork doesn’t have to be at the end; for example, the list might be produced in chunks of 200 elements, with the first Fork being at the 100 element mark, and every 200 elements thereafter. This would mean that at any time there would be at least 100 and up to 300 elements waiting to be consumed.”

Zeta’s first chunk is not full-sized. Also if the chunks were size 200, with forkDistance 100, there would never be 300 elements waiting to be consumed. There could only ever be 200 elements waiting.

This is why my solution is more complicated: it’s following Marlow’s instructions correctly. My chunks look more like this:

1111F1111 2222F2222 3333F3333 4444F4444

…so that forking is done in the middle of consuming a chunk. If the chunks were size 200, with forkDistance 100, that would give the opportunity for up to 300 elements to be waiting to be consumed: 100 from the rest of the chunk following the fork, and 200 from the next chunk.

Because the fork can be in the middle of the chunk, this is why I need to generate extra IVars: nextChunk, newtl, and nextNextChunk.

I hope that clears things up. I should probably post a response to Zeta’s answer.

See you at the next meeting!

Lyle

--
You received this message because you are subscribed to the Google Groups "pdxfunc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdxfunc+u...@googlegroups.com.
To post to this group, send email to pdx...@googlegroups.com.
Visit this group at https://groups.google.com/group/pdxfunc.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward