Performance Tests and Tuning

13 views
Skip to first unread message

Graeme Defty

unread,
Mar 15, 2011, 3:36:12 AM3/15/11
to neoto...@googlegroups.com, Tony Arcieri, Sean Cribbs, Erlang Questions
Sorry all. This is likely to be a biggish post (and may well turn out to be unreadable, if the tables dont reproduce well ;-) 

I would certainly appreciate someone commenting on the logic of my approach.

OK - heres what I have done so far, using the Reia PEG as a test case....

It seemed that fprof was the best tool to use, but it measures elapsed time, not CPU. I could not find a tool to give me processor time used. :-(    (Is that really so hard to do in a VM?!?)

As a baseline, I used the last version of the memory branch that I posted.  The top few lines of the result were as follows:

"<0.39.0>"

1940718 undefined 52034.028


13.2247



Count Tot Own Unit %ageTotal

neotoma_parse p_all 4 138932 72683.399 3934.601 0.02832 7.56%
1.0000
neotoma_parse p_attempt 4 130960 69777.613 3884.261 0.02966 7.46%
0.9872
neotoma_parse p 5 51709 73383.129 3834.867 0.07416 7.37%
0.9747
neotoma_parse get_memo 1 51709 5872.097 2540.128 0.04912 4.88%
0.6456
lists foldl 3 95022 5179.214 2277.290 0.02397 4.38%
0.5788
neotoma_parse -p_charclass/1-fun-0-' 4 14558 6096.801 2265.258 0.15560 4.35%
0.5757
proplists lookup 2 137078 3229.624 2216.337 0.01617 4.26%
0.5633
neotoma_parse -p_string/1-fun-0-' 4 110587 4273.063 2200.807 0.01990 4.23%
0.5593
neotoma_parse p_advance_index 2 95022 10534.828 2080.734 0.02190 4.00%
0.5288
neotoma_parse memoize 2 42376 4736.719 1954.612 0.04613 3.76%
0.4968
neotoma_parse p_scan 4 53795 72608.579 1629.444 0.03029 3.13%
0.4141
unicode characters_to_list 1 47433 3260.534 1588.838 0.03350 3.05%
0.4038
neotoma_parse -p_not/1-fun-0-' 3 37517 3341.361 1537.576 0.04098 2.95%
0.3908
neotoma_parse -p_anything/0-fun-0-' 2 34785 9531.222 1458.524 0.04193 2.80%
0.3707
neotoma_parse memo_table_name 0 95476 1899.382 1354.767 0.01419 2.60%
0.3443
proplists is_defined 2 74994 1764.019 1212.677 0.01617 2.33%
0.3082
garbage_collect 537
1 1039.164 1039.164 1039.16400 2.00%
0.2641
re run 3 14558 1056.997 994.229 0.06829 1.91%
0.2527
ets lookup 2 52794 1155.080 950.885 0.01801 1.83%
0.2417
ets insert 2 42676 1240.016 877.454 0.02056 1.69%
0.2230

For those unfamiliar with fprof, 'count' is the number of times a function is executed, Tot is the Total time spent in a function, including time in the called functions, and Own is the time spent in a function EXcluding called functions (the bit in which we are most interested)

The figures on the top line level with <0,39,0> are the totals for the whole process.

The 'unit' column I added, and shows the 'own' time for a function divided by the number of times it is executed (i.e. the unit time for an execution ;-) ) and the % of total column shows the Own time as a percentage of the total own time.

The bad news, as can be seen from this %age column, is that the is no particular 'hotspot', so no obvious place to focus our efforts.

SO . . . I just started at the top of the list. The function p_all seemed the biggest user, but looking at the code, t is not easy to see how to make it any slicker. The same is true of p_attempt and p.

The first line of interest is that for get_memo. I have wondered for a while about whether it would be better to use a proplist to store the caching information, so that was my first experiment. My initial try showed an improvement of about 50% reduction in time, then I remembered this was elapsed time, and so highly sensitive to the resources being used on my machine at the time.

To combat this variable, the best I could think of was to scale everything relative to the p_all time (which I did not expect to be improving any). This is what you see in the far right column, where the own time for each function is shown as a fraction of the p_all time.

Memoisation
Having changed the memoisation routines to use a proplist, I got the following results:
"<0.39.0>"

1940528 undefined 26105.719
50.17%
11.9094



Count Tot Own Unit %ageTotal

neotoma_parse p_all 4 138932 49375.193 2192.023 0.01578 8.40%
1.0000
neotoma_parse p 5 51709 49669.187 1948.556 0.03768 7.46%
0.8889
neotoma_parse p_attempt 4 130960 47517.007 1896.901 0.01448 7.27%
0.8654
neotoma_parse get_memo 1 51709 4467.727 1271.97 0.02460 4.87%
0.5803
neotoma_parse memoize 2 42376 3598.339 1195.318 0.02821 4.58%
0.5453
neotoma_parse p_advance_index 2 95022 7174.674 1161.685 0.01223 4.45%
0.5300
lists foldl 3 95022 3465.169 1090.941 0.01148 4.18%
0.4977
neotoma_parse -p_string/1-fun-0-' 4 110587 2499.068 1051.826 0.00951 4.03%
0.4798
neotoma_parse p_scan 4 53795 49261.369 957.463 0.01780 3.67%
0.4368
proplists lookup 2 137078 2055.560 953.497 0.00696 3.65%
0.4350
re run 3 14558 1030.919 844.034 0.05798 3.23%
0.3850
unicode characters_to_list 1 47433 2274.426 708.93 0.01495 2.72%
0.3234
neotoma_parse memo_table_name 0 95476 1422.423 676.268 0.00708 2.59%
0.3085
garbage_collect 537
1 674.150 669.530 1.00000 2.56%
0.3054
neotoma_parse -p_not/1-fun-0-' 3 37517 2014.170 630.088 0.01679 2.41%
0.2874
ets lookup 2 52794 1037.220 611.673 0.01159 2.34%
0.2790
proplists is_defined 2 74994 1209.329 579.755 0.00773 2.22%
0.2645
ets insert 2 42676 936.285 577.493 0.01353 2.21%
0.2635
neotoma_parse -p_charclass/1-fun-0-' 4 14558 3818.591 564.022 0.03874 2.16%
0.2573
neotoma_parse -p_anything/0-fun-0-' 2 34785 6316.239 489.329 0.01407 1.87%
0.2232


As can be seen, there is some improvement in get_memo (down from 0.6456 to 0.5803), sadly there is a corresponding INcrease in memoize (up from 0.4968 to 0.5453) which reduces the gain to just 0.05 p_all execution-equivalents (if you will excuse my somewhat clumsy scale). Actually, the impacts are complex, because all sorts of other things have shifted, so I concentrated on the figure at the top which shows the total elapsed time in terms of p_all executions. This has reduced from 13.225 to 11.909, a reduction of about 10%. Not stunning, but a worthwhile improvement.

The next obvious target was advance_index. It seemed to me that the index structure was somewhat over-engineered, so I tried replacing it with a simple dotted pair ([x|y]). HAS to produce some benefits, right?

Well the results were as follows:

"<0.39.0>"

1940530 undefined 26095.127
50.15%
13.4288



Count Tot Own Unit %ageTotal

neotoma_parse p_all 4 138932 51173.965 1943.224 0.01399 7.45%
1.0000
neotoma_parse p 5 51709 51657.472 1923.512 0.03720 7.37%
0.9899
neotoma_parse p_attempt 4 130960 49330.829 1855.634 0.01417 7.11%
0.9549
neotoma_parse get_memo 1 51709 4568.310 1265.543 0.02447 4.85%
0.6513
neotoma_parse p_advance_index 2 95022 7860.237 1165.753 0.01227 4.47%
0.5999
proplists lookup 2 137078 2279.512 1160.885 0.00847 4.45%
0.5974
lists foldl 3 95022 3769.014 1107.37 0.01165 4.24%
0.5699
neotoma_parse -p_string/1-fun-0-' 4 110587 2909.453 1047.395 0.00947 4.01%
0.5390
neotoma_parse memoize 2 42376 3365.578 962.101 0.02270 3.69%
0.4951
neotoma_parse p_scan 4 53795 51087.806 910.642 0.01693 3.49%
0.4686
re run 3 14558 1021.338 885.071 0.06080 3.39%
0.4555
neotoma_parse -p_charclass/1-fun-0-' 4 14558 4297.158 766.53 0.05265 2.94%
0.3945
garbage_collect 537
1 738.733 732.901 1.00000 2.81%
0.3772
unicode characters_to_list 1 47433 2487.362 723.41 0.01525 2.77%
0.3723
neotoma_parse memo_table_name 0 95476 1529.930 663.84 0.00695 2.54%
0.3416
ets lookup 2 52794 1100.060 629.197 0.01192 2.41%
0.3238
proplists is_defined 2 74994 1355.632 606.665 0.00809 2.32%
0.3122
ets insert 2 42676 876.211 588.44 0.01379 2.25%
0.3028
neotoma_parse -p_not/1-fun-0-' 3 37517 2047.480 547.493 0.01459 2.10%
0.2817
neotoma_parse -p_anything/0-fun-0-' 2 34785 6542.401 545.422 0.01568 2.09%
0.2807

So the total has gone from 13.2247 to 13.4288. An INCREASE of about 1.5% ! ! !

Hmmm - Well maybe dotted pair handling is not so great after all. I will try a 2-item list next.

But you can see that this does not look like it is going to be easy. One way to improve things is to tighten up the structure of the PEG. I am sure I could simplify it, especially around the definition of the lowest level elements, and thus reduce the number of times all these key functions are executed (p_all, p, etyc.)  but that seems a bit like putting the cart before the horse. I do not really want to bend my PEG (if you will excuse the expression) for good performance, but of course if that is what it takes . . .

All ideas gratefully received.

g
_____________________________________________
On 15 March 2011 03:56, Tony Arcieri <tony.a...@medioh.com> wrote:
Any plans for further performance improvements in the roadmap? :)


On Sun, Mar 13, 2011 at 4:25 PM, Sean Cribbs <seanc...@gmail.com> wrote:
It's been a year coming, so I'm proud to announce a new version of
Neotoma, the PEG/Packrat toolkit and parser-generator.  The primary
feature of version 1.5 is that it uses significantly less memory and
also supports UTF8 (via binaries), thanks to some awesome work by
Graeme Defty, who has also converted Reia to use a PEG.

You can get the latest source on Github: https://github.com/seancribbs/neotoma

Neotoma is rebar-friendly as well, so adding it to a rebar project is
a simple deps declaration like so:

   {neotoma, "1.5", {git, "git://github.com/seancribbs/neotoma", {tag, "1.5"}}}

Hope to see some of you at Erlang Factory SF next week.

Cheers,

Sean Cribbs



--
Tony Arcieri
Medioh! Kudelski

Sean Cribbs

unread,
Mar 15, 2011, 9:06:54 AM3/15/11
to Graeme Defty, neoto...@googlegroups.com, Tony Arcieri, Erlang Questions
Graeme,

Maybe it's just me, but it looks like the largest chunk of time is spent in garbage collection, thus the natural way to optimize that is to make fewer copies of things.  Also, it's natural that your list version of the Index counter is slightly slower, tuples have faster access since they are of fixed length.

Sean

Graeme Defty

unread,
Mar 15, 2011, 11:17:55 AM3/15/11
to Sean Cribbs, neoto...@googlegroups.com, Tony Arcieri, Erlang Questions
Yes, the garbage collection (although VERY variable) is always a significant chunk. That seems to come under the "optimise the PEG" category to some extent. I guess there is some scope for examining each routine and making it as copy-free as possible.

WRT the index, perhaps then the best option is a duple. (on the basis that eliminating the 'line' and 'column' atoms would avoid some effort in matching. I will give it a go.

The main message seems to be though . . . "no silver bullet" - well WHAT a surprise ;-)

g
______________________________

Graeme Defty

unread,
Mar 15, 2011, 8:54:07 PM3/15/11
to Sean Cribbs, neoto...@googlegroups.com, Tony Arcieri, Erlang Questions
Sean,

On reflection, dotted pairs have even MORE fixed length than tuples. Tuples are at least decidable by the programmer. Dotted pairs are fixed at 2 (which is why I chose them). That also has the benefit of using only a single cell over whatever the tuple takes. Less space means less garbage collection so there should be some benefit there, so I am still a little puzzled.  Still, I don't know the implementation details, so maybe you are right.

On the other ideas for improvement, I also thought of adding an offset into the input to the index structure to save multiple copies of the input being made, but after reading a bit about binaries and the various forms, it seemed that copies were probably not being made anyway (subject to the ets point below).

> I also want to investigate using the process dictionary a bit more instead of ETS,
> or at least get a comparison metric of the two approaches.

This was the same as my idea of using a a proplist, my thought being that ets was probably making a copy of the memoised input (and output) and that a proplist would avoid that happening (instead just storing pointers into the structure). Again though, I am not sure, so do not know where the 10% improvement really originates.

I still keep coming back to the fact that if we manage the impossible and eliminate one of these functions *totally* we will still only save about 7% at best! p_all really doesn't do much but call other functions, and yet takes the most time. My eye is drawn to the 'count' column. 130,000+ executions for p_all and p_attempt. Given that the peg source is about 44k chars in total, that represents over 4 calls to those functions for every character in the source!

OTOH, p is called only 51,700 times,  about 1.16 times per character of input, which really doesn't seem too bad.

Maybe I just need to make the input smaller, but that doesn't help us compete with leex/yecc. I my memory serves me right we are something around 6x slower in executing the Reia unit tests. Only acceptable if you are parsing a small amount of source.

We seem to have a way to go.

g
_________________________________________________
On 15 March 2011 20:06, Sean Cribbs <seanc...@gmail.com> wrote:

Graeme Defty

unread,
Mar 22, 2011, 8:22:37 AM3/22/11
to neoto...@googlegroups.com, Sean Cribbs
Sean,

I was a little puzzled by your comment "...use the functions in the includes file instead of having them in both places."

When I looked into the function I saw what you meant! I could not remember copying the functions source into the test module, but I guess I must have done it at some time.

This must have caused you more than a little confusion (not to mention doubts regarding my sanity). My apologies.

I have decided that in view of that, I will delete my fork and start again from scratch. I will create a new benchmark for testing performance enhancements against using the latest version (with the proplist changes etc. in)

Cheers,

g
___________________________________
Reply all
Reply to author
Forward
0 new messages