Performance Tests and Tuning

Graeme Defty

unread,

Mar 15, 2011, 3:36:12 AM3/15/11

to neoto...@googlegroups.com, Tony Arcieri, Sean Cribbs, Erlang Questions

Sorry all. This is likely to be a biggish post (and may well turn out to be unreadable, if the tables dont reproduce well ;-)

I would certainly appreciate someone commenting on the logic of my approach.

OK - heres what I have done so far, using the Reia PEG as a test case....

It seemed that fprof was the best tool to use, but it measures elapsed time, not CPU. I could not find a tool to give me processor time used. :-( (Is that really so hard to do in a VM?!?)

As a baseline, I used the last version of the memory branch that I posted. The top few lines of the result were as follows:

"<0.39.0>"			1940718	undefined	52034.028			13.2247
			Count	Tot	Own	Unit	%ageTotal
neotoma_parse	p_all	4	138932	72683.399	3934.601	0.02832	7.56%	1.0000
neotoma_parse	p_attempt	4	130960	69777.613	3884.261	0.02966	7.46%	0.9872
neotoma_parse	p	5	51709	73383.129	3834.867	0.07416	7.37%	0.9747
neotoma_parse	get_memo	1	51709	5872.097	2540.128	0.04912	4.88%	0.6456
lists	foldl	3	95022	5179.214	2277.290	0.02397	4.38%	0.5788
neotoma_parse	-p_charclass/1-fun-0-'	4	14558	6096.801	2265.258	0.15560	4.35%	0.5757
proplists	lookup	2	137078	3229.624	2216.337	0.01617	4.26%	0.5633
neotoma_parse	-p_string/1-fun-0-'	4	110587	4273.063	2200.807	0.01990	4.23%	0.5593
neotoma_parse	p_advance_index	2	95022	10534.828	2080.734	0.02190	4.00%	0.5288
neotoma_parse	memoize	2	42376	4736.719	1954.612	0.04613	3.76%	0.4968
neotoma_parse	p_scan	4	53795	72608.579	1629.444	0.03029	3.13%	0.4141
unicode	characters_to_list	1	47433	3260.534	1588.838	0.03350	3.05%	0.4038
neotoma_parse	-p_not/1-fun-0-'	3	37517	3341.361	1537.576	0.04098	2.95%	0.3908
neotoma_parse	-p_anything/0-fun-0-'	2	34785	9531.222	1458.524	0.04193	2.80%	0.3707
neotoma_parse	memo_table_name	0	95476	1899.382	1354.767	0.01419	2.60%	0.3443
proplists	is_defined	2	74994	1764.019	1212.677	0.01617	2.33%	0.3082
garbage_collect	537		1	1039.164	1039.164	1039.16400	2.00%	0.2641
re	run	3	14558	1056.997	994.229	0.06829	1.91%	0.2527
ets	lookup	2	52794	1155.080	950.885	0.01801	1.83%	0.2417
ets	insert	2	42676	1240.016	877.454	0.02056	1.69%	0.2230

For those unfamiliar with fprof, 'count' is the number of times a function is executed, Tot is the Total time spent in a function, including time in the called functions, and Own is the time spent in a function EXcluding called functions (the bit in which we are most interested)

The figures on the top line level with <0,39,0> are the totals for the whole process.

The 'unit' column I added, and shows the 'own' time for a function divided by the number of times it is executed (i.e. the unit time for an execution ;-) ) and the % of total column shows the Own time as a percentage of the total own time.

The bad news, as can be seen from this %age column, is that the is no particular 'hotspot', so no obvious place to focus our efforts.

SO . . . I just started at the top of the list. The function p_all seemed the biggest user, but looking at the code, t is not easy to see how to make it any slicker. The same is true of p_attempt and p.

The first line of interest is that for get_memo. I have wondered for a while about whether it would be better to use a proplist to store the caching information, so that was my first experiment. My initial try showed an improvement of about 50% reduction in time, then I remembered this was elapsed time, and so highly sensitive to the resources being used on my machine at the time.

To combat this variable, the best I could think of was to scale everything relative to the p_all time (which I did not expect to be improving any). This is what you see in the far right column, where the own time for each function is shown as a fraction of the p_all time.

Memoisation
Having changed the memoisation routines to use a proplist, I got the following results:

"<0.39.0>"			1940528	undefined	26105.719		50.17%	11.9094
			Count	Tot	Own	Unit	%ageTotal
neotoma_parse	p_all	4	138932	49375.193	2192.023	0.01578	8.40%	1.0000
neotoma_parse	p	5	51709	49669.187	1948.556	0.03768	7.46%	0.8889
neotoma_parse	p_attempt	4	130960	47517.007	1896.901	0.01448	7.27%	0.8654
neotoma_parse	get_memo	1	51709	4467.727	1271.97	0.02460	4.87%	0.5803
neotoma_parse	memoize	2	42376	3598.339	1195.318	0.02821	4.58%	0.5453
neotoma_parse	p_advance_index	2	95022	7174.674	1161.685	0.01223	4.45%	0.5300
lists	foldl	3	95022	3465.169	1090.941	0.01148	4.18%	0.4977
neotoma_parse	-p_string/1-fun-0-'	4	110587	2499.068	1051.826	0.00951	4.03%	0.4798
neotoma_parse	p_scan	4	53795	49261.369	957.463	0.01780	3.67%	0.4368
proplists	lookup	2	137078	2055.560	953.497	0.00696	3.65%	0.4350
re	run	3	14558	1030.919	844.034	0.05798	3.23%	0.3850
unicode	characters_to_list	1	47433	2274.426	708.93	0.01495	2.72%	0.3234
neotoma_parse	memo_table_name	0	95476	1422.423	676.268	0.00708	2.59%	0.3085
garbage_collect	537		1	674.150	669.530	1.00000	2.56%	0.3054
neotoma_parse	-p_not/1-fun-0-'	3	37517	2014.170	630.088	0.01679	2.41%	0.2874
ets	lookup	2	52794	1037.220	611.673	0.01159	2.34%	0.2790
proplists	is_defined	2	74994	1209.329	579.755	0.00773	2.22%	0.2645
ets	insert	2	42676	936.285	577.493	0.01353	2.21%	0.2635
neotoma_parse	-p_charclass/1-fun-0-'	4	14558	3818.591	564.022	0.03874	2.16%	0.2573
neotoma_parse	-p_anything/0-fun-0-'	2	34785	6316.239	489.329	0.01407	1.87%	0.2232

As can be seen, there is some improvement in get_memo (down from 0.6456 to 0.5803), sadly there is a corresponding INcrease in memoize (up from 0.4968 to 0.5453) which reduces the gain to just 0.05 p_all execution-equivalents (if you will excuse my somewhat clumsy scale). Actually, the impacts are complex, because all sorts of other things have shifted, so I concentrated on the figure at the top which shows the total elapsed time in terms of p_all executions. This has reduced from 13.225 to 11.909, a reduction of about 10%. Not stunning, but a worthwhile improvement.

The next obvious target was advance_index. It seemed to me that the index structure was somewhat over-engineered, so I tried replacing it with a simple dotted pair ([x|y]). HAS to produce some benefits, right?

Well the results were as follows:

"<0.39.0>"			1940530	undefined	26095.127		50.15%	13.4288
			Count	Tot	Own	Unit	%ageTotal
neotoma_parse	p_all	4	138932	51173.965	1943.224	0.01399	7.45%	1.0000
neotoma_parse	p	5	51709	51657.472	1923.512	0.03720	7.37%	0.9899
neotoma_parse	p_attempt	4	130960	49330.829	1855.634	0.01417	7.11%	0.9549
neotoma_parse	get_memo	1	51709	4568.310	1265.543	0.02447	4.85%	0.6513
neotoma_parse	p_advance_index	2	95022	7860.237	1165.753	0.01227	4.47%	0.5999
proplists	lookup	2	137078	2279.512	1160.885	0.00847	4.45%	0.5974
lists	foldl	3	95022	3769.014	1107.37	0.01165	4.24%	0.5699
neotoma_parse	-p_string/1-fun-0-'	4	110587	2909.453	1047.395	0.00947	4.01%	0.5390
neotoma_parse	memoize	2	42376	3365.578	962.101	0.02270	3.69%	0.4951
neotoma_parse	p_scan	4	53795	51087.806	910.642	0.01693	3.49%	0.4686
re	run	3	14558	1021.338	885.071	0.06080	3.39%	0.4555
neotoma_parse	-p_charclass/1-fun-0-'	4	14558	4297.158	766.53	0.05265	2.94%	0.3945
garbage_collect	537		1	738.733	732.901	1.00000	2.81%	0.3772
unicode	characters_to_list	1	47433	2487.362	723.41	0.01525	2.77%	0.3723
neotoma_parse	memo_table_name	0	95476	1529.930	663.84	0.00695	2.54%	0.3416
ets	lookup	2	52794	1100.060	629.197	0.01192	2.41%	0.3238
proplists	is_defined	2	74994	1355.632	606.665	0.00809	2.32%	0.3122
ets	insert	2	42676	876.211	588.44	0.01379	2.25%	0.3028
neotoma_parse	-p_not/1-fun-0-'	3	37517	2047.480	547.493	0.01459	2.10%	0.2817
neotoma_parse	-p_anything/0-fun-0-'	2	34785	6542.401	545.422	0.01568	2.09%	0.2807

So the total has gone from 13.2247 to 13.4288. An INCREASE of about 1.5% ! ! !

Hmmm - Well maybe dotted pair handling is not so great after all. I will try a 2-item list next.

But you can see that this does not look like it is going to be easy. One way to improve things is to tighten up the structure of the PEG. I am sure I could simplify it, especially around the definition of the lowest level elements, and thus reduce the number of times all these key functions are executed (p_all, p, etyc.) but that seems a bit like putting the cart before the horse. I do not really want to bend my PEG (if you will excuse the expression) for good performance, but of course if that is what it takes . . .

All ideas gratefully received.

g
_____________________________________________

On 15 March 2011 03:56, Tony Arcieri <tony.a...@medioh.com> wrote:

Any plans for further performance improvements in the roadmap? :)

On Sun, Mar 13, 2011 at 4:25 PM, Sean Cribbs <seanc...@gmail.com> wrote:

It's been a year coming, so I'm proud to announce a new version of
Neotoma, the PEG/Packrat toolkit and parser-generator. The primary
feature of version 1.5 is that it uses significantly less memory and
also supports UTF8 (via binaries), thanks to some awesome work by
Graeme Defty, who has also converted Reia to use a PEG.

You can get the latest source on Github: https://github.com/seancribbs/neotoma

Neotoma is rebar-friendly as well, so adding it to a rebar project is
a simple deps declaration like so:

{neotoma, "1.5", {git, "git://github.com/seancribbs/neotoma", {tag, "1.5"}}}

Hope to see some of you at Erlang Factory SF next week.

Cheers,

Sean Cribbs

--
Tony Arcieri
Medioh! Kudelski

Sean Cribbs

unread,

Mar 15, 2011, 9:06:54 AM3/15/11

to Graeme Defty, neoto...@googlegroups.com, Tony Arcieri, Erlang Questions

Graeme,

Maybe it's just me, but it looks like the largest chunk of time is spent in garbage collection, thus the natural way to optimize that is to make fewer copies of things. Also, it's natural that your list version of the Index counter is slightly slower, tuples have faster access since they are of fixed length.

Sean

Graeme Defty

unread,

Mar 15, 2011, 11:17:55 AM3/15/11

to Sean Cribbs, neoto...@googlegroups.com, Tony Arcieri, Erlang Questions

Yes, the garbage collection (although VERY variable) is always a significant chunk. That seems to come under the "optimise the PEG" category to some extent. I guess there is some scope for examining each routine and making it as copy-free as possible.

WRT the index, perhaps then the best option is a duple. (on the basis that eliminating the 'line' and 'column' atoms would avoid some effort in matching. I will give it a go.

The main message seems to be though . . . "no silver bullet" - well WHAT a surprise ;-)

g
______________________________

Graeme Defty

unread,

Mar 15, 2011, 8:54:07 PM3/15/11

to Sean Cribbs, neoto...@googlegroups.com, Tony Arcieri, Erlang Questions

Sean,

On reflection, dotted pairs have even MORE fixed length than tuples. Tuples are at least decidable by the programmer. Dotted pairs are fixed at 2 (which is why I chose them). That also has the benefit of using only a single cell over whatever the tuple takes. Less space means less garbage collection so there should be some benefit there, so I am still a little puzzled. Still, I don't know the implementation details, so maybe you are right.

On the other ideas for improvement, I also thought of adding an offset into the input to the index structure to save multiple copies of the input being made, but after reading a bit about binaries and the various forms, it seemed that copies were probably not being made anyway (subject to the ets point below).

> I also want to investigate using the process dictionary a bit more instead of ETS,
> or at least get a comparison metric of the two approaches.

This was the same as my idea of using a a proplist, my thought being that ets was probably making a copy of the memoised input (and output) and that a proplist would avoid that happening (instead just storing pointers into the structure). Again though, I am not sure, so do not know where the 10% improvement really originates.

I still keep coming back to the fact that if we manage the impossible and eliminate one of these functions *totally* we will still only save about 7% at best! p_all really doesn't do much but call other functions, and yet takes the most time. My eye is drawn to the 'count' column. 130,000+ executions for p_all and p_attempt. Given that the peg source is about 44k chars in total, that represents over 4 calls to those functions for every character in the source!

OTOH, p is called only 51,700 times, about 1.16 times per character of input, which really doesn't seem too bad.

Maybe I just need to make the input smaller, but that doesn't help us compete with leex/yecc. I my memory serves me right we are something around 6x slower in executing the Reia unit tests. Only acceptable if you are parsing a small amount of source.

We seem to have a way to go.

g
_________________________________________________

On 15 March 2011 20:06, Sean Cribbs <seanc...@gmail.com> wrote:

Graeme Defty

unread,

Mar 22, 2011, 8:22:37 AM3/22/11

to neoto...@googlegroups.com, Sean Cribbs

Sean,

I was a little puzzled by your comment "...use the functions in the includes file instead of having them in both places."

When I looked into the function I saw what you meant! I could not remember copying the functions source into the test module, but I guess I must have done it at some time.

This must have caused you more than a little confusion (not to mention doubts regarding my sanity). My apologies.

I have decided that in view of that, I will delete my fork and start again from scratch. I will create a new benchmark for testing performance enhancements against using the latest version (with the proplist changes etc. in)

Cheers,

g
___________________________________

Reply all

Reply to author

Forward