what is the complexity of regexp.MustCompile()?

550 views
Skip to first unread message

Ray Pereda

unread,
Jun 3, 2020, 1:07:12 PM6/3/20
to golang-nuts
I believe that the complexity of regexp.MustCompile() is linear based on this comment in the regexp package overview.
"The regexp implementation provided by this package is guaranteed to run in time linear in the size of the input"

What is the complexity of regexp.MustCompile()? Is it linear in the length of the regular expression?

-ray



Ray Pereda

unread,
Jun 3, 2020, 5:42:57 PM6/3/20
to golang-nuts
Typo fix, regexp#MatchString and related Match functions are linear. That is an amazing guarantee and makes regular
expressions 10x more useful than regexps in most other programming languages.

Question: Does the compile of regular expressions also guaranteed linear runtime? I'm considering regular expressions with
1000s of keywords; very long regexps. I looked at the source and it was not obvious if that is the case. Intricate code. 

Michael Jones

unread,
Jun 3, 2020, 6:22:08 PM6/3/20
to Ray Pereda, golang-nuts
If you have thousands of fixed strings, a map is your friend. Remarkably so. 

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/f51125e7-f66e-45e8-af3a-381f96071d9a%40googlegroups.com.
--
Michael T. Jones
michae...@gmail.com

Axel Wagner

unread,
Jun 3, 2020, 7:16:26 PM6/3/20
to Ray Pereda, golang-nuts
Hi,

I can very much recommend reading this list of blog posts by Russ Cox. They explain the theory behind all of this. And AIUI, yes, compilation should also be linear. The cost, notably, is that the regexp package actually can only compile actual regular expressions which only recognize regular languages, by omitting features such as back-references. Other regexp-engines are more powerful, at the expense of no longer being able to have this linear runtime characteristic. So there is still a tradeoff (though I tend to agree with the one made by RE2 and thus the regexp-package).


Amnon Baron Cohen

unread,
Jun 8, 2020, 4:15:50 AM6/8/20
to golang-nuts
Should we care?

Regular expressions are generally small.
So the asymptotic complexity is not particularly important.

But regular expressions are often used to search large amounts of input.

regexp gives us fast, guaranteed linear search times.
But we pay for this with slower compilation times.

In my opinion, this is a good tradeoff.

Axel Wagner

unread,
Jun 8, 2020, 8:40:36 AM6/8/20
to Amnon Baron Cohen, golang-nuts
Hi Amnon,

if you read the blog posts I linked above, you'll find examples of where we care very much. RE2 was developed for enabling regular expression search in a large source code corpus. In that scenario, the attacker controls both the regular expression and (to a degree) the text to be searched. If they could craft expression/text pairs that are costly to compile and/or match, then this could enable a denial of service attack.

So, guaranteeing linear compile- *and* match-times is actually pretty relevant for some real-world use-cases.

Best,

Axel

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Robert Engels

unread,
Jun 8, 2020, 8:54:05 AM6/8/20
to Axel Wagner, Amnon Baron Cohen, golang-nuts
Attempting to prevent DOS attacks through algorithm efficiency never works - you have to have resource throttling. 

I’m guessing the IO cost of pulling the text in this case has a better chance of creating a DOS than the regex compile. 

On Jun 8, 2020, at 7:40 AM, 'Axel Wagner' via golang-nuts <golan...@googlegroups.com> wrote:



Axel Wagner

unread,
Jun 8, 2020, 9:05:01 AM6/8/20
to Robert Engels, Amnon Baron Cohen, golang-nuts
On Mon, Jun 8, 2020 at 2:53 PM Robert Engels <ren...@ix.netcom.com> wrote:
Attempting to prevent DOS attacks through algorithm efficiency never works

Uhm, no, it totally does. Code Search is a real-world example of it working at least once. 
 
you have to have resource throttling. 

Well, yes. But reigning in algorithmic complexity makes that far easier and . If the cost of a query is proportional to its length, you can just limit the length of queries to some gratuitous upper bound of reasonable. But if it's quadratic or even exponential, that no longer works; if the cost can be doubled just by adding a character to the query, you have to restrict query length to a restrictive *lower* bound on reasonable - which is inherently user-unfriendly.
 
Really, I'd argue that algorithmic efficiency is the *only* real effective measure against a cost DoS.

I’m guessing the IO cost of pulling the text in this case has a better chance of creating a DOS than the regex compile. 

You might guess that, but you'd be wrong. Again, just look at the graph on top of this blog post. You get *minutes* of match-times for queries and corpus of a couple hundred characters.

Regexp-based code search couldn't exist without carefully designing around algorithmic complexity.

Robert Engels

unread,
Jun 8, 2020, 9:20:00 AM6/8/20
to Axel Wagner, Amnon Baron Cohen, golang-nuts
I don’t see anything in the blog post referring to compilation time - only runtime. That being said I couldn’t review the graphs on my phone. 

Even still, to prevent DOS you can bound the compilation time as easily as the runtime. If the lib doesn’t support it a simple fork to add an emit with cancel out stage is all that is required. 

On Jun 8, 2020, at 8:04 AM, 'Axel Wagner' via golang-nuts <golan...@googlegroups.com> wrote:



Axel Wagner

unread,
Jun 8, 2020, 9:27:09 AM6/8/20
to Robert Engels, Amnon Baron Cohen, golang-nuts
On Mon, Jun 8, 2020 at 3:19 PM Robert Engels <ren...@ix.netcom.com> wrote:
I don’t see anything in the blog post referring to compilation time - only runtime.

Sure. It's more a general point about you saying algorithmic complexity doesn't matter.
 
That being said I couldn’t review the graphs on my phone. 

Even still, to prevent DOS you can bound the compilation time as easily as the runtime. If the lib doesn’t support it a simple fork to add an emit with cancel out stage is all that is required. 

Or, you know, make sure that it just has a guaranteed linear complexity. Then you don't even need "an emit with a cancel out stage".

FTR, the whole issue here is a) someone asked about the algorithmic complexity of a stdlib function and b) people told them off saying the question isn't relevant. It is. You might not care (you should, IMO. But if you don't, that's on you). But it's definitely a relevant question to ask.

Robert Engels

unread,
Jun 8, 2020, 9:39:32 AM6/8/20
to Axel Wagner, Amnon Baron Cohen, golang-nuts
I will claim it also doesn’t matter because you need external bounding anyway. Take a simple recursive directory listing. It is linear time bounded. Perform it on the top level on a sufficiently large directory and it might run for days consuming TBs of memory - easily becoming a DOS attack point. So regardless of the complexity you need other constraints anyway. Build those in at the request handling level to avoid DOS and UX issues. 

On Jun 8, 2020, at 8:27 AM, 'Axel Wagner' via golang-nuts <golan...@googlegroups.com> wrote:



Thomas Bushnell, BSG

unread,
Jun 8, 2020, 11:43:47 AM6/8/20
to Robert Engels, Axel Wagner, Amnon Baron Cohen, golang-nuts
The OP was about MustCompile, so I think it's clear they are not using patterns passed in by external requests.

Axel Wagner

unread,
Jun 8, 2020, 12:02:46 PM6/8/20
to Thomas Bushnell, BSG, Robert Engels, Amnon Baron Cohen, golang-nuts
On Mon, Jun 8, 2020 at 5:41 PM Thomas Bushnell, BSG <tbus...@google.com> wrote:
The OP was about MustCompile, so I think it's clear they are not using patterns passed in by external requests.

I don't think that's clear at all. How do you assume patterns from external sources can be matched, if not by compiling them?

Thomas Bushnell, BSG

unread,
Jun 8, 2020, 12:06:38 PM6/8/20
to Axel Wagner, Robert Engels, Amnon Baron Cohen, golang-nuts
On Mon, Jun 8, 2020 at 12:02 PM Axel Wagner <axel.wa...@googlemail.com> wrote:
On Mon, Jun 8, 2020 at 5:41 PM Thomas Bushnell, BSG <tbus...@google.com> wrote:
The OP was about MustCompile, so I think it's clear they are not using patterns passed in by external requests.

I don't think that's clear at all. How do you assume patterns from external sources can be matched, if not by compiling them?

"MustCompile is like Compile but panics if the expression cannot be parsed."

Thomas 

Axel Wagner

unread,
Jun 8, 2020, 12:07:06 PM6/8/20
to Robert Engels, Amnon Baron Cohen, golang-nuts
On Mon, Jun 8, 2020 at 3:38 PM Robert Engels <ren...@ix.netcom.com> wrote:
I will claim it also doesn’t matter because you need external bounding anyway. Take a simple recursive directory listing. It is linear time bounded.

Depends on what we're talking about as the "input" here. It is certainly not linearly bound by the name of the directory, as you pointed out. So the input would have to be the FS then.
 
Perform it on the top level on a sufficiently large directory and it might run for days consuming TBs of memory - easily becoming a DOS attack point. So regardless of the complexity you need other constraints anyway. Build those in at the request handling level to avoid DOS and UX issues. 

I disagree. Because for this to be the case, the attacker would then have to be able to provide a sufficiently large FS as an input. Of course you shouldn't design a system where someone can run directory listings against an arbitrarily large FS with an arbitrarily short query. But the reason is *precisely* that in that scenario, a directory listing is *not* bounded by the input-size.

That's the core of the argument here: If your algorithm is linear or sublinear in the input-size, then any attacker wanting to consume resources, must spent a proportional amount of their own resources.

Axel Wagner

unread,
Jun 8, 2020, 12:08:00 PM6/8/20
to Thomas Bushnell, BSG, Robert Engels, Amnon Baron Cohen, golang-nuts
Oh, true, I overlooked that detail. But FTR, they clarify their question subsequently and specifically ask whether compilation is linear in expression size. No Must* in that clarification.

Robert Engels

unread,
Jun 8, 2020, 1:18:13 PM6/8/20
to Axel Wagner, Thomas Bushnell, BSG, Amnon Baron Cohen, golang-nuts
If the input regex string is bounded by a typical user input box - the compilation time will be negligible when compared to the runtimes.   

On Jun 8, 2020, at 11:07 AM, Axel Wagner <axel.wa...@googlemail.com> wrote:



Axel Wagner

unread,
Jun 8, 2020, 3:27:16 PM6/8/20
to Robert Engels, golang-nuts
And that exact logic is what wouldn't work, if compilation or matching would be exponential, for example.
Being able to say "as long as the inputs don't get astronomically long, the running time of the algorithm will be reasonable" is *exactly* the conclusion that a small complexity class gives you.

Robert Engels

unread,
Jun 8, 2020, 5:23:01 PM6/8/20
to Axel Wagner, golang-nuts
You can’t say it arbitrary regex inputted by a user then claim they can be arbitrarily long - these are incompatible. For any reasonable user input the compile time is negligible, and it is trivial to limit the input size to something a user could input. 

On Jun 8, 2020, at 2:27 PM, Axel Wagner <axel.wa...@googlemail.com> wrote:



Axel Wagner

unread,
Jun 8, 2020, 5:42:16 PM6/8/20
to Robert Engels, golang-nuts
They don't have to be *arbitrarily* long. They actually can be quite short. 2^n is large, even for relatively small n.
Again, I refer to the graphs in the blog post above. We are talking about input length of less than 100 characters to get CPU churn of *minutes*. That's not a huge limit for a search-box.

Yes, the compile-time is negligible - but it's negligible *because it's linear in the input size*. And match-times are only small, because RE2 made the conscious decision (for *exactly* the reasons I outline here) to forego features that were very common and even expected of regexp-engines, so it could guarantee a linear match time. That was novel at the time - almost no engine did that and the most commonly used most certainly didn't. When people in this thread say "regular expressions are great because they give linear match time on the matched text size", they forget that most "regular expression" engines don't actually give you that. They only give that, if you specifically design with that in mind.

I also explained why "it is trivial to limit the input size to something a user could input" is simply wrong, in practice. If you have exponential growth, the difference between 20 and 30 characters (again, we are talking search-terms here, it is not at all unreasonable to look for a 30 character identifier) is three orders of magnitude. If the curve is too steep, you can't reasonably separate between sensible input lengths from attacks. If you get a nice, linear curve, that becomes trivial. Just multiply any sensible limit you could think of by 100 and you're still going to end up well within spec.

And even if we ignore the DoS-potential - just considering what that does to your load-patterns, that some user's queries will be quasi free, while others will be a million times or more as expensive is a good argument to pay attention to this. Just from an operational POV.

Seriously - you should *absolutely* consider the algorithmic complexity of functions you are calling with user-provided input.

Wojciech S. Czarnecki

unread,
Jun 8, 2020, 6:22:49 PM6/8/20
to golan...@googlegroups.com
Dnia 2020-06-08, o godz. 16:22:24
Robert Engels <ren...@ix.netcom.com> napisał(a):

> it is trivial to limit the input size to something a user could input.

With exponential complexity simple regex /(x+x+)+y/ blows up at input of 20 to 30 x-es.
See: https://www.regular-expressions.info/catastrophic.html

[Cut long explanations... Axel just posted most of what I was writing regarding trade-offs).

Hope this helps,

--
Wojciech S. Czarnecki
<< ^oo^ >> OHIR-RIPE

Kurtis Rader

unread,
Jun 8, 2020, 6:48:10 PM6/8/20
to Wojciech S. Czarnecki, golang-nuts
You're talking past each other. Robert is talking about limiting the length of the regex, not the string(s) evaluated by the regex. It should be possible to compile any regex of a reasonable length in a matter of microseconds. Regardless of whether the application of the regex to a given input is near linear (as in the case of the Go RE implementation) or exponential (as in the case of PCRE). I'm pretty sure Robert is not arguing that the scaling problems of the regex implementation used by Perl, and too many others, can be mitigated simply by limiting the size of the string to be matched by the regex. If compiling a regex of reasonable length takes a non-negligible amount of time something is wrong.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.


--
Kurtis Rader
Caretaker of the exceptional canines Junior and Hank

Axel Wagner

unread,
Jun 8, 2020, 7:23:26 PM6/8/20
to Kurtis Rader, Wojciech S. Czarnecki, golang-nuts
On Tue, Jun 9, 2020 at 12:48 AM Kurtis Rader <kra...@skepticism.us> wrote:
You're talking past each other. Robert is talking about limiting the length of the regex, not the string(s) evaluated by the regex.

So am I. Note that e.g. a Code Search based on PCRE would break, even if you limit *both* (or rather, any limit causing it to not break would result in a completely useless piece of software).

It should be possible to compile any regex of a reasonable length in a matter of microseconds. Regardless of whether the application of the regex to a given input is near linear (as in the case of the Go RE implementation) or exponential (as in the case of PCRE).

This might be, where we talk past each other. I am using application as an example for concrete numbers on how quickly exponential growth can devolve. Of course, compilation of a regexp is fast - I am aware of that. As it turns out, so is application of a regexp, if you use RE2 (or Go's regexp package).

What I take issue with are the statements that a) the question about the complexity of compiling a regexp is irrelevant, because b) limiting the algorithmic complexity of a function to counteract resource exhaustion attacks "never works". RE2 is an excellent example to show that it does work; it is carefully designed for linear complexity of the combined compilation+matching of a regular expression.

I'm not saying compiling a regular expression is an exponential operation. I'm saying *if it was*, you could never reasonably build something like Code Search. And we can use the fact that *application* indeed is (in most engines) exponential in the combined input length of expression+search text as a good basis to make that case.

Of course we don't even need this fantasy world to make the case - after all, saying "you can't build Code Search on PCRE, but you can on RE2" is just as effective to prove the effectiveness of lowering algorithmic complexity. It's just that OP's question was about compilation, so to talk about why OP's question *is* relevant, we need to talk about what would happen if the answer *wasn't* "it's linear".

(I also genuinely don't understand the instinct to tell someone "why would you care?" instead of telling them the answer, FWIW. Especially in a case like this, where the answer is pretty simple)

I'm pretty sure Robert is not arguing that the scaling problems of the regex implementation used by Perl, and too many others, can be mitigated simply by limiting the size of the string to be matched by the regex. If compiling a regex of reasonable length takes a non-negligible amount of time something is wrong.

On Mon, Jun 8, 2020 at 3:22 PM Wojciech S. Czarnecki <oh...@fairbe.org> wrote:
Dnia 2020-06-08, o godz. 16:22:24
Robert Engels <ren...@ix.netcom.com> napisał(a):

> it is trivial to limit the input size to something a user could input.

With exponential complexity simple regex /(x+x+)+y/ blows up at input of 20 to 30 x-es.
See: https://www.regular-expressions.info/catastrophic.html

[Cut long explanations... Axel just posted most of what I was writing regarding trade-offs).

Hope this helps,

--
Wojciech S. Czarnecki
 << ^oo^ >> OHIR-RIPE

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/20200609002207.0a161adf%40xmint.


--
Kurtis Rader
Caretaker of the exceptional canines Junior and Hank

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Michael Jones

unread,
Jun 8, 2020, 7:39:16 PM6/8/20
to Ray Pereda, golang-nuts
Ray, only the discussion is exponential.

Robert Engels

unread,
Jun 8, 2020, 9:12:33 PM6/8/20
to Michael Jones, Ray Pereda, golang-nuts
The OP specifically asked about compilation - in fact it’s in the title. You stated the compilation complexity is a DoS attack vector. I argued that it is not. That is all. 

On Jun 8, 2020, at 6:39 PM, Michael Jones <michae...@gmail.com> wrote:



Thomas Bushnell, BSG

unread,
Jun 9, 2020, 11:12:45 AM6/9/20
to Robert Engels, Michael Jones, Ray Pereda, golang-nuts
I'm surprised none of you have taken all this time to just go read the code, where it is clearly linear.

Axel Wagner

unread,
Jun 9, 2020, 11:24:38 AM6/9/20
to Thomas Bushnell, BSG, golang-nuts
If you actually read OPs second E-Mail, they did and they didn't find it very clear. With that in mind, your message isn't very nice.
(Also, not to be repetitive or anything: ~80% of the messages in this thread aren't actually concerned with what the complexity class *is*, but  whether it matters)

Thomas Bushnell, BSG

unread,
Jun 9, 2020, 11:43:13 AM6/9/20
to Axel Wagner, golang-nuts
On Tue, Jun 9, 2020 at 11:23 AM Axel Wagner <axel.wa...@googlemail.com> wrote:
If you actually read OPs second E-Mail, they did and they didn't find it very clear. With that in mind, your message isn't very nice.
(Also, not to be repetitive or anything: ~80% of the messages in this thread aren't actually concerned with what the complexity class *is*, but  whether it matters)

The OP stopped participating in this exciting thread a long time ago. I went and read through the code, and it seems pretty clear to me that it's linear. 

Andy Balholm

unread,
Jun 11, 2020, 12:01:37 PM6/11/20
to Thomas Bushnell, BSG, Axel Wagner, golang-nuts
It’s apparently quadratic in some cases. Yesterday fuzzing on github.com/andybalholm/cascadia found an input that triggered a timeout. The time was being spent compiling a 180-KB regex (which I’m attaching to this message). If I concatenate two copies of this file, the combined regex takes at least four times as long to compile.

I plan to investigate further, and see if I can find a way to reproduce the issue that doesn’t look like it was generated by a fuzzer.

Andy

regex.txt

Robert Engels

unread,
Jun 11, 2020, 3:27:21 PM6/11/20
to Andy Balholm, Thomas Bushnell, BSG, Axel Wagner, golang-nuts
Why would you ever allow that regex?

On Jun 11, 2020, at 11:01 AM, Andy Balholm <andyb...@gmail.com> wrote:

It’s apparently quadratic in some cases. Yesterday fuzzing on github.com/andybalholm/cascadia found an input that triggered a timeout. The time was being spent compiling a 180-KB regex (which I’m attaching to this message). If I concatenate two copies of this file, the combined regex takes at least four times as long to compile.

I plan to investigate further, and see if I can find a way to reproduce the issue that doesn’t look like it was generated by a fuzzer.

Andy

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Andy Balholm

unread,
Jun 11, 2020, 3:56:44 PM6/11/20
to Robert Engels, Thomas Bushnell, BSG, Axel Wagner, golang-nuts
Obviously any reasonable input validation or length limit would disallow it. 

The time requirement is only quadratic, not exponential, so it takes ridiculously long inputs to cause a problem.

Andy

Andy Balholm

unread,
Jun 11, 2020, 4:36:42 PM6/11/20
to Robert Engels, Thomas Bushnell, BSG, Axel Wagner, golang-nuts
Here is a simpler reproducer: https://play.golang.org/p/82UBmyfyqV-

(Running it on the playground isn’t much use because of the playground’s fake time, though.)

It just repeats the string “D||” (with two vertical bars) to make the regex. The runtime is roughly quadratic in the number of repetitions.

Andy

Ray Pereda

unread,
Jun 11, 2020, 5:12:45 PM6/11/20
to Andy Balholm, Robert Engels, Thomas Bushnell, BSG, Axel Wagner, golang-nuts
On my laptop, I compiled Andy's code from here: https://play.golang.org/p/82UBmyfyqV-
I imported the output into google sheets and plotted it with a bar chart.
The runtime strongly looks linear. 

Still, the constant in the O(n) runtime can possibly be improved with effort profiling the code and pull requests.
There many man-months in fine-tuning the regexp engines in other languages like Perl, Python, C PCRE libraries, and Java.
Any recommendations for Go profilers for this work would be appreciated.

Ray




You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/8M-qVbRKIWA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/DECADAAC-2F98-463C-9202-132D6D0F93F6%40gmail.com.


--
Ray

Andy Balholm

unread,
Jun 11, 2020, 5:18:43 PM6/11/20
to Ray Pereda, Robert Engels, Thomas Bushnell, BSG, Axel Wagner, golang-nuts
It’s more than linear. 10000 repetitions take 73 times the time of 1000 repetitions. 

Andy

Ray Pereda

unread,
Jun 11, 2020, 5:56:22 PM6/11/20
to Andy Balholm, Robert Engels, Thomas Bushnell, BSG, Axel Wagner, golang-nuts
Oops, I screwed up the plot in google sheets.
I was only plotting the regexp length values.
I fixed the plot, and only plotted the y values, runtime.
The x-values are linearly spaced.

There is a curve. It doesn't look linear. I think Andy is right.
I did a quadratic curve fit with https://mycurvefit.com/
y = 7.763291 - 0.002980518*x + 0.000008260165*x^2
with an R^2 value 0.9998

Ray

--
Ray

Axel Wagner

unread,
Jun 11, 2020, 5:57:50 PM6/11/20
to golang-nuts
The Graph is clearly not linear. Another way to see this is to print out the ratio of time taken and i. If the cost is linear, you'd expect that ratio to be constant. When I run this code https://play.golang.org/p/v1JVQkOXnEH on my machine I get

1000 6.821361ms 6821.361
2000 27.229439ms 13614.7195
3000 65.352307ms 21784.102333333332
4000 118.262424ms 29565.606
5000 184.231412ms 36846.2824
6000 265.085287ms 44180.881166666666
7000 341.980378ms 48854.339714285714
8000 468.367349ms 58545.918625
9000 566.985419ms 62998.37988888889
10000 715.327926ms 71532.7926
11000 863.203647ms 78473.05881818182
12000 1.046590109s 87215.84241666667
13000 1.208058338s 92927.56446153847
14000 1.367802614s 97700.18671428571
15000 1.599170348s 106611.35653333334
16000 1.856199076s 116012.44225
17000 2.035878934s 119757.58435294118
18000 2.29658726s 127588.18111111112
19000 2.57228423s 135383.3805263158

You can see the ratio increasing by ~7K on each increment (though there's of course noise). This means the graph is pretty clearly quadratic.

Kurtis Rader

unread,
Jun 11, 2020, 7:48:05 PM6/11/20
to Axel Wagner, golang-nuts
On Thu, Jun 11, 2020 at 2:57 PM 'Axel Wagner' via golang-nuts <golan...@googlegroups.com> wrote:
The Graph is clearly not linear. Another way to see this is to print out the ratio of time taken and i. If the cost is linear, you'd expect that ratio to be constant. When I run this code https://play.golang.org/p/v1JVQkOXnEH on my machine I get...

The poor scalability of the example provided by Andy is due to the empty alternation. Using "D|" repeated as many times as you desire results in linear time and a constant number of memory allocations to compile the sequence. Changing the segment to "D||"  causes two allocations for every "||" instance. While "D||D" is legal it's also somewhat silly since it matches anything. Inserting a repetition between the two "|" (e.g., "D|d*|"), or a fixed length expression that is a different length (e.g., "D|xy|") results in the same behavior. Putting a fixed length expression between the two "|" that is the same length as the expression on the other side results in linear time even if the expression is a char class; e.g., "D|[xy]|". This doesn't surprise me given the research papers linked to earlier in this thread. Whether the pathological alternation case can be optimized to reduce the number of allocations is an open question.

Andy Balholm

unread,
Jun 11, 2020, 8:40:43 PM6/11/20
to Kurtis Rader, Axel Wagner, golang-nuts
Right. That’s why I left the double bar in my example. 

Basically all the time is spent appending to a linked list in regexp/syntax.patchlist.append. Which makes sense, because appending to a linked list when you only have a head pointer is O(n) in the length of the list. So building the whole list that way is O(n^2).

Andy

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Ray Pereda

unread,
Jun 11, 2020, 9:09:42 PM6/11/20
to Andy Balholm, Kurtis Rader, Axel Wagner, golang-nuts

I think that regexp/syntax.patchlist.append refers to line 42 here

Using a singly linked list with a pointer to the tail pointer would give constant time append

Ray

You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/8M-qVbRKIWA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/C52F7F0C-1DCA-44B6-ACE8-2F96066D62CD%40gmail.com.

Andy Balholm

unread,
Jun 11, 2020, 10:26:01 PM6/11/20
to Ray Pereda, Kurtis Rader, Axel Wagner, golang-nuts
Reply all
Reply to author
Forward
0 new messages