It's possible to count the number of captures in a given regexp:
def count_captures regexp
Regexp.union([regexp, //]).match('').captures.length
end
p count_captures /(ax)is/ # => 1
p count_captures /((ax)is)/ # => 2
p count_captures /(?:(ax)is)/ # => 1
With that information, you can calculate offsets. Calculating offsets
would still result in a linear scan (since you'd have to keep a list of
the offsets, then find the one that matched), and since we can use named
captures on 1.9, I'm not sure an offset based solution would be great.
> This is the proof-of-concept: https://gist.github.com/1798985.
Seems good, but you don't need a *args to Regexp.union. :-)
> I believe this is correct *as long as* the user regexps have no
> backreferences, because if you have a = /(..)\1/ and b = /(.)\1/ then "xx"
> matches b, but does not match the union a|b because the backreference \1 in
> b now refers to the group in a.
>
> OK, so this is the question: do you guys use backreferences in custom
> inflections? If you didn't we could consider ruling them out for 4.0 to be
> able to implement this.
If you can rule out backreferences, it seems possible you could build a
state machine that would perform in linear time to the string length.
It seems though, with your algorithm, as the string gets longer, the
original implementation is faster:
https://gist.github.com/1805359
Do we know what kind of data we're dealing with? Somewhere between
strings of length 10 and strings of length 100, the original code
outperforms this version. If we're only dealing with shorter strings,
maybe this newer code would be better.
One last comment, the line:
name = md.names.detect {|n| md[n]}
The length of `names` increases as the number of rules is added to the
system. Which means in the case of a match, this algorithm will still
be O(n), where n is the number of rules defined. We can see this by
testing strings that match first and strings that match last:
https://gist.github.com/1805536
I'm using AS 3.2.1, so "zombies" will be best case, and a bunch of x's
will be worst.
If we're willing to limit the type of expressions provided to the
inflector, it's possible to construct something that matches the string
regardless of the number of rules (entries in "inflections.rb"), and
proportional to the string length.
MEOW MEOW MEOW MEOW
<3 <3 <3 <3 <3 <3
--
Aaron Patterson
http://tenderlovemaking.com/
It's possible to count the number of captures in a given regexp:
def count_captures regexp
Regexp.union([regexp, //]).match('').captures.length
end
With that information, you can calculate offsets.
Seems good, but you don't need a *args to Regexp.union. :-)
If you can rule out backreferences, it seems possible you could build a
state machine that would perform in linear time to the string length.
It seems though, with your algorithm, as the string gets longer, the
original implementation is faster:
https://gist.github.com/1805359
Do we know what kind of data we're dealing with? Somewhere between
strings of length 10 and strings of length 100, the original code
outperforms this version. If we're only dealing with shorter strings,
maybe this newer code would be better.
name = md.names.detect {|n| md[n]}
The length of `names` increases as the number of rules is added to the
system. Which means in the case of a match, this algorithm will still
be O(n)
where n is the number of rules defined. We can see this by
testing strings that match first and strings that match last:
https://gist.github.com/1805536
I'm using AS 3.2.1, so "zombies" will be best case, and a bunch of x's
will be worst.
Ya, but if we're going to put arbitrary restrictions on the type of
matches people can do (i.e. no backreferences), you may as well use an
engine that can execute in O(n) time (n = string length). Otherwise,
what's the point?
> It seems though, with your algorithm, as the string gets longer, the
> > original implementation is faster:
> >
> > https://gist.github.com/1805359
> >
> > Do we know what kind of data we're dealing with? Somewhere between
> > strings of length 10 and strings of length 100, the original code
> > outperforms this version. If we're only dealing with shorter strings,
> > maybe this newer code would be better.
> >
>
> Yes, I think this approach is worthwhile because in practice I believe
> Rails and Rails applications singularize and pluralize English words.
>
>
> name = md.names.detect {|n| md[n]}
> >
> > The length of `names` increases as the number of rules is added to the
> > system. Which means in the case of a match, this algorithm will still
> > be O(n)
>
>
> Yes, yes, but seems unavoidable an O(n) that still outperforms the original
> algorithm with the sizes I've tested (I don't mean it does not with lager
> sets of inflections, only I measured the typical size which is what most
> people use).
How did you determine the length of the strings that most people use? I
think seeing this data would be useful in speeding up this code.
Ya, but if we're going to put arbitrary restrictions on the type of
matches people can do (i.e. no backreferences), you may as well use an
engine that can execute in O(n) time (n = string length). Otherwise,
what's the point?
How did you determine the length of the strings that most people use? I
think seeing this data would be useful in speeding up this code.
Nowadays long strings get a performance boost. That does not make sense statistically speaking, English words should be the fast ones.
Interesting. Have you investigated expanding the regular expressions
and doing hash based replacement via gsub!? Since we can know the
replacements in advance, it's possible to compile a hash and use it for
the replacement. If the hash misses, we can fall back to a linear scan.
Here's a quick implementation as an example. We could probably optimize
more of the expressions:
https://gist.github.com/1806575
and doing hash based replacement via gsub!? Since we can know theInteresting. Have you investigated expanding the regular expressions
replacements in advance, it's possible to compile a hash and use it for
the replacement. If the hash misses, we can fall back to a linear scan.
Here's a quick implementation as an example. We could probably optimize
more of the expressions:
https://gist.github.com/1806575
I have a patch with the new approach implemented and have compared
runtimes: no difference, there's no speedup.
The 6x speedup in the proof of concept happens because the script is
not doing this apparently innocent check over the default 9
uncountable words:
inflections.uncountables.any? { |inflection| result =~
/\b#{inflection}\Z/i }
which is present in ActiveSupport::Inflections#apply_inflections.
In that method, if you return the inflection right way instead of
doing this check, you see the 6x.
--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group.
To post to this group, send email to rubyonra...@googlegroups.com.
To unsubscribe from this group, send email to rubyonrails-co...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en.
https://github.com/rails/rails/commit/d3071db1200e90c0533f75b967c4afb519656d00
which exploits the fact that uncountables are not regexps, but words.
It is not entirely backwards compatible because the words are not
required today to be lowercase (though in practice they normally are).
This is interesting. I tried a similar optimization parsing RSS feeds,
replacing the list of strings in a when clause with a regex. It was 35%
faster. Then I combined all the regex with a group with a unique character in
it to determine which sub-pattern matched. No faster than the list of string
approach. Interesting. There is something I don't understand about regex
performance.
Jeffrey