Description:
RegExp: Add support for table-based character class
code generation. This is performance neutral for
all our tests, but a factor 6 faster for the Unicode
based regexp in the new test (and much more compact
code).
Please review this at https://chromiumcodereview.appspot.com/9854020/
SVN Base: http://v8.googlecode.com/svn/branches/bleeding_edge/
Affected files:
M src/arm/regexp-macro-assembler-arm.h
M src/arm/regexp-macro-assembler-arm.cc
M src/bytecodes-irregexp.h
M src/ia32/regexp-macro-assembler-ia32.h
M src/ia32/regexp-macro-assembler-ia32.cc
M src/interpreter-irregexp.cc
M src/jsregexp.cc
M src/mips/regexp-macro-assembler-mips.h
M src/regexp-macro-assembler-irregexp-inl.h
M src/regexp-macro-assembler-irregexp.h
M src/regexp-macro-assembler-irregexp.cc
M src/regexp-macro-assembler-tracer.h
M src/regexp-macro-assembler-tracer.cc
M src/regexp-macro-assembler.h
M src/x64/regexp-macro-assembler-x64.h
M src/x64/regexp-macro-assembler-x64.cc
A test/mjsunit/unicodelctest-no-optimization.js
A test/mjsunit/unicodelctest.js
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc
File src/jsregexp.cc (right):
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1555
src/jsregexp.cc:1555: Label* even_label,
Odd/even doesn't make sense without the ranges2 array. Consider
something like in_range, out_of_range.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1573
src/jsregexp.cc:1573:
A comment about odd/even/ranges2 would be helpful here.
If I understand correctly there is also an implicit precondition on
values in ranges2:
for all i, j: (ranges2[i] >> kTableSizeBits) == (ranges[j] >>
kTableSizeBits).
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1607
src/jsregexp.cc:1607: j < ranges2->at(i + 1) - base && j < kSize;
Is (ranges2->at(i + 1) - base) == ranges2->at(i + 1) & kMask?
If yes, I would prefer the kMask version because it is consistent with j
initializer.
If not, I don't understand this part.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1626
src/jsregexp.cc:1626: // Gets a series of segments representing a
character class. If the character
Gets a series of segment boundaries ...
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1669
src/jsregexp.cc:1669: int c = ranges2->at(i);
If you extract the body of the loop into a function, then you could
avoid SearchOrder trickery:
1. Search for a single char range.
2. If found, call the function on this char range.
3. Otherwise, call the function on the first range.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1674
src/jsregexp.cc:1674: Label rest;
'fall_through' seems to be a better name than 'rest'.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1677
src/jsregexp.cc:1677: // Cut out the single range by rewriting the
array.
This implicitly creates a new range with endpoints ranges2[i-1] ..
ranges2[i+2]. A comment that it preserves the oddity of the labels would
be helpful.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1684
src/jsregexp.cc:1684: GenerateBranches(masm,
I think start_index + 1 can be larger that end_index - 1, but
GenerateBranches doesn't handle it.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1735
src/jsregexp.cc:1735: if (ranges2->at(new_start_index) > border) break;
I would expect to break on >=
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1749
src/jsregexp.cc:1749: int binary_chop_index = (end_index + start_index)
/ 2;
Why not (end_index + new_start_index) / 2 ?
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1750
src/jsregexp.cc:1750: if (border - 1 > String::kMaxAsciiCharCode &&
border >= String::kMaxAsciiCharCode
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1752
src/jsregexp.cc:1752: last - first > kSize * 4 &&
Are 4s above magic?
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1757
src/jsregexp.cc:1757: while (scan_forward_for_section_border <
end_index) {
This loop construct appears twice. Maybe put it into a separate
function?
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1778
src/jsregexp.cc:1778: }
Can we extract the code above into a separate function that returns a
new_start_index and a border? There are many edge cases, I would be
happy if those would be unit tested.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1886
src/jsregexp.cc:1886: ZoneList<int>* ranges2 =
This fits into one line. Can we find a better name than 'ranges2'? How
about 'range_endpoints', 'range_points', 'range_boundaries',
'alternating_ranges'.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1890
src/jsregexp.cc:1890: if (cc->is_negated()) zeroth_entry_is_failure =
false;
zeroth_entry_is_failure = !cc->is_negated();
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1896
src/jsregexp.cc:1896: ranges2->Add(0);
I wonder why we need ranges2[0] to be zero? GenerateBranch doens't seem
to rely on ranges2[0] == 0 and uses min_char.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1904
src/jsregexp.cc:1904: int ranges2_length = ranges2->length() - 1;
In 'ranges2_length' the 'length' suffix is misleading. I think it is not
a length but an index, so it should be called something like
'end_index'.
https://chromiumcodereview.appspot.com/9854020/diff/11001/src/x64/regexp-macro-assembler-x64.cc
File src/x64/regexp-macro-assembler-x64.cc (right):
https://chromiumcodereview.appspot.com/9854020/diff/11001/src/x64/regexp-macro-assembler-x64.cc#newcode578
src/x64/regexp-macro-assembler-x64.cc:578: Label* on_in_range) {
Maybe special-case if 'from' is zero (or even if 'to' is 0xffff)
https://chromiumcodereview.appspot.com/9854020/diff/11001/src/x64/regexp-macro-assembler-x64.cc#newcode579
src/x64/regexp-macro-assembler-x64.cc:579: __ lea(rax,
Operand(current_character(), -from));
lea->leal (saves a byte).
https://chromiumcodereview.appspot.com/9854020/diff/11001/src/x64/regexp-macro-assembler-x64.cc#newcode602
src/x64/regexp-macro-assembler-x64.cc:602: __ and_(rbx,
Immediate(kTableMask));
I'd consider swapping these:
movq(rbx, Immediate(kTableMask));
and_(rbx, current_character());
(speculating that the constant load could hoist better).
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1555
src/jsregexp.cc:1555: Label* even_label,
On 2012/03/28 13:14:08, ulan wrote:
> Odd/even doesn't make sense without the ranges2 array. Consider
something like
> in_range, out_of_range.
Done.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1573
src/jsregexp.cc:1573:
On 2012/03/28 13:14:08, ulan wrote:
> A comment about odd/even/ranges2 would be helpful here.
> If I understand correctly there is also an implicit precondition on
values in
> ranges2:
> for all i, j: (ranges2[i] >> kTableSizeBits) == (ranges[j] >>
kTableSizeBits).
Done.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1607
src/jsregexp.cc:1607: j < ranges2->at(i + 1) - base && j < kSize;
On 2012/03/28 13:14:08, ulan wrote:
> Is (ranges2->at(i + 1) - base) == ranges2->at(i + 1) & kMask?
> If yes, I would prefer the kMask version because it is consistent with
j
> initializer.
> If not, I don't understand this part.
The answer is yes, and it is changed to use the kMask version.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1626
src/jsregexp.cc:1626: // Gets a series of segments representing a
character class. If the character
On 2012/03/28 13:14:08, ulan wrote:
> Gets a series of segment boundaries ...
Done.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1669
src/jsregexp.cc:1669: int c = ranges2->at(i);
On 2012/03/28 13:14:08, ulan wrote:
> If you extract the body of the loop into a function, then you could
avoid
> SearchOrder trickery:
> 1. Search for a single char range.
> 2. If found, call the function on this char range.
> 3. Otherwise, call the function on the first range.
Done.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1674
src/jsregexp.cc:1674: Label rest;
On 2012/03/28 13:14:08, ulan wrote:
> 'fall_through' seems to be a better name than 'rest'.
Renamed to dummy since it is not actually used.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1677
src/jsregexp.cc:1677: // Cut out the single range by rewriting the
array.
On 2012/03/28 13:14:08, ulan wrote:
> This implicitly creates a new range with endpoints ranges2[i-1] ..
ranges2[i+2].
> A comment that it preserves the oddity of the labels would be helpful.
Done.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1684
src/jsregexp.cc:1684: GenerateBranches(masm,
On 2012/03/28 13:14:08, ulan wrote:
> I think start_index + 1 can be larger that end_index - 1, but
GenerateBranches
> doesn't handle it.
We do not get into this if() unless they are 6 apart.
Assert added.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1735
src/jsregexp.cc:1735: if (ranges2->at(new_start_index) > border) break;
On 2012/03/28 13:14:08, ulan wrote:
> I would expect to break on >=
I think this is correct. If there is a border right on the edge we want
to cut right before the border, but we don't want the new start_index to
be on the border, since that would make a trivial zero-width range at
the start.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1749
src/jsregexp.cc:1749: int binary_chop_index = (end_index + start_index)
/ 2;
On 2012/03/28 13:14:08, ulan wrote:
> Why not (end_index + new_start_index) / 2 ?
If we are doing binary chop then we would want to do it exactly in the
middle, but new_start_index is one page higher than the middle.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1750
src/jsregexp.cc:1750: if (border - 1 > String::kMaxAsciiCharCode &&
On 2012/03/28 13:14:08, ulan wrote:
> border >= String::kMaxAsciiCharCode
This line makes a cut at the ASCII non-ASCII boundary because the ASCII
area is so important for performance. Even a Cyrillic text like the
Hippo article on todays Russian Wikipedia front page is 20% ASCII. In
order to bypass the binary chop branch tree for ASCII this is the
correct test. Comment added, though it does not explicitly mention the
Hippo.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1752
src/jsregexp.cc:1752: last - first > kSize * 4 &&
On 2012/03/28 13:14:08, ulan wrote:
> Are 4s above magic?
Yes. I changed it to 2, which makes little difference to performance,
but makes more sense (there's no difference between binary chop and just
one at a time when there are only two pages left).
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1757
src/jsregexp.cc:1757: while (scan_forward_for_section_border <
end_index) {
On 2012/03/28 13:14:08, ulan wrote:
> This loop construct appears twice. Maybe put it into a separate
function?
The two occurrences are slightly different. A function that has 7 lines
of body and returns two values will have much bigger overhead, and given
that they are not doing exactly the same I can't see a nice way to do
it.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1778
src/jsregexp.cc:1778: }
On 2012/03/28 13:14:08, ulan wrote:
> Can we extract the code above into a separate function that returns a
> new_start_index and a border? There are many edge cases, I would be
happy if
> those would be unit tested.
Extracted, and more tests added. It's fuzzing rather than unit, but it
improves the coverage and caught an off-by-one error.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1886
src/jsregexp.cc:1886: ZoneList<int>* ranges2 =
On 2012/03/28 13:14:08, ulan wrote:
> This fits into one line. Can we find a better name than 'ranges2'? How
about
> 'range_endpoints', 'range_points', 'range_boundaries',
'alternating_ranges'.
range_boundaries, but I changed it to just ranges inside most of the
other functions because anything longer will cause cascading reformats
to stay within 80 columns.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1890
src/jsregexp.cc:1890: if (cc->is_negated()) zeroth_entry_is_failure =
false;
On 2012/03/28 13:14:08, ulan wrote:
> zeroth_entry_is_failure = !cc->is_negated();
Done.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1896
src/jsregexp.cc:1896: ranges2->Add(0);
On 2012/03/28 13:14:08, ulan wrote:
> I wonder why we need ranges2[0] to be zero? GenerateBranch doens't
seem to rely
> on ranges2[0] == 0 and uses min_char.
Done.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newcode1904
src/jsregexp.cc:1904: int ranges2_length = ranges2->length() - 1;
On 2012/03/28 13:14:08, ulan wrote:
> In 'ranges2_length' the 'length' suffix is misleading. I think it is
not a
> length but an index, so it should be called something like
'end_index'.
Done.
http://codereview.chromium.org/9854020/diff/11001/src/x64/regexp-macro-assembler-x64.cc
File src/x64/regexp-macro-assembler-x64.cc (right):
http://codereview.chromium.org/9854020/diff/11001/src/x64/regexp-macro-assembler-x64.cc#newcode578
src/x64/regexp-macro-assembler-x64.cc:578: Label* on_in_range) {
On 2012/03/28 14:02:10, Lasse Reichstein Nielsen wrote:
> Maybe special-case if 'from' is zero (or even if 'to' is 0xffff)
Never happens.
http://codereview.chromium.org/9854020/diff/11001/src/x64/regexp-macro-assembler-x64.cc#newcode579
src/x64/regexp-macro-assembler-x64.cc:579: __ lea(rax,
Operand(current_character(), -from));
On 2012/03/28 14:02:10, Lasse Reichstein Nielsen wrote:
> lea->leal (saves a byte).
I think you mean on x64.
Fixed.
http://codereview.chromium.org/9854020/diff/11001/src/x64/regexp-macro-assembler-x64.cc#newcode602
src/x64/regexp-macro-assembler-x64.cc:602: __ and_(rbx,
Immediate(kTableMask));
On 2012/03/28 14:02:10, Lasse Reichstein Nielsen wrote:
> I'd consider swapping these:
> movq(rbx, Immediate(kTableMask));
> and_(rbx, current_character());
> (speculating that the constant load could hoist better).
OTOH a mov can be eliminated completely by the register renamer.
http://codereview.chromium.org/9854020/diff/11001/src/x64/regexp-macro-assembler-x64.cc#newcode602
src/x64/regexp-macro-assembler-x64.cc:602: __ and_(rbx,
Immediate(kTableMask));
I don't think the register renamer will do that since the value of
current_character() needs to be retained as well. A move will use a
different physical register.
http://codereview.chromium.org/9854020/diff/15001/src/jsregexp.cc
File src/jsregexp.cc (right):
http://codereview.chromium.org/9854020/diff/15001/src/jsregexp.cc#newcode1843
src/jsregexp.cc:1843: ASSERT_LT(new_end_index, end_index);
Also: start_index <= new_end_index && new_start_index <= end_index.
http://codereview.chromium.org/9854020/diff/15001/src/jsregexp.cc#newcode1969
src/jsregexp.cc:1969: if (range.from() == 0) {
I think we don't need to check for range[i] == 0. GenerateBranches
doesn't count the range [minchar ... range_boundaries[0]], so the oddity
of ranges should be correct without this check.
for (int i = 0; i <= last_valid_range; i++) {
CharacterRange& range = ranges->at(i);
range_boundaries->Add(range.from());
range_boundaries->Add(range.to() + 1);
}
http://codereview.chromium.org/9854020/diff/15001/src/mips/regexp-macro-assembler-mips.h
File src/mips/regexp-macro-assembler-mips.h (right):
http://codereview.chromium.org/9854020/diff/15001/src/mips/regexp-macro-assembler-mips.h#newcode84
src/mips/regexp-macro-assembler-mips.h:84: virtual void
CheckCharacterInRange(uc16 from,
These functions do not have implementation. If it is intentional please
ignore this comment.
http://codereview.chromium.org/9854020/diff/15001/test/mjsunit/unicodelctest-no-optimization.js
File test/mjsunit/unicodelctest-no-optimization.js (right):
http://codereview.chromium.org/9854020/diff/15001/test/mjsunit/unicodelctest-no-optimization.js#newcode110
test/mjsunit/unicodelctest-no-optimization.js:110: var re = new
RegExp("[" + cc + "]");
Maybe also test negated character class?
http://codereview.chromium.org/9854020/diff/11001/src/x64/regexp-macro-assembler-x64.cc#newcode602
src/x64/regexp-macro-assembler-x64.cc:602: __ and_(rbx,
Immediate(kTableMask));
On 2012/03/29 15:25:32, Lasse Reichstein wrote:
> I don't think the register renamer will do that since the value of
> current_character() needs to be retained as well. A move will use a
different
> physical register.
My understanding is that after renaming you have gone from 2-register
instructions to 3-register instructions so the mov disappears.
http://codereview.chromium.org/9854020/diff/15001/src/jsregexp.cc
File src/jsregexp.cc (right):
http://codereview.chromium.org/9854020/diff/15001/src/jsregexp.cc#newcode1843
src/jsregexp.cc:1843: ASSERT_LT(new_end_index, end_index);
On 2012/03/29 16:19:17, ulan wrote:
> Also: start_index <= new_end_index && new_start_index <= end_index.
Done.
http://codereview.chromium.org/9854020/diff/15001/src/jsregexp.cc#newcode1969
src/jsregexp.cc:1969: if (range.from() == 0) {
On 2012/03/29 16:19:17, ulan wrote:
> I think we don't need to check for range[i] == 0. GenerateBranches
doesn't count
> the range [minchar ... range_boundaries[0]], so the oddity of ranges
should be
> correct without this check.
> for (int i = 0; i <= last_valid_range; i++) {
> CharacterRange& range = ranges->at(i);
> range_boundaries->Add(range.from());
> range_boundaries->Add(range.to() + 1);
> }
There is an explicit assumption in GenerateBranches that min_char is
strictly less than ranges[start_index]. I made that assumption explicit
with an assert and kept this code, which makes the assumption true.
http://codereview.chromium.org/9854020/diff/15001/src/mips/regexp-macro-assembler-mips.h
File src/mips/regexp-macro-assembler-mips.h (right):
http://codereview.chromium.org/9854020/diff/15001/src/mips/regexp-macro-assembler-mips.h#newcode84
src/mips/regexp-macro-assembler-mips.h:84: virtual void
CheckCharacterInRange(uc16 from,
On 2012/03/29 16:19:17, ulan wrote:
> These functions do not have implementation. If it is intentional
please ignore
> this comment.
It is intentional. I don't have time to fix this right now, so landing
will break MIPS.
http://codereview.chromium.org/9854020/diff/15001/test/mjsunit/unicodelctest-no-optimization.js
File test/mjsunit/unicodelctest-no-optimization.js (right):
http://codereview.chromium.org/9854020/diff/15001/test/mjsunit/unicodelctest-no-optimization.js#newcode110
test/mjsunit/unicodelctest-no-optimization.js:110: var re = new
RegExp("[" + cc + "]");
On 2012/03/29 16:19:17, ulan wrote:
> Maybe also test negated character class?
Done.
src/x64/regexp-macro-assembler-x64.cc:602: __ and_(rbx,
Immediate(kTableMask));
On 2012/03/30 07:46:28, Erik Corry wrote:
> On 2012/03/29 15:25:32, Lasse Reichstein wrote:
> > I don't think the register renamer will do that since the value of
> > current_character() needs to be retained as well. A move will use a
different
> > physical register.
> My understanding is that after renaming you have gone from 2-register
> instructions to 3-register instructions so the mov disappears.
But I tested it and your version is marginally faster.