http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc
File src/jsregexp.cc (right):
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1555: Label* even_label,
On 2012/03/28 13:14:08, ulan wrote:
> Odd/even doesn't make sense without the ranges2 array. Consider
something like
> in_range, out_of_range.
Done.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1573:
On 2012/03/28 13:14:08, ulan wrote:
> A comment about odd/even/ranges2 would be helpful here.
> If I understand correctly there is also an implicit precondition on
values in
> ranges2:
> for all i, j: (ranges2[i] >> kTableSizeBits) == (ranges[j] >>
kTableSizeBits).
Done.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1607: j < ranges2->at(i + 1) - base && j < kSize;
On 2012/03/28 13:14:08, ulan wrote:
> Is (ranges2->at(i + 1) - base) == ranges2->at(i + 1) & kMask?
> If yes, I would prefer the kMask version because it is consistent with
j
> initializer.
> If not, I don't understand this part.
The answer is yes, and it is changed to use the kMask version.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1626: // Gets a series of segments representing a
character class. If the character
On 2012/03/28 13:14:08, ulan wrote:
> Gets a series of segment boundaries ...
Done.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1669: int c = ranges2->at(i);
On 2012/03/28 13:14:08, ulan wrote:
> If you extract the body of the loop into a function, then you could
avoid
> SearchOrder trickery:
> 1. Search for a single char range.
> 2. If found, call the function on this char range.
> 3. Otherwise, call the function on the first range.
Done.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1674: Label rest;
On 2012/03/28 13:14:08, ulan wrote:
> 'fall_through' seems to be a better name than 'rest'.
Renamed to dummy since it is not actually used.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1677: // Cut out the single range by rewriting the
array.
On 2012/03/28 13:14:08, ulan wrote:
> This implicitly creates a new range with endpoints ranges2[i-1] ..
ranges2[i+2].
> A comment that it preserves the oddity of the labels would be helpful.
Done.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1684: GenerateBranches(masm,
On 2012/03/28 13:14:08, ulan wrote:
> I think start_index + 1 can be larger that end_index - 1, but
GenerateBranches
> doesn't handle it.
We do not get into this if() unless they are 6 apart.
Assert added.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1735: if (ranges2->at(new_start_index) > border) break;
On 2012/03/28 13:14:08, ulan wrote:
> I would expect to break on >=
I think this is correct. If there is a border right on the edge we want
to cut right before the border, but we don't want the new start_index to
be on the border, since that would make a trivial zero-width range at
the start.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1749: int binary_chop_index = (end_index + start_index)
/ 2;
On 2012/03/28 13:14:08, ulan wrote:
> Why not (end_index + new_start_index) / 2 ?
If we are doing binary chop then we would want to do it exactly in the
middle, but new_start_index is one page higher than the middle.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1750: if (border - 1 > String::kMaxAsciiCharCode &&
On 2012/03/28 13:14:08, ulan wrote:
> border >= String::kMaxAsciiCharCode
This line makes a cut at the ASCII non-ASCII boundary because the ASCII
area is so important for performance. Even a Cyrillic text like the
Hippo article on todays Russian Wikipedia front page is 20% ASCII. In
order to bypass the binary chop branch tree for ASCII this is the
correct test. Comment added, though it does not explicitly mention the
Hippo.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1752: last - first > kSize * 4 &&
On 2012/03/28 13:14:08, ulan wrote:
> Are 4s above magic?
Yes. I changed it to 2, which makes little difference to performance,
but makes more sense (there's no difference between binary chop and just
one at a time when there are only two pages left).
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1757: while (scan_forward_for_section_border <
end_index) {
On 2012/03/28 13:14:08, ulan wrote:
> This loop construct appears twice. Maybe put it into a separate
function?
The two occurrences are slightly different. A function that has 7 lines
of body and returns two values will have much bigger overhead, and given
that they are not doing exactly the same I can't see a nice way to do
it.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1778: }
On 2012/03/28 13:14:08, ulan wrote:
> Can we extract the code above into a separate function that returns a
> new_start_index and a border? There are many edge cases, I would be
happy if
> those would be unit tested.
Extracted, and more tests added. It's fuzzing rather than unit, but it
improves the coverage and caught an off-by-one error.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1886: ZoneList<int>* ranges2 =
On 2012/03/28 13:14:08, ulan wrote:
> This fits into one line. Can we find a better name than 'ranges2'? How
about
> 'range_endpoints', 'range_points', 'range_boundaries',
'alternating_ranges'.
range_boundaries, but I changed it to just ranges inside most of the
other functions because anything longer will cause cascading reformats
to stay within 80 columns.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1890: if (cc->is_negated()) zeroth_entry_is_failure =
false;
On 2012/03/28 13:14:08, ulan wrote:
> zeroth_entry_is_failure = !cc->is_negated();
Done.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1896: ranges2->Add(0);
On 2012/03/28 13:14:08, ulan wrote:
> I wonder why we need ranges2[0] to be zero? GenerateBranch doens't
seem to rely
> on ranges2[0] == 0 and uses min_char.
Done.
http://codereview.chromium.org/9854020/diff/6001/src/jsregexp.cc#newc...
src/jsregexp.cc:1904: int ranges2_length = ranges2->length() - 1;
On 2012/03/28 13:14:08, ulan wrote:
> In 'ranges2_length' the 'length' suffix is misleading. I think it is
not a
> length but an index, so it should be called something like
'end_index'.
Done.
http://codereview.chromium.org/9854020/diff/11001/src/x64/regexp-macr...
File src/x64/regexp-macro-assembler-x64.cc (right):
http://codereview.chromium.org/9854020/diff/11001/src/x64/regexp-macr...
src/x64/regexp-macro-assembler-x64.cc:578: Label* on_in_range) {
On 2012/03/28 14:02:10, Lasse Reichstein Nielsen wrote:
> Maybe special-case if 'from' is zero (or even if 'to' is 0xffff)
Never happens.
http://codereview.chromium.org/9854020/diff/11001/src/x64/regexp-macr...
src/x64/regexp-macro-assembler-x64.cc:579: __ lea(rax,
Operand(current_character(), -from));
On 2012/03/28 14:02:10, Lasse Reichstein Nielsen wrote:
> lea->leal (saves a byte).
I think you mean on x64.
Fixed.
http://codereview.chromium.org/9854020/diff/11001/src/x64/regexp-macr...
src/x64/regexp-macro-assembler-x64.cc:602: __ and_(rbx,
Immediate(kTableMask));
On 2012/03/28 14:02:10, Lasse Reichstein Nielsen wrote:
> I'd consider swapping these:
> movq(rbx, Immediate(kTableMask));
> and_(rbx, current_character());
> (speculating that the constant load could hoist better).
OTOH a mov can be eliminated completely by the register renamer.
http://codereview.chromium.org/9854020/