Attempt to make \b\w+ faster. Slight performance increase on, e.g., string unpacking. (issue507051)

1 view
Skip to first unread message

l...@chromium.org

unread,
Dec 18, 2009, 9:26:28 AM12/18/09
to erik....@gmail.com, v8-...@googlegroups.com
Reviewers: Erik Corry,

Message:
Early warning review. I'll be removing some of the changes again.


http://codereview.chromium.org/507051/diff/1/4
File src/ast.h (right):

http://codereview.chromium.org/507051/diff/1/4#newcode1623
src/ast.h:1623: enum Type { GREEDY, NON_GREEDY, POSSESSIVE };
In preparation for identifying potentially possessive repetitions.

http://codereview.chromium.org/507051/diff/1/10
File src/jsregexp.h (right):

http://codereview.chromium.org/507051/diff/1/10#newcode1258
src/jsregexp.h:1258: int preloaded_characters_offset_;
Remembers not only the count of preloaded characters, but also its
cp_offset, so we don't have to clear it when advancing.
But I don't think it does anything. I'll remove it again. Ignore this
part of the change.

Description:
Attempt to make \b\w+ faster. Slight performance increase on, e.g., string
unpacking.

Please review this at http://codereview.chromium.org/507051

Affected files:
M src/arm/regexp-macro-assembler-arm.h
M src/arm/regexp-macro-assembler-arm.cc
M src/ast.h
M src/flag-definitions.h
M src/heap.cc
M src/ia32/regexp-macro-assembler-ia32.h
M src/ia32/regexp-macro-assembler-ia32.cc
M src/jsregexp.h
M src/jsregexp.cc
M src/parser.cc
M src/regexp-macro-assembler-tracer.h
M src/regexp-macro-assembler-tracer.cc
M src/regexp-macro-assembler.h
M src/x64/regexp-macro-assembler-x64.h
M src/x64/regexp-macro-assembler-x64.cc
M test/cctest/test-regexp.cc


l...@chromium.org

unread,
Dec 18, 2009, 10:51:49 AM12/18/09
to erik....@gmail.com, v8-...@googlegroups.com

erik....@gmail.com

unread,
Jan 4, 2010, 7:56:40 AM1/4/10
to l...@chromium.org, v8-...@googlegroups.com
I don't see the test that uses the posessive syntax.

LGTM


http://codereview.chromium.org/507051/diff/5001/6001
File src/arm/regexp-macro-assembler-arm.cc (right):

http://codereview.chromium.org/507051/diff/5001/6001#newcode473
src/arm/regexp-macro-assembler-arm.cc:473: if (!preloaded) {
Seems like this repeated code could be moved into an "if (type != '*')
{" clause above.

http://codereview.chromium.org/507051/diff/5001/6006
File src/ia32/regexp-macro-assembler-ia32.cc (right):

http://codereview.chromium.org/507051/diff/5001/6006#newcode485
src/ia32/regexp-macro-assembler-ia32.cc:485: if (!preloaded) {
And here.

http://codereview.chromium.org/507051/diff/5001/6006#newcode647
src/ia32/regexp-macro-assembler-ia32.cc:647: __
sub(Operand(current_character()), Immediate(0x0b));
I don't see how you can get away with destroying the current character
register like this?

http://codereview.chromium.org/507051/diff/5001/6008
File src/jsregexp.cc (right):

http://codereview.chromium.org/507051/diff/5001/6008#newcode1630
src/jsregexp.cc:1630: int NegativeLookaheadChoiceNode::EatsAtLeast(int
still_to_find,
Formatting still not right.

http://codereview.chromium.org/507051/diff/5001/6008#newcode2056
src/jsregexp.cc:2056: true,
Comment after boolean arguments.

http://codereview.chromium.org/507051/diff/5001/6008#newcode2134
src/jsregexp.cc:2134: assembler->GoTo(on_non_word);
You only need this line if expect_word_character.

http://codereview.chromium.org/507051/diff/5001/6008#newcode4101
src/jsregexp.cc:4101: SetRelation CharacterRange::WordCharacterRelation(
Can't you implement this using Merge using a fraction of the lines?

http://codereview.chromium.org/507051/diff/5001/6008#newcode4416
src/jsregexp.cc:4416: kInsideFirst = 1,
Should probably be unified with SetRelation::kInFirst.

http://codereview.chromium.org/507051/diff/5001/6008#newcode4420
src/jsregexp.cc:4420:
Missing newline?

http://codereview.chromium.org/507051/diff/5001/6008#newcode4421
src/jsregexp.cc:4421: // Utility function from CharacterRange::Merge.
Adds a range at the end of
from -> for

http://codereview.chromium.org/507051/diff/5001/6008#newcode4485
src/jsregexp.cc:4485: kInsideFirst = 1,
Here it is again.

http://codereview.chromium.org/507051/diff/5001/6008#newcode5001
src/jsregexp.cc:5001: case AFTER_NONWORD_CHARACTER:
Can't we do better for these last two?

http://codereview.chromium.org/507051/diff/5001/6008#newcode5040
src/jsregexp.cc:5040: if (text.type == TextElement::ATOM) {
What about case independent?

http://codereview.chromium.org/507051/diff/5001/6009
File src/jsregexp.h (right):

http://codereview.chromium.org/507051/diff/5001/6009#newcode839
src/jsregexp.h:839: // Non-default types. Used for modifying a boundary
node if its surrounding
Non-default should be non-standard?

http://codereview.chromium.org/507051/diff/5001/6009#newcode841
src/jsregexp.h:841: AFTER_NONWORD_CHARACTER,
The name seems misleading. We can be after a non-word without being at
a boundary. How about NONWORD_WORD_BOUNDARY and WORD_NONWORD_BOUNDARY?

http://codereview.chromium.org/507051/diff/5001/6014
File src/x64/regexp-macro-assembler-x64.cc (right):

http://codereview.chromium.org/507051/diff/5001/6014#newcode506
src/x64/regexp-macro-assembler-x64.cc:506: if (!preloaded) {
And here.

http://codereview.chromium.org/507051

l...@chromium.org

unread,
Jan 6, 2010, 12:00:49 PM1/6/10
to erik....@gmail.com, v8-...@googlegroups.com
Addressed comments.
Added support for word/non-word matching on ARM. Please check these.
Still needs to add test for possessive syntax.


http://codereview.chromium.org/507051/diff/5001/6001
File src/arm/regexp-macro-assembler-arm.cc (right):

http://codereview.chromium.org/507051/diff/5001/6001#newcode473
src/arm/regexp-macro-assembler-arm.cc:473: if (!preloaded) {

Sadly not. We should only preload if we support the character class.
That would mean first switching on the type to see if we should preload,
and then switching again to do the actual matching - which means
duplicating the selection logic and keeping the two duplicates in sync.

Ok, I rewrote it completely, moving the loading out of the check
function and relying on the caller to load the character. Much better
that way, and more consistent with other check functions.

http://codereview.chromium.org/507051/diff/5001/6006
File src/ia32/regexp-macro-assembler-ia32.cc (right):

http://codereview.chromium.org/507051/diff/5001/6006#newcode647
src/ia32/regexp-macro-assembler-ia32.cc:647: __
sub(Operand(current_character()), Immediate(0x0b));
It might be a coincidence, but we never use the loaded character after
checking for a character class.
But it's unnecessary to destroy it. I'll change the sub to a lea.

http://codereview.chromium.org/507051/diff/5001/6008#newcode1630
src/jsregexp.cc:1630: int NegativeLookaheadChoiceNode::EatsAtLeast(int
still_to_find,

Fixed.

Added.

http://codereview.chromium.org/507051/diff/5001/6008#newcode2134
src/jsregexp.cc:2134: assembler->GoTo(on_non_word);

Hence the "if(expect_word_character) {" above :)

http://codereview.chromium.org/507051/diff/5001/6008#newcode4101
src/jsregexp.cc:4101: SetRelation CharacterRange::WordCharacterRelation(

I could, but Merge needs to actually build the three sets (first only,
intersection, second only), while this function only determines whether
there would be something in the sets.

If something should be reused, it would probably be by abstracting the
actions of Merge (using either templates or an accumulator object with
virtual methods) so that it didn't need to build the result, but could
take any action for each range it outputs. I don't think it's much
simpler, though.

The meanings are different. This enum represents the location of a
single element (in the four areas of a Venn diagram). SetRelation
represents how two sets relate, i.e., which areas of the Venn diagram
for the two sets contain elements.

If there is any relation, it would be that SetRelation::kInFirst = 1 <<
kInsideFirst (etc).
I'll move this enum to the header file and implement the SetRelation
enum in terms of this enum.

http://codereview.chromium.org/507051/diff/5001/6008#newcode4420
src/jsregexp.cc:4420:
added

http://codereview.chromium.org/507051/diff/5001/6008#newcode4421
src/jsregexp.cc:4421: // Utility function from CharacterRange::Merge.
Adds a range at the end of

Fixed

One of the ones in this file is superflous. I'll retain the on at top
level only and move it to the .h file.

http://codereview.chromium.org/507051/diff/5001/6008#newcode5001
src/jsregexp.cc:5001: case AFTER_NONWORD_CHARACTER:

Not precisely enough. If the ComputeFirstCharacterSet result is used for
anything else than detecting boundaries (and I want to use it for
detecting possessive loops too), we need to know the exact first
character set, not just that it's a subset of word characters or subset
of non-word characters.

Also, the meaning of AFTER_NONWORD_CHARACTER does not assume anything of
the next character - even though we can currently only introduce it by
detecting a following word character after a boundaray assertion.

http://codereview.chromium.org/507051/diff/5001/6008#newcode5040
src/jsregexp.cc:5040: if (text.type == TextElement::ATOM) {

If I understand it correctly, we convert ranges and atoms to include
their case-equivalent characters when we create the Node from the AST.
The case independence flag shouldn't matter after this.

http://codereview.chromium.org/507051/diff/5001/6009#newcode839
src/jsregexp.h:839: // Non-default types. Used for modifying a boundary
node if its surrounding

Possibly. I didn't really like either.
It's a type that doesn't correspond to one directly expressible in the
syntax. Perhaps I shouldn't try to find a short way to say that. I'll
elaborate.

http://codereview.chromium.org/507051/diff/5001/6009#newcode841
src/jsregexp.h:841: AFTER_NONWORD_CHARACTER,

The name is correct for what it's used for. We actually reduce the
AT_BOUNDARY to AFTER_NONWORD_CHARACTER and only check that the previous
character is a non-word character. We have statically determined that
the next character must be a word character, otherwise we will fail
anyway.

If it was named NONWORD_WORD_BOUNDARY we would expect to need to check
both the previous and next characters.

http://codereview.chromium.org/507051

Reply all
Reply to author
Forward
0 new messages