how to capture words until a specific word

47 views
Skip to first unread message

ke...@restlater.com

unread,
Nov 15, 2020, 6:13:44 PM11/15/20
to Treetop Development
I'd like to emulate this regular expression, except matching words rather than characters.

^([^d]+)

If you attempt to match "abcdefg", that regex would capture "abc".

In other words, I want to capture words until a specific word, or any in a list of specific words.

Imaging the input is "Alice was beginning to get very tired of sitting by her sister", and the not allowed word is "to", then I would like to capture the words "Alice was beginning".

Suggestions?

Martin J. Dürst

unread,
Nov 16, 2020, 12:49:02 AM11/16/20
to treet...@googlegroups.com, ke...@restlater.com
I'd suggest using greedy matching for the part that you want to find,
followed by a positive lookahead lookahead or $ (end of line/string).

This will match as little as it can (because it's greedy) followed by
the word you don't want to be included. Please look up "greedy match"
and "positive lookahead" to get more information.

Regards, Martin.

ke...@restlater.com

unread,
Nov 16, 2020, 1:40:42 PM11/16/20
to Treetop Development
Thank you Martin.

I left out one of the needs - if any of the specified words are not encountered, I want to match the whole line.

Again, matching the behavior of this regex except with words, not characters: ^([^d]+)

When I use positive lookahead, I encounter two problems:

 - a positive lookahead by itself causes the match to fail if the word is not encountered.
 - a positive lookahead combined with another option ( &specified_words / word ) is apparently an infinite loop.

When I encounter the specified word, I want to either stop the match *without failing*, or match the rest of the string but segregate it from the part before the specified word.

In other words, I don't want to just fail.

I'll try to add my grammar and some samples.

Thank you again. Your comments inspired several new attempts.

-Kelly

Clifford Heath

unread,
Nov 17, 2020, 4:36:01 AM11/17/20
to Treetop Development
I assume that your specified_words can match any word. That is the reason for your infinite loop - it infinitely matches no specified word.

You need something like this:

rule word_char
  [A-Za-z]
end

rule word
  word_char+ !word_char
end

rule space
  [ \t]
end

rule stop_word
  ( 'to'
  / 'other'
  / 'words'
  / 'here'
  ) !word_char
  /
  '\n'    // Stop at end of line
end

rule sentence
  space* ( !stop_word word space* )+
end

Notice how all white-space must be explicitly skipped. PEG parsers have no separate lexer stage, so lexing is included in the parser rules.
The `sentence` rule says skip any space, then look for one or more words (as long as it is not a stop word, including newline), skipping any space after each word.

Clifford Heath
Reply all
Reply to author
Forward
0 new messages