std::regex and multiline flag

1,663 views
Skip to first unread message

William Fisher

unread,
Jul 21, 2013, 5:14:25 PM7/21/13
to std-dis...@isocpp.org
I've noticed that std::regex in C++11 doesn't support a `multiline` mode, as described in ECMA-262. With multiline mode, the meta-characters `^` and `$` match the beginning of line and end of line, in addition to the beginning and end of the string.

Was there a reason that multiline was omitted from the standard?  The ECMA-262 spec has an unambiguous definition of `LineTerminator` which treats a CR-LF as an empty line.

The beginning of line position can be matched as (?:^|\n) but this is clumsy, since the \n still needs to be consumed.

Thanks,
Bill

jonatha...@gmail.com

unread,
Jul 22, 2013, 8:00:45 AM7/22/13
to std-dis...@isocpp.org


On Sunday, July 21, 2013 10:14:25 PM UTC+1, William Fisher wrote:
I've noticed that std::regex in C++11 doesn't support a `multiline` mode, as described in ECMA-262. With multiline mode, the meta-characters `^` and `$` match the beginning of line and end of line, in addition to the beginning and end of the string.

Can match_not_bol and match_not_eol be used to do something approximately similar, so ^ and $ match the beginning and end of lines, not the string?

 
Was there a reason that multiline was omitted from the standard?

The answer is almost certainly "because it wasn't proposed" and the reason it wasn't proposed is almost certainly because it isn't in Boost.Regex, or wasn't in Boost.Regex when that was proposed for TR1.

Nicol Bolas

unread,
Jul 22, 2013, 12:52:07 PM7/22/13
to std-dis...@isocpp.org


On Sunday, July 21, 2013 2:14:25 PM UTC-7, William Fisher wrote:
I've noticed that std::regex in C++11 doesn't support a `multiline` mode, as described in ECMA-262. With multiline mode, the meta-characters `^` and `$` match the beginning of line and end of line, in addition to the beginning and end of the string.

Was there a reason that multiline was omitted from the standard?  The ECMA-262 spec has an unambiguous definition of `LineTerminator` which treats a CR-LF as an empty line.

Does this mean that any file saved in the standard DOS text file format (and loaded without conversion, perhaps for performance reasons) will be considered to have a bunch of empty lines in it? Because that sounds like a good reason not to do this.

William Fisher

unread,
Jul 22, 2013, 3:08:35 PM7/22/13
to std-dis...@isocpp.org, jonatha...@gmail.com
On Monday, July 22, 2013 5:00:45 AM UTC-7, jonatha...@gmail.com wrote:


On Sunday, July 21, 2013 10:14:25 PM UTC+1, William Fisher wrote:
I've noticed that std::regex in C++11 doesn't support a `multiline` mode, as described in ECMA-262. With multiline mode, the meta-characters `^` and `$` match the beginning of line and end of line, in addition to the beginning and end of the string.

Can match_not_bol and match_not_eol be used to do something approximately similar, so ^ and $ match the beginning and end of lines, not the string?

Effectively, match_not_bol disables ^ so it will never match anything. match_not_eol does the same for $. The use of the `line` terminology drew my attention here too. It got me thinking that multiline mode may have been proposed and then removed, but my search of mailing list archives turned up nothing.
 
Was there a reason that multiline was omitted from the standard?

The answer is almost certainly "because it wasn't proposed" and the reason it wasn't proposed is almost certainly because it isn't in Boost.Regex, or wasn't in Boost.Regex when that was proposed for TR1.


Thanks for the answer,
-Bill

William Fisher

unread,
Jul 22, 2013, 3:22:04 PM7/22/13
to std-dis...@isocpp.org
Yes, ECMAScript with multiline says that ^ and $ will match between the CR-LF in a DOS text file.

> 'a\r\nb\n'.match(/^.*$/gm)
[ 'a', '', 'b', '' ]

The ECMAScript LineTerminators are  { \n, \r, \u2028, \u2029 }.

-Bill
Reply all
Reply to author
Forward
0 new messages