Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

simple regex pattern sought

10 views
Skip to first unread message

Roedy Green

unread,
May 25, 2012, 5:45:59 PM5/25/12
to
I often have to search for things of the form

"xxxxx"
or
'xxxxx'

where xxx is anything not " or '. It might be Russian or English or
any other language.

What is the cleanest way to do that?
--
Roedy Green Canadian Mind Products
http://mindprod.com
I would be quite surprised if the NSA (National Security Agency)
did not have a computer program to scan bits of shredded
documents and electronically put them back together like a giant
jigsaw puzzle. This suggests you cannot just shred, you must also burn.
.

markspace

unread,
May 25, 2012, 5:55:01 PM5/25/12
to
On 5/25/2012 2:45 PM, Roedy Green wrote:
> I often have to search for things of the form
>
> "xxxxx"
> or
> 'xxxxx'
>
> where xxx is anything not " or '. It might be Russian or English or
> any other language.
>
> What is the cleanest way to do that?


Would this work?

'[^']+'|"[^"]+"

Lew

unread,
May 25, 2012, 5:55:07 PM5/25/12
to
Roedy Green wrote:
> I often have to search for things of the form
>
> "xxxxx"
> or
> 'xxxxx'
>
> where xxx is anything not " or '. It might be Russian or English or
> any other language.
>
> What is the cleanest way to do that?

Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.

--
Lew


markspace

unread,
May 25, 2012, 6:04:29 PM5/25/12
to
On 5/25/2012 2:55 PM, Lew wrote:

>
> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.
>

This would match "John's restaurant" as "John'.

The first quote matches ", John does not contain either ' or " as
specified, and the last character class matches the '. Not I think what
is wanted.


Lew

unread,
May 25, 2012, 6:03:29 PM5/25/12
to
"([\"'])[^\"']+\\1"

That way you match the opening quote.

(The extra backslashes are to escape the characters in the string. Regex sees one fewer per each set.)

--
Lew

Robert Klemme

unread,
May 25, 2012, 6:12:34 PM5/25/12
to
That does not match quoting properly. Better do something like

"([\"'])[^\"']*\\1"

Still I prefer

"\"[^\"]*\"|'[^']*'"

Because it allows for quotes of the other type inside quotes.

With proper escaping (using \ as escape char, any other works, too) this
becomes

"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"

Kind regards

robert


package rx;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Quotes {

private static final Pattern Q1 = Pattern.compile("([\"'])[^\"']*\\1");
private static final Pattern Q2 = Pattern.compile("\"[^\"]*\"|'[^']*'");
private static final Pattern Q3 =
Pattern.compile("\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'");

public static void main(String[] args) {
System.out.println(Q1);
for (final Matcher m = Q1.matcher("'a' \"b\" 'c'"); m.find();) {
System.out.println(m.group());
}

System.out.println(Q2);
for (final Matcher m = Q2.matcher("'a' \"b\" 'c'"); m.find();) {
System.out.println(m.group());
}

System.out.println(Q3);
for (final Matcher m = Q3.matcher("'a' \"\\\"b\" 'c'"); m.find();) {
System.out.println(m.group());
}
}

}


--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

markspace

unread,
May 25, 2012, 9:43:59 PM5/25/12
to
On 5/25/2012 3:12 PM, Robert Klemme wrote:

> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"


This looks overly baroque to me. You don't need to escape \ single
quotes ' in a Java string, and I don't think you need to in a regex
either (although I didn't check that). I'm also not seeing the need for
the parenthesis around the character classes [] (but again, without
having tried it, I could be wrong). And the dot . inside the
parenthesis just looks wrong.

Great post overall though.

Roedy Green

unread,
May 26, 2012, 9:19:37 AM5/26/12
to
On Sat, 26 May 2012 00:12:34 +0200, Robert Klemme
<short...@googlemail.com> wrote, quoted or indirectly quoted
someone who said :

>On 25.05.2012 23:55, Lew wrote:
>> Roedy Green wrote:
>>> I often have to search for things of the form
>>>
>>> "xxxxx"
>>> or
>>> 'xxxxx'
>>>
>>> where xxx is anything not " or '. It might be Russian or English or
>>> any other language.
/*
* [TestRegexFindQuotedString.java]
*
* Summary: Finding a quoted String with a regex.
.
*
* Copyright: (c) 2012 Roedy Green, Canadian Mind Products,
http://mindprod.com
*
* Licence: This software may be copied and used freely for any
purpose but military.
* http://mindprod.com/contact/nonmil.html
*
* Requires: JDK 1.7+
*
* Created with: JetBrains IntelliJ IDEA IDE
http://www.jetbrains.com/idea/
*
* Version History:
* 1.0 2012-05-25 initial release
*/
package com.mindprod.example;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static java.lang.System.out;

/**
* Finding a quoted String with a regex.
*
* @author Roedy Green, Canadian Mind Products
* @version 1.0 2012-05-25 initial release
* @since 2012-05-25
*/
public class TestRegexFindQuotedString
{
// ------------------------------ CONSTANTS
------------------------------

private static final String lookIn = "George said \"that's the
ticket\"." +
" Jeb replied '\"ticket?\"
what ticket'." +
" \"How na\u00efve!\"." +
" empty: \"\"" +
" 'unbalanced\"";

// -------------------------- STATIC METHODS
--------------------------

/**
* exercise that pattern to see what if can find
*/
static void exercisePattern( Pattern pattern )
{
out.println();
out.println( "Pattern: " + pattern.toString() );
final Matcher m = pattern.matcher( lookIn ); // Matchers are
used both for matching and finding.
while ( m.find() )
{
out.println( m.group( 0 ) );
}
}

// --------------------------- main() method
---------------------------

/**
* test harness
*
* @param args not used
*/
public static void main( String[] args )
{
// We want to find Strings of the form "xx'xx" or 'xx"xx'
// We want to avoid the following problems:
// 1. Works even if String contains foreign languages, even
Russian or accented letters.
// 2. If starts with " must end with ", if starts with ' must
end with '.
// 3. ' is ok inside "...", and " is ok inside '...'
// 4. We don't worry about how to use ' inside '...'.

// here are some suggested techniques:

exercisePattern( Pattern.compile( "[\"']\\p{Print}+?[\"']" )
); // fails 1 2 3

exercisePattern( Pattern.compile( "[\"'][^\"']+[\"']" ) ); //
fails 2 3

exercisePattern( Pattern.compile( "([\"'])[^\"']+\\1" ) ); //
fails 3, uses a capturing group.

exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) ); //
works, rejects empty strings by Mark Space.

exercisePattern( Pattern.compile( "\"[^\"]*\"|'[^']*'" ) ); //
works, accepts empty strings by Robert Klemme.

exercisePattern( Pattern.compile(
"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
empty strings
// (?: ) is a non-capturing group. This is Robert Klemme's
contribution. I don't understand how it works.

markspace

unread,
May 26, 2012, 10:19:12 AM5/26/12
to
On 5/26/2012 6:19 AM, Roedy Green wrote:

> exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) ); //
> works, rejects empty strings by Mark Space.


If you want it to accept empty strings, replace the +'s with *'s. You
didn't specify empty strings in your original problem statement, so I
decided to disallow them.

Thanks for posting that SSCCE, btw. I was too lazy to cook one up.


Robert Klemme

unread,
May 26, 2012, 10:37:09 AM5/26/12
to
On 26.05.2012 03:43, markspace wrote:
> On 5/25/2012 3:12 PM, Robert Klemme wrote:
>
>> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"
>
>
> This looks overly baroque to me. You don't need to escape \ single
> quotes ' in a Java string,

I didn't.

> and I don't think you need to in a regex
> either (although I didn't check that).

There is also no regexp escaping of single quotes either. The only
regexp escaping you can see are the \\\\ which translate into \\ in the
string which is a literal backslash for the regexp engine.

> I'm also not seeing the need for
> the parenthesis around the character classes [] (but again, without
> having tried it, I could be wrong).

It's not parenthesis around character classes but around the alternative
of "match a backslash followed by any char" and "any char which is not
backslash or the opening quote type of this string variant".

> And the dot . inside the parenthesis just looks wrong.

It isn't - see above.

> Great post overall though.

Thank you! It does seem to need some time to sink in though... :-)

Kind regards

robert

markspace

unread,
May 26, 2012, 10:57:07 AM5/26/12
to
On 5/26/2012 6:19 AM, Roedy Green wrote:

> exercisePattern( Pattern.compile(
> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
> empty strings
> // (?: ) is a non-capturing group. This is Robert Klemme's
> contribution. I don't understand how it works.


Ah, OK, so here's my contribution to your excellent SSCCE. First this
pattern is basically the same as mine. It uses alternation (the
vertical bar |) to pick a string delimited by either ' or "

Here's his regex string without the extra escapes for Java:

"(?:\\.|[^\"])*"|'(?:\\.|[^\'])*'
^^^^^^^^^^^^^^^^

Let's look at just the first half for a moment, without the (?:\\. part.

"[^\"]*"
^^^^^^^^
12 3
Example for the first part:
1. " string starts with double quote
2. [^\"]* doesn't contain a "
3. " ends with double quote

Same for the second half of the string.

Notice he's using * instead of +'s, which is why his matches 0 width
strings.

The other part didn't appear in your problem statement, but in HTML/XML
it's allowed to escape characters. E.g., 'Bob\'s your uncle.' So his
inclusion is very reasonable.

So he Robert adds (\\.|[^\"])* to the first part, which is
12 345 6

1. Start a group
2. A slash. It needs to be escaped for regex, hence \\.
3. . is regex "any character". 2 and 3 together mean "match \ followed
by any character"
4. OR (alternation again)
5. character class, negated (the ^), matches anything except \ or ". I
think this is a mistake: the \ needs to be quoted.
6. zero or more.

Then after that mess, he does the obvious thing and adds non-capturing
group, to make the regex do a little less work.

"(?:\\.|[^\"])*"

Phew! Next, he adds one alternation and does the same for a ' delimited
string.

|'(?:\\.|[^\'])*'

Same thing, just ' instead of ".

Finally I think this could be simplified slightly with Lew's
back-reference idea.

(['"])(?:\\.|[^\1\\])*

(Untested.) This allows empty strings between delimiters; instead of a
* use + for only non-empty strings between the quotes.



My executive summary:

Regex is a great rapid development tool, except when it isn't. You
realize your problem is simple, and you could have hand-coded a parser
to do this much quicker than all these news post exchanges?

markspace

unread,
May 26, 2012, 11:06:46 AM5/26/12
to
On 5/26/2012 7:37 AM, Robert Klemme wrote:
> On 26.05.2012 03:43, markspace wrote:
>> On 5/25/2012 3:12 PM, Robert Klemme wrote:
>>
>>> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"
>>
...
>> and I don't think you need to in a regex
>> either (although I didn't check that).
>
> There is also no regexp escaping of single quotes either. The only
> regexp escaping you can see are the \\\\ which translate into \\ in the
> string which is a literal backslash for the regexp engine.


Yes, there is, although I think it's a typo. Both \\\" and \\' get
passed to the regex as \" and \', which means just a single character "
and ' respectively.

You're right about the rest of it though. With so many \'s floating
around, I have a hard time reading Java regex!


> It's not parenthesis around character classes but around the alternative
> of "match a backslash followed by any char" and "any char which is not
> backslash or the opening quote type of this string variant".


Yup, I totally missed this too. Thanks for pointing it out.

Robert Klemme

unread,
May 26, 2012, 11:13:05 AM5/26/12
to
Oh, right, thanks for finding that!

> 6. zero or more.
>
> Then after that mess, he does the obvious thing and adds non-capturing
> group, to make the regex do a little less work.
>
> "(?:\\.|[^\"])*"
>
> Phew! Next, he adds one alternation and does the same for a ' delimited
> string.
>
> |'(?:\\.|[^\'])*'
>
> Same thing, just ' instead of ".
>
> Finally I think this could be simplified slightly with Lew's
> back-reference idea.
>
> (['"])(?:\\.|[^\1\\])*
>
> (Untested.) This allows empty strings between delimiters; instead of a *
> use + for only non-empty strings between the quotes.

Interesting approach - but it doesn't work. Simple test with
Pattern.compile("(.)[a\\1]"):

Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal/unsupported escape sequence near index 6
(.)[a\1]
^

> My executive summary:
>
> Regex is a great rapid development tool, except when it isn't. You
> realize your problem is simple, and you could have hand-coded a parser
> to do this much quicker than all these news post exchanges?

Maybe, maybe not.

Kind regards

robert

Robert Klemme

unread,
May 26, 2012, 11:34:49 AM5/26/12
to
On 26.05.2012 17:06, markspace wrote:
> On 5/26/2012 7:37 AM, Robert Klemme wrote:
>> On 26.05.2012 03:43, markspace wrote:
>>> On 5/25/2012 3:12 PM, Robert Klemme wrote:
>>>
>>>> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"
>>>
> ...
>>> and I don't think you need to in a regex
>>> either (although I didn't check that).
>>
>> There is also no regexp escaping of single quotes either. The only
>> regexp escaping you can see are the \\\\ which translate into \\ in the
>> string which is a literal backslash for the regexp engine.
>
>
> Yes, there is, although I think it's a typo. Both \\\" and \\' get
> passed to the regex as \" and \', which means just a single character "
> and ' respectively.

Right you are - both times: there is regexp escapind and it was in fact
a typo (missing \\)!

> You're right about the rest of it though. With so many \'s floating
> around, I have a hard time reading Java regex!

That's true for other languages as well - the basic reason is that the
same character is used for

- escaping in strings
- escaping in backslashes
- escaping in the source text (in this case we could pick another
character)

>> It's not parenthesis around character classes but around the alternative
>> of "match a backslash followed by any char" and "any char which is not
>> backslash or the opening quote type of this string variant".
>
>
> Yup, I totally missed this too. Thanks for pointing it out.

You're welcome! Thank you again for finding the missing escape.

Cheers

Peter Duniho

unread,
May 26, 2012, 1:07:54 PM5/26/12
to
On Sat, 26 May 2012 17:34:49 +0200, Robert Klemme wrote:

> [...]
>> You're right about the rest of it though. With so many \'s floating
>> around, I have a hard time reading Java regex!
>
> That's true for other languages as well

Not C#, which allows string literals to be prefaced with the @ symbol to
disable compiler escaping.

In fact, I'll bet C# wasn't the first language to have such a feature.
Surely there are many other languages that also avoid the issue.

markspace

unread,
May 26, 2012, 1:08:58 PM5/26/12
to
On 5/26/2012 8:13 AM, Robert Klemme wrote:
> On 26.05.2012 16:57, markspace wrote:
>> Finally I think this could be simplified slightly with Lew's
>> back-reference idea.
>>
>> (['"])(?:\\.|[^\1\\])*
>>
>> (Untested.) This allows empty strings between delimiters; instead of a *
>> use + for only non-empty strings between the quotes.
>
> Interesting approach - but it doesn't work. Simple test with
> Pattern.compile("(.)[a\\1]"):
>
> Exception in thread "main" java.util.regex.PatternSyntaxException:
> Illegal/unsupported escape sequence near index 6
> (.)[a\1]
> ^


Yup, [] is for characters, and \1 could be a string. Gets rejected. I
think you could use "negative lookahead" to say "not this string" when
parsing. Gets kinda ugly though.

<http://www.regular-expressions.info/conditional.html>

Java:

"(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1"

Regex:

(['"])(?:\\.|(?!\1|\\).)+\1

I re-did Roedy's test program to be a bit more clear about what it was
looking for, and the results. This could be even cleaner if it was run
with a JUnit test harness.

At this point though the regex is basically just a mess. Download antlr
and get an XML/HTML grammar from online.



package quicktest;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static java.lang.System.out;

/**
*
* @author Brenden
*/
public class MindProdRegex {

}

/*
* [TestRegexFindQuotedString.java]
*
* Summary: Finding a quoted String with a regex.
.
*
* Copyright: (c) 2012 Roedy Green, Canadian Mind Products,
http://mindprod.com
*
* Licence: This software may be copied and used freely for any
purpose but military.
* http://mindprod.com/contact/nonmil.html
*
* Requires: JDK 1.7+
*
* Created with: JetBrains IntelliJ IDEA IDE
http://www.jetbrains.com/idea/
*
* Version History:
* 1.0 2012-05-25 initial release
*/

/**
* Finding a quoted String with a regex.
*
* @author Roedy Green, Canadian Mind Products
* @version 1.0 2012-05-25 initial release
* @since 2012-05-25
*/
class TestRegexFindQuotedString
{
// ------------------------------
CONSTANTS------------------------------

private static final String[] vectors =
{"Basic: George said \"that's theticket\".",
"\"that's theticket\"",
"Nested: Jeb replied '\"ticket?\"what ticket'.",
"'\"ticket?\"what ticket'",
"Non-ASCII: \"How na\u00efve!\".",
"\"How na\u00efve!\"",
" empty: \"\"xx",
"\"\"",
" escaped: 'Bob\\'s your uncle.'",
"'Bob\\'s your uncle.'",
" 'unbalanced\"",
"",
};

// -------------------------- STATIC METHODS--------------------------

/**
* exercise that pattern to see what if can find
*/
static void exercisePattern( Pattern pattern )
{
out.println();
out.println( "Pattern: " + pattern.toString() );
for( int i = 0; i < vectors.length; i+=2 ) {
String test = vectors[i];
String result = vectors[i+1];
final Matcher m = pattern.matcher( test );
boolean found = m.find();
boolean correct = false;
String groupString = null;
if( found ) {
correct = m.group(0).equals( result );
groupString = m.group();
}
System.out.println( test+", found: "+ found +
", correct: "+correct+" ("+groupString+")");
}
}

// --------------------------- main() method---------------------------

/**
* test harness
*
* @param args not used
*/
public static void main( String[] args )
{
// We want to find Strings of the form "xx'xx" or 'xx"xx'
// We want to avoid the following problems:
// 1. Works even if String contains foreign languages,
evenRussian or accented letters.
// 2. If starts with " must end with ", if starts with '
mustend with '.
// 3. ' is ok inside "...", and " is ok inside '...'
// 4. We don't worry about how to use ' inside '...'.

// here are some suggested techniques:

exercisePattern( Pattern.compile( "[\"']\\p{Print}+?[\"']" )
); // fails 1 2 3

exercisePattern( Pattern.compile( "[\"'][^\"']+[\"']" ) );
//fails 2 3

exercisePattern( Pattern.compile( "([\"'])[^\"']+\\1" ) );
//fails 3, uses a capturing group.

exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) );
//works, rejects empty strings by Mark Space.
exercisePattern( Pattern.compile(
"(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1" ) ); //works, rejects empty strings
by Mark Space.

exercisePattern( Pattern.compile( "\"[^\"]*\"|'[^']*'" ) );
//works, accepts empty strings by Robert Klemme.
exercisePattern( Pattern.compile(
"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, acceptsempty
strings
// (?: ) is a non-capturing group. This is Robert
Klemme'scontribution. I don't understand how it works.
}
}

Lew

unread,
May 26, 2012, 5:07:08 PM5/26/12
to
As I correct6ed in my very next post.

--
Lew
Honi soit qui mal y pense.
http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg

Roedy Green

unread,
May 26, 2012, 5:14:44 PM5/26/12
to
On Sat, 26 May 2012 10:08:58 -0700, markspace <-@.> wrote, quoted or
indirectly quoted someone who said :

>I re-did Roedy's test program to be a bit more clear about what it was
>looking for, and the results. This could be even cleaner if it was run
>with a JUnit test harness.

Thanks Brendan. I have incorporated your suggestions plus a bit more
polishing.

See http://mindprod.com/jgloss/regex.html#FINDQUOTED

for a formatted listing + output.

The next task, probably procrastinated, is to solve it with a little
finite state automaton that decodes \x as well, and a simpler version
without. If a newbie is interested in tackling that, they can look at
my Java snippet parser as part of JPrep/JDisplay and strip it down.

markspace

unread,
May 26, 2012, 9:34:00 PM5/26/12
to
On 5/26/2012 2:07 PM, Lew wrote:
> markspace wrote:
>> Lew wrote:
>>> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I
>>> don't know.
>>>
>> This would match "John's restaurant" as "John'.
>>
>> The first quote matches ", John does not contain either ' or " as
>> specified,
>> and the last character class matches the '. Not I think what is wanted.
>
> As I correct6ed in my very next post.
>


Unfortunately that one doesn't work either. The central part, [^"'],
doesn't allow a match of a ' if the starting delimiter was a ", and that
doesn't match Roedy's spec. "John's restaurant" wouldn't be matched at
all, because the matcher couldn't match past the ' to get to the ".

I think the easiest is to write out a grammar for the expression, then
translate to regex.

QUOTED_STRING := SQUOTED_STRING | DQUOTED_STRING

SQUOTED_STRING := ' NON_S_QUOTE + '

DQUOTED_STRING := " NON_D_QUOTE + "

NON_S_QUOTE := [^']

NON_D_QUOTE := [^"]

At this point the grammar is very clear. (Note I haven't included
Robert's \x escape sequences.) I think it's worth learning to use antlr
rather than regex, which tends to obfuscate more than it helps.
However, a literal translation into regex isn't hard, and a literal
translation avoids mis-optimizations.


Lew

unread,
May 27, 2012, 2:39:14 PM5/27/12
to
Very illuminating. Thank you.
0 new messages