nested brackets with javascript

1,397 views
Skip to first unread message

tyler....@gmail.com

unread,
Mar 4, 2006, 9:18:51 AM3/4/06
to Regex
Dear all,

I am trying to scrape BibTeX entries from web-pages using javascript
(from within a firefox extension)
A single entry looks roughly like this:

@resourceType{
field1 = aValue,
field2 = "value with quotation marks",
field3 = "value with quotation mark and {brackets} in the field value",
field4 = {brackets},
lastfield = "should not have a delimiting comma, but some generators do
anyhow"
}

(many more entries like above)

for starters, I am trying to get all the entries from a page.

using a simple
/@([\s\S]+?){[\s\S]+?}/gim
unfortunatly stops greedyly at entries that include } brackets.

I am not very familiar with regexp in general, so maybe there's some
fundamental things I do not get right.
Can anybody of you help me?
Thanks a lot in advance

Sergei Z

unread,
Mar 4, 2006, 9:54:05 AM3/4/06
to Regex
for those not familiar with ***BibTeX entries*** can u say exactly WHAT
u'd like to match/fetch/replace in

@resourceType{
field1 = aValue,
field2 = "value with quotation marks",
field3 = "value with quotation mark and {brackets} in the field value",

field4 = {brackets},
lastfield = "should not have a delimiting comma, but some generators do

anyhow"
}

it'd help a lot.

Allen Day

unread,
Mar 4, 2006, 10:20:02 AM3/4/06
to re...@googlegroups.com
On 3/4/06, tyler....@gmail.com <tyler....@gmail.com> wrote:
>
> using a simple
> /@([\s\S]+?){[\s\S]+?}/gim
> unfortunatly stops greedyly at entries that include } brackets.

You could look for that last bracket as ^\}$ -- assuming it's on its
own line. ^ anchors at the start of the line, $ at the end.

--
Remember, no matter where you go, there you are. -Buckaroo Banzai
Online Regex find/repalce utility: http://rereplace.com
Command-based online image editor: http://theprawn.com/imagiine

Sergei Z

unread,
Mar 4, 2006, 10:30:04 AM3/4/06
to Regex

Allen Day wrote:
>
> You could look for that last bracket as ^\}$ -- assuming it's on its
> own line. ^ anchors at the start of the line, $ at the end.

Allen, looking for

^\}$

is not bullet-proof solution obviously. Ideally we need to grab
corresponding curly brackets recursively using balancing groups. Are u
familiar with them? I've seen a solution that worked for similar
problem:

pls go to

http://regexadvice.com/forums/13051/ShowPost.aspx
POST 2:09 pm

do u thinkIs it applicable here?

Thanks,
Sergei

Sergei Z

unread,
Mar 4, 2006, 11:06:16 AM3/4/06
to Regex
tested in .NET - works fine

balancing goup regex:

(@\w+{) # Find the first curly bracket, preceded by @resourceType
coded as @\w+
( # one of these three things:
{(?<DEPTH>) # another opening curly bracket (and increment
DEPTH)
| # or
}(?<-DEPTH>) # closing curly bracket (and decrement DEPTH)
| # or
.*? # anything else (lazily)
)* # and repeat.
(?(DEPTH)(?!)) # Match if depth == 0, i.e., the initial if is
balanced.
(}) # Conclude with the final curly bracket
#END REGEX

from input:

@resourceType{
field1 = aValue,
field2 = "value with quotation marks",
field3 = "value with quotation mark and {brackets} in the field value",

field4 = {brackets},
lastfield = "should not have a delimiting comma, but some generators do

anyhow"
}

{some other HTML code}

@resourceType{
field1 = aValue,
field2 = "value with quotation marks",

field3 = "value with quotation mark and {brac{T}ke{T}ts} in the field


value",
field4 = {brackets},
lastfield = "should not have a delimiting comma, but some generators do

anyhow"
}
}
{some other HTML code}


it matches both @resource type as 2 matches. Try it in Expresso.

code for C# regex object : Watch OPTIONS !!

using System.Text.RegularExpressions;

Regex regex = new Regex(
@"(@\w+{) # Find the first curly bracket
( "
+ @" # one of these three things:
{(?<DEPTH>) "
+ @" # another opening curly bracket (and increment DEPTH)
"
+ @"| # or
}(?<-DEPTH>) # closing "
+ @"curly bracket (and decrement DEPTH)
| "
+ @" # or
.*? # anything else (lazily)
"
+ @")* # and repeat.
(?(DEPTH)(?!)) "
+ @"# Match if depth == 0, i.e., the initial if is balanced.
(}"
+ @") # Conclude with the final curly bracket",
RegexOptions.IgnoreCase
| RegexOptions.Singleline
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);

Sergei Z

unread,
Mar 4, 2006, 11:15:16 AM3/4/06
to Regex
SUMMARIZING....

(@\w+{)
(
{(?<DEPTH>)
|
}(?<-DEPTH>)
|
.*?
)*
(?(DEPTH)(?!))
(})

Options: RegexOptions.IgnoreCase
| RegexOptions.Singleline
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);


WHAT IT DOES?: matches recursively 2 @resourceType{.....} blocks from

tyler....@gmail.com

unread,
Mar 4, 2006, 12:11:30 PM3/4/06
to Regex
Dear Sergej, Allen,

wow, just went out for a coffee and 5 re:s already! Thanks a lot
already.
After a brief glance at the balancing .NET stuff I am very impressed
(almost reminds me of real lex/yacc tokens)
Unfortunatly this is not an option for me (bound to javascript).
The other thing is, that my problem is even simpler: recursive nesting
does not happen, so its at most {text{text}} (one level 'deep' inside
the outer most { } pair.

So to reformulate:
I am trying to ignore {} pairs inside an outer pattern delimited by
@[\s\S]+?{ on the one end and } at the other

Thanks a lot for your fast and valuable input
Cheers
Tyler

tyler....@gmail.com

unread,
Mar 4, 2006, 1:22:10 PM3/4/06
to Regex
what I want to do is get an array of many @resourceType{*everything
inside*}
and then dissect them further (this then is a piece of cake like

*pattern fieldlabel* = *pattern fieldvalue*

)

Sergei Z

unread,
Mar 4, 2006, 1:56:02 PM3/4/06
to Regex
Re: ***So to reformulate:

I am trying to ignore {} pairs inside an outer pattern delimited by
@[\s\S]+?{ on the one end and } at the other***

that's exactly the problem u r facing:

how would u distinguish between the last closing bracket of the
@resourceType{ .......} block AND any of the INNER brackets? It IS
RECURSIVE, man. The only difference between the last closing bracket
and the rest of "inner" brackets inside the block is that they are
logically linked (i.e. they form a open-close pair). This logic is
impossible to implement by simple lookaround assertion, simply because
u dont know how many inner bracket pairs u can have inside. And this is
despite of them being on "the first layer deep". U need either

1. to use a recursuve regex solution above or
2. implement your own recursive parser (DOM?)

Can anyine prove me wrong? I'd be very interested in feedback on this.

PS: Tyler, why doesn't ***{text{text}}*** look recursive to u? It's 1
level deep but it is recursive.

Sergei

Sergei Z

unread,
Mar 4, 2006, 2:26:21 PM3/4/06
to Regex
these ugly thing has a [very slim] chance to retrieve your strings
non-recursively,

@\w+{(.*?{.*?})+.*?}(?=.*?@\w+)

but

1. there MUST be inner curly brackets for this to work
and
2. it uses variable look-ahead to find .*?@\w+ after the last closing
bracket. So, the chances are that it'll break if encounters something
similar to @\w+ at some point ahead of the desired input fragment like

@resourceType{
field3 = "value with {brackets}
field4 = {brackets}
smth else
}

3. it will not match the last desired code fragment in the input (u can
tweak it to handle this, i guess)

Torsten Edler

unread,
Mar 4, 2006, 6:29:48 PM3/4/06
to Regex
But only one level of recursion, so what about this:

@[^{]+{([^{}]|{[^{}]*})*}

Works only, if
- braces are always balanced
- no braces in strings present
- level of brace nesting is only one

Sergei Z

unread,
Mar 4, 2006, 8:42:43 PM3/4/06
to Regex

Torsten,

thanks for this one - it's a cool logic, I really enjoyed it. U proved
me wrong on this.

See u around.

Sergei

tyler....@gmail.com

unread,
Mar 5, 2006, 6:31:27 AM3/5/06
to Regex
Torsten,

this is excellent. It looks a bit like black magic, but it really does
the job!

Thanks to all of you for your input, 'twas enlighting!

Have a nice weekend,
Tyler

Reply all
Reply to author
Forward
0 new messages