Selecting indented text blocks with side headings [Java]

bigAPE

unread,

Jul 7, 2009, 5:56:08 AM7/7/09

to Regex

Hi Guys,

Wondering if someone could help me I have the following sample text in
a file

REFERENCE 1 (initial reference)
AUTHORS Spirao,D., Halpain,R., Boayne,A., Baera,J., Gheadin,E.,
Hostetaler,J.,
Fedoraova,N., Kiam,M., Zaborsky,J., Overton,L.,
Djuaric,K.,
Sarmienato,M., Sitaz,J., Katazel,D., Haalpin,K.,
Stevaens,V.,
Selleack,P., Aaxel,A., Bruce,K., Joahnson,M., Heaine,H.,
Haoa,D.M.,
Loang,N.T., Vau,P.P., Baao,Y., Boalotov,P., Dernaovoy,D.,
Kirayutin,B.,
Lipaman,D.J. and Tatusaova,T.
TITLE This is the title text and it goes on and on and on and on
and on
and on and on and on and on.
JOURNAL Unpublished
REFERENCE 2 (another reference)
TITLE This is the another text and it goes on and on and on and
on and on
and on and on and on and on.
JOURNAL Unpublished

Now this may or may not come out as expected in this thread, but it is
fixed width with a 12 character "margin" containing the field headers
and the field content to the right of this. Each fields content can
span multiple lines.

What I want to get out of it is a Match for each Field and it's total
content even when it spans multiple lines as follows

Match 1: [REFERENCE] [1 (initial reference)]
Match 2: [AUTHORS] [Spirao,D., Halpain,R., ... Lipaman,D.J. and
Tatusaova,T.]
Match 3: [TITLE] [This is the title text and it goes on and on and on
and on and on and on and on and on and on.]
Match 4: ...etc

This is the Java I have so far:

final String regex = "^\\s{0,5}([A-Z]+)\\s+(.(?!\\n\\s{0,5}([A-Z]+))
+)";
Matcher matcher = Pattern.compile(regex, Pattern.MULTILINE |
Pattern.CASE_INSENSITIVE | Pattern.DOTALL).matcher(textToParse);
while (matcher.find()) {
System.out.println( matcher.group(1) + " = " + matcher.group(2) );
}

But what I end up with is the following

REFERENCE = 1
AUTHORS = S
TITLE = T
JOURNAL = U

What I want to do is say grab any selection where the we start with
0-5 spaces and a number of alpha chars followed by a space and then
(here's the important bit) any char including new lines but stopping
when we meet the next field tag (0-5 spaces and a number of alpha
chars followed by a space).

I'm not sure if my Lookahead syntax is working correctly.

(?!\\n\\s{0,5}([A-Z]+))

If I use the following I get each line I want, but NOT the following
overflow lines for AUTHORS and TITLE

^\\s{0,5}([A-Z]+)\\s+(.+)

Can any Regex Gurus out there shed any light on this for me or point
me at a decent tutorial on Lookahead which may help?

Cheers

Al

bigAPE

unread,

Jul 7, 2009, 6:10:23 AM7/7/09

to Regex

sorry, this sample might come out a little better

REFERENCE 1 (initial reference)
AUTHORS Spirao,D., Halpain,R., Boayne,A., Baera,J., Gheadin,E.,

Fedoraova,N., Kiam,M., Zaborsky,J., Overton,L.,
Sarmienato,M., Sitaz,J., Katazel,D., Haalpin,K.,

Selleack,P., Aaxel,A., Bruce,K., Joahnson,M., Heaine,H.,

Loang,N.T., Vau,P.P., Baao,Y., Boalotov,P., Dernaovoy,D.,

Lipaman,D.J. and Tatusaova,T.
TITLE This is the title text and it goes on and on and on and on
and on and on and on and on.
JOURNAL Unpublished
REFERENCE 2 (another reference)
TITLE This is the another text and it goes on and on and on and
and on and on and on and on.
JOURNAL Unpublished

Al

Accmailer

unread,

Jul 8, 2009, 4:39:55 AM7/8/09

to Regex

Hi Al,

I suggest this regex (in free spacing mode and in dot matches newlines
mode)

\s*REFERENCE\s*([^\r\n]+)\r\n
\s*AUTHORS\s*(.*?)\r\n
\s*TITLE\s*(.*?)\r\n
\s*JOURNAL\s*([^\r\n]+)\r\n
\s*REFERENCE\s*([^\r\n]+)\r\n
\s*TITLE\s*(.*?)\r\n
\s*JOURNAL\s*([^\r\n]+)\r\n

after applying it, collect corresponding backreferences:

ref1: $1
authors: $2
title1: $3
journal1: $4
ref2: $5
journal2: $6

And, if necessary, inside these backreferenses, replace all linebreaks
with nothing. That'll be the 2nd step.

Voila!

P.S. Note it's based on the assumption that the word TITLE can never
occur inside authors list and the word JOURNAL can never ossur inside
title line(s).

bigAPE

unread,

Jul 8, 2009, 5:49:21 AM7/8/09

to Regex

I should have mentioned that this is a SAMPLE if the file format there
are many other fields in no specific order and the names can change
all the time. Also the line breaks are just \n in the sample files
that I have.

What I am after is a fully GENERIC approach which will allow me to
iterate over the matches and get each "tag" (REFERENCE, AUTHOR, TITLE,
etc) and the textual "value" for that "tag", even if it spans several
lines. Thus my attempt at using the negative lookahead.

Any other suggestions, I'm wondering if my use of the negative
lookahead is incorrect ? What I really want to do is test for the
block by looking for the ending (?!^\\s{0,5}([A-Z]+)) but the ^
character doesn't seem to work in this instance. I am wondering if
there is a way in Java to tell the Matcher that the ^ char means start
of each line in the Regex ?

Al

Reply all

Reply to author

Forward