bigAPE
unread,Jul 7, 2009, 5:56:08 AM7/7/09Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Regex
Hi Guys,
Wondering if someone could help me I have the following sample text in
a file
REFERENCE 1 (initial reference)
AUTHORS Spirao,D., Halpain,R., Boayne,A., Baera,J., Gheadin,E.,
Hostetaler,J.,
Fedoraova,N., Kiam,M., Zaborsky,J., Overton,L.,
Djuaric,K.,
Sarmienato,M., Sitaz,J., Katazel,D., Haalpin,K.,
Stevaens,V.,
Selleack,P., Aaxel,A., Bruce,K., Joahnson,M., Heaine,H.,
Haoa,D.M.,
Loang,N.T., Vau,P.P., Baao,Y., Boalotov,P., Dernaovoy,D.,
Kirayutin,B.,
Lipaman,D.J. and Tatusaova,T.
TITLE This is the title text and it goes on and on and on and on
and on
and on and on and on and on.
JOURNAL Unpublished
REFERENCE 2 (another reference)
TITLE This is the another text and it goes on and on and on and
on and on
and on and on and on and on.
JOURNAL Unpublished
Now this may or may not come out as expected in this thread, but it is
fixed width with a 12 character "margin" containing the field headers
and the field content to the right of this. Each fields content can
span multiple lines.
What I want to get out of it is a Match for each Field and it's total
content even when it spans multiple lines as follows
Match 1: [REFERENCE] [1 (initial reference)]
Match 2: [AUTHORS] [Spirao,D., Halpain,R., ... Lipaman,D.J. and
Tatusaova,T.]
Match 3: [TITLE] [This is the title text and it goes on and on and on
and on and on and on and on and on and on.]
Match 4: ...etc
This is the Java I have so far:
final String regex = "^\\s{0,5}([A-Z]+)\\s+(.(?!\\n\\s{0,5}([A-Z]+))
+)";
Matcher matcher = Pattern.compile(regex, Pattern.MULTILINE |
Pattern.CASE_INSENSITIVE | Pattern.DOTALL).matcher(textToParse);
while (matcher.find()) {
System.out.println( matcher.group(1) + " = " + matcher.group(2) );
}
But what I end up with is the following
REFERENCE = 1
AUTHORS = S
TITLE = T
JOURNAL = U
What I want to do is say grab any selection where the we start with
0-5 spaces and a number of alpha chars followed by a space and then
(here's the important bit) any char including new lines but stopping
when we meet the next field tag (0-5 spaces and a number of alpha
chars followed by a space).
I'm not sure if my Lookahead syntax is working correctly.
(?!\\n\\s{0,5}([A-Z]+))
If I use the following I get each line I want, but NOT the following
overflow lines for AUTHORS and TITLE
^\\s{0,5}([A-Z]+)\\s+(.+)
Can any Regex Gurus out there shed any light on this for me or point
me at a decent tutorial on Lookahead which may help?
Cheers
Al