Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Regex problem

0 views
Skip to first unread message

Hendrik Maryns

unread,
Oct 8, 2007, 8:48:11 AM10/8/07
to
(This is in Java, but the regex is general, therefore x-post to
c.l.p.m., f-up to c.l.j.h.)

Hi all,

I want to discard the header of some file. The header is everything
before a line beginning with "#BOS". However, I do not want #BOS to be
part of the match, since I need it later on.

I thought of using a regex to do that. I came up with

.*(?s)(?=#BOS)

However, this gave me nothing.
(To be precise, I have

Scanner corpus = new Scanner(inFile);
Pattern header = Pattern.compile(".*(?s)(?=#BOS)", Pattern.MULTILINE);
corpus.skip(header);

and it gives me

java.util.NoSuchElementException
at java.util.Scanner.skip(Scanner.java:1706)
at
de.uni_tuebingen.sfb.lichtenstein.binarytrees.Converter2.main(Converter2.java:61)

so if any of the Java people sees a problem there, please point out.)

So to pinpoint my problem: I want a regex which matches any number of
lines until it finds a line beginning with #BOS, but does not include
#BOS in the match.

Other tries looked like this:

.*?(?s)(?=#BOS)
(.|\n)*?(?=#BOS) (this freezes the program)
.*(?=#BOS) with MULTLINE uption to Pattern.Compile
.*(?s)^(?=#BOS)

and several others, but I find no solution. So my last resort is asking
here.

TIA, H.
--
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html

it_says_BALLS_on_your forehead

unread,
Oct 8, 2007, 10:43:07 AM10/8/07
to
On Oct 8, 8:48 am, Hendrik Maryns <hendrik_mar...@despammed.com>
wrote:

The script below does what I think you're asking. Also, have you
thought about capturing the body instead of capturing the header and
skipping it?

i.e.

my ( $body ) = $str =~ m/(\#BOS.*)/s;

----------------------

#!perl
use strict; use warnings;

my $str = slurp_string();

print ">>$str<<\n";

my ( $header ) = $str =~ m/^(.*?)(?=\#BOS)/s;

print "Header: >>$header<<\n";


sub slurp_string {
my $text = do { local $/; <DATA> };
return $text;
}

__DATA__
line one
line 2
#BOS
content


OUTPUT:
-------------

bash-2.03$ ./regex.pl
>>line one
line 2
#BOS
content
<<
Header: >>line one
line 2
<<
bash-2.03$

Lew

unread,
Oct 8, 2007, 10:58:50 AM10/8/07
to
it_says_BALLS_on_your forehead wrote:
> #!perl

Please translate that into Java.

--
Lew

Abigail

unread,
Oct 8, 2007, 12:08:52 PM10/8/07
to
_
Hendrik Maryns (hendrik...@despammed.com) wrote on VCLI September
MCMXCIII in <URL:news:fed8ua$v7n$1...@newsserv.zdv.uni-tuebingen.de>:
** (This is in Java, but the regex is general, therefore x-post to
** c.l.p.m., f-up to c.l.j.h.)

I don't read the latter, so I won't post just there. Followups set to
clpm though.

** I want to discard the header of some file. The header is everything
** before a line beginning with "#BOS". However, I do not want #BOS to be
** part of the match, since I need it later on.
**
** I thought of using a regex to do that. I came up with
**
** .*(?s)(?=#BOS)

That changes the meaning of . *after* matching .*

/(?s).*(?=#BOS)/

would do, although I would write it as:

/^.*(?=#BOS)/s

Note that due to the .*, it will match everything up to the *last* occurance
of #BOS. You might want to write that differently if you want to remove things
up to the first #BOS, for instance (untested):

/^[^#]*(?:#(?!BOS)[^#]*)*#BOS/

which does some loop unrolling, avoids the usage of .*? (which can be
costly), and doesn't need (?s) because there's no . in the pattern.

Note that I anchored the pattern to the beginning of the string. This
should speed up the case where no #BOS is present in the string matched
against.


Abigail
--
perl5.004 -wMMath::BigInt -e'$^V=Math::BigInt->new(qq]$^F$^W783$[$%9889$^F47]
.qq]$|88768$^W596577669$%$^W5$^F3364$[$^W$^F$|838747$[8889739$%$|$^F673$%$^W]
.qq]98$^F76777$=56]);$^U=substr($]=>$|=>5)*(q.25..($^W=@^V))=>do{print+chr$^V
%$^U;$^V/=$^U}while$^V!=$^W'

Lew

unread,
Oct 8, 2007, 1:43:24 PM10/8/07
to
> Hendrik Maryns (hendrik...@despammed.com) wrote on VCLI September
> MCMXCIII in <URL:news:fed8ua$v7n$1...@newsserv.zdv.uni-tuebingen.de>:
> ** (This is in Java, but the regex is general, therefore x-post to
> ** c.l.p.m., f-up to c.l.j.h.)

Abigail wrote:
> I don't read the latter, so I won't post just there. Followups set to
> clpm though.

But the OP /does/ read clj.help, and pointed out that his problem is in Java,
so redirecting the answers away from clj.help is pure arrogance.

--
Lew

Message has been deleted
Message has been deleted

Tad McClellan

unread,
Oct 8, 2007, 6:45:10 PM10/8/07
to

[ f-up set to a newsgroup that I participate in. ]


Lew <l...@lewscanon.com> wrote:
>> Hendrik Maryns (hendrik...@despammed.com) wrote on VCLI September
>> MCMXCIII in <URL:news:fed8ua$v7n$1...@newsserv.zdv.uni-tuebingen.de>:
>> ** (This is in Java, but the regex is general, therefore x-post to
>> ** c.l.p.m., f-up to c.l.j.h.)

^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^


> Abigail wrote:
>> I don't read the latter, so I won't post just there. Followups set to

^^^^^^^^^^
^^^^^^^^^^


>> clpm though.
>
> But the OP /does/ read clj.help, and pointed out that his problem is in Java,


And he will see Abigail's helpful answer there.

So what's the problem?


> so redirecting the answers away from clj.help is pure arrogance.


He did not redirect answers away!

His post containing an answer was posted to the newsgroup that
the OP asked for.

Abigail does not participate in clj.help, and so won't
be able to see any followups to his post.

Dumping stuff into a newsgroup you do not read is arrogance.


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

Message has been deleted
Message has been deleted

Roedy Green

unread,
Oct 8, 2007, 11:16:00 PM10/8/07
to
On Mon, 08 Oct 2007 14:48:11 +0200, Hendrik Maryns
<hendrik...@despammed.com> wrote, quoted or indirectly quoted
someone who said :

>I want to discard the header of some file. The header is everything
>before a line beginning with "#BOS". However, I do not want #BOS to be
>part of the match, since I need it later on.

You are allowed to use tools like indexOf and substring too. There is
no rule that says you have to use the single tool Regex to solve the
entire problem.
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Roedy Green

unread,
Oct 9, 2007, 12:00:38 AM10/9/07
to
On Tue, 09 Oct 2007 03:16:00 GMT, Roedy Green
<see_w...@mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

>You are allowed to use tools like indexOf and substring too. There is


>no rule that says you have to use the single tool Regex to solve the
>entire problem.

The most common error people make is trying to solve a problem with
one immensely complicated regex. Often you can solve the problem more
simply and faster with several simpler regexes and some indexOfs and
substrings.

Hendrik Maryns

unread,
Oct 10, 2007, 10:31:59 AM10/10/07
to
it_says_BALLS_on_your forehead schreef:

I did not find appropriate methods in the Scanner API to do that.

> my ( $header ) = $str =~ m/^(.*?)(?=\#BOS)/s;

Thanks, this works, after adapting it to Java.

Hendrik Maryns

unread,
Oct 10, 2007, 10:44:13 AM10/10/07
to
Roedy Green schreef:

> On Tue, 09 Oct 2007 03:16:00 GMT, Roedy Green
> <see_w...@mindprod.com.invalid> wrote, quoted or indirectly quoted
> someone who said :
>
>> You are allowed to use tools like indexOf and substring too. There is
>> no rule that says you have to use the single tool Regex to solve the
>> entire problem.
>
> The most common error people make is trying to solve a problem with
> one immensely complicated regex. Often you can solve the problem more
> simply and faster with several simpler regexes and some indexOfs and
> substrings.

I acknowledge that, but I would like to process the file sequentially,
and I see no immediate possibility to do it as you suggest without
reading in the whole file, which is otherwise unnecessary.

Could you elaborate a little bit how this would work?

Thanks, H.

signature.asc

Roedy Green

unread,
Oct 10, 2007, 7:44:00 PM10/10/07
to
On Wed, 10 Oct 2007 16:44:13 +0200, Hendrik Maryns
<hendrik...@despammed.com> wrote, quoted or indirectly quoted
someone who said :

>I acknowledge that, but I would like to process the file sequentially,


>and I see no immediate possibility to do it as you suggest without
>reading in the whole file, which is otherwise unnecessary.
>
>Could you elaborate a little bit how this would work?

I don't have a complete solution, but you could read the first N
chars of the file, the do a startsWith.

Patricia Shanahan

unread,
Oct 10, 2007, 8:05:59 PM10/10/07
to
Roedy Green wrote:
> On Wed, 10 Oct 2007 16:44:13 +0200, Hendrik Maryns
> <hendrik...@despammed.com> wrote, quoted or indirectly quoted
> someone who said :
>
>> I acknowledge that, but I would like to process the file sequentially,
>> and I see no immediate possibility to do it as you suggest without
>> reading in the whole file, which is otherwise unnecessary.
>>
>> Could you elaborate a little bit how this would work?
>
> I don't have a complete solution, but you could read the first N
> chars of the file, the do a startsWith.

In the original message, Hendrik Maryns said:

> I want to discard the header of some file. The header is everything
> before a line beginning with "#BOS". However, I do not want #BOS to be
> part of the match, since I need it later on.


If the end of header is a line, why not read the file a line at a time,
using a BufferedReader? Write a very simple regex that matches any line
beginning with "#BOS", or apply startsWith to each line. Discard lines
until one matches, then go into normal processing beginning with that line.

Patricia

Roedy Green

unread,
Oct 10, 2007, 10:57:14 PM10/10/07
to
On Wed, 10 Oct 2007 23:44:00 GMT, Roedy Green
<see_w...@mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

>I don't have a complete solution, but you could read the first N


>chars of the file, the do a startsWith.

If your stream is a disk file, it would not be that extravagant to
read the file once to determine the presence of the header, then open
it again with the knowledge of its presence. You can use skip to hop
over it the second time if needed.

Another possibility is to use nio which will let you process multiple
times.

Lew

unread,
Oct 11, 2007, 1:51:36 AM10/11/07
to
Hendrik Maryns wrote:
>> The script below does what I think you're asking. Also, have you
>> thought about capturing the body instead of capturing the header and
>> skipping it?

it_says_BALLS_on_your forehead schreef:


> I did not find appropriate methods in the Scanner API to do that.

Huh? He's talking about a different regex, not different methods.

--
Lew

0 new messages