> How to read a file, one "record" (of more lines, with a consistent
> record delimiter) at a time?
> RECORD1
> some
> text
> RECORD2
> some
> other
> text
> RECORD3
> and
> much
> more
> text
> RECORD4
> etc.
> thanks
Probably the simplest way is a loop of readline (concatenating the string) with a check for the delimiter, nested inside a loop that does whatever you want with the records. Oh, and the inside loop will need an EOF check as well. You can use an explicit loop or a while/until construct or an if or a cond with an explicit transfer of control. You could even use recursion.
hans <schatzer.joh...@gmail.com> writes:
> How to read a file, one "record" (of more lines, with a consistent
> record delimiter) at a time?
> RECORD1
> some
> text
> RECORD2
> some
> other
> text
> RECORD3
> and
> much
> more
> text
> RECORD4
> etc.
What's the variable "$/"? Check perlvar(1perl).
-- Carl Lei (XeCycle)
Department of Physics, Shanghai Jiao Tong University
OpenPGP public key: 7795E591
Fingerprint: 1FB6 7F1F D45D F681 C845 27F7 8D71 8EC4 7795 E591
XeCycle <xecy...@gmail.com> writes:
> hans <schatzer.joh...@gmail.com> writes:
>> How to read a file, one "record" (of more lines, with a consistent
>> record delimiter) at a time?
>> RECORD1
>> some
>> text
>> RECORD2
>> some
>> other
>> text
>> RECORD3
>> and
>> much
>> more
>> text
>> RECORD4
>> etc.
> What's the variable "$/"? Check perlvar(1perl).
Sorry, I thought I was in comp.lang.perl.misc.
But I recommend Perl, too.
-- Carl Lei (XeCycle)
Department of Physics, Shanghai Jiao Tong University
OpenPGP public key: 7795E591
Fingerprint: 1FB6 7F1F D45D F681 C845 27F7 8D71 8EC4 7795 E591
+---------------
| How to read a file, one "record" (of more lines, with a consistent
| record delimiter) at a time?
| | RECORD1
| some
| text
| RECORD2
| some
| other
| text
| RECORD3
| and
| much
| more
| text
| RECORD4
| etc.
+---------------
If your record delimiter is actually something that matches the
regexp pattern "RECORD[0-9]+", then you can match/parse it very
easily with MISMATCH and PARSE-INTEGER:
> (loop with lines = '("RECORD1"
"some"
"text"
"RECORD2"
"some"
"other"
"text")
for line in lines
for delim-p = (eql 6 (mismatch "RECORD" line))
for datum = (if delim-p (parse-integer line :start 6) line)
collect (list delim-p datum))
The file is 145 MB, has about 20000 records, a record may have over
500 lines, but the record separator is simply and always *RECORD* on a
separate line.
Sorry for the above complication with RECORD1, RECORD2 ...
In Perl you would simply do
$/ = "*RECORD*";
as Tim Bradshaw and XeCycle say.
hans <schatzer.joh...@gmail.com> writes:
> On Oct 31, 3:14 am, "Pascal J. Bourguignon" <p...@informatimago.com>
> wrote:
>> hans <schatzer.joh...@gmail.com> writes:
>> > How to read a file, one "record" (of more lines, with a consistent
>> > record delimiter) at a time?
>> > RECORD1
>> > some
>> > text
>> > RECORD2
>> > some
>> > other
>> > text
>> > RECORD3
>> > and
>> > much
>> > more
>> > text
>> > RECORD4
>> > etc.
>> By programming. That is, using one's brain.
>> I don't understand this kind of question. What problem do you have?
>> Do you have a problem of not knowing lisp I/O primitives?
>> Do you have a problem of not knowing how to read structured files?
>> Do you have a problem of not recognizing the structure of the file
>> (ie. not being able to come with a specificiation)?
> The file is 145 MB, has about 20000 records, a record may have over
> 500 lines, but the record separator is simply and always *RECORD* on a
> separate line.
> Sorry for the above complication with RECORD1, RECORD2 ...
Ok, so it seems you can recognize more or less the structure of the
file.
You say "separator", but in your example, it looks like the 'RECORD'
token is a prefix. You must choose what file structure you have:
file ::= { record } .
record ::= 'RECORD' { line } .
file ::= { record } .
record ::= { line } 'RECORD' .
file ::= [ record { 'RECORD' record } ] .
record ::= { line } .
But you didn't answer the other questions:
>> Do you have a problem of not knowing lisp I/O primitives?
>> Do you have a problem of not knowing how to read structured files?
"Pascal J. Bourguignon" <p...@informatimago.com> writes:
>> The file is 145 MB, has about 20000 records, a record may have over
>> 500 lines, but the record separator is simply and always *RECORD* on a
>> separate line.
>> Sorry for the above complication with RECORD1, RECORD2 ...
> Ok, so it seems you can recognize more or less the structure of the
> file.
> You say "separator", but in your example, it looks like the 'RECORD'
> token is a prefix. You must choose what file structure you have:
It's a wonderful illustration of Perl vs. CL differences.
Pascal Bourguignon mentioned possible variants of input grammar. Each of
them is fairly trivial to code in CL, and it might be just as trivial in
Perl. But that's not how people program in Perl, apparently: they
recognize $/ as something that has a chance to work, and they go on and
use it because it's simple and terse and "beautiful". Now let's look
closer at this beauty.
$/="*RECORD*" is obviously wrong:
*RECORD*
An item, that would set the *RECORD* straight.
*RECORD*
A previous record triggered a bug.
$/="\n*RECORD*\n" is somewhat better, but the *first* record header (if
there are headers) won't be recognized as record separator anymore.
(For a variant without \n's, we get an empty first record in this case,
but, of course, Perl people would "solve" it by ignoring empty records).
Now, a line-sensitive regular expression could be useful as separator
instead, but $/ is _not_ a regex, so we're out of luck. It's "better" to
leave it as $/="*RECORD*. Good perl programmer would _document_ the
problem with inline *RECORD*s; that's the maximum quality we could
reasonably expect.
Seriously, such thing is "perfect" as a one-shot throwaway code _only_.
But when I want to massage a text file once and forget about it, I'd
better open it in the _editor_, and with some replace-regexps it will
become a 145-Mb file with S-expressions, which I'll then read with
CL:READ.
> On 2011-10-31, Tim Bradshaw <t...@tfeb.org> wrote:
>> hans <schatzer.joh...@gmail.com> wrote:
>>> How to read a file, one "record" (of more lines, with a consistent
>>> record delimiter) at a time?
>> I will no doubt be crucified for saying so but: Perl. Read it in Perl, spit
>> out sexps from Perl, and read those with Lisp.
> No way. There is a new text mangler with Lisp roots.
- :vars on in inner collect ensure that empty collects still
produces a binding for the text variable (a binding to the empty list nil),
even if there is no match.
Kaz Kylheku <k...@kylheku.com> wrote:
> No way. There is a new text mangler with Lisp roots.
I don't think this is really different: my point wasn't really "use Perl"
it was "use the appropriate tool" (OK, I should have said that). There
probably are cases where there is a real reason to use x for everything,
but generally the "reason" is some kind of invented thing in people's
minds, and in fact it is just fine to use a combination of tools: AWK or
Perl or txr or what-have-you for file-munging and Lisp or ... for other
bits
> But that's not how people program in Perl,
> apparently: they recognize $/ as something that has a chance to work,
> and they go on and use it because it's simple and terse and
> "beautiful". Now let's look closer at this beauty.
> $/="*RECORD*" is obviously wrong:
> *RECORD*
> An item, that would set the *RECORD* straight.
> *RECORD*
> A previous record triggered a bug.
> $/="\n*RECORD*\n" is somewhat better, but the *first* record header
> (if there are headers) won't be recognized as record separator
> anymore. (For a variant without \n's, we get an empty first record in
> this case, but, of course, Perl people would "solve" it by ignoring
> empty records).
> Now, a line-sensitive regular expression could be useful as separator
> instead, but $/ is _not_ a regex, so we're out of luck. It's "better"
> to leave it as $/="*RECORD*. Good perl programmer would _document_ the
> problem with inline *RECORD*s; that's the maximum quality we could
> reasonably expect.
To solve your "problem", a Perl programmer would probably just read and
discard the first header, and then set $/ to "\n*RECORD*\n". Your
strawman Perl programmers are too incompetent, you should fire them.
--
Carlos <an...@quovadis.com.ar> writes:
> To solve your "problem", a Perl programmer would probably just read and
> discard the first header, and then set $/ to "\n*RECORD*\n". Your
> strawman Perl programmers are too incompetent, you should fire them.
As we can see, even a compenent, caring Perl programmer proposes "read
and discard" instead of "read, check and discard", and that's in the
discussion of correctness. Why would I need a strawman?
-- Regards, Anton Kovalenko
+7(916)345-34-02 | Elektrostal' MO, Russia
> [Anton Kovalenko <an...@sw4me.com>, 2011-10-31 20:59]
> [...]
>> But that's not how people program in Perl,
>> apparently: they recognize $/ as something that has a chance to work,
>> and they go on and use it because it's simple and terse and
>> "beautiful". Now let's look closer at this beauty.
>> $/="*RECORD*" is obviously wrong:
>> *RECORD*
>> An item, that would set the *RECORD* straight.
>> *RECORD*
>> A previous record triggered a bug.
>> $/="\n*RECORD*\n" is somewhat better, but the *first* record header
>> (if there are headers) won't be recognized as record separator
>> anymore. (For a variant without \n's, we get an empty first record in
>> this case, but, of course, Perl people would "solve" it by ignoring
>> empty records).
>> Now, a line-sensitive regular expression could be useful as separator
>> instead, but $/ is _not_ a regex, so we're out of luck. It's "better"
>> to leave it as $/="*RECORD*. Good perl programmer would _document_ the
>> problem with inline *RECORD*s; that's the maximum quality we could
>> reasonably expect.
> To solve your "problem", a Perl programmer would probably just read and
> discard the first header, and then set $/ to "\n*RECORD*\n". Your
> strawman Perl programmers are too incompetent, you should fire them.
If you discard the first header, but that record is empty,
then you're again left with a header which does not match
"\n*RECORD*\n"
\n is not a good substitute for anchors like ^ and $ which are not
character matches, but a semantic extension to regexes.
-- Alan Perlis Epigram 32. Programmers are not to be measured by their ingenuity
and their logic but by the completeness of their case analysis.
Anton Kovalenko <an...@sw4me.com> wrote:
> As we can see, even a compenent, caring Perl programmer proposes "read
> and discard" instead of "read, check and discard", and that's in the
> discussion of correctness. Why would I need a strawman?
It's this kind of thing that makes me want to take Lisp programmers out and
shoot them.
Tim Bradshaw <t...@tfeb.org> writes:
>> As we can see, even a compenent, caring Perl programmer proposes "read
>> and discard" instead of "read, check and discard", and that's in the
>> discussion of correctness. Why would I need a strawman?
> It's this kind of thing that makes me want to take Lisp programmers out and
> shoot them.
Your own suggestion to spit out sexps was perferctly sane (and it
doesn't need Perl, which is a good sign). What's ridiculous here is not
Perl, or Perl's $/, it's how people stick to a specific Perl feature
($/), even after it was shown to be a wrong tool in a number of ways
(Kaz Kylheku noticed an additional danger of empty records).
Similar thing could happen with CL. Imagine that we're parsing
command-line arguments, and there's one that should be an
integer-bounded range, like 1222-33334. Let's use parse-integer. Then
it turns out that 0xDEAD-0xDEEF is also valid, and 0177-0755 should be
octal and it's silently misinterpreted as decimal. Let's insert some
special cases and still use parse-integer. Then it turns out that we
accept 0x+12-0x+FF, which we shouldn't, and we insert some more code but
_still_ use parse-integer. Then 0-283 turns out to be misdetected as
octal and signals an error on 8...
Surely it _could_ happen with CL, but I have yet to see it happening.
> On 2011-10-31, Carlos <an...@quovadis.com.ar> wrote:
> > [Anton Kovalenko <an...@sw4me.com>, 2011-10-31 20:59]
> > [...]
> >> But that's not how people program in Perl,
> >> apparently: they recognize $/ as something that has a chance to
> >> work, and they go on and use it because it's simple and terse and
> >> "beautiful". Now let's look closer at this beauty.
> >> $/="*RECORD*" is obviously wrong:
> >> *RECORD*
> >> An item, that would set the *RECORD* straight.
> >> *RECORD*
> >> A previous record triggered a bug.
> >> $/="\n*RECORD*\n" is somewhat better, but the *first* record header
> >> (if there are headers) won't be recognized as record separator
> >> anymore. (For a variant without \n's, we get an empty first record
> >> in this case, but, of course, Perl people would "solve" it by
> >> ignoring empty records).
> >> Now, a line-sensitive regular expression could be useful as
> >> separator instead, but $/ is _not_ a regex, so we're out of luck.
> >> It's "better" to leave it as $/="*RECORD*. Good perl programmer
> >> would _document_ the problem with inline *RECORD*s; that's the
> >> maximum quality we could reasonably expect.
> > To solve your "problem", a Perl programmer would probably just read
> > and discard the first header, and then set $/ to "\n*RECORD*\n".
> > Your strawman Perl programmers are too incompetent, you should fire
> > them.
> If you discard the first header, but that record is empty,
> then you're again left with a header which does not match
> "\n*RECORD*\n"
> \n is not a good substitute for anchors like ^ and $ which are not
> character matches, but a semantic extension to regexes.
Come on, you are testing a sketch algorithm to a made up specification.
He was talking about *RECORD* being not a separator but a header. Now
you say there can be empty records? Then the Perl programmer would set
$/ to "*RECORD*\n" and join records if needed.
My point is that Perl programmers aren't necessarily stupid. That's all.
Oh, and also that Perl's augmented read-line simplifies the solution a
lot.
> > To solve your "problem", a Perl programmer would probably just read
> > and discard the first header, and then set $/ to "\n*RECORD*\n".
> > Your strawman Perl programmers are too incompetent, you should fire
> > them.
> As we can see, even a compenent, caring Perl programmer proposes "read
> and discard" instead of "read, check and discard", and that's in the
> discussion of correctness. Why would I need a strawman?
Because I said "read and discard the first header", not "read and
discard anything whatsoever".
> > > To solve your "problem", a Perl programmer would probably just
> > > read and discard the first header, and then set $/ to
> > > "\n*RECORD*\n". Your strawman Perl programmers are too
> > > incompetent, you should fire them.
> > As we can see, even a compenent, caring Perl programmer proposes
> > "read and discard" instead of "read, check and discard", and that's
> > in the discussion of correctness. Why would I need a strawman?
> Because I said "read and discard the first header", not "read and
> discard anything whatsoever".
^^^^^^^^^^ I think this "whatsoever" here isn't
right; I withdraw it.
Anton Kovalenko <an...@sw4me.com> writes:
>>> As we can see, even a compenent, caring Perl programmer proposes "read
>>> and discard" instead of "read, check and discard", and that's in the
>>> discussion of correctness. Why would I need a strawman?
>> It's this kind of thing that makes me want to take Lisp programmers out and
>> shoot them.
[...]
> [I]t's how people stick to a specific Perl feature
> ($/), even after it was shown to be a wrong tool in a number of ways
> (Kaz Kylheku noticed an additional danger of empty records).
[...]
> Surely it _could_ happen with CL, but I have yet to see it happening.
Well, that was a gross overstatement: it happens all the time with
FORMAT ("~a-~a" is incorrect for making symbol names from other symbol
names, but widely used). And I have an idea why it happens with Perl and
with FORMAT, but not with most other CL stuff.
If we leave out FORMAT, CL doesn't have "killer features", that is,
things so shining with elegance and brevity that we're instantly tempted
to use them. There's nothing magic about PARSE-INTEGER, or SEARCH, or
MAPCAR..., you can write your own and use it, sometimes without any
performance penalty. When a tool is appropriate, you use it; when it's
not quite there, you roll your own. The original tool we wanted to use
usually provides some good hints on the interface we want to export
(e.g. our own parse-c-integer could take string, end, start, radix,
junk-allowed too, and :test & :key are useful for many other stuff).
In Perl, OTOH, _any_ feature is a killer feature. How would I roll my
own $/ or $_, if they were not there? Therefore, each feature that we're
using for a specific task has a chance of becoming addictive: it looks
like too much work to do if we dare to throw it away, even if it's not
really so hard for a specific task. It's not hard in Perl, after all, to
read a line at a time in a loop, check for "*RECORD*", collect a list --
that kind of boring thing we would do in CL.
> - :vars on in inner collect ensure that empty collects still
> produces a binding for the text variable (a binding to the empty list nil),
> even if there is no match.
Enough of the trivial Hello, World stuff, and on to a more robust, realistic
solution to the problem.
New requirements:
- produce literals, and escape occurences of " and single
escapes within literals
- catch RECORDX where X is not a number
- enforce that records start with RECORD<NUM>
We use a filter (filters are based on a trie data structure) to do the
stringification. A sprinkle of TXR's "blub-style for the Java spewing masses"
exception handling for the errors. We define a custom exception, derived
from exception type error.
We tighten the record collect with :gap 0 so that it does not skip nonmatching
garbage in its search for a header (not because we have to, but just for the
hell of it).
Look, Ma, one single regex used. For what regexes are designed for:
recognizing/validating a token.
Carlos <an...@quovadis.com.ar> wrote:
> My point is that Perl programmers aren't necessarily stupid. That's all.
My point was that as well, with the additional one that Lisp programmers
are often really disturbingly literal-minded (I'd like to believe it's just
the 8 of them remaining in cll, but I don't).
Anton Kovalenko <an...@sw4me.com> wrote:
> Your own suggestion to spit out sexps was perferctly sane (and it
> doesn't need Perl, which is a good sign). What's ridiculous here is not
> Perl, or Perl's $/, it's how people stick to a specific Perl feature
> ($/), even after it was shown to be a wrong tool in a number of ways
> (Kaz Kylheku noticed an additional danger of empty records).