Get a count of records

Kun Lin

unread,

Jun 8, 2022, 4:35:50 PM6/8/22

to pymarc Discussion

Is there a way I could use mymarc to get a count of how many records in the files? MARCXml file.

Thanks

Geoffrey Spear

unread,

Jun 9, 2022, 8:14:50 AM6/9/22

to pym...@googlegroups.com

Here's what I do to count records in files in MARC21 transmission format: https://github.com/Wooble/marctools/blob/main/marctools/marccount.py

It shouldn't require too much modification to do XML instead, I think.

On Wed, Jun 8, 2022 at 4:35 PM Kun Lin <kun...@usc.edu> wrote:

Is there a way I could use mymarc to get a count of how many records in the files? MARCXml file.

Thanks

--
You received this message because you are subscribed to the Google Groups "pymarc Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pymarc+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pymarc/719b7a8c-c53b-4985-aea1-5a0be0275ed0n%40googlegroups.com.

Ed Summers

unread,

Jun 9, 2022, 8:17:33 AM6/9/22

to pym...@googlegroups.com

Hmm, I wonder if this could be the start of a CLI tool that gets installed with pymarc?

Geoffrey Spear

unread,

Jun 9, 2022, 8:24:11 AM6/9/22

to pym...@googlegroups.com

Works for me. I think I had larger ambitions than the 2 tools I ended up with when I started the marctools repo, but these 2 functions (counting records and paging through a human readable display of them) eliminated about 95% of what I was using MarcEdit for so I stopped there. They're certainly small enough that it's silly to be installing them separately from pypi.

--
You received this message because you are subscribed to the Google Groups "pymarc Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pymarc+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/pymarc/74B6B950-61A1-4B9A-8524-F8EB70A53D01%40pobox.com.

Ed Summers

unread,

Jun 9, 2022, 8:51:06 AM6/9/22

to pym...@googlegroups.com

That’s your call. I think there’s some value in keeping them separated, but I had forgotten about marctools, so maybe having batteries included might be good for letting people know that the tool is there?

> To view this discussion on the web visit https://groups.google.com/d/msgid/pymarc/CAGifb9H8bu23mQQs4H5%2B7AVyaPH2ED-iq7jCiEXcmUPA9s8t%2Bg%40mail.gmail.com.

Kun Lin

unread,

Jun 9, 2022, 11:34:34 AM6/9/22

to pymarc Discussion

Ok. So you would have to loop through the file once to get the record count. What I was envision is when processing a large MARC file, I could create a progress bar on how far it has been processed. Looping it twice would probably slow it down, I guess.

Kun

Geoffrey Spear

unread,

Jun 9, 2022, 12:58:33 PM6/9/22

to pym...@googlegroups.com

It depends just how big the file is, and how much slower whatever processing you're doing is compared to how long it takes just to parse the file.

I've done a lot of projects where I get a MARC file out of my ILS, make some changes, and write the records back to the ILS. Saving the records is so much slower than anything else I'm doing that reading the file twice doesn't make a noticeable difference in the runtime, and is worth it to get progress. In the worst case, your processing is super fast per-record and just adding a progress bar doubles the runtime,

You might be able to do something clever by checking the size of the file and estimating how far through it you are, but I'm not sure there's a good way, while iterating the records, to get the size of the XML each record came from.

To view this discussion on the web visit https://groups.google.com/d/msgid/pymarc/e1c30e46-221f-4fee-b748-186ecbfa43ccn%40googlegroups.com.

Rob Loomis

unread,

Jun 9, 2022, 1:51:49 PM6/9/22

to pym...@googlegroups.com

Pretty sure neither .mrk nor XML MARC contain any header data that would provide a record count. That means that whether you code a loop yourself, or it's a built in pymarc function, the only way to get that value is to loop through the data. However, if all you are doing is getting a count, that loop may run fast enough that you don't find it adds a noticeable amount of time to your process. MIght be worth a try. I have a script that gets a record count and gives me a breakdown by encoding level, and it seems to run pretty fast. Of course file size will be a factor.

On Thu, Jun 9, 2022 at 11:34 AM Kun Lin <kun...@usc.edu> wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/pymarc/e1c30e46-221f-4fee-b748-186ecbfa43ccn%40googlegroups.com.

--

-- 
Rob E. Loomis
Acquisitions and Discovery Department
North Carolina State University Libraries
Box 7131
Raleigh, NC 27695
(919) 515-4094

Reply all

Reply to author

Forward