Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: AW: [librecat-dev] A common MARC record path language

7 views
Skip to first unread message

Patrick Hochstenbach

unread,
Jan 21, 2014, 3:56:47 AM1/21/14
to Klee, Carsten, vo...@gbv.de, librec...@mail.librecat.org, perl...@perl.org
Hi Carsten

Excuses for the late reply, it took some while to get the system booted
after winter vacations.

You are right in the discussion about which parts should be specified by a
MARCspec language and which part should be implemented as operations on
nodes found. I gave the examples not as a hit for the implementation
language (e.g. if it requires regular expressions or not) but as a
examples of MARC in the wild (non standard tags) and MARC combined with
cataloging rules (where subfields and characters in front of a subfield
have a special meaning).

In daily work I often encounter mapping rules which involve these special
subfield cases (“Take everything from the 245 until you hit the first /
before a subfield”). These things can’t be easily (can it) expressed in
Xpath when using XSTL or MARCspec when using tools like Catmandu..but are
very common and can be shared across tools. I think this would be
candidates to formalise .


Cheers
Patrick

On 06/01/14 16:33, "Klee, Carsten" <Carste...@sbb.spk-berlin.de> wrote:

>
>On the other hand I could imagine something like "100[0]" for the first
>100 field (author) and "100[1]" for the second and so on. But what is
>about repeatable subfields? Maybe someone requires the first subfield "a"
>of the second 100 field. Besides the characters "[" and "]" are also
>valid subfield codes (see [2]).
>
>With substrings it is more complicated. I only could imagine using
>regular expressions. Maybe something like 245a[Œ\s(.*)]_10. But for
>usability reasons this might be better left to the applications. Isn't
>there something in Catmandu like
>marc_map('245','my.title', -substring-after => 'Œ '); ??
>
>Maybe you have another solution for that?
>
>Another issue I suspect with your last example under
>https://metacpan.org/pod/Catmandu::Fix::marc_map
>
># Copy all 100 subfields except the digits to the 'author' field
>marc_map('100^0123456789','author');
>
>In the current MARCspec this would be interpreted as "a reference to
>subfields ^, 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 of field 100". This is
>because "^" is a valid subfield code (see [2]).
>
>So far... I would be happy to read more comments on this.
>
>Cheers!
>
>Carsten
>
>
>[1] <https://github.com/cKlee/marc-spec/issues>
>[2] <http://www.loc.gov/marc/specifications/specrecstruc.html#varifields>
>_______________________________________________
>Carsten Klee
>Abt. Überregionale Bibliographische Dienste IIE
>Staatsbibliothek zu Berlin – Preußischer Kulturbesitz
>
>Fon: +49 30 266-43 44 02
>
>> -----Ursprüngliche Nachricht-----
>> Von: Patrick Hochstenbach [mailto:Patrick.Ho...@UGent.be]
>> Gesendet: Freitag, 20. Dezember 2013 14:06
>> An: vo...@gbv.de; librec...@mail.librecat.org; perl...@perl.org
>> Cc: Klee, Carsten
>> Betreff: Re: [librecat-dev] A common MARC record path language
>>
>> Hi
>>
>> Thanks for this initiative to formalise the path language for MARC
>> records. In Catmandu our path language is better described at:
>> https://metacpan.org/pod/Catmandu::Fix::marc_map. It would be an easy
>>fix
>> for us to follow Carsten¹s MARC spec rules and I will gladly implement
>>it
>> for our community.
>>
>> We see these type of MARC paths in programming libraries such as the
>> projects mentioned below but also in products like XSTL, SolrMarc,
>> ILS-vendors who need them to define how to index marc, standardisation
>> bodies like e.g. that provide mapping rules (e.g.
>> http://www.loc.gov/standards/mods/mods-mapping.html). I tried to make a
>> small roundup in the past of these projects but it would be great to
>>have
>> more extensive look at all current pratices.
>>
>> In our Catmandu project we found that Xpaths are too verbose for our
>> librarians to interpret and in practise tied to XSLT-programming which
>> requires quite some programming skills to read and interpret.
>>
>> Our paths are very much simplified but still seem to lack some things
>>that
>> are available in the MARC data model which would be great to have
>> available in the MARCspec syntax:
>>
>> - Notion of pointing to the first item (first author)
>> - Supporting local defined MARC (sub)fields (e.g. Ex Libris exports
>> contain all kind of Z30, CAT , etc fields)
>> - Support for pointing to a subfields that follow a specific character
>> (e.g. In titles I would like to point to everything after the Œ/Œ in a
>>245
>> field).
>>
>> Cheers and have a nice holiday
>>
>> Patrick
>>
>>
>> On 19/12/13 13:16, "Jakob Voß" <vo...@gbv.de> wrote:
>>
>> >Hi,
>> >
>> >Carsten Klee specified a simple path language for MARC records, called
>> >"MARC spec". In short it is a formal syntax to refer to selected parts
>> >of a MARC record (similar to XPath for XML):
>> >
>> >http://collidoscope.de/lld/marcspec-as-string.html
>> >http://cklee.github.io/marc-spec/marc-spec.html#examples
>> >
>> >Similar languages have been invented before but not with a strict
>> >specification, as far as I know. For instance the perl Catmandu::MARC
>> >supports references to MARC fields:
>> >
>> >https://metacpan.org/pod/Catmandu::Fix::Inline::marc_map
>> >https://metacpan.org/source/NICS/Catmandu-MARC-
>> 0.103/lib/Catmandu/Fix/Inli
>> >ne/marc_map.pm#L26
>> >
>> >Could you please have a look at MARC spec and join forces to get a
>> >common syntax that can be used among different tools? So
>> >
>> >- If your tool does not support all aspects of MARC spec, please
>> >implement the missing parts.
>> >
>> >- If your tool supports more than included in MARC spec, help extending
>> >the syntax at https://github.com/cKlee/marc-spec/
>> >
>> >- If you tool uses a different syntax to refer to parts of MARC,
>> >please think about modifying it to align with MARC spec.
>> >
>> >Cheers,
>> >Jakob
>> >
>> >--
>> >Jakob Voß <jakob...@gbv.de>
>> >Verbundzentrale des GBV (VZG) / Common Library Network
>> >Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
>> >+49 (0)551 39-10242, http://www.gbv.de/
>> >
>> >_______________________________________________
>> >librecat-dev mailing list
>> >librec...@mail.librecat.org
>> >http://mail.librecat.org/mailman/listinfo/librecat-dev
>

Klee, Carsten

unread,
Feb 18, 2014, 11:47:17 AM2/18/14
to Patrick Hochstenbach, vo...@gbv.de, librec...@mail.librecat.org, perl...@perl.org
Hi Patrick,

I'm sorry for my much more late reply. Since your mail MARCspec [1] developed a lot. But unfortunately I didn't got a solution for the issue you described. Honestly, I'm not sure, if I understand the problem which you want to solve here.

I understand that there is MARC data combined with cataloging rules. We don't use this approach within our MARC. So I'm not really aware of the problematics.

You mentioned the typical case “Take everything from the 245 until you hit the first / before a subfield”. I thought about this, but didn't came even close how this could be expressed in MARCspec. How is this solved in Catmandu now?

If you have any suggestions, how this could be expressed in a string, please give me a hint.

The only thing I can imagine, is a reference to a character within a subfield. Something like

245$a[0]/-1

could be read as "A reference to the last character of the first subfield 'a' of field 245". Then you could check if the reference character is "/". But I think, that didn't solve your problem, right?

However I would be very glad, if MARCspec gets adopted by Catmandu. If you're interested, I've written some algorithm rules for MARCspec parsers [2]. They are very comprehensive. Maybe there is a smarter algorithm, but this might give you some clue.

Cheers!

Carsten

[1] <http://cklee.github.io/marc-spec/marc-spec.html>
[2] <https://github.com/cKlee/marc-spec/blob/master/marc-spec-parser-rules.md>
_______________________________________________
Carsten Klee
Abt. Überregionale Bibliographische Dienste IIE
Staatsbibliothek zu Berlin – Preußischer Kulturbesitz

Fon: +49 30 266-43 44 02


> -----Ursprüngliche Nachricht-----
> Von: Patrick Hochstenbach [mailto:Patrick.Ho...@UGent.be]
> Gesendet: Dienstag, 21. Januar 2014 09:57
> An: Klee, Carsten; vo...@gbv.de; librec...@mail.librecat.org;
> perl...@perl.org
> Betreff: Re: AW: [librecat-dev] A common MARC record path language

Thomas Berger

unread,
Feb 18, 2014, 7:03:34 PM2/18/14
to Klee, Carsten, Patrick Hochstenbach, vo...@gbv.de, librec...@mail.librecat.org, perl...@perl.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Am 18.02.2014 17:47, schrieb Klee, Carsten:

> I understand that there is MARC data combined with cataloging rules. We
> don't use this approach within our MARC. So I'm not really aware of the problematics.

"Your" MARC however will be very much interested in "/" (or "=") as the first
character of some subfield in 245 if I recall correctly. Not such a big
difference I would think. But maybe a slight complication of the matter,
since MARCspec should have to cope with both approaches...

Thomas Berger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iJwEAQECAAYFAlMD9NYACgkQYhMlmJ6W47PzEQP/RIfm5bsHLTwhJMLJjNjF3vO/
XIpKt98CPUgy+hcFXc4hpTi+UH8j7NIWtaCyXYOfdL4xryzI0kEk98brZ/4TJG+9
IxzPZ8WDQL8bjX1hRTF8P4qjn/u+nyvDFFvdbM4kH7QhYhPeeWfoVqtCnMFHLzFJ
7v+o6x2CKH2MnfOcgGI=
=yBFy
-----END PGP SIGNATURE-----

Klee, Carsten

unread,
Feb 19, 2014, 8:27:58 AM2/19/14
to t...@gymel.com, Patrick Hochstenbach, vo...@gbv.de, librec...@mail.librecat.org, perl...@perl.org
Hi Thomas and Patrick!

I think the whole problem lies in the limited expressivity of strings. MARCspec is pretty much close to XPath at its approach, but without regular expressions and functions like first(), last() etc. But even with XPath it would be pretty hard to get the character before a subfield in a MARCXML file.

The only solution I can think of, is using regular expressions. And I'm not convinced that bringing this into MARCspec is a good idea. As I already mentioned in the spec, MARCspec is not independent from the application using MARCspec. Taking regular expressions into MARCspec wouldn't make the application more usable, but would blow up the specification.

One example:

The data in field 245 is:

"$aConcerto per piano n. 21, K 467$h[sound recording] /$cW.A. Mozart"

The desired result is (rule: take everything from 245 until the string ' /$' appears):

"Concerto per piano n. 21, K 467 [sound recording]"

Imagine a MARCspec with regular expression. // pseudo code coming up!

marcspec = "245.match(/(.*)\s\/\$/)"
titleData = getMARCspec(record, marcspec)
print titleData[1]
// should result in "$aConcerto per piano n. 21, K 467$h[sound recording]"

Now pretty the same but without the regular expression in the MARCspec.

marcspec = "245"
titleData = getMARCspec(record, marcspec).match(/(.*)\s\/\$/)
print titleData[1]
// should result in "$aConcerto per piano n. 21, K 467$h[sound recording]"

You see, nothing won here.

But an application could provide a special function like

function takeEverythingFromSpecUntilYouHitBeforeSubfield(marcspec,hitWhat,record)
{
// get the data before the / or = or else
regex = new RegExp("(.*)\\s\\" + hitWhat + "\\$")
data = getMARCspec(record, marcspec).match(regex)[1]

// now split on subfield
dataSplit = data.split(/\$[a-z0-9]/)

// loop everything into result
for (i = 1; i < dataSplit.length-1; i++)
{
result += dataSplit[i] + " "
}
result += dataSplit[dataSplit.length]

return result
}

In Catmandu or elsewhere the user calls the function

takeEverythingFromSpecUntilYouHitBeforeSubfield("245","/",record)

--> this should result in the desired "Concerto per piano n. 21, K 467 [sound recording]".

If there is any other approach you can think of, pleeeease make a proposal or give me a substantial discussion here. Otherwise I can't see any options solving this problem in MARCspec.

Cheers!

Carsten
_______________________________________________
Carsten Klee
Abt. Überregionale Bibliographische Dienste IIE
Staatsbibliothek zu Berlin - Preußischer Kulturbesitz

Fon: +49 30 266-43 44 02

> -----Ursprüngliche Nachricht-----
> Von: Thomas Berger [mailto:T...@Gymel.com]
> Gesendet: Mittwoch, 19. Februar 2014 01:04
> An: Klee, Carsten; 'Patrick Hochstenbach'
> Cc: vo...@gbv.de; librec...@mail.librecat.org; perl...@perl.org
> Betreff: Re: [librecat-dev] A common MARC record path language

Patrick Hochstenbach

unread,
Feb 19, 2014, 2:36:40 PM2/19/14
to Klee, Carsten, t...@gymel.com, vo...@gbv.de, librec...@mail.librecat.org, perl...@perl.org
Hi Carsten

Thanks for the new spec I think it is a great initiative to align many projects that are processing MARC records. Here are some general remarks I hope we can use to discuss the spec more in depth.

What I'm missing reading the specification is a separate use-case document. In the spec I see sections like the introduction of "2 Expressing MARCspecs as string" and "2.1" which are design concerns which require a separate discussion from the formal part of the document. I mean, I can agree or disagree with the design concerns..with the formal section I should be able to say if it is correct or not.

The discussion we have here in this email thread deserves a separate document of use-cases. Producing Linked Data is only of the cases. Solrmarc is about transforming MARC into something that can be send to SOLR. In ILS systems you might use it to point to parts of MARC you want to display in a webinterface. In catmandu you might want to produce reports. Every use-case can have its own needs to make parts of MARC easy addressable.

We need tools like easyM2R, solrmarc, catmandu not only because of the verboseness of XPath or because it is tight to one possible serialization of MARC. Of course I love to write

100$a instead of /marc:record//marc:datafield[@tag='100']

This opens up a new class of easy DSL tools to process our datasets.

But..this treats MARC as a document key-value exchange format for bibliographical data. And I can't agree with that... or not in a strict sense. I can as easily state that MARC is a mark-up language that requires more processing after the first mappings have been made. E.g. if you want to map 260$c to an xsd^date field you really need get rid of the trailing dot '.' at the end. MARC is a key-value exchange format only as first approximation.

Using cataloging rules you can get much more information out of the record. And I wonder if in a second approximation we could add paths that implement some of that logic.

For instance. as stupid example:

245{/$.} : could evaluate to everything in 245 until you hit the first /$$subfield

In catmandu..we'll we don't have a spec for that. We do the same things as in easyM2R and solrmarc and create a small DSL language of functions that get MARCspecs as input. Of course we could all agree on a same collection of functions like move_field, split_field, copy_field etc etc. But I hope there are other options also.

Cheers
Patrick

________________________________________
From: Klee, Carsten [Carste...@sbb.spk-berlin.de]
Sent: Wednesday, February 19, 2014 2:27 PM
To: 't...@gymel.com'; Patrick Hochstenbach
Cc: vo...@gbv.de; librec...@mail.librecat.org; perl...@perl.org
Subject: AW: [librecat-dev] A common MARC record path language

Thomas Berger

unread,
Feb 19, 2014, 5:05:54 PM2/19/14
to Klee, Carsten, Patrick Hochstenbach, vo...@gbv.de, librec...@mail.librecat.org, perl...@perl.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Carsten,

> I think the whole problem lies in the limited expressivity of strings.
> MARCspec is pretty much close to XPath at its approach, but without regular
> expressions and functions like first(), last() etc. But even with XPath it would
> be pretty hard to get the character before a subfield in a MARCXML file.
>
> The only solution I can think of, is using regular expressions. And I'm not
> convinced that bringing this into MARCspec is a good idea. As I already
> mentioned in the spec, MARCspec is not independent from the application using
> MARCspec. Taking regular expressions into MARCspec wouldn't make the application
> more usable, but would blow up the specification.

Agreed, therefore regular expressions or other /general/ mechanisms
should not the way to go (for specifying MARCspecs - specific implementations
may realize it using a regexp implementation at hand)

Thus, yes, limited expressivness of strings demands to make the most
typical and most important "operations" on MARC records to be
expressible. But if it's too limited (say it could only extract fields
or has blind spots - parts of record data which cannot be accessed at all)
it wouldn't be of any use.

Thus MARCspec's need a convincing approach to the peculiarities of MARC
records:

Subfields are not always data elements in a proper sense, sometimes
they are just marks interspersed into the field content.

And as Patrick pointed out there is the presence of non-MARC delimiters
(markup) which is crucial for processing of some (sub)fields.

Many fields contain "ensembles" of subfields with one nature, accompanied
by other, more data-like subfields of a different nature:

- - Most subfields in 700 are a simple copy of some (hypothetical) authority
record's 100, however $e and/or $4 denote the function of that person with
respect to the work described by the record at hand - and repeatable $0's
just are complimentary to the "core" subfields which well may be $a,$b,$c,$d,
$f,$g,$j,$k,$l,$n,$p,$q,$t and $u (some of them repeatable and don't even
dare to change anything in their order). Use cases might include /selection/
based on one or more of the more data-like subfields and /reduction/ of the
field to a form suitable for further proessing (indexing without $e, display
including $e, or with deviant formatting of $e with reverence to today's
slighly silly discussion on AUTOCAT concerning photographers acting as authors
and authors acting as photographers to the perplexion of patrons ...).

- - Same issue with most fields 77X: most subfields pertain to the work,
some are the individual "coordinates" within this work for that part
described by the given record

- - The 245 example (and also the $e in 100's) may demonstrate a need to
/partition/ a field at certain spots - maybe before or after subfields
meeting some content condition.

- - Ubiquitous (in the specification, maybe not in the "field") are $6 and
$8's. If MARCspec's could make thusly interwoven fields accessible
as ensembles - that would be an enormous benefit!

- From my limited experience the "unclear" nature of subfields really is the
hard part in MARC processing: If you delve into subfield processing too
early you get data fragments almost or completely impossible to reassemble
into something meaningful. On the other hand looking at fields as a whole
gives you more chances to understand what it is about but you're going
to choke on the weeding out necessary to proceed.

Thus maybe due to my limited experience in MARC processing I'd very much
appreciate MARCspec as a grammar to formulate those tasks that really
matter (and are hard to be done 100% right). To achieve that - cf.
Patrick's reply again - one or several "processing paradigms" for MARC
records should serve as a base and - for clarities' sake - should be made
explicit in the MARCspec specification.

Thomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iJwEAQECAAYFAlMFKsEACgkQYhMlmJ6W47M05wP/WcjpFrIXlOI/y21kxcYc+XDH
QHT/8QypD6yKqHM8c7KzcHB8efttB7CQ8mB7cAtqxqQw2oqPzicnkYXIJU9Z9Yxm
yIaJXPWKovgypLNn4sAjPf2/MsJMYTtCrLOGwWxgp+Uq8bvAuZx5iMr1rKP68PzH
DCGkPq31KhMT1tUBHMk=
=EP69
-----END PGP SIGNATURE-----

Klee, Carsten

unread,
Feb 24, 2014, 2:39:30 AM2/24/14
to t...@gymel.com, Patrick Hochstenbach, vo...@gbv.de, librec...@mail.librecat.org, perl...@perl.org
Hi Thomas and Patrick!

Thank you both for bringing the discussion forward. I must admit that I'm having some problems following here. I read your mails multiple times, really trying to understand your demands. After reading this [1], I hope I'm getting closer.

I just want to sum up what I think I've understood so far. Please correct me if I'm wrong..

-- When it comes to cataloging based delimiters (punctuation), there is some inner semantic to the content of the subfields. E.g. "=$b" in field 245 means something different than ":$b".

-- There may be data you want to get at whole, which spread over multiple subfields. This information is cannot be described by the range of subfields, but with the closure through punctuation. E.g. in the field

245 00$aHeritage Books archives.$pUnderwood biographical dictionary.$nVolumes 1 & 2 revised$h[electronic resource] /$cLaverne Galeener-Moore.

the data you want to get is

Heritage Books archives. Underwood biographical dictionary. Volumes 1 & 2 revised [electronic resource]

Is this what you mean when want to say something like "Get me all from field XXX until you hit Y"? I guess so.

-- Therefore the order of subfields is crucial. While MARCspec allows subfields stated in any order, a result should preserve the subfield order emerging in the field.

-- Some fields are linked through specific subfields. There may be some data you want to get dependent on linkage from other fields. I'm not sure if I have an example for this. Maybe you could provide one.

Finally I've found a nice example on the MARC21 website [2] (section $i - Relationship information). That my question is, if you want to achieve something like this:

Source:
100 1# $aVerdi, Giuseppe, $d1813-1901.
245 10 $aOtello :$bin full score /$cGiuseppe Verdi.
700 1# $iLibretto based on (work) $aShakespeare, William, $d1564-1616. $tOthello.
787 08 $ireproduction of (manifestation) $aVerdi, Giuseppe, 1813-1901. $tOtello.$d Milano: Ricordi, c1913

Result (user display):
Verdi, Giuseppe, 1813-1901. Otello : in full score / Giuseppe Verdi
Reproduction of Verdi, Giuseppe, 1813-1901. Otello. Milano : Ricordi, c1913
Libretto based on Shakespeare, William, 1564-1616. Othello.

Is this something you want to express within a MARCspec?

Anyhow a collection of use cases is a great idea. That would help to discover the tasks a MARCspec should cope. But I really need your help here. Maybe a wider audience would also be helpful?
Cheers!

Carsten

[1] <http://marc-must-die.info/index.php?title=MARC_issues>
[2] <http://www.loc.gov/marc/bibliographic/bd76x78x.html>
_______________________________________________
Carsten Klee
Abt. Überregionale Bibliographische Dienste IIE
Staatsbibliothek zu Berlin - Preußischer Kulturbesitz

Fon: +49 30 266-43 44 02

> -----Ursprüngliche Nachricht-----
> Von: Thomas Berger [mailto:T...@Gymel.com]
> Gesendet: Mittwoch, 19. Februar 2014 23:06
> An: Klee, Carsten; 'Patrick Hochstenbach'
> Cc: vo...@gbv.de; librec...@mail.librecat.org; perl...@perl.org
> Betreff: Re: [librecat-dev] A common MARC record path language
>

Thomas Berger

unread,
Feb 24, 2014, 5:18:03 AM2/24/14
to Klee, Carsten, Patrick Hochstenbach, vo...@gbv.de, librec...@mail.librecat.org, perl...@perl.org
Carsten,


> Thank you both for bringing the discussion forward. I must admit that I'm
> having some problems following here. I read your mails multiple times, really
> trying to understand your demands. After reading this [1], I hope I'm getting
> closer.

You also could consider to grok Jason Thomale's "Interpreting MARC: Where's the
Bibliographic Data?" < http://journal.code4lib.org/articles/3832 > (preceeding
Karen Coyle's more widely known article "MARC21 as Data: A Start"
< http://journal.code4lib.org/articles/5468 >).


> I just want to sum up what I think I've understood so far. Please correct me if I'm wrong..
>
> -- When it comes to cataloging based delimiters (punctuation), there is some
> inner semantic to the content of the subfields. E.g. "=$b" in field 245 means
> something different than ":$b".

Yes and no: In your example

787 08 $ireproduction of (manifestation) $aVerdi, Giuseppe, 1813-1901.
$tOtello.$d Milano: Ricordi, c1913

three of the four subfields have internal structure which is likely to
be exploited as in

display "$ireproduction of (manifestation)" without the text in parentheses
as the left column in a table or styled differently (introductory phrase
in italics and/or followed by a colon)

display "$aVerdi, Giuseppe, 1813-1901." as "Verdi, Giuseppe, (1813-1901)"
succeeded by a colon if $t is next

display the title $tOtello. in italics, index it somewhere

extract the place "Milano" from "$dMilano: Ricordi, c1913" before ":"

display copyright signs more nicely than "c" (applies to 787$d).


245$b is only a notorious example where a subfield does not only combine
several concepts as in 787$d but where there is no fixed "first one" and
therefore its meaning has to be deduced from punctuation information
unfortunately (but as usual) not "in" the subfield itself but immediately
preceeding it.

Furthermore the ensemble $a+$t+$d constitutes a unit (*one* citation) which
for many cases should not be torn apart.

[There's also the case of 100$c as a kind of unspecific container for
any of the several different classes of information to be injected in
the heading according to AACR2 or RDA: professions, bynames, indications
of rank etc. But there is no non-MARC markup except "," and it's almost
impossible to revere engineer $c to the factual information ("spanish
king") underlying the heading]


> -- There may be data you want to get at whole, which spread over multiple
> subfields. This information is cannot be described by the range of subfields,
> but with the closure through punctuation. E.g. in the field
>
> 245 00$aHeritage Books archives.$pUnderwood biographical dictionary.$nVolumes 1 & 2 revised$h[electronic resource] /$cLaverne Galeener-Moore.
>
> the data you want to get is
>
> Heritage Books archives. Underwood biographical dictionary. Volumes 1 & 2 revised [electronic resource]

I think 245 is one of the many cases where specific information can be
/deduced/ from (MARC and ISBD) markup in the field but it would be
dangerous to state that e.g. 245$h /contains data/. It is tempting to
speak or think in terms of "subfield content", i.e. something data-like
which is implicitly terminated by the next subfield mark: The " /"
actually does not belong to $h when attempting to view it as data, it's
just an indication that the next subfield mark to follow will probably $c).
Thus 245 is in XML lingo "mixed content" with most of the prescribed
punctuation /outside/ the children data elements. As usual, also MARCspec
cannot boldly declare that the permissible results should be regarded
as "the text" or "the data" - both views are legitimate and have to be
taken into account.

To achieve the string you just gave is either trivial (prevalent AACR2 practice
with ISBD punctuation always provided in the record: "Fetch the field and
substitute $.? by a single space") or involves much magic (coming D-A-CH
practice with ISBD punctuation generally not provided: "Fetch the field,
analyze the subfield marks and enhance it with proper ISBD punctuation".
[o.k. I see: You either stripped $c from it or the content after "/" or
the specific constellation of the trailing "/" immediately preceeding
$anything or specifically $c - ISBD knows about a "parallel statement
of responsibility" like in "Our Mission / by Corporate Body A = Notre
Commande / par Corporation A" but I don't know offhand how this is coded
in AACR2+MARC for current examples]

And - as I'm not a typewriter - I rather would like to process the content of
245 with the help of the semantic clues given by the MARC encoding. Something
with only ". " as remaining delimiters is not much help. (And retrieving
more refined components like $a, $b etc. afterwards and match them to
specific parts of the combined string above seems to be very much work -
comparable to automatic tagging of OCR results)


> Is this what you mean when want to say something like "Get me all from field XXX until you hit Y"? I guess so.

As I understand the purpose of MARCspec it is kind of "hit and run":

It is not a MOM (MARC Object Model) or rather an object model for
any format derived from ISO 2709 and its concepts of files, records,
(flavors of) fields and subfields and therefore no abstract API
can be specified (prescribing that some operation X is defined on
record objects and yields field objects).

MARCspec's can only be applied to records and yield an implementation
dependend something, /preferably/ this something should be a list of
some other things.

To be more specific: There may be implementations which indeed return a single
string
"Heritage Books archives. Underwood biographical dictionary. Volumes 1 & 2
revised [electronic resource]"
but I would consider these to be very special.

Other implementations might return a /string/

$aHeritage Books archives.$pUnderwood biographical dictionary.$nVolumes 1 & 2
revised$h[electronic resource] /

or a string

$aHeritage Books archives$pUnderwood biographical dictionary$nVolumes 1 & 2
revised$helectronic resource

and a third implementation could produce a list like

$aHeritage Books archives.
$pUnderwood biographical dictionary.
$nVolumes 1 & 2 revised
$h[electronic resource] /


and maybe a fourth implementation based on MARCXML and implemented within
the XML DOM would yield an (unserialized) XML fragment.



> -- Therefore the order of subfields is crucial. While MARCspec allows
> subfields stated in any order, a result should preserve the subfield order
> emerging in the field.

For "extract me /this/ subfield" there is no difference.

For "extract me those 15 subfields which might occur and I'm going to
name now" I'm not so sure: The cases above were to my impression more
like "partition me field 245 at some interesting position I provide"
or "give me anything from 783 except $i"

Furthermore the tasks of selection and extraction might (I'm speculating)
sometimes involve different (sets of) subfield tags: Select me those
6XX with either no $5 or a $5 "for me" and extract the "proper content"
(i.e. everything but control subfields but including $3?).
Or: Select any 651 with $2rswk [there is a nexus to i2=7 wich might
be disregarded?] and give me $a and either $e or $4 from the results
(Some implementations would return a list of lists? As with any
selections the primary list would be one where the members correspond
to the fields matched)



> -- Some fields are linked through specific subfields. There may be some data
> you want to get dependent on linkage from other fields. I'm not sure if I have
> an example for this. Maybe you could provide one.


cf. "Appendix A - Control Subfields" of the MARC21 documentation at
< http://www.loc.gov/marc/bibliographic/ecbdcntf.html >: I was especially
alluding to $6 and $8 which provide two MARC21 specific ways of denoting
that data in different fields comlements each other (giving the same
information in different scripts or codes kind of "tabular data" (prescribing
an order for the fields - although common practice there is neither a
rule that MARC records consist of fields sorted by label nor that
the order of fields in the record matters (i.e. transports information.
It's just so that display along increasing field labels gives a close
approximation to ISBD ordering).

I have no information how often these fields occur (AFAIK original script
cataloguing has to utilize $6) but to stumble upon some field with $6
and having to retrieve the associated content is an higher order operation
that /could/ be at least faciliated by MARCspec's if not done automagically.
An example given is

245 10$6880-03$aSosei to kako :$bNihon Sosei Kako Gakkai shi.
880 10$6245-03/$1$a[Title in Japanese script] :$b[Subtitle on Japanese script] .

and $6 contains the three digit field number of the associated field and
a two-digit "random number" to make this unique (there may be several
880's, each associated with at most one non-880 field). Therefore upon
seeing the 245 with $6 content "880-03" the task is:
Retrieve the (only) 880 which has (starts with) exactly $6 with content
"245-03".


Thomas

PHILLIPS M.E.

unread,
Feb 25, 2014, 6:50:03 AM2/25/14
to Thomas Berger, Klee, Carsten, Patrick Hochstenbach, vo...@gbv.de, librec...@mail.librecat.org, perl...@perl.org
> You also could consider to grok Jason Thomale's "Interpreting MARC:
> Where's the Bibliographic Data?" < http://journal.code4lib.org/articles/3832 >

That's a very good article, as it highlights the problems of the prescribed punctuation both getting in the way of extracting parts of the data and its role in providing extra context to the subfields.

> It is not a MOM (MARC Object Model) or rather an object model for
> any format derived from ISO 2709 and its concepts of files, records,
> (flavors of) fields and subfields and therefore no abstract API
> can be specified (prescribing that some operation X is defined on
> record objects and yields field objects).

If we are just talking about ISO 2709, the whole family of MARC formats in general, then you have to remember that UNIMARC and obsolete formats like UKMARC have very different requirements. UKMARC and UNIMARC are actually much easier to work with than MARC21 because the ISBD punctuation is not carried in the record but is generated from the subfield tags. So you don't have to say "give me the 245 $a and $b but strip / off the end if present" because the slash is not there. And there is a different subfield tag to introduce a parallel title, so you don't need to distinguish :$b from =$b.

In the UK most libraries have been MARC21 for a decade or more now. I don't know how much use is still made of UNIMARC, or the other national formats, nor how good they were. It seems as though in the last twenty years many countries have made moves towards MARC21 because of the sheer numbers of records available in that format. It's just a pity that it's possibly the worst of the ISO 2709 formats to work with if you want to repurpose the data!

I hope that BIBFRAME is not going to make the same mistakes. I have not been following that initiative in detail, but I've seen a few examples of data with punctuation hanging about at the end. Hard to tell whether it's prescribed punctuation or copying from the book.

The title field, in particular, is much more akin to HTML markup than data fields in a database. In antiquarian cataloguing rules like DCRM, the emphasis is on exact transcription from the title page, where the presence or absence of punctuation can make a difference in identifying variant editions. In MARC21 we get the crazy situation where the cataloguers transcribe the exact punctuation from the title page and *add* the ISBD punctuation to the MARC21 record. This makes it very hard to present the lay-person with anything meaningful.

Matthew

--
Matthew Phillips
Head of Digital and Bibliographic Services,
Durham University Library, Stockton Road, Durham, DH1 3LY
+44 (0)191 334 2941

Thomas Berger

unread,
Feb 25, 2014, 7:33:33 AM2/25/14
to PHILLIPS M.E., Klee, Carsten, Patrick Hochstenbach, vo...@gbv.de, librec...@mail.librecat.org, perl...@perl.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am 25.02.2014 12:50, schrieb PHILLIPS M.E.:

> If we are just talking about ISO 2709, the whole family of MARC formats in
> general, then you have to remember that UNIMARC and obsolete formats like UKMARC
> have very different requirements. UKMARC and UNIMARC are actually much easier to
> work with than MARC21 because the ISBD punctuation is not carried in the record
> but is generated from the subfield tags. So you don't have to say "give me the
> 245 $a and $b but strip / off the end if present" because the slash is not
> there.

same thing with MARC21: Punctuation regime for the record is governed by Leader
pos. 18 ("descriptive cataloging form" which currently gives the choice between
mainly "AACR2", "ISBD with punctuation" and "ISBD without punctuation" - and
not yet code(s) for "RDA").

Here in Germany there is a strong tradition that cataloguers shall not enter
punctuation when the field granularity of the underlying database allows its
automatic generation for display or conversion to other formats
(what I mean is: punctuation is generated when converting from the internal
format to MARC in cases where MARC is not as granular as the internal format).

This applies to RAK data in the union databases and its transport via MAB2 or
MARC21 and it is also the intention to carry this on when switching from RAK
to RDA.

[There's also been the regulation for the D-A-CH application layer to move
punctuation which cannot be eliminated to the start of the subfield "it
belongs to", e.g.

245 $a title = $b parallel title

becomes

245 $a title $b = parallel title

probably on the prospect that this could ease processing...]

viele Gruesse
Thomas Berger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iJwEAQECAAYFAlMMjZ0ACgkQYhMlmJ6W47NLLgP+KJcGwEad9zbYoUNRQer/+XBd
L39rvnWDMK6XOmW5NL+M3FQFSfArT2iJ1eyIuni92gLMfURG+z96SrKVQNEcF+IL
DVglbTE4+6OqNGf61YcwBA3x/k+MVrmqGKLqoKE7R43FgaYHKk3s7PlYaf1au9mz
z9nNz/hZDEXmujNIxJ8=
=uVi7
-----END PGP SIGNATURE-----
0 new messages