Consolidation of MARC tag content on the basis of RecordNumber

33 views
Skip to first unread message

Parthasarathi Mukhopadhyay

unread,
Apr 17, 2022, 2:50:06 PM4/17/22
to openr...@googlegroups.com
Dear all

I have a MARC dataset, which has been imported into OpeRefine through MarcEdit.

It looks like this (included only tags 020, 082, 245, 650 through text faceting) -

RecordNumber

Tags

Indicators

Content


1

020

\\

$a9781315882673 (e-book : PDF)


1

245

00

$aAsexualities$h[electronic resource] :$bfeminist and queer perspectives /$cedited by Karli June Cerankowski and Megan Milks.


1

650

\0

$aFeminist theory.


1

650

\0

$aQueer theory.


1

650

\0

$aSex.


1

650

\0

$aSexual attraction.


1

650

\0

$aSexual desire disorders.


1

650

\0

$aSexual orientation.


2

020

\\

$a9781137590312 (ebook:PDF)$9978-1-137-59031-2


2

082

04

$a305.3$223


2

245

10

$aIntersectionality and LGBT Activist Politics$h[electronic resource] :$bMultiple Others in Croatia and Serbia /$cedited by Bojan Bilić, Sanja Kajinić.


2

650

\0

$aEthnicity.


2

650

14

$aGender Studies.


2

650

24

$aSocial Structure, Social Inequality.


3

020

\\

$c{dollar}85.10


3

082

04

$a306.74309548$222


3

245

20

$aSexual behavior among male college students in Southern India :$bimplications for HIV risk and prevention :$by David S. Hausner


3

650

\0

$aMale students$xSexual behavior$zIndia, South


3

650

\0

$aHIV-positive bisexual men$zIndia, South



Now, what I want is one column against every record containing ISBN (tag 020)| DDC (082)|Title&Subtitle (tag 245 $a $b)|Subject headings (to be separated by ;)|; something like this for the record number 2 -

2

020

\\

$a9781137590312 (ebook:PDF)$9978-1-137-59031-2

9781137590312||305.3||Intersectionality and LGBT Activist Politics: Multiple Others in Croatia and Serbia||Ethnicity.;Gender Studies.;Social Structure, Social Inequality.

2

082

04

$a305.3$223

2

245

10

$aIntersectionality and LGBT Activist Politics$h[electronic resource] :$bMultiple Others in Croatia and Serbia /$cedited by Bojan Bilić, Sanja Kajinić.

2

650

\0

$aEthnicity.

2

650

14

$aGender Studies.

2

650

24

$aSocial Structure, Social Inequality.

The purpose is to use the resultant column to create another OR project from it with ISBN|DDC|Title&Subtitle|Descriptors for measuring subject cataloguing quality for a given library against a predefined gold standard.

Regards

Parthasarathi Mukhopadhyay

Professor, Department of Library and Information Science,

University of Kalyani, Kalyani - 741 235 (WB), India

Owen Stephens

unread,
Apr 21, 2022, 7:31:59 AM4/21/22
to OpenRefine
Hi Parthasarathi - there are probably a few ways to approach this but it's going to take a few steps I think as you've got some complexity in the key you want to create

I'd start with creating the parts of the key you want to create separately and then join them together at the end. First of all I'd get the fields you want to work with in a single column each (i.e. a column for ISBN info, a column for title info, a column for subject info). There are several ways to do this including using Transpose by Key/Value, but I'd suggest as follows:

  1. Make sure you are in Row mode
  2. Add a text facet on the "Tags" column
  3. In the facet, select "020" to filter to the 020 rows (to do the ISBN info) and then from the "Content" column use "Add column based on this column" to create your "ISBN" column
  4. Repeat step 3 for the 245 and 650 tags
You should now have three new columns: ISBN, Title, Subject

These three columns will need some manipulation to get the values you want to create your key - this is the most difficult part of the process. I'm not sure I can give full instructions here as there could be some factors in your full data set that mean you need some additional clean up but hopefully I can give you enough here:

ISBN
--------
In the new ISBN column do a transform like:

filter(value.split("$"),v,v.startsWith("a"))[0].substring(1)

This should get you the content of just the $a subfield (in theory the $a subfield isn't repeatable in MARC, and the transformation above assumes that this is the case for your MARC - you may want to double check this with a custom text facet like:

filter(value.split("$"),v,v.startsWith("a")).length()

before you do the transformation and if any of these are >1 then investigate.

Now the ISBN column will contain values like:
9781315882673 (e-book : PDF)

You might need to do some tidy up to make sure there are no hyphens or spaces in the ISBN, and you want to drop everything except the actual ISBN - so a transformation like:

value.toUppercase().replace(/[ -]/,"").find(/\d{12}[0-9X]/)[0]

Again - I'm basing this on the sample data you shared and some guesses about potential issues - so you may need to refine this a little to get exactly what you need

Title
--------
Working with the title is going to be a pain honestly. Because of the punctuation in the MARC it's tricky to create a reliable process to get the data in the way you have it in your example. To show an example with a 245 like 
"$aIntersectionality and LGBT Activist Politics$h[electronic resource] :$bMultiple Others in Croatia and Serbia /$cedited by Bojan Bilić, Sanja Kajinić."
 the colon between the title ($a) and subtitle ($b) is actually in the $h subfield - this is because the MARC is designed to give a readable string, not a machine-processable string. So extracting the $a and $b from the title will lose the colon. But if the $h field is omitted then the colon would be at the end of the $a substring (or omitted if there is no $b) and you have a different requirement for getting your eventual title string.

Anyway your basic need is something like a transform:

forEach(filter(value.split("$"),v,or(v.startsWith("a"),v.startsWith("b"))),w,w.substring(1)).join(": ")

You could then do something to try and remove any repeated colons and trailing back slashes (caused by the punctuation at the end of whichever subfield precedes a statement of responsibility in the $c subfield (arrgh - MARC is a mess!)

Subject
------------
In the new Subject column do a transform like:

filter(value.split("$"),v,v.startsWith("a"))[0].substring(1)

This assumes you are only interested in the $a subfield - I'm not sure this is the case but that's the only thing present in your examples

Creating your final output
----------------------------------------
Now you want to start working with the Records instead of Rows. Then you can join the cells together in the columns for each data type, and then join the columns together for your final outcome:

  1. Use "Blank down" on the RecordNumber column
  2. Switch to Records mode
  3. In the ISBN column us "Join together multi-valued cells". If you need to account for potential multiple 020 in a single record you'll need a separator here and decide what you do with the result
  4. In the Title column us "Join together multi-valued cells". If you need to account for potential multiple 245 in a single record you'll need a separator here and decide what you do with the result
  5. In the Subject column us "Join together multi-valued cells". We know there can be multiple 650s - based on your example you want to use semi-colon as the separator
  6. On one of the columns use Edit columns -> Join columns - select the relevant columns, put them in the right order, decide what to do with nulls, specify pipe character as your separator
Hopefully this gives you the final outcome you need and you can work from there!

Best wishes and good luck

Owen 

Owen Stephens

unread,
Apr 21, 2022, 7:36:36 AM4/21/22
to OpenRefine
Oh - I forgot to say in the steps to get the various subfields like the following on the ISBN column

filter(value.split("$"),v,v.startsWith("a"))[0].substring(1)

You may need to specify in the transform to use a blank on error - as otherwise in situations where the field lacks a $a you'll end up with the original string from the field

Parthasarathi Mukhopadhyay

unread,
Apr 21, 2022, 1:40:54 PM4/21/22
to openr...@googlegroups.com

Hello Owen

What can I say? An absolute genius direction ....

Thanks a ton.

The column now looks like


9781137590312 ||
Intersectionality and LGBT Activist Politics: Multiple Others in Croatia and Serbia||
Ethnicity.;Gender Studies.;Social Structure, Social Inequality.

It can now serve as an input to the next project for further works.

Heartfelt thanks and best regards





--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/8b2c8334-ab34-42dc-b3e1-c023e5ff002bn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages