Using Openrefine to get DEWEY numbers

Petros Liveris

unread,

Feb 6, 2019, 4:27:30 AM2/6/19

to OpenRefine

Hello,

i have in a list (Owen may be more familiar with this), values which were supposed to have in them DEWEY numbers:

https://www.cheatography.com//davidpol/cheat-sheets/dewey-decimal-classification-system/pdf/

I try to extract out of these values, the DEWEY number that may be in the value.

Like Ettore very well mentioned,

"But I'm sure it will not work for a 4th special case you did not mention in your three examples"

example values, where i extracted a number as DEWEY, when i should not have:

Is there a way with OpenRefine, to be able to see somehow the patterns that my data have, so as to be able to extract only the real DEWEY numbers?

one idea would be to somehow have facets, with something like

1 .* \d\d\d.\d\d .*

2 \d\d\d .*

3 .* \d\d\d.\d\d\d .*

Source value

| Value Extracted

| Match with DEWEY

15475

154

TRUE

ΕΙΚ MAY 27571 ΕΙΚ

275

TRUE

Μ (P) SΑΡ 2009

200

TRUE

20.949 5 PΥΧ ΑΝΤ.2

949

TRUE

Ettore Rizza

unread,

Feb 6, 2019, 4:56:16 AM2/6/19

to OpenRefine

Hello,

OK, it's obviously clearer when you explain what you want to get.

According to Wikidata, the format constraint (expressed in regex) for a Dewey classification is:

\d{3}|\d{3}\.\d |[12456]--\d |3[ABC]?--\d

You can test this regex here.

In OpenRefine 3 and onwards, this is like using:

screenshot-127.0.0.1-3333-2019.02.06-10-51-15.png

value.find(/\d{3}|\d{3}\.\d |[12456]--\d |3[ABC]?--\d /)[0]

Hope this helps,

Ettore

Ettore Rizza

unread,

Feb 6, 2019, 5:13:37 AM2/6/19

to OpenRefine

By the way, this seems like a good example of XY problem: your first question was not about the problem you wanted to solve, but about the solution you thought was the right one. ;)

Petros Liveris

unread,

Feb 6, 2019, 5:19:55 AM2/6/19

to OpenRefine

thank you very much,

all this info is very helpful. And ofcourse your saying about my wrong asking question.

The above regex, still brings

as the DEWEY number, the 154 out of the 15475

and from 20.949. it selects the 949.

This is why i am looking for a faceted way to have all possible occurrences of values, so as to see which ones i should choose.

for example, a facet containing the 15475 (and all values that have five digits in them), would be like \d{5}, and 20.949 would be like \d\d.\d\d\d, so i could know which values i need to deal with.

perhaps a way to have in facets all digits replaced with the d character, and all letters with a special character, say c, leaving as it is all punctuation?

i.e. "." "," "(" ")" etc

again i am asking a question thinking about how it could be solved, so please accept my apologies, i try to find a good solution...

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ettore Rizza

unread,

Feb 6, 2019, 5:33:50 AM2/6/19

to OpenRefine

The above regex, still brings
as the DEWEY number, the 154 out of the 15475
and from 20.949. it selects the 949.

But that's exactly the behavior you seem to be waiting for, based on the example you posted above where all the "match with Dewey" are TRUE. :-|

I'm not sure what you want to do. Is it to match only whole Dewey classifications?

Petros Liveris

unread,

Feb 6, 2019, 5:43:58 AM2/6/19

to OpenRefine

my mistake.

It was matched as TRUE, since the 154 exists in DEWEY:

https://www.cheatography.com//davidpol/cheat-sheets/dewey-decimal-classification-system/pdf/

but in this field the librarian has put a value which i should not have taken into account in the first place, because it is totally wrong.

So the TRUE here, is a false positive, which i need to avoid.

Same goes with the 20.949where the DEWEY lies in the digits before the dot character, (probably the librarian meant 020) while the regex selects the 949,

so if i go with this approach, i will not correct all values, but some i will also alter them in a very bad way. So i try to use OpenRefine as a way to visualize with facets,

how the DEWEY data are. Then accordingly, i will create the appropriate regex, so as to capture only for instance the value stored in $1 of the following: (\d\d\d)(\s), or (\d\d\d)(\.\d\d\)

This way i will be pretty sure that i get real DEWEY numbers extracted, and not garbage put in the DEWEY field by librarians.

Thank you again for your patience to read all this, hope it is clear now what i try to achieve

Ettore Rizza

unread,

Feb 6, 2019, 6:56:15 AM2/6/19

to OpenRefine

Do you want something like this?

screenshot-127.0.0.1-3333-2019.02.06-12-42-29.png

If so, here is the formula I used to create the new column:

forEach(value.replace(/\d/, "1").replace(/\p{L}/, "A").split(' '), e, e.match(/(\d{3}|\d{3}\.\d+|[12456]--\d+|3[ABC]?--\d+)/))

If you want to examine manually all the patterns of numbers and letters in your cells (the solution you propose), you can create a custom text facet with this formula:

value.replace(/\p{L}/, "A").replace(/\d/, "\\\\d").split(' ')

This will give you something like this:

screenshot-127.0.0.1-3333-2019.02.06-12-52-52.png

If this is not what you want, you should really post an example of your original data and the expected results (and not an example of false positives that you want to avoid).

Cheers,

Ettore

PS : There was an error in the regex I posted above. For some reason, the + had disappeared when I copied it from Wikidata.

Ettore Rizza

unread,

Feb 6, 2019, 6:58:54 AM2/6/19

to OpenRefine

Sorry, bad copy-paste. The formula I used in the first screenshot is :

forEach(value.split(' '), e, e.match(/(\d{3}|\d{3}\.\d+|[12456]--\d+|3[ABC]?--\d+)/)[0]).join('')

Petros Liveris

unread,

Feb 6, 2019, 7:00:24 AM2/6/19

to OpenRefine

your second screenshotseems to do my job, going testing wright away,

thank you again

Reply all

Reply to author

Forward