Inserting or removing specific characters in a string

278 views
Skip to first unread message

Abby

unread,
Aug 10, 2018, 1:39:30 PM8/10/18
to OpenRefine

I am new to OpenRefine and working on a project where I'm trying to standardize Library of Congress call numbers.  Each cell in the column will have a different value, but I need to standardize punctuation and spacing.  Here are a couple of examples of the format I need to achieve:

DT170 .C663 2000 
QP519.9 .N83 A38 2014 

The files that I have vary widely from this format.  Some have too few spaces (e.g. DT170.C6632000), too many spaces (QP519 .9 .N83 A38 2014), or are missing decimal points (DT170 C663 2000).  I can facet the lists to isolate cells according to each of these issues, but I'm running into trouble fixing the formatting.  OpenRefine seems to only allow editing of all instances of a character, when I need to edit only one and leave the rest alone.  For instance, I thought I had fixed the problem of a missing space before a decimal followed by a letter (e.g. changing QP519.9.N83 to QP519.9 .N83) using if(value.splitByCharType()[4].contains(“.”),replace(value,value.splitByCharType()[4],” .”),value) but this added a space before any decimal point (e.g. QP519 .9 .N83).

Is it possible to change or remove a specific single character in a text string without altering other instances of that character in the string?

Hosioneh

unread,
Aug 10, 2018, 1:59:46 PM8/10/18
to OpenRefine
it's hard to offer best solution without reviewing data. But this might help.
value.partition('.')[0]+' '+value.partition('.')[1]+value.partition('.')[2]

It will add a space after first matching dot.

Thad Guidry

unread,
Aug 10, 2018, 3:24:46 PM8/10/18
to openr...@googlegroups.com
Hi Abby,

Regex is what will work best for your particular needs to find the patterns that you dislike and build Facets for those disliked patterns.

Start with a simple pattern first...  2 whitespaces together, we do not like...so let's find them...

1. Facet -> Custom text facet
2.   value.find(/\s\s/)

For replacment functions, you can use our replace()

Replace 2 spaces with only 1 space
value.replace(" ", " ")

Iterate with Facets to narrow the working row set and to help review...then use value.replace as necessary.

Let us know if you need more,


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--

Abby

unread,
Aug 10, 2018, 4:54:55 PM8/10/18
to OpenRefine
Thank you!  This was very helpful. 

John Little

unread,
Aug 10, 2018, 4:57:11 PM8/10/18
to openr...@googlegroups.com
Your question:  Is it possible to change or remove a specific single character in a text string without altering other instances of that character in the string?  I would say, yes, if using the value.replace function.  replace can take input as a regex pattern or literal characters.  Anyway, I'm no expert on LC classification "numbers" but I composed something that works with your two examples.  I'm not sure how generalizable it will be.  

regex:  (\w+\d+)\.(\d+)\.(.*)

implementation:  value.replace(/(\w+\d+)\.(\d+)\.(.*)/,"$1.$2. $3")

Explained:
\w = "word" character
\d = digit or number 
\  = escape character ; what follows, follows literally.  e.g. \. means '.'  (escaped the period to distinguish it from the regex wildcard which is '.' and so creates ambiguity in this case)
+ = quantifier, 1 or more
* = quantifier, 0 or more  (so .* means match a whole bunch of anything)
() = capture group ; then referenced by the order identified, e.g. $1, $3, $2

Thad Guidry

unread,
Aug 10, 2018, 5:17:12 PM8/10/18
to openr...@googlegroups.com
John certainly has a nice shotgun approach !

We just do not like shotguns because someone often gets shot...you.

We prefer lots of little needles...that at most will only cause a single prick, and no deaths will occur :)

Its a primary reason we have Facets...so you don't get shot and miss something in the data.

Take it from me, be wise, inspect with good Facets to narrow the surface area of any Transformations you will apply on columns, omit the outliers and work through your data methodically.

--

John Little

unread,
Aug 10, 2018, 5:27:49 PM8/10/18
to openr...@googlegroups.com
Good wisdom.  Well put.  The facet features of OpenRefine are there for a reason -- and they are an awesome functional approach.  For best results leverage the facets.  Even if there are shortcuts with data, relying on other people's algorithmic solutions if they have not investigated the dataset can lead to bad results.  Being methodical will almost always be the best approach.  

Do what Thad says (not what I do ;-j )

--
John 

--
Reply all
Reply to author
Forward
0 new messages