Split Multi-valued strings via nth occurence

264 views
Skip to first unread message

GJ

unread,
Jul 1, 2019, 4:29:44 AM7/1/19
to OpenRefine
Hi

Is it possible to split a multi-valued cell where there is no separator, but there is a pattern? i.e. I want to split by the 5th occurrence of a particular character.

Example:

I have a string like this (which contains 4 values):

Pakistan||||Asia|Laccadive Islands||||India|Vietnam||||Asia|India||||Asia|

The result I want is:

Pakistan||||Asia|
Laccadive Islands||||India|
Vietnam||||Asia|
India||||Asia|

Or this:

Pakistan||||Asia|@Laccadive Islands||||India|@Vietnam||||Asia|@India||||Asia|

The pattern is that each value contains 5 pipemarks and the 5th pipemark always represents the final character of each value.

I thought this expression might work:

value.split(/\|{5}/).join("@")

But this nth occurrence expression seems to be looking for consecutive occurrences.

I think I have a slightly convoluted multi-step solution to this, but I thought I'd see if anyone knows of a way of splitting these values with one expression.

Best wishes

Graham




Owen Stephens

unread,
Jul 1, 2019, 4:39:17 AM7/1/19
to OpenRefine
I can't think of any way to do this in one step off the top of my head, although there probably is a way of expressing it as a regular expression.

If the pattern is exactly as you have here - the last | always isolated from the others, then I'd probably do a replace first like:

value.replace(/([^|])\|([^|])/,"$1~$2")

This replaces any isolated | with an ~ symbol instead. Then you can split on the ~ symbol.

However, if the pattern is more varied (e.g. sometimes the last pipe character is adjacent to the others in the string) then I can't think of a simple solution

Owen

Jevon, Graham

unread,
Jul 1, 2019, 4:49:37 AM7/1/19
to openr...@googlegroups.com

Thanks Owen

 

Unfortunately one of the problems is that the pattern can be more varied (e.g. a value could end with 1-5 pipes).

 

I haven’t tested it yet, but I think my plan is to do something like this:

  1. Split multi valued cells by “|”
  2. Transform cells: value+”|”
  3. Export to Excel
  4. Column A: add unique number every 5 rows
  5. Put back into open refine
  6. Records row
  7. Join multi-valued cells

Best wishes

 

Graham

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/66d5467e-0d18-4a44-be60-cb5ce431004b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


 
******************************************************************************************************************
Experience the British Library online at www.bl.uk
The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html
Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook
The Library's St Pancras site is WiFi - enabled
*****************************************************************************************************************
The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postm...@bl.uk : The contents of this e-mail must not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.
*****************************************************************************************************************
Think before you print

Owen Stephens

unread,
Jul 1, 2019, 5:05:28 AM7/1/19
to OpenRefine
OK - I have a simpler way (simpler in the sense you can do it all in OpenRefine anyway!)

If you use 'smartSplit' using pipe character as the separator, you'll get a split string in which every 5th component is the place you want to do your eventual split. Using 'smartSplit' rather than just 'split' allows you to preserve all the components from a split even if this results in an array containing empty elements.

So, once you have the resulting array, you can step through it using 'forEachIndex' - which is a way of stepping through an array and knowing which item in the array you are currently working with. In this way you can check if you are on the 5th/10th/15th/etc component and add an extra character into that bit of the string. Finally join the array back together to get back to your starting string with some additional character added (I'm going to use ~ in my example code).

You can then do a split based on this new character.

So the code I've used just now to test is:

forEachIndex(value.smartSplit("|"),i,v,if(mod(i,5)==4,v+"~",v)).join("|")

I think that will work - let me know if any issues

Owen

To unsubscribe from this group and stop receiving emails from it, send an email to openr...@googlegroups.com.

Jevon, Graham

unread,
Jul 1, 2019, 5:47:11 AM7/1/19
to openr...@googlegroups.com

Thanks Owen!

 

Yes, that is definitely simpler and seems to work perfectly. In fact, it pretty much does exactly what I was hoping to achieve.

 

I just added .replace(“~|”,”|~”) to the end of the expression so that the ~ separator follows the 5th pipemark.

 

Thanks again!

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/2fed0f00-57bb-469a-8ff6-0bef5b8d4c3b%40googlegroups.com.

Owen Stephens

unread,
Jul 1, 2019, 5:50:29 AM7/1/19
to OpenRefine
Having done a bit of research into how to express 'nth occurrence of a character' as a regular expression, you could use:

value.replace(/((?:[^|]*\|){5})/,"$1~")

To get the additional split character into the original string. Then you could split on that

Owen

Jevon, Graham

unread,
Jul 1, 2019, 6:04:30 AM7/1/19
to openr...@googlegroups.com

Thanks Owen

 

That also works perfectly. I’m sure a similar problem will crop up again. So hopefully the logic behind these will come in handy on future occasions as well.

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/e0154273-1593-4569-a239-868d834a90c3%40googlegroups.com.

Ettore Rizza

unread,
Jul 1, 2019, 6:41:43 AM7/1/19
to OpenRefine
Hello Graham,

Personally, I use GREL for simple cases and switch to Jython as soon as things get a little complicated. GREL is not a complete language, it cannot do everything. Owen has shown that a solution can often be found, but not everyone is Owen. :)

Python, on the other hand, is a complete language that allows you to do just about anything. It requires more learning, but this learning can be valued elsewhere than in OpenRefine, especially in the world of digital humanities. In addition, it has such a large user base that almost every possible question has already been asked on StackOverflow. As a result, you need to know just enough Python to adapt the code found on the net. That means a dozen hours of training. 

For example, if I check your question in Google by adding the Python keyword, I come across this StackOverflow post where I just have to pick a solution and adapt it:

def nth_repl(s, sub, repl, nth):

    find = s.find(sub)

    # if find is not p1 we have found at least one match for the substring
    i = 1

    # loop util we find the nth or we find no match
    while find != -1 and i != nth:
        # find + 1 means we start at the last match start index + 1
        find = s.find(sub, find + 1)
        i += 1
    # if i  is equal to nth we found nth matches so replace

    if i == nth:
        s= s[:find]+repl+s[find+1:]
        return s[:find] + nth_repl(s[find:], sub, repl, nth)
    else:
        return s

return nth_repl(value, "|", "\n", 5)


This solution is probably not the shortest, there would probably be a one-liner equivalent using the itertools module, but you understood the idea.

Best regards,

Owen

The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the post...@bl.uk : The contents of this e-mail must not be disclosed or copied without the sender's consent.

The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.

*****************************************************************************************************************

Think before you print

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/2fed0f00-57bb-469a-8ff6-0bef5b8d4c3b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.

To unsubscribe from this group and stop receiving emails from it, send an email to openr...@googlegroups.com.

Ettore Rizza

unread,
Jul 1, 2019, 6:44:05 AM7/1/19
to OpenRefine

Jevon, Graham

unread,
Jul 2, 2019, 4:36:55 AM7/2/19
to openr...@googlegroups.com

Thanks Ettore

 

I recently started an online python tutorial and I’ve probably done about a 12 hours, so I’ll see If I’ve learnt enough to adapt this.

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/f9ea1a90-4628-4d22-8168-74a0f6dfb894%40googlegroups.com.

Owen Stephens

unread,
Jul 4, 2019, 5:54:48 AM7/4/19
to OpenRefine
Always worth re-sharing :)

https://xkcd.com/353/

I just need someone to draw a similar one for GREL :)

Thad Guidry

unread,
Jul 4, 2019, 11:39:38 AM7/4/19
to openr...@googlegroups.com
Anecdotal information about Python Regex and retrieving a particular Group.

\number

Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters.


^(first)(second)(third)\3$

Result:  "third"

My Split function that I use:

# Python Split Kth Occurance
split_char = "|"
# K specifies which occurrence of split_char to split on
K = 5
temp = value.split(split_char)
res = split_char.join(temp[:K]), split_char.join(temp[K:])
return res


Reply all
Reply to author
Forward
0 new messages