converting a list within cells into boolean columns

David Bain

unread,

May 6, 2021, 3:47:56 PM5/6/21

to OpenRefine

I'm new to Openrefine and I haven't found documentation or examples for my usecase.

Any guidance would be greatly appreciated! For example could this be solved with clustering?

Use case

I have a dataset with cells containing lists of items.

e.g. a tools column may look like this

Tools

spanner, wrench,hammer

nail hitting tool, wrench, screw driver

screwdriver, big hammer, spanner

My goal is transform this single "Tools" column into corresponding boolean columns (after cleaning up alternative names, e.g. hammer is also called big hammer and nail hitting tool).

The output would look like this:

spanner, wrench, hammer, screw driver

yes, yes, yes, no

no, yes, yes, yes

yes, no, yes, yes

Please ask clarifying questions if anything is unclear.

mbec...@g.harvard.edu

unread,

May 6, 2021, 5:41:06 PM5/6/21

to OpenRefine

Someone may provide a more elegant/flexible solution, but a simple way to do this is to open the dropdown menu on the Tools column, select Edit column --> Add column based on this column... and use the expression:

value.contains("spanner")

Repeat as needed for each boolean you need.

Much more on this in the documentation on string functions.

Best,

Michael

David Bain

unread,

May 6, 2021, 6:34:55 PM5/6/21

to OpenRefine

Thanks @Michael.......Interesting.

This does work. I used Jython instead of GREL, something like this:

names = ['hammer','big hammer','nail hitting tool']

return any(name in value.lower() for name in names)

I was reflecting a bit more and this really is a tagging issue. The tools column is equivalent to a "tags" or "subject" column. Unfortunately, the users have tagged things inconsistently, hence why I needed a list of possibilities for hammer.

Give that I have variations of the same "tag".

Now is there a way to generate all the columns in one step?

Thad Guidry

unread,

May 6, 2021, 6:55:09 PM5/6/21

to openr...@googlegroups.com

If you have an inconsistent pattern in a column...

you might look at our Cluster feature.

(many tutorials online "openrefine clustering tutorial")

1. Perform a Text Facet on the column.

2. Click the Cluster button on the Facet.

3. Retype the values you wish in the Clustering dialog and merge.

4. Repeat as necessary.

Then you might be left with some consistency in order to perform those additional transformations or aggregate boolean functions.

Thad

https://www.linkedin.com/in/thadguidry/

https://calendly.com/thadguidry/

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/c030d529-5e6b-464c-9364-a16eb5c335fcn%40googlegroups.com.

David Bain

unread,

May 6, 2021, 9:45:39 PM5/6/21

to OpenRefine

Interestingly I haven't yet needed the clustering for this use case.

Here's where I've reached:

Since my users provided me a comma separated list of alternative names, I needed to be able to cut and paste the exact data structure that they gave me. As a result I'm doing:

Edit column --> Add column based on this column... with a Jython script that looks like this:

data = {

"canonical name":"hammer",

"alternative names":"hammer,big hammer,nail hitting tool"

}

names = [name.strip() for name in data["alternative names"].split(',')]

names.append(data["canonical name"])

# in case the value is null it will return 0

if value:

return any(name.lower() in value.lower() for name in names)

return 0

David Bain

unread,

May 6, 2021, 9:46:29 PM5/6/21

to OpenRefine

Next I'm working on automating this so I don't have do the new columns one at a time.

Owen Stephens

unread,

May 11, 2021, 5:29:45 AM5/11/21

to OpenRefine

OpenRefine is designed to work down a column, so to do this in a single step you may want to look at writing an expression that builds a single list of true/false (or 1/0) values which you can then split into columns afterwards. So from the original cell (or a duplicate of the original cell if you need to preserve the original) you could do something like:

output_list = []

data = [{

"canonical name":"spanner",

"alternative names":"spanner"

},

{

"canonical name":"wrench",

"alternative names":"wrench"

},

{

"canonical name":"hammer",

"alternative names":"hammer,big hammer,nail hitting tool"

},

{

"canonical name":"screwdriver",

"alternative names":"screwdriver, screw driver"

},

]

for datum in data:

names = [name.strip() for name in datum["alternative names"].split(',')]

names.append(datum["canonical name"])

# in case the value is null it will return 0

if value:

output_list.append(str(any(name.lower() in value.lower() for name in names)))

else:

output_list.append(str(0))

return ','.join(output_list)

This would give you a comma separate list of values that you could then split into columns using the OpenRefine menu option Edit Column -> Split into several columns, with a comma as the separator

To be honest at this point I do wonder if OpenRefine is the appropriate tool, especially if you want to automate it. If you are able to write the Python, it wouldn't be much of a step from what you have to writing the whole thing in a Python script

There maybe other factors in play of course, but I wouldn't stick to OpenRefine for this type of task unless it was giving me some other benefit