Creating a new column based on two other columns.

207 views
Skip to first unread message

JD Schwarzenbach

unread,
Apr 13, 2017, 2:04:39 PM4/13/17
to OpenRefine

I created a table in Refine that contains user ID's and review cities of the city in which each of their reviews take place


I am trying to figure out the best way to determine the users "local" city by creating a new column assigning where they are a local. 

I want to determine this by taking a users most frequently reviewed city (>50%) and assigning that city to their user id as their "local city"

But I am stumped on how to wrangle this in refine. 

Any suggestions?

Ettore Rizza

unread,
Apr 13, 2017, 4:14:29 PM4/13/17
to OpenRefine
HI JD,

Another user asked a similar question this afternoon. The answer will be essentially the same, except for the formula to use (since you do not want to know the number of different cities rewieved by a user, but which is cited more than half of the times). I think the simplest for you is to use this little Jython code by adding a column based on "business_city". Here is the code (do not forget to put the Open Refine window in Python/Jython mode):

liste = row['record']['cells']['business_cities']['value'] #make sure that the column name "business_cities" is spelled correctly

for cities in liste:
    count
= liste.count(cities)
   
if count >= len(liste)/2.0:
       
return cities

This formula assumes that there is no tie. If a user has cited two cities exactly 50% of the time, the code will just return the first. It can be modified if you want.

Feel free to ask for more details if something doesn't seem clear to you.

Ettore Rizza

unread,
Apr 13, 2017, 4:16:56 PM4/13/17
to OpenRefine
Oups, I forgot the link to the similar question and the answer : https://groups.google.com/forum/#!topic/openrefine/nuLGxvsJerQ


JD Schwarzenbach

unread,
Apr 14, 2017, 1:34:04 AM4/14/17
to OpenRefine
Thank you so much man! Really appreciate it. This is just what we needed!
Reply all
Reply to author
Forward
0 new messages