Pandas: distance / relevance words finding from dataframe

491 views
Skip to first unread message

Ch. Shakeel Mumtaz

unread,
Sep 15, 2014, 2:25:38 PM9/15/14
to pyd...@googlegroups.com

I am new with pandas and python. I want to find common words for my data set. e.g i have list of companies ["Microsoft.com", "Microsoft", "Microsoft com", "apple" ...] etc. I have around 1M list of such companies and i want to find the relevance words e.g Microsoft.com, Microsoft, Microsoft com there words belongs to one keyword "Microsoft".

This is what i did but it's very slow:

companies = pd.read_csv('/tmp/companies.csv', error_bad_lines=False)
unique_companies = companies.groupby(['company'])['company'].unique()
df = DataFrame(columns=['name', 'leven', 'fuzzy'])

for company in unique_companies:
df = df.append({'name': company[0], 'leven': [], 'fuzzy': []}, ignore_index=True)
for index, series1 in df.iterrows():
for index, series2 in df.iterrows():
ratio = fuzz.ratio(series1['name'], series2['name'])
if ratio > 85:
series1['fuzzy'].append({'name': series2['name'], 'value': ratio })


can anyone guide me to do it efficiently ?

Paul Hobson

unread,
Sep 15, 2014, 3:51:14 PM9/15/14
to pyd...@googlegroups.com
Hard to know without seeing precisely what your desired output is like.

That said, first thing I see is that:
unique_companies = companies.groupby(['company'])['company'].unique()
Could be simplified to:
unique_companies = companies['company'].unique()

Seconds thing I would do is set the index when you initialize the dataframe to pre-allocate the memory:
df = DataFrame(columns=['leven', 'fuzzy'], index=unique_companies)
Then with the iterools (standard) library, you could just do 1 loop:

import itertools
for c1, c2, in itertools.product(unique_companies, unique_companies):
    ratio = fuzz.ratio(c1, c2)
    if ratio > 85:
         df.loc[c1, 'fuzzy] = c2
         df.loc[c1, 'value'] = ratio
Just some thoughts.
-p

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Miki Tebeka

unread,
Sep 17, 2014, 1:35:13 AM9/17/14
to pyd...@googlegroups.com
IMO what's you're trying to do here is to cluster words. There are many ways to do that - here's one example using scikit-learn.

Ch. Shakeel Mumtaz

unread,
Sep 18, 2014, 4:48:19 AM9/18/14
to pyd...@googlegroups.com

i have dataframe like this:

        apple aple  apply
apple     0     0      0
aple      0     0      0
apply     0     0      0


I want to calculate string distance e.g apple -> aple etc. My end result is here:

        apple aple  apply
apple     0     32     14
aple      32    0      30
apply     14    30     0


Currently this is code i am using (but it's very slow for big data):

columns = df.columns
for r in columns:
  for c in columns:
     m[r][c] = Simhash(r).distance(Simhash(c)) 


can you please guide me how to do it efficiently ?

Ch. Shakeel Mumtaz

unread,
Sep 18, 2014, 5:42:14 AM9/18/14
to pyd...@googlegroups.com

can you give me example for text type data to use kmean ?

Tom Augspurger

unread,
Sep 18, 2014, 9:19:12 AM9/18/14
to pyd...@googlegroups.com
Scikit-Learn uses floats or ints for everything. In Pauls example the words are transformed to a matrix of integers with scikit-learn's TfidVectorizer

Ch. Shakeel Mumtaz

unread,
Sep 18, 2014, 9:32:01 AM9/18/14
to pyd...@googlegroups.com

I tried scikit-learn Kmean algorithm but that is not suitable for string based distance finding or relevance. So i tried hierarchically clustering using jaro distance method for finding the distance. But that is also slow for large dataset.

here is the code i used:

from jellyfish import jaro_distance
import numpy as np
import scipy.cluster.hierarchy

words = unique_companies

def d(coord):
     i,j = coord
     return 1 - jaro_distance(words[i], words[j])

np.triu_indices(len(words), 1)
np.apply_along_axis(d, 0, _)
scipy.cluster.hierarchy.linkage(_)

Also how to display this output ?

Paul Hobson

unread,
Sep 18, 2014, 9:41:23 PM9/18/14
to pyd...@googlegroups.com
On Thu, Sep 18, 2014 at 1:48 AM, Ch. Shakeel Mumtaz <itsha...@gmail.com> wrote:

i have dataframe like this:

        apple aple  apply
apple     0     0      0
aple      0     0      0
apply     0     0      0


I want to calculate string distance e.g apple -> aple etc. My end result is here:

        apple aple  apply
apple     0     32     14
aple      32    0      30
apply     14    30     0


Currently this is code i am using (but it's very slow for big data):

columns = df.columns
for r in columns:
  for c in columns:
     m[r][c] = Simhash(r).distance(Simhash(c)) 



This is really inefficient. You should stack up your dataframe and apply your function. 

That would look like this:

from io import StringIO
import pandas

data = StringIO("""\
word1   apple aple  apply
apple     0     0      0
aple      0     0      0
apply     0     0      0
""")

df = pandas.read_table(data, sep='\s+', index_col='word1')
df.columns.names = ['word2']
df = df.stack().reset_index().rename(columns={0: 'score'})
print(df)

which gives me:

   word1  word2  score
0  apple  apple      0
1  apple   aple      0
2  apple  apply      0
3   aple  apple      0
4   aple   aple      0
5   aple  apply      0
6  apply  apple      0
7  apply   aple      0
8  apply  apply      0

And then you can use apply like this:

df['score'] = df.apply(lambda r: Simhash(r['word1']).distance(Simhash(r['word2')), axis=1) 
Reply all
Reply to author
Forward
0 new messages