Pandas: distance / relevance words finding from dataframe

Ch. Shakeel Mumtaz

unread,

Sep 15, 2014, 2:25:38 PM9/15/14

to pyd...@googlegroups.com

I am new with pandas and python. I want to find common words for my data set. e.g i have list of companies ["Microsoft.com", "Microsoft", "Microsoft com", "apple" ...] etc. I have around 1M list of such companies and i want to find the relevance words e.g Microsoft.com, Microsoft, Microsoft com there words belongs to one keyword "Microsoft".

This is what i did but it's very slow:

companies = pd.read_csv('/tmp/companies.csv', error_bad_lines=False)
unique_companies = companies.groupby(['company'])['company'].unique()
df = DataFrame(columns=['name', 'leven', 'fuzzy'])

for company in unique_companies:
    df = df.append({'name': company[0], 'leven': [], 'fuzzy': []}, ignore_index=True)

for index, series1 in df.iterrows():
   for index, series2 in df.iterrows():
        ratio = fuzz.ratio(series1['name'], series2['name'])
        if ratio > 85:
            series1['fuzzy'].append({'name': series2['name'], 'value': ratio })
   

can anyone guide me to do it efficiently ?

Paul Hobson

unread,

Sep 15, 2014, 3:51:14 PM9/15/14

to pyd...@googlegroups.com

Hard to know without seeing precisely what your desired output is like.

That said, first thing I see is that:

unique_companies = companies.groupby(['company'])['company'].unique()

Could be simplified to:

unique_companies = companies['company'].unique()

Seconds thing I would do is set the index when you initialize the dataframe to pre-allocate the memory:

df = DataFrame(columns=['leven', 'fuzzy'], index=unique_companies)
Then with the iterools (standard) library, you could just do 1 loop:

import itertools
for c1, c2, in itertools.product(unique_companies, unique_companies):
    ratio = fuzz.ratio(c1, c2)
    if ratio > 85:
         df.loc[c1, 'fuzzy] = c2
         df.loc[c1, 'value'] = ratio
Just some thoughts.
-p

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Miki Tebeka

unread,

Sep 17, 2014, 1:35:13 AM9/17/14

to pyd...@googlegroups.com

IMO what's you're trying to do here is to cluster words. There are many ways to do that - here's one example using scikit-learn.

Ch. Shakeel Mumtaz

unread,

Sep 18, 2014, 4:48:19 AM9/18/14

to pyd...@googlegroups.com

i have dataframe like this:

        apple aple  apply
apple     0     0      0
aple      0     0      0
apply     0     0      0

I want to calculate string distance e.g apple -> aple etc. My end result is here:

        apple aple  apply
apple     0     32     14
aple      32    0      30
apply     14    30     0

Currently this is code i am using (but it's very slow for big data):

columns = df.columns
for r in columns:
  for c in columns:
     m[r][c] = Simhash(r).distance(Simhash(c))

can you please guide me how to do it efficiently ?

Ch. Shakeel Mumtaz

unread,

Sep 18, 2014, 5:42:14 AM9/18/14

to pyd...@googlegroups.com

can you give me example for text type data to use kmean ?

Tom Augspurger

unread,

Sep 18, 2014, 9:19:12 AM9/18/14

to pyd...@googlegroups.com

Scikit-Learn uses floats or ints for everything. In Pauls example the words are transformed to a matrix of integers with scikit-learn's TfidVectorizer

Ch. Shakeel Mumtaz

unread,

Sep 18, 2014, 9:32:01 AM9/18/14

to pyd...@googlegroups.com

I tried scikit-learn Kmean algorithm but that is not suitable for string based distance finding or relevance. So i tried hierarchically clustering using jaro distance method for finding the distance. But that is also slow for large dataset.

here is the code i used:

from jellyfish import jaro_distance
import numpy as np
import scipy.cluster.hierarchy

words = unique_companies

def d(coord):
i,j = coord
return 1 - jaro_distance(words[i], words[j])

np.triu_indices(len(words), 1)
np.apply_along_axis(d, 0, _)
scipy.cluster.hierarchy.linkage(_)

Also how to display this output ?

Paul Hobson

unread,

Sep 18, 2014, 9:41:23 PM9/18/14

to pyd...@googlegroups.com

On Thu, Sep 18, 2014 at 1:48 AM, Ch. Shakeel Mumtaz <itsha...@gmail.com> wrote:

i have dataframe like this:
        apple aple  apply
apple     0     0      0
aple      0     0      0
apply     0     0      0
I want to calculate string distance e.g apple -> aple etc. My end result is here:
        apple aple  apply
apple     0     32     14
aple      32    0      30
apply     14    30     0
Currently this is code i am using (but it's very slow for big data):
columns = df.columns
for r in columns:
  for c in columns:
     m[r][c] = Simhash(r).distance(Simhash(c)) 

This is really inefficient. You should stack up your dataframe and apply your function.

That would look like this:

from io import StringIO

import pandas

data = StringIO("""\

word1 apple aple apply

apple 0 0 0

aple 0 0 0

apply 0 0 0

""")

df = pandas.read_table(data, sep='\s+', index_col='word1')

df.columns.names = ['word2']

df = df.stack().reset_index().rename(columns={0: 'score'})

print(df)

which gives me:

   word1  word2  score
0  apple  apple      0
1  apple   aple      0
2  apple  apply      0
3   aple  apple      0
4   aple   aple      0
5   aple  apply      0
6  apply  apple      0
7  apply   aple      0
8  apply  apply      0

And then you can use apply like this:

df['score'] = df.apply(lambda r: Simhash(r['word1']).distance(Simhash(r['word2')), axis=1)

Reply all

Reply to author

Forward