|
I am new with pandas and python. I want to find common words for my data set. e.g i have list of companies ["Microsoft.com", "Microsoft", "Microsoft com", "apple" ...] etc. I have around 1M list of such companies and i want to find the relevance words e.g Microsoft.com, Microsoft, Microsoft com there words belongs to one keyword "Microsoft". This is what i did but it's very slow: |
unique_companies = companies.groupby(['company'])['company'].unique()Could be simplified to:
unique_companies = companies['company'].unique()
Seconds thing I would do is set the index when you initialize the dataframe to pre-allocate the memory:df = DataFrame(columns=['leven', 'fuzzy'], index=unique_companies)Then with the iterools (standard) library, you could just do 1 loop:import itertoolsfor c1, c2, in itertools.product(unique_companies, unique_companies):ratio = fuzz.ratio(c1, c2)if ratio > 85:df.loc[c1, 'fuzzy] = c2df.loc[c1, 'value'] = ratioJust some thoughts.-p
--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
apple aple apply
apple 0 0 0
aple 0 0 0
apply 0 0 0
I want to calculate string distance e.g apple -> aple etc. My end result is here:
apple aple apply
apple 0 32 14
aple 32 0 30
apply 14 30 0
Currently this is code i am using (but it's very slow for big data):
columns = df.columns
for r in columns:
for c in columns:
m[r][c] = Simhash(r).distance(Simhash(c))
can you please guide me how to do it efficiently ?
i have dataframe like this:apple aple apply apple 0 0 0 aple 0 0 0 apply 0 0 0
I want to calculate string distance e.g apple -> aple etc. My end result is here:
apple aple apply apple 0 32 14 aple 32 0 30 apply 14 30 0
Currently this is code i am using (but it's very slow for big data):
columns = df.columns for r in columns: for c in columns: m[r][c] = Simhash(r).distance(Simhash(c))
word1 word2 score 0 apple apple 0 1 apple aple 0 2 apple apply 0 3 aple apple 0 4 aple aple 0 5 aple apply 0 6 apply apple 0 7 apply aple 0 8 apply apply 0
And then you can use apply like this:
df['score'] = df.apply(lambda r: Simhash(r['word1']).distance(Simhash(r['word2')), axis=1)