how to write add frequency in particular file by reading a csv file and then making a new file of multiple csv file by adding frequency

kishan.samp...@gmail.com

unread,

Jun 22, 2017, 7:16:28 AM6/22/17

to

I want to write a common file in which It can add the frequency by adding multiple csv file and if the same words are repeated in python then it should add the frequency in the common file can any one help me please

import re
import operator
import string

class words:
def __init__(self,fh):
self.fh = fh
def read(self):
for line in fh:
yield line.split()

if __name__ == "__main__":
frequency = {}
document_text = open('data_analysis.csv', 'r')
common1_file = open("common_file1.csv", "r")

text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)

text_string_one = common1_file.read().lower()
match_pattern_one = re.findall(r'\b[a-z]{3,15}\b', text_string_one)
#print("match_pattern"+(str(match_pattern)))
for word in match_pattern:
for word1 in match_pattern_one:
count = frequency.get(word,0)
count1 = frequency.get(word1,0)
if word1 == word:
frequency[word] = count + count1
else:
frequency[word] = count

frequency_list = frequency.keys()
text_file = open("common_file1.csv", "w")
for words in frequency_list:
data = (words, frequency[words])
print (data)
#text_file = open("common_file1.csv", "w")
#for i in data:
#store_fre = (str(data)+"\n")
text_file.write(str(data)+"\n")

text_file.close()

this is my code written by me til now but not getting satisfied results

bream...@gmail.com

unread,

Jun 22, 2017, 8:49:46 AM6/22/17

to

A quick glance suggests you need to close common_file1.csv in before you open it in WRITE mode.

You can simplify your code by using a Counter https://docs.python.org/3/library/collections.html#collections.Counter

Kindest regards.

Mark Lawrence

Jussi Piitulainen

unread,

Jun 22, 2017, 3:46:40 PM6/22/17

to

Dennis Lee Bieber writes:

> # lowerecase all, open hyphenated and / separated words, parens,
> # etc.
> ln = ln.lower().replace("/", " ").replace("-", " ").replace(".", " ")
> ln = ln.replace("\\", " ").replace("[", " ").replace("]", " ")
> ln = ln.replace("{", " ").replace("}", " ")
> wds = ln.replace("(", " ").replace(")", " ").replace("\t", " ").split()

A pair of methods, str.maketrans to make a translation table and then
.translate on every string, allows to do all that in one step:

spacy = r'\/-.[]{}()'
tr = str.maketrans(dict.fromkeys(spacy, ' '))

...

ln = ln.translate(tr)

But those seem to be only in Python 3.

> # for each word in the line
> for wd in wds:
> # strip off leading/trailing punctuation
> wd = wd.strip("\\|'\";'[]{},<>?~!@#$%^&*_+= ")

You have already replaced several of those characters with spaces.

> # do we still have a word? Skip any with still embedded
> # punctuation
> if wd and wd.isalpha():
> # attempt to update the count for this word

But for quick and dirty work I might use a very simple regex, probably
literally this regex:

wordy = re.compile(r'\w+')

...

for wd in wordy.findall(ln): # or .finditer, but I think it's newer
...

However, if the OP really is getting their input from a CSV file, they
shouldn't need methods like these. Because surely it's then already an
unambiguous list of words, to be read in with the csv module? Or else
it's not yet CSV at all after all? I think they need to sit down with
someone who can walk them through the whole exercise.

Jussi Piitulainen

unread,

Jun 23, 2017, 2:49:19 AM6/23/17

to

Dennis Lee Bieber writes:

> On Thu, 22 Jun 2017 22:46:28 +0300, Jussi Piitulainen declaimed the
> following:

>
>>
>> A pair of methods, str.maketrans to make a translation table and then
>> .translate on every string, allows to do all that in one step:
>>
>> spacy = r'\/-.[]{}()'
>> tr = str.maketrans(dict.fromkeys(spacy, ' '))
>>
>> ...
>>
>> ln = ln.translate(tr)
>>
>> But those seem to be only in Python 3.
>>
>

> Well -- I wasn't trying for "production ready" either; mostly
> focusing on the SQLite side of things.

I know, and that's a sound suggestion if the OP is ready for that.

I just like those character translation methods, and I didn't like it
when you first took the time to call a simple regex "line noise" and
then proceeded to post something that looked much more noisy yourself.

>> However, if the OP really is getting their input from a CSV file,
>> they shouldn't need methods like these. Because surely it's then
>> already an unambiguous list of words, to be read in with the csv
>> module? Or else it's not yet CSV at all after all? I think they need
>> to sit down with someone who can walk them through the whole
>> exercise.
>

> The OP file extensions had CSV, but there was no sign of the csv
> module being used; worse, it looks like the write of the results file
> has no formatting -- it is the repr of a tuple of (word, count)!

Exactly. Too many things like that makes me think they are not ready for
more advanced methods.

> I'm going out on a limb and guessing the regex being used to
> find words is accepting anything separated by leading/trailing space
> containing a minimum of 3 and maximum of 15 characters in the set
> a..z. So could be missing first and last words on a line if they don't
> have the leading or trailing space, and ignoring "a", "an", "me",
> etc., along with "mrs." [due to .] In contrast, I didn't limit on
> length, and tried to split "look-alike" into "look" and "alike" (and
> given time, would have tried to accept "people's" as a possessive).

I'm not sure I like the splitting of look-alike (I'm not sure that I
like not splitting it either) but note that the regex does that for
free.

The \b in the original regex matches the empty string at a position
where there is a "word character" on only one side. It recognizes a
boundary at the beginning of a line and at whitespace, but also at all
the punctuation marks.

You guess right about the length limits. I wouldn't use them, and then
there's no need for the boundary markers any more: my \w+ matches
maximal sequences of word characters (even in foreign languages like
Finnish or French, and even in upper case, also digits).

To also match "people's" and "didn't", use \w+'\w+, and to match with
and without the ' make the trailing part optional \w+('\w+)? except the
notation really does start to become noisy because one must prevent the
parentheses from "capturing" the group:

import re
wordy = re.compile(r''' \w+ (?: ' \w+ )? ''', re.VERBOSE)
text = '''
Oliver N'Goma, dit Noli, né le 23 mars 1959 à Mayumba et mort le 7 juin
2010, est un chanteur et guitariste gabonais d'Afro-zouk.
'''

print(wordy.findall(text))

# ['Oliver', "N'Goma", 'dit', 'Noli', 'né', 'le', '23', 'mars', '1959',
# 'à', 'Mayumba', 'et', 'mort', 'le', '7', 'juin', '2010', 'est', 'un',
# 'chanteur', 'et', 'guitariste', 'gabonais', "d'Afro", 'zouk']

Not too bad?

But some punctuation really belongs in words. And other doesn't. And
everything becomes hard and every new heuristic turns out too strict or
too lenient and things that are not words at all may look like words, or
it may not be clear whether something is a word or is more than one word
or is less than a word or not like a word at all. Should one be amused?
Should one despair?

:)

Peter Otten

unread,

Jun 23, 2017, 12:13:53 PM6/23/17

to

Dennis Lee Bieber wrote:

> On Fri, 23 Jun 2017 09:49:06 +0300, Jussi Piitulainen
> <jussi.pi...@helsinki.fi> declaimed the following:

>
>>I just like those character translation methods, and I didn't like it
>>when you first took the time to call a simple regex "line noise" and
>>then proceeded to post something that looked much more noisy yourself.
>>
>

> Tediously long (and likely slow running), but I'd think each .replace()
> would have been self-explanatory.

> Above content saved (in a write-only file? I don't recall the times
> I've searched my post archives) for potential future use. I should plug it
> into my demo and see how much speed improvement I get.

Most of the potential speedup can be gained from using collections.Counter()
instead of the database. If necessary write the counter's contents into the
database in a second step.

mbyr...@gmail.com

unread,

Jun 23, 2017, 8:04:16 PM6/23/17

to

On Thursday, June 22, 2017 at 12:16:28 PM UTC+1, kishan.samp...@gmail.com wrote:

Dictionary 'frequency' is updated only with values of 0.

If the aim is to get a count of occurrences for each word
where the word exists in both input files, you could replace this:

for word in match_pattern:
for word1 in match_pattern_one:
count = frequency.get(word,0)
count1 = frequency.get(word1,0)
if word1 == word:
frequency[word] = count + count1
else:
frequency[word] = count

with this:

all_words = match_pattern + match_pattern_one
word_set = set(match_pattern) & set(match_pattern_one)
while word_set:
word = word_set.pop()
count = all_words.count(word)
frequency[word] = count

Other observations:
- Reading from and writing to the csv files is not utilsing the csv format
- The regex may be too restrictive and not all expected words extracted
- The output is written to one of the input files, overwriting the original content of the input file

Message has been deleted

SANS SANTOSH

unread,

Jun 27, 2017, 3:57:40 PM6/27/17

to

On Sat 24. Jun 2017 at 13:27, Mark Byrne <mbyr...@gmail.com> wrote:

> A problem (possibly the problem) is the lines which use the get function:
> count = frequency.get(word,0)
>
> Since the dictionary is empty at the start of the loop, the get function is
> passing a value of 0 to count and count1.
> The subsequent updates to the dictionary are applying a value of zero
>
> As a result, the file being written to contains count values of 0 for each
> word.
>
> Possible fix is to replace this:

>
> count = frequency.get(word,0)
> count1 = frequency.get(word1,0)
> if word1 == word:
> frequency[word] = count + count1
> else:
> frequency[word] = count
>
> with this:
>

> if word1 == word:
> if word in frequency:
> frequency[word] += 1
> else:
> frequency[word] = 1
> --
> https://mail.python.org/mailman/listinfo/python-list
>

Jerry Hill

unread,

Jun 27, 2017, 4:07:18 PM6/27/17

to

On Fri, Jun 23, 2017 at 4:07 PM, Mark Byrne <mbyr...@gmail.com> wrote:

> Possible fix is to replace this:
>
> count = frequency.get(word,0)
> count1 = frequency.get(word1,0)
> if word1 == word:
> frequency[word] = count + count1
> else:
> frequency[word] = count
>
> with this:
>
> if word1 == word:
> if word in frequency:
> frequency[word] += 1
> else:
> frequency[word] = 1
>

Have you considered replacing your frequency dict with a
defaultdict(int)? That way you could boil the whole thing down to:

from collections import defaultdict
frequency = defaultdict(int)
...
<more code here>
...
if word1 == word:
frequency[word] += 1

The defaultdict takes care of the special case for you. If the value is
missing, it's defaulted to 0.

--
Jerry