Re: Using dictionary key as a regular expression class

Terry Reedy

unread,

Jan 22, 2010, 8:46:35 PM1/22/10

to pytho...@python.org

On 1/22/2010 4:47 PM, Chris Jones wrote:
> I was writing a script that counts occurrences of characters in source code files:
>
> #!/usr/bin/python
> import codecs
> tcounters = {}
> f = codecs.open('/home/gavron/git/screen/src/screen.c', 'r', "utf-8")
> for uline in f:
> lline = []
> for char in uline[:-1]:
> lline += [char]

Same but slower than lline.append(char), however, this loop just
uselessless copies uline[:1]

> counters = {}
> for i in set(lline):
> counters[i] = lline.count(i)

slow way to do this

> for c in counters.keys():
> if c in tcounters:
> tcounters[c] += counters[c]
> else:
> tcounters.update({c: counters[c]})

I do not see the reason for intermediate dict

> counters = {}

duplicate line

> for c in tcounters.keys():
> print c, '\t', tcounters[c]

To only count ascii chars, as should be the case for C code,

achars = [0]*63
for c in open('xxx', 'c'):
try:
achars[ord(c)-32] += 1
except IndexError:
pass

for i,n in enumerate(achars)
print chr(i), n

or sum subsets as desired.

Terry Jan Reedy

Chris Jones

unread,

Jan 22, 2010, 9:58:48 PM1/22/10

to pytho...@python.org

On Fri, Jan 22, 2010 at 08:46:35PM EST, Terry Reedy wrote:
> On 1/22/2010 4:47 PM, Chris Jones wrote:
>> I was writing a script that counts occurrences of characters in source code files:
>>
>> #!/usr/bin/python
>> import codecs
>> tcounters = {}
>> f = codecs.open('/home/gavron/git/screen/src/screen.c', 'r', "utf-8")
>> for uline in f:
>> lline = []
>> for char in uline[:-1]:
>> lline += [char]
>
> Same but slower than lline.append(char), however, this loop just
> uselessless copies uline[:1]

I'll change that.

Do you mean I should just read the file one character at a time?

That was my original intention but I didn't find the way to do it.

>> counters = {}
>> for i in set(lline):
>> counters[i] = lline.count(i)
>
> slow way to do this
>
>> for c in counters.keys():
>> if c in tcounters:
>> tcounters[c] += counters[c]
>> else:
>> tcounters.update({c: counters[c]})
>
> I do not see the reason for intermediate dict

Couldn't find a way to increment the counters in the 'grand total'
dictionary. I always ended up with the counter values for the last input
line :-(

Moot point if I can do a for loop reading one character at a time till
end of file.

>> counters = {}
>
> duplicate line

And totally useless since I never reference it after that. Something I
move else where and forgot to delete.

Sorry about that.

>> for c in tcounters.keys():
>> print c, '\t', tcounters[c]

Literals, comments, €'s..?

> To only count ascii chars, as should be the case for C code,
>
> achars = [0]*63
> for c in open('xxx', 'c'):
> try:
> achars[ord(c)-32] += 1
> except IndexError:
> pass
>
> for i,n in enumerate(achars)
> print chr(i), n
>
> or sum subsets as desired.

Thanks much for the snippet, let me play with it and see if I can come
up with a Unicode/utf-8 version.. since while I'm at it I might as well
write something a bit more general than C code.

Since utf-8 is backward-compatible with 7bit ASCII, this shouldn't be
a problem.

> Terry Jan Reedy

Thank you for your comments!

CJ

Terry Reedy

unread,

Jan 23, 2010, 2:45:41 AM1/23/10

to pytho...@python.org

On 1/22/2010 9:58 PM, Chris Jones wrote:
> On Fri, Jan 22, 2010 at 08:46:35PM EST, Terry Reedy wrote:

> Do you mean I should just read the file one character at a time?

Whoops, my misdirection (you can .read(1), but this is s l o w.
I meant to suggest processing it a char at a time.

1. If not too big,

for c in open(x, 'rb').read() # left .read() off
# 'b' will get bytes, though ord(c) same for ascii chars for byte or
unicode

2. If too big for that,

for line in open():
for c in line: # or left off this part

>> To only count ascii chars, as should be the case for C code,
>>
>> achars = [0]*63
>> for c in open('xxx', 'c'):
>> try:
>> achars[ord(c)-32] += 1
>> except IndexError:
>> pass
>>
>> for i,n in enumerate(achars)
>> print chr(i), n
>>
>> or sum subsets as desired.
>
> Thanks much for the snippet, let me play with it and see if I can come
> up with a Unicode/utf-8 version.. since while I'm at it I might as well
> write something a bit more general than C code.
>
> Since utf-8 is backward-compatible with 7bit ASCII, this shouldn't be
> a problem.

For any extended ascii, use larger array without decoding (until print,
if need be). For unicode, add encoding to open and 'c in line' will
return unicode chars. Then use *one* dict or defaultdict. I think
something like

from collections import defaultdict
d = defaultdict(int)
...
d[c] += 1 # if c is new, d[c] defaults to int() == 0

Terry Jan Reedy

Chris Jones

unread,

Jan 23, 2010, 4:22:39 AM1/23/10

to pytho...@python.org

On Sat, Jan 23, 2010 at 02:45:41AM EST, Terry Reedy wrote:
> On 1/22/2010 9:58 PM, Chris Jones wrote:

>> Do you mean I should just read the file one character at a time?
>
> Whoops, my misdirection (you can .read(1), but this is s l o w.
> I meant to suggest processing it a char at a time.

Right.. that's how I understood it - i.e. asking python for the next
character, and not worrying about how much is retrieved from the disk in
one pass.

> 1. If not too big,
>
> for c in open(x, 'rb').read() # left .read() off
> # 'b' will get bytes, though ord(c) same for ascii chars for byte or
> unicode
>
> 2. If too big for that,
>
> for line in open():
> for c in line: # or left off this part

Well the script is not going to process anything larger that a few
KiloBytes, but all the same that's something I want to understand
better.

Isn't there any way I can tell python to retrieve a fairly large chunk
of the disk file, like 4-8K, maybe.. and increment a pointer behind the
scenes while I iterate so that I have access to characters one at a
time. I mean, that should be pretty fast, since disk access would be
minimal, and no data would actually be copied.. I would have thought
that 1. above would cause python to do something like that behind the
scenes.

[..]

>> Thanks much for the snippet, let me play with it and see if I can
>> come up with a Unicode/utf-8 version.. since while I'm at it I might
>> as well write something a bit more general than C code.
>>
>> Since utf-8 is backward-compatible with 7bit ASCII, this shouldn't be
>> a problem.

> For any extended ascii,

You mean 8-bit encodings, like latin1 right?

> use larger array without decoding (until print, if need be). For
> unicode, add encoding to open and 'c in line' will return unicode
> chars. Then use *one* dict or defaultdict. I think something like

> from collections import defaultdict
> d = defaultdict(int)
> ...
> d[c] += 1 # if c is new, d[c] defaults to int() == 0

I don't know python, so I basically googled for the building blocks of
my little script.. and I remember seeing defaultdict(int) mentioned some
place or other, but somehow I didn't understand what it did.

Cool feature.

Even if it's a bit wasteful, with unicode/utf-8, it looks like working
with the code points, either as dictionary keys or as index values into
an array might make the logic simpler - i.e. for each char, obtain its
code point 'cp' and add one to dict[cp] or array[cp] - and then loop and
print all non-zero values when the end-of-file condition is reached.

Food for thought in any case.

CJ