Python Data Utils

Jesse Aldridge

unread,

Apr 6, 2008, 12:43:29 AM4/6/08

to

In an effort to experiment with open source, I put a couple of my
utility files up <a href="http://github.com/jessald/python_data_utils/
tree/master">here</a>. What do you think?

Gabriel Genellina

unread,

Apr 6, 2008, 3:13:41 AM4/6/08

to pytho...@python.org

En Sun, 06 Apr 2008 01:43:29 -0300, Jesse Aldridge
<JesseA...@gmail.com> escribió:

> In an effort to experiment with open source, I put a couple of my
> utility files up <a href="http://github.com/jessald/python_data_utils/
> tree/master">here</a>. What do you think?

Some names are a bit obscure - "universify"?
Docstrings would help too, and blank lines, and in general following PEP8
style guide.
find_string is a much slower version of the find method of string objects,
same for find_string_last, contains and others.
And I don't see what you gain from things like:
def count( s, sub ):
return s.count( sub )
it's slower and harder to read (because one has to *know* what S.count
does).
Other functions may be useful but without even a docstring it's hard to
tell what they do.
delete_string, as a function, looks like it should delete some string, not
return a character; I'd use a string constant DELETE_CHAR, or just DEL,
it's name in ASCII.

In general, None should be compared using `is` instead of `==`, and
instead of `type(x) is type(0)` or `type(x) == type(0)` I'd use
`isinstance(x, int)` (unless you use Python 2.1 or older, int, float, str,
list... are types themselves)

Files.py is similar - a lot of more or less common things with a different
name, and a few wheels reinvented :)

Don't feel bad, but I would not use those modules because there is no net
gain, and even a loss in legibility. If you develop your code alone,
that's fine, you know what you wrote and can use it whenever you please.
But for others to use it, it means that they have to learn new ways to say
the same old thing.

--
Gabriel Genellina

Konstantin Veretennicov

unread,

Apr 6, 2008, 7:14:14 AM4/6/08

to Jesse Aldridge, pytho...@python.org

Would you search for, install, learn and use these modules if *someone
else* created them?

--
kv

Jesse Aldridge

unread,

Apr 6, 2008, 10:32:27 AM4/6/08

to

Thanks for the detailed feedback. I made a lot of modifications based
on your advice. Mind taking another look?

> Some names are a bit obscure - "universify"?
> Docstrings would help too, and blank lines

I changed the name of universify and added a docstrings to every
function.

> ...PEP8

I made a few changes in this direction, feel free to take it the rest
of the way ;)

> find_string is a much slower version of the find method of string objects,

Got rid of find_string, and contains. What are the others?

> And I don't see what you gain from things like:
> def count( s, sub ):
> return s.count( sub )

Yeah, got rid of that stuff too. I ported these files from Java a
while ago, so there was a bit of junk like this lying around.

> delete_string, as a function, looks like it should delete some string, not
> return a character; I'd use a string constant DELETE_CHAR, or just DEL,
> it's name in ASCII.

Got rid of that too :)

> In general, None should be compared using `is` instead of `==`, and
> instead of `type(x) is type(0)` or `type(x) == type(0)` I'd use
> `isinstance(x, int)` (unless you use Python 2.1 or older, int, float, str,
> list... are types themselves)

Changed.

So, yeah, hopefully things are better now.

Soon developers will flock from all over the world to build this into
the greatest data manipulation library the world has ever seen! ...or
not...

I'm tired. Making code for other people is too much work :)

Jesse Aldridge

unread,

Apr 6, 2008, 10:34:11 AM4/6/08

to

On Apr 6, 6:14 am, "Konstantin Veretennicov" <kveretenni...@gmail.com>
wrote:

> On Sun, Apr 6, 2008 at 7:43 AM, Jesse Aldridge <JesseAldri...@gmail.com> wrote:
> > In an effort to experiment with open source, I put a couple of my
> > utility files up <a href="http://github.com/jessald/python_data_utils/
> > tree/master">here</a>. What do you think?
>
> Would you search for, install, learn and use these modules if *someone
> else* created them?
>
> --
> kv

Yes, I would. I searched a bit for a library that offered similar
functionality. I didn't find anything. Maybe I'm just looking in the
wrong place. Any suggestions?

John Machin

unread,

Apr 6, 2008, 6:10:51 PM4/6/08

to

On Apr 7, 12:32 am, Jesse Aldridge <JesseAldri...@gmail.com> wrote:
> Thanks for the detailed feedback. I made a lot of modifications based
> on your advice. Mind taking another look?
>
> > Some names are a bit obscure - "universify"?
> > Docstrings would help too, and blank lines
>
> I changed the name of universify and added a docstrings to every
> function.

Docstrings go *after* the def statement.

>
> > ...PEP8
>
> I made a few changes in this direction, feel free to take it the rest
> of the way ;)

I doubt anyone will bother to take up your invitation. A few simple
changes would reduce eyestrain e.g. changing "( " to "(" and " )" to
")".

>
> > find_string is a much slower version of the find method of string objects,
>
> Got rid of find_string, and contains. What are the others?

It seems that you could usefully spend some time reading the
documentation on str methods ... instead of asking other people to do
your job for you, unpaid. E.g. look for a str method that you could
use instead of at least one of the confusingly named "is_white" and
"is_blank"?

>
> > And I don't see what you gain from things like:
> > def count( s, sub ):
> > return s.count( sub )
>
> Yeah, got rid of that stuff too. I ported these files from Java a
> while ago, so there was a bit of junk like this lying around.

The penny drops :-)

> > delete_string, as a function, looks like it should delete some string, not
> > return a character; I'd use a string constant DELETE_CHAR, or just DEL,
> > it's name in ASCII.
>
> Got rid of that too :)
>
> > In general, None should be compared using `is` instead of `==`, and
> > instead of `type(x) is type(0)` or `type(x) == type(0)` I'd use
> > `isinstance(x, int)` (unless you use Python 2.1 or older, int, float, str,
> > list... are types themselves)
>
> Changed.

Not in all places ... look at the ends_with function. BTW, this should
be named something like "fuzzy_ends_with".

Why all the testing against None? If you have a convention that ""
means that a value is known to be the zero-length string whereas None
means that the true value is unknown, then:
(1) you should document that convention
(2) you should use it consistently e.g. fuzzy_match(None, None) should
return False.

The get_before function returns None in one case and "" in another; is
this accidental or deliberate?

>
> So, yeah, hopefully things are better now.
>
> Soon developers will flock from all over the world to build this into
> the greatest data manipulation library the world has ever seen! ...or
> not...
>
> I'm tired. Making code for other people is too much work :)

When you recover, here are a few more things to consider:

1. Testing if obj is a str or unicode object is best done by
isinstance(obj, basestring) ... you don't need an is_string function.

2. make_fuzzy function: first two statements should read "s =
s.replace(.....)" instead of "s.replace(.....)".

3. Fuzzy matching functions are specialised to an application; I can't
imagine that anyone would be particularly interested in those that you
provide. A basic string normalisation-before-comparison function would
usefully include replacing multiple internal whitespace characters by
a single space.

4. get_after('dog cat', 3) is a baroque and slow way of doing 'dog
cat'[3+1:]

5. Casual inspection of your indentation function gave the impression
that it was stuffed ... verified by a simple test:

>>> for i in range(11):
... stest = ' ' * i + 'x'
... print i, indentation(stest, 4)
...
0 0
1 1
2 0
3 0
4 1
5 2
6 1
7 1
8 2
9 3
10 2

HTH,
John

Gabriel Genellina

unread,

Apr 6, 2008, 7:08:27 PM4/6/08

to pytho...@python.org

En Sun, 06 Apr 2008 11:34:11 -0300, Jesse Aldridge
<JesseA...@gmail.com> escribió:

> On Apr 6, 6:14 am, "Konstantin Veretennicov" <kveretenni...@gmail.com>
> wrote:
>> On Sun, Apr 6, 2008 at 7:43 AM, Jesse Aldridge
>> <JesseAldri...@gmail.com> wrote:
>> > In an effort to experiment with open source, I put a couple of my
>> > utility files up <a
>> href="http://github.com/jessald/python_data_utils/
>> > tree/master">here</a>. What do you think?
>>
>> Would you search for, install, learn and use these modules if *someone
>> else* created them?
>

> Yes, I would. I searched a bit for a library that offered similar
> functionality. I didn't find anything. Maybe I'm just looking in the
> wrong place. Any suggestions?

Haven't you heard that Python comes with "batteries included"?
For most operations in your modules, you don't need anything more than
what already Python provides. Most of the whitespace, find, search,
replace variants are already in the standard library, or trivially
implemented with the existing tools [1]. Even more speciallized functions
like pattern_to_regex are already there (see fnmatch.translate [2])
So I think you would benefit a lot reading the Library Reference
documentation [3]; as it says in the docs main page: "keep this under your
pillow"

[1] http://docs.python.org/lib/string-methods.html
[2] http://docs.python.org/lib/module-fnmatch.html
[3] http://docs.python.org/lib/
--
Gabriel Genellina

Jesse Aldridge

unread,

Apr 7, 2008, 2:22:03 AM4/7/08

to

> Docstrings go *after* the def statement.

Fixed.

> changing "( " to "(" and " )" to ")".

Changed.

I attempted to take out everything that could be trivially implemented
with the standard library.
This has left me with... 4 functions in S.py. 1 one of them is used
internally, and the others aren't terribly awesome :\ But I think the
ones that remain are at least a bit useful :)

> The penny drops :-)

yeah, yeah

> Not in all places ... look at the ends_with function. BTW, this should
> be named something like "fuzzy_ends_with".

fixed

> fuzzy_match(None, None) should return False.

changed

> 2. make_fuzzy function: first two statements should read "s =
> s.replace(.....)" instead of "s.replace(.....)".

fixed

> 3. Fuzzy matching functions are specialised to an application; I can't
> imagine that anyone would be particularly interested in those that you
> provide.

I think it's useful in many cases. I use it all the time. It helps
guard against annoying input errors.

> A basic string normalisation-before-comparison function would
> usefully include replacing multiple internal whitespace characters by
> a single space.

I added this functionality.

> 5. Casual inspection of your indentation function gave the impression
> that it was stuffed

Fixed

Thanks for the feedback.

John Machin

unread,

Apr 7, 2008, 5:17:24 AM4/7/08

to

On Apr 7, 4:22 pm, Jesse Aldridge <JesseAldri...@gmail.com> wrote:
>
> > changing "( " to "(" and " )" to ")".
>
> Changed.

But then you introduced more.

>
> I attempted to take out everything that could be trivially implemented
> with the standard library.
> This has left me with... 4 functions in S.py. 1 one of them is used
> internally, and the others aren't terribly awesome :\ But I think the
> ones that remain are at least a bit useful :)

If you want to look at stuff that can't be implemented trivially using
str/unicode methods, and is more than a bit useful, google for
mxTextTools.

>
> > A basic string normalisation-before-comparison function would
> > usefully include replacing multiple internal whitespace characters by
> > a single space.
>
> I added this functionality.

Not quite. I said "whitespace", not "space".

The following is the standard Python idiom for removing leading and
trailing whitespace and replacing one or more whitespace characters
with a single space:

def normalise_whitespace(s):
return ' '.join(s.split())

If your data is obtained by web scraping, you may find some people use
'\xA0' aka NBSP to pad out fields. The above code will get rid of
these if s is unicode; if s is str, you need to chuck
a .replace('\xA0', ' ') in there somewhere.

HTH,
John

Jesse Aldridge

unread,

Apr 7, 2008, 11:24:02 AM4/7/08

to

> But then you introduced more.

oops. old habits...

> mxTextTools.

This looks cool, so does the associated book - "Text Processing in
Python". I'll look into them.

> def normalise_whitespace(s):
> return ' '.join(s.split())

Ok, fixed.

> a.replace('\xA0', ' ') in there somewhere.

Added.

Thanks again.