Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

python tool: finding duplicate code

0 views
Skip to first unread message

Michal Wallace

unread,
May 29, 2002, 8:48:44 AM5/29/02
to

In "Refactoring: Improving the Design of Existing Code",
Martin Fowler and Kent Beck list duplicate code as their
number one "Code Smell"

Since I'm looking to clean up some of my projects, I went
looking for a way to find duplicate lines in my python
project... So I wrote a little python program to do it for
me.

Features:

- lists all pairs of overlapping files
- shows how many lines are in common
- shows each duplicated line
- ignores indentation
- filters out try, pass, if __name__=="__main__", etc.

It's just string-matching, so it won't find duplicate logic
with different variable names or layout, but it *can* find
cut and paste issues.

(hmm... Come to think of it, someone could probably find
*some* duplicate logic by running source files through the
tokenizer first. I wonder if that would work...)

Anyway, just thought I'd share:

http://cvs.sabren.com/sixthdev/cvsweb.cgi/sdunit/overlaps.py?rev=1.1

Cheers,

- Michal http://www.sabren.net/ sab...@manifestation.com
------------------------------------------------------------
Learn to build web apps! http://www.webAppWorkshop.com/
------------------------------------------------------------

Tim Peters

unread,
May 29, 2002, 10:17:51 PM5/29/02
to
[Michal Wallace]

> In "Refactoring: Improving the Design of Existing Code",
> Martin Fowler and Kent Beck list duplicate code as their
> number one "Code Smell"
> ...

> It's just string-matching, so it won't find duplicate logic
> with different variable names or layout, but it *can* find
> cut and paste issues.
>
> (hmm... Come to think of it, someone could probably find
> *some* duplicate logic by running source files through the
> tokenizer first. I wonder if that would work...)

Brenda Baker has done some interesting work on this problem (not with Python
in mind, but million-line C systems):

http://cm.bell-labs.com/who/bsb/

Her "On Finding Duplication and Near-Duplication in Large Software Systems"
is a good entry into the literature.

I have a self-serving reason for mentioning this: if somebody whips up a
fast suffix tree for Python, I could put it to good use in ameliorating
difflib.py's worst-case time sinks <wink>.

Michal Wallace

unread,
May 30, 2002, 12:51:16 PM5/30/02
to
On Wed, 29 May 2002, Tim Peters wrote:

> > (hmm... Come to think of it, someone could probably find
> > *some* duplicate logic by running source files through the
> > tokenizer first. I wonder if that would work...)
>
> Brenda Baker has done some interesting work on this
> problem (not with Python in mind, but million-line C
> systems):
>
> http://cm.bell-labs.com/who/bsb/
>
> Her "On Finding Duplication and Near-Duplication in Large
> Software Systems" is a good entry into the literature.
>
> I have a self-serving reason for mentioning this: if
> somebody whips up a fast suffix tree for Python, I could
> put it to good use in ameliorating difflib.py's worst-case
> time sinks <wink>.

Hey Tim,

Thanks for the link! I found a javascript version of a
suffix tree algorithm online. I ported it to python and it
seems to work... Unfortunately the original code is very
hard to understand. I did a straight port and then tried to
clean it up and make it a little more object oriented, but
when I started looking into the algorithm, I just couldn't
track what was going on.

Then I spent another couple hours trying to rebuild it from
scratch using a pythonic style, but again I couldn't get my
mind around the algorithm... Anyway, I spent all night
messing around with this stuff, and I'm giving up. If
someone wants to take a look, go here:

http://cvs.sabren.com/sixthdev/cvsweb.cgi/sdunit/

NastySuffixTree.py is the working version.

SuffixTree.py is the cleaner version I tried to build, but
it doesn't implement the whole algorithm.

The javascript version is here:

http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffix/

There's also a suffix tree module written in C, but with a
python binding here:

http://www-hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/

Cheers,

- Michal http://www.sabren.net/ sab...@manifestation.com
------------------------------------------------------------

Switch to Cornerhost! http://www.cornerhost.com/
High Powered Hosting - With a Human Touch. :)
------------------------------------------------------------

Ira Baxter

unread,
Jun 1, 2002, 9:40:43 AM6/1/02
to
If you are interested in strong clone detection,
you should check out
http://www.semdesigns.com/Products/Clone/index.html

--
Ira Baxter, Ph.D. CTO Semantic Designs
www.semdesigns.com 512-250-1018

"Michal Wallace" <sab...@manifestation.com> wrote in message
news:mailman.1022676587...@python.org...

0 new messages