Clone Digger is slow

Haoyu Bai

unread,

Mar 3, 2009, 10:26:38 AM3/3/09

to clonedigg...@googlegroups.com

Hi,

I tried to run Clone Digger with Java project Apache Wicket. It have
run 6 hours, and finally hanged. I also have ever tried to run with
some small codebase, and the result it pretty well. I think it is just
very slow on large projects. Is the algorithm inherently slow, or
there's some part of Clond Digger could be improved to speed up?

Thanks!

-- Haoyu Bai

Peter Bulychev

unread,

Mar 4, 2009, 12:51:23 AM3/4/09

to clonedigg...@googlegroups.com

Hello.

How many lines of code are there in your project?
We've recently run CD on eclipse-jdtcore with 146K lines of code and it took 2h43m on a modern PC.

Can you provide the output of Clone Digger? If you do, I'll try to realize what's wrong.

Also, did you removed:

Automatically generated sources.
Tests.
Third party libraries.

from the project tree?

2009/3/3 Haoyu Bai <divi...@gmail.com>

--
Best regards,
Peter Bulychev.

Peter Bulychev

unread,

Mar 4, 2009, 1:12:58 AM3/4/09

to clonedigg...@googlegroups.com

I did't answer your main question.

Clone Digger works on the abstract syntax tree and its algorithm has nonlinear complexity.
However we introduced several heuristics into it and it worked fine for me for large projects (by large projects it mean < 200K loc).

Tests and automatically generated sources have a lot of similar pieces of code and similar statements and this leads to bad performance.
That was the reason for me to ask you to remove them.

If you need something much faster, you can use text-bases clone detection tools like DuDe (http://loose.upt.ro/iplasma/dude.html),
but their reports will have more false positives, for instance, it can contain a clone, which resides at the very end of one function and at the beginning of the next function. Certainly such a clone can not be refactored.

Also you can look at the other AST-based tool CloneDR (http://www.semdesigns.com/Products/Clone/). It is commercial, but the evalution version is available, which reports top 9 clones only.

2009/3/4 Peter Bulychev <peter.b...@gmail.com>

Haoyu Bai

unread,

Mar 4, 2009, 10:06:05 AM3/4/09

to clonedigg...@googlegroups.com

Hi,

The codebase have 269K lines of code and comments, and CD reported
24,000 statements. I don't want to reproduce the output right now. But
as I can remember, CD finally hanged on the "Choosing pattern for each
statement..." stage, when working on the 20,000th statements . I
believe it hanged because the CPU usage of it is only 1% at that time.

Indeed in Wicket's source code there's tests I haven't removed,
because the test code is mixed with other code. And I think this would
be a common style in software development. So I suggest to have a
filter mechanism, so we can filter out functions like testFoo().

Also I have run CCFinder on the same codebase and it is much faster.
However I prefer Clone Digger because it is Open Source!

Thanks!

-- Haoyu Bai

On Wed, Mar 4, 2009 at 1:51 PM, Peter Bulychev <peter.b...@gmail.com> wrote:
> Hello.
>
> How many lines of code are there in your project?
> We've recently run CD on eclipse-jdtcore with 146K lines of code and it took
> 2h43m on a modern PC.
>
> Can you provide the output of Clone Digger? If you do, I'll try to realize
> what's wrong.
>
> Also, did you removed:
>
> Automatically generated sources.
> Tests.
> Third party libraries.
>
> from the project tree?
>
> 2009/3/3 Haoyu Bai <divi...@gmail.com>

> - Show quoted text -

zpcspm

unread,

Mar 18, 2009, 10:45:34 AM3/18/09

to Clone Digger general

On Mar 3, 5:26 pm, Haoyu Bai <divine...@gmail.com> wrote:
> very slow on large projects. Is the algorithm inherently slow, or
> there's some part of Clond Digger could be improved to speed up?

I would suggest you to try running Python code using psyco. This
usually improves its performance.

Reply all

Reply to author

Forward