How to analyze relatively big source trees with CloneDigger?

zpcspm

unread,

Sep 22, 2008, 10:40:03 AM9/22/08

to Clone Digger general

I've run CloneDigger against a source tree that contains almost 12
megabytes of Python source files. It went well until the python
process ate all my free RAM (1 GB) and all my swap space (512 MB).
After that I've witnessed lots of disk I/O activity for about an hour.
CloneDigger remained frozen at that file that it was parsing when it
filled all available virtual memory. Are there any workarounds for
such situations? Of course it is possible to run CloneDigger against
subsets of source files, or even against single source files, but that
won't provide a general view about the whole source tree.

I was thinking about CloneDigger being able to use some kind of disk
storage (an sqlite database perhaps?) for its temporary data like ASTs
and clones, just to avoid using a lot of RAM. Are there any plans to
implement such a feature?

Another option that could help improving speed would be to use some
kind of cache. For example to skip processing files that didn't
change. The previous idea about using persistent storage for temporary
data fits well into this context.

Peter Bulychev

unread,

Sep 22, 2008, 4:03:07 PM9/22/08

to clonedigg...@googlegroups.com

I know, there are some problems with perfomance. But I think that firstly they should be fixed by eliminating perfomance bottlnecks (because 1G produced from 12Mb is not very good result :) ). I've eliminated some of them during this summer, but some still exist.

At what point did Clone Digger stopped?

Depending on the phase, on which the problem arised, I can suggest the following workarounds:
using --fast option
increasing --hashing-depth option
increasing --size-threshold option

Also I suggest removing automatically generated sources and tests from the source tree of your project.

2008/9/22 zpcspm <zpc...@gmail.com>

--
Best regards,
Peter Bulychev.

zpcspm

unread,

Sep 22, 2008, 4:32:09 PM9/22/08

to Clone Digger general

On Sep 22, 11:03 pm, "Peter Bulychev" <peter.bulyc...@gmail.com>
wrote:

> At what point did Clone Digger stopped?

Like I said, it happened when CloneDigger was still parsing source
files. As soon as the python process ate all available virtual memory,
I didn't see any progress (CloneDigger proceeding to parse the next
file) for about one hour so I've just pressed Ctrl-C.

> Depending on the phase, on which the problem arised, I can suggest the
> following workarounds:
> using --fast option
> increasing --hashing-depth option
> increasing --size-threshold option

I'd be happy to see CloneDigger succeeding at this task with the
default set of options. Because I agree that the difference between
12MB and 1GB is very big. Since python frees memory for objects as
soon as there are 0 references to them, I'd have a blind shot and try
to assume that some objects just "live" longer than needed. I'm a bit
skeptical about the usage of inner functions and recursion in
CloneDigger, but I can't prove these are the bottlenecks.

> Also I suggest removing automatically generated sources and tests from the
> source tree of your project.

This advice has a practical point in terms of performance. Sometimes
I'd run CloneDigger against tests on purpose. Test code is still code,
so redundancy in test code is bad as well. And CloneDigger helps
finding it. I'm not very enthusiastic about excluding tests just for
the sake of masking a possible performance issue that eventually can
be fixed.

Reply all

Reply to author

Forward