Comment on revision r51 in apache-rat-pd

codesite...@google.com

unread,

Aug 23, 2009, 5:42:42 PM8/23/09

to apache-rat...@googlegroups.com

maka82 commented on revision r51 in project apache-rat-pd.
Details are at http://code.google.com/p/apache-rat-pd/source/detail?r=51

Line-by-line comments:

File:
/trunk/src/main/java/org/apache/rat/pd/engines/google/GoogleCodeSearchParser.java
(r51)
===============================================================================

Line 288: String[] lines = posibleCutAndPastedCode.split("\n *");
-------------------------------------------------------------------------------
There is bug in this code. I noticed that, even if just one line produce
query URL with length greater than 1024, then this splitting algorithm
start endless loop. Problem is that this function splits code into lines -
but there is not guarantee that URL produced from this line is not longer
then 1024. One of possible solutions is to recognise such lines and split
them to smaller peaces. But I don't see valid criteria for splitting. If it
is blank space or dot, this line will be not splited and will cause
application to hang:
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
But this is very common situation in programming:
//////////////////////////////////////////////////////////////////
//// Comments or something...
/////////////////////////////////////////////////////////////////

Code is splited into lines because Google code search that way works more
accurate.

Google code search can work with queries with length below 2000 characters,
however they recommend queries below 1024.

Respond to these comments at
http://code.google.com/p/apache-rat-pd/source/detail?r=51
--
You received this message because you starred this review, or because
your project has directed all notifications to a mailing list that you
subscribe to.
You may adjust your review notification preferences at:
http://code.google.com/hosting/settings

Egor Pasko

unread,

Aug 24, 2009, 10:35:35 AM8/24/09

to apache-rat...@googlegroups.com

IMHO, the main source of problems in this method is it's complexity.
For me the method is hard to understand also because the javadoc is
not very specific about what the method is doing.

The comment says that the code is split and produces URLs with such
and such length properties. From this description I do not understand
why combineLines() is being used. Is that to search for code that was
copied and joined?

A general question: why a simple splitting by '\n' and 1024 characters
does not work?

--
Egor Pasko

Marija Šljivović

unread,

Aug 24, 2009, 3:19:14 PM8/24/09

to apache-rat...@googlegroups.com

Hi!

This function is basic sliding window algorithm implementation. Input
for this function are lines. Window starts from first line and then
expand (concatenate first n lines) until this concatenation produce
URL with length lesser then 1024 characters(
generateUrlFrom(line[1]+line[2]+...+line[n])<1024). Then it add this
last-url-with-length-lesser-then-1024 to return list and starts again
from line[n+1].
This ensure that we send to GoogleCodeSearch query with maximum length
(for number of queries optimisation ) and that this query length will
never be greater then 1024 (avoiding GoogleCodesearch to throw
exception).
It is important when working with GoogleCodeSearch to generate
line-based queries. Only then found words are "sticked together" [1].

Problem is that any line alone can produce URL longer then any limit.
Then this algorithm breaks. Even if algorithm will not break, long URL
will produce SearchEngine exception.
Possible solution is split this problematic lines. Question is by
which criteria? This splitting must be done careful - elsewhere we can
split line on the middle of important word . It is best to split line
on dot or on white space. But sometimes this characters are not
contained in line which causing problem.

There are very simple ways to fix this problem, I am just trying to
get the best solution.

>The comment says that the code is split and produces URLs with such
>and such length properties. From this description I do not understand
>why combineLines() is being used. Is that to search for code that was
>copied and joined?

Sorry for poor javadoc. combineLins() is just "current window" in this
sliding window algorithm.

[1] http://groups.google.com/group/google-code-search/browse_thread/thread/2064067f869ddc71

Marija

Egor Pasko

unread,

Aug 25, 2009, 5:05:21 AM8/25/09

to apache-rat...@googlegroups.com

2009/8/24 Marija Šljivović <mak...@gmail.com>:

Marija, thanks for explaining! I am sorry for being lazy at
understanding what the function is doing from the code. I was just
expecting to see this information in the function's comment.

My general advice on commenting functions is: tell _what_ the function
does, but do _not_ mention _how_ it does the work. Thus, worth
mentioning in the comment are:
* line-based queries
* joining lines to minimize the amount of queries
* following the length limit

I would recommend to name the function differently. The name
"splitLongUrl" is not my favourite because:
* it does not split the URL since it does not take a single URL as a
parameter :)
* it does not say anything about what kind of split this is

My naming suggestion would be going this direction:
fitCodeLinesIntoUrls() or leastUrlsArrayForCode().

On the problem of splitting in the middle of a word .. hm .. I do not
see anything really bad here. Okay, we may miss some plagiarism
because of this. But, anyway, it is better than hang or crash with
exception. If you really want to match every word, you can split into
strings with an overlap of, say, 30 characters.

Finally, the code looked to me not really easy in understanding and
not very optimal. I wanted to share some knowledge on how StringReader
could be used with avoiding creation of new array, redundant string
appends, scary nested whiles with indexes modified in a tricky way.

You can take it if you like (same license:), I am not sure it works,
did not run it a single time.

try {
BufferedReader br = new BufferedReader(
new StringReader(possibleCutAndPastedCode));
StringBuilder urlBuilder = new StringBuilder();
while ((String fullLine = br.readLine()) != null) {
String line = fullLine.substring(...); // Strip leading spaces.

// Fill the urlBuilder greedily until the limit is reached.
int increasedLength = urlBuilder.length() + line.length();
if (increasedLength <= length) {
urlBuilder.append(line);
continue;
}
toRet.add(createUrl(urlBuilder.toString()));

// If the line is too long, split it by length trivially.
while (line.length() >= length) {
toRet.add(createurl(line.substring(0, length)));
line = line.substring(length, line.length() + 1);
}
urlBuilder = new StringBuilder(line);
}
if (urlBuilder.length() > 0) {
toRet.add(createUrl(urlBuilder.toString()));
}
} catch (IOException e) {
throw new RuntimeException("Failed to organize a piecie of code in
URLS.", e);
}
return toRet;

--
Egor Pasko

Marija Šljivović

unread,

Aug 28, 2009, 7:30:52 AM8/28/09

to apache-rat...@googlegroups.com

Hi,Egor

Unfortunately, this algorithm didn't work but looking at it I got some
ideas so I have again approach making sliding-window algorithm.
Interesting is that this algorithm does not have to be square
complexity as I thought,but it can be with linear complexity.

The code that I have sent to you is an example of an algorithm with
linear complexity which founds all subarrays of an array
which satisfy some condition. In our code, the condition is that URL
made from a subarray(of a string) does not have a lenght more then
1024.

public List<String> allSubArrays(String line, int maximumLength)
throws MalformedURLException {

ArrayList<String> toRet = new ArrayList<String>();

StringBuffer sb = new StringBuffer();
String current = "";
for (int i = 0; i < line.length(); i++) {

sb.append(line.charAt(i));
//is length of URL greater of 1024?
if
(createUrl(sb.toString()).toString().length() <= maximumLength) {
//we store last good substring
current = sb.toString();
} else {
//last good substring is substring
with maximum length which URL is < 1024
toRet.add(current);
//clear string buffer
sb.setLength(0);
//store current char in string buffer
sb.append(line.charAt(i));
}
}
//last substring of initial string is not added, but
it should be.
toRet.add(current);
return toRet;
}

This way, the lines which URLs are longer that 1024 can be broken on a lines
with maximum lenght which just the same have URLs with good lenght.

So, we will:
1. break String posiblePlagiarisedCod on a lines
2. possible too long lines will be handle with the above mentioned algorithm
and convert them in a few smaller (or even better first try to
separate them by using dot or blank)
3. from all of this lines we will made URLs with maximal lenght for
GoogleCodeSearch the way we are making them now :)

Interesting is that this way we can simplify and other two
sliding-window algorithm in application
and even both algorithms in GoogleCodesearchParser replace with one
function which will be call with different parameters.

Marija

Egor Pasko

unread,

Sep 4, 2009, 7:35:30 AM9/4/09

to apache-rat...@googlegroups.com

2009/8/28 Marija Šljivović <mak...@gmail.com>:
>
> Hi,Egor

Hello, Marija, sorry for my late, a very-very late response. I am
feeling guilty :)

In general, I think going to linear complexity is totally awesome. It
also makes the code easier to read. (less 'break', 'continue'
statements)

After taking a look at the new code some comments jumped into my mind:
* the name allSubArrays suggests .. like it wants some function/class
as a parameter (that would determine the criteria for the url that
fits) instead of length .. maybe you were also thinking about this
(so, either it should be renamed or the different parameter set)
* going char by char may be generally slow, some 'buffering' may be reasonable

I like these ideas.

> Interesting is that this way we can simplify and other two
> sliding-window algorithm in application
> and even both algorithms in GoogleCodesearchParser replace with one
> function which will be call with different parameters.