comparing files without using arrays

saiji...@gmail.com

unread,

Feb 16, 2007, 1:45:48 AM2/16/07

to

hi all,

got a problem with comparing two large files(A file with 8000
words and another file of size 8 MB with some content in it.. the
frequency of the 8000 words in the other file must be found)

we tried it using arrays..

but we get array out of bounds exception even after increasing the
heap space(-Xmx512m) in run time..

can someone help us out with feasible methods for solving this
problem..

Thanks..

denJs

unread,

Feb 16, 2007, 4:09:49 AM2/16/07

to

You can use two arrays of any size as buffers for both files,
fill the buffers and compare the content in a loop.

Gordon Beaton

unread,

Feb 16, 2007, 4:30:21 AM2/16/07

to

On 15 Feb 2007 22:45:48 -0800, saiji...@gmail.com wrote:
> we get array out of bounds exception even after increasing the heap
> space(-Xmx512m) in run time..

If you're getting an array bounds exception then your problems aren't
due to lack of memory, they're caused by attempting to access beyond
the end of the array (regardless of its size).

An array of size N has indices 0 to N-1. Attempting to use an index
outside this range will cause an array bounds error.

/gordon

--
[ don't email me support questions or followups ]
g o r d o n + n e w s @ b a l d e r 1 3 . s e

Chris Dollin

unread,

Feb 16, 2007, 4:51:07 AM2/16/07

to

saiji...@gmail.com wrote:

> got a problem with comparing two large files(A file with 8000
> words and another file of size 8 MB with some content in it.. the
> frequency of the 8000 words in the other file must be found)
>
> we tried it using arrays..
>
> but we get array out of bounds exception even after increasing the
> heap space(-Xmx512m) in run time..

(a) The array-indexing-exception should tell you /where/ it happened.
Look at that piece of code.

(b) The exception happens because you're trying to us an index i that
isn't 0 <= i < array.length. (Note: /less-than/ array.length.)
Make sure the index makes sense.

(c) This has nothing to do with how much heap space you've allocated.
If there wasn't enough room for the arrays, you would have got an
OutOfMemory error.

(d) Your summary of what you want to do suggests /strongly/ that arrays
are not the data structure you want, since you need to associate
with each word from the word-file another value viz the count of
times it occurs in the other-file. This to me suggests a different
data type.

--
Chris "electric hedgehog" Dollin
"It's just the beginning we've seen" - Colosseum, /Tomorrow's Blues/

Gordon Beaton

unread,

Feb 16, 2007, 4:49:39 AM2/16/07

to

On 15 Feb 2007 22:45:48 -0800, saiji...@gmail.com wrote:

> got a problem with comparing two large files(A file with 8000 words
> and another file of size 8 MB with some content in it.. the
> frequency of the 8000 words in the other file must be found)

I've addressed your array bounds exception in another post, but
thought I'd suggest a couple of things anyway...

Only the word list needs to be kept in memory, for example using a
Hashmap or other structure that lets you easily map words to counters.

The other file can be read and processed one line at a time. You never
need to hold more than one line's worth of information. From your
problem description, I can't see a need to read the entire file into
any structure if that's what you're doing now.

Read one line, tokenize it into words, then look up each word to
update the corresponding counter. Repeat with the next line, etc.

Eric Sosman

unread,

Feb 16, 2007, 8:04:33 AM2/16/07

to

saiji...@gmail.com wrote:
> hi all,
>
> got a problem with comparing two large files(A file with 8000
> words and another file of size 8 MB with some content in it.. the
> frequency of the 8000 words in the other file must be found)
>
> we tried it using arrays..
>
> but we get array out of bounds exception even after increasing the
> heap space(-Xmx512m) in run time..

You are chasing the wrong fox. ArrayIndexOutOfBoundsException
means that somewhere in your program there is an array of N things
(ints, Strings, whatever) and that you have tried to use it with
an index that is <0 or >=N. It does not mean that your program is
running low on memory; that one produces OutOfMemoryError.

> can someone help us out with feasible methods for solving this
> problem..

Read the 8000-word file and enter each of its words as
the key in a Map<String,Integer> with the Integer value zero.
Then read the other file, word by word. For each word, do
a get(word) on the Map. If get(word) returns null, the word
is not one of the 8000 you care about, so ignore it. On the
other hand if get(word) returns an Integer, add one to that
value and update the Map: put(word, biggerInteger). At the
end, traverse the entire Map to retrieve the results.

--
Eric Sosman
eso...@acm-dot-org.invalid

Patricia Shanahan

unread,

Feb 16, 2007, 11:51:13 AM2/16/07

to

Eric Sosman wrote:
...

> Read the 8000-word file and enter each of its words as
> the key in a Map<String,Integer> with the Integer value zero.
> Then read the other file, word by word. For each word, do
> a get(word) on the Map. If get(word) returns null, the word
> is not one of the 8000 you care about, so ignore it. On the
> other hand if get(word) returns an Integer, add one to that
> value and update the Map: put(word, biggerInteger). At the
> end, traverse the entire Map to retrieve the results.
>

As an alternative to using Integer, I sometimes use a small Counter
class that wraps an int with increment() and get() operations. get() is
defined to return the number of times the Counter's increment() method
has been called.

Patricia

Mark Rafn

unread,

Feb 16, 2007, 12:09:13 PM2/16/07

to

<saiji...@gmail.com> wrote:
> but we get array out of bounds exception even after increasing the
>heap space(-Xmx512m) in run time..

ArrayIndexOutOfBoundsException won't be fixed by increasing heap space. It's
a bug in your code.
--
Mark Rafn da...@dagon.net <http://www.dagon.net/>

Alex Hunsley

unread,

Feb 16, 2007, 4:09:24 PM2/16/07

to

denJs wrote:
> You can use two arrays of any size as buffers for both files,
> fill the buffers and compare the content in a loop.

It's not a direct file comparison the OP is after ....

>
>
>
>

Alex Hunsley

unread,

Feb 16, 2007, 4:14:52 PM2/16/07

to

As the others have said, you have a bug in your program - you're trying
to access beyond the end of the array, at a guess.

As for a very basic strategy for your problem, read the 'word' file into
memory. Put the words into a Set (e.g. HashSet)[1]. Then parse the 8meg
file - read 10k chunks at a time, for example (more efficient this way),
splitting each chunk into words, and check if each word in a chunk is in
the HashSet. If so, increase its count. (So you may want a HashMap, not
a HashSet, because then it would be easy to map from each word to its
frequency count.)

[1] Using HashSet this way gives you much better efficiency than the
naive way of storing the words in a simple array, and comparing every
word in the array each time!

lex

saiji...@gmail.com

unread,

Feb 17, 2007, 2:44:11 AM2/17/07

to

thnx for ur suggestions... even i had the plan of opting for hashset
before.. Actually after finding the freqency array i got to apply
SVD(singular value decomposition), which needs a 2D array as an
input.

and what is that mapset all about??can i convert that into a 2d array
later by any means???

for now i hav used the array list concept..and converted that into 2d
array and hav applied svd(but i cant do this for large files)...

Boaz...@gmail.com

unread,

Feb 17, 2007, 8:09:46 AM2/17/07

to

sorry but i would like to know more about what you are trying to do,
you mentioned a compersion and a word freq. count
by compersion do you mean a lexigraphic compersion (char by char)? or
just compering the freq. of occurence of each word?

for checking the freq. of each word a HashTable(or any other
datastructure which provide you with the ability to set your own index
type an value with a corresponding value object) would be the best for
you
why? cause you can set the Index of each "Cell" as the Word itself
(and by that earning efficency)while the value is an Integer which
represents the number of occurnces (i recommend making a class that
contains an int memeber and have a method increment() instead of
creating an Integer instance for each incremetation. as said here
before).

inserting - put(Object key, Object value)
Hashtable numbers = new Hashtable();
numbers.put("one", new Integer(1));
numbers.put("two", new Integer(2));
numbers.put("three", new Integer(3));

retriving - get(Objectt key)

Integer n = (Integer)numbers.get("two");

for a help with the compersion itself i would require more info about
what you are trying to do and what are your limitations (unless all
the compersion is of the freq. of the words in both your files)

here is a ref to hashtable class doc
http://java.sun.com/j2se/1.5.0/docs/api/java/util/Hashtable.html

Lew

unread,

Feb 17, 2007, 10:56:58 AM2/17/07

to

Boaz...@gmail.com wrote:
> for checking the freq. of each word a HashTable

if you are in J2ME, HashMap for JSE.

- Lew

Lew

unread,

Feb 17, 2007, 11:07:52 AM2/17/07

to

Boaz...@gmail.com wrote:
> retriving - get(Objectt key)
>
> Integer n = (Integer)numbers.get("two");

Good exposition. Suggestion: use generics.

Map<String, Integer> freqs = new HashMap<String, Integer>();
// or use Map<String, Wrapper> as Boaz.Jan suggested
...
int n = freqs.get( "two" ); // Wrapper w = freqs.get("two");

With the Wrapper you can use clever idioms like

Wrapper w;
if ( (w = freqs.get( word )) == null )
{
freqs.put( word, new Wrapper() );
}
else
{
w.increment();
}

- Lew

Boaz...@gmail.com

unread,

Feb 17, 2007, 11:35:21 AM2/17/07

to

On Feb 17, 6:07 pm, Lew <l...@nospam.lewscanon.com> wrote:

didnt want to enter the concept of generics on a post clearly abit
less advanced
so copied the non generic example from the jse api docs about
hashtable
and sorry about the Hashtable - old habbit :) (the Dictionary class is
now obsolete and therefor so is hashtable... sorry again :)http://
java.sun.com/j2se/1.5.0/docs/api/java/util/Dictionary.html)

Lew

unread,

Feb 17, 2007, 12:02:59 PM2/17/07

to

Boaz...@gmail.com wrote:
> didnt want to enter the concept of generics on a post clearly abit
> less advanced
> so copied the non generic example from the jse api docs about

One could argue that generics are just as easy and less error-prone compared
to the type casts needed for raw types.

- Lew

Chris Uppal

unread,

Feb 17, 2007, 12:31:40 PM2/17/07

to

Lew wrote:

It seems reasonable to assume that someone (the OP) who doesn't yet know about
HashMap and its friends will not yet know about Java generics either.

-- chris