Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Message from discussion Migrating 2+ years of version 7 data takes an hour and counting...
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Austen Ito  
View profile  
 More options Aug 27 2007, 7:24 pm
From: "Austen Ito" <austen....@gmail.com>
Date: Mon, 27 Aug 2007 13:24:51 -1000
Local: Mon, Aug 27 2007 7:24 pm
Subject: Re: [hackystat-dev] Re: Migrating 2+ years of version 7 data takes an hour and counting...
Hi Philip,

I found a huge error on my part.  Read below for more details...

> [1] The entire contents of each data file must be read into memory at

once (due to the call to unmarshall();

This turned out to not be a problem at all.  I commented out the
sensorshell code and invoked the sensor with different heap arguments.
 The large file, the one with 12,000+ entries, only caused an OOM
exception when i dropped the heap argument to less than 20MB.

> When you now uncomment the SensorShell lines, the out of memory
> problem starts up again.  What this means is that you've now isolated
> the problem as being Issue [2].

While I was looking at the code I found a _huge_ mistake on my part.
I was loading sensor shell with a new key-val map after each attribute
in an XML entry was parsed.  This caused the original OOM exception
that would occur before I invoked send().  I can now invoke the
sensor, without increasing the JVM heap size, to read through large
data files.

The problems that still exist are:

[1] Figuring out a good estimate of when to clear the sensorshell
"buffer" by invoking send.  After talking with a coworker, I may
investigate the increase of object sizes to find out a good time to
invoke send().  He pointed me to a link:
http://rrandomized.blogspot.com/2005/09/yahoo-sizeof-function-in-java...
.  This may be a better approach than invoking send after an arbitrary
amount of entries have been parsed.

[2] Finding the reason why an OOM exception still occurs when sending
data for a long period of time.  I decided to send the large amount of
test data using the fixed code and found that an OOM exception occurs
after an hour or so.  I was invoking send after 1000 entries and did
not increase the heap at all.  It may be the case that I can increase
the heap and the OOM will not occur.  In any case,  I'm thinking of
profiling my sensor to see where the problem is.

Due to my error, I think that you are right that SensorShell is
implemented correctly.  I'm going to do some investigation on object
sizes and profiling to see if I can get the sensor working.  I'm
curious to see where the problem is that causes the sensor to blow up.
 It looks like you can go back to work on the high-level analysis
stuff ;)

Thanks Philip.

austen

On 8/26/07, Philip Johnson <john...@hawaii.edu> wrote:

> Hi Austen,

> Thanks for the prompt response.  I took a look at the code and here's
> what I observe:

> First, the issue can be localized to the structure of a very small
> amount of code in the MigrationOption.execute() method.  Basically,
> this code can be paraphrased as follows:

> for (File sensorDataFile : directory) {
>   JAXBContext context = JAXBContext.newInstance(class);

> Unmarshaller unmarshaller = context.createUnmarshaller();
>   Data data = (Data) unmarshaller.unmarshal(sensorDataFile);
>   for (entry : data) {
>     massageEntry(entry);
>     shell.add(entry);
>   }
> }
> shell.send();

> The problem is out of memory errors. There are three basic scaling
> issues with this code

> [1] The entire contents of each data file must be read into memory at
> once (due to the call to unmarshall();
> [2] The entire contents of each file is added to the shell before any
> of it is sent;
> [3] The entire contents of the directory is added to the shell before
> any of the data is sent;

> Problem [3] is trivial to fix (and I suggested it in my last email),
> which is to send the data after each file is loaded. This just
> requires moving the send() call into the loop:

> for (File sensorDataFile : directory) {
>   JAXBContext context = JAXBContext.newInstance(class);

> Unmarshaller unmarshaller = context.createUnmarshaller();
>   Data data = (Data) unmarshaller.unmarshal(sensorDataFile);
>   for (entry : data) {
>     massageEntry(entry);
>     shell.add(entry);
>   }
>   shell.send();
> }

> This doesn't solve the situation in which a single data file is very
> large, which unfortunately occurs in your circumstances.

> To solve this, you have to address either or both of issues [1] or
> [2].  The first thing I would do is some diagnosis.  Get your very
> big data file, and run it over your migration code, but comment out
> the sensorshell stuff:

> for (File sensorDataFile : directory) {
>   JAXBContext context = JAXBContext.newInstance(class);

> Unmarshaller unmarshaller = context.createUnmarshaller();
>   Data data = (Data) unmarshaller.unmarshal(sensorDataFile);
> //  for (entry : data) {
> //    massageEntry(entry);
> //    shell.add(entry);
> //  }
> //  shell.send();
> }

> What this will tell you is whether reading the entire data file into
> memory (i.e. Issue [1]) is at least one of the sources of the memory
> problem. It might be useful to run your system with a few different
> heap values, to see just how much heap you need to allocate to simply
> read the big XML file into memory successfully.

> If the system blows up with the sensorshell code commented out, then
> I can suggest two ways to resolve it:

> a.  Buy more RAM for your machine (or move to a machine with more RAM
> for the migration), enabling you to increase the heap size to the
> point where the entire data file can be marshalled into memory at
> once. Then your JAXB approach is OK.

> b.  Replace the use of JAXB with a custom written SAX event-driven
> parser.  What this effectively allows you to do is intermingle the
> XML processing of your file with the sensorshell sending of the data,
> somewhat like the following:

> for (File sensorDataFile : directory) {
>   SAXParserFactory factory = SAXParserFactory.newInstance();
>   factory.newSAXParser().parse(new File(filename), handler);
> }

> where 'handler' is a callback to a method that will process a single
> data entry:

> massageEntry(entry);
> shell.add(entry);

> The good news with SAX is that the contents of the entire XML file is
> never required to be in memory all at once, only a single entry is,
> so your code can scale to an arbitrarily large V7 data file without
> having to scale your hardware to arbitrarily large heap size. :-)

> The final issue, [2] potentially still remains.

> Let's say that you discover that if you comment out the sensorshell
> stuff, then the MigrationOption does not throw an out of memory
> exception even with a nominal setting for the heap size when parsing
> the largest of your data files. That would be great. Or, that it does
> throw the exception but that you've fixed it by moving to a SAX
> parser.

> When you now uncomment the SensorShell lines, the out of memory
> problem starts up again.  What this means is that you've now isolated
> the problem as being Issue [2].

> I can think of two reasons for this:

> a. The SensorShell is implemented correctly, but given the available
> heap size, you are exceeding it by adding too many entries before
> sending.  In this case, the solution is simple. Either add more heap,
> or invoke send() more frequently such that the problem goes away.
> This really isn't bogus, it's just reality. If you give Java 50MB of
> heap, then you can't store a 51MB string in it no matter how hard you
> try.

> b. The SensorShell is implemented incorrectly, such that no matter
> how frequently you invoke send(), an out of memory error occurs.
> This would be due to the SensorShell not releasing resources
> appropriately after a send().  I've looked at the code and it appears
> to be OK in this regard, but I could be missing something. To test
> this, you just run the system a few times with a counter that invokes
> send() after N entries are added.  If the system blows up around the
> same time regardless of whether the counter is set to 1000, 100, or
> 10, then there's a problem with the SensorShell.

> All right, I hope that gives you some additional ideas to play with.
> Let me know how it goes and what I can do to help.

> Cheers,
> Philip


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google