Hi all, I was testing sending massive amounts (over 2 years worth) of Hackystat 7 data to the Hackystat 8 sensorbase service and came across a couple interesting issues:
1. Loading hundreds of thousands of entries in the sensorshell and sending the data throws an exception. I found that the putSensorDataBatch method in org.hackystat.sensorbase.client.SensorBaseClient creates a huge XML string that cannot be sent to the server. I can send data from 5000 or so entries just fine.
My current approach is a bogus one. In order to avoid the SensorBaseClientException thrown by sending mass amounts of data, I am invoking send() after every 1000 entries.
As Aaron says: akihisa56: that is bogus.
My approach is especially bogus since the exception is due to the size of the entries rather than the total amount. If one file has n-entries with a lot of attributes, the same exception may be thrown. Philip, is there a way to send a large data representation with REST in batches that are small enough that the server can receive the data? I'm not sure how data is received in REST and if the client's sending data can "know" what the maximum amount of data that can be sent.
org.hackystat.sensorbase.client.SensorBaseClientException: 1001: Unable to compl ete the HTTP call due to a communication error with the remote server. Error wri ting request body to server at org.hackystat.sensorbase.client.SensorBaseClient.putSensorDataBatch(S ensorBaseClient.java:621) at org.hackystat.sensorshell.command.SensorDataCommand.send(SensorDataCo mmand.java:71) at org.hackystat.sensorshell.SensorShell.send(SensorShell.java:610)
2. It takes so long to send all of the data that Autosend is invoked while I'm sending data to the server. This causes an exception to be thrown. An example error case may be if data is being sent to the server by an ant task in the background and DevEvent data from Eclipse is sent by Autosend.
Exception in thread "Timer-2" java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(Unknown Source) at java.util.AbstractList$Itr.next(Unknown Source) at org.hackystat.sensorshell.command.SensorDataCommand.send(SensorDataCo mmand.java:68) at org.hackystat.sensorshell.SensorShell.send(SensorShell.java:612) at org.hackystat.sensorshell.command.AutoSendCommand$AutoSendCommandTask .run(AutoSendCommand.java:98) at java.util.TimerThread.mainLoop(Unknown Source) at java.util.TimerThread.run(Unknown Source)
The above cases are rare. I don't think people will be sending years worth of data unless they are migrating over to Version 8. Migrating the data will also most likely be a one-shot deal. I'm going to let the migration run while I'm sleeping. Maybe I'll find more interesting things in the morning. ;)
I am really interested in this experience, and want to put some effort into massaging the SensorShell so that it can deal with this kind of situation gracefully and efficiently.
My initial thoughts are:
* There's nothing bogus about needing to invoke send() periodically in order to 'clean out' the buffer. Indeed, that's much more optimal than somehow loading up an arbitrary amount of data on the client end and then, in one gigantic http PUT, sending it to the server. Instead, by breaking it down into smaller chunks, both the client and server can be busy at the same time, which should reduce the overall time required.
* What is bogus, as you and Aaron note, is that the invocation of send() is based upon the number of entries rather than the size of the payload. As you note, this is brittle. What we want to achieve is a kind of balance where the client and server are both working efficiently together.
* I never thought about the AutoSend issue! The solution is to set AutoSend to 0 before starting the data migration.
What I'd like to do, if you're agreeable, is the following:
- Let me look at the SensorShell/SensorBase code, make some adjustments, and then let you know of the new version(s).
- You re-try the test migration, and see if my hacks result in any improvements.
- When we're satisfied with the results, then we write up a Wiki page on "Version 7 to Version 8 migration", that documents how to do it and what people can expect. For example, no one (including me) would think that setting AutoSend to 0 would be necessary.
Let me know if you come up with other issues when you wake up.
> Hi all, > I was testing sending massive amounts (over 2 years worth) of > Hackystat 7 data to the Hackystat 8 sensorbase service and came > across a couple interesting issues:
> 1. Loading hundreds of thousands of entries in the sensorshell and > sending the data throws an exception. I found that the > putSensorDataBatch method in > org.hackystat.sensorbase.client.SensorBaseClient creates a huge XML > string that cannot be sent to the server. I can send data from 5000 > or so entries just fine.
> My current approach is a bogus one. In order to avoid the > SensorBaseClientException thrown by sending mass amounts of data, I > am invoking send() after every 1000 entries.
> As Aaron says: akihisa56: that is bogus.
> My approach is especially bogus since the exception is due to the > size of the entries rather than the total amount. If one file has > n-entries with a lot of attributes, the same exception may be > thrown. Philip, is there a way to send a large data representation > with REST in batches that are small enough that the server can > receive the data? I'm not sure how data is received in REST and if > the client's sending data can "know" what the maximum amount of > data that can be sent.
> org.hackystat.sensorbase.client.SensorBaseClientException: 1001: > Unable to compl ete the HTTP call due to a communication error with > the remote server. Error wri ting request body to server > at > org.hackystat.sensorbase.client.SensorBaseClient.putSensorDataBatch > (S ensorBaseClient.java:621) > at > org.hackystat.sensorshell.command.SensorDataCommand.send(SensorData > Co mmand.java:71) > at > org.hackystat.sensorshell.SensorShell.send(SensorShell.java:610)
> 2. It takes so long to send all of the data that Autosend is invoked > while I'm sending data to the server. This causes an exception to > be thrown. An example error case may be if data is being sent to > the server by an ant task in the background and DevEvent data from > Eclipse is sent by Autosend.
> Exception in thread "Timer-2" > java.util.ConcurrentModificationException at > java.util.AbstractList$Itr.checkForComodification(Unknown Source) > at java.util.AbstractList$Itr.next(Unknown Source) > at > org.hackystat.sensorshell.command.SensorDataCommand.send(SensorData > Co mmand.java:68) > at > org.hackystat.sensorshell.SensorShell.send(SensorShell.java:612) > at > org.hackystat.sensorshell.command.AutoSendCommand$AutoSendCommandTa > sk .run(AutoSendCommand.java:98) > at java.util.TimerThread.mainLoop(Unknown Source) > at java.util.TimerThread.run(Unknown Source)
> The above cases are rare. I don't think people will be sending > years worth of data unless they are migrating over to Version 8. > Migrating the data will also most likely be a one-shot deal. I'm > going to let the migration run while I'm sleeping. Maybe I'll find > more interesting things in the morning. ;)
Hi Philip, Sounds good. I try again when you have massaged the sensorshell/sensorbase code a bit. I checked migration in the morning and it sadly failed due to an OOM exception. I might have to look at managing the objects that are loaded into the shell.
austen
On 8/24/07, Philip Johnson <john...@hawaii.edu> wrote:
> I am really interested in this experience, and want to put some > effort into massaging the SensorShell so that it can deal with this > kind of situation gracefully and efficiently.
> My initial thoughts are:
> * There's nothing bogus about needing to invoke send() periodically > in order to 'clean out' the buffer. Indeed, that's much more optimal > than somehow loading up an arbitrary amount of data on the client end > and then, in one gigantic http PUT, sending it to the server. > Instead, by breaking it down into smaller chunks, both the client and > server can be busy at the same time, which should reduce the overall > time required.
> * What is bogus, as you and Aaron note, is that the invocation of > send() is based upon the number of entries rather than the size of > the payload. As you note, this is brittle. What we want to achieve > is a kind of balance where the client and server are both working > efficiently together.
> * I never thought about the AutoSend issue! The solution is to set > AutoSend to 0 before starting the data migration.
> What I'd like to do, if you're agreeable, is the following:
> - Let me look at the SensorShell/SensorBase code, make some > adjustments, and then let you know of the new version(s).
> - You re-try the test migration, and see if my hacks result in any > improvements.
> - When we're satisfied with the results, then we write up a Wiki page > on "Version 7 to Version 8 migration", that documents how to do it > and what people can expect. For example, no one (including me) would > think that setting AutoSend to 0 would be necessary.
> Let me know if you come up with other issues when you wake up.
> Cheers, > Philip
> --On August 24, 2007 12:23:52 AM -1000 Austen Ito > <austen....@gmail.com> wrote:
> > Hi all, > > I was testing sending massive amounts (over 2 years worth) of > > Hackystat 7 data to the Hackystat 8 sensorbase service and came > > across a couple interesting issues:
> > 1. Loading hundreds of thousands of entries in the sensorshell and > > sending the data throws an exception. I found that the > > putSensorDataBatch method in > > org.hackystat.sensorbase.client.SensorBaseClient creates a huge XML > > string that cannot be sent to the server. I can send data from 5000 > > or so entries just fine.
> > My current approach is a bogus one. In order to avoid the > > SensorBaseClientException thrown by sending mass amounts of data, I > > am invoking send() after every 1000 entries.
> > As Aaron says: akihisa56: that is bogus.
> > My approach is especially bogus since the exception is due to the > > size of the entries rather than the total amount. If one file has > > n-entries with a lot of attributes, the same exception may be > > thrown. Philip, is there a way to send a large data representation > > with REST in batches that are small enough that the server can > > receive the data? I'm not sure how data is received in REST and if > > the client's sending data can "know" what the maximum amount of > > data that can be sent.
> > org.hackystat.sensorbase.client.SensorBaseClientException: 1001: > > Unable to compl ete the HTTP call due to a communication error with > > the remote server. Error wri ting request body to server > > at > > org.hackystat.sensorbase.client.SensorBaseClient.putSensorDataBatch > > (S ensorBaseClient.java:621) > > at > > org.hackystat.sensorshell.command.SensorDataCommand.send(SensorData > > Co mmand.java:71) > > at > > org.hackystat.sensorshell.SensorShell.send(SensorShell.java:610)
> > 2. It takes so long to send all of the data that Autosend is invoked > > while I'm sending data to the server. This causes an exception to > > be thrown. An example error case may be if data is being sent to > > the server by an ant task in the background and DevEvent data from > > Eclipse is sent by Autosend.
> > Exception in thread "Timer-2" > > java.util.ConcurrentModificationException at > > java.util.AbstractList$Itr.checkForComodification(Unknown Source) > > at java.util.AbstractList$Itr.next(Unknown Source) > > at > > org.hackystat.sensorshell.command.SensorDataCommand.send(SensorData > > Co mmand.java:68) > > at > > org.hackystat.sensorshell.SensorShell.send(SensorShell.java:612) > > at > > org.hackystat.sensorshell.command.AutoSendCommand$AutoSendCommandTa > > sk .run(AutoSendCommand.java:98) > > at java.util.TimerThread.mainLoop(Unknown Source) > > at java.util.TimerThread.run(Unknown Source)
> > The above cases are rare. I don't think people will be sending > > years worth of data unless they are migrating over to Version 8. > > Migrating the data will also most likely be a one-shot deal. I'm > > going to let the migration run while I'm sleeping. Maybe I'll find > > more interesting things in the morning. ;)
This is textbook. I was able to reproduce the error perfectly in about 5 minutes.
Before I dive into this deeper, I do notice one immediate design problem from looking at the output:
$ java -jar xmldata-cli.jar -migration ../hackystat-data-test/users testUser foo bar Hackystat SensorShell Version: 8.0.825 SensorShell started at: Sun Aug 26 11:39:43 HST 2007 Using Sensor Properties in: Type 'help' for a list of commands. Host: http://localhost:9876/sensorbase/ is available. User ad...@hackystat.org is authorized to login at this host. AutoSend set to 10 minutes Processing /hackystat-data-test/users/testUser/data/Activity/ 2006-10-08.xml Processing /hackystat-data-test/users/testUser/data/Activity/ 2006-10-09.xml Processing /hackystat-data-test/users/testUser/data/Activity/ 2006-10-10.xml Processing /hackystat-data-test/users/testUser/data/Activity/ 2006-10-12.xml Processing /hackystat-data-test/users/testUser/data/Activity/ 2006-10-13.xml About to send the following sensor data: <Timestamp SDT Owner Tool Resource Runtime {Properties}> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
The question I have is: why did you choose to read in _all_ of the data before sending _any_ of it?
Could you avoid the problem by simply interleaving the reading and the sending on a file-by-file basis? So, the output might look like:
Processing /hackystat-data-test/users/testUser/data/Activity/ 2006-10-08.xml About to send the following sensor data: Processing /hackystat-data-test/users/testUser/data/Activity/ 2006-10-09.xml About to send the following sensor data: Processing /hackystat-data-test/users/testUser/data/Activity/ 2006-10-10.xml About to send the following sensor data: Processing /hackystat-data-test/users/testUser/data/Activity/ 2006-10-12.xml About to send the following sensor data: Processing /hackystat-data-test/users/testUser/data/Activity/ 2006-10-13.xml About to send the following sensor data:
If you're careful to ensure that you make all of the intermediate data structures available for GC after each process+send cycle (i.e. set instance variables to null, clear() collection instances, etc), then this might be about the simplest way to make the system scalable to an indefinite number of files.
Austen, could you please try that design change out and let me know what happens? I'm happy to look into this further if that change to the design doesn't fix the problem, but to me it's the necessary first step.
One other small issue: I noticed when running this that the system appears to get the <email> and <password> from the command line, but the sensorbase <host> from the v8.server.properties file. That seems a little weird to me. If we're going to override the v8.sensor.properties <email> and <password> values, shouldn't we just go all the way and override the <host> property as well by supplying it on the command line?
I'm glad you were able to reproduce the error. Here are my replies to your questions/comments.
> The question I have is: why did you choose to read in _all_ of the > data before sending _any_ of it?
> Could you avoid the problem by simply interleaving the reading and the > sending on a file-by-file basis?
The distribution I posted on the wiki was my previous implementation before the "bogus" one where I would send data every 1000 entries. I gave you that one so you could see the exception. The problem with reading a file and immediately sending data is that some files have a large number of entries that will cause the exception to occur. For example, I just tested sending data from one Referentia Version 7 data file that has over 12,000 entries. That caused an OOM exception if I didn't increase the heap and the data did not send once I did increase the heap.
> If you're careful to ensure that you make all of the intermediate data > structures available for GC after each process+send cycle (i.e. set > instance variables to null, clear() collection instances, etc), then > this might be about the simplest way to make the system scalable to an > indefinite number of files.
The test of sending data every 1000 entries eventually caused an OOM exception to get thrown about 2.5 hours in. After we get this issue resolved, I will need to make sure to cleanup the data structures correctly.
> One other small issue: I noticed when running this that the system > appears to get the <email> and <password> from the command line, but > the sensorbase <host> from the v8.server.properties file. That seems > a little weird to me. If we're going to override the > v8.sensor.properties <email> and <password> values, shouldn't we just > go all the way and override the <host> property as well by supplying > it on the command line?
Yes I agree that it is a bit weird. It also felt a bit weird when I was coding it. I'll fix that up.
Let me know what you think the next steps should be. Thanks for the quick response ;)
austen
On 8/26/07, Philip Johnson <philipmjohn...@gmail.com> wrote:
> This is textbook. I was able to reproduce the error perfectly in > about 5 minutes.
> Before I dive into this deeper, I do notice one immediate design > problem from looking at the output:
> $ java -jar xmldata-cli.jar -migration ../hackystat-data-test/users > testUser foo bar > Hackystat SensorShell Version: 8.0.825 > SensorShell started at: Sun Aug 26 11:39:43 HST 2007 > Using Sensor Properties in: > Type 'help' for a list of commands. > Host: http://localhost:9876/sensorbase/ is available. > User ad...@hackystat.org is authorized to login at this host. > AutoSend set to 10 minutes > Processing /hackystat-data-test/users/testUser/data/Activity/ > 2006-10-08.xml > Processing /hackystat-data-test/users/testUser/data/Activity/ > 2006-10-09.xml > Processing /hackystat-data-test/users/testUser/data/Activity/ > 2006-10-10.xml > Processing /hackystat-data-test/users/testUser/data/Activity/ > 2006-10-12.xml > Processing /hackystat-data-test/users/testUser/data/Activity/ > 2006-10-13.xml > About to send the following sensor data: > <Timestamp SDT Owner Tool Resource Runtime {Properties}> > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> The question I have is: why did you choose to read in _all_ of the > data before sending _any_ of it?
> Could you avoid the problem by simply interleaving the reading and the > sending on a file-by-file basis? So, the output might look like:
> Processing /hackystat-data-test/users/testUser/data/Activity/ > 2006-10-08.xml > About to send the following sensor data: > Processing /hackystat-data-test/users/testUser/data/Activity/ > 2006-10-09.xml > About to send the following sensor data: > Processing /hackystat-data-test/users/testUser/data/Activity/ > 2006-10-10.xml > About to send the following sensor data: > Processing /hackystat-data-test/users/testUser/data/Activity/ > 2006-10-12.xml > About to send the following sensor data: > Processing /hackystat-data-test/users/testUser/data/Activity/ > 2006-10-13.xml > About to send the following sensor data:
> If you're careful to ensure that you make all of the intermediate data > structures available for GC after each process+send cycle (i.e. set > instance variables to null, clear() collection instances, etc), then > this might be about the simplest way to make the system scalable to an > indefinite number of files.
> Austen, could you please try that design change out and let me know > what happens? I'm happy to look into this further if that change to > the design doesn't fix the problem, but to me it's the necessary first > step.
> One other small issue: I noticed when running this that the system > appears to get the <email> and <password> from the command line, but > the sensorbase <host> from the v8.server.properties file. That seems > a little weird to me. If we're going to override the > v8.sensor.properties <email> and <password> values, shouldn't we just > go all the way and override the <host> property as well by supplying > it on the command line?
Thanks for the prompt response. I took a look at the code and here's what I observe:
First, the issue can be localized to the structure of a very small amount of code in the MigrationOption.execute() method. Basically, this code can be paraphrased as follows:
for (File sensorDataFile : directory) { JAXBContext context = JAXBContext.newInstance(class);
Unmarshaller unmarshaller = context.createUnmarshaller(); Data data = (Data) unmarshaller.unmarshal(sensorDataFile); for (entry : data) { massageEntry(entry); shell.add(entry); }
}
shell.send();
The problem is out of memory errors. There are three basic scaling issues with this code
[1] The entire contents of each data file must be read into memory at once (due to the call to unmarshall(); [2] The entire contents of each file is added to the shell before any of it is sent; [3] The entire contents of the directory is added to the shell before any of the data is sent;
Problem [3] is trivial to fix (and I suggested it in my last email), which is to send the data after each file is loaded. This just requires moving the send() call into the loop:
for (File sensorDataFile : directory) { JAXBContext context = JAXBContext.newInstance(class);
Unmarshaller unmarshaller = context.createUnmarshaller(); Data data = (Data) unmarshaller.unmarshal(sensorDataFile); for (entry : data) { massageEntry(entry); shell.add(entry); } shell.send();
}
This doesn't solve the situation in which a single data file is very large, which unfortunately occurs in your circumstances.
To solve this, you have to address either or both of issues [1] or [2]. The first thing I would do is some diagnosis. Get your very big data file, and run it over your migration code, but comment out the sensorshell stuff:
for (File sensorDataFile : directory) { JAXBContext context = JAXBContext.newInstance(class);
Unmarshaller unmarshaller = context.createUnmarshaller(); Data data = (Data) unmarshaller.unmarshal(sensorDataFile); // for (entry : data) { // massageEntry(entry); // shell.add(entry); // } // shell.send();
}
What this will tell you is whether reading the entire data file into memory (i.e. Issue [1]) is at least one of the sources of the memory problem. It might be useful to run your system with a few different heap values, to see just how much heap you need to allocate to simply read the big XML file into memory successfully.
If the system blows up with the sensorshell code commented out, then I can suggest two ways to resolve it:
a. Buy more RAM for your machine (or move to a machine with more RAM for the migration), enabling you to increase the heap size to the point where the entire data file can be marshalled into memory at once. Then your JAXB approach is OK.
b. Replace the use of JAXB with a custom written SAX event-driven parser. What this effectively allows you to do is intermingle the XML processing of your file with the sensorshell sending of the data, somewhat like the following:
where 'handler' is a callback to a method that will process a single data entry:
massageEntry(entry); shell.add(entry);
The good news with SAX is that the contents of the entire XML file is never required to be in memory all at once, only a single entry is, so your code can scale to an arbitrarily large V7 data file without having to scale your hardware to arbitrarily large heap size. :-)
The final issue, [2] potentially still remains.
Let's say that you discover that if you comment out the sensorshell stuff, then the MigrationOption does not throw an out of memory exception even with a nominal setting for the heap size when parsing the largest of your data files. That would be great. Or, that it does throw the exception but that you've fixed it by moving to a SAX parser.
When you now uncomment the SensorShell lines, the out of memory problem starts up again. What this means is that you've now isolated the problem as being Issue [2].
I can think of two reasons for this:
a. The SensorShell is implemented correctly, but given the available heap size, you are exceeding it by adding too many entries before sending. In this case, the solution is simple. Either add more heap, or invoke send() more frequently such that the problem goes away. This really isn't bogus, it's just reality. If you give Java 50MB of heap, then you can't store a 51MB string in it no matter how hard you try.
b. The SensorShell is implemented incorrectly, such that no matter how frequently you invoke send(), an out of memory error occurs. This would be due to the SensorShell not releasing resources appropriately after a send(). I've looked at the code and it appears to be OK in this regard, but I could be missing something. To test this, you just run the system a few times with a counter that invokes send() after N entries are added. If the system blows up around the same time regardless of whether the counter is set to 1000, 100, or 10, then there's a problem with the SensorShell.
All right, I hope that gives you some additional ideas to play with. Let me know how it goes and what I can do to help.
It occurred to me that another possible memory problem in your situation can be created by the fact that the SensorShell echos all of the data it is about to send (in one single string, no less) prior to sending.
This behavior was simply a bootstrapping /debugging mechanism, which is no longer needed now that we have other mechanisms for detecting sensor data transmission (i.e. the sensordataviewer).
So, I've just removed those lines and committed the changes to the SensorShell. Austen, you might want to update your local copy and use this version in your current development.
I found a huge error on my part. Read below for more details...
> [1] The entire contents of each data file must be read into memory at
once (due to the call to unmarshall();
This turned out to not be a problem at all. I commented out the sensorshell code and invoked the sensor with different heap arguments. The large file, the one with 12,000+ entries, only caused an OOM exception when i dropped the heap argument to less than 20MB.
> When you now uncomment the SensorShell lines, the out of memory > problem starts up again. What this means is that you've now isolated > the problem as being Issue [2].
While I was looking at the code I found a _huge_ mistake on my part. I was loading sensor shell with a new key-val map after each attribute in an XML entry was parsed. This caused the original OOM exception that would occur before I invoked send(). I can now invoke the sensor, without increasing the JVM heap size, to read through large data files.
The problems that still exist are:
[1] Figuring out a good estimate of when to clear the sensorshell "buffer" by invoking send. After talking with a coworker, I may investigate the increase of object sizes to find out a good time to invoke send(). He pointed me to a link: http://rrandomized.blogspot.com/2005/09/yahoo-sizeof-function-in-java... . This may be a better approach than invoking send after an arbitrary amount of entries have been parsed.
[2] Finding the reason why an OOM exception still occurs when sending data for a long period of time. I decided to send the large amount of test data using the fixed code and found that an OOM exception occurs after an hour or so. I was invoking send after 1000 entries and did not increase the heap at all. It may be the case that I can increase the heap and the OOM will not occur. In any case, I'm thinking of profiling my sensor to see where the problem is.
Due to my error, I think that you are right that SensorShell is implemented correctly. I'm going to do some investigation on object sizes and profiling to see if I can get the sensor working. I'm curious to see where the problem is that causes the sensor to blow up. It looks like you can go back to work on the high-level analysis stuff ;)
Thanks Philip.
austen
On 8/26/07, Philip Johnson <john...@hawaii.edu> wrote:
> Thanks for the prompt response. I took a look at the code and here's > what I observe:
> First, the issue can be localized to the structure of a very small > amount of code in the MigrationOption.execute() method. Basically, > this code can be paraphrased as follows:
> Unmarshaller unmarshaller = context.createUnmarshaller(); > Data data = (Data) unmarshaller.unmarshal(sensorDataFile); > for (entry : data) { > massageEntry(entry); > shell.add(entry); > } > } > shell.send();
> The problem is out of memory errors. There are three basic scaling > issues with this code
> [1] The entire contents of each data file must be read into memory at > once (due to the call to unmarshall(); > [2] The entire contents of each file is added to the shell before any > of it is sent; > [3] The entire contents of the directory is added to the shell before > any of the data is sent;
> Problem [3] is trivial to fix (and I suggested it in my last email), > which is to send the data after each file is loaded. This just > requires moving the send() call into the loop:
> Unmarshaller unmarshaller = context.createUnmarshaller(); > Data data = (Data) unmarshaller.unmarshal(sensorDataFile); > for (entry : data) { > massageEntry(entry); > shell.add(entry); > } > shell.send(); > }
> This doesn't solve the situation in which a single data file is very > large, which unfortunately occurs in your circumstances.
> To solve this, you have to address either or both of issues [1] or > [2]. The first thing I would do is some diagnosis. Get your very > big data file, and run it over your migration code, but comment out > the sensorshell stuff:
> What this will tell you is whether reading the entire data file into > memory (i.e. Issue [1]) is at least one of the sources of the memory > problem. It might be useful to run your system with a few different > heap values, to see just how much heap you need to allocate to simply > read the big XML file into memory successfully.
> If the system blows up with the sensorshell code commented out, then > I can suggest two ways to resolve it:
> a. Buy more RAM for your machine (or move to a machine with more RAM > for the migration), enabling you to increase the heap size to the > point where the entire data file can be marshalled into memory at > once. Then your JAXB approach is OK.
> b. Replace the use of JAXB with a custom written SAX event-driven > parser. What this effectively allows you to do is intermingle the > XML processing of your file with the sensorshell sending of the data, > somewhat like the following:
> where 'handler' is a callback to a method that will process a single > data entry:
> massageEntry(entry); > shell.add(entry);
> The good news with SAX is that the contents of the entire XML file is > never required to be in memory all at once, only a single entry is, > so your code can scale to an arbitrarily large V7 data file without > having to scale your hardware to arbitrarily large heap size. :-)
> The final issue, [2] potentially still remains.
> Let's say that you discover that if you comment out the sensorshell > stuff, then the MigrationOption does not throw an out of memory > exception even with a nominal setting for the heap size when parsing > the largest of your data files. That would be great. Or, that it does > throw the exception but that you've fixed it by moving to a SAX > parser.
> When you now uncomment the SensorShell lines, the out of memory > problem starts up again. What this means is that you've now isolated > the problem as being Issue [2].
> I can think of two reasons for this:
> a. The SensorShell is implemented correctly, but given the available > heap size, you are exceeding it by adding too many entries before > sending. In this case, the solution is simple. Either add more heap, > or invoke send() more frequently such that the problem goes away. > This really isn't bogus, it's just reality. If you give Java 50MB of > heap, then you can't store a 51MB string in it no matter how hard you > try.
> b. The SensorShell is implemented incorrectly, such that no matter > how frequently you invoke send(), an out of memory error occurs. > This would be due to the SensorShell not releasing resources > appropriately after a send(). I've looked at the code and it appears > to be OK in this regard, but I could be missing something. To test > this, you just run the system a few times with a counter that invokes > send() after N entries are added. If the system blows up around the > same time regardless of whether the counter is set to 1000, 100, or > 10, then there's a problem with the SensorShell.
> All right, I hope that gives you some additional ideas to play with. > Let me know how it goes and what I can do to help.
--On Monday, August 27, 2007 1:24 PM -1000 Austen Ito <austen....@gmail.com> wrote:
> [1] Figuring out a good estimate of when to clear the sensorshell > "buffer" by invoking send.
I'm not sure whether this is really worth optimizing. Basically, an "optimal" solution only saves you a certain number of HTTP request overheads. HTTP request overhead is typically very small, so for it to be noticable, you have to be able to reduce it by at least 2-3 orders of magnitude. For example, let's say I have to send 1000 entries to the server.
If I call save() after each entry is added, then I've generated 1000 HTTP requests.
I can reduce by 1 order of magnitude by calling save() after every 10 requests, which generates only 100 HTTP requests.
I can reduce by 2 orders of magnitude by calling save() after every 100 requests, which generates only 10 HTTP requests.
And, finally, by 3 orders of magnitude if I call it only once.
Of course, the space-time tradeoff is that you need 1, 2, or 3 orders of magnitude larger buffers to hold these entries prior to the transmission.
For a file of 10,000 entries, if I send() after every 100 entries is added, I generate 100 HTTP requests. If I send() after every 1000 entries is added, I only eliminate 90 requests. If the request overhead is, say, 0.25 seconds, then I've saved 22.5 seconds, which just doesn't seem like a lot to me.
What I would propose you do is a send() after a "reasonable but conservative" number of entry additions--maybe 250 or so. Unless you've got your heap size set really low and/or your entries have a gimungous number of attributes, that should not cause an out of memory exception if the rest of the code is working right. For a big migration, it might be that you end up waiting around one additional hour for it to complete, but it might take you 6 hours to write and debug the code to save you that hour of waiting. :-)
> [2] Finding the reason why an OOM exception still occurs when sending > data for a long period of time. I decided to send the large amount of > test data using the fixed code and found that an OOM exception occurs > after an hour or so. I was invoking send after 1000 entries and did > not increase the heap at all. It may be the case that I can increase > the heap and the OOM will not occur. In any case, I'm thinking of > profiling my sensor to see where the problem is.
Given that I'm saving you time on Issue [1] by proposing that you set it to 250 and forget about it, I am hoping you can spend your newly freed up time to problem [2], which seems more interesting.
I can think of a few scenarios to explain what you're seeing:
(a) After about an hour, the migration code encounters some "weird" data (maybe some sensor data where each entry is, indeed a few orders of magnitude larger than anything that came before). That produces the OOM. In which case, changing the order in which the migration mechanism encounters the data changes the time at which it crashes.
(b) It has nothing to do with the data, no matter what order it is encountered, it always crashes after an hour or so.
(c) 1000 entries, in combination with your heap size, puts the system too near the limit. Reducing the buffer size to 250 fixes things and the system can run indefinitely.
Shoots, it looks like you can invoke the garbage collector occasionally if you want!
When doing these tests, I would try to minimize the amount of output from my migration code. Maybe just generate a timestamp and the name of the XML data file when it is read in, and another time stamp when the system completes the process of sending all of it to the V8 server. (perhaps along with the total number of entries that were processed and sent in that file.) Then you can correlate the time info in your JConsole chart with the data that's being sent, and get a better idea of whether heap is being used up slowly but consistently, or whether there's a sudden spike in heap usage when it encounters a specific kind of data file that puts it over the edge.
This sounds like fun. Wish I was there. :-) Let me know what you find out!
Unfortunately, you've now got me thinking about this problem.
I am guessing that the behavior of this application involves the following cycle:
- spend a few seconds reading in the data and getting ready to send it - spend a few seconds _waiting_ for the HTTP request to complete. - go back to reading in the data and getting ready to send it. - spend a few seconds _waiting_ for the HTTP request to complete.
It occurs to me that it might well be that over the course of a run, up to half of the time the client is essentially idle, waiting for its HTTP request to complete. That's the real cause of inefficiency in the system.
Now, _after_ you've totally gotten rid of the out of memory errors, you might want to recreationally think about how to speed this sucker up. The goal, I think, is to minimize the time the system spends in an idle state just waiting for an HTTP request to complete.
There are lots and lots of ways to think about this problem. One way would be to have just two threads and kind of alternate between them: as soon as the first thread emits the send(), the second one goes off and starts reading in data, and vice versa.
Another way might be to have a pool of N "worker" threads with one "master" thread, and divide up the work among them. Let's say there are 10 worker threads and 1000 files to process. The master thread divides up the 1000 files into 10 batches of 100 files each, and gives each thread its own batch of files to process. In this case, the odds that all 10 would be waiting at the same time become low. Of course, you now want to make sure your server process can handle the onslaught!
There are a bunch of others--nonblocking HTTP, and so forth.
I never thought migration would be so interesting!
Could you tell me the delimiter between each key-value pair so I can separate them out when migrating the data? I think that would be faster than me looking through data files trying to find a pattern ;)
austen
On 8/27/07, Austen Ito <austen....@gmail.com> wrote:
> Hi Philip, > You are right. This is very interesting! I'm going to try out the > JConsole app and watch what happens. I'll give you all the juicy > details of the things that I find.
> austen
> On 8/27/07, Philip Johnson <philipmjohn...@gmail.com> wrote: > > Hi Austen,
> > In other words, don't create a new context and unmarshaller for every > > file. I know that the context is quite expensive to create. I doubt > > this is a silver bullet for your problems, but it might help a little > > bit.
That's excellent! Actually, it looks like increasing the heap size to 1024M was a bit of overkill---while the default (64M) was not enough, it looks like 128M would have been sufficient.
> Also, I noticed that the pMap attributes are included in some > entries:
> Could you tell me the delimiter between each key-value pair so I can > separate them out when migrating the data? I think that would be > faster than me looking through data files trying to find a pattern > ;)
Actually, it's easier than that. First, copy the implementation of the SensorDataPropertyMap class into your migration code. The JavaDoc for that class is here:
Then, feed the string from the encoded pMap into the constructor, and you've now got a SensorDataPropertMap instance you can extract the data from and then feed into the v8 SensorData getProperties() return value.
> Actually, it's easier than that. First, copy the implementation of > the SensorDataPropertyMap class into your migration code. The JavaDoc > for that class is here:
Oh good. That is _much_ easier. Thanks Philip.
austen
On 9/1/07, Philip Johnson <john...@hawaii.edu> wrote:
> That's excellent! Actually, it looks like increasing the heap size > to 1024M was a bit of overkill---while the default (64M) was not > enough, it looks like 128M would have been sufficient.
> > Also, I noticed that the pMap attributes are included in some > > entries:
> > Could you tell me the delimiter between each key-value pair so I can > > separate them out when migrating the data? I think that would be > > faster than me looking through data files trying to find a pattern > > ;)
> Actually, it's easier than that. First, copy the implementation of > the SensorDataPropertyMap class into your migration code. The JavaDoc > for that class is here:
> Then, feed the string from the encoded pMap into the constructor, and > you've now got a SensorDataPropertMap instance you can extract the > data from and then feed into the v8 SensorData getProperties() return > value.
So, Austen, now that you've solved the memory issues and are cleaning up details (i.e. property maps), it would be very interesting (at least to me) to insert some timing code to find out what proportion of time is spent waiting on HTTP requests (I suspect it could be as much as 50% of the execution time). It would also be interesting to see how much time, on average, it takes to migrate a single sensor data entry.
If you're interested in checking this out, it should be pretty simple. First, create a long that you increment each time you migrate a single sensor data item. Second, create a Date instance when you start execution and right before you end execution. Subtract those two to get total wall clock time in milliseconds. Divide that by the total number of sensor data items to get average milliseconds per sensor data item migration.
Finally, generate two Date instances on either side of the SensorShell.send() call, then subtract them immediately after each send() to find out how long the HTTP request took to process. Keep a running tally of those milliseconds in some global counter.
Just before executing, print out the statistics: how many milliseconds total, how many sensor data transmissions total, how many milliseconds per sensor data transmission, how many milliseconds total spent waiting for send() to complete, and what percentage of the total time was spent waiting for send().
It might be interesting to do a couple of test runs with a subset of the data where you change the number of sensor data entries that you buffer before sending from 250 to maybe 100 and then maybe 500, just to see if this changes these runtime characteristics.
I'm hoping this is just 15 minutes of coding, and the benefit is that it gives us a concrete sense for how much we could speed up performance of the SensorShell by implementing a multi-threaded solution to reduce the wait time for the client. I could see this as being helpful to us, not only in migration scenarios, but anytime there are large amounts of data being sent to the server.
If you don't get around to doing this, no worries!
Hi Philip, I did the test runs per your request. It seems that the majority of the time is spent sending the data. I do believe that a multi-threaded solution would not only be useful but fun as well.
I ran 6 test cases, each sending 33417 entries and varying buffer clearing entry amounts. Here are the stats: (I rounded all numbers to the nearest tenth decimal place)
First run: 1 entry buffer size Average Time Per Migration: 45.9 milliseconds Total Time Spent Sending: 1495070 milliseconds Total Percentage of the Time Spent Sending: 97.4% Total Execution Time: 1535025 milliseconds
Second run: 100 entry buffer size Average Time Per Migration: 5.8 milliseconds Total Time Spent Sending: 175604 milliseconds Total Percentage of the Time Spent Sending: 91.0% Total Execution Time: 193101 milliseconds
Third run: 250 buffer size Average Time Per Migration: 5.5 milliseconds Total Time Spent Sending: 166854 milliseconds Total Percentage of the Time Spent Sending: 90.7% Total Execution Time: 183999 milliseconds
Fourth run: 500 entry buffer size Average Time Per Migration: 5.0 milliseconds Total Time Spent Sending: 152260 milliseconds Total Percentage of the Time Spent Sending: 90.7% Total Execution Time: 167914 milliseconds
Fifth run: 1000 entry buffer size Average Time Per Migration: 4.9 milliseconds Total Time Spent Sending: 147598 milliseconds Total Percentage of the Time Spent Sending: 90.3% Total Execution Time: 163308 milliseconds
Sixth run: 5000 entry buffer size Average Time Per Migration: 4.5 milliseconds Total Time Spent Sending: 136131 milliseconds Total Percentage of the Time Spent Sending: 90.2% Total Execution Time: 150883 milliseconds
Assuming my calculations are correct, it seems that the majority of the time is spent waiting on HTTP requests. The results seem to correlate with each other because when reading larger amounts of data before sending, the total execution time drops. When the amount of requests are increased, for example by sending a request after 1 entry, the execution time balloons. Although the the execution time drops, the rate of change from the 2nd test run on is much less.
Hope that is helpful. If the results seem bogus, I can go ahead and commit another developer release with my benchmarking code. Let me know if you have any questions.
austen
On 9/2/07, Philip Johnson <philipmjohn...@gmail.com> wrote:
> So, Austen, now that you've solved the memory issues and are cleaning > up details (i.e. property maps), it would be very interesting (at > least to me) to insert some timing code to find out what proportion of > time is spent waiting on HTTP requests (I suspect it could be as much > as 50% of the execution time). It would also be interesting to see > how much time, on average, it takes to migrate a single sensor data > entry.
> If you're interested in checking this out, it should be pretty > simple. First, create a long that you increment each time you migrate > a single sensor data item. Second, create a Date instance when you > start execution and right before you end execution. Subtract those > two to get total wall clock time in milliseconds. Divide that by the > total number of sensor data items to get average milliseconds per > sensor data item migration.
> Finally, generate two Date instances on either side of the > SensorShell.send() call, then subtract them immediately after each > send() to find out how long the HTTP request took to process. Keep a > running tally of those milliseconds in some global counter.
> Just before executing, print out the statistics: how many > milliseconds total, how many sensor data transmissions total, how many > milliseconds per sensor data transmission, how many milliseconds total > spent waiting for send() to complete, and what percentage of the total > time was spent waiting for send().
> It might be interesting to do a couple of test runs with a subset of > the data where you change the number of sensor data entries that you > buffer before sending from 250 to maybe 100 and then maybe 500, just > to see if this changes these runtime characteristics.
> I'm hoping this is just 15 minutes of coding, and the benefit is that > it gives us a concrete sense for how much we could speed up > performance of the SensorShell by implementing a multi-threaded > solution to reduce the wait time for the client. I could see this as > being helpful to us, not only in migration scenarios, but anytime > there are large amounts of data being sent to the server.
> If you don't get around to doing this, no worries!
* The results seem internally consistent. For example, the changes in average time per migration correlate quite well with the buffer size, which indicates to me that your counters are working correctly.
* I guessed "as much as 50%" of the execution time would be waiting on send(). I was off by a factor of 2---it spends almost 100% of the execution time waiting on send(). :-)
* Changing the buffer size from 250 to 5000 has about a 20% impact on overall throughput. On the other hand, if we could change the system so that a thread is always processing data and the client is never completely paused waiting for a send() to complete, we could theoretically improve throughput by 900%. (Under the assumption that the server would service requests just as fast even though the load on it would be substantially higher.)
* Our baseline right now (with a 250 buffer size) is 5.5 milliseconds per migration of a single sensor data entry, or about 11,000 sensor data instances per minute. That's really not too shabby by itself and I am sure way better than Version 7.
Thus, a non-multithreaded solution that sets the buffer size at 250 is quite reasonable in terms of minimizing the required heap size yet getting decent performance. Austen, I would recommend that you complete the SensorDataProperty stuff, package everything up, and make a "single threaded" release so that we can start to build on that version.
Once we've got a stable, functionally complete version of the system in single thread mode, then the fun can begin on throughput optimization. If it were me, I might start by trying a solution that divides the files into N batches and spawns an individual thread to process each of them. I would also make sure to put my SensorBase on a separate machine so that the experiments are not affected by context switching back and forth (unless you're running on a quad core system or something.) Collect the same data on a per thread basis, but also get the overall start and end times. Then re-run the system, starting with N=1 (the base condition that should replicate your current findings), then N=2, 3, 4, 5, 6, 9, 12, 15, and 20. You might even want to make 2 or 3 runs at each N just to make sure the results are consistent for a given N.
What I hypothesize you'll find is the following:
* For small N, speedup should be almost linear. The per thread data on time average time per migration should not change as N increases (indicating the server can handle the load for small N.) Individual threads will continue to spend almost 90% of their time waiting, which is fine. The overall client, however, will complete much more quickly because while one thread is blocked, another is doing something.
* As N gets higher, two things should happen at some point: (1) The server starts to feel the load, which will be evidenced by the average time per migration for an individual thread getting higher. (2) The overall client is now working near 100% of the time and thus further increases in N don't utilize it any more effectively.
If this is true, then spawning just 3 threads, for example, will result in up to a 300% improvement in throughput, to about 30,000 sensor data instances per minute.
If this is actually what happens, then we should (a) make a new release of the multi-threaded migration package because it works, and (b) start thinking about how to make a multi-threaded strategy available to other applications who need high throughput.