Java instance startup time out of control

1,933 views
Skip to first unread message

Jeff Schnitzer

unread,
Jun 16, 2012, 5:56:33 PM6/16/12
to Google App Engine
We're having a big problem with instance startup time. It varies
between 20s and 60+ seconds, and lately it's tending towards the high
end. We're starting to experience downtime because instances get
deadlined before they go active.

This app is well optimized for GAE. There's no classpath scanning and
it doesn't try to eagerly load data. On a good day it starts in
20s... so at this point there's not really much I can do.

I have a cron task that performs a db cleanup once a minute, and since
crons can run over 60s I can eventually get one instance started,
which is enough to serve traffic at the moment. But at this point I
can no longer deploy code over old versions because the appserver
restart will fail.

Please help.

Jeff

Rafael Dipold

unread,
Jun 16, 2012, 8:52:27 PM6/16/12
to google-a...@googlegroups.com

Per

unread,
Jun 17, 2012, 7:46:16 AM6/17/12
to google-a...@googlegroups.com
Hi Jeff, sounds awful.

Even 20s on a good day is a lot IMO. We're using Wicket, which is on the heavy side, and still the startup typically only takes 10 to 15s, and that's including rendering an initial page too (so it basically involves firing up all subsystems too, loading all classes, contacting the database several times, filling some caches, etc).

I'm guessing you're experiencing some "new release testing" by the Google Team, so that might account for a certain slowdown. We've seen our average latency increase from 250ms to 380ms over night, 2 days ago.  Maybe your slowdown is related to a similar service change.

But it sounds like your application startup is too slow to begin with. Do you have logging in there which can pinpoint what parts take how much time? Would be interested to learn about that.

Cheers,
Per

Richard Watson

unread,
Jun 17, 2012, 8:00:07 AM6/17/12
to google-a...@googlegroups.com
I have a strong suspicion that disk access with many files is an issue.  Have you made any effort to package your classes as a jar file?

It'd be great to get some instrumentation that shows disk vs in-memory setup time.

Looks a bit fiddly to do (multiple projects), but here's some Q&A around it:

On Saturday, June 16, 2012 11:56:33 PM UTC+2, Jeff Schnitzer wrote:
On Saturday, June 16, 2012 11:56:33 PM UTC+2, Jeff Schnitzer wrote:

Jeff Schnitzer

unread,
Jun 17, 2012, 3:36:31 PM6/17/12
to google-a...@googlegroups.com
On Sun, Jun 17, 2012 at 5:00 AM, Richard Watson
<richard...@gmail.com> wrote:
> I have a strong suspicion that disk access with many files is an issue.
>  Have you made any effort to package your classes as a jar file?

I have the same suspicion. I have done tests in the past (packaging
it up by hand) and found about a 20% improvement in startup time. It
was significant but not enough to justify the effort of hacking this
into the deployment process. Of course, my project has grown since
then, so maybe it's time to revisit this.

My biggest concern is that even if I cut my startup time in half
(which I suspect is optimistic), I'm still going to be at the mercy of
a blip in GAE. The startup time varies by a factor of 3 even when
GAE is in "normal" state. This variability is very hard to work with,
especially since (from comments on this list) I suspect sub-20s java
instance startups are difficult to attain in real-world projects.

Another problem is that this variance makes testing hard since I have
to do statistical measurement of startup time. It's hard to know if
changes have a real affect or are just a quirk of getting deployed to
a faster/slower part of the cluster.

I guess I have three specific complaints right now (completely
independent from specific techniques to make my app faster):

1) Startup time should not vary this much. If HRD latency varied by
a factor of 3, someone would probably sound an alarm. Whatever
datasource classes are being loaded from should be more consistent
(assuming that's the problem).

2) The deadline for startup requests is too short. In pharmacology,
one measure of the safety of drugs is the Therapeutic Index -
basically, the ratio of the amount that will kill you divided by the
amount that you need for an effective dose. For alcohol, it's about
10:1. If typical startup time is 20s, then the Therapeutic Index for
GAE is 3:1 - waaaaay too close, especially considering that startup
times seem to vary by a factor of 3 normally.

3) Google, we really need more transparency on this issue. We have a
lot of people speculating and doing trial-and-error experiments, but
not much in the way of official guidance. We shouldn't need to guess
what will work and what won't - please tell us what's going on, why
does a process that takes 3s on my local box take 60s+ in production?
If we understand the underlying mechanism, we can (hopefully) design
around it.

Thanks,
Jeff

Jeff Schnitzer

unread,
Jun 17, 2012, 4:45:58 PM6/17/12
to google-a...@googlegroups.com
Here's some more information:

I have two environments, a production environment and a sandbox
environment (same code, different appids). Production loading
requests usually take ~50 seconds, with occasional (rare) 25s loads.
Sandbox consistently takes 25s, with *very* rare numbers higher.

I log a few timing metrics at startup. Behavior is consistent with
slow classloading, but I can't think of a good way to be certain. Is
it possible to instrument the classloader and get load timings for
specific classes?

I've optimized my app in all the obvious ways: Lazily load everything
that can be, eliminate any classpath scanning, using Objectify, etc.
However, it still requires a lot of classes to get my app started, and
I don't see any way around it:

* Objectify requires every class to be registered up front. This is
taking 5-8 seconds to regster 36 entity classes (plus a fair bit of
embedded structure). I am intimately familiar with what Ofy does
during registration - this isn't a computational cost.

* I'm using Guice (development mode), but this isn't really the core
of the problem. The issue is that I'm using JAX-RS (Resteasy) to
support my REST API. In order for the @Path annotations to be
recognized, every resource class must be registered - and thus loaded.
There's ~80 of these classes. The Guice initialization part (which
includes Objectify registration) takes ~30s in production, ~15s in
sandbox.

I don't see an obvious way to optimize this short of going back to
Java programming circa 2002. As long as my sitemap is defined by
@Path annotations and not a bunch of text in an xml file, those
classes are going to have to be loaded before my app starts. Entity
classes must be registered and introspected before any data is loaded
from the datastore.

I'm not even asking for classpath scanning... I just want normal
classes to load in reasonable time. This problem is going to get
significantly worse over the next year. We add classes every day; our
app gets more features, not fewer. I guess my next experiment will
need to be JARing my WEB-INF/classes.

Rafael: Thanks for the links... John Patterson's comment on David
Chandler's article was interesting. I use Guice AOP in several rather
important places; I really hesitate to remove it. At the very least
it would require adding a lot of tedious, error-prone code to replace
my interceptors.

I would like to hear from someone who has measured the effects of
stripping unused code out of third-party jars. Does that really help?
It would be heinously complicated to maintain this, but I'm very
concerned that I'm hitting a scalability limit of appengine -
something like "your application cannot be more complex than N
classes".

Thanks,
Jeff
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/CzSFxHEhj3QJ.
>
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.

Richard Watson

unread,
Jun 17, 2012, 4:46:41 PM6/17/12
to google-a...@googlegroups.com
On Sunday, June 17, 2012 9:36:31 PM UTC+2, Jeff Schnitzer wrote:
 
My biggest concern is that even if I cut my startup time in half
(which I suspect is optimistic), I'm still going to be at the mercy of
a blip in GAE.   The startup time varies by a factor of 3 even when
GAE is in "normal" state.

There's a chance creating a jar will reduce the variability as well, in that one file load offers reduced opportunity for issues compared to a couple hundred file accesses, each of which could cause a delay. However, it's hard to believe it's a silver bullet or surely it'd be built into the deployment process by default?

Cesium

unread,
Jun 17, 2012, 6:21:02 PM6/17/12
to google-a...@googlegroups.com
Dang Jeff,

This sounds brutal.

The only aspect I could test and collect data on is Objectify. I'm not really sure what most of those other technologies are.

If there is something I could set up in parallel for testing (something I could actually understand), just holler'.

David

PS, maybe something as simple as loading a well-know, big-ass jar?

Thomas Wiradikusuma

unread,
Jun 17, 2012, 8:57:03 PM6/17/12
to google-a...@googlegroups.com
Just my 2 cents,

If indeed our app needs to be single-JARred and obfuscaticated (at least removing unused code), IMO that feature should be baked in the tool. Probably triggered with extra flag.

Takashi Matsuo

unread,
Jun 17, 2012, 11:44:29 PM6/17/12
to google-a...@googlegroups.com
Hi Jeff,

> * Objectify requires every class to be registered up front. This is
> taking 5-8 seconds to regster 36 entity classes (plus a fair bit of
> embedded structure). I am intimately familiar with what Ofy does
> during registration - this isn't a computational cost.
>
> * I'm using Guice (development mode), but this isn't really the core
> of the problem. The issue is that I'm using JAX-RS (Resteasy) to
> support my REST API. In order for the @Path annotations to be
> recognized, every resource class must be registered - and thus loaded.
> There's ~80 of these classes. The Guice initialization part (which
> includes Objectify registration) takes ~30s in production, ~15s in
> sandbox.
>
> I don't see an obvious way to optimize this short of going back to
> Java programming circa 2002. As long as my sitemap is defined by
> @Path annotations and not a bunch of text in an xml file, those
> classes are going to have to be loaded before my app starts. Entity
> classes must be registered and introspected before any data is loaded
> from the datastore.

I'm sorry if I miss something, but I don't think these kinds of
introspection are fundamentally necessary because the class definition
doesn't change during a single version, so you can introduce some
caching mechanism.

Also, I think there are some workaround like setting appropriate
number of Min Idle Instances, or using backends instances. Didn't they
help you here at all?

> I'm not even asking for classpath scanning... I just want normal
> classes to load in reasonable time. This problem is going to get
> significantly worse over the next year. We add classes every day; our
> app gets more features, not fewer. I guess my next experiment will
> need to be JARing my WEB-INF/classes.

'Making class loading faster' would be definitely a constructive
feature request. It's worth filing an issue and seeing how it
interests people.

Hi Thomas,
I think this is also a good feedback especially if creating the single
JAR contributes the performance. I'd appreciate it if you could file
an issue.

-- Takashi

>
> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/DX5yI2GRG9YJ.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
>



--
Takashi Matsuo | Developer Advocate | tma...@google.com

Jeff Schnitzer

unread,
Jun 18, 2012, 3:58:29 AM6/18/12
to google-a...@googlegroups.com
On Sun, Jun 17, 2012 at 8:44 PM, Takashi Matsuo <tma...@google.com> wrote:
>
> I'm sorry if I miss something, but I don't think these kinds of
> introspection are fundamentally necessary because the class definition
> doesn't change during a single version, so you can introduce some
> caching mechanism.
>
> Also, I think there are some workaround like setting appropriate
> number of Min Idle Instances, or using backends instances. Didn't they
> help you here at all?

Hi Takashi. The problem is that I can't get a single instance
started. Min idle instances won't help me if my app can't start
before the 60s cutoff deadline. Caching doesn't help because I can't
get an instance off the ground in the first place.

If my understanding is correct - that network classloading is the
major lag - then here is the rough summary of my problem:

* My "sitemap" (ie the mapping of URIs to code) is determined by
@Path annotations on 80+ classes. This is the JAX-RS way. The
alternative is the history of defining all URIs in xml files like
web.xml or struts.xml - an approach that was wholly abandoned by the
Java community at least 5 years ago.

* In order to serve even a single request, all 80+ of those classes
need to be loaded. The actual # is quite a bit larger because those
classes depend on other classes, and presumably there's some sort of
transitive classloading process - but this exceeds my knowledge of
java classloading.

* To perform the very first datastore operation, there's quite a lot
of classloading required to get the persistence system off the ground.
This isn't really optional. The first datastore request could be
"load key Vehicle(123)" and the persistence mechanism needs to be able
to understand that this is a polymorphic Airplane. So any kind of ORM
system (like Objectify) needs to classload and introspect every
possible class *before* any requests hit the datastore.

* Consequentially, there is a minimum number of classes that my app
must load at startup time. There is no way to lazy-load them because
they are all necessary to 1) establish the JAX-RS sitemap and 2)
establish the persistence context.

> 'Making class loading faster' would be definitely a constructive
> feature request. It's worth filing an issue and seeing how it
> interests people.

Here: http://code.google.com/p/googleappengine/issues/detail?id=7706

Note that we - the user community - really have no idea if
classloading is the issue. Is it? We're guessing based on observed
behavior; we just know that it takes an oddly long time for our app
instances to start. It would be helpful if someone from Google
described the underlying architecture so that the community could both
provide constructive feedback and figure out workarounds.

Also, while this happens to be hitting me directly, I urge you guys to
take this as seriously as possible - I'm about as GAE-savvy as anyone
gets without a @google.com email address, and I can't think of a
workaround for this problem. I'm generally very sympathetic to making
my app work "the GAE way" instead of using traditional JavaEE design
patterns, but "remove features" is not really an option. My codebase
is going to grow significantly before the end of the year.

> On Mon, Jun 18, 2012 at 9:57 AM, Thomas Wiradikusuma
> <wiradi...@gmail.com> wrote:
>> Just my 2 cents,
>>
>> If indeed our app needs to be single-JARred and obfuscaticated (at least removing unused code), IMO that feature should be baked in the tool. Probably triggered with extra flag.
>
> I think this is also a good feedback especially if creating the single
> JAR contributes the performance. I'd appreciate it if you could file
> an issue.

Someone in Google - possibly you - knows if this will be the case
without us having to guess. Can you describe how the GAE classloader
works? Does it make a separate network request per classfile? Please
don't make us guess at what will improve startup performance, give us
some guidance.

Thanks,
Jeff

Michael Hermus

unread,
Jun 18, 2012, 10:26:11 AM6/18/12
to google-a...@googlegroups.com
Jeff,

If by "going back to Java programming circa 2002", you mean not using annotation processing that requires full classpath scanning, I think that is in fact the only solution right now. Based on my limited research, I think you really need to stay away from that in order to avoid cold start time problems with GAE Java. In fact, the 'Best Practices' section of the Objectify wiki is one of the first places I saw this.

Although you say have no classpath scanning, the JAX-RS @Path processing is exactly that. In addition, I am almost sure that Guice is scanning the entire classpath for @Inject annotations as well. I wholeheartedly agree it stinks that you have to avoid leveraging awesome frameworks like those (and hence having better, more maintainable code) in order to have a properly functioning GAE Java application. I would gladly star any feature request you make in this direction. However, for now I am staying away.

As a reference, I use only Objectify and a few other standard java libraries, but no heavyweight frameworks, and my _ah_warmup requests have been recently averaging about 3.5 seconds.

-Mike

Jeff Schnitzer

unread,
Jun 18, 2012, 1:34:17 PM6/18/12
to google-a...@googlegroups.com
This is incorrect. Guice does not perform classpath scanning, and
while classpath scanning is nice for making JAX-RS @Path annotations
work, it is optional and I have disabled it.

The way it works is that you explicitly register the classes that have
annotations. So each of them individually must be classloaded at
startup. Despite the lack of classpath scanning, this process is
still taking an excessive amount of time. Using Objectify is similar;
all entity classes must be explicitly registered and introspected
before datastore operations begin.

So... this problem goes way beyond classpath scanning. The problem is
getting classes loaded up front, and this problem doesn't expose
itself until your application reaches a significant level of
complexity. By this standard, Objectify is a "heavyweight" framework
- the only way around loading entity classes up-front is to use the
low-level API.

I am impressed by your 3.5 second startup time. Does that include a
datastore hit (ie Objectify registration)? How big is your project?
Would you complete this straw poll, filling in your answers? Everyone
else reading with Java instances, would you do the same?

My project:

# of classes in WEB-INF/classes: 619
(cd war/WEB-INF/classes; ls -R | grep class | wc -l)

Size of WEB-INF/classes: 3.3M
(cd war/WEB-INF/classes; du -sh .)

# of jars in WEB-INF/lib: 54

Size of WEB-INF/lib: 42M
(25M of this is GAE SDK)

# of classes registered with Objectify: 36 (plus maybe half that again
in @Embed and @Serialize classes)

# of classes registered with other means (any explicit classloading,
ie JAX-RS): 100+

Fastest observed startup time: 20s
Typical startup time: 50s
Slowest startup time: deadlined 60s+

I readily acknowledge that I have a fairly large number of jar
dependencies. However, I'm not scanning them. They're also (almost)
all essential for certain features; I do a lot of integration with
third-party APIs. At best I can get rid of one or two by rewriting a
few sections of code.

Also... this project isn't really that big as enterprise projects go.
I've worked with much, much larger codebases in the past. I shudder
to think what that would do to appengine :-(

Jeff
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/YHU-mGVlPAsJ.

Tom Phillips

unread,
Jun 18, 2012, 3:46:05 PM6/18/12
to Google App Engine
My Project:

# of classes in WEB-INF/classes: 1074 (jarred into a single jar for
faster deployment)
Size of WEB-INF/classes: 9.0M (jars down to 3.3M)
# of jars in WEB-INF/lib: 87
Size of WEB-INF/lib: 90M

Frameworks initialized:

JDO PMF (3-6secs)
JAXB single context (4-10 secs)
Guice (2-4 secs) - just dipping my toes back in the water with guice
on gae, only one experimental @Inject. Abandoned Guice previously due
to initialization time concerns.

JDO classes ~ 14 entity types
JAXB beans ~ 146
Guice - single @Inject

Fastest observed startup time: 11s
Typical startup time: 20s
Slowest startup time: 25s+

A few weeks ago I must have been moved to a new data center. Prior to
that I was seeing between 2-3 times worse performance, for months.
Both for instance initialization (which was 20-50+ secs), and general
request latencies.

This application is relatively small compared to some of the J2EE
applications I've worked on in the past.

/Tom

Richard Watson

unread,
Jun 18, 2012, 3:54:27 PM6/18/12
to google-a...@googlegroups.com

# of classes in WEB-INF/classes:  304
Size of WEB-INF/classes:  737KB (1.5MB on disk)
# of jars in WEB-INF/lib:  56
Size of WEB-INF/lib: 50.8M
(25M of this is GAE SDK)

# of classes registered with Objectify: 12
# of classes registered with other means:  5?  (Guice)

Fastest observed startup time:  10s
Typical startup time: 15s
Slowest startup time: 25s 

 I have another small app using Objectify for a few classes, no magic, startup about 3.5 seconds.

Stefano Ciccarelli

unread,
Jun 18, 2012, 4:20:31 PM6/18/12
to google-a...@googlegroups.com
My project is GAE + GWT.
 
# of classes in WEB-INF/classes: 619
(cd war/WEB-INF/classes; ls -R | grep class | wc -l)
5421 (packed in a single .jar)

Size of WEB-INF/classes: 3.3M
(cd war/WEB-INF/classes; du -sh .)
58M 

# of jars in WEB-INF/lib: 54
36 

Size of WEB-INF/lib: 42M
(25M of this is GAE SDK)
76M 

# of classes registered with Objectify: 36 (plus maybe half that again
in @Embed and @Serialize classes)
325 (more or less) 

# of classes registered with other means (any explicit classloading,
ie JAX-RS): 100+
26

Fastest observed startup time: 20s
Typical startup time: 50s
Slowest startup time: deadlined 60s+

In the last 3/4 weeks: 12s, 14s, 20s
Since a month ago: 35s, 50s, deadlined 60s+

I do not know why and I do not know how, but in recent weeks the startup times have improved significantly, without changing a single line of code.

However, like you, I too have had serious problems starting the instances in the past months.

Then, magically, it all worked out.

Joakim

unread,
Jun 18, 2012, 4:40:21 PM6/18/12
to google-a...@googlegroups.com
I've been pondering for some time now why none of the frameworks seem to have realized that the configuration will never change after the build is complete. They should all ship something that generates an XML config from the class annotations (Ant plugin, an annotation processor for javac, anything), I can't imagine the amount of resources wasted globally because of the lack of this (though that likely says more about my imagination than anything else).

My project:

# of classes in WEB-INF/classes:
Zero (I jar)

Size of WEB-INF/classes:
0M


# of jars in WEB-INF/lib:
44

Size of WEB-INF/lib:
34.5M


# of classes registered with Objectify:
Zero (I still haven't moved from JDO)


# of classes registered with other means (any explicit classloading, ie JAX-RS):
100+

Fastest observed startup time:  35s
Typical startup time: 45s
Slowest startup time: deadlined 60s+

Michael Hermus

unread,
Jun 18, 2012, 4:56:20 PM6/18/12
to google-a...@googlegroups.com
Interesting; thanks for the clarification. Even though there is no classpath scanning, it does seem like you are loading AND introspecting the majority of your classes upon initialization. Regardless, clearly it is confounding that it takes so long.

I don't have access to all the information you requested from my current location, but I will try to get it later. I can safely say that my project is indeed much smaller than yours. I do know that I have 16 Entity classes registered with Objectify (not including any @Serialize), and 24 jars in WEB-INF/lib. As I said, no other frameworks are used.

In _ah_warmup, I load the DAO classes, which register Entity objects via static initializer. To be fair, I am using a Servlet for _ah_warmup, and so probably defer the cost of JSP initialization (which I believe you mentioned somewhere that you do not use, anyway). From a brief look at the logs, it seems as though the first JSP request after a warmup takes about 1 second longer than normal. This would indicate a total initialization time of around 4-5 seconds. I should probably point _ah_warmup at a JSP page to get a more accurate cold start average.

Per

unread,
Jun 18, 2012, 8:43:44 PM6/18/12
to google-a...@googlegroups.com
Hi Jeff,

just as a comparison, we have 33 classes that get initialised with Objectify, and it takes merely 2 seconds. We had all the problems you mentioned prior to jaring things up, and since then performance has been improved vastly. I thought Google had improved the problem (they made some comment about this 9 months ago or so, when we posted our summary at http://www.small-improvements.com/app-engine-performance-tuning) but maybe they haven't.

We're also using Apache Wicket, and while initialising the mounted paths, we're actually referring to hundreds of additional Java files, which again refer to some 2 to 5 inner classes each. This was *killing* us before some nice fellow suggested jarring the classes, and now even that merely takes 4 seconds during startups.

Maybe it also helps a little that we're on F4 these days, but all the performance tuning we did back then was on F1. I'm guessing you're on F4 anway, if you're this desperate.

BTW, I had also considered stripping unused classes from the jars, but it was really the class *loading*, not the parsing of Jar files, that was causing the slowdown. I'm guessing it had to do with file access ultimately, and that each file access on the VM needs to be verified by the secure Classloader, and that this is simply tons more efficient if you're just looking at the same jar file all the time.

So, add that target to your ant file, and let us know how you go! :)

 <target name="createjar" depends="copycerts" description="Creates a jar from the classes folder">
        <jar jarfile="${libs}/small-improvements.jar" basedir="${classes}"/>
        <delete dir="${classes}"/>
    </target>



Cheers,
Per

Kyaw Tun

unread,
Jun 18, 2012, 10:33:16 PM6/18/12
to Google App Engine
Hi Jeff,

How about optimizing with ProGuard (http://proguard.sourceforge.net/
index.htm). It preform death code elimination and repackage to single
jar file.

KT

Jason Collins

unread,
Jun 19, 2012, 12:23:57 PM6/19/12
to Google App Engine
Joakim, you took the words right out of my mouth.

It seems, in distributed computing with a solid versioning mechanism,
that everything that can be moved to a build-time operation is a
(potentially) huge win for global cost/performance.

j

Jeff Schnitzer

unread,
Jun 19, 2012, 2:05:29 PM6/19/12
to google-a...@googlegroups.com
As someone who develops one of these frameworks, I can explain exactly
why they don't generate XML metamodels at build time:

1) It's an enormous amount of work to develop both the tools and the
metamodel. Java has a built-in metamodel (annotations) which is easy
to work with and well understood by most developers.

2) It imposes significantly upon the end developer. Instead of just
including a jar with your deployment, now you need to hook into the
build system (ant, maven, or eclipse?). 90% of developers don't know
how to do this and even fewer actually want to bother.

3) This is only an issue on GAE, not on other platforms. GAE is a
tiny part of the Java developer community so there are very few tools
available to support hooking into the build system, converting
annotations to a metamodel, and exposing this metamodel at runtime.
We're lucky to have the appengine-specific OS projects that we do.

4) This classloader limitation is an undocumented (and hopefully
temporary) issue in GAE. Hell, we don't even know that this is the
problem - there's no official statement on the matter, just a lot of
trial-and-error-informed speculation. I've been developing for years
on GAE and didn't realize the scope of this problem; I always thought
"avoid classpath scanning and I'll be ok". Well it turns out that
it's more complicated than that. And it only hits apps that reach a
certain critical level of complexity... which I finally have.

The upshot is that I wouldn't expect a lot of frameworks to adopt this
model anytime soon. It's not that there aren't any - take a look at
Slim3, it builds a metamodel at compile time. However, run some tests
- AFAIK, they built the metamodel for the purpose of avoiding
reflection overhead at runtime (a non-issue), not to avoid
classloading, so it's possible that loading all those pre-generated
stubs could make the problem even worse. This harkens back to one of
my earlier points - it's hard to develop workarounds for a problem
that we only barely understand. I could spend weeks developing a
"solution" to this problem that might not even provide significant
benefit.

Jeff
> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.

Renzo Nuccitelli

unread,
Jun 19, 2012, 8:12:12 PM6/19/12
to google-a...@googlegroups.com
 I started on JAVA GAE about 3 years ago. At that time I was trying to use Spring and the startup were very expensive. A made some experiments and in one of them I removed all frameworks and the statup decrease 90%.

 Once Java is poor productive without frameworks, I give Python a try. Once it don´t need to load all the code on startup, I wrote a small Python framework (ZenWArch) to import a respective request handler code just when the application need it.  Now the startup ocurrs in 1 to 3s on my applications, even when the source grows.

 I don´t like flames Language A versus Language B. It just seems  that Python fits better on GAE. It´s just a matter of using the right tool for the problem.

 I hope Java get better on GAE.

Jeff Schnitzer

unread,
Jun 19, 2012, 9:34:14 PM6/19/12
to google-a...@googlegroups.com
On Tue, Jun 19, 2012 at 5:12 PM, Renzo Nuccitelli <ren...@gmail.com> wrote:
>
>  I don´t like flames Language A versus Language B. It just seems  that
> Python fits better on GAE. It´s just a matter of using the right tool for
> the problem.

As someone who builds both Python and Java apps on GAE, let me assure
you that there are plenty of issues in Pythonland as well. There is
no perfect language or perfect platform, just different problems to
solve or work around. The ones you are unfamiliar with always seem
hardest.

I like Python, but I think it is naive to claim that Python fits
better on GAE. There are many, many considerations beyond instance
startup time.

Jeff

Takashi Matsuo

unread,
Jun 20, 2012, 12:15:30 AM6/20/12
to google-a...@googlegroups.com
Thanks everyone for the constructive discussion!

First of all, before it becomes a sort of "frame"(please understand
I'm not saying you're inciting people), please keep in mind that we
want every runtime to be a first citizen of App Engine.

Secondly, we have introduced significant improvement to the Java
Runtime in the past few years, so it should much much faster than it
was 3 years ago.

Jeff,

FYI, I've just started an internal discussion about this issue, and
we're taking this issue very seriously.
I also promise that I'll continue pushing the core engineering team
hard with this issue.

Thanks again,

-- Takashi
> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
>



Takashi Matsuo

unread,
Jun 20, 2012, 12:34:32 AM6/20/12
to google-a...@googlegroups.com
Oops. Obviously I meant "flame".

Richard Watson

unread,
Jun 20, 2012, 3:40:51 AM6/20/12
to google-a...@googlegroups.com
I've been using the Eclipse Plugin up until now, which is great but
can be a touch frustrating. I had to make some big changes in my app
anyway, so I've just (finally) set up an ant script for
build-gwt-jar-deploy. So far, just under 10 seconds on first startup,
about 6-8 seconds on following startups (after shutting the instance
down).

I do have some GWT loading issues but I think that's related to
first-run caching and can be fixed by code splitting to reduce the
initial download.

Renzo Nuccitelli

unread,
Jun 20, 2012, 5:43:51 PM6/20/12
to google-a...@googlegroups.com
 Jeff,

 I am very familiar with the cold start problem on Java and have searched a lot for a solution, including building my one. The cold start is a problem with no reasonable solution for more than 2 years. This issue was definitive for me, once a lot of users gave up using some of my apps after waiting more than 20s for a page to load. So I switched to Python and solved my client´s problem. Even if the decision was naive, my clients were/are happy and that´s what matters for me.

  I completaly agree that there is not perfect language nor plataform, but I don´t know how this common sense information can help on cold start issue, what is your solution for it?. 

  I should have wrote "..Python fits GAE better, IN MY OPINION" to try avoid the flame ;).

 

 Takashi

 I very glad to hear abou the plan to have no second citzen on GAE. To be honest, after switching to Python at that time was great and gave me the impression the language was the prefered one, once the platform have started using it. More than that I´ve seen very much more question about Java problems on this group than in Python land. This fact doesn´t prove my opinion, once the number of java users can be greater than Go or Python. But it is very odd, at least :)

 As I have said, I have reasonable startup time after removing heavy frameworks, like spring. I guess  the need to load all classes make this time increase a lot, but i could not find a way to make a"lazy load start" on Java like I do on Python  (ZenWArch) . Maybe this information can help in your team´s discussion.

 Renzo.
>> To post to this group, send email to google-appengine@googlegroups.com.
>> To unsubscribe from this group, send email to google-appengine+unsubscribe@googlegroups.com.

Thomas Wiradikusuma

unread,
Jun 21, 2012, 11:18:51 PM6/21/12
to google-a...@googlegroups.com
I have updated http://code.google.com/p/googleappengine/issues/detail?id=7706 with this information.

Richard Watson

unread,
Jun 23, 2012, 1:22:34 AM6/23/12
to google-a...@googlegroups.com
Hi Will,

I also tried bundling (most [1]) jars into one but it didn't seem to move the needle at all, once classes were jarred.  I did perceive a lower initial-RAM level - I think it was about 4 or 5 megs lower but I didn't test that too carefully.  Is your load-time difference compared to unjarred classes, or to multiple jars only?

I did think to try bundling only the jars my app would need on startup, to reduce the overall initial load.  Does Java have to inspect the contents of all jar files to figure out where required classes are?

Richard

[1] GAE deployment complained my 1-jar solution was too big, so I wrote an ant task to jar-up only the jars below a certain size and leaving very few bigger ones. Went from 50+ to about 8. But again, no perceived load-time improvement once the classes were jarred.

On Friday, June 22, 2012 8:35:11 AM UTC+2, Will Rayner wrote:
Hi all,

I've also been battling with with java warmup times. Last week I had startup time of at least 37 seconds. Now it's hovering around 16.

My performance improvements were made by bundling all my dependencies together into a single jar. I've been using the excellent gradle gae plugin (https://github.com/bmuschko/gradle-gae-plugin), which integrates with https://github.com/musketyr/gradle-fatjar-plugin/. This could easily be integrated with an existing gradle project in under an hour.

We're using Resteasy, Htmleasy, soy templates, hibernate orm and validator. There were about 60 jars in my WEB-INF/lib.

Regards,
Will Rayner


On Friday, June 22, 2012 1:18:51 PM UTC+10, Thomas Wiradikusuma wrote:
I have updated http://code.google.com/p/googleappengine/issues/detail?id=7706 with this information.

On Monday, 18 June 2012 11:44:29 UTC+8, Takashi Matsuo (Google) wrote:
On Mon, Jun 18, 2012 at 9:57 AM, Thomas Wiradikusuma

Per

unread,
Jun 23, 2012, 6:42:03 AM6/23/12
to google-a...@googlegroups.com
Hi Richard,


IMO it's not important to put all JAR files into one. It's important to reduce your absolute number of files. Each initial file access costs a little time, but it doesn't really matter if you access 30 jar files or 1. What makes a difference is wether App Engine needs to load 2000 class files one by one, or if they are all inside a handful of jars.

I'd also consider moving all other files, like property-files and HTML files (in case your app parses them) into those jars. Made a huge difference for me.

Good luck!
Per

Rafael Dipold

unread,
Jun 23, 2012, 11:26:47 PM6/23/12
to google-a...@googlegroups.com
Hi Jeff,

I'm not so experienced with GAE like you (less than 1 year), but I would try to make my contribution.

I have found in recent tests that Guice is a great villain of startup time. The Guice execute a bytecode generator (com.google.inject.internal.BytecodeGen newFastClass) classes for each startup and increases the start time by up to 13 seconds (only the Guice!!).

Then I change my project to use PicoContainer instead of Guice and my startup decreased 11-12 seconds!

Today I use Vraptor framework (http://vraptor.caelum.com.br/en) + PicoContainer + Objectify 4 with classpath scanning and my startup is in average 5 seconds. 

Follow my jars lib (34Mb):

appengine-api-1.0-sdk-1.6.6.jar
appengine-api-labs-1.6.6.jar
appengine-jsr107cache-1.6.6.jar
commons-fileupload-1.2.1.jar
commons-io-1.3.2.jar
gmultipart.jar
guava-r07.jar
hamcrest-all-1.2RC3.jar
iogi-0.9.1.jar
javassist-3.14.0.GA.jar
json-20090211.jar
jsr107cache-1.1.jar
log4j-1.2.16.jar
mirror-1.5.1.jar
objectify-4.0a3.jar
paranamer-2.2.jar
picocontainer-2.13.6.jar
scannotation-1.0.3.jar
slf4j-api-1.6.1.jar
slf4j-log4j12-1.6.1.jar
vraptor-3.4.1.jar
vraptor-gae.jar
xstream-xstream-1.3.2-SNAPSHOT-GAE.jar  

Jeff Schnitzer

unread,
Jun 24, 2012, 3:53:06 PM6/24/12
to google-a...@googlegroups.com
Interesting. From John Patterson's comments, it sounds like I can
remove bytecode generation by disabling the AOP stuff in Guice.
Unfortunate because I rely on interceptors pretty heavily, but I can
probably find an alternative. Thanks for the suggestion; I will do
some experimentation.

We're now seeing startup times in 20-40s range in production, which
could be related to a big code refactoring we just pushed (not focused
on startup time) or could be coincidence. At any rate we're not in
panic mode anymore.

Jeff
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/-czBx6HdOsYJ.
>
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.

Jeff Schnitzer

unread,
Jun 24, 2012, 8:00:47 PM6/24/12
to google-a...@googlegroups.com
Experiment #1: JARing my classes.

Times are measured by shutting down instance, hitting a URL, looking
at request time in logs. Repeat until bored.

First, the "control". My app (sandbox appid), normally deployed
(classes in WEB-INF/classes):

21118ms
23849ms
35995ms
20556ms
21620ms
23718ms
34446ms
42948ms
22487ms
32722ms
34511ms
31883ms

Redeployed, same code but with classes JARed instead of in WEB-INF/classes:

19240ms
19386ms
19912ms
27517ms
20400ms
20483ms
20186ms
19517ms
20352ms
19528ms
20856ms

What's interesting is that this change didn't improve the best-case
load times by much, but it almost eliminated the crazy variance. This
is a huge win.

Conclusion: Use this ant script for deployment:

<target name="deploy">
<delete dir="${staging.dir}" />
<mkdir dir="${staging.dir}" />

<copy todir="${staging.dir}">
<fileset dir="war">
<exclude name="WEB-INF/classes/**" />
<exclude name="WEB-INF/appengine-generated/**" />
</fileset>
</copy>
<jar destfile="${staging.dir}/WEB-INF/lib/classes.jar"
basedir="${classes.dir}" />

<appcfg action="update" war="${staging.dir}" />
</target>

Jeff Schnitzer

unread,
Jun 24, 2012, 8:02:38 PM6/24/12
to google-a...@googlegroups.com
Oh, an added bonus: Deployment is faster since it clones half the
number of files.

Jeff

Jeff Schnitzer

unread,
Jun 24, 2012, 8:58:42 PM6/24/12
to google-a...@googlegroups.com
Experiment #2: Bigger better faster frontends

Using JARed classes, with F2 frontends:

19409ms
18516ms
17125ms
18056ms
17152ms
18708ms
28104ms
16821ms
18074ms
16859ms
18311ms

Small but noticeable improvement, maybe 10%?

Same deployment, with F4 frontend:

12063ms
9070ms
10037ms
8617ms
10024ms
10656ms
8871ms
9330ms
9019ms
9253ms

Hot damn!

I'm not entirely sure how to interpret these results. An F4 is about
twice as fast as an F1. This suggests the problem is significantly
computational. Except that an F2 is more or less the same speed as an
F1, which suggests the problem is almost entirely I/O.

Maybe F2 instances aren't actually twice the CPU power of an F1?
Maybe F4 instances get some special I/O priority? Anyone want to
speculate?

Jeff

On Sun, Jun 24, 2012 at 5:00 PM, Jeff Schnitzer <je...@infohazard.org> wrote:

Brandon Wirtz

unread,
Jun 24, 2012, 10:20:58 PM6/24/12
to google-a...@googlegroups.com
F4 Vs F2 Vs F1

I'll bet money, your numbers are right and your conclusion is wrong. F4's
have more memory, and more CPU, and more "IO" but the difference I'm 90%
certain is that you get a whole VM all to yourself so your neighbors aren't
stealing from you. :-) so you get a crap ton more throughput. Watch F4's
talk to memcache, It is a whole other beast.

Why didn't you test on an F8? You can do some really, really fun things on
an F8. (which isn't really 4.8 GHZ :-) ) You move to an F4 with F8 backends
and you will see Appengine through the Rose Colored glasses Brandon does,
where the world is always happy, and the numbers never change, and life is
all sunshine and rainbows and unicorns .

It's all those lousy neighbors who load huge frame works, that Dead Line
Exceed on startups cook the CPU and the IO, and then crash only to start up
30 seconds later only to do it again... No losers trying to stuff 129 megs
of stuff in to memory along with 64 megs of startup over flowing and going
through re-spin up.

Nah, co-habitation sucks. Pony up the rent and kick out the room mates, your
life will be happier and you will get laid more often too.
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to google-
> appengine+...@googlegroups.com.

Thomas Wiradikusuma

unread,
Jun 25, 2012, 10:13:58 PM6/25/12
to google-a...@googlegroups.com
I went further by JARring everything (classes+small JARs, except Scala JAR and GAE Labs JAR). But i don't see big difference :(
From ~18sec to ~15sec.

Cesium

unread,
Jun 26, 2012, 9:02:22 AM6/26/12
to google-a...@googlegroups.com
I'm really liking Brandon's conclusions.
David

Hernan Liendo

unread,
Jun 26, 2012, 4:46:24 PM6/26/12
to google-a...@googlegroups.com
+1

de Witte

unread,
Jun 27, 2012, 5:05:45 AM6/27/12
to google-a...@googlegroups.com


Op maandag 18 juni 2012 22:20:31 UTC+2 schreef Stefano Ciccarelli het volgende:
My project is GAE + GWT.
 
# of classes in WEB-INF/classes: 619
(cd war/WEB-INF/classes; ls -R | grep class | wc -l)
5421 (packed in a single .jar)


These are mainly GWT client class files. You can delete them, no need to send them to the server side. Will reduce your upload file size to several MB's instead of 58.

Jeff Schnitzer

unread,
Jul 4, 2012, 3:07:22 PM7/4/12
to google-a...@googlegroups.com
Reviving this thread again:

Our production appid now blows the 60s deadline and won't start new
instances. The exact same code on our sandbox appid starts in under
20s.

We have instance(s) running and serving traffic so we're not down.
But I can't get a new version warmed up, so I can't deploy new code.
I've filed a production issue.

Many attempts to start an instance fail. The "hot instance" theory
certainly makes sense but does that mean my app is pinned to a hot
instance? This seems like a bad idea.

It seems to me that startup requests really need to be allowed more
than 60s. The risk of a GAE slowdown producing user-facing downtime
is high... and based on the (highly unscientific) poll in this thread,
my normally-20s startup times are not atypical for a Java app.

Jeff

Richard Watson

unread,
Jul 4, 2012, 4:13:56 PM7/4/12
to google-a...@googlegroups.com
Options I can think of:
1) Using a bigger instance. You said it had some effect on startup
time? Rather pay more while you figure it out. Also, it might kick
you off any hot instances, if you don't fit on there anymore.
2) Seeing if you can log which specific parts of the startup is
slowing it down. Print stuff out using static initialisers if you have
to.
3) Pay the $500pm for premier support, if there's some way to do that
temporarily. Not sure whether you can afford to run it indefinitely,
but it sounds like this is costing you more than that at this point.
This is the most distasteful option, like paying tax for knowing how
to make your app start.

What are the differences between your sandbox and prod? Just users,
or data as well?
> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to google-appengi...@googlegroups.com.

Jeff Schnitzer

unread,
Jul 4, 2012, 4:42:34 PM7/4/12
to google-a...@googlegroups.com
On Wed, Jul 4, 2012 at 1:13 PM, Richard Watson <richard...@gmail.com> wrote:
> Options I can think of:
> 1) Using a bigger instance. You said it had some effect on startup
> time? Rather pay more while you figure it out. Also, it might kick
> you off any hot instances, if you don't fit on there anymore.

Definitely an option we have considered. As a bootstrapped startup
with low revenue (so far), each $ counts... but if we keep having
problems this will be the solution (assuming it solves the problem).
At this point we have started instances on the new version.

> 2) Seeing if you can log which specific parts of the startup is
> slowing it down. Print stuff out using static initialisers if you have
> to.

If only I had a couple more days in the week :-)

> 3) Pay the $500pm for premier support, if there's some way to do that
> temporarily. Not sure whether you can afford to run it indefinitely,
> but it sounds like this is costing you more than that at this point.
> This is the most distasteful option, like paying tax for knowing how
> to make your app start.

Totally not in the budget, unfortunately.

> What are the differences between your sandbox and prod? Just users,
> or data as well?

The startup time difference is independent of the datastore. I see
this erratic startup problem with requests that only touch the
urlfetch service (a https proxy for mapquest OSM tiles). The observed
behavior difference between production and sandbox seems to be a
function of the appid itself.

Jeff

Richard Watson

unread,
Jul 4, 2012, 4:51:59 PM7/4/12
to google-a...@googlegroups.com
On Wed, Jul 4, 2012 at 10:42 PM, Jeff Schnitzer <je...@infohazard.org> wrote:
> The observed
> behavior difference between production and sandbox seems to be a
> function of the appid itself.

If that's true, 1) start new app with data copy, 2) move Cloudfront
and DNS to point to new app, 3) profit.

(4. wait until some similarly weird thing happens again. 5. start
working on AWS-Objectify)

Richard
Reply all
Reply to author
Forward
0 new messages