Instance failed to start accepting connections

1,091 views
Skip to first unread message

John McTeague

unread,
Oct 1, 2014, 10:11:44 AM10/1/14
to vcap...@cloudfoundry.org
We are experiencing the following problem across multiple instances of CF, all running across the same openstack instance, when pushing or scaling applications

We are running v177, 187 and 188 running on trusty stemcells 2707 (for v177) and 2719.2 for 187 and 188.

For all buildpack types (java, ruby and go) we frequently get  Instance (index x) failed to start accepting connections and the app instances fail to start. If we are scaling to multiple instances, 1 or 2 occasionally succeed.

While possibly co-incidental, pushing and scaling works fine for the first few actions but eventually we see consistent problems across the instance invoking actions on the apps.

The apps we are pushing are basic hello world apps that take only a few seconds to startup. We have a two minute default_health_check_timeout set on the dea. If an instance starts then they start fairly quickly. We have experimented with different DEA sizes and number of DEA's. No change was observed.

For the java applications, we do not see Server startup in X ms in the logs, which suggests tomcat never completes its startup process, but the "failed to start accepting connections" error is consistent across buildpacks.

The common points between these is the same openstack. Liberal security group policies have been applied to allow traffic on all ports and protocols.


Can anyone shed any light on this?

Ben Hale

unread,
Oct 1, 2014, 11:49:37 AM10/1/14
to vcap...@cloudfoundry.org
For the java applications, we do not see Server startup in X ms in the logs, which suggests tomcat never completes its startup process, but the "failed to start accepting connections" error is consistent across buildpacks.

John,

To begin with, you're not alone :)  I've been chasing this problem since yesterday on PWS.  So the problem is Cloud Foundry-wide, not just limited to Open Stack.  In my analysis, we see the same error, but Tomcat 8 (or embedded Tomcat 7 in Spring Boot applications) services requests just fine.  As an example:

2014-10-01T09:47:55.82+0100 [STG]    OUT -----> Downloaded app package (15M)
2014-10-01T09:47:57.96+0100 [STG]    OUT -----> Downloaded app buildpack cache (211M)
2014-10-01T09:47:58.10+0100 [STG]    ERR Cloning into '/tmp/buildpacks/java-buildpack'...
2014-10-01T09:47:59.22+0100 [STG]    OUT -----> Java Buildpack Version: 717d8bf | https://github.com/cloudfoundry/java-buildpack.git#717d8bf
2014-10-01T09:47:59.32+0100 [STG]    OUT -----> Downloading Open Jdk JRE 1.8.0_20 from https://download.run.pivotal.io/openjdk/lucid/x86_64/openjdk-1.8.0_20.tar.gz (found in cache)
2014-10-01T09:48:00.53+0100 [STG]    OUT        Expanding Open Jdk JRE to .java-buildpack/open_jdk_jre (1.2s)
2014-10-01T09:48:00.55+0100 [STG]    OUT -----> Downloading Spring Auto Reconfiguration 1.5.0_RELEASE from https://download.run.pivotal.io/auto-reconfiguration/auto-reconfiguration-1.5.0_RELEASE.jar (found in cache)
2014-10-01T09:48:00.56+0100 [STG]    OUT        Modifying /WEB-INF/web.xml for Auto Reconfiguration
2014-10-01T09:48:00.58+0100 [STG]    OUT -----> Downloading Tomcat Instance 8.0.12 from https://download.run.pivotal.io/tomcat/tomcat-8.0.12.tar.gz (found in cache)
2014-10-01T09:48:00.74+0100 [STG]    OUT        Expanding Tomcat to .java-buildpack/tomcat (0.1s)
2014-10-01T09:48:00.76+0100 [STG]    OUT -----> Downloading Tomcat Lifecycle Support 2.3.0_RELEASE from https://download.run.pivotal.io/tomcat-lifecycle-support/tomcat-lifecycle-support-2.3.0_RELEASE.jar (found in cache)
2014-10-01T09:48:00.78+0100 [STG]    OUT -----> Downloading Tomcat Logging Support 2.3.0_RELEASE from https://download.run.pivotal.io/tomcat-logging-support/tomcat-logging-support-2.3.0_RELEASE.jar (found in cache)
2014-10-01T09:48:00.80+0100 [STG]    OUT -----> Downloading Tomcat Access Logging Support 2.3.0_RELEASE from https://download.run.pivotal.io/tomcat-access-logging-support/tomcat-access-logging-support-2.3.0_RE... (found in cache)
2014-10-01T09:48:10.19+0100 [STG]    OUT -----> Uploading droplet (60M)
2014-10-01T09:48:35.54+0100 [DEA]    OUT Starting app instance (index 0) with guid 68c1a1b9-f46e-44b7-bf20-57953ae46805
2014-10-01T09:48:39.70+0100 [App/0]  OUT [CONTAINER] org.apache.coyote.http11.Http11NioProtocol        INFO    Initializing ProtocolHandler ["http-nio-62140"]
2014-10-01T09:48:39.71+0100 [App/0]  OUT [CONTAINER] org.apache.catalina.startup.Catalina              INFO    Initialization processed in 549 ms
2014-10-01T09:48:39.72+0100 [App/0]  OUT [CONTAINER] org.apache.catalina.core.StandardService          INFO    Starting service Catalina
2014-10-01T09:48:39.72+0100 [App/0]  OUT [CONTAINER] org.apache.catalina.core.StandardEngine            INFO    Starting Servlet Engine: Apache Tomcat/8.0.12
2014-10-01T09:48:39.75+0100 [App/0]  OUT [CONTAINER] org.apache.catalina.startup.HostConfig            INFO    Deploying web application directory /home/vcap/app/.java-buildpack/tomcat/webapps/ROOT
2014-10-01T09:48:42.46+0100 [App/0]  OUT [CONTAINER] ing.AutoReconfigurationServletContainerInitializer INFO    Initializing ServletContext with Auto-reconfiguration ApplicationContextInitializers
2014-10-01T09:48:42.46+0100 [App/0]  OUT [CONTAINER] lina.core.ContainerBase.[Catalina].[localhost].[/] INFO    No Spring WebApplicationInitializer types detected on classpath
2014-10-01T09:48:42.86+0100 [App/0]  OUT [CONTAINER] lina.core.ContainerBase.[Catalina].[localhost].[/] INFO    Initializing Spring FrameworkServlet 'dispatcherServlet'
2014-10-01T09:48:43.07+0100 [App/0]  OUT [CONTAINER] n.spring.CloudProfileApplicationContextInitializer INFO    Adding 'cloud' to list of active profiles
2014-10-01T09:48:43.07+0100 [App/0]  OUT [CONTAINER] g.CloudPropertySourceApplicationContextInitializer INFO    Adding 'cloud' PropertySource to ApplicationContext
2014-10-01T09:48:43.14+0100 [App/0]  OUT [CONTAINER] udAutoReconfigurationApplicationContextInitializer INFO    Adding cloud service auto-reconfiguration to ApplicationContext
2014-10-01T09:48:43.52+0100 [App/0]  OUT [CONTAINER] ing.DataSourceCloudServiceBeanFactoryPostProcessor INFO    Auto-reconfiguring beans of type javax.sql.DataSource
2014-10-01T09:48:43.55+0100 [App/0]  OUT [CONTAINER] ing.DataSourceCloudServiceBeanFactoryPostProcessor INFO    No matching service found. Skipping auto-reconfiguration.
2014-10-01T09:48:43.56+0100 [App/0]  OUT [CONTAINER] n.spring.MongoCloudServiceBeanFactoryPostProcessor INFO    Auto-reconfiguring beans of type org.springframework.data.mongodb.MongoDbFactory
2014-10-01T09:48:43.56+0100 [App/0]  OUT [CONTAINER] n.spring.MongoCloudServiceBeanFactoryPostProcessor INFO    No matching service found. Skipping auto-reconfiguration.
2014-10-01T09:48:43.57+0100 [App/0]  OUT [CONTAINER] .spring.RabbitCloudServiceBeanFactoryPostProcessor INFO    Auto-reconfiguring beans of type org.springframework.amqp.rabbit.connection.ConnectionFactory
2014-10-01T09:48:43.57+0100 [App/0]  OUT [CONTAINER] .spring.RabbitCloudServiceBeanFactoryPostProcessor INFO    No matching service found. Skipping auto-reconfiguration.
2014-10-01T09:48:43.57+0100 [App/0]  OUT [CONTAINER] n.spring.RedisCloudServiceBeanFactoryPostProcessor INFO    Auto-reconfiguring beans of type org.springframework.data.redis.connection.RedisConnectionFactory
2014-10-01T09:48:43.58+0100 [App/0]  OUT [CONTAINER] n.spring.RedisCloudServiceBeanFactoryPostProcessor INFO    No matching service found. Skipping auto-reconfiguration.
2014-10-01T09:48:44.79+0100 [App/0]  OUT [CONTAINER] org.apache.catalina.startup.HostConfig            INFO    Deployment of web application directory /home/vcap/app/.java-buildpack/tomcat/webapps/ROOT has finished in 5,036 ms
2014-10-01T09:48:44.83+0100 [App/0]  OUT [CONTAINER] org.apache.tomcat.util.net.NioSelectorPool        INFO    Using a shared selector for servlet write/read
2014-10-01T09:48:44.84+0100 [DEA]    ERR Instance (index 0) failed to start accepting connections
2014-10-01T09:49:42.50+0100 [RTR]    OUT ben-web-servlet-2-application.cfapps.io - [01/10/2014:08:49:42 +0000] "GET / HTTP/1.1" 200 2 "-" "curl/7.30.0" 10.10.2.122:61614 x_forwarded_for:"89.21.231.30" vcap_request_id:338d909e-a06c-446a-6d1f-d8c6dc6df29f response_time:0.097188002 app_id:68c1a1b9-f46e-44b7-bf20-57953ae46805

You'll notice that towards the end the DEA reports "failed to start accepting connections", but then immediately routes a request successfully.  There's clearly a problem (that message is wrong), but other than the annoyance I don't believe that we're seeing any other effect.  Definitely needs to be higher priority if you're seeing routing problems as well.


-Ben Hale
Cloud Foundry Java Experience

John McTeague

unread,
Oct 1, 2014, 11:54:19 AM10/1/14
to vcap...@cloudfoundry.org
Interesting, thanks for the update!

We had similar issues with v2.5 of the java buildpack from a fresh v188 install. My first guess was it was related to that, but rolling back to v2.4 didnt help.

John


On Wednesday, October 1, 2014 4:49:37 PM UTC+1, Ben Hale wrote:
For the java applications, we do not see Server startup in X ms in the logs, which suggests tomcat never completes its startup process, but the "failed to start accepting connections" error is consistent across buildpacks.

John,

To begin with, you're not alone :)  I've been chasing this problem since yesterday on PWS.  So the problem is Cloud Foundry-wide, not just limited to Open Stack.  In my analysis, we see the same error, but Tomcat 8 (or embedded Tomcat 7 in Spring Boot applications) services requests just fine.  As an example:

2014-10-01T09:47:55.82+0100 [STG]    OUT -----> Downloaded app package (15M)
2014-10-01T09:47:57.96+0100 [STG]    OUT -----> Downloaded app buildpack cache (211M)
2014-10-01T09:47:58.10+0100 [STG]    ERR Cloning into '/tmp/buildpacks/java-buildpack'...
2014-10-01T09:47:59.22+0100 [STG]    OUT -----> Java Buildpack Version: 717d8bf | https://github.com/cloudfoundry/java-buildpack.git#717d8bf
2014-10-01T09:47:59.32+0100 [STG]    OUT -----> Downloading Open Jdk JRE 1.8.0_20 from https://download.run.pivotal.io/openjdk/lucid/x86_64/openjdk-1.8.0_20.tar.gz (found in cache)
2014-10-01T09:48:00.53+0100 [STG]    OUT        Expanding Open Jdk JRE to .java-buildpack/open_jdk_jre (1.2s)
2014-10-01T09:48:00.55+0100 [STG]    OUT -----> Downloading Spring Auto Reconfiguration 1.5.0_RELEASE from https://download.run.pivotal.io/auto-reconfiguration/auto-reconfiguration-1.5.0_RELEASE.jar (found in cache)
2014-10-01T09:48:00.56+0100 [STG]    OUT        Modifying /WEB-INF/web.xml for Auto Reconfiguration
2014-10-01T09:48:00.58+0100 [STG]    OUT -----> Downloading Tomcat Instance 8.0.12 from https://download.run.pivotal.io/tomcat/tomcat-8.0.12.tar.gz (found in cache)
2014-10-01T09:48:00.74+0100 [STG]    OUT        Expanding Tomcat to .java-buildpack/tomcat (0.1s)
2014-10-01T09:48:00.76+0100 [STG]    OUT -----> Downloading Tomcat Lifecycle Support 2.3.0_RELEASE from https://download.run.pivotal.io/tomcat-lifecycle-support/tomcat-lifecycle-support-2.3.0_RELEASE.jar (found in cache)
2014-10-01T09:48:00.78+0100 [STG]    OUT -----> Downloading Tomcat Logging Support 2.3.0_RELEASE from https://download.run.pivotal.io/tomcat-logging-support/tomcat-logging-support-2.3.0_RELEASE.jar (found in cache)
2014-10-01T09:48:00.80+0100 [STG]    OUT -----> Downloading Tomcat Access Logging Support 2.3.0_RELEASE from https://download.run.pivotal.io/tomcat-access-logging-support/tomcat-access-logging-support-2.3.0_RE... (found in cache)
2014-10-01T09:48:10.19+0100 [STG]    OUT -----> Uploading droplet (60M)
2014-10-01T09:48:35.54+0100 [DEA]    OUT Starting app instance (index 0) with guid 68c1a1b9-f46e-44b7-bf20-57953ae46805
2014-10-01T09:48:39.70+0100 [App/0]  OUT [CONTAINER] org.apache.coyote.http11.Http11NioProtocol        INFO    Initializing ProtocolHandler ["http-nio-62140"]
2014-10-01T09:48:39.71+0100 [App/0]  OUT [CONTAINER] org.apache.catalina.startup.Catalina              INFO    Initialization processed in 549 ms
2014-10-01T09:48:39.72+0100 [App/0]  OUT [CONTAINER] org.apache.catalina.core.StandardService          INFO    Starting service Catalina
2014-10-01T09:48:39.72+0100 [<span
...

Ben Hale

unread,
Oct 1, 2014, 11:57:10 AM10/1/14
to vcap...@cloudfoundry.org
We had similar issues with v2.5 of the java buildpack from a fresh v188 install. My first guess was it was related to that, but rolling back to v2.4 didnt help. 

We've been heading down the same investigation path clearly.  I've replicated the problem on PWS using Tomcat 7, Tomcat 8, Java 7, Java 8, Java Buildpack v2.4, and Java Buildpack v2.5.  The fact that you're seeing it with Ruby and Go buildpacks as well indicates that it's not related to the buildpacks.

Dieu Cao

unread,
Oct 1, 2014, 12:30:18 PM10/1/14
to vcap...@cloudfoundry.org
I've put a bug in tracker and will have a pair look into it today.

-Dieu
Cloud Foundry Runtime PM

Luan Santos

unread,
Oct 1, 2014, 1:25:52 PM10/1/14
to vcap...@cloudfoundry.org
Hi John and Ben,

We're trying to reproduce this problem, we're not seeing this with our example apps. Could you provide us with an app that elicits the issue?

Thanks,

Luan and Jim, CF Runtime Team

John McTeague

unread,
Oct 1, 2014, 4:45:24 PM10/1/14
to vcap...@cloudfoundry.org
Luan, the cf-scale-boot app highlights this problem for me - https://github.com/cf-platform-eng/cf-scale-boot. In fact any app I deploy in any buildpack. https://github.com/longnguyen11288/go-web-cf is a go example i tried that also failed.

Ben, interestingly our problem also started yesterday. Co-incidence?

John

James Myers

unread,
Oct 2, 2014, 1:35:43 PM10/2/14
to vcap...@cloudfoundry.org
Hi John,

We currently are seeing this error in our logs when scaling applications, however we have yet to see this error when pushing an application. In the cases when we do see it, we never actually see the instance in a crashed/failing state (similar to what Ben Hale is reporting). We also have successfully been able to curl/contact the instances that have "failed to start accepting connections" a short while after seeing the error. Thus this is a case in which the instance actually successfully started, however we are incorrectly logging an error message in the logs. When you observe this error when scaling applications, do you actually see a crashed/failing instance? Or are you just observing the error in the logs? Could you also provide us with the output from `cf logs` during a scale or push when the failure occurs?

Thanks,

Jim & Alex, CF Runtime Team

John McTeague

unread,
Oct 2, 2014, 4:28:14 PM10/2/14
to vcap...@cloudfoundry.org
To clarify what I have seen, some pushes do fail, the app attempts to start several times before giving up. Each failure ends with the accepting connections error. 

Occasionally it does start. If it does start, a scale up will just as likely fail for some of the nodes, 1 or 2 may work at best.

You are correct that after some period of time the instances may come up. One test we attempted was to scale to 10 instances. It took several hours but eventually all 10 started. The logs were full of failed to start accepting connection errors. Other times it only took 5-10 mins (prior to this issue it only took 1-2 mins).

I hope this helps, I'll dump a log here tomorrow from the office (its similar to what Ben posted but for a v2.4 java buildpack).

If you need any other data or assistance feel free to get in touch. I or my colleague will happy to share a webex if you need to explore the issue during UK or EST office hours.

John

John McTeague

unread,
Oct 3, 2014, 4:24:13 AM10/3/14
to vcap...@cloudfoundry.org
I'm going to contradict my previous comment somewhat.

Starting with a clean v188 environment (java buildpack 2.4) this morning I pushed cf-scale-boot. It started fine but the error did appear in the log.

I then ran cf scale cf-scale-boot -i 4

The second instance started almost immediately but it too put the error in the log.

Instances 3 and 4 continuously tried to start but failed. After several attempts the instances show as crashing when running cf app cf-scale-boot.

The log is attached.
cf_log_scale_boot.log

Luan Santos

unread,
Oct 6, 2014, 5:55:42 PM10/6/14
to vcap...@cloudfoundry.org
Hi John,

We're still having trouble reproducing this error on our development/staging environments, we're also investigating whether or not the issues seen on PWS are the same as the one you're experiencing. I'd like to get some more information on the details of your environment, could you send us the output of `bosh vms --vitals`? 
A webex session could indeed be useful to speed up this investigation. We're available 9-6 Pacific so we could probably find a time that works there. Let us know the time that works for you.

Thanks,
Luan

john mcteague

unread,
Oct 6, 2014, 6:00:49 PM10/6/14
to vcap...@cloudfoundry.org, aitk...@gmail.com

Luan, I will set up a WebEx for tomorrow between you and my colleague in NYC and send the output of the bosh command you requested.

John.

--
You received this message because you are subscribed to a topic in the Google Groups "Cloud Foundry Developers" group.
To view this discussion on the web visit https://groups.google.com/a/cloudfoundry.org/d/msgid/vcap-dev/7284dcff-b9d5-48ea-a583-f3c4246c3a38%40cloudfoundry.org.

John McTeague

unread,
Oct 7, 2014, 4:19:20 AM10/7/14
to vcap...@cloudfoundry.org, aitk...@gmail.com
The output of bosh vms --vitals is:

+---------------------------+---------+---------------+----------------+-----------------------+------+------+------+--------------+-------------+------------+------------+------------+
| Job/index                 | State   | Resource Pool | IPs            |         Load          | CPU  | CPU  | CPU  | Memory Usage | Swap Usage  | System     | Ephemeral  | Persistent |
|                           |         |               |                | (avg01, avg05, avg15) | User | Sys  | Wait |              |             | Disk Usage | Disk Usage | Disk Usage |
+---------------------------+---------+---------------+----------------+-----------------------+------+------+------+--------------+-------------+------------+------------+------------+
| unknown/unknown           | running | az1_cf_small  | 172.16.3.6     | 0.00, 0.01, 0.05      | 0.0% | 0.0% | 0.0% | 3% (61.9M)   | 0% (0B)     | 46%        | 0%         | n/a        |
| unknown/unknown           | running | az1_external  | 172.16.2.6     | 0.00, 0.01, 0.05      | 0.0% | 0.1% | 0.1% | 3% (66.2M)   | 0% (0B)     | 46%        | 0%         | n/a        |
| unknown/unknown           | running | az2_cf_small  | 172.16.3.7     | 0.00, 0.01, 0.05      | 0.0% | 0.0% | 0.1% | 3% (63.7M)   | 0% (0B)     | 46%        | 0%         | n/a        |
| unknown/unknown           | running | az2_external  | 172.16.2.7     | 0.00, 0.01, 0.05      | 0.1% | 0.0% | 0.0% | 3% (66.1M)   | 0% (0B)     | 46%        | 0%         | n/a        |
| cloud_controller_az1/0    | running | az1_cf_medium | 172.16.3.112   | 0.00, 0.01, 0.05      | 0.2% | 0.1% | 0.0% | 20% (775.9M) | 0% (660.0K) | 46%        | 8%         | n/a        |
| cloud_controller_az2/0    | running | az2_cf_medium | 172.16.3.113   | 0.00, 0.01, 0.05      | 0.2% | 0.1% | 0.0% | 19% (762.0M) | 0% (144.0K) | 46%        | 8%         | n/a        |
| core-mysql-az1/0          | running | az1_db        | 172.16.4.101   | 0.03, 0.05, 0.05      | 0.0% | 0.1% | 0.0% | 4% (574.4M)  | 0% (0B)     | 46%        | 1%         | 3%         |
| core-mysql-az1/1          | running | az1_db        | 172.16.4.103   | 0.00, 0.01, 0.05      | 0.1% | 0.0% | 0.0% | 3% (514.5M)  | 0% (0B)     | 46%        | 1%         | 3%         |
| core-mysql-az2/0          | running | az2_db        | 172.16.4.102   | 0.00, 0.01, 0.05      | 0.1% | 0.0% | 0.0% | 3% (513.4M)  | 0% (0B)     | 46%        | 1%         | 3%         |
| core-mysql-haproxy/0      | running | az2_db        | 172.16.4.104   | 0.00, 0.01, 0.05      | 0.0% | 0.0% | 0.0% | 1% (200.8M)  | 0% (0B)     | 46%        | 0%         | n/a        |
|                           |         |               | 10.100.132.248 |                       |      |      |      |              |             |            |            |            |
| dea_az1/0                 | running | az1_dea       | 172.16.3.116   | 0.00, 0.01, 0.05      | 0.2% | 0.2% | 0.0% | 7% (1.1G)    | 0% (0B)     | 46%        | 3%         | n/a        |
| dea_az2/0                 | running | az2_dea       | 172.16.3.117   | 0.00, 0.01, 0.05      | 0.2% | 0.2% | 0.0% | 10% (1.6G)   | 0% (0B)     | 46%        | 3%         | n/a        |
| etcd_az1/0                | running | az1_cf_small  | 172.16.3.103   | 0.06, 0.11, 0.13      | 1.0% | 3.4% | 1.7% | 6% (118.9M)  | 0% (0B)     | 46%        | 1%         | 0%         |
| etcd_az2/0                | running | az2_cf_small  | 172.16.3.104   | 0.05, 0.08, 0.07      | 0.4% | 0.4% | 1.9% | 5% (102.3M)  | 0% (0B)     | 46%        | 1%         | 0%         |
| haproxy_az1/0             | running | az1_cf_small  | 172.16.2.101   | 0.00, 0.01, 0.05      | 0.0% | 0.1% | 0.1% | 4% (85.8M)   | 0% (0B)     | 46%        | 1%         | n/a        |
|                           |         |               | 10.100.132.246 |                       |      |      |      |              |             |            |            |            |
| haproxy_az2/0             | running | az2_cf_small  | 172.16.2.102   | 0.00, 0.01, 0.05      | 0.1% | 0.2% | 0.0% | 4% (80.2M)   | 0% (0B)     | 46%        | 0%         | n/a        |
|                           |         |               | 10.100.132.247 |                       |      |      |      |              |             |            |            |            |
| health_manager_az1/0      | running | az1_cf_small  | 172.16.3.105   | 0.00, 0.01, 0.05      | 0.4% | 0.0% | 0.1% | 6% (130.1M)  | 0% (0B)     | 46%        | 2%         | n/a        |
| health_manager_az2/0      | running | az2_cf_small  | 172.16.3.106   | 0.00, 0.01, 0.05      | 0.3% | 0.1% | 0.0% | 6% (127.9M)  | 0% (0B)     | 46%        | 1%         | n/a        |
| loggregator_az1/0         | running | az1_cf_small  | 172.16.3.108   | 0.00, 0.01, 0.05      | 0.1% | 0.1% | 0.0% | 4% (79.0M)   | 0% (0B)     | 46%        | 1%         | n/a        |
| loggregator_az2/0         | running | az2_cf_small  | 172.16.3.109   | 0.00, 0.01, 0.05      | 0.1% | 0.1% | 0.0% | 4% (77.0M)   | 0% (0B)     | 46%        | 1%         | n/a        |
| loggregator_traffic_az1/0 | running | az1_cf_small  | 172.16.3.110   | 0.00, 0.01, 0.05      | 0.1% | 0.2% | 0.0% | 4% (78.7M)   | 0% (0B)     | 46%        | 1%         | n/a        |
| loggregator_traffic_az2/0 | running | az2_cf_small  | 172.16.3.111   | 0.00, 0.01, 0.05      | 0.0% | 0.1% | 0.0% | 4% (72.8M)   | 0% (0B)     | 46%        | 1%         | n/a        |
| login_az1/0               | running | az1_cf_small  | 172.16.3.150   | 0.00, 0.01, 0.05      | 0.1% | 0.0% | 0.1% | 17% (348.5M) | 0% (0B)     | 46%        | 7%         | n/a        |
| login_az2/0               | running | az2_cf_small  | 172.16.3.151   | 0.00, 0.01, 0.05      | 0.0% | 0.1% | 0.0% | 17% (342.8M) | 0% (0B)     | 46%        | 7%         | n/a        |
| nats_az1/0                | running | az1_cf_small  | 172.16.3.101   | 0.00, 0.01, 0.05      | 0.3% | 0.1% | 0.0% | 6% (110.3M)  | 0% (0B)     | 46%        | 2%         | n/a        |
| nats_az2/0                | running | az2_cf_small  | 172.16.3.102   | 0.00, 0.01, 0.05      | 0.2% | 0.2% | 0.1% | 5% (99.3M)   | 0% (0B)     | 46%        | 2%         | n/a        |
| postgres/0                | running | az1_cf_small  | 172.16.4.110   | 0.00, 0.01, 0.05      | 0.2% | 0.1% | 0.0% | 5% (90.8M)   | 0% (0B)     | 46%        | 1%         | 0%         |
| router_az1/0              | running | az1_cf_small  | 172.16.3.114   | 0.00, 0.01, 0.05      | 0.2% | 0.2% | 0.0% | 5% (94.1M)   | 0% (0B)     | 46%        | 1%         | n/a        |
| router_az2/0              | running | az2_cf_small  | 172.16.3.115   | 0.00, 0.01, 0.05      | 0.2% | 0.4% | 0.0% | 5% (100.4M)  | 0% (0B)     | 46%        | 1%         | n/a        |
| uaa_az1/0                 | running | az1_cf_small  | 172.16.3.107   | 0.00, 0.01, 0.05      | 0.0% | 0.1% | 0.1% | 18% (369.4M) | 0% (0B)     | 46%        | 7%         | n/a        |
| uaa_az2/0                 | running | az2_cf_small  | 172.16.3.130   | 0.00, 0.01, 0.05      | 0.1% | 0.0% | 0.1% | 19% (377.9M) | 0% (0B)     | 46%        | 7%         | n/a        |
+---------------------------+---------+---------------+----------------+-----------------------+------+------+------+--------------+-------------+------------+------------+------------+

Neil Aitken

unread,
Oct 7, 2014, 3:41:19 PM10/7/14
to vcap...@cloudfoundry.org, aitk...@gmail.com
Just a quick summary of the meeting today:

DEA VM vitals were in good shape.  Grepping some DEA and Warden logs showed no errors, other than the top level app timeout itself.

We tried out a few things with sample apps from cf-acceptance-tests:
  Ruby and go apps scaled to 20 instances quickly (though they still threw errors for connection timeout in the logs, they actually started up)
  The sample java app failed the same way we've been seeing with our samples like scale-boot

Changing the ruby hello-world app to have some arbitrary large files so the droplet size is comparable with java, still scaled quickly to 10 instances.

So we think we've narrowed down to something with the java buildpack.

-----

After the meeting I tried the same test with buildpack v2.5, but the behavior was unchanged - first instance starts relatively quickly, subsequent instances after scale request timeout and fail to start up,.

Ben Hale

unread,
Oct 8, 2014, 5:47:50 AM10/8/14
to vcap...@cloudfoundry.org, aitk...@gmail.com
Luan, the cf-scale-boot app highlights this problem for me - https://github.com/cf-platform-eng/cf-scale-boot. In fact any app I deploy in any buildpack. https://github.com/longnguyen11288/go-web-cf is a go example i tried that also failed.

We tried out a few things with sample apps from cf-acceptance-tests:
  Ruby and go apps scaled to 20 instances quickly (though they still threw errors for connection timeout in the logs, they actually started up)

Neil,

You say that the problem is with the Java Buildpack and yet it's happening with both the Ruby and Go buildpacks.  In addition, you can try this with any previous version of the Java buildpack (e.g. -b https://github.com/cloudfoundry/java-buildpack.git#v2.4, -b https://github.com/cloudfoundry/java-buildpack.git#v2.3, -b https://github.com/cloudfoundry/java-buildpack.git#v2.2-b https://github.com/cloudfoundry/java-buildpack.git#v2.1.2) , all of which used to work without a problem, and now no longer do.  With those two data points, what in your analysis leads you to think that it's a problem specifically with the latest version of the Java Buildpack?

As well I noticed on the status update that you are having trouble reproducing the problem.  The Cloud Foundry Java Experience team stands ready to help with that problem since our system tests are reliably reproducing the problem at least 120 times an hour.

zrob...@pivotal.io

unread,
Oct 8, 2014, 7:28:09 PM10/8/14
to vcap...@cloudfoundry.org, aitk...@gmail.com
Hi Neil,

We have successfully been able to reproduce this issue on a couple of environments and will continue to investigate.

Ben,

It looks like some issues are getting crossed up on this thread, a brief summary is:

 - The error messages in the logs are a red herring, they are showing up every time an app starts successfully, a separate fix is in place for that ( I think it may be on PWS today)
 - The real issue is that when scaling apps (say from 1 to 10) we see that some instances actually do fail to start
 - So far we have only reproduced this using java apps - the others we have tried are ruby and go
 - It does not occur on every environment

So the question is, what does a java app need that we are not able to provide on these environments?  We're going to follow up in email with some more info for you Ben.

Thanks,
Zach and Jim

Ken Krueger

unread,
Oct 9, 2014, 1:53:56 PM10/9/14
to vcap...@cloudfoundry.org
I just saw an announcement from PWS support that this should be resolved as of now.

--
You received this message because you are subscribed to the Google Groups "Cloud Foundry Developers" group.
To view this discussion on the web visit https://groups.google.com/a/cloudfoundry.org/d/msgid/vcap-dev/872c63c3-c912-4d9a-ad33-65431ef03564%40cloudfoundry.org.

To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.



--
Ken Krueger  
Manager, Global Education Delivery
407 256 9737 Mobile
kenkrueger65 Skype
Education questions?  educ...@pivotal.io

Dieu Cao

unread,
Oct 9, 2014, 1:58:36 PM10/9/14
to vcap...@cloudfoundry.org
We resolved an issue where the specific error message "Instance failed to start accepting connections" was being logged gratuitously in v189 even when apps were starting up correctly.
We are still investigating the specific issue that Neil and John are experiencing as Zach describes in his last post.

-Dieu
To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+unsubscribe@cloudfoundry.org.

zrob...@pivotal.io

unread,
Oct 9, 2014, 2:54:02 PM10/9/14
to vcap...@cloudfoundry.org
Neil,

On our environments we were able to solve the start up problem using information from this thread: https://groups.google.com/a/cloudfoundry.org/forum/#!searchin/vcap-dev/java$20entropy/vcap-dev/GogAsIvBeVo/emkFTDXZhWUJ

Could you try adding the suggested param '-Djava.security.egd=file:/dev/urandom' to your start up command and see if it resolves your problems?

I'm not actually sure that it is a property that you want, but if it does indeed cause things to start correctly then we can look into whether it is a setting we want by default or at least start thinking about some approaches to avoid this issue.

Thanks,
Zach & Eric

 

To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.

Neil Aitken

unread,
Oct 9, 2014, 3:55:34 PM10/9/14
to vcap...@cloudfoundry.org, zrob...@pivotal.io
Hi,

This tallies with what I'm seeing from the jstack (kill -3) output - 

2014-10-09T19:39:46.17+0000 [App/0]   OUT 2014-10-09 19:39:46
2014-10-09T19:39:46.17+0000 [App/0]   OUT Full thread dump OpenJDK 64-Bit Server VM (25.20-b22 mixed mode):
2014-10-09T19:39:46.17+0000 [App/0]   OUT "localhost-startStop-1" #13 daemon prio=5 os_prio=0 tid=0x000000000228f000 nid=0x39 runnable [0x00007f463cc5a000]
2014-10-09T19:39:46.17+0000 [App/0]   OUT    java.lang.Thread.State: RUNNABLE
2014-10-09T19:39:46.17+0000 [App/0]   OUT       at java.io.FileInputStream.readBytes(Native Method)
2014-10-09T19:39:46.17+0000 [App/0]   OUT       at java.io.FileInputStream.read(FileInputStream.java:246)
2014-10-09T19:39:46.17+0000 [App/0]   OUT       at sun.security.provider.SeedGenerator$URLSeedGenerator.getSeedBytes(SeedGenerator.java:539)
2014-10-09T19:39:46.17+0000 [App/0]   OUT       at sun.security.provider.SeedGenerator.generateSeed(SeedGenerator.java:144)
2014-10-09T19:39:46.17+0000 [App/0]   OUT       at sun.security.provider.SecureRandom$SeederHolder.<clinit>(SecureRandom.java:192)
2014-10-09T19:39:46.17+0000 [App/0]   OUT       at sun.security.provider.SecureRandom.engineNextBytes(SecureRandom.java:210)
2014-10-09T19:39:46.17+0000 [App/0]   OUT       - locked <0x00000000f334ace0> (a sun.security.provider.SecureRandom)
2014-10-09T19:39:46.17+0000 [App/0]   OUT       at java.security.SecureRandom.nextBytes(SecureRandom.java:457)
2014-10-09T19:39:46.17+0000 [App/0]   OUT       - locked <0x00000000f334b700> (a java.security.SecureRandom)
2014-10-09T19:39:46.17+0000 [App/0]   OUT       at java.security.SecureRandom.next(SecureRandom.java:480)
2014-10-09T19:39:46.17+0000 [App/0]   OUT       at java.util.Random.nextInt(Random.java:329)
...

Adding this into the manifest.yml:

  env:
   JAVA_OPTS: -Djava.security.egd=file:///dev/urandom

Apps now scale up quickly, so yeah, looks like insufficient entropy on the boxes, and the pseudo rand workaround resolves it.  I guess the remaining question then is whether there are any security implications .

thanks for your help,
Neil

Dieu Cao

unread,
Oct 13, 2014, 4:49:35 PM10/13/14
to vcap...@cloudfoundry.org, zrob...@pivotal.io
There's a great article I found about /dev/random vs /dev/urandom .
I found it to be very convincing that using /dev/urandom instead of /dev/random is fine.
Reply all
Reply to author
Forward
0 new messages