1 minute delay between getJobManager and selectCluster - client time out

26 views
Skip to first unread message

Jelez Raditchkov

unread,
Jun 21, 2016, 11:53:08 AM6/21/16
to genie
The last several day we experience an odd problem where client applications get disconnected due to timeout when submitting jobs to genie.
We test on AWS private VPC subnets. it takes very consistently 1 minute between chooseClusterForJob call and selectCluster


2016-06-21 15:21:22,166 INFO  com.netflix.genie.server.jobmanager.JobManagerFactory:70 [http-nio-8080-exec-3] [getJobManager] called
2016-06-21 15:21:22,167 DEBUG com.netflix.genie.server.services.impl.jpa.ClusterConfigServiceJPAImpl:180 [http-nio-8080-exec-3] [chooseClusterForJob] Called


We see 10 occurrences within 1 minute of the following messages:
2016-06-21 15:22:13,067 DEBUG com.netflix.genie.server.util.NetUtil:86 [Thread-6] [getHostName] called
2016-06-21 15:22:13,067 DEBUG com.netflix.genie.server.util.NetUtil:113 [Thread-6] [getCloudHostName] called

after one minute:
2016-06-21 15:22:23,013 INFO  com.netflix.genie.server.services.impl.RandomizedClusterLoadBalancerImpl:48 [http-nio-8080-exec-3] [selectCluster] called


All hosts have AWS hostnames and AWS DNS and resolve well. Any ideas?

Cheers!

Steven Wong

unread,
Jun 21, 2016, 1:49:12 PM6/21/16
to genie
Since it sounds like the issue is repeatedly happening, if it's reproducible you can try to take a few thread dumps during that one-minute period and see what's happening. If it's not reproducible on demand, perhaps you can schedule periodic thread dumps to catch the next recurrences of the issue.

--
You received this message because you are subscribed to the Google Groups "genie" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genieoss+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jelez Raditchkov

unread,
Jun 21, 2016, 8:38:46 PM6/21/16
to genie
ok this is what is happening. any help will be appreciated!

Genie database has the following tables:
mysql> SELECT TABLE_NAME, TABLE_ROWS FROM `information_schema`.`tables` WHERE `table_schema` = 'genie';
+---------------------+------------+
| TABLE_NAME          | TABLE_ROWS |
+---------------------+------------+
| Application         |          2 |
| Application_configs |          2 |
| Application_jars    |          9 |
| Application_tags    |         14 |
| Cluster             |          1 |
| Cluster_Command     |          4 |
| Cluster_configs     |          3 |
| Cluster_tags        |          8 |
| Command             |          4 |
| Command_configs     |          0 |
| Command_tags        |         36 |
| Job                 |          0 |
| Job_tags            |          0 |
+---------------------+------------+
13 rows in set (0.01 sec)

Genie is executing the following command: 
SELECT DISTINCT t11.id
                t12.element 
FROM   Cluster t0 
       INNER JOIN Cluster_Command t1 
               ON t0.id = t1.clusters_id 
       INNER JOIN Cluster_tags t7 
               ON t0.id = t7.cluster_id 
       INNER JOIN Cluster_tags t8 
               ON t0.id = t8.cluster_id 
       INNER JOIN Cluster_tags t9 
               ON t0.id = t9.cluster_id 
       INNER JOIN Cluster_Command t10 
               ON t0.id = t10.clusters_id 
       INNER JOIN Command t2 
               ON t1.Commands_id = t2.id 
       INNER JOIN Command t11 
               ON t10.Commands_id = t11.id 
       INNER JOIN Command_tags t3 
               ON t2.id = t3.Command_id 
       INNER JOIN Command_tags t4 
               ON t2.id = t4.Command_id 
       INNER JOIN Command_tags t5 
               ON t2.id = t5.Command_id 
       INNER JOIN Command_tags t6 
               ON t2.id = t6.Command_id 
       INNER JOIN Command_tags t12 
               ON t11.id = t12.Command_id 
WHERE  ( t2.status = 'ACTIVE' 
         AND t0.status = 'UP' 
         AND t3.element = 'ver:1.6.1' 
         AND t4.element = 'env:dev' 
         AND t5.element = 'Command:spark-submit' 
         AND t6.element = 'type:spark' 
         AND t7.element = 'type:yarn' 
         AND t8.element = 'env:dev' 
         AND t9.element = 'ver:2.7.2' ) 
ORDER  BY t11.id ASC

It returns: 36 rows in set (25.62 sec)

Apparently the query above is not optimal and needs improvement. Any advice?

The DB is running on AWS RDS: db.m4.large instance

Any advice? Specific MySQL tuning that may help? 

A Mean Nerd

unread,
Jun 22, 2016, 11:46:19 AM6/22/16
to genie
+1 
have the same exact issue. 

Jelez Raditchkov

unread,
Jun 22, 2016, 5:18:09 PM6/22/16
to genie
how is this working at netflix at scale? It seems that for each criteria it adds a join and mysql is creeping out. What magic tunings/optimizations are used in mysql to make this work?
I hope I am not catching anything but this cannot work as  is. genie version 2.3.3.

hope someone can come up with some sort of solution

Tom Gianos

unread,
Jun 22, 2016, 8:36:35 PM6/22/16
to genie

Hi Jalez,

Unfortunately you’re having issues at a bad time as the two primary Genie engineers at Netflix (myself and Amit) are out of the office for personal reasons. We’ll probably be unable to get you an in depth answer for a couple of weeks. To give you a relatively quick answer yes Genie 2 has been running at scale (40,000+ jobs per day) for a couple of years now and we haven’t seen the issues you’re describing.

Our RDS is a db.m2.4xlarge running MySQL 5.6.21 with the default mysql 5.6 option and parameter groups so no special tweaks there. On the client side we’re running with a connection pool of size 20 for each Genie node.

As you can see our database is far bigger than yours:

mysql> SELECT TABLE_NAME, TABLE_ROWS FROM `information_schema`.`tables` WHERE `table_schema` = 'genie';
+---------------------+------------+
| TABLE_NAME          | TABLE_ROWS |
+---------------------+------------+
| Application         |         22 |
| Application_configs |         21 |
| Application_jars    |         47 |
| Application_tags    |         96 |
| Cluster             |         87 |
| Cluster_Command     |       9356 |
| Cluster_configs     |       4663 |
| Cluster_tags        |       4155 |
| Command             |         62 |
| Command_configs     |         45 |
| Command_tags        |        288 |
| Job                 |    2371443 |
| Job_tags            |    9166369 |
+---------------------+------------+
13 rows in set (0.09 sec)

I will say one difference I can see on the surface is that we don’t put in nearly as many tags to our cluster and command tags as you seem to be. What kind of tags are you sending in in your command and cluster criteria? Additionally how are you tagging your clusters and commands?

There may very well be something wrong with the query. You can take a look at how it is working in Genie 2 here. There is likely some inefficiency in using isMember on large tag sets. I vaguely remember Amit writing it this way to avoid multiple queries to the database.

I will say that we’re not actively developing against Genie 2 anymore and are hard at work getting Genie 3 ready for production use at Netflix. The database schema has been optimized to collapse the tags into rows within the tables rather than as join tables (which were auto generated by OpenJPA in Genie 2 rather than static schema in Genie 3 via DDL files). As such the logic for cluster selection was re-written using like statements (see here) and will hopefully be far better performance. This was done because at the scale we were running we DID see large performance degradation on job search due to millions of Job tag to Job record joins.

Hope some of that helps a little bit at least while we’re out. If you see something that can be quickly fixed in Genie 2 PR’s are always welcome. Let me know if there is something else specific I can answer while being out of office and I’ll try to help.

Thanks,
Tom

Reply all
Reply to author
Forward
0 new messages