[1.4.0m2] High CPU usage in dev using cassandra persistence

125 views
Skip to first unread message

Andy Moreland

unread,
Oct 30, 2017, 6:31:49 PM10/30/17
to Lagom Framework Users
Hi folks,

A few people (including myself) at my company are experiencing extremely high cpu usage in dev: roughly 800% sustained.

For context, we have about 11 services, most of which have a couple of read side handlers. We're using the in-jvm cassandra and kafka in dev, wiping the data after each run, so that shouldn't be relevant.

After my application starts (with `runAll`) it consistently burns 800% cpu (on an 8-core machine) on linux, and ~600% cpu (on an 8-hyper-thread machine) on macos.

I profiled the application with yourkit and nothing super obvious stood out to me. Potential things:

(1) forkjoinpool.scan/forkjoinpool.park both are consuming the majority of the application's CPU time.
(2) in the yourkit cassandra event log I see about 34 cassandra queries per second. Each of which looks roughly like:
`SELECT * FROM keyspace.eventsbytag1 WHERE ...`

These are originating from `EventsByTagFetcher` in the cassandra akka persistence plugin. My application code shouldn't be doing anything. I haven't 100% ruled out an infinite loop between services or something, but I don't think that is the case because it would _probably_ show up in logs.

Would appreciate any insight into this problem. I'm not sure how best to debug this issue -- happy to provide logs or profiles that are requested. Not sure which are most useful right now.

Best,



Andy

Tim Moore

unread,
Nov 2, 2017, 9:46:28 PM11/2/17
to Andy Moreland, Lagom Framework Users
Hi Andy,

Those queries are the ones that read-side processors use to poll for events.

There are a few questions I have and a few suggestions for things you can check out:

Is it the Lagom process or the Cassandra process that is burning CPU? In case you weren't aware, Lagom is not actually running Cassandra in the same JVM, it is forking a process. You can use 'jps -l' and look for a process with "Cassandra" in the name to find the pid.

I think scan and park indicate that the ForkJoinPool is waiting for work. Are you sure it's CPU time and not wall clock time? In either case, this would indicate a mostly idle service, so that also suggests that it is Cassandra consuming CPU.

As I said before, these queries are normal behavior for read-side processors. There are a few things that determine the number and frequency of the queries:

When you are using sharded event tags, you'll have one polling loop per shard for each read-side processor, and each of those will issue the query every three seconds by default.

So if you're using 11 services * ~2-3 read-side processors per service * 4 shards per event tag, then the rate of queries you're seeing seems to be in the right ballpark.

You can adjust the number of event tag shards to change the amount of parallelism in your read-side processors. It's likely that the number of shards you'll want might differ between development and production, in which case I'd recommend using a config property. Be aware, however, that changing the number of shards after events have been written will mean that you can lose ordering consistency. Ordinarily, all events for a given persistence ID will be written to the same sharded tag, but if you change the number of shards, this will cause the shard assigned to a persistence ID to change. It's OK to change in development/test if you're wiping the data every time.

There are also many tuning parameters in Akka Persistence Cassandra that can also affect this. Look at https://github.com/akka/akka-persistence-cassandra/blob/v0.58/core/src/main/resources/reference.conf for the complete list, especially the cassandra-query-journal section (https://github.com/akka/akka-persistence-cassandra/blob/v0.58/core/src/main/resources/reference.conf#L531).

In particular, you can adjust the cassandra-query-journal.refresh-interval property to poll less frequently than every 3s. The trade-off here is that there will be potentially be a longer delay before a read-side processor picks up a new event.

You can mitigate this by setting the cassandra-journal.pubsub-minimum-interval property (https://github.com/akka/akka-persistence-cassandra/blob/v0.58/core/src/main/resources/reference.conf#L255-L264). This causes Akka Persistence Cassandra to send an internal pub-sub broadcast message when it writes an event, which the read-side processors will automatically subscribe to. When they receive this message, they will issue a new poll immediately. So this allows you to set a much longer refresh-interval period without seeing a huge delay when there is a new event. This won't help much if you are actually writing a high volume of events, but in development if things are mostly idle it will reduce the amount of background activity.

Cheers,
Tim

--
You received this message because you are subscribed to the Google Groups "Lagom Framework Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lagom-framework+unsubscribe@googlegroups.com.
To post to this group, send email to lagom-framework@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lagom-framework/ceefdbb7-ff84-4278-ae7a-f6218fcd6440%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Tim Moore
Senior Engineer, Lagom, Lightbend, Inc.

James Roper

unread,
Nov 3, 2017, 12:15:07 AM11/3/17
to Tim Moore, Andy Moreland, Lagom Framework Users
I've been considering whether we should add a feature in Lagom to automatically disable event sharding in dev mode. This looks like an opportune time to do so. There is no need to shard events in dev mode.

Also, two other features that are related:

* Read number of shards from config
* Read tag name from config

The former is useful for obvious reasons. The latter is useful for a specific use case I've encountered, when running a multi datacenter Cassandra setup, but with one Akka cluster per datacenter (assuming each persistent entity is only ever written to from the same datacenter) with local quorum writes, you'll want to run read sides in each datacenter. To do that, you'll want to use a different tag for events produced by different datacenters, so that read sides only consume events emitted by its local data center, and this avoids problems of missing events due to datacenter partitions with the local quorum consistency.


For more options, visit https://groups.google.com/d/optout.

James Roper

unread,
Nov 3, 2017, 1:50:23 AM11/3/17
to Tim Moore, Andy Moreland, Lagom Framework Users

Andy Moreland

unread,
Nov 6, 2017, 3:15:13 AM11/6/17
to Lagom Framework Users
Thank you both for the thoughtful replies. I've looked into tweaking the cassandra persistence config like so:

cassandra-query-journal {
 
eventual-consistency-delay = 10s
 
refresh-interval = 30s
}


I still see extremely CPU usage in the same neighborhood as before. The cassandra query count is way down. I also pulled cassandra onto a separate machine, so I'm fairly certain that it's just lagom services that I'm looking at now.

I will continue looking around with yourkit.


Andy
To unsubscribe from this group and stop receiving emails from it, send an email to lagom-framewo...@googlegroups.com.
To post to this group, send email to lagom-f...@googlegroups.com.
--
Tim Moore
Senior Engineer, Lagom, Lightbend, Inc.

--
You received this message because you are subscribed to the Google Groups "Lagom Framework Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lagom-framewo...@googlegroups.com.
To post to this group, send email to lagom-f...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages