Appscale database cleanup

73 views
Skip to first unread message

Aakash Jain

unread,
Aug 24, 2016, 4:08:56 PM8/24/16
to AppScale Community
Cassandra database (created by AppScale) on one of my machine (hosted of GCE) has grown over time and it is near to the disk capacity (~250gb). What is the best way to delete old data from Cassandra?

Also, is there any setting I can do in AppScale so that it auto-deletes old data (e.g.: automatically delete all data older than 6 months)?

Thanks in advance.

-Aakash

chris....@appscale.com

unread,
Aug 24, 2016, 6:27:49 PM8/24/16
to AppScale Community
Hi Aakash,

The dashboard application has a tendency to keep a lot of historical data. The groomer usually cleans most of this up, but you might want to get a rough distribution of how much data each app is using. You can approximate this with the following CQL query:

SELECT COUNT(*) FROM "ENTITIES__"
WHERE token
(key) > token(textAsBlob('appscaledashboard'))
AND token
(key) < token(textAsBlob('appscaledashboard\x01'));

The above query will show the number of entity rows that the dashboard application is using. You can do the same for your other apps to see which is using the most data.

If you have any "last_modified" timestamp fields on your entities, that would be the easiest way to delete older entities. Your application can query and delete them as needed.

Another thing you may consider is a Cassandra compaction if it has not been run recently. When data is deleted from Cassandra, it still exists on disk until a compaction is run. Be aware that your deployment may suffer performance issues while the compaction is running.

Which version of AppScale are you using? The reason I ask is because 3.0 drops the journal table, which typically takes a very large portion of the disk space used by an AppScale deployment. You may not be ready to upgrade yet, particularly because it requires a conversion process that may take awhile on a 250GB cluster, but it's something to keep in mind.

Also, how many nodes do you have in your Cassandra cluster, and what is the replication factor?

To answer your question about a setting for auto-deleting application data, AppScale does not have that feature built-in (other than what the groomer does for the dashboard data, that is). Cassandra does store metadata about when data was written (which can be accessed with the writetime CQL function), so it's theoretically possible to achieve what you are asking for, but it would require a bit of work. You'd want to delete the corresponding entries in the index tables along with the entity data.

-Chris

Aakash Jain

unread,
Aug 30, 2016, 8:38:58 PM8/30/16
to AppScale Community
Thanks Chris for the detailed reply.

I have only 1 node in my cluster with a replication factor of 1 (no replication). I am using a single machine deployed using GCE (having 13 GB memory).

I got 14 Million entries for appscaledashboard, 3 Million entries for one of my apps, and for another app (having huge data) I get "Request did not complete within rpc_timeout." (I do not see any errors in the logs while I run this).

Few questions:
1) How can I delete all the old appscaledashboard data (we don't use it)?
2) How can I fix the rpc_timeout error, Can I run any other (more efficient) command to get approximate numbers?
3) How can I run Cassandra compaction? Does appscale has any tool/script build-in for that?

Thanks
Aakash

chris....@appscale.com

unread,
Aug 30, 2016, 9:13:34 PM8/30/16
to AppScale Community
  1. You can do `AppDB/delete_all_records.py cassandra appscaledashboard`. That script has been changed somewhat recently, so if you are on 2.9.0 or earlier, I would download the latest version of the script from the appscale repo before running it.
  2. If there's only one app that you don't know the entity count for, the easiest thing to do would be run `/opt/cassandra/cassandra/bin/nodetool cfstats Keyspace1.ENTITIES__` (nodetool might be in AppDB/cassandra/bin if you are on an older version of AppScale). There should be an entry for 'Number of keys (estimate)'. You can take this number and subtract the total from the other apps to get an approximation of how much data the unknown app is using.
  3. `nodetool compact`.

Aakash Jain

unread,
Sep 19, 2016, 6:25:39 PM9/19/16
to AppScale Community
Hi Chris,

Thanks for the detailed reply. I was able to delete the appscaledashboard data and run Cassandra compaction.

However, that doesn't help too much. I still need to delete the old data from my apps.

Is there a way I can add TTL field to all new data going to Cassandra through appscale?

Thanks
Aakash

chris....@appscale.com

unread,
Sep 19, 2016, 7:15:22 PM9/19/16
to AppScale Community
The App Engine API doesn't provide a way to specify an entity's TTL, so that's probably not something that we will add as a feature to AppScale for the foreseeable future.

The typical way to accomplish what you are talking about in GAE is to give your entities a timestamp field and add a cron job that queries and deletes entities that are older than a certain date.

If for some reason that approach does not work for you, I can point you to some places in the datastore server code where you can hard code a TTL value for all data inserted into Cassandra. However, I would highly recommend trying the timestamp field approach first since modifying the datastore code will make AppScale upgrades much more difficult and it can also have unpredictable side effects since we don't test that particular use case.

Aakash Jain

unread,
Sep 24, 2016, 12:14:17 PM9/24/16
to AppScale Community
Hi Chris,

"Querying and deleting entities that are older than a certain date" worked well. Thanks for the suggestion. After deleting old entities and re-running Cassandra compaction, Cassandra space reduced by ~75GB. 

The ENTITIES__ table is now just 12GB. However the JOURNAL__ table seems to be same (99GB). Is it expected? Can/Should I reduce the JOURNAL__ table as well?

Below is the space utilization:

[/opt/appscale/cassandra/Keyspace1]#du -hs *
99G JOURNAL__ 
12G ENTITIES__ 
5.0G ASC_PROPERTY__ 
3.0G COMPOSITE_INDEXES__ 
2.1G DSC_PROPERTY__ 
277M KINDS__ 
16M Standard1 
36K APPS__
36K __key__
36K METADATA__
36K USERS__
4.0K APP_IDS__ 


Thanks
Aakash

chris....@appscale.com

unread,
Sep 26, 2016, 12:36:56 AM9/26/16
to AppScale Community
It's definitely possible to clean up the journal data. The best way, in my opinion, is to upgrade to AppScale 3.1.0. There is a conversion process, so there will be some downtime, but it shouldn't be more than 10-15 min for the size of your data. To upgrade, you should first upgrade your tools with `pip install --upgrade appscale-tools`. After that, you can run `appscale down` and then `appscale upgrade`. After the conversion process finishes, it will completely drop the JOURNAL__ table, which will free up quite a bit of space.

If downtime is not an option, let me know. There are other options, but they are a bit trickier and less thorough.
Reply all
Reply to author
Forward
0 new messages