Database corruption

19 views
Skip to first unread message

Aakash Jain

unread,
Feb 24, 2017, 5:31:26 PM2/24/17
to AppScale Community
One of my Appscale app which was working fine for long time, suddenly stop working and I am seeing lot of exceptions. Manually running remote_api also gives me similar exceptions (it sometimes works, and sometimes gives exception, for same query). Also 4 datastore_server processes are taking about 100% cpu continuously. Rebooting machine doesn't help. Can someone look into these exceptions and suggests what might be wrong? Is my Cassandra database corrupted? How can I debug/fix it?


app-queues> QueueStatus.all(projection=['bot_id'], distinct=True).fetch(1)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/root/appscale/AppServer/google/appengine/ext/db/__init__.py", line 2157, in fetch
    return list(self.run(limit=limit, offset=offset, **kwargs))
  File "/root/appscale/AppServer/google/appengine/ext/db/__init__.py", line 2326, in next
    return self.__model_class.from_entity(self.__iterator.next())
  File "/root/appscale/AppServer/google/appengine/datastore/datastore_query.py", line 2892, in next
    next_batch = self.__batcher.next()
  File "/root/appscale/AppServer/google/appengine/datastore/datastore_query.py", line 2754, in next
    return self.next_batch(self.AT_LEAST_ONE)
  File "/root/appscale/AppServer/google/appengine/datastore/datastore_query.py", line 2791, in next_batch
    batch = self.__next_batch.get_result()
  File "/root/appscale/AppServer/google/appengine/api/apiproxy_stub_map.py", line 615, in get_result
    return self.__get_result_hook(self)
  File "/root/appscale/AppServer/google/appengine/datastore/datastore_query.py", line 2528, in __query_result_hook
    self._batch_shared.conn.check_rpc_success(rpc)
  File "/root/appscale/AppServer/google/appengine/datastore/datastore_rpc.py", line 1222, in check_rpc_success
    rpc.check_success()
  File "/root/appscale/AppServer/google/appengine/api/apiproxy_stub_map.py", line 581, in check_success
    self.__rpc.CheckSuccess()
  File "/root/appscale/AppServer/google/appengine/api/apiproxy_rpc.py", line 155, in _WaitImpl
    self.request, self.response)
  File "/root/appscale/AppServer/google/appengine/ext/remote_api/remote_api_stub.py", line 285, in MakeSyncCall
    handler(request, response)
  File "/root/appscale/AppServer/google/appengine/ext/remote_api/remote_api_stub.py", line 334, in _Dynamic_Next
    self._Dynamic_RunQuery(query, query_result, cursor_id)
  File "/root/appscale/AppServer/google/appengine/ext/remote_api/remote_api_stub.py", line 295, in _Dynamic_RunQuery
    'datastore_v3', 'RunQuery', query, query_result)
  File "/root/appscale/AppServer/google/appengine/ext/remote_api/remote_api_stub.py", line 200, in MakeSyncCall
    self._MakeRealSyncCall(service, call, request, response)
  File "/root/appscale/AppServer/google/appengine/ext/remote_api/remote_api_stub.py", line 234, in _MakeRealSyncCall
    raise pickle.loads(response_pb.exception())
ProtocolBufferReturnError: 500




Logs from datastore_server-4000.log:
ERROR:root:Lock /appscale/apps/appscaledashboard/locks/appscaledashboard%00%00RequestLogLine%3Aapp-queues146.148.38.9915691894024000000%01 in use by /appscale/apps/appscaledashboard/txids/tx0027118185
WARNING:root:Concurrent transaction exception for app id appscaledashboard with info acquire_additional_lock: There is already another transaction using /appscale/apps/appscaledashboard/locks/appscaledashboard%00%00RequestLogLine%3Aapp-queues146.148.38.9915691894024000000%01 lock
WARNING:root:Trying again to acquire lockinfo acquire_additional_lock: There is already another transaction using /appscale/apps/appscaledashboard/locks/appscaledashboard%00%00RequestLogLine%3Aapp-queues146.148.38.9915691894024000000%01 lock with retry #5
ERROR:root:Doing a rollback on transaction id 83796396 for app id app-queues
ERROR:root:((), {})
Traceback (most recent call last):
  File "/root/appscale/AppDB/zkappscale/zktransaction.py", line 1142, in notify_failed_transaction
    for item in self.run_with_retry(self.handle.get_children, txpath):
  File "/usr/local/lib/python2.7/dist-packages/kazoo/client.py", line 267, in _retry
    return self._retry.copy()(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/kazoo/retry.py", line 123, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/kazoo/client.py", line 1031, in get_children
    return self.get_children_async(path, watch, include_data).get()
  File "/usr/local/lib/python2.7/dist-packages/kazoo/handlers/threading.py", line 102, in get
    raise self._exception
NoNodeError: ((), {})

Logs from datastore_server-4001.log:
WARNING:root:Concurrent transaction exception for app id appscaledashboard with info acquire_additional_lock: There is already another transaction using /appscale/apps/appscaledashboard/locks/appscaledashboard%00%00RequestLogLine%3Aapp-queues146.148.38.9915691893915000000%01 lock
WARNING:root:Trying again to acquire lockinfo acquire_additional_lock: There is already another transaction using /appscale/apps/appscaledashboard/locks/appscaledashboard%00%00RequestLogLine%3Aapp-queues146.148.38.9915691893915000000%01 lock with retry #5
ERROR:root:Notify failed transaction removing lock: /appscale/apps/appscaledashboard/txids/tx0027118272
ERROR:root:Notify failed transaction removing lock: /appscale/apps/appscaledashboard/txids/tx0027118273
ERROR:root:Notify failed transaction removing lock: /appscale/apps/appscaledashboard/txids/tx0027118248
ERROR:root:Notify failed transaction removing lock: /appscale/apps/appscaledashboard/txids/tx0027118275
ERROR:root:Doing a rollback on transaction id 83796438 for app id app-queues
ERROR:root:((), {})
Traceback (most recent call last):
  File "/root/appscale/AppDB/zkappscale/zktransaction.py", line 1142, in notify_failed_transaction
    for item in self.run_with_retry(self.handle.get_children, txpath):
  File "/usr/local/lib/python2.7/dist-packages/kazoo/client.py", line 267, in _retry
    return self._retry.copy()(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/kazoo/retry.py", line 123, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/kazoo/client.py", line 1031, in get_children
    return self.get_children_async(path, watch, include_data).get()
  File "/usr/local/lib/python2.7/dist-packages/kazoo/handlers/threading.py", line 102, in get
    raise self._exception
NoNodeError: ((), {})


Would be great if someone can help quickly, as this is a live server and affects many users in my team.

Meni Vaitsi

unread,
Feb 24, 2017, 5:48:01 PM2/24/17
to appscale_community
Hi Aakash,

What is the output of appscale status?
Is this a single-node deployment? Can you also send the output of df -h from all the nodes?

Thanks
-Meni

--
Meni Vaitsi
Software Engineer
AppScale Systems, Inc.

--
You received this message because you are subscribed to the Google Groups "AppScale Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to appscale_community+unsub...@googlegroups.com.
To post to this group, send email to appscale_community@googlegroups.com.
Visit this group at https://groups.google.com/group/appscale_community.
For more options, visit https://groups.google.com/d/optout.

Aakash Jain

unread,
Feb 24, 2017, 6:00:45 PM2/24/17
to appscale_...@googlegroups.com
Hi Meni,

Thanks for the quick reply.

It is a single-node deployment. A single machine on google compute engine. Below are the output of commands you asked for. Can you please tell me what does ProtocolBufferReturnError indicates?


[~]# appscale status
Status of node at 146.148.38.99:
    Currently using 6.0 Percent CPU and 34.00 Percent Memory
    Hard disk is 69 Percent full
    Is currently: load_balancer, taskqueue_master, zookeeper, db_master, taskqueue, memcache, database, shadow, login, appengine
    Database is at 146.148.38.99
    Is in cloud: cloud1
    Current State: Preparing to run AppEngine apps if needed
    Hosting the following apps: app-queues
    The number of AppServers for app app-queues is: 1


[~]#df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       493G  323G  150G  69% /
udev            6.4G  8.0K  6.4G   1% /dev
tmpfs           1.3G  256K  1.3G   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            6.4G  400K  6.4G   1% /run/shm


Thanks
Aakash





To unsubscribe from this group and stop receiving emails from it, send an email to appscale_community+unsubscribe@googlegroups.com.

To post to this group, send email to appscale_community@googlegroups.com.
Visit this group at https://groups.google.com/group/appscale_community.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "AppScale Community" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/appscale_community/fGh3WnTk-lg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to appscale_community+unsub...@googlegroups.com.

Chris Donati

unread,
Feb 24, 2017, 6:17:10 PM2/24/17
to appscale_...@googlegroups.com
Hi Aakash,

The ProtocolBufferReturnError indicates that the datastore server encountered an exception that it did not know how to handle. The logs you posted from the datastore are tracebacks when processing a request from the dashboard.

It's unlikely that your Cassandra data is corrupted. However, I don't have enough info from these logs to know what the problem is. Would you mind emailing me the full datastore_server logs?

To unsubscribe from this group and all its topics, send an email to appscale_community+unsubscribe@googlegroups.com.

To post to this group, send email to appscale_community@googlegroups.com.
Visit this group at https://groups.google.com/group/appscale_community.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "AppScale Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to appscale_community+unsub...@googlegroups.com.

Aakash Jain

unread,
Feb 25, 2017, 1:00:30 AM2/25/17
to appscale_...@googlegroups.com
Thanks Chris for the details.

On further analysis it seems like the problem was because of too much load on the system. Tightening the firewall rules helped the node come back to normal.

-Aakash




To unsubscribe from this group and all its topics, send an email to appscale_community+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages