2 Questions regarding server stability

92 views
Skip to first unread message

ctasada

unread,
Sep 24, 2012, 5:21:22 AM9/24/12
to project-...@googlegroups.com
Hi everyone,

I'm still having some stability problems in my Voldemort Servers. I have 2 questions regarding it:

a) EnvironmentFailureException in the Voldemort Server
[2012-08-10 12:33:21,330 voldemort.store.bdb.BdbStorageEngine] ERROR com.sleepycat.je.EnvironmentFailureException: (JE 4.1.17) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 4.1.17) /home/voldemort/voldemort/server/bin/../../stores-caronte/data/bdb/protobufTax fetchTarget of 0x2b06/0xc8c0e0 parent IN=589382 IN class=com.sleepycat.je.tree.BIN lastFullVersion=0x2b8e/0x9fdfaa parent.getDirty()=true state=0 LOG_FILE_NOT_FOUND: Log file missing, log is likely invalid. Environment is invalid and must be closed. 

I'm using BDB 4.1.17 with some changes from Vinoth, and even when it's working much better, I still have problems from time to time. I'm going to upgrade to 4.1.21 since it seems to be fixing some of those problems. Is there any known problem with such a version?

Also, from time to time I see the next trace:

[2012-08-10 12:33:21,331 voldemort.server.niosocket.AsyncRequestHandler] ERROR  
java.lang.NullPointerException
at voldemort.store.bdb.BdbStorageEngine.attemptCommit(BdbStorageEngine.java:415)
at voldemort.store.bdb.BdbStorageEngine.delete(BdbStorageEngine.java:372)
at voldemort.store.bdb.BdbStorageEngine.delete(BdbStorageEngine.java:68)
at voldemort.store.logging.LoggingStore.delete(LoggingStore.java:90)
at voldemort.store.rebalancing.RedirectingStore.delete(RedirectingStore.java:194)
at voldemort.store.rebalancing.RedirectingStore.delete(RedirectingStore.java:60)
at voldemort.store.invalidmetadata.InvalidMetadataCheckingStore.delete(InvalidMetadataCheckingStore.java:71)
at voldemort.store.invalidmetadata.InvalidMetadataCheckingStore.delete(InvalidMetadataCheckingStore.java:41)
at voldemort.store.DelegatingStore.delete(DelegatingStore.java:49)
at voldemort.store.stats.StatTrackingStore.delete(StatTrackingStore.java:52)
at voldemort.store.stats.StatTrackingStore.delete(StatTrackingStore.java:39)
at voldemort.server.protocol.vold.VoldemortNativeRequestHandler.handleDelete(VoldemortNativeRequestHandler.java:366)
at voldemort.server.protocol.vold.VoldemortNativeRequestHandler.handleRequest(VoldemortNativeRequestHandler.java:72)
at voldemort.server.niosocket.AsyncRequestHandler.read(AsyncRequestHandler.java:120)
at voldemort.utils.SelectorManagerWorker.run(SelectorManagerWorker.java:98)
at voldemort.utils.SelectorManager.run(SelectorManager.java:194)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

I'll apply a patch to solve it, but is really synthomatic, since the real problem is caused by BDB.

2) Right now I've a cluster with 6 nodes. Those nodes are different, since 3 of them are newer, with more Cores and Memory. My question is: Can I configure those servers with more nio,selectors and more bdb.cache memory? Could be some problem synchronizing metadata between servers?

Thanks.

Regards,
Carlos.

ctasada

unread,
Oct 2, 2012, 4:43:45 PM10/2/12
to project-...@googlegroups.com
Hi guys,

No one?

Vinoth Chandar

unread,
Oct 2, 2012, 5:18:37 PM10/2/12
to project-...@googlegroups.com
Carlos,

I have not seen these before. Well, EnvironmentFailureExceptions happen if disk goes bad and such,. But not specifically for 4.1.17.
Can you point to the version or branch are you running off? And 4.1.21 was basically made with some changes to BDB5 preupgrade script. So not sure what extra fixes are in there.

For 2), essentially, you will be throwing more resources at some machines. This might be okay in general. but make sure you don't have preferred_reads or something, since if you block of a fast and a slow node, you will only seethe performance of the slow node anyway


Thanks
Vinoth

Vinoth Chandar

unread,
Oct 2, 2012, 5:26:13 PM10/2/12
to project-...@googlegroups.com
https://github.com/vinothchandar/voldemort/blob/pidscan/src/java/voldemort/store/bdb/BdbStorageEngine.java#L468

is what we are testing now. So these NPEs should be taken care of. If you are simply slapping 4.1.17 or greater onto 0.96 voldemort, please don't.

Carlos Tasada

unread,
Oct 2, 2012, 6:01:18 PM10/2/12
to project-...@googlegroups.com
Hi Vinoth,

Thanks for your answers. I'll double-check my configurations to make sure that I don't have any bottleneck with the old hardware.

Regarding BDB 4.1.21 you're right, it only has some changes in the preupgrade code, but 4.1.20 includes some other fixes regarding the "lock files". 

What do you mean with 'slapping' 4.1.17 onto voldemort 0.96? My local changes are including the library plus code changes. It has been working fine for some time so far with my 0.91 modified version. I'm still testing the migration to 0.96.

--
You received this message because you are subscribed to the Google Groups "project-voldemort" group.
To unsubscribe from this group, send email to project-voldem...@googlegroups.com.
Visit this group at http://groups.google.com/group/project-voldemort?hl=en.
 
 

Vinoth Chandar

unread,
Oct 2, 2012, 9:24:29 PM10/2/12
to project-...@googlegroups.com
Since you mentioned you are testing some of my code, I was wondering what exactly you are using.
By "slapping" bdb 4.1.17, what I meant was, are you simply updating the bdb version on an existing voldemort codebase. The most important change I have made is getting rid of BDB sorted duplicates usage, which is necessary for any migration to a higher version. Else, you will see disk growth from 4.0.92 due to the problems I outlined in the blog.

>> 4.1.20 includes some other fixes regarding the "lock files".
Point 1 in the change log addresses deferred write dbs, which I don't think relates to voldemort. anyways.

We are testing https://github.com/voldemort/voldemort/compare/master...release-096li8 and if confirmed, we will release some conversion scripts so people can migrate their data over.
Reply all
Reply to author
Forward
0 new messages