NSF failover not working

8 views
Skip to first unread message

harish

unread,
Feb 9, 2010, 8:41:12 AM2/9/10
to Scribe Server
I'm using scribe to log over NSF (my primary), and want to failover to
the local hard drive in case NSF goes down (my secondary). I've
tested the secondary without NSF on startup, and it works. However,
lately I've been having issues when NSF goes down (specifically, the
machine hosting the NSF mounted drive crashes) after I already have
threads open and streaming logs to scribe - in this case, the
secondary does not seem to be used at all. When NSF goes down for a
short period of time, I don't seem to lose any logging information and
everything seems to remain stable. However, after a longer period of
time, I lose information (after NSF finally comes back up) and the
processes producing the logs (apache, etc.) begin to stall.

Does anyone have any clue how I can configure scribe to act properly
in this scenario? Should I log to the machine hosting the NSF mounted
drive as my primary, and have that machine running another scribe
instance relaying the messages to the NSF drive?

Thanks,
-Harish

Gautam Roy

unread,
Feb 9, 2010, 11:53:40 AM2/9/10
to scribe...@googlegroups.com
can you post the configuration you are using? seems to me if your
primary goes down for a long time ..the secondary could be running out
of memory, in this case it will return TRY_LATER to your client ..which
I am guessing will keep retrying to send messages without success/drop
messages itself.

harish

unread,
Feb 9, 2010, 3:07:25 PM2/9/10
to Scribe Server
Here is the config (btw, the secondary should be going to disk, not
memory - but I don't see anything appear on the filesystem when NFS
goes down), please do let me know if I have something set up
incorrectly/stupidly:

port=1463
max_msg_per_second=2000000
check_interval=3


# DEFAULT - forward all messages to Scribe on port 1463
<store>
category=default
type=buffer

target_write_size=20480
max_write_interval=1
buffer_send_rate=1
retry_interval=30
retry_interval_range=10

<primary>
type=file
fs_type=std
file_path=/data/var/log
base_filename=thisisoverwritten
rotate_period=daily
rotate_hour=0
rotate_minute=0
</primary>

<secondary>
type=file
fs_type=std
file_path=/var/log/scribe
base_filename=thisisoverwritten
max_size=3000000
</secondary>
</store>

Gautam Roy

unread,
Feb 9, 2010, 3:41:02 PM2/9/10
to scribe...@googlegroups.com
The configuration looks fine. Two things come to mind.
1. With the permissions u run scribed, do you have permission to write
to /var/log/scribe?
2. How much data are you logging? If its a small amount of data, 20480
or less bytes it will remain in the queue and not be logged.

What do the print messages from the shell you run scribed from tell you?

Best,
Gautam

harish

unread,
Feb 9, 2010, 6:01:24 PM2/9/10
to Scribe Server
Hi Gautam,

1. The permissions should be fine - when I've tested the failover
before without nsf (for instance, setting a primary store to a
directory which doesn't exist), the messages get logged to the
secondary store directory. I believe I've tested (will do so again
tonight) shutting down nsf gracefully before and messages were logged
to the secondary store. The specific scenario in which messages don't
get logged to the seconday (and are forever lost) is when NSF crashes
unexpectedly via the hosting machine going down.
2. The failure occurred over a span of about eight hours on a server
running apache, it definitely would have exceeded 20480 bytes.

I'm using a script to run scribed in daemon mode, so I'm not sure how
to examine logged messages. Is it not advisable to run scribed in
this way?

Thanks,
-Harish

Gautam Roy

unread,
Feb 9, 2010, 7:27:03 PM2/9/10
to scribe...@googlegroups.com
Hey Harish,

nothing wrong with running scribed in daemon mode. We use LOG_OPER
(defined) in env_default.h in a bunch of places in the source code to
provide messages that would help in debugging in case of a problem. It
is sent to stderr. If you run scribed you should see these messages
print. Look at whichever place stderr is redirected to, in case of
daemonizing.

Best,
Gautam

harish

unread,
Feb 9, 2010, 10:14:41 PM2/9/10
to Scribe Server
There aren't any error messages coming up through stderr...

Here is a related question - when using scribe over NFS, should I have
the client mount the NFS drive in 'soft' failure mode? I'm not very
familiar with NFS and have left most things to their default values.
I'm now reading about hard vs. soft failure mode and it seems like I
should be using the 'soft' mode, but it would be great to get some
advice from someone more experienced than me.

Gautam Roy

unread,
Feb 10, 2010, 2:12:05 AM2/10/10
to scribe...@googlegroups.com
no idea about the nfs configuration. By error logs I mean some prints.
Try running scribe using
> /pathto/scribed configuration.

You will see log messages. Something like
[~/Code/scribe/examples]$ ../src/scribed example1.conf
[Tue Feb 9 23:11:02 2010] "setrlimit error (setting max fd size)"
[Tue Feb 9 23:11:02 2010] "STATUS: STARTING"
[Tue Feb 9 23:11:02 2010] "STATUS: configuring"
[Tue Feb 9 23:11:02 2010] "got configuration data from file
<example1.conf>"
[Tue Feb 9 23:11:02 2010] "CATEGORY : default"
[Tue Feb 9 23:11:02 2010] "Creating default store"
[Tue Feb 9 23:11:02 2010] "configured <1> stores"
[Tue Feb 9 23:11:02 2010] "STATUS: "
[Tue Feb 9 23:11:02 2010] "STATUS: ALIVE"
[Tue Feb 9 23:11:02 2010] "Starting scribe server on port 1463"
Thrift: Tue Feb 9 23:11:02 2010 libevent 2.0.3-alpha method kqueue


There will be more messages when messages are received/forwared.

Best,
Gautam

On 2/9/10 5:41 AM, harish wrote:

harish

unread,
Feb 10, 2010, 2:19:08 PM2/10/10
to Scribe Server
Hi Gautam,

I'll try to restart scribed in a terminal when I have a moment and see
what gets printed out during an nfs failure - I don't have a great
development environment in which to test things though.

In the meantime, I went ahead and tested soft failure mode on the nfs
client and scribe is now behaving as expected (failing over to a local
drive when nfs goes down). While this fixes the problem, I am still
worried that I'm setting myself up for larger problems down the line -
would be great to hear from any scribe over nfs users out there with
suggestions on how best to configure scribe and an nfs client to work
well with each other during an nfs outage.

-Harish

Reply all
Reply to author
Forward
0 new messages