[scribe-server] Failover when writing to HDFS problems

Wouter de Bie

unread,

May 7, 2010, 11:09:13 AM5/7/10

to Scribe Server

Hi all,

We're currently having some problems when writing to HDFS if the
connection to the namenode becomes unavailable. hdfsWrite() always
returns the bytes written, even if it never actually wrote. The hdfs
client tries to reconnect and tries this for 45 minutes. This is done
in Client.java line 307:

} catch (SocketTimeoutException toe) {
/* The max number of retries is 45,
* which amounts to 20s*45 = 15 minutes retries.
*/
handleConnectionFailure(timeoutFailures++, 45, toe);
}

There is some code in hfds.c that tries to catch an exception from the
java client, but it seems to never get that exception (or maybe after
45 minutes). This is in hdfs.c line 1005:

if (invokeMethod(env, NULL, &jExc, INSTANCE, jOutputStream,
HADOOP_OSTRM, "write",
"([B)V", jbWarray) != 0) {
errno = errnoFromException(jExc, env,
"org.apache.hadoop.fs."
"FSDataOutputStream::write");
length = -1;
}

Here, we should receive -1, but apparently we never get it, since the
timeout takes 45 minutes. On monday, we'll try lowering the 45 minute
timeout, but since it's hardcoded, we have no idea how to proceed and
if it actually works.

When rolling the file, scribe tries to reopen a new HDFS file and when
it fails, it starts writing to the secondary store, but if this never
happens, we never write to the secondary store. Next to that, the
stuff that is supposedly written to HDFS is still in the hdfs client.
I guess it is written when a reconnect happens (when the namenode is
up again), but that might take a while. After some time, scribe just
returns with TRY_AGAIN, because it can't write at all.

For us, this is kind of a blocker, since we don't want scribe to go
down when the namenode is down.

Does anyone of you guys have experienced the same issue?

// Wouter

Travis Crawford

unread,

May 7, 2010, 7:07:23 PM5/7/10

to scribe...@googlegroups.com

Interesting find.

Do all your scribes write directly to scribe? Do you have a tree of
scribes? If you log to a local scribe, and forward to a scribe that
writes to HDFS the pushback causes producers to buffer locally when
the HDFS-writing scribe starts pushing back.

In practice HDFS goes down so infrequently that we haven't fully
explored failure modes around that case.

--travis

>
> // Wouter
>

prutser

unread,

May 8, 2010, 1:32:06 PM5/8/10

to scribe...@googlegroups.com

On Sat, May 8, 2010 at 1:07 AM, Travis Crawford <travisc...@gmail.com> wrote:

Interesting find.

Do all your scribes write directly to scribe? Do you have a tree of
scribes?

We currently log to a local scribe that logs directly to HDFS.

If you log to a local scribe, and forward to a scribe that
writes to HDFS the pushback causes producers to buffer locally when
the HDFS-writing scribe starts pushing back.

That might be an option, but it feels a bit like a hack, running extra scribes to circumvent the issue. Is that the way you guys do it? I haven't tried out writing to other scribes, but how does failover work here?

In practice HDFS goes down so infrequently that we haven't fully
explored failure modes around that case.

When using a tree of scribes, than it shouldn't be such a big problem, but in our case it is. When HDFS goes down, we can't write that much before scribe fails. This means that we need to failover to the secondary store as soon as possible.

// Wouter

Gautam Roy

unread,

May 10, 2010, 1:27:23 PM5/10/10

to scribe...@googlegroups.com

Hi,

We use the following type of configuration when writing to HDFS files.
Which is basically buffer store with another bufferstore as primary. We
use two hdfs clusters on the same physical cluster to deal with the
problem of the namenode being a single point of failure.

With the replay_buffer=no configuration,

scribe does try to transfer data from secondary to primary when say
hdfs://dfsscribe3 comes back up again after being down for a while.

So basically the functionality becomes, try writing to first hdfs
cluster, else try writing to the second hdfs cluster and if both fail,
then buffer on local disk.

With this kind of setup you will have to set up the right copier scripts
to collect your data from two logical clusters.

Hope that helps,

Gautam

port=1456
max_msg_per_second=1000000
check_interval=1
max_queue_size=100000000
num_thrift_server_threads=3

# DEFAULT
<store>
category=default
type=buffer
max_write_interval=1
retry_interval=120
buffer_send_rate=5
must_succeed=yes

<primary>
type=buffer
retry_interval=600
replay_buffer=no

<primary>
type=file
fs_type=hdfs
file_path=hdfs://dfsscribe3:9000/user/scribe
create_symlink=no
use_hostname_sub_directory=yes
base_filename=thisisoverwritten
max_size=1000000000
rotate_period=hourly
add_newlines=1
write_stats=no
rotate_on_reopen=yes
</primary>

<secondary>
type=file
fs_type=hdfs
file_path=hdfs://dfsscribe4:9000/user/scribe
create_symlink=no
use_hostname_sub_directory=yes
base_filename=thisisoverwritten
max_size=1000000000
rotate_period=hourly
add_newlines=1
write_stats=no
rotate_on_reopen=yes
</secondary>

</primary>

<secondary>
type=file
file_path=/mnt/d0/scribe
base_filename=thisisoverwritten
max_size=40000000
</secondary>

</store>

On 5/7/10 4:07 PM, Travis Crawford wrote:
> On Fri, May 7, 2010 at 8:09 AM, Wouter de Bie<pru...@gmail.com> wrote:
>> Hi all,
>>
>> We're currently having some problems when writing to HDFS if the
>> connection to the namenode becomes unavailable. hdfsWrite() always
>> returns the bytes written, even if it never actually wrote. The hdfs
>> client tries to reconnect and tries this for 45 minutes. This is done
>> in Client.java line 307:
>>
>> } catch (SocketTimeoutException toe) {
>> /* The max number of retries is 45,
>> * which amounts to 20s*45 = 15 minutes retries.
>> */
>> handleConnectionFailure(timeoutFailures++, 45, toe);
>> }
>>
>>
>> There is some code in hfds.c that tries to catch an exception from the
>> java client, but it seems to never get that exception (or maybe after
>> 45 minutes). This is in hdfs.c line 1005:
>>

>> if (invokeMethod(env, NULL,&jExc, INSTANCE, jOutputStream,

prutser

unread,

May 11, 2010, 2:59:15 AM5/11/10

to scribe...@googlegroups.com

Hi Gautam,

Thanks for the info! This is very helpful!

// Wouter

Reply all

Reply to author

Forward