UTF-8 Encoding problems with tail and hdfs collectorSink

dboek

unread,

Sep 2, 2010, 3:17:37 AM9/2/10

to Flume Users

I'm currently using Flume 0.9.1

Flume tails a file that is encoded in UTF-8, opening the file shows me
ä,ö,ü and others characters. When I open the seq files in Hadoop,
which were transmitted and stored by flume through the collectorSink
in raw format, all special characters like ä,ö,ü are broken like Ã¤
--- it seems somewhere might be a change between UTF-8 and another
encoding or is the raw output format the problem? or do I have to tell
Flume explicitly to handle everything in UTF-8 ?

Thanks for any help,

Daniel

Jonathan Hsieh

unread,

Sep 3, 2010, 7:54:08 PM9/3/10

to dboek, Flume Users

Daniel,

We have tried to keep everything as byte arrays to avoid character encoding problems, but it looks like we may have missed some spots.

I've looked at the RawOutputFormat and it doesn't look like the culprit.

I think the bug in TailSource -- it reads lines using a method (readLine) which does character set interpretation.

Can you file this a bug in the jira? (issues.cloudera.org)

Thanks,

Jon.

This sounds like a bug.

--
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera

// j...@cloudera.com

dan...@skycheck.com

unread,

Sep 6, 2010, 1:44:10 PM9/6/10

to Flume Users

Dear Jon,

thanks for your help, I just created a bug issue for it.

Regards,

Daniel

On 4 Sep., 01:54, Jonathan Hsieh <j...@cloudera.com> wrote:
> Daniel,
>
> We have tried to keep everything as byte arrays to avoid character encoding
> problems, but it looks like we may have missed some spots.
>
> I've looked at the RawOutputFormat and it doesn't look like the culprit.
>
> I think the bug in TailSource -- it reads lines using a method (readLine)
> which does character set interpretation.
>
> Can you file this a bug in the jira? (issues.cloudera.org)
>
> Thanks,
> Jon.
>
> This sounds like a bug.
>

Reply all

Reply to author

Forward