UTF-8 Encoding problems with tail and hdfs collectorSink

688 views
Skip to first unread message

dboek

unread,
Sep 2, 2010, 3:17:37 AM9/2/10
to Flume Users
I'm currently using Flume 0.9.1

Flume tails a file that is encoded in UTF-8, opening the file shows me
ä,ö,ü and others characters. When I open the seq files in Hadoop,
which were transmitted and stored by flume through the collectorSink
in raw format, all special characters like ä,ö,ü are broken like ä
--- it seems somewhere might be a change between UTF-8 and another
encoding or is the raw output format the problem? or do I have to tell
Flume explicitly to handle everything in UTF-8 ?

Thanks for any help,

Daniel

Jonathan Hsieh

unread,
Sep 3, 2010, 7:54:08 PM9/3/10
to dboek, Flume Users
Daniel,

We have tried to keep everything as byte arrays to avoid character encoding problems, but it looks like we may have missed some spots.

I've looked at the RawOutputFormat and it doesn't look like the culprit.

I think the bug in TailSource -- it reads lines using a method  (readLine)  which does character set interpretation.  

Can you file this a bug in the jira?  (issues.cloudera.org)

Thanks,
Jon.


This sounds like a bug.  
--
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera

dan...@skycheck.com

unread,
Sep 6, 2010, 1:44:10 PM9/6/10
to Flume Users
Dear Jon,

thanks for your help, I just created a bug issue for it.


Regards,

Daniel



On 4 Sep., 01:54, Jonathan Hsieh <j...@cloudera.com> wrote:
> Daniel,
>
> We have tried to keep everything as byte arrays to avoid character encoding
> problems, but it looks like we may have missed some spots.
>
> I've looked at the RawOutputFormat and it doesn't look like the culprit.
>
> I think the bug in TailSource -- it reads lines using a method  (readLine)
>  which does character set interpretation.
>
> Can you file this a bug in the jira?  (issues.cloudera.org)
>
> Thanks,
> Jon.
>
> This sounds like a bug.
>
Reply all
Reply to author
Forward
0 new messages