The ability to specifiy texLine encoding

137 views
Skip to first unread message

Jiacheng Guo

unread,
Aug 6, 2012, 3:13:26 AM8/6/12
to cascadi...@googlegroups.com
Hi,
   I'm new to scalding. I'm try to read a file in GB2312 on hadoop in textline format. How can I specify the encoding?

Oscar Boykin

unread,
Aug 6, 2012, 7:09:14 PM8/6/12
to cascadi...@googlegroups.com, Chris Wensel
+Chris

you will need to subclass.

See the trait here:


Here is the class we are wrapping:


I don't see how to set the encoding there in my quick glance.

Chris?

On Mon, Aug 6, 2012 at 12:13 AM, Jiacheng Guo <guo...@gmail.com> wrote:
Hi,
   I'm new to scalding. I'm try to read a file in GB2312 on hadoop in textline format. How can I specify the encoding?

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/tQdJ553GISMJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.



--
Oscar Boykin :: @posco :: https://twitter.com/intent/user?screen_name=posco

Chris K Wensel

unread,
Aug 6, 2012, 7:28:12 PM8/6/12
to Oscar Boykin, cascadi...@googlegroups.com
TextLine use the Hadoop TextInputFormat. so it only has the features the Hadoop TextInputFormat supports.


ckw

Chris K Wensel

unread,
Aug 6, 2012, 7:30:39 PM8/6/12
to cascadi...@googlegroups.com, Oscar Boykin

Jiacheng Guo

unread,
Aug 7, 2012, 3:31:28 AM8/7/12
to cascadi...@googlegroups.com, Oscar Boykin
actually it is possible to read GBK from TextInputFormat, it is just a little bit tricky.
new String(Text.getBytes(), 0, Text.getLength(),"GBK")

in cascading, however it use Text.toString() which is try covert the result  as "utf-8", and the toString method even try to fix the string with replace some illegal chararcter, which make me impossible to fix the encoding in scalding.


On Tuesday, August 7, 2012 7:30:39 AM UTC+8, Chris K Wensel wrote:
On Aug 6, 2012, at 4:28 PM, Chris K Wensel <ch...@wensel.net> wrote:

TextLine use the Hadoop TextInputFormat. so it only has the features the Hadoop TextInputFormat supports.


ckw
On Aug 6, 2012, at 4:09 PM, Oscar Boykin <os...@twitter.com> wrote:

+Chris

you will need to subclass.

See the trait here:


Here is the class we are wrapping:


I don't see how to set the encoding there in my quick glance.

Chris?

On Mon, Aug 6, 2012 at 12:13 AM, Jiacheng Guo <guo...@gmail.com> wrote:
Hi,
   I'm new to scalding. I'm try to read a file in GB2312 on hadoop in textline format. How can I specify the encoding?


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/tQdJ553GISMJ.
To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Chris K Wensel

unread,
Aug 7, 2012, 12:18:46 PM8/7/12
to cascadi...@googlegroups.com, Oscar Boykin
I'll look to see if I can include an encoding option on the api at some point.

ckw

To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/0pT-Ko2reJEJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Jiacheng Guo

unread,
Aug 8, 2012, 10:20:50 PM8/8/12
to cascadi...@googlegroups.com, Oscar Boykin
thanks, This really will help
Chris?

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.



--

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/0pT-Ko2reJEJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Francesco Montecuccoli

unread,
Feb 2, 2014, 5:35:11 PM2/2/14
to cascadi...@googlegroups.com, Oscar Boykin
Hello,
I added support to charset encoding in TextLine source by subclassing TextLineScheme trait (in com/twitter/scalding/FileSource.scala ) like this:

trait TextLineCharsetScheme extends TextLineScheme {
  val charset = "UTF-8"
  override def localScheme = new CLTextLine(new Fields("offset","line"), Fields.ALL, charset)
  override def hdfsScheme = HadoopSchemeInstance(new CHTextLine(CHTextLine.DEFAULT_SOURCE_FIELDS, charset))
}

case class TextLineCharset(p : String, c : String) extends FixedPathSource(p) with TextLineCharsetScheme {
    override val charset = c
}

Using TextLineCharset instead of TextLine source you can pass the encoding string.
I managed to compile it in Scala 2.9.2 on scalding 0.8.5 and it works fine, but a type mismatch error with Scala 2.10 that I cannot figure out:

type mismatch;
[error]  found   : cascading.scheme.hadoop.TextLine
[error]  required: cascading.scheme.Scheme[_, _, _, _, _]
[error] Note: org.apache.hadoop.mapred.JobConf <: Any (and cascading.scheme.hadoop.TextLine <: cascading.scheme.Scheme[org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,Array[Object],Array[Object]]), but Java-defined class Scheme is invariant in type Config.
[error] You may wish to investigate a wildcard type such as `_ <: Any`. (SLS 3.2.10)
[error] Note: org.apache.hadoop.mapred.RecordReader <: Any (and cascading.scheme.hadoop.TextLine <: cascading.scheme.Scheme[org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,Array[Object],Array[Object]]), but Java-defined class Scheme is invariant in type Input.
[error] You may wish to investigate a wildcard type such as `_ <: Any`. (SLS 3.2.10)
[error] Note: org.apache.hadoop.mapred.OutputCollector <: Any (and cascading.scheme.hadoop.TextLine <: cascading.scheme.Scheme[org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,Array[Object],Array[Object]]), but Java-defined class Scheme is invariant in type Output.
[error] You may wish to investigate a wildcard type such as `_ <: Any`. (SLS 3.2.10)
[error] Note: Array[Object] <: Any (and cascading.scheme.hadoop.TextLine <: cascading.scheme.Scheme[org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,Array[Object],Array[Object]]), but Java-defined class Scheme is invariant in type SourceContext.
[error] You may wish to investigate a wildcard type such as `_ <: Any`. (SLS 3.2.10)
[error] Note: Array[Object] <: Any (and cascading.scheme.hadoop.TextLine <: cascading.scheme.Scheme[org.apache.hadoop.mapred.JobConf,org.apache.hadoop.mapred.RecordReader,org.apache.hadoop.mapred.OutputCollector,Array[Object],Array[Object]]), but Java-defined class Scheme is invariant in type SinkContext.
[error] You may wish to investigate a wildcard type such as `_ <: Any`. (SLS 3.2.10)
[error]   override def hdfsScheme = HadoopSchemeInstance(new CHTextLine(CHTextLine.DEFAULT_SOURCE_FIELDS, charset))
[error]                                                  ^
[error] one error found


How can this type mismatch be solved?
Thanks
Francesco
Chris?

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.



--

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/0pT-Ko2reJEJ.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.
Reply all
Reply to author
Forward
0 new messages