Issue about reading HDFS file which include Chinese characters

313 views
Skip to first unread message

young rex

unread,
Jul 5, 2013, 5:47:20 AM7/5/13
to rha...@googlegroups.com
Hi all,
I want to handle the HDFS files include Chinese characters with tab as delimiter, sample as bellow. 
3592367796239058	#趣找网#分享个好玩的活动给你们,都来看看吧。[礼物]  地址:http://t.cn/zH3sf2K
3592379158700163	#不旅行不青春#大家都来围观吧,给力奖品必须给力的来支持,好运来袭吧[礼物]  地址:http://t.cn/zH380tF
3592379204606985	#家的感动瞬间#喜欢就转发吧,下一个幸运会是你吗  地址:http://t.cn/zH33qsN
3592379209041862	#老友吧#[笑哈哈]期待好运呀,哇哈,支持!!!给力活动支持  地址:http://t.cn/zH334NZ
3592379209041745	#电信充值卡#[笑哈哈]哈哈,终于等到这个活动喽。等了好长时间了,来吧,大奖  地址:http://t.cn/zH1E5md
3592379255187784	好活动一定要珍惜,人生最珍贵的不是“得不到”和“已失去”而是现在能把握的幸福!平淡是真。[给力]
my code as bellow:

#! /usr/bin/R
library(rmr2)
map <- function(k,v) {
  keyval(v[[1]],v[[2]])
}
reduce <- function(k,vv) {
  keyval(k, vv)
}
mapreduce(
  input = "/user/superman/senti/weibo-data.txt",
  output = "/user/superman/senti/output",
  input.format =  make.input.format("csv", sep = "\t"),
  output.format = "text",
  map = map)
but I got the following error:

Loading required package: rmr2
Loading required package: Rcpp
Loading required package: methods
Loading required package: RJSONIO
Loading required package: bitops
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
??????value[[3L]](cond) : 56?????2???
Calls: <Anonymous> ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>
?????
13/07/05 17:42:55 INFO streaming.PipeMapRed: MRErrorThread done
13/07/05 17:42:55 INFO streaming.PipeMapRed: log:null
R/W/S=761/0/0 in:NA [rec/s] out:NA [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=superman
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Fri Jul 05 17:42:55 CST 2013
java.io.IOException: Broken pipe
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:282)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
        at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
        at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:110)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
        at java.lang.Thread.run(Thread.java:662)


13/07/05 17:42:55 WARN streaming.PipeMapRed: java.io.IOException: Broken pipe
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:282)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
        at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:569)
        at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:125)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
        at java.lang.Thread.run(Thread.java:662)

13/07/05 17:42:55 INFO streaming.PipeMapRed: mapRedFinished
13/07/05 17:42:55 WARN streaming.PipeMapRed: java.io.IOException: Bad file descriptor
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:282)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
        at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:569)
        at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
        at java.lang.Thread.run(Thread.java:662)

13/07/05 17:42:55 INFO streaming.PipeMapRed: mapRedFinished
13/07/05 17:42:55 INFO mapred.LocalJobRunner: Map task executor complete.
13/07/05 17:42:55 INFO streaming.StreamJob:  map 0%  reduce 0%
13/07/05 17:42:55 WARN mapred.LocalJobRunner: job_local1776539990_0001
java.lang.Exception: java.io.IOException: log:null
R/W/S=761/0/0 in:NA [rec/s] out:NA [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=superman
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Fri Jul 05 17:42:55 CST 2013
java.io.IOException: Broken pipe
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:282)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
        at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
        at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:110)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
        at java.lang.Thread.run(Thread.java:662)


        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404)
Caused by: java.io.IOException: log:null
R/W/S=761/0/0 in:NA [rec/s] out:NA [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=superman
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Fri Jul 05 17:42:55 CST 2013
java.io.IOException: Broken pipe
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:282)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
        at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
        at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:110)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
        at java.lang.Thread.run(Thread.java:662)


        at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:126)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
        at java.lang.Thread.run(Thread.java:662)
13/07/05 17:42:56 INFO streaming.StreamJob: Job running in-process (local Hadoop)
13/07/05 17:42:56 ERROR streaming.StreamJob: Job not successful. Error: NA
13/07/05 17:42:56 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
´íÎóÓÚmr(map = map, reduce = reduce, combine = combine, vectorized.reduce,  :
  hadoop streaming failed with error code 1
I am wondering if Rmr2 can handle Chinese characters. Any comments would be appreciated. 

Antonio Piccolboni

unread,
Jul 5, 2013, 1:30:56 PM7/5/13
to RHadoop Google Group
I think it could but you need to use utf-8 encoding, which may not be the first choice for  chinese characters. Beyond utf-8 you need to define a binary input format, as the text variety only admits utf-8. See hadoop-1722 for details on the java side. So you need something like

make.input.format(mode = "binary", ...)

That will switch streaming to accept non-utf8, the other arguments you need to provide are a function of  a binary readable connection and a number that will read approximately that number of records from the connection (exact number is not necessary) and return a key-value pair generated with keyval, which is the same that will be passed to the map function as two separate args, key and value.
Have you tried to read a small subset of your data from local disk with read.table? If that works, you could try

make.input.format(mode= "binary", format = make.csv.input.format(), streaming.format = "org.apache.hadoop.streaming.AutoInputFormat")


pass it to mapreduce as input.format arg and see what happens. help(make.input.format) will be  a useful read in this process.


Antonio


--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Message has been deleted

young rex

unread,
Jul 6, 2013, 11:10:09 AM7/6/13
to rha...@googlegroups.com, ant...@piccolboni.info
Hi Antonio,
Thanks very much for your quick response. I have tried to read the file from local with read.table, it can't be read correctly, But I can read it with read.csv correctly. could you please give me some advice based on this situation? thanks. 

Rex

Antonio Piccolboni

unread,
Jul 8, 2013, 12:24:51 PM7/8/13
to rha...@googlegroups.com, ant...@piccolboni.info
read.csv is nothing but read.table with certain options set in a certain way, so my advice is the same as previous message with this only variant

make.input.format(mode= "binary", format = make.csv.input.format(<options to read.table here>)), streaming.format = "org.apache.hadoop.streaming.AutoInputFormat")

Where you read <options to read table> you should provide the same options that make read.table behave like read.csv. help(read.csv) should describe them.


Antonio

Antonio Piccolboni

unread,
Jul 8, 2013, 12:55:57 PM7/8/13
to rha...@googlegroups.com, ant...@piccolboni.info
Could you provide me with a small sample of the data, provided that there aren't any confidentiality issues attached to it? I saw you posted a few recs already. You can send it to rha...@revolutionanalytics.com. 100 records will do.


Antonio

young rex

unread,
Jul 8, 2013, 11:56:40 PM7/8/13
to rha...@googlegroups.com
Hi Antonio,
Thanks for your help. Please find the attached data file you need, I sent you 10000 records, you could feel free to pick up 100 records from them. 

BTW, when I follow your advice to use make.csv.input.format, there is an error message which indicate that there is not make.csv.input.format function, I am using rmr2_2.2.1.tar.gz and R version 2.15.0 (2012-03-30), could you please share your comments about this? 


--
You received this message because you are subscribed to a topic in the Google Groups "RHadoop" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rhadoop/qhy7jnGKRyA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rhadoop+u...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
------------------
13916777959
msn:youn...@hotmail.com
weibo-data.rar

Antonio Piccolboni

unread,
Jul 9, 2013, 1:25:02 AM7/9/13
to RHadoop Google Group

It's a private function, you should prefix it with rmr2::
This is not encouraged or supported. It's just for the sake of the experiment, if it works we'll find a better way to reuse that function. But I think there will be other problems, let me test your data and I'll let you know.

Antonio

You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.

young rex

unread,
Jul 9, 2013, 3:37:40 AM7/9/13
to rha...@googlegroups.com
It's Okay, thanks Antonio. 

Antonio Piccolboni

unread,
Jul 9, 2013, 12:40:54 PM7/9/13
to RHadoop Google Group
Could you share the exact read.csv command with which I am supposed to read this? No matter what I do, I get a single column and about twice as many records as I should. If we can't read in the small, we won't in the large.


Antonio

young rex

unread,
Jul 9, 2013, 10:15:49 PM7/9/13
to rha...@googlegroups.com
Hi Antonio,
please find my command as bellow:
 a=read.csv(file("/home/superman/isocial/sentiment/weibo-data.txt"),header=FALSE,sep="\t",encoding="utf-8")
and I use the following command to check if data is loaded successfully. 
a[[2]][1]
a[[1]][2]

As Chinese is not your language, it could be difficult for you to check if the file has been loaded successfully, please feel free to let me know if you have any question about this. 

Rex
Best Regards

Antonio Piccolboni

unread,
Jul 9, 2013, 10:36:27 PM7/9/13
to RHadoop Google Group
This input format works for me

make.input.format("csv", encoding="utf-8", quote = "\"", fill = TRUE,comment.char = "", sep = "\t")

The idea is that you want to read your file with read.table (of which read.csv is a variant with different defaults only) and the encoding is ASCII or utf8, you should be able to read it with mapreduce. The only options you can't specify are header, file,  nrows, col.names and row.names. It's all the in the help for make.input.format.

Antonio

young rex

unread,
Jul 9, 2013, 11:36:13 PM7/9/13
to rha...@googlegroups.com
Hi Antonio,
Thanks very much for your help, your solution works for me, but there are some problems in the HDFS file I generated. I don't know the reason, maybe these problems are caused by Chinese characters, I will continue to investigate. 
FYI the error as bellow:
there first 6 lines (in green) are not correct as they should have ids with them.
for the id 3591188953463792, there should be "转发微博" followed. 

1.没有心情的心情是怎样的心情。2.在不同的经度,纬度,邂逅不同的人。3.一个人只有在静下心来的时候,才能看到最真的自己。4.一个人一句话,足以让你回忆一辈子。5.别动不动就哭,世界上没那么多爱你的人。6.跟自己说声对不起,因为曾经为了别人难为了自己。7.青春并不忧伤,却被我们演绎的如此凄凉。	
#1元租赁美国百年户外品牌整套露营装#我在不停的转发,就让我奖中奖中奖吧,我好期盼能够得到哦,希望在博主这能好运连连  地址:http://t.cn/zHde8D0	
#2013年中大促#强烈支持转发。转发此微博:期待~  地址:http://t.cn/zHnmcik	
#20万拥有太湖一套房#小编辛苦啦,我会一直支持你们!  地址:http://t.cn/zH31Te7	
3530471382430953	转发微博
3533065332432400	是一种刚强的意志 //@白百何:青春不是生命的一个阶段,它是一种精神状态[赞]
3533208782562449	就要旅游去了,但是怎么高兴不起来呢。  http://t.cn/zjqskLv
3533785037480677	 //@柯蓝://@雾满拦江:
3538004087967675	【快来微博搜索 找朋友 吧!】通过关注分组帮你寻觅散落在微博上的身边好友;独有星座筛选功能,帮你在朋友关系圈中找感情和事业最佳配对!http://t.cn/zjC6PAh
3540683807368088	缓慢思考  答案便萌生出来  于是透彻领悟  听到的声响 泪如雨下  在意的 却总是伤害我
3543473439477398	春节版的微博Android客户端好喜庆~我给大家拜个年送个红包#让红包飞# mini汽车、千台HTC手机等你拿!猛戳:http://t.cn/zYLWxxq 新版特性:1.新增精美新年主题;2.全新话题功能;3.支持分享到微信!http://t.cn/zY5SGaL
3543519027697663	[馋嘴]@乐属我佳
3544567473334631	 @乐属我佳 如来,[嘻嘻]
3544679737811062	 @乐属我佳
3544679839057754	[兔子]@乐属我佳
3545285781756747	转发微博
3545286054459769	//@徐凡1991:转发微博
3545286972995826	转发微博
3545322477016534	我正在使用#微博二维码#,扫描下面的二维码就能关注我啦,快来和我一起玩转微博吧~ 我在:http://t.cn/zYJCbNm
3545323240567558	明天,明天噢! 貌似很纠结,也很期待。[嘻嘻]不管怎样都很开心的,有情人的陪伴…[偷笑] 我在这里:http://t.cn/zjJM7rt
3545554682606600	//@于正1978:[哈哈][哈哈][哈哈][哈哈]接财神!@宁财神
3591188953463792	
3546623147781339	转发微博
3546671864646880	 @乐属我佳
3546689577518621	转发微博
3548057734274342	转发微博
3548058556060679	人生就是如此吧,得不到了永远会是好的,应该学会知足。[嘻嘻]@刘佳佳0915
3548060955658148	真纠结!@乐属我佳
3548061404066954	转发微博
3548801442416379	转发微博
3548801471778707	转发微博
3548801518165870	转发微博
3548812335449099	我刚领取了微号731678197,【搜索】或【@】731678197就可以快速找到我。超炫靓号、专属标识、闪亮勋章~ 我就是微博潮人,你也来加入吧( http://t.cn/zOWeMad )
3548817380688886	某 这句戳到我了
3548820400566585	即使发现自己错了 也要坚持走下去 你给我的 是一种近乎偏执的态度
3548820933535019	有木有花样美男的感觉 就是那种靠脸蛋取胜的男孩纸 哦哈哈
3552205195578206	
3548939498646751	原来可以这样…
3577512506827103	
3549273289767137	元宵节爬山的哦!嘿嘿…  开心…
3549272576500446	
3549273545265088	转发微博
3549506509247128	我刚刚更换了个人主页封面图,欢迎大家围观哦~@刘佳佳0915 http://weibo.com/3211291673/profile
3549507198113305	早上还没吃饭呢,就这样馋我
3549507349353875	
3549507528524241	转发微博
3549691901399324	今天我看见它开了,有种快乐在心里! 你就是美,就是美,美、美、美、、、、[偷笑] 我在:http://t.cn/zjxblNy
3549692040670001	转发微博
3549692523110872	转发微博
3549692875201660	转发微博
3549693131146222	好酷

Antonio Piccolboni

unread,
Jul 10, 2013, 2:34:04 AM7/10/13
to RHadoop Google Group
Could you pick a more sensible output format? Like csv with the same encoding as the input?

Antonio

young rex

unread,
Jul 10, 2013, 9:59:45 PM7/10/13
to rha...@googlegroups.com, ant...@piccolboni.info
Hi Antonio,
I am sorry for replying late, I just tested it with the following code:
output.format=make.output.format("csv",sep="\t")

the output is not correct, all the chinese characters disappeared.  you could check it as bellow, I don't know if it happens on your side:
2124 1750	
2132 997	
2133 1540	
2135 3549	
2134 3438	
2136 1457	
2137 1348	
2123 3312	
2138 63	
2139 3219	
1287 1548	
2140 27	
1288 2804	
1289 3619


在 2013年7月10日星期三UTC+8下午2时34分04秒,Antonio Piccolboni写道:
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to a topic in the Google Groups "RHadoop" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rhadoop/qhy7jnGKRyA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
------------------
13916777959
msn:yo...@hotmail.com

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to a topic in the Google Groups "RHadoop" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rhadoop/qhy7jnGKRyA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
------------------
13916777959
msn:yo...@hotmail.com

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to a topic in the Google Groups "RHadoop" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rhadoop/qhy7jnGKRyA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
------------------
13916777959
msn:yo...@hotmail.com

young rex

unread,
Jul 11, 2013, 5:35:56 AM7/11/13
to rha...@googlegroups.com, ant...@piccolboni.info
Hi Antonio,
I modified my code as bellow, now the output is better than previous one. 

map <- function(k,v) {
key =as.character(v[[1]])
value=as.character(v[[2]])
   keyval(key,value)
}

mapreduce(
  input = "/user/superman/senti/weibo-data.txt",
  output = "/user/superman/senti/output",
  input.format=make.input.format("csv", encoding="utf-8", quote = "\"", fill = TRUE,comment.char = "", sep = "\t"),
  output.format=make.output.format("csv",sep="\t",quote=FALSE),
  map = map
)

Antonio Piccolboni

unread,
Jul 11, 2013, 2:45:14 PM7/11/13
to RHadoop Google Group
I think you need to set the encoding of the output as well, see help(write.table). The options you give to make.output.format("csv", ...) are passed to write.table with few exceptions


Antonio

young rex

unread,
Jul 12, 2013, 10:21:39 AM7/12/13
to rha...@googlegroups.com, ant...@piccolboni.info
Thanks Antonio, when I add the following code into my script, I got messy code in my output. should I convert the encoding of the string?

map <- function(k,v) {
key =as.character(v[[1]])
value=as.character(v[[2]])
        some_txt=segmentCN(value,nature=TRUE, nosymbol=FALSE, recognition = FALSE)
   keyval(c(some_txt),key)
}

mapreduce(
  input = "/user/superman/senti/weibo-data.txt",
  output = "/user/superman/senti/output",
  input.format=make.input.format("csv", encoding="utf-8", quote = "\"", fill = TRUE,comment.char = "", sep = "\t"),
  output.format=make.output.format("csv",sep="\t",quote=FALSE),
  map = map
)

segmentCN is a function from the package Rwordseg, it is used to segment a Chinese sentence. you could find the packages here: https://r-forge.r-project.org/scm/viewvc.php/pkg/Rwordseg/?diff_format=s&root=rweibo&sortby=file&pathrev=37 and install it with the command install.packages("Rwordseg", repos = "http://R-Forge.R-project.org"). 

Rex


在 2013年7月12日星期五UTC+8上午2时45分14秒,Antonio Piccolboni写道:

Antonio Piccolboni

unread,
Jul 12, 2013, 1:09:47 PM7/12/13
to RHadoop Google Group
Interesting, but how is it related to the problem at hand? Experiment with write table. You may find that if you specify the same encoding you specified for the input (because it's the same strings after all) it works

write.table(zz, fileEncoding="utf-8",file="/tmp/weibo.out")

(zz has a few lines read in with read.table from the sample you sent me) I think that's exactly what I suggested two messages ago, sorry if I am repeating myself, maybe this time it will be more clear. Now you take whatever options worked with write table (with a few exceptions listed in help(make.output.format)) and make an output format with them

make.output.format("csv",  fileEncoding="utf-8")

Now you can use this with to.dfs or mapreduce as in 

mapreduce("/Users/antonio//Downloads/weibo-data.txt", "/tmp/weibo.out", input.format=make.input.format("csv", encoding="utf-8", quote = "\"", fill = TRUE,comment.char = "", sep = "\t"),output.format = make.output.format("csv",  fileEncoding="utf-8"))

Of course you may still want to adjust the quoting and other details and this mapreduce call is just the identity function, so provide sensible map and/or reduce functions, but it seems to me this addresses your question.

Antonio

young rex

unread,
Jul 15, 2013, 4:49:55 AM7/15/13
to rha...@googlegroups.com, ant...@piccolboni.info
Hi Antonio,
Sorry, I think I didn't make it clear, let me clarify as bellow.
when I execute the following code, I can get the correct result, no messy code in the result file. 
map <- function(k,v) {
key =as.character(v[[1]])
value=as.character(v[[2]])
       
   keyval(key, value)
}

but when I executed the following code, I found lots of messy code in the result file.

map <- function(k,v) {
key =as.character(v[[1]])
value=as.character(v[[2]])
        some_txt=segmentCN(value,nature=TRUE, nosymbol=FALSE, recognition = FALSE)
   keyval(c(some_txt),key)
}

result file:
c("�, "f", "�", "�, "�, "�� "��", "�, "�, "�� "��", "�, "�� "��", "�, "��", "�, "�� "��", "�, "�� "��", "�, "��", "�, "�� "��", "�, "�[", "�, "�� "��", "�, "�� "��", "�, "�h", "ttp", "t", "cn", "zh", "3", "sf", "2", "k")	3592367796239058
c("�, "�� "��", "�, "�� "��", "�, "�� "��", "�, "�� "��", "�, "�� "�, "�� "��", "�, "�� "��", "�, "�� "�� "��", "�, "�� "�, "�� "��", "�, "�� "�", "�, "�, "�� "��", "�, "�� "��", "�, "�, "�, "�� "��", "�, "�� "��", "�, "�h", "ttp", "t", "cn", "zh", "380", "tf")	3592379158700163
c("�, "�� "��", "�, "��", "�, "�, "�, "�� "��", "�, "�, "�� "��", "�, "�� "��", "��", "�, "�� "��", "�, "�� "��", "�, "�� "��", "�, "�� "��", "�, "�h", "ttp", "t", "cn", "zh", "33", "qsn")	3592379204606985
c("Կ", "�� "��", "�, "�, "�� "��", "�, "�]", "�", "�", "��", "�, "�� "��", "��, "�, "�� "��", "�, "�� "��", "�, "�, "�� "��", "�, "�� "��", "�, "�� "��", "�, "�� "�", "�, "�, "�, "�� "��", "�, "�h", "ttp", "t", "cn", "zh", "334", "nz")	3592379209041862
c("�, "�� "��", "�, "�� "�� "��", "�, "�� "��", "�, "�]", "�, "�� "��", "�, "�� "��", "�, "�� "��", "�, "�� "��", "�, "�, "�� "�", "�, "�� "�� "��", "�, "��, "�, "�� "��", "�, "�� "�� "�, "ɽ", "�", "�, "��", "�, "�, "�� "��", "�, "�h", "ttp", "t", "cn", "zh", "1", "e", "5", "md")	3592379209041745
so I guess the output of the code "some_txt=segmentCN(value,nature=TRUE, nosymbol=FALSE, recognition = FALSE) " is not correct. to verify if the segmentCN can produce correct output, I tested it with the following code:

map <- function(k,v) {
key =as.character(v[[1]])
value="非常感谢安东尼奥"

        some_txt=segmentCN(value,nature=TRUE, nosymbol=FALSE, recognition = FALSE)
   keyval(c(some_txt),key)
}

I can get the correct result, so I believe the function segmentCN has no problem. I guess the root cause of this problem is the input of segmentCN does not have the correct encoding. but I don't know how to testify it, could you please share your comments about this? 

I also post the source code of segmentCN here:


##' A function segment Chinese sentence into words.
##' 
##' @title Sengment a sentence.
##' @param strwords A Chinese sentence in UTF-8.
##' @param analyzer A JAVA object of analyzer.
##' @param nature Whether to recognise the nature of the words.
##' @param nosymbol Whether to keep symbols in the sentence.
##' @param recognition Whether to recognise the person names automatically.
##' @return a vector of words (list if input is vecter) which have been segmented.
##' @author Jian Li <\email{rwe...@sina.com}>
##' @examples \dontrun{
##' segmentCN("hello world!")
##' }

segmentCN <- function(strwords, analyzer = get("Analyzer", envir = .RwordsegEnv), 
nature = FALSE, nosymbol = TRUE, recognition = TRUE) {
if (!is.character(strwords)) stop("Please input character!")
if (length(strwords) == 1) {
if (nature) {
strfunc <- ifelse(recognition, "segWordNature", "segWordNatureNoRecog")
OUT <- .jcall(analyzer, "S", strfunc, strwords)
Encoding(OUT) <- "UTF-8"
OUT <- gsub(" +", " ", OUT)
if (nzchar(OUT)) {
OUT <- strsplit(OUT, split = " ")[[1]]
OUT <- gsub(":.*$", "", OUT)
splitlist <- strsplit(OUT[nzchar(OUT)], split = "\\|")
OUT <- sapply(splitlist, FUN = function(X) X[[1]])
names(OUT) <- sapply(splitlist, FUN = function(X) X[[2]])
}
} else {
strfunc <- ifelse(recognition, "segWord", "segWordNoRecog")
OUT <- .jcall(analyzer, "S", strfunc, strwords)
Encoding(OUT) <- "UTF-8"
OUT <- gsub(" +", " ", OUT)
if (nzchar(OUT)) OUT <- strsplit(OUT, split = " ")[[1]]
}
if (nosymbol && any(nzchar(OUT))) {
OUT <- OUT[grep("[\u4e00-\u9fa5]|[a-z]|[0-9]", OUT)]
OUT <- gsub("\\[|\\]", "", OUT)
OUT <- OUT[nzchar(OUT)]
}
if (length(OUT) == 0) OUT <- ""
return(OUT)
} else {
return(lapply(strwords, segmentCN, analyzer, nature, nosymbol))
}
}

Thanks 
Rex



在 2013年7月13日星期六UTC+8上午1时09分47秒,Antonio Piccolboni写道:

Antonio Piccolboni

unread,
Jul 17, 2013, 11:55:34 AM7/17/13
to rha...@googlegroups.com, ant...@piccolboni.info
Have you tried to fix the output encoding as I explained? 


Antonio


On Monday, July 15, 2013 1:49:55 AM UTC-7, young rex wrote:
Hi Antonio,
##' @author Jian Li <\email{...@sina.com}>

young rex

unread,
Jul 18, 2013, 1:16:56 AM7/18/13
to rha...@googlegroups.com, ant...@piccolboni.info
Hi Antonio,
Thanks for your feedback, I tried your suggestion as bellow:
output.format=make.output.format("csv",sep="\t",quote=FALSE,fileEncoding="utf-8")

but it doesn't work for me. and I tried the following code, it works.
map <- function(k,v) {
key =as.character(v[[1]])
value=as.character(v[[2]])
        Encoding(value) = "UTF-8"  # the new line
        some_txt=segmentCN(value,nature=TRUE, nosymbol=FALSE, recognition = FALSE)
   keyval(c(some_txt),key)
}

Now the problem is resovled, But I don't know why it works after I added the new line. Anyway, I really appreciate you, you are so great and enthusiastic, thanks very much. 
Reply all
Reply to author
Forward
0 new messages