Convert Unicode to Text in Ruby

555 views
Skip to first unread message

Liliana Gaspar

unread,
Jan 11, 2014, 2:45:33 PM1/11/14
to ruby_i...@googlegroups.com
Hello,

I downloaded a tab delimited file, and am trying to create a script to read it, but the lines are coming out like this:

"\xFF\xFEu\x00s\x00e\x00r\x00-\x00r\x00e\x00p\x00o\x00r\x00t\x00-\x00s\x00e\x00a
\x00r\x00c\x00h\x00-\x00r\x00e\x00s\x00u\x00l\x00t\x00s\x00-\x002\x000\x001\x004
\x000\x001\x000\x009\x001\x002\x000\x006\x000\x007\x00-\x00G\x00M\x00T\x00.\x00\
t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\
x00\r\x00\n"

I believe I need to convert (unicode?) to simple text. Is there a string method that does this? I searched the documentation, but couldn't understand which one does the trick. Below is what I see when I open the file in a regular text editor, for the line quoted above:

"user-report-search-results-20140109120607-GMT."
(and a series of tabs)

Thank you !

Eamonn Webster

unread,
Jan 13, 2014, 7:47:52 AM1/13/14
to ruby_i...@googlegroups.com
The file is in utf-16 googling for ruby reading utf-16 files yields 

File.open(path, 'rb:utf-16') do |f|
  ...
end



--
You received this message because you are subscribed to the Google Groups "Ruby Ireland" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby_ireland...@googlegroups.com.
To post to this group, send an email to ruby_i...@googlegroups.com.
Visit this group at http://groups.google.com/group/ruby_ireland.
For more options, visit https://groups.google.com/groups/opt_out.

Andrei Balcanasu

unread,
Jan 13, 2014, 9:41:56 AM1/13/14
to ruby_i...@googlegroups.com
For safety, add this to the top of the file, if you are using ruby 1.9 :

#encoding: utf-8


--
Regards,
Andrei

Liliana Gaspar

unread,
Jan 14, 2014, 11:17:18 AM1/14/14
to ruby_i...@googlegroups.com
Hello Eamonn,

I've tried this code, just to print the first lines and check:

count = 0
File.open(filename,'rb:utf-16') do |f|
    f.each_line do |line|
        p line
        count += 1
        if count == 5 then break end
    end
end


And this is the result:

=> first line:

"\uFEFFuser-report-search-results-20140109120607-GMT.\t\t\t\t\t\t\t\t\t\t\t\t\t\t\r\x0A"

(there is still something left there but I can work around it with a regex)

=> second line and next ones look like this:
"\x00\x4C\x00\x6F\x00\x67\x00\x69\x00\x6E\x00\x20\x00\x4E\x00\x61\x00\x6D\x00\x65\x00\t\x00\x4E\x00\x75\x00\x6D\x00\x62\x00\x65\x00\x72\x00\x20\x00\x6F\x00\x66\
x00\x20\x00\x50\x00\x6F\x00\x73\x00\x74\x00\x73\x00\t\x00\x54\x00\x6F\x00\x74\x00\x61\x00\x6C\x00\x20\x00\x4C\x00\x6F\x00\x67\x00\x69\x00\x6E\x00\x73\x00\t\x00\
x54\x00\x6F\x00\x74\x00\x61\x00\x6C\x00\x20\x00\x4D\x00\x69\x00\x6E\x00\x75\x00\x74\x00\x65\x00\x73\x00\x20\x00\x4F\x00\x6E\x00\x6C\x00\x69\x00\x6E\x00\x65\x00\
t\x00\x54\x00\x6F\x00\x74\x00\x61\x00\x6C\x00\x20\x00\x41\x00\x64\x00\x6D\x00\x69" [....etc...]

In notepad++ this is how the second line shows, and what I would need when reading my file:
"Login Name\tNumber of Posts\tTotal Logins\tTotal Minutes Online" [...etc...]

Thanks !
:-)

Liliana Gaspar

unread,
Jan 14, 2014, 11:19:12 AM1/14/14
to ruby_i...@googlegroups.com
Thanks Andrei :-)

Oisin Hurley

unread,
Jan 14, 2014, 11:41:21 AM1/14/14
to ruby_i...@googlegroups.com, Liliana Gaspar
 
> "\uFEFFuser-report-search-results-20140109120607-GMT.\t\t\t\t\t\t\t\t\t\t\t\t\t\t\r\x0A"

So the 

  p line 

is in effect calling 'puts line.inspect', rather than simply printing the thing to the terminal - I guess that is why the BOM (\uFEFF) and the tab characters and the Windows newline \r\x0A at the end are visible. Maybe. 

Use 

 puts line

just to do a common-or-garden print of the line.

In terms of your goal, if you are looking to read in a TSV file (tab-separated values), the built-in CSV processor in Ruby 1.9.x should be able to manage it 

table = CSV.read(filename, { :col_sep => "\t” })

should slurp the whole file in. I’m guessing here at what you are attempting to achieve :)

 —oh

--
Oisin Hurley
@oisin

Liliana Gaspar

unread,
Jan 14, 2014, 12:57:34 PM1/14/14
to ruby_i...@googlegroups.com, Liliana Gaspar
Hi Oisin,

Puts does provide the input without BOM, however when reading the file the system sees it inside the String, same as it does with \n, or \t, in which case it still has to be dealt with. Hence my use of p, so I can actually see everything I have to process when reading those lines... I think... but I couldn't thoroughly confirm this. Look at this:

With the code:

File.open(filename,'rb:utf-16') do |f|
    f.each_line do |line|
        p line  # =>
        puts line.class # =>
        puts line
        puts line.class
        if line.start_with?('user-report') then puts true else puts false end

        count += 1
        if count == 5 then break end
    end
end


I get:


"\uFEFFuser-report-search-results-20140109120607-GMT.\t\t\t\t\t\t\t\t\t\t\t\t\t\t\r\x0A"
String
user-report-search-results-20140109120607-GMT.
?String
C:/Users/liliana_gaspar/RUBY/count_email_verified.rb:63:in `start_with?': incompatible character encodings: UTF-16 and UTF-8 (Encoding::CompatibilityError)


As for CSV, it doesn't accept \r or \n at the end of the line. In other words, it would have to be a clean / perfect tab-delimited file. Here is the error:

Unquoted fields do not allow \r or \n (line 2). (CSV::MalformedCSVError)

Thanks

Liliana Gaspar

unread,
Jan 14, 2014, 1:04:00 PM1/14/14
to ruby_i...@googlegroups.com, Liliana Gaspar
Sorry, I forgot to give you my code for CSV

count = 0
CSV.read(filename, { :col_sep => "\t" }) do |row|
    p row

    count += 1
    if count == 5 then break end
end

Andrei Balcanasu

unread,
Jan 14, 2014, 1:04:20 PM1/14/14
to ruby_i...@googlegroups.com
Check this page about the CSV class. Especially for the list of options: http://ruby-doc.org/stdlib-1.9.3/libdoc/csv/rdoc/CSV.html#method-c-new

You should be able to configure :row_sep =: "\r\n"

Regarding this error, I think this could be sorted by changing the header to
#encoding: utf-16

You have a file encoded in UTF-16 and you are trying to search a UTF-8 string over UTF-16 text. They need to be of the same encoding.






--
You received this message because you are subscribed to the Google Groups "Ruby Ireland" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby_ireland...@googlegroups.com.
To post to this group, send an email to ruby_i...@googlegroups.com.
Visit this group at http://groups.google.com/group/ruby_ireland.
For more options, visit https://groups.google.com/groups/opt_out.



--
Regards,
Andrei

Eamonn Webster

unread,
Jan 14, 2014, 1:19:36 PM1/14/14
to ruby_i...@googlegroups.com
The #encoding: utf-16 is the encoding for the ruby script not for the data.

To convert each row call row!.encode(‘utf-8’)

Liliana Gaspar

unread,
Jan 14, 2014, 3:35:29 PM1/14/14
to ruby_i...@googlegroups.com
Hello,

In fact,    #encoding: utf-16   produces this error:

=> UTF-16 is not ASCII compatible (ArgumentError)

So I went back to     #encoding utf-8    ...should I keep it?

As for the CSV, I noticed I needed to use the .foreach method if I wanted to read each line. I tested this code:

count = 0
CSV.foreach(filename, { :row_sep => '\r\n', :col_sep => '\t' }) do |row|
    p row!.encode('utf-8')

    count += 1
    if count == 5 then break end
end

...which still produces this error:
Unquoted fields do not allow \r or \n (line 1). (CSV::MalformedCSVError)

After some searches:
- based on (http://stackoverflow.com/questions/11548637/csv-unquoted-fields-do-not-allow-r-or-n-line-2) I tested :row_sep => :auto and got some progress on the first line.
- I believe Eamonn meant row.encoding!('utf-8') and not row!.encoding('utf-8).
- I cannot call .encode! directly on row, because it's an Array and I get a no method error. I have to separately call it on each column / index of row, which is a String
- using either 'utf-8' or 'utf-16' as arguments of encode! isn't enough. The output is ""\u00A0\u25A0u\u0000s\u0000e\u0000r\u0000-"[...etc...] and ""\uFEFF\u00A0\u25A0u\u0000s\u0000e\u0000r\u0000-\u0000r\u0000e\u0000p\u0000o"[...etc...] respectively. I had to specify the destination encoding as 'ASCII', so row[0].encode!( 'ASCII' , 'utf-16' ). utf-8 produces an error as the BOM is not recognised: "\xFF" on UTF-8 (Encoding::InvalidByteSequenceError)

So now I got this version of the code:

count = 0
CSV.foreach(filename, { :row_sep => :auto, :col_sep => '\t' }) do |row|
    p row[0].encode!('ASCII','utf-16')

    count += 1
    if count == 5 then break end
end

...which reads the first line with some minor problems(*) that I can sort out in other ways, but then produces an error again on the second. Here is the output:


"user-report-search-results-20140109120607-GMT.\t\t\t\t\t\t\t\t\t\t\t\t\t\t"
Unquoted fields do not allow \r or \n (line 2). (CSV::MalformedCSVError)

(*)
- It seems the '\t' is not being recognised, if it were, we wouldn't see it in the output; row[0] should be "user-report-search-results-20140109120607-GMT." only; but I can trim it.

Andrei Balcanasu

unread,
Jan 14, 2014, 3:45:19 PM1/14/14
to ruby_i...@googlegroups.com
Hi,

This is great progress. I was wrong about utf-16 in the header.

I think you should use double quotes (“) instead of single quotes (‘) when you encapsulate escaped characters as \t, \n, \r

This is what I get in irb:

1.9.3-p327 :005 > puts 'a\ta'
a\ta
 => nil
1.9.3-p327 :006 > puts "a\ta"
a a
 => nil

‘\t’ is the literal \t, while “\t” is the tab character.

so your line should be:

CSV.foreach(filename, { :row_sep => :auto, :col_sep => \t" }) do |row|

Don’t copy paste the above line as my email program adds fancy double quotes.

Note: Encoding is always horrible to deal with.

-- 
Regards,
Andrei Balcanasu

Eamonn Webster

unread,
Jan 14, 2014, 3:53:10 PM1/14/14
to ruby_i...@googlegroups.com
This works for me:
The encoding 'UTF-16:UTF-8’ means read as UTF-16 but then  convert to UTF-8

CSV.foreach(filename, { :row_sep => :auto, :col_sep => “\t”, :encoding => 'UTF-16:UTF-8’}) do |row| 
  p row
end

Liliana Gaspar

unread,
Jan 15, 2014, 4:49:53 AM1/15/14
to ruby_i...@googlegroups.com
Hi Eamonn,

That worked perfectly!

There was still the issue that the row wasn't getting split by '\t', but I found out that I needed to use "\t", so double quotation marks rather than single quotation marks.

Thank you all for your help.

Liliana Gaspar

unread,
Jan 15, 2014, 4:51:45 AM1/15/14
to ruby_i...@googlegroups.com
Hi Andrei,

Just saw your comment now about the double quotation marks. It's good information. I did realise I had to use them in the end, but I didn't know why.. and now I do! :)

Oisin Hurley

unread,
Jan 15, 2014, 3:49:41 PM1/15/14
to ruby_i...@googlegroups.com
Great to hear you got it sorted! Go @rubyireland :) Consider putting
this tiny code snippet in a gist (gist.github.com) with a descriptive
name so that Google can index it and people looking for a similar
solution can find it faster!

--oh
Reply all
Reply to author
Forward
0 new messages