French sentences appearing weird in Rails Website

Ritvvij Parrikh

unread,

May 15, 2013, 7:30:45 AM5/15/13

to

I have a Rails app. One of my clients is importing French Text which
is appearing weirdly. Check below example:

1. str = "--- \nFrench: \"3. Combien de r\\xC3\\xA9gions y a-t-il
au Cameroon?\"\nEnglish: 3. How many regions are there in Cameroon?\n"

Can someone assist please?

I am thinking on following lines:

2. str = str.gsub('"', '')

3. **Need to add a line which replaces \\ in the str above to just
\**

4. str = str.force_encoding("iso-8859-1")

5. str = str.encode('UTF-8')

In step 3, I was thinking of something like

str = str.gsub(/\\\\/, "\\")

OR somehow if possible push output of puts or a similar function back
to str example:

> puts str

---

French: 3. Combien de r\xC3\xA9gions y a-t-il au Cameroon?

English: 3. How many regions are there in Cameroon?

but even that works. Can someone please assist?

Simon Krahnke

unread,

May 17, 2013, 4:52:35 PM5/17/13

to

* Ritvvij Parrikh <ritv...@gmail.com> (2013-05-15) schrieb:

>I have a Rails app. One of my clients is importing French Text which
>is appearing weirdly. Check below example:
>
> 1. str = "--- \nFrench: \"3. Combien de r\\xC3\\xA9gions y a-t-il
>au Cameroon?\"\nEnglish: 3. How many regions are there in Cameroon?\n"
>
>Can someone assist please?
>
>I am thinking on following lines:
>
> 2. str = str.gsub('"', '')
>
> 3. **Need to add a line which replaces \\ in the str above to just
>\**
>
> 4. str = str.force_encoding("iso-8859-1")

No, "\xc3\xa9" is UTF-8, not ISO-8859-1. At least, that makes much more
sense in UTF-8.

> 5. str = str.encode('UTF-8')
>
>In step 3, I was thinking of something like
>
> str = str.gsub(/\\\\/, "\\")

Yeah.

mfg, simon .... l

Charles Calvert

unread,

May 29, 2013, 5:44:03 PM5/29/13

to

On Wed, 15 May 2013 04:30:45 -0700 (PDT), Ritvvij Parrikh
<ritv...@gmail.com> wrote in
<a2f5ed36-27d0-46cc...@j2g2000pbx.googlegroups.com>:

>I have a Rails app. One of my clients is importing French Text which
>is appearing weirdly. Check below example:
>
> 1. str = "--- \nFrench: \"3. Combien de r\\xC3\\xA9gions y a-t-il
>au Cameroon?\"\nEnglish: 3. How many regions are there in Cameroon?\n"

As Simon said, this text is encoded in UTF-8. You need to process it
as such. Are you using 1.8 or 1.9?

[snip rest]
--
Charles Calvert | Websites
Celtic Wolf, Inc. | Web Applications
http://www.celticwolf.com/ | Software
(703) 580-0210 | Databases

Simon Krahnke

unread,

May 29, 2013, 7:28:33 PM5/29/13

to

* Charles Calvert <cb...@yahoo.com> (23:44) schrieb:

>On Wed, 15 May 2013 04:30:45 -0700 (PDT), Ritvvij Parrikh
>

>> I have a Rails app. One of my clients is importing French Text which
>> is appearing weirdly. Check below example:
>>
>> 1. str = "--- \nFrench: \"3. Combien de r\\xC3\\xA9gions y a-t-il
>> au Cameroon?\"\nEnglish: 3. How many regions are there in Cameroon?\n"
>
> As Simon said, this text is encoded in UTF-8. You need to process it
> as such. Are you using 1.8 or 1.9?

Are there versions of 1.8 that support encodings for strings?

mfg, simon .... l

Charles Calvert

unread,

May 29, 2013, 9:26:00 PM5/29/13

to

On Thu, 30 May 2013 01:28:33 +0200, Simon Krahnke <over...@gmx.li>
wrote in <87obbtb...@xts.gnuu.de>:

For file i/o, the only option of which I'm aware is the iconv library
(http://ruby-doc.org/stdlib-1.8.7/libdoc/iconv/rdoc/Iconv.html).

1.9, on the other hand, has built-in support for encoded strings and
conversion for file i/o. Here's some demo code that I wrote for a
talk that I gave on Unicode in Ruby:

#!/usr/bin/env ruby
# encoding: UTF-8

File.open('utf8.txt', 'w') do |file|
puts "Writing a UTF-8 file"
file.write('Tomás')
puts ""
end

File.open('utf8.txt', 'r:UTF-8') do |file|
puts "Reading the UTF-8 file"
puts "File external encoding: #{file.external_encoding}"
puts "File contains:"
line_count = 1
file.each_line do |line|
puts "#{line_count}: #{line}"
line_count += 1
puts ""
end
end

File.open('utf8.txt', 'r:UTF-8:UTF-16LE') do |file|
puts "Reading the UTF-8 file and storing in memory as UTF-16 little
endian"
puts "File external encoding: #{file.external_encoding}"
puts "File internal encoding: #{file.internal_encoding}"
puts "In memory representation contains:"
line_count = 1
file.each_line do |line|
puts "#{line_count}: contains #{line.size} characters and
#{line.bytesize} bytes in encoding #{line.encoding.name}"
line_count += 1
end
puts ""
end

Simon Krahnke

unread,

May 31, 2013, 1:43:03 AM5/31/13

to

* Charles Calvert <cb...@yahoo.com> (2013-05-30) schrieb:

>On Thu, 30 May 2013 01:28:33 +0200, Simon Krahnke <over...@gmx.li>
>wrote in <87obbtb...@xts.gnuu.de>:
>
>>* Charles Calvert <cb...@yahoo.com> (23:44) schrieb:
>>
>>>On Wed, 15 May 2013 04:30:45 -0700 (PDT), Ritvvij Parrikh
>>>
>>>> I have a Rails app. One of my clients is importing French Text which
>>>> is appearing weirdly. Check below example:
>>>>
>>>> 1. str = "--- \nFrench: \"3. Combien de r\\xC3\\xA9gions y a-t-il
>>>> au Cameroon?\"\nEnglish: 3. How many regions are there in Cameroon?\n"
>>>
>>> As Simon said, this text is encoded in UTF-8. You need to process it
>>> as such. Are you using 1.8 or 1.9?
>>
>>Are there versions of 1.8 that support encodings for strings?
>
>For file i/o, the only option of which I'm aware is the iconv library
>(http://ruby-doc.org/stdlib-1.8.7/libdoc/iconv/rdoc/Iconv.html).
>
>1.9, on the other hand, has built-in support for encoded strings and
>conversion for file i/o. Here's some demo code that I wrote for a
>talk that I gave on Unicode in Ruby:
>
>#!/usr/bin/env ruby
># encoding: UTF-8
>
>File.open('utf8.txt', 'w') do |file|
> puts "Writing a UTF-8 file"
> file.write('Tomás')

That String is UTF-8 because of the default encoding specified in the
encoding magic comment above.

But why is the File written in UTF-8, because of the same reason?

Thanks for the examples.

mfg, simon .... l

Charles Calvert

unread,

Jun 1, 2013, 11:00:36 AM6/1/13

to

On Fri, 31 May 2013 07:43:03 +0200, Simon Krahnke <over...@gmx.li>
wrote in <87k3mfb...@xts.gnuu.de>:

>* Charles Calvert <cb...@yahoo.com> (2013-05-30) schrieb:

[snip]

>>1.9, on the other hand, has built-in support for encoded strings and
>>conversion for file i/o. Here's some demo code that I wrote for a
>>talk that I gave on Unicode in Ruby:
>>
>>#!/usr/bin/env ruby
>># encoding: UTF-8
>>
>>File.open('utf8.txt', 'w') do |file|
>> puts "Writing a UTF-8 file"

>> file.write('Tom�s')

>
>That String is UTF-8 because of the default encoding specified in the
>encoding magic comment above.

Correct.

>But why is the File written in UTF-8, because of the same reason?

I believe so, though I haven't checked the source to verify.

>Thanks for the examples.

You're welcome.

Simon Krahnke

unread,

Jun 2, 2013, 9:08:04 AM6/2/13

to

* Charles Calvert <cb...@yahoo.com> (17:00) schrieb:

> On Fri, 31 May 2013 07:43:03 +0200, Simon Krahnke <over...@gmx.li>
>

>>* Charles Calvert <cb...@yahoo.com> (2013-05-30) schrieb:
>>
>>> 1.9, on the other hand, has built-in support for encoded strings and
>>> conversion for file i/o. Here's some demo code that I wrote for a
>>> talk that I gave on Unicode in Ruby:
>>>
>>> #!/usr/bin/env ruby
>>> # encoding: UTF-8
>>>
>>> File.open('utf8.txt', 'w') do |file|
>>> puts "Writing a UTF-8 file"

>>> file.write('Tomás')

>>
>> That String is UTF-8 because of the default encoding specified in the
>> encoding magic comment above.
>
> Correct.
>
>> But why is the File written in UTF-8, because of the same reason?
>
> I believe so, though I haven't checked the source to verify.

But you can make it explicit, like you did for reading, can't you. I
think that would be a good idea, to keep things local. Someone might
change the encoding of the file, and then the file will have a different
encoding. Some other application might try read the file as UTF-8,
though.

For string literals there is no way to declare the encoding locally,
Let's just hope that the one who changes the encoding doesn't think it
is magically done by just changing the comment.

mfg, simon .... l

Charles Calvert

unread,

Jun 5, 2013, 12:58:55 PM6/5/13

to

On Sun, 02 Jun 2013 15:08:04 +0200, Simon Krahnke <over...@gmx.li>
wrote in <87fvx0b...@xts.gnuu.de>:

>* Charles Calvert <cb...@yahoo.com> (17:00) schrieb:
>
>> On Fri, 31 May 2013 07:43:03 +0200, Simon Krahnke <over...@gmx.li>
>>
>>>* Charles Calvert <cb...@yahoo.com> (2013-05-30) schrieb:
>>>
>>>> 1.9, on the other hand, has built-in support for encoded strings and
>>>> conversion for file i/o. Here's some demo code that I wrote for a
>>>> talk that I gave on Unicode in Ruby:
>>>>
>>>> #!/usr/bin/env ruby
>>>> # encoding: UTF-8
>>>>
>>>> File.open('utf8.txt', 'w') do |file|
>>>> puts "Writing a UTF-8 file"

>>>> file.write('Tom�s')

>>>
>>> That String is UTF-8 because of the default encoding specified in the
>>> encoding magic comment above.
>>
>> Correct.
>>
>>> But why is the File written in UTF-8, because of the same reason?
>>
>> I believe so, though I haven't checked the source to verify.
>
>But you can make it explicit, like you did for reading, can't you.

Yes, as well as specifying an in-memory encoding that is different
from the file's encoding on disk.

> I think that would be a good idea, to keep things local. Someone
> might change the encoding of the file, and then the file will have
> a different encoding.

Except that specifying the encoding doesn't transform the data if the
actual encoding is something other than what you specified. Maybe I
misunderstood you.

> Some other application might try read the file as UTF-8, though.

Yes. You have to be careful with encodings. :)

>For string literals there is no way to declare the encoding locally,

No, but you can escape them (e.g. "\x00\x50\x00\x65\x00\xF1\x00\x61")
if you need a literal in an encoding other than the default.

>Let's just hope that the one who changes the encoding doesn't think it
>is magically done by just changing the comment.

True.

Simon Krahnke

unread,

Jun 6, 2013, 12:59:37 PM6/6/13

to

* Charles Calvert <cb...@yahoo.com> (18:58) schrieb:

>On Sun, 02 Jun 2013 15:08:04 +0200, Simon Krahnke <over...@gmx.li>
>wrote in <87fvx0b...@xts.gnuu.de>:
>
>>* Charles Calvert <cb...@yahoo.com> (17:00) schrieb:
>>
>>> On Fri, 31 May 2013 07:43:03 +0200, Simon Krahnke <over...@gmx.li>
>>>
>>>>* Charles Calvert <cb...@yahoo.com> (2013-05-30) schrieb:
>>>>
>>>>> 1.9, on the other hand, has built-in support for encoded strings and
>>>>> conversion for file i/o. Here's some demo code that I wrote for a
>>>>> talk that I gave on Unicode in Ruby:
>>>>>
>>>>> #!/usr/bin/env ruby
>>>>> # encoding: UTF-8
>>>>>
>>>>> File.open('utf8.txt', 'w') do |file|
>>>>> puts "Writing a UTF-8 file"
>>>>> file.write('Tom�s')
>>>>
>>>> That String is UTF-8 because of the default encoding specified in the
>>>> encoding magic comment above.
>>>
>>> Correct.
>>>
>>>> But why is the File written in UTF-8, because of the same reason?
>>>
>>> I believe so, though I haven't checked the source to verify.

I've looked through the code and it looks to me like the default is
Encoding.default_external, which seems to be initialized by the locale,
not the file's encoding. I can't find a place to find the source files
encoding from within Ruby.

>> But you can make it explicit, like you did for reading, can't you.
>
> Yes, as well as specifying an in-memory encoding that is different
> from the file's encoding on disk.

puts and the like seem to just dump that internal encoding out, right?

>> I think that would be a good idea, to keep things local. Someone
>> might change the encoding of the file, and then the file will have
>> a different encoding.
>
> Except that specifying the encoding doesn't transform the data if the
> actual encoding is something other than what you specified. Maybe I
> misunderstood you.

That was based an false premises anyway. The internal encoding doesn't
inform the default encoding of files written, the locale does.

>> Some other application might try read the file as UTF-8, though.
>
> Yes. You have to be careful with encodings. :)

Which too should expect to find the file be encoded with what the locale
says.

>>For string literals there is no way to declare the encoding locally,
>
> No, but you can escape them (e.g. "\x00\x50\x00\x65\x00\xF1\x00\x61")
> if you need a literal in an encoding other than the default.

But that string will still have an encoding attributed with it that says
file's encoding.

>> Let's just hope that the one who changes the encoding doesn't think it
>> is magically done by just changing the comment.
>
> True.

I've seen people who seemed to think that on usenet.

mfg, simon .... l

Charles Calvert

unread,

Jun 10, 2013, 6:27:42 PM6/10/13

to

On Thu, 06 Jun 2013 18:59:37 +0200, Simon Krahnke <over...@gmx.li>
wrote in <877gi7b...@xts.gnuu.de>:

>* Charles Calvert <cb...@yahoo.com> (18:58) schrieb:
>
>>On Sun, 02 Jun 2013 15:08:04 +0200, Simon Krahnke <over...@gmx.li>
>>wrote in <87fvx0b...@xts.gnuu.de>:
>>
>>>* Charles Calvert <cb...@yahoo.com> (17:00) schrieb:
>>>
>>>> On Fri, 31 May 2013 07:43:03 +0200, Simon Krahnke <over...@gmx.li>
>>>>
>>>>>* Charles Calvert <cb...@yahoo.com> (2013-05-30) schrieb:
>>>>>
>>>>>> 1.9, on the other hand, has built-in support for encoded strings and
>>>>>> conversion for file i/o. Here's some demo code that I wrote for a
>>>>>> talk that I gave on Unicode in Ruby:
>>>>>>
>>>>>> #!/usr/bin/env ruby
>>>>>> # encoding: UTF-8
>>>>>>
>>>>>> File.open('utf8.txt', 'w') do |file|
>>>>>> puts "Writing a UTF-8 file"

>>>>>> file.write('Tomás')

>>>>>
>>>>> That String is UTF-8 because of the default encoding specified in the
>>>>> encoding magic comment above.
>>>>
>>>> Correct.
>>>>
>>>>> But why is the File written in UTF-8, because of the same reason?
>>>>
>>>> I believe so, though I haven't checked the source to verify.
>
>I've looked through the code and it looks to me like the default is
>Encoding.default_external, which seems to be initialized by the locale,
>not the file's encoding. I can't find a place to find the source files
>encoding from within Ruby.

That makes sense from what I've seen. Detecting the encoding of a
file without a BOM is a tricky process, and there are libraries to do
it, so building it into the core seems like overkill.

>>> But you can make it explicit, like you did for reading, can't you.
>>
>> Yes, as well as specifying an in-memory encoding that is different
>> from the file's encoding on disk.
>
>puts and the like seem to just dump that internal encoding out, right?

The internal encoding of the string, yes.

>>> I think that would be a good idea, to keep things local. Someone
>>> might change the encoding of the file, and then the file will have
>>> a different encoding.
>>
>> Except that specifying the encoding doesn't transform the data if the
>> actual encoding is something other than what you specified. Maybe I
>> misunderstood you.
>
>That was based an false premises anyway. The internal encoding doesn't
>inform the default encoding of files written, the locale does.

From my testing, it appears to be the encoding of the string written
to the file, rather than the locale.

>>> Some other application might try read the file as UTF-8, though.
>>
>> Yes. You have to be careful with encodings. :)
>
>Which too should expect to find the file be encoded with what the locale
>says.

I never assume when it comes to user input. :)

>>>For string literals there is no way to declare the encoding locally,
>>
>> No, but you can escape them (e.g. "\x00\x50\x00\x65\x00\xF1\x00\x61")
>> if you need a literal in an encoding other than the default.
>
>But that string will still have an encoding attributed with it that says
>file's encoding.

String#force_encoding is useful there.