read a file and remove Mojibake chars

Daiyue Weng

unread,

Apr 7, 2016, 5:30:43 AM4/7/16

to

Hi, when I read a file, the file string contains Mojibake chars at the
beginning, the code is like,

file_str = open(file_path, 'r', encoding='utf-8').read()
print(repr(open(file_path, 'r', encoding='utf-8').read())

part of the string (been printing) containing Mojibake chars is like,

'锘縶\n "name": "__NAME__"'

I tried to remove the non utf-8 chars using the code,

def read_config_file(fname):
with open(fname, "r", encoding='utf-8') as fp:
for line in fp:
line = line.strip()
line = line.decode('utf-8','ignore').encode("utf-8")

return fp.read()

but it doesn't work, so how to remove the Mojibakes in this case?

many thanks

Ben Finney

unread,

Apr 7, 2016, 5:41:15 AM4/7/16

to

Daiyue Weng <daiyu...@gmail.com> writes:

> Hi, when I read a file, the file string contains Mojibake chars at the
> beginning

You are explicitly setting an encoding to read the file; that is good,
since Python should not guess the input encoding.

The reason it's good is because the issue, of knowing the correct text
encoding, is dealt with immediately. I am guessing the text encoding may
be not as you expect.

Are you certain the text encoding is “utf-8”? Can you verify that with
whatever created the file — what text encoding does it use to write that
file?

--
\ “Advertising is the price companies pay for being unoriginal.” |
`\ —Yves Béhar, _New York Times_ interview 2010-12-30 |
_o__) |
Ben Finney

Peter Otten

unread,

Apr 7, 2016, 5:50:16 AM4/7/16

to

Daiyue Weng wrote:

> Hi, when I read a file, the file string contains Mojibake chars at the

> beginning, the code is like,
>
> file_str = open(file_path, 'r', encoding='utf-8').read()
> print(repr(open(file_path, 'r', encoding='utf-8').read())
>
> part of the string (been printing) containing Mojibake chars is like,
>
> '锘縶\n "name": "__NAME__"'
>
> I tried to remove the non utf-8 chars using the code,
>
> def read_config_file(fname):
> with open(fname, "r", encoding='utf-8') as fp:
> for line in fp:
> line = line.strip()
> line = line.decode('utf-8','ignore').encode("utf-8")
>
> return fp.read()
>
> but it doesn't work, so how to remove the Mojibakes in this case?

I'd first investigate if the file can correctly be decoded using an encoding
other than UTF-8, but if it's really hopeless and your best bet is to ignore
all non-ascii characters try

def read_config_file(fname):
with open(fname, "r", encoding="ascii", errors="ignore") as f:
return f.read()

Chris Angelico

unread,

Apr 7, 2016, 8:52:06 AM4/7/16

to

On Thu, Apr 7, 2016 at 6:47 PM, Daiyue Weng <daiyu...@gmail.com> wrote:
> Hi, when I read a file, the file string contains Mojibake chars at the
> beginning, the code is like,
>
> file_str = open(file_path, 'r', encoding='utf-8').read()
> print(repr(open(file_path, 'r', encoding='utf-8').read())
>
> part of the string (been printing) containing Mojibake chars is like,
>
> '锘縶\n "name": "__NAME__"'
>
> I tried to remove the non utf-8 chars using the code,
>
> def read_config_file(fname):
> with open(fname, "r", encoding='utf-8') as fp:
> for line in fp:
> line = line.strip()
> line = line.decode('utf-8','ignore').encode("utf-8")
>
> return fp.read()
>
> but it doesn't work, so how to remove the Mojibakes in this case?

This won't work as it currently stands. You're looping over the file,
stripping, *DE*coding (which shouldn't work - although in Python 2, it
sorta-kinda might), re-encoding, and then dropping the lines on the
floor. Then, after you've closed the file, you try to read from it. So
yeah, it doesn't work.

But if you're able to read the file *at all* using your original code,
it must be a correctly-formed UTF-8 stream. The probability that
random non-ASCII bytes just happen to be UTF-8 decodable is
vanishingly low, so I suspect your data issue has nothing to do with
encodings.

ChrisA

Random832

unread,

Apr 7, 2016, 10:20:13 AM4/7/16

to

On Thu, Apr 7, 2016, at 04:47, Daiyue Weng wrote:
> Hi, when I read a file, the file string contains Mojibake chars at the
> beginning, the code is like,
>
> file_str = open(file_path, 'r', encoding='utf-8').read()
> print(repr(open(file_path, 'r', encoding='utf-8').read())
>
> part of the string (been printing) containing Mojibake chars is like,
>
> '锘縶\n "name": "__NAME__"'

Based on a hunch, I tried something:

"锘縶" happens to be the GBK/GB18030 interpretation of the bytes "ef bb bf
7b", which is a UTF-8 byte order mark followed by "{".

So what happened is that someone wrote text in UTF-8 with a byte-order
marker, and someone else read this as GBK/GB18030 and wrote the
resulting characters as UTF-8. So it may be easier to simply
special-case it:

if file_str[:2] == '锘縶': file_str = '{' + file_str[2:]
elif file_str[:2] == '锘縖': file_str = '[' + file_str[2:]

In principle, the whole process could be reversed as file_str =
file_str.encode('gbk').decode('utf-8'), but that would be overkill if it
contains no other ASCII characters and can't contain anything at the
start except these. Plus, if there are any other non-ASCII characters in
the string, it's anyone's guess as to whether they survived the process
in a way that allows you to reverse it.