Combine the lines of a file including Chinese characters into one line.

Hongyi Zhao

unread,

Nov 28, 2011, 5:20:20 AM11/28/11

to

Hi all,

I've a file which including Chinese characters in it. Now, I want to
combine the lines in this file into one line. See the following for
detail:

Suppose the file I'm going to manipulate is names as 222.

werner@debian:~$ cat -vte 222
M-?M-FM-QM-'M-5M-DM-DM-ZM-9M-&M-HM--_10058706^M|112^M|M-UM-BM-DM-KM-FM-
w^M|10058706^M|1984M-DM-j12M-TM-BM-5M-Z1M-0M-f$

werner@debian:~$ awk '{printf $0}' 222 > 333

Then, I open the file 333 with gedit, but find that the contents are not
combined into one line. It seems that they are just the same as looked
in the original file 222.

So, I do the following command:

werner@debian:~$ cat -vte 333
M-?M-FM-QM-'M-5M-DM-DM-ZM-9M-&M-HM--_10058706^M|112^M|M-UM-BM-DM-KM-FM-
w^M|10058706^M|1984M-DM-j12M-TM-BM-5M-Z1M-0M-fwerner@debian:~$

Any hints on this issue?

Regards
--
.: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.

Chris F.A. Johnson

unread,

Nov 28, 2011, 5:42:18 AM11/28/11

to

On 2011-11-28, Hongyi Zhao wrote:
> Hi all,
>
> I've a file which including Chinese characters in it. Now, I want to
> combine the lines in this file into one line.

To remove all newlines:

tr -d '\n' < "$file" > "$newfile"

To replace them with spaces:

tr '\n' ' ' < "$file" > "$newfile"

--
Chris F.A. Johnson, author <http://shell.cfajohnson.com/>
===================================================================
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
Pro Bash Programming: Scripting the GNU/Linux Shell (2009, Apress)

Janis Papanagnou

unread,

Nov 28, 2011, 6:32:15 AM11/28/11

to

Am 28.11.2011 11:20, schrieb Hongyi Zhao:

(It's amazing how many questions you ask. Don't you feel it would be
better for you to spend the time experimenting a bit more yourself?)

> Hi all,
>
> I've a file which including Chinese characters in it. Now, I want to
> combine the lines in this file into one line. See the following for
> detail:
>
> Suppose the file I'm going to manipulate is names as 222.
>
> werner@debian:~$ cat -vte 222
> M-?M-FM-QM-'M-5M-DM-DM-ZM-9M-&M-HM--_10058706^M|112^M|M-UM-BM-DM-KM-FM-
> w^M|10058706^M|1984M-DM-j12M-TM-BM-5M-Z1M-0M-f$
>
> werner@debian:~$ awk '{printf $0}' 222> 333
>
> Then, I open the file 333 with gedit, but find that the contents are not
> combined into one line. It seems that they are just the same as looked
> in the original file 222.

That's probably because of gedit? - You seem to have CRs in your data,
which may be intepreted as the newline symbols in your editor, but not
in awk. In awk use gsub() to replace/remove those characters. You want
(probably) all \r and \n removed.

Janis

Hongyi Zhao

unread,

Nov 29, 2011, 4:22:03 AM11/29/11

to

On Mon, 28 Nov 2011 12:32:15 +0100, Janis Papanagnou wrote:

> That's probably because of gedit? - You seem to have CRs in your data,
> which may be intepreted as the newline symbols in your editor, but not
> in awk. In awk use gsub() to replace/remove those characters. You want
> (probably) all \r and \n removed.

The following command will does the trick:

$ tr -d "\r\n" < myfile

I've tried all of the following forms of awk, but none of them giving the
expected result:

$ awk '{ sub(/\r$/, ""); print }' myfile
$ awk '{ gsub(/\r\n/, ""); print }' myfile
$ awk '{ gsub(/\\r\\n/, ""); print }' myfile

Best regards

Janis Papanagnou

unread,

Nov 29, 2011, 6:50:07 AM11/29/11

to

Am 29.11.2011 10:22, schrieb Hongyi Zhao:
> On Mon, 28 Nov 2011 12:32:15 +0100, Janis Papanagnou wrote:
>
>> That's probably because of gedit? - You seem to have CRs in your data,
>> which may be intepreted as the newline symbols in your editor, but not
>> in awk. In awk use gsub() to replace/remove those characters. You want
>> (probably) all \r and \n removed.
>
> The following command will does the trick:
>
> $ tr -d "\r\n"< myfile
>
> I've tried all of the following forms of awk, but none of them giving the
> expected result:
>
> $ awk '{ sub(/\r$/, ""); print }' myfile

This removes one \r just at the end of the line, but your data
has the ^M spread all over the line. To remove all ^M use just

awk '{ gsub(/\r/, ""); print }' myfile

> $ awk '{ gsub(/\r\n/, ""); print }' myfile

This one removes only a sequence of \r followed by \n. To remove
all instances of \r an \n you'd have to use

gsub(/[\r\n]/,"")

In addition you should not add \n again by your print statement,
use printf instead

{ gsub(/[\r\n]/,""); printf "%s", $0 }

(The latter can also be achieved by redefining RS appropriately;
using gawk, for example, by RS="\0", or generally by using some
string that won't appear in the data.)

Janis

Hongyi Zhao

unread,

Nov 29, 2011, 7:47:07 AM11/29/11

to

On Tue, 29 Nov 2011 12:50:07 +0100, Janis Papanagnou wrote:

[snipped]

> This removes one \r just at the end of the line, but your data has the
> ^M spread all over the line. To remove all ^M use just
>
> awk '{ gsub(/\r/, ""); print }' myfile
>
>> $ awk '{ gsub(/\r\n/, ""); print }' myfile
>
> This one removes only a sequence of \r followed by \n. To remove all
> instances of \r an \n you'd have to use
>
> gsub(/[\r\n]/,"")
>
> In addition you should not add \n again by your print statement,
> use printf instead
>
> { gsub(/[\r\n]/,""); printf "%s", $0 }
>
> (The latter can also be achieved by redefining RS appropriately;
> using gawk, for example, by RS="\0", or generally by using some string
> that won't appear in the data.)

The following command will also do the trick:

$ awk ' {printf "%s", $0 }' myfile > newfile

Regards

Janis Papanagnou

unread,

Nov 29, 2011, 8:33:43 AM11/29/11

to

That won't remove all the CR (coded as \r, often displayed as ^M, as
it seems to be the case in your data). Reinspect the output, or do

od -c newfile

to see what's actually in the data. Glad to hear then, if it's okay.

Janis

>
> Regards

Ron

unread,

Nov 29, 2011, 8:50:15 AM11/29/11

to

We all know how Chinese like to copy other people's ideas instead of
thinking for them selves. It's easier, it's cheaper & it's faster!

Chinese people think: better well stolen than badly made.

Wait I shouldn't have said that, now I've hurt their feelings and pride!

beter goed gestolen dan slecht gemaakt

Op 28-11-11 wk 48 12:32, Janis Papanagnou schreef:

Kaz Kylheku

unread,

Nov 29, 2011, 4:11:18 PM11/29/11

to

On 2011-11-28, Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> Am 28.11.2011 11:20, schrieb Hongyi Zhao:
>
> (It's amazing how many questions you ask. Don't you feel it would be
> better for you to spend the time experimenting a bit more yourself?)

H.Z. stopping to ask questions would probably kill this newsgroup, though.

Ron

unread,

Nov 29, 2011, 7:04:45 PM11/29/11

to

No it won't, it would only kill your intelligence.

Op 29-11-11 wk 48 22:11, Kaz Kylheku schreef:

Hongyi Zhao

unread,

Nov 29, 2011, 7:58:26 PM11/29/11

to

On Wed, 30 Nov 2011 01:04:45 +0100, Ron wrote:

> No it won't, it would only kill your intelligence.

Thanks for your fair criticism and serious suggestions.

But if you have learned something from "The Book of Changes" by sage Fuxi
of China, probably your attitude will looked more moderate ;-)

Hongyi Zhao

unread,

Nov 29, 2011, 8:08:18 PM11/29/11

to

On Tue, 29 Nov 2011 12:50:07 +0100, Janis Papanagnou wrote:

> { gsub(/[\r\n]/,""); printf "%s", $0 }

After do some try and error, it looks like this one is the winner for
case.

Furthermore, if I want to combine every 5 lines into 1 line, what's the
code should looks like? I've do some experiments but failed to figure it
out. Any hints will be highly appreciated.

Janis Papanagnou

unread,

Nov 29, 2011, 8:22:19 PM11/29/11

to

On 30.11.2011 02:08, Hongyi Zhao wrote:
> On Tue, 29 Nov 2011 12:50:07 +0100, Janis Papanagnou wrote:
>
>> { gsub(/[\r\n]/,""); printf "%s", $0 }
>
> After do some try and error, it looks like this one is the winner for
> case.
>
> Furthermore, if I want to combine every 5 lines into 1 line, what's the
> code should looks like?

Wasn't that the original question already? Doesn't the proposed code
do solve that?

If you mean that you want to collect blocks of 5 lines, then you can
add after

{ gsub(/[\r\n]/,""); printf "%s", $0 }

a second condition/action-block

!(NR%5) { print "" }

which will add the newline character on every fifth line.

> I've do some experiments but failed to figure it
> out. Any hints will be highly appreciated.

In case I misunderstood; could you provide test samples (with ASCII
only, please, no Chinese characters) so that we understand what failed?

>
> Best regards

Seebs

unread,

Nov 29, 2011, 8:21:42 PM11/29/11

to

On 2011-11-30, Hongyi Zhao <hongy...@gmail.com> wrote:
> Furthermore, if I want to combine every 5 lines into 1 line, what's the
> code should looks like? I've do some experiments but failed to figure it
> out. Any hints will be highly appreciated.

Hint #1: Discuss in more detail what the experiments were. What did you
try? Why did you think it would work? What actually happened? What did
that teach you about the subjects of your experiments?

The thing that continues to mystify me about your posts is that nothing
you've posted yet has suggested that you have yet achieved the stage of
believing that computers behave in predictable patterns.

Let's look again at this:

{ gsub(/[\r\n]/,""); printf "%s", $0 }

Can you explain WHY this appears to combine lines? If not, wouldn't it make
sense to study this more and come to understand it before you go ahead to
the next thing?

Hint #2: If you understand why that works, then:
SOMETHING_GOES_HERE { printf "\n" }
will solve your problem. The SOMETHING_GOES_HERE should be pretty obvious.

-s
--
Copyright 2011, all wrongs reversed. Peter Seebach / usenet...@seebs.net
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
I am not speaking for my employer, although they do rent some of my opinions.

Hongyi Zhao

unread,

Nov 29, 2011, 9:27:22 PM11/29/11

to

On Wed, 30 Nov 2011 01:21:42 +0000, Seebs wrote:

> { gsub(/[\r\n]/,""); printf "%s", $0 }

gsub(/[\r\n]/,"") means for each substring matching the \r (carriage
return) or \n (line feed) in current record, i.e., $0, substitute them
with a empty string, "", i.e., delete all of the control characters
"carriage return" or "line feed" from each lines of the input file.

printf "%s", $0 means formatted the current record and display it as an
unquoted string.

Due to all of the control symbols for new lines (for windows it is \r\n
, for linux, it is \n, and for mac, it is \r ) have been deleted by gsub
and then formatted each lines as the unquoted string, the file will be
combined into one line as one string.

Regards

Loki Harfagr

unread,

Nov 30, 2011, 7:43:47 AM11/30/11

to

Wed, 30 Nov 2011 01:08:18 +0000, Hongyi Zhao did cat :

> On Tue, 29 Nov 2011 12:50:07 +0100, Janis Papanagnou wrote:
>
>> { gsub(/[\r\n]/,""); printf "%s", $0 }
>
> After do some try and error, it looks like this one is the winner for
> case.
>
> Furthermore, if I want to combine every 5 lines into 1 line, what's the
> code should looks like? I've do some experiments but failed to figure
> it out. Any hints will be highly appreciated.
>
> Best regards

as you asked for a hint, here's one:
seq 32 | awk '{a=a$0} !(NR%5){print a ORS;a=""}'

Loki Harfagr

unread,

Nov 30, 2011, 7:59:41 AM11/30/11

to

Wed, 30 Nov 2011 00:58:26 +0000, Hongyi Zhao did cat :

> On Wed, 30 Nov 2011 01:04:45 +0100, Ron wrote:
>
>> No it won't, it would only kill your intelligence.
>
> Thanks for your fair criticism and serious suggestions.
>
> But if you have learned something from "The Book of Changes"

should that title somewhat answer to Seebs previous anguish ?-)

>> The thing that continues to mystify me about your posts is that nothing
>> you've posted yet has suggested that you have yet achieved the stage of
>> believing that computers behave in predictable patterns.

> by sage
> Fuxi of China, probably your attitude will looked more moderate ;-)

not willing to start a geopolitics thread surgeon but the previously
published sages of China, even Sunzi were not exactly dealing or even
looking like a game of BoP or a part in Storytron (though...) like
the recent herpetical liberalism outbreak in modern China leaders do ,-)

Hongyi Zhao

unread,

Nov 30, 2011, 8:07:42 AM11/30/11

to

On Wed, 30 Nov 2011 02:22:19 +0100, Janis Papanagnou wrote:

> Wasn't that the original question already? Doesn't the proposed code do
> solve that?
>
> If you mean that you want to collect blocks of 5 lines, then you can add
> after
>
> { gsub(/[\r\n]/,""); printf "%s", $0 }
>
> a second condition/action-block
>
> !(NR%5) { print "" }
>
> which will add the newline character on every fifth line.
>
>> I've do some experiments but failed to figure it out. Any hints will
>> be highly appreciated.
>
> In case I misunderstood; could you provide test samples (with ASCII
> only, please, no Chinese characters) so that we understand what failed?

The current case is that I must manipulate the Chinese characters in it.
I've made a minimal example file with 10 lines in it and 1 Chinese
character on each line. By using the following code, it will combine
every 5 lines into 1 line:

$ awk ' { gsub(/[\r\n]/,""); printf "%s", $0 }!(NR%5) { print "" } '
minimal_example

But when I use the above code to my actual data, it will combine all of
the contents into one line. So, I think it maybe the different encoding
of these two files. For checking the encoding, I do the following
commands:

$ file minimal_example
555: UTF-8 Unicode text

$ file actual_data
222: ISO-8859 text, with CR line terminators

So, In this case, I want to convert the actual_data to UTF-8 encoding
with the following command:

$ iconv -f ISO_8859-1 -t utf8 actual_data > new_file

After this, when I open the new_file with gedit, I only find some messy
codes. Any hints?

Regards

Hongyi Zhao

unread,

Nov 30, 2011, 8:16:51 AM11/30/11

to

On Wed, 30 Nov 2011 13:07:42 +0000, Hongyi Zhao wrote:

> $ file minimal_example
> 555: UTF-8 Unicode text

The 555 is the original file name of minimal_example. Sorry for this
inconsistency.

>
> $ file actual_data
> 222: ISO-8859 text, with CR line terminators

Again, the 222 is original file name of actual_data.

Hongyi Zhao

unread,

Nov 30, 2011, 8:29:52 AM11/30/11

to

On Wed, 30 Nov 2011 12:43:47 +0000, Loki Harfagr wrote:

> as you asked for a hint, here's one:
> seq 32 | awk '{a=a$0} !(NR%5){print a ORS;a=""}'

Thanks a lot, it's a obscure hints for me, though ;-)

Regards

Janis Papanagnou

unread,

Nov 30, 2011, 8:50:25 AM11/30/11

to

Am 30.11.2011 14:07, schrieb Hongyi Zhao:
> [...]

>
> But when I use the above code to my actual data, it will combine all of
> the contents into one line. So, I think it maybe the different encoding
> of these two files. For checking the encoding, I do the following
> commands:

> [...]
> Any hints?

I suggest to set the locale environment variables appropriately and
define awk's RS according to your line terminators.

>
> Regards

Seebs

unread,

Nov 30, 2011, 2:58:03 PM11/30/11

to

On 2011-11-30, Hongyi Zhao <hongy...@gmail.com> wrote:

> On Wed, 30 Nov 2011 01:21:42 +0000, Seebs wrote:
>> { gsub(/[\r\n]/,""); printf "%s", $0 }

> gsub(/[\r\n]/,"") means for each substring matching the \r (carriage
> return) or \n (line feed) in current record, i.e., $0, substitute them
> with a empty string, "", i.e., delete all of the control characters
> "carriage return" or "line feed" from each lines of the input file.

Okay.

Now, what will be different if you omit the gsub?

> printf "%s", $0 means formatted the current record and display it as an
> unquoted string.

What do you mean "formatted"?

> Due to all of the control symbols for new lines (for windows it is \r\n
> , for linux, it is \n, and for mac, it is \r ) have been deleted by gsub
> and then formatted each lines as the unquoted string, the file will be
> combined into one line as one string.

So try it without the gsub and see what it does!

(Hint: For plain ASCII text, it does exactly the same thing.)

Bill Marcum

unread,

Dec 1, 2011, 1:09:25 AM12/1/11

to

On 2011-11-30, Hongyi Zhao <hongy...@gmail.com> wrote:
>

> The current case is that I must manipulate the Chinese characters in it.
> I've made a minimal example file with 10 lines in it and 1 Chinese
> character on each line. By using the following code, it will combine
> every 5 lines into 1 line:
>
> $ awk ' { gsub(/[\r\n]/,""); printf "%s", $0 }!(NR%5) { print "" } '
> minimal_example
>
> But when I use the above code to my actual data, it will combine all of
> the contents into one line. So, I think it maybe the different encoding
> of these two files. For checking the encoding, I do the following
> commands:
>
> $ file minimal_example
> 555: UTF-8 Unicode text
>
> $ file actual_data
> 222: ISO-8859 text, with CR line terminators
>
> So, In this case, I want to convert the actual_data to UTF-8 encoding
> with the following command:
>
> $ iconv -f ISO_8859-1 -t utf8 actual_data > new_file
>
> After this, when I open the new_file with gedit, I only find some messy
> codes. Any hints?
>
> Regards

For some files, the "file" command can only guess about the contents.
You know that your file contains Chinese characters, so it isn't
ISO-8859-1.

--
We've had 30 years of center-right to extreme-right governance and the
country has gotten steadily worse. The triumph of politics is that at
least half the country blames the Democrats for this.

Hongyi Zhao

unread,

Dec 1, 2011, 1:42:04 AM12/1/11

to

On Thu, 01 Dec 2011 01:09:25 -0500, Bill Marcum wrote:

> For some files, the "file" command can only guess about the contents.
> You know that your file contains Chinese characters, so it isn't
> ISO-8859-1.

So, how can I know its encoding accurately? From therein can I convert
it to UTF-8 Unicode text format with iconv?

Ben Finney

unread,

Dec 1, 2011, 8:43:29 AM12/1/11

to

Hongyi Zhao <hongy...@gmail.com> writes:

> So, how can I know its encoding accurately?

The only way to know for certain is to find out how it was created.
There is no sure way to know the encoding only by inspecting the data.

You can know many encodings that it *is not* – the ones that fail when
trying to read the byte stream. But there's no generally-applicable way
to be certain what the encoding *is* for a given byte stream.

> From therein can I convert it to UTF-8 Unicode text format with iconv?

Once you know the original encoding (and assuming that the data contains
no Unicode-incompatible characters), yes.

--
\ “The double standard that exempts religious activities from |
`\ almost all standards of accountability should be dismantled |
_o__) once and for all.” —Daniel Dennett, 2010-01-12 |
Ben Finney

Robert Bonomi

unread,

Dec 1, 2011, 10:07:49 AM12/1/11

to

In article <jb77js$6e1$1...@aspen.stu.neva.ru>,

Hongyi Zhao <hongy...@gmail.com> wrote:
>On Thu, 01 Dec 2011 01:09:25 -0500, Bill Marcum wrote:
>
>> For some files, the "file" command can only guess about the contents.
>> You know that your file contains Chinese characters, so it isn't
>> ISO-8859-1.
>
>So, how can I know its encoding accurately?

If you use a machine with the right features, it is quite easy.

If the computer has functional ROM ("<R>ead <O>perators <M>ind"), it will
know what to do, just by your looking at the text.

Unfortunately, "Read Operator's Mind" functionality is only effective for
a limited number of uses. This is why the specifications will say,
something like 'ROM: 64k'. After 64k (in this example) uses, "Read
Operator's Mind' is worn out, an the computer must be replaced.

Ron

unread,

Dec 2, 2011, 10:14:45 AM12/2/11

to

I don't want to learn about a nearly 5000 year old ide0logy, if I see
how Ch1n3s3 greed and sl4very destroys modern global economies.

The root cause of all this misery is that Ch1n3s3 government makes HUGE
profits by enforcing ide0logies that keeps Ch1n3s3 hungry, p00r & humble

We don't need a wise Ch1n3s3 philosopher to see fowl play in the make.
Because Ch1n4 refuses to raise the value of the Yuan, w3stern economies
are now artificially lowering their demand for Ch1n3se products.

You'll probably keep on ignoring the truth anyway...

Guess why Ch1n3s3 are deprived from an unc3nsored Internet.
Guess why Chin3s3 are deprived from d3m0cracy
Guess why Chin3s3 are deprived from fr33dom

Because you're more productive for Beijing.

The richer the 3mperor becomes, the more you believe in his human
violating ideology. It's a vicious circle you can't escape from.

I'm not against the Ch1n3s3, but I'm against the consequences of global
destructiveness caused by Ch1n3s3 v1olation of hum4n r1ghts!

If you wonder why I'm encrypting words, it's because I want this message
to pass the great Ch1n3s3 F1rewall and reach Ch1na's people.

The above is my book of changes!

Op 30-11-11 wk 48 01:58, Hongyi Zhao schreef:

Janis Papanagnou

unread,

Dec 2, 2011, 11:08:07 AM12/2/11

to

Am 02.12.2011 16:14, schrieb Ron:
> [...]
> The root cause of all this misery is [...]

> You'll probably keep on ignoring the truth anyway...

It's certainly easy to name a single problem of the world and
make that responsible for the mess we currently have; in the
whole world]. I am sure you are convinced about your "truth"
as others are about their [perceived] "truth". Though, what
you wrote makes it quite apparent that you seem to be still far
from understanding what's going on. Don't worry; you are not
alone. Others, though, seem to be less pretentious in sharing
their "Truth"[tm]. Anyway, who cares; this is the Internet, and
there's always at least one who is wrong... http://xkcd.com/386/

BTW, you are top-posting, and your posting is [OT] here; please
abstain from that misbehaviour. Thanks. Good night. And good luck.

[followup-to: talk.bizarre]

> [...]