> When writing into a Unicode text file in Java, given that the
> stream encoding was set to »UTF-8«, what is the proper, best
> or canonical way to terminate a line?
I suggest the Unicode mailing list
http://www.unicode.org/consortium/distlist.html
or one of the comp.lang.java* newsgroups.
--
In memoriam Alan J. Flavell
http://groups.google.com/groups/search?q=author:Alan.J.Flavell
> When writing into a Unicode text file in Java, given that the
> stream encoding was set to »UTF-8«, what is the proper, best
> or canonical way to terminate a line?
Let's ignore the programming aspect as well as the encoding (UTF-8 vs.
something else, such as UTF-16), for the time being. The primary question is
how to terminate a line in Unicode.
> This has to do with the question whether the specification of
> the line terminator of a proper »Unicode text file« is the
> responsibility of Unicode or of the operating system (or other
> protocols used).
Both, but basically the latter.
Unicode defines several line breaking characters. It does not define "the"
line breaking character. The characters include U+2028 LINE SEPARATOR (LS),
which unambiguously means line break; but it is rarely used. Other line
break characters may have different semantics, by operating system or other
software. Using them does not violate the Unicode standard in any way,
though it creates portability problems.
See section 5.8 (Newline Guidelines) in the Unicode standard;
http://www.unicode.org/versions/Unicode5.0.0/ch05.pdf
So the pragmatic question is: what do the recipients (i.e., software that
will process your file) recognize as line break?
> And also, whether Java does any translations of "\n" and other
> codes, when writing to UTF-8.
There are three conceptual levels involved here. First, "\n" means something
in Java, and you need to check Java references for that. Generally, escapes
like "\n" are defined as indicating line break, without specifying a
particular character or string; in practice, this means that it is
interpreted, by a programming language compiler or interpreter, in a
system-dependent manner, as a character or a string that works as line break
in the underlying system. Second, this character or string has to be
presented according to some character code standard, such ASCII or Unicode.
Finally, the character or string has to be represented using some encoding,
such as UTF-8 - but this is smooth sailing as soon as we know the character
or string, as coded in some known code, and the encoding has been decided
on.
> So a related question removes
> Java from the discussion by asking: What is the proper byte
> sequence to terminate a line of a »UTF-8 text file«?
The proper character or string depends on the agreements on line breaking
characters. The rest is simple and algorithmic. There is nothing special
happening here; the line break character(s) are encoded as any other
characters. In most cases, line break will be presented as CR, LF, or CR LF
pair. CR is octet 0D and LF is octet 0A in UTF-8.
> A finer detail would be the question, whether the last
> line of a »proper Unicode text file« needs to be terminated
> by something, or whether the lines are separated.
No terminator is needed. Of course, specific data formats or conventions or
programs may require a trailing line break, whereas some programs may treat
it as an indication of an empty line at the end of the file.
--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
> My specific application is a demonstration program showing how
> to write a »text file« with Java and then how to read it again
> in the context of a Usenet discussion.
>
> In this case, technically, I might use any character sequence
> as a line terminator that does not occur within a line,
That's correct, though the text file isn't really a (plain) text file if its
line terminator is not one of the characters designated for such use in
character code standards. This implies that in such a case, it cannot be
smoothly displayed and otherwise processed with tools for plain text files
(like Notepad or Emacs).
> The most likely two candidates in Java are
>
> \n Unicode 10 (decimal)
> %n The line separator of the operating system
> (might be a sequence of characters)
I'm not a specialist in Java issues, but it seems to me that the
specifications for the language designate "\n" as line break and
specifically as line feed, LF, U+000A, i.e. as Unicode 10 decimal. This is
somewhat obscure (since the operating system need not use such a convention)
and reflects lack of rigorous standardization of the language.
I guess both candidates are feasible, with no clear preference, but the
context and purpose may make one of them preferable. If you think about the
possibilities of using the file in the particular environment (operating
system), %n might be slightly better. If you think about wider
processability, the \n might be better. When the file is used, as such, in
another environment - with a different line break convention - \n is safer
than %n. It is more probable that an unknown recipient is able to handle
U+000A as line break in a plain text file than that it can handle your
operating system's line break, if it is something exotic.
Generally, I'd vote for \n, since normally the purpose of writing a UTF-8
file is to produce something that is portable across systems.
> Or, if I would want to write a tutorial itself as a »UTF-8
> text file« for further distribution (»Content-Type:
> text/plain; charset=UTF-8«), what should be used as a line
> separator in this case?
CR LF (U+000D U+000A), because that's mandatory for text/plain by the
definition of this Internet (MIME) media type, and any subtype of text:
"The canonical form of any MIME "text" subtype MUST always represent a line
break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text"
MUST represent a line break. Use of CR and LF outside of line break
sequences is also forbidden.
This rule applies regardless of format or character set or sets involved."
Source.: RFC 2046, clause 4.1, available e.g. at
http://www.mhonarc.org/~ehood/MIME/2046/rfc2046.html#4.1
Thus, anything delivered with Internet message headers indicating it as
"text" (in the media type sense) MUST use CR LF for line breaks. Of course,
"MUST" is to be understood as a normative requirement; you might be able to
violate it without serious consequences. In practice, programs tend to be
more permissive, accepting a lone CR or a lone LF as line break as well, and
this has been explicitly specified for some subtypes, such as text/html, see
http://www.w3.org/TR/html401/struct/text.html#didx-line_break
Short answer:
Since UTF-8 is nothing but an extension of ASCII, it inherits
all the historic mess that ASCII already had with regard
line termination and other control characters.
Honest answer:
The IMHO proper way to terminate a plain-text line is LF, the
convention that Unix always had used and that MacOS has switched
to in more recent years. Sadly, Microsoft and some IETF protocols
continue to prefix each LF with a technically completely
unnecessary CR byte, for no practical reason whatsoever.
CRs should be ignored and gradually phased out, if just to
save memory and bandwidth and rid numerous protocols and APIs
of the otherwise unnecessary (and dangerous!) distinction
between text and binary files.
The LINE SEPARATOR and PARAGRAPH SEPARATOR characters that
Unicode added seem not to be used in practice. I have yet to
encounter one of them in the wild and I increasingly believe
that they will suffer the exact same fate as most of the
C0/C1 ASCII control characters did, namely become still-born
design-by-committee inventions that continue to waste precious
code space and confuse developers to this day.
Markus
--
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain
That is correct.
http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.10.6
> This is somewhat obscure (since the operating system need not use such
> a convention) and reflects lack of rigorous standardization of the
> language.
I don't think that is the reason. To pick another language, although
the C standard allows \n to be something other than U+000A, I believe
it is mostly very old C compilers (1980's vintage, especially for
MacOS) which do so. Especially in today's networked environment,
where you often generate text on one operating system and send it to
another (either immediately or later), having \n vary based on
operating system creates as many problems as it solves. (Having a way
to write a text file using the operating system conventions is a
different matter, via mechanisms like the C text-mode fopen or Java's
line.separator system property).
My recommendation would be to use U+000A whenever you can, which is
most of the time. This is the native convention for MacOS X and Unix,
and even on Windows, the vast majority of programs can cope.