Can we have a version that doesn't need a Charset?
i.e.
List<String> readLines(File file)
I admit I don't know much about Charsets and what they are, so I'm
guessing there is a Good Reason for requiring a Charset when reading a
file. All I can argue is that I've been using a "List<String> readLines
(File file)" in practice for years without ever having to know what
Charset I was using.
This seems an example of Java boilerplate that Guava purports to
reduce. I can just see lots of code with the line "List
lines=Files.readLines(file, Charsets.UTF_8)". If I am just banging out
some scripting code, shoule I need to worry about Charsets??
A related question...
The javadoc describes the charset parameter as
"the character set used when writing the file"
If I didn't write the file, how do I find the Charset? Is there a
Files.getCharset(File file) function? And if so, perhaps this could be
incorporated into a File.readLines(File file) function?
This is a question regarding the function
com.google.common.io.Files.readLines(File file, Charset charset).
Can we have a version that doesn't need a Charset?
i.e.
List<String> readLines(File file)
I admit I don't know much about Charsets and what they are, so I'm
guessing there is a Good Reason for requiring a Charset when reading a
file. All I can argue is that I've been using a "List<String> readLines
(File file)" in practice for years without ever having to know what
Charset I was using.
This seems an example of Java boilerplate that Guava purports to
reduce. I can just see lots of code with the line "List
lines=Files.readLines(file, Charsets.UTF_8)". If I am just banging out
some scripting code, shoule I need to worry about Charsets??
A related question...
The javadoc describes the charset parameter as
"the character set used when writing the file"
If I didn't write the file, how do I find the Charset? Is there a
Files.getCharset(File file) function? And if so, perhaps this could be
incorporated into a File.readLines(File file) function?
I could have guessed this stems from a problem with Java's original
design :-)
> The trouble is that if we now had a method which didn't take a charset, we'd
> be torn between going along with Java's normal policy of using the platform
> default - which is almost always the wrong thing to do -
Does this depend on what you are doing though? If I am writing files
and reading them back on the same machine (which is hardly an uncommon
use case), then using the platform default is fine, is it not?
> Then you were almost certainly just using the platform default and *hoping* it
> was right. You can get away with this for years if you're lucky... but
> equally you can end up overwriting immensely valuable customer data with
> garbage if you're unlucky. I'm sure you know the question you need to ask
> yourself ;)
And if I am not working with "immensely valuable customer data"?
I agree with all your points. It does seem to fly in the face of
practical experience though. How do other languages handle this? In
Ruby you can do File.foreach("myfile.txt") do|line|. Should this be
changed to File.foreach("myfile.txt", UTF_8) do|line| ? Can you read
files in Perl or Python without specifying the character set? And in
practise does this cause big problems for everyone?
Hopefully this API is aimed at all types of Java users, not just
enterprise programmers reading massive text files of immensely
valuable customer data. Sometimes I just want to dump some data from
one place and read it back in another. Will the world really end if I
don't explicitly specify the character set?
> That depends - do you want your program to produce garbage? If so, you
> should absolutely pick a random charset. Otherwise, you should really know
> what charset your file is in, and use that :)
I think this is over the top. There are millions of lines of code out
there reading and writing text files, completely oblivious to
character sets, and everything is running fine. I'm not saying that
Charsets are not relevant (as you've said, in some cases they clearly
are), but that in some (perhaps many) cases they are not that
important at all.
What about "make the simple things easy and the hard things possible"?
It's great that this API supports different Charsets (that's the "hard
things possible" bit) but how about making the simple stuff easy? Why
does this incidental complexity (the fact that we have different
character sets for text files) have to permeate every corner of the
API?
> There are heuristics which can be applied, but they won't be perfect for all
> cases. I'd be happy to have some heuristic detector elsewhere - but it
> should be completely separate from this method, IMO.
That's interesting to know. Looking around I see some libraries for
"guessing" the character set.
If this is an architecture decision ("All text input in Guava MUST
specifiy a Charset") then that's fine. Not everyone is writing mission
critical code though. For those users, it would be nice to have a
simpler API.
Andreas
2010/1/28 Andreas <awm...@gmail.com>This is a question regarding the function com.google.common.io.Files.readLines(File file, Charset charset).
Can we have a version that doesn't need a Charset?I would argue very strongly against this. Java's decision to make the default be "whatever the platform chooses" is a terrible idea.
>> Can we have a version that doesn't need a Charset?
>
> I would argue very strongly against this. Java's decision to make the
> default be "whatever the platform chooses" is a terrible idea. .NET took the
> better option of defaulting to UTF-8 for most of the API (but then confused
> things by Encoding.Default mean the platform default, not the default used
> when you didn't specify anything).
It may have been wrong, but it was basically the decision made by the
entire software industry back in the 80s - that it was a good idea
for each vendor and for each country to have different, implicit,
text encodings and for software to work differently in different
"locales". It's not just java - python 3000 uses a system default
encoding in a similar way. I believe using the platform encoding,
as java does, is likely the right pragmatic decision,
even though it is "a terrible idea". The hope is that future systems
will move towards UTF-8 as the default system encoding,
as Linux has done.
Martin
> The trouble is that if we now had a method which didn't take a charset, we'd> be torn between going along with Java's normal policy of using the platformDoes this depend on what you are doing though? If I am writing files
> default - which is almost always the wrong thing to do -
and reading them back on the same machine (which is hardly an uncommon
use case), then using the platform default is fine, is it not?
> Then you were almost certainly just using the platform default and *hoping* itAnd if I am not working with "immensely valuable customer data"?
> was right. You can get away with this for years if you're lucky... but
> equally you can end up overwriting immensely valuable customer data with
> garbage if you're unlucky. I'm sure you know the question you need to ask
> yourself ;)
I agree with all your points. It does seem to fly in the face of
practical experience though. How do other languages handle this? In
Ruby you can do File.foreach("myfile.txt") do|line|. Should this be
changed to File.foreach("myfile.txt", UTF_8) do|line| ? Can you read
files in Perl or Python without specifying the character set? And in
practise does this cause big problems for everyone?
Hopefully this API is aimed at all types of Java users, not just
enterprise programmers reading massive text files of immensely
valuable customer data. Sometimes I just want to dump some data from
one place and read it back in another. Will the world really end if I
don't explicitly specify the character set?
> That depends - do you want your program to produce garbage? If so, youI think this is over the top. There are millions of lines of code out
> should absolutely pick a random charset. Otherwise, you should really know
> what charset your file is in, and use that :)
there reading and writing text files, completely oblivious to
character sets, and everything is running fine. I'm not saying that
Charsets are not relevant (as you've said, in some cases they clearly
are), but that in some (perhaps many) cases they are not that
important at all.
What about "make the simple things easy and the hard things possible"?
It's great that this API supports different Charsets (that's the "hard
things possible" bit) but how about making the simple stuff easy? Why
does this incidental complexity (the fact that we have different
character sets for text files) have to permeate every corner of the
API?
> There are heuristics which can be applied, but they won't be perfect for allThat's interesting to know. Looking around I see some libraries for
> cases. I'd be happy to have some heuristic detector elsewhere - but it
> should be completely separate from this method, IMO.
"guessing" the character set.
If this is an architecture decision ("All text input in Guava MUST
specifiy a Charset") then that's fine. Not everyone is writing mission
critical code though. For those users, it would be nice to have a
simpler API.
--
guava-...@googlegroups.com.
http://groups.google.com/group/guava-discuss?hl=en
unsubscribe: guava-discus...@googlegroups.com
This list is for discussion; for help, post to Stack Overflow instead:
http://stackoverflow.com/questions/ask
Use the tag "guava".
Jon has articulated my opinion on the matter much better than I was going to. :-)
If someone could file an issue saying to please include helpful pointers to Charsets.UTF_8 in the doc of each Charset-accepting method, I think it sounds like a good idea.
I still think you are forcing this complexity on everybody, instead of
just those who need it. Which has traditionally been the Java way I
guess, especially when it comes to I/O.
I think in practise I will write my own utility method
List<String> readLines(File)
that would just do a call to Guava's readLines(file, Encodings.UTF_8).
Of course, when I am reading files that may have a different encoding
I will not use this function :-)
And yes, please add some explanations for all the programmers like me
who don't know about Charsets.
And thank you Jon for having the patience to explain all this stuff to
me. I've learned a lot from this thread :-)
Andreas
I wouldn't mind *Utf8 helper methods. If the simpler methods
encouraged people to use UTF-8 instead of non-Unicode charsets that
would be a good thing. The obvious downside is doubling the size of
the public API. Let's start with adding documentation that mentions
Charsets.UTF_8 and re-evaluate again in the future.
If I could so bold as to make a suggestion, if you're worried about the size
of the API doubling I'd say it would be better to avoid *Utf8 methods and
instead have a public static class on Files that implemented all the relevant
methods, passing in its specifc Charset for each one.
i.e. you would call - Files.Utf8.readLines(file);
Utf8 would look like this:
public static final class Utf8 {
public static List<String> readLines(File file) throws IOException {
return Files.readLines(file, Charsets.UTF_8);
}
// other methods taking a charset...
}
The public API wouldn't expand without the user digging into the Utf8 class,
and with static imports code could still be concise.
I should probably include the disclaimer that I don't actually use Guava, I
just follow the mailing list for some interesting discussions and thought I'd
throw something out there :-)
Kind regards,
Graham Allan
In Guava I can do this
List<String> lines=CharStreams.readLines(new FileReader(filename)); //
No Charset specified!!
which does the same thing as the (hypothetical) function we have been
discussing.
List<String> lines=Files.readLines(filename) // will not compile.
Hazardous operation!
Is this a problem?
I don't see what the big deal is with having to provide a Charset argument.
It's good practice to always provide a character set and having programmers
think about the character set issue is a good thing when they haven't had to do
so in the past. Plus, using the code is a simple import of Charsets and then
Charsets.UTF_8.
As a user of Guava I don't mind using Charsets.UTF_8 everywhere.
Blair
I'd agree, I was only suggesting it as an alternative to *Utf8 helper methods,
when it's desirable to keep the public API footprint small.
>If you're a fan of static imports (I know not everyone is; I like it
> occasionally for things like newArrayList, but I wouldn't use it here) you
> could import Charsets.UTF_8 statically and then use:
>
>Files.readLines(filename, UTF_8);
Yeah, that's obviously a reasonable thing to expect...
But (and I suspect I'm painting the dog house here) if there's to be more than
one call in the same source file, the charset is likely to be homogeneous, so
providing it everytime would be boilerplate. Setting it in one place (the
static import) is one way to avoid that, albeit with all the downsides of
static imports...
Disregarding the validity of providing a default charset in the first place,
would you say this way would be preferable to *Utf8 helper methods? (Just out
of curiosity).
Kind regards,
Graham
>I don't see what the big deal is with having to provide a Charset argument.I'd agree, I was only suggesting it as an alternative to *Utf8 helper methods,
> It's good practice to always provide a character set and having programmers
> think about the character set issue is a good thing when they haven't had
> to do so in the past. Plus, using the code is a simple import of Charsets
> and then Charsets.UTF_8.
>
when it's desirable to keep the public API footprint small.
Yeah, that's obviously a reasonable thing to expect...
>If you're a fan of static imports (I know not everyone is; I like it
> occasionally for things like newArrayList, but I wouldn't use it here) you
> could import Charsets.UTF_8 statically and then use:
>
>Files.readLines(filename, UTF_8);
But (and I suspect I'm painting the dog house here) if there's to be more than
one call in the same source file, the charset is likely to be homogeneous, so
providing it everytime would be boilerplate.
Setting it in one place (the
static import) is one way to avoid that, albeit with all the downsides of
static imports...
Disregarding the validity of providing a default charset in the first place,
would you say this way would be preferable to *Utf8 helper methods? (Just out
of curiosity).
Kind regards,
Graham
> This is one of those places where I think instance methods would be better.
> If there were some class called, say, LineReader (er.. there is actually,
> but it's not quite the same as what we're currently discussing), it could
> look something like this:
> public class LineReader {
> private LineReader() {...}
> public static LineReader newInstance(Charset cs) { ... }
> public List<String> readLines(File file) throws IOException { ... }
> }
> This is, of course, no better for one-off on the fly usages (which actually
> wind up being more verbose
> (LineReader.newInstance(Charsets.UTF_8).readLines()). But it's a bigger win
> for repeated usages with the same charset.
> Since I like dependency injection, I also like that I can create a single
> instance for my whole application and just inject it everwhere I need it.
> Even better, I can stub it out with something that doesn't read files in my
> tests.
This makes a lot of sense, although I can imagine a number of people
would be put off by the need to create an instance of something just
to call utility methods. If "LineReader" delegated to the static
utility class (Files), people could have their cake and eat it too.
(Although, the cake here would still require a CharSet.. but it'd be
baked in when you want to eat it.)
Sam
Easy enough to work around this for 99% of cases with a default
implementation (or two)
LineReader.UTF8.readLines(...);
LineReader.PLATFORM_DEFAULT_ENCODING_DO_NOT_USE_ARE_YOU_MAD.readLines(...);
Maybe with names that are more static-import friendly...
Paul
Also possible to keep Files as-is and add a new LineReader as Brian
described, with the method implementations delegating to Files
(passing the CharSet from the LineReader constructor). This makes it
slightly easier for existing users to continue using the library as-is
and also opens the door for mocking & DI (as Brian mentioned). It
does make it a little confusing that there's two ways to access the
same functionality, though.
Sam
--
guava-...@googlegroups.com.
http://groups.google.com/group/guava-discuss?hl=en
unsubscribe: guava-discus...@googlegroups.com
This list is for discussion; for help, post to Stack Overflow instead:
http://stackoverflow.com/questions/ask
Use the tag "guava".
This makes a lot of sense, although I can imagine a number of peoplewould be put off by the need to create an instance of something just
to call utility methods.
Wouldn't it be safer to default to UTF_8 than force programmers to
make a
possibly-uninformed choice of character set?
The advantage with UTF_8 (and also US_ASCII) is that if you read a
file and
have incorrectly guessed the encoding, your program will start
throwing
exceptions very quickly, because the probability of any file looking
like one
of these encodings when it isn't is very small.
Now imagine a programmer discovering the readLines method, not knowing
which character set to pick, and opting for 8859-1 because it is
familiar to
them (I made a similar mistake a few years ago in a program that used
an
HTML parser.) They will never see an exception, because every
possible
sequence of bytes is a valid 8859-1 string. This is far more likely
to cause
data loss.
This only applies to reading of course, not to writing. Writing UTF-8
when
the receiver expects GB18030 for example, can cause corruption to go
undetected.
Finn
On Jan 29, 6:29 am, Jon Skeet <sk...@pobox.com> wrote:Wouldn't it be safer to default to UTF_8 than force programmers to
> It's encouraging you to think about something that you really *should* be
> thinking about. It's a decision to be made - and using the platform default
> encoding is a choice which can easily lead to data loss. As I mentioned
> before, if we could sensibly default to UTF-8 that wouldn't be nearly so bad
> - but then we'd be out of line with the other Java APIs :(
make a possibly-uninformed choice of character set?
This only applies to reading of course, not to writing. Writing UTF-8
when the receiver expects GB18030 for example, can cause corruption to go
undetected.