Can we have a version of Files.readLines that doesn't take a Charset?

59 views
Skip to first unread message

Andreas

unread,
Jan 28, 2010, 3:32:25 PM1/28/10
to guava-discuss
This is a question regarding the function
com.google.common.io.Files.readLines(File file, Charset charset).

Can we have a version that doesn't need a Charset?

i.e.
List<String> readLines(File file)

I admit I don't know much about Charsets and what they are, so I'm
guessing there is a Good Reason for requiring a Charset when reading a
file. All I can argue is that I've been using a "List<String> readLines
(File file)" in practice for years without ever having to know what
Charset I was using.

This seems an example of Java boilerplate that Guava purports to
reduce. I can just see lots of code with the line "List
lines=Files.readLines(file, Charsets.UTF_8)". If I am just banging out
some scripting code, shoule I need to worry about Charsets??

A related question...

The javadoc describes the charset parameter as
"the character set used when writing the file"

If I didn't write the file, how do I find the Charset? Is there a
Files.getCharset(File file) function? And if so, perhaps this could be
incorporated into a File.readLines(File file) function?


Jon Skeet

unread,
Jan 28, 2010, 3:40:41 PM1/28/10
to guava-...@googlegroups.com
2010/1/28 Andreas <awm...@gmail.com>

This is a question regarding the function
com.google.common.io.Files.readLines(File file, Charset charset).

Can we have a version that doesn't need a Charset?

I would argue very strongly against this. Java's decision to make the default be "whatever the platform chooses" is a terrible idea. .NET took the better option of defaulting to UTF-8 for most of the API (but then confused things by Encoding.Default mean the platform default, not the default used when you didn't specify anything).

The trouble is that if we now had a method which didn't take a charset, we'd be torn between going along with Java's normal policy of using the platform default - which is almost always the wrong thing to do - and using something sensible like UTF-8, which is then inconsistent with the rest of the Java API.

i.e.
List<String> readLines(File file)

I admit I don't know much about Charsets and what they are, so I'm
guessing there is a Good Reason for requiring a Charset when reading a
file. All I can argue is that I've been using a "List<String> readLines
(File file)" in practice for years without ever having to know what
Charset I was using.

Then you were almost certainly just using the platform default and hoping it was right. You can get away with this for years if you're lucky... but equally you can end up overwriting immensely valuable customer data with garbage if you're unlucky. I'm sure you know the question you need to ask yourself ;)
 
This seems an example of Java boilerplate that Guava purports to
reduce. I can just see lots of code with the line "List
lines=Files.readLines(file, Charsets.UTF_8)". If I am just banging out
some scripting code, shoule I need to worry about Charsets??

That depends - do you want your program to produce garbage? If so, you should absolutely pick a random charset. Otherwise, you should really know what charset your file is in, and use that :)
 
A related question...

The javadoc describes the charset parameter as
"the character set used when writing the file"

If I didn't write the file, how do I find the Charset? Is there a
Files.getCharset(File file) function? And if so, perhaps this could be
incorporated into a File.readLines(File file) function?

You can't - there's no foolproof way of doing this, as a file can be perfectly valid in multiple charsets, but mean different things.

There are heuristics which can be applied, but they won't be perfect for all cases. I'd be happy to have some heuristic detector elsewhere - but it should be completely separate from this method, IMO.

Jon

Andreas

unread,
Jan 28, 2010, 7:28:21 PM1/28/10
to guava-discuss

> > Can we have a version that doesn't need a Charset?
>
> I would argue very strongly against this. Java's decision to make the
> default be "whatever the platform chooses" is a terrible idea.

I could have guessed this stems from a problem with Java's original
design :-)

> The trouble is that if we now had a method which didn't take a charset, we'd
> be torn between going along with Java's normal policy of using the platform
> default - which is almost always the wrong thing to do -

Does this depend on what you are doing though? If I am writing files
and reading them back on the same machine (which is hardly an uncommon
use case), then using the platform default is fine, is it not?

> Then you were almost certainly just using the platform default and *hoping* it


> was right. You can get away with this for years if you're lucky... but
> equally you can end up overwriting immensely valuable customer data with
> garbage if you're unlucky. I'm sure you know the question you need to ask
> yourself ;)

And if I am not working with "immensely valuable customer data"?

I agree with all your points. It does seem to fly in the face of
practical experience though. How do other languages handle this? In
Ruby you can do File.foreach("myfile.txt") do|line|. Should this be
changed to File.foreach("myfile.txt", UTF_8) do|line| ? Can you read
files in Perl or Python without specifying the character set? And in
practise does this cause big problems for everyone?

Hopefully this API is aimed at all types of Java users, not just
enterprise programmers reading massive text files of immensely
valuable customer data. Sometimes I just want to dump some data from
one place and read it back in another. Will the world really end if I
don't explicitly specify the character set?

> That depends - do you want your program to produce garbage? If so, you
> should absolutely pick a random charset. Otherwise, you should really know
> what charset your file is in, and use that :)

I think this is over the top. There are millions of lines of code out
there reading and writing text files, completely oblivious to
character sets, and everything is running fine. I'm not saying that
Charsets are not relevant (as you've said, in some cases they clearly
are), but that in some (perhaps many) cases they are not that
important at all.

What about "make the simple things easy and the hard things possible"?
It's great that this API supports different Charsets (that's the "hard
things possible" bit) but how about making the simple stuff easy? Why
does this incidental complexity (the fact that we have different
character sets for text files) have to permeate every corner of the
API?

> There are heuristics which can be applied, but they won't be perfect for all
> cases. I'd be happy to have some heuristic detector elsewhere - but it
> should be completely separate from this method, IMO.

That's interesting to know. Looking around I see some libraries for
"guessing" the character set.

If this is an architecture decision ("All text input in Guava MUST
specifiy a Charset") then that's fine. Not everyone is writing mission
critical code though. For those users, it would be nice to have a
simpler API.

Andreas

Tim Peierls

unread,
Jan 28, 2010, 8:15:45 PM1/28/10
to guava-...@googlegroups.com
On Thu, Jan 28, 2010 at 3:40 PM, Jon Skeet <sk...@pobox.com> wrote:
2010/1/28 Andreas <awm...@gmail.com>
This is a question regarding the function com.google.common.io.Files.readLines(File file, Charset charset).
Can we have a version that doesn't need a Charset?

I would argue very strongly against this. Java's decision to make the default be "whatever the platform chooses" is a terrible idea.

Maybe, but if failing to respond to a request like this results in people turning away from Files.readLines (and, by extension, from Guava), that would be a shame. Can you think of a less terrible approach that doesn't simply slam the door? How about helpful documentation in readLines that suggests how a Charset value might be obtained?

--tim

Martin Buchholz

unread,
Jan 28, 2010, 9:43:54 PM1/28/10
to guava-...@googlegroups.com
On Thu, Jan 28, 2010 at 12:40, Jon Skeet <sk...@pobox.com> wrote:

>> Can we have a version that doesn't need a Charset?
>
> I would argue very strongly against this. Java's decision to make the
> default be "whatever the platform chooses" is a terrible idea. .NET took the
> better option of defaulting to UTF-8 for most of the API (but then confused
> things by Encoding.Default mean the platform default, not the default used
> when you didn't specify anything).

It may have been wrong, but it was basically the decision made by the
entire software industry back in the 80s - that it was a good idea
for each vendor and for each country to have different, implicit,
text encodings and for software to work differently in different
"locales". It's not just java - python 3000 uses a system default
encoding in a similar way. I believe using the platform encoding,
as java does, is likely the right pragmatic decision,
even though it is "a terrible idea". The hope is that future systems
will move towards UTF-8 as the default system encoding,
as Linux has done.

Martin

Jon Skeet

unread,
Jan 29, 2010, 1:29:03 AM1/29/10
to guava-...@googlegroups.com
2010/1/29 Andreas <awm...@gmail.com>

> The trouble is that if we now had a method which didn't take a charset, we'd
> be torn between going along with Java's normal policy of using the platform
> default - which is almost always the wrong thing to do -

Does this depend on what you are doing though? If I am writing files
and reading them back on the same machine (which is hardly an uncommon
use case),  then using the platform default is fine, is it not?

Not necessarily, no. The platform default may very well not contain all the characters you're trying to handle - I'm on Windows, for example, and my default encoding is Windows-1252 - an 8-bit character encoding. That means that there are only 256 characters I'll be able to write and read without losing data. It's like having a default image format which limits you to 256 colours - except it's far more insidious than that, because image formats generally identify themselves.
 
> Then you were almost certainly just using the platform default and *hoping* it
> was right. You can get away with this for years if you're lucky... but
> equally you can end up overwriting immensely valuable customer data with
> garbage if you're unlucky. I'm sure you know the question you need to ask
> yourself ;)

And if I am not working with "immensely valuable customer data"?

I agree with all your points. It does seem to fly in the face of
practical experience though. How do other languages handle this?  In
Ruby you can do File.foreach("myfile.txt") do|line|. Should this be
changed to File.foreach("myfile.txt", UTF_8) do|line| ? Can you read
files in Perl or Python without specifying the character set? And in
practise does this cause big problems for everyone?

In all of these cases, I'd expect that either they're using a better default than "the default platform encoding" or they can easily lose data. In the case of Ruby (at least until relatively recently), I believe the platform default encoding is used for the strings themselves (rather than Unicode) - but I'm not a Ruby expert.

If you're only reading and writing ASCII, it will probably work fine... but it makes it really easy to get it wrong, and you may well not spot it until it's too late and you've failed to store some data correctly.

Hopefully this API is aimed at all types of Java users, not just
enterprise programmers reading massive text files of immensely
valuable customer data. Sometimes I just want to dump some data from
one place and read it back in another. Will the world really end if I
don't explicitly specify the character set?

It depends whether you count not being able to store all characters as world-ending.

If this were only going to be used for unimportant data, I wouldn't care so much - but I've seen plenty of experienced Java developers all into the trap of using the default encoding accidentally, leading to subtle bugs which only show themselves after data has been lost.
 
> That depends - do you want your program to produce garbage? If so, you
> should absolutely pick a random charset. Otherwise, you should really know
> what charset your file is in, and use that :)

I think this is over the top. There are millions of lines of code out
there reading and writing text files, completely oblivious to
character sets, and everything is running fine. I'm not saying that
Charsets are not relevant (as you've said, in some cases they clearly
are), but that in some (perhaps many) cases they are not that
important at all.

It depends what you mean by "everything is running fine". I'm sure a lot of those systems would fail if they were given some data which happened to lie outside their platform's default encoding - or if you moved the file to another computer in the same company, but in a different country. Not only that, but the failure might not be obvious even then. Who's going to check every field in every file they copy? No, the character data will just be misinterpreted, and the bad data will propagate to other files and other systems.

This is an issue in the real world, IMO.
 
What about "make the simple things easy and the hard things possible"?
It's great that this API supports different Charsets (that's the "hard
things possible" bit) but how about making the simple stuff easy? Why
does this incidental complexity (the fact that we have different
character sets for text files) have to permeate every corner of the
API?

It's encouraging you to think about something that you really should be thinking about. It's a decision to be made - and using the platform default encoding is a choice which can easily lead to data loss. As I mentioned before, if we could sensibly default to UTF-8 that wouldn't be nearly so bad - but then we'd be out of line with the other Java APIs :(
 
> There are heuristics which can be applied, but they won't be perfect for all
> cases. I'd be happy to have some heuristic detector elsewhere - but it
> should be completely separate from this method, IMO.

That's interesting to know. Looking around I see some libraries for
"guessing" the character set.

If this is an architecture decision ("All text input in Guava MUST
specifiy a Charset") then that's fine. Not everyone is writing mission
critical code though. For those users, it would be nice to have a
simpler API.

I could get behind there being File.readLinesUtf8(...) as a shortcut for using Charsets.UTF_8... would that be any better for you? But I think that keeping data accurate is important even when you're not writing mission critical code.

Jon

Jon Skeet

unread,
Jan 29, 2010, 1:33:36 AM1/29/10
to guava-...@googlegroups.com
2010/1/29 Tim Peierls <t...@peierls.net>
Absolutely - anything which doesn't encourage people into making inappropriate choices would be fine. Another method called Files.readLinesUtf8 would be okay by me too, albeit slightly ugly/redundant. But I think it's a good thing to encourage developers to understand that when they're saving text they are making a decision.

To give an analogy, if you had an image API where the image itself didn't have any concept of a file format, would you expect there to be Image.save(filename) which picked some platform-specific default image format? Wouldn't it be entirely natural for it to be Image.save(filename, imageFormat)? This is really the same kind of thing, except that image file formats identify themselves so much better.

If I hadn't seen so much code accidentally using the default platform encoding, I wouldn't have as much of a problem with this - but it's an area where Java's defaults do mean that data is lost, and that's something I feel passionately about.

Jon

Kevin Bourrillion

unread,
Jan 29, 2010, 12:13:19 PM1/29/10
to guava-...@googlegroups.com
Jon has articulated my opinion on the matter much better than I was going to.  :-)

If someone could file an issue saying to please include helpful pointers to Charsets.UTF_8 in the doc of each Charset-accepting method, I think it sounds like a good idea.


--
guava-...@googlegroups.com.
http://groups.google.com/group/guava-discuss?hl=en
unsubscribe: guava-discus...@googlegroups.com
 
This list is for discussion; for help, post to Stack Overflow instead:
http://stackoverflow.com/questions/ask
Use the tag "guava".



--
Kevin Bourrillion @ Google
internal:  http://go/javalibraries
external: guava-libraries.googlecode.com

Nikolas Everett

unread,
Jan 29, 2010, 1:51:02 PM1/29/10
to guava-...@googlegroups.com
On Fri, Jan 29, 2010 at 12:13 PM, Kevin Bourrillion <kev...@google.com> wrote:
Jon has articulated my opinion on the matter much better than I was going to.  :-)

If someone could file an issue saying to please include helpful pointers to Charsets.UTF_8 in the doc of each Charset-accepting method, I think it sounds like a good idea.

+1 for this.  

I've been following this with some interest for about 22 hours, or so gmail informs me.  

I'm sure I'm guilty of using the platform default encoding when it was a bad idea.  I file this in the same space as String.toLowerCase without a locale.  Reading files and toLowerCase just look so innocuous you think "What could go wrong if I do it the simple way?" and so you do it the simple way and then you make poor Turkish guy's life hell.

Nik

Jon Skeet

unread,
Jan 29, 2010, 2:49:21 PM1/29/10
to guava-...@googlegroups.com
I've filed issue 319: http://code.google.com/p/guava-libraries/issues/detail?id=319

Jon

2010/1/29 Kevin Bourrillion <kev...@google.com>

Andreas

unread,
Jan 29, 2010, 2:50:58 PM1/29/10
to guava-discuss
Having looked around more at the Guava, I can see requiring a Charset
is consistent with the rest of the API. So fair enough.

I still think you are forcing this complexity on everybody, instead of
just those who need it. Which has traditionally been the Java way I
guess, especially when it comes to I/O.

I think in practise I will write my own utility method

List<String> readLines(File)

that would just do a call to Guava's readLines(file, Encodings.UTF_8).
Of course, when I am reading files that may have a different encoding
I will not use this function :-)

And yes, please add some explanations for all the programmers like me
who don't know about Charsets.

And thank you Jon for having the patience to explain all this stuff to
me. I've learned a lot from this thread :-)

Andreas


Chris Nokleberg

unread,
Jan 29, 2010, 4:18:08 PM1/29/10
to guava-...@googlegroups.com
On Thu, Jan 28, 2010 at 22:29, Jon Skeet <sk...@pobox.com> wrote:
> I could get behind there being File.readLinesUtf8(...) as a shortcut for
> using Charsets.UTF_8... would that be any better for you? But I think that
> keeping data accurate is important even when you're not writing mission
> critical code.

I wouldn't mind *Utf8 helper methods. If the simpler methods
encouraged people to use UTF-8 instead of non-Unicode charsets that
would be a good thing. The obvious downside is doubling the size of
the public API. Let's start with adding documentation that mentions
Charsets.UTF_8 and re-evaluate again in the future.

Raymond Conner

unread,
Jan 29, 2010, 9:33:42 PM1/29/10
to guava-...@googlegroups.com
Wikipedia has a wealth of information on this topic, you can start here if you want to know more:

http://en.wikipedia.org/wiki/Character_encoding

This sort of problem is exactly why XML is supposed to have that bit at the beginning for specifying the encoding. Of course, sometimes the encoding affects you before you can read that far into the file. See this as well:


That exact problem (from Windows-edited XML) caused our software to fail and really annoyed our customer. Don't make the mistake of assuming these things are not important.

- Ray Conner

Graham Allan

unread,
Jan 30, 2010, 8:31:46 AM1/30/10
to guava-...@googlegroups.com
>I wouldn't mind *Utf8 helper methods. If the simpler methods
>encouraged people to use UTF-8 instead of non-Unicode charsets that
>would be a good thing. The obvious downside is doubling the size of
>the public API. Let's start with adding documentation that mentions
>Charsets.UTF_8 and re-evaluate again in the future.

If I could so bold as to make a suggestion, if you're worried about the size
of the API doubling I'd say it would be better to avoid *Utf8 methods and
instead have a public static class on Files that implemented all the relevant
methods, passing in its specifc Charset for each one.

i.e. you would call - Files.Utf8.readLines(file);

Utf8 would look like this:

public static final class Utf8 {
public static List<String> readLines(File file) throws IOException {
return Files.readLines(file, Charsets.UTF_8);
}
// other methods taking a charset...
}

The public API wouldn't expand without the user digging into the Utf8 class,
and with static imports code could still be concise.

I should probably include the disclaimer that I don't actually use Guava, I
just follow the mailing list for some interesting discussions and thought I'd
throw something out there :-)

Kind regards,
Graham Allan

Andreas

unread,
Jan 30, 2010, 4:08:39 PM1/30/10
to guava-discuss
A related question

In Guava I can do this

List<String> lines=CharStreams.readLines(new FileReader(filename)); //
No Charset specified!!

which does the same thing as the (hypothetical) function we have been
discussing.

List<String> lines=Files.readLines(filename) // will not compile.
Hazardous operation!

Is this a problem?

Jon Skeet

unread,
Jan 30, 2010, 4:19:54 PM1/30/10
to guava-...@googlegroups.com
Yes - it's basically because FileReader is an awful class. It doesn't just default to the platform-default encoding - it doesn't allow you to specify any other encoding! It's fundamentally broken, IMO - it should at least expose a constructor taking a Charset or charset name. I always avoid it, using an InputStreamReader wrapped around a FileInputStream.

Jon

Blair Zajac

unread,
Jan 30, 2010, 4:22:08 PM1/30/10
to guava-...@googlegroups.com
Graham Allan wrote:
>> I wouldn't mind *Utf8 helper methods. If the simpler methods
>> encouraged people to use UTF-8 instead of non-Unicode charsets that
>> would be a good thing. The obvious downside is doubling the size of
>> the public API. Let's start with adding documentation that mentions
>> Charsets.UTF_8 and re-evaluate again in the future.
>
> If I could so bold as to make a suggestion, if you're worried about the size
> of the API doubling I'd say it would be better to avoid *Utf8 methods and
> instead have a public static class on Files that implemented all the relevant
> methods, passing in its specifc Charset for each one.

I don't see what the big deal is with having to provide a Charset argument.
It's good practice to always provide a character set and having programmers
think about the character set issue is a good thing when they haven't had to do
so in the past. Plus, using the code is a simple import of Charsets and then
Charsets.UTF_8.

As a user of Guava I don't mind using Charsets.UTF_8 everywhere.

Blair

Jon Skeet

unread,
Jan 30, 2010, 4:31:09 PM1/30/10
to guava-...@googlegroups.com
If you're a fan of static imports (I know not everyone is; I like it occasionally for things like newArrayList, but I wouldn't use it here) you could import Charsets.UTF_8 statically and then use:

Files.readLines(filename, UTF_8);

Jon 

Graham Allan

unread,
Jan 30, 2010, 6:33:31 PM1/30/10
to guava-...@googlegroups.com
>I don't see what the big deal is with having to provide a Charset argument.
> It's good practice to always provide a character set and having programmers
> think about the character set issue is a good thing when they haven't had
> to do so in the past. Plus, using the code is a simple import of Charsets
> and then Charsets.UTF_8.
>

I'd agree, I was only suggesting it as an alternative to *Utf8 helper methods,
when it's desirable to keep the public API footprint small.

>If you're a fan of static imports (I know not everyone is; I like it
> occasionally for things like newArrayList, but I wouldn't use it here) you
> could import Charsets.UTF_8 statically and then use:
>
>Files.readLines(filename, UTF_8);

Yeah, that's obviously a reasonable thing to expect...

But (and I suspect I'm painting the dog house here) if there's to be more than
one call in the same source file, the charset is likely to be homogeneous, so
providing it everytime would be boilerplate. Setting it in one place (the
static import) is one way to avoid that, albeit with all the downsides of
static imports...

Disregarding the validity of providing a default charset in the first place,
would you say this way would be preferable to *Utf8 helper methods? (Just out
of curiosity).

Kind regards,
Graham

Brian Duff

unread,
Jan 30, 2010, 9:33:04 PM1/30/10
to guava-...@googlegroups.com
On Sat, Jan 30, 2010 at 3:33 PM, Graham Allan <grundl...@googlemail.com> wrote:
>I don't see what the big deal is with having to provide a Charset argument.
> It's good practice to always provide a character set and having programmers
> think about the character set issue is a good thing when they haven't had
> to do so in the past.  Plus, using the code is a simple import of Charsets
> and then Charsets.UTF_8.
>

I'd agree, I was only suggesting it as an alternative to *Utf8 helper methods,
when it's desirable to keep the public API footprint small.

>If you're a fan of static imports (I know not everyone is; I like it
> occasionally for things like newArrayList, but I wouldn't use it here) you
> could import Charsets.UTF_8 statically and then use:
>
>Files.readLines(filename, UTF_8);

Yeah, that's obviously a reasonable thing to expect...

But (and I suspect I'm painting the dog house here) if there's to be more than
one call in the same source file, the charset is likely to be homogeneous, so
providing it everytime would be boilerplate.

This is one of those places where I think instance methods would be better. If there were some class called, say, LineReader (er.. there is actually, but it's not quite the same as what we're currently discussing), it could look something like this:

public class LineReader {
  private LineReader() {...}

  public static LineReader newInstance(Charset cs) { ... }

  public List<String> readLines(File file) throws IOException { ... }
}

This is, of course, no better for one-off on the fly usages (which actually wind up being more verbose (LineReader.newInstance(Charsets.UTF_8).readLines()). But it's a bigger win for repeated usages with the same charset. 

Since I like dependency injection, I also like that I can create a single instance for my whole application and just inject it everwhere I need it. Even better, I can stub it out with something that doesn't read files in my tests.
 
Setting it in one place (the
static import) is one way to avoid that, albeit with all the downsides of
static imports...

Disregarding the validity of providing a default charset in the first place,
would you say this way would be preferable to *Utf8 helper methods? (Just out
of curiosity).

Kind regards,
Graham

Sam Berlin

unread,
Jan 30, 2010, 10:06:49 PM1/30/10
to guava-...@googlegroups.com
On Sat, Jan 30, 2010 at 9:33 PM, Brian Duff <bd...@google.com> wrote:

> This is one of those places where I think instance methods would be better.
> If there were some class called, say, LineReader (er.. there is actually,
> but it's not quite the same as what we're currently discussing), it could
> look something like this:
> public class LineReader {
>   private LineReader() {...}
>   public static LineReader newInstance(Charset cs) { ... }
>   public List<String> readLines(File file) throws IOException { ... }
> }
> This is, of course, no better for one-off on the fly usages (which actually
> wind up being more verbose
> (LineReader.newInstance(Charsets.UTF_8).readLines()). But it's a bigger win
> for repeated usages with the same charset.
> Since I like dependency injection, I also like that I can create a single
> instance for my whole application and just inject it everwhere I need it.
> Even better, I can stub it out with something that doesn't read files in my
> tests.

This makes a lot of sense, although I can imagine a number of people
would be put off by the need to create an instance of something just
to call utility methods. If "LineReader" delegated to the static
utility class (Files), people could have their cake and eat it too.
(Although, the cake here would still require a CharSet.. but it'd be
baked in when you want to eat it.)

Sam

Paul Cowan

unread,
Jan 30, 2010, 10:11:01 PM1/30/10
to guava-...@googlegroups.com
Sam Berlin wrote:
> This makes a lot of sense, although I can imagine a number of people
> would be put off by the need to create an instance of something just
> to call utility methods.

Easy enough to work around this for 99% of cases with a default
implementation (or two)

LineReader.UTF8.readLines(...);

LineReader.PLATFORM_DEFAULT_ENCODING_DO_NOT_USE_ARE_YOU_MAD.readLines(...);

Maybe with names that are more static-import friendly...

Paul

Sam Berlin

unread,
Jan 30, 2010, 10:27:03 PM1/30/10
to guava-...@googlegroups.com
On Sat, Jan 30, 2010 at 10:11 PM, Paul Cowan <google...@funkwit.com> wrote:
> Sam Berlin wrote:
>>
>> This makes a lot of sense, although I can imagine a number of people
>> would be put off by the need to create an instance of something just
>> to call utility methods.
>
> Easy enough to work around this for 99% of cases with a default
> implementation (or two)
>
> LineReader.UTF8.readLines(...);
>
> LineReader.PLATFORM_DEFAULT_ENCODING_DO_NOT_USE_ARE_YOU_MAD.readLines(...);

Also possible to keep Files as-is and add a new LineReader as Brian
described, with the method implementations delegating to Files
(passing the CharSet from the LineReader constructor). This makes it
slightly easier for existing users to continue using the library as-is
and also opens the door for mocking & DI (as Brian mentioned). It
does make it a little confusing that there's two ways to access the
same functionality, though.

Sam

Chris Nokleberg

unread,
Jan 30, 2010, 10:30:37 PM1/30/10
to guava-...@googlegroups.com
The reason the methods are all static is mostly historical--they were
replacing a hodgepodge of existing static methods and it was much
easier to switch over the many callers to a similar API. I agree that
an instance-based API has some merit, although if you are going to go
instance-based I think you'd probably want to go all the way and have
it encapsulate the file too, and not just the Charset. Methods like
readLines(), toByteArray(), etc., could all be no-arg methods. You'd
just need a couple of static factory methods, one of which could
default to UTF-8.

Kevin Bourrillion

unread,
Jan 31, 2010, 2:04:24 PM1/31/10
to guava-...@googlegroups.com
This is exactly what I have always wanted us to do.  Chris and I discussed it at the start, but he was absolutely right that we needed a clean and proper API that we could actually migrate all of our Google code onto without excessive pain so we could get rid of our awful legacy stuff.

--
guava-...@googlegroups.com.
http://groups.google.com/group/guava-discuss?hl=en
unsubscribe: guava-discus...@googlegroups.com

This list is for discussion; for help, post to Stack Overflow instead:
http://stackoverflow.com/questions/ask
Use the tag "guava".

Kevin Bourrillion

unread,
Jan 31, 2010, 2:07:28 PM1/31/10
to guava-...@googlegroups.com
On Sat, Jan 30, 2010 at 7:06 PM, Sam Berlin <sbe...@gmail.com> wrote:

This makes a lot of sense, although I can imagine a number of people
would be put off by the need to create an instance of something just
to call utility methods.

Note that that concern doesn't hold much water with us; see Joiner, Splitter, CharMatcher, Ordering, etc.


finnw

unread,
Feb 1, 2010, 11:10:54 AM2/1/10
to guava-discuss
On Jan 29, 6:29 am, Jon Skeet <sk...@pobox.com> wrote:
> It's encouraging you to think about something that you really *should* be

> thinking about. It's a decision to be made - and using the platform default
> encoding is a choice which can easily lead to data loss. As I mentioned
> before, if we could sensibly default to UTF-8 that wouldn't be nearly so bad
> - but then we'd be out of line with the other Java APIs :(

Wouldn't it be safer to default to UTF_8 than force programmers to
make a
possibly-uninformed choice of character set?

The advantage with UTF_8 (and also US_ASCII) is that if you read a
file and
have incorrectly guessed the encoding, your program will start
throwing
exceptions very quickly, because the probability of any file looking
like one
of these encodings when it isn't is very small.

Now imagine a programmer discovering the readLines method, not knowing
which character set to pick, and opting for 8859-1 because it is
familiar to
them (I made a similar mistake a few years ago in a program that used
an
HTML parser.) They will never see an exception, because every
possible
sequence of bytes is a valid 8859-1 string. This is far more likely
to cause
data loss.

This only applies to reading of course, not to writing. Writing UTF-8
when
the receiver expects GB18030 for example, can cause corruption to go
undetected.

Finn

Jon Skeet

unread,
Feb 1, 2010, 11:16:55 AM2/1/10
to guava-...@googlegroups.com
On 1 February 2010 16:10, finnw <fin...@gmail.com> wrote:
On Jan 29, 6:29 am, Jon Skeet <sk...@pobox.com> wrote:
> It's encouraging you to think about something that you really *should* be
> thinking about. It's a decision to be made - and using the platform default
> encoding is a choice which can easily lead to data loss. As I mentioned
> before, if we could sensibly default to UTF-8 that wouldn't be nearly so bad
> - but then we'd be out of line with the other Java APIs :(

Wouldn't it be safer to default to UTF_8 than force programmers to
make a possibly-uninformed choice of character set?

<snip>
 
This only applies to reading of course, not to writing.  Writing UTF-8
when the receiver expects GB18030 for example, can cause corruption to go
undetected.

... and there's the rub. It would be bad enough for us to have defaults in one place which went against the JDK defaults... but to be inconsistent about defaulting even within a single API sounds like a bad idea to me.

Jon
 
Reply all
Reply to author
Forward
0 new messages