Splitter API - How do I extract tabular values easily?

33 views
Skip to first unread message

Andreas

unread,
Mar 17, 2010, 9:05:00 PM3/17/10
to guava-discuss
I'm using Splitter to extract values from a tabular text file. How do
I refer to the columns in the table by number?

I was imagining this:

Splitter splitter=Splitter.on(',');
List<String> values=splitter.split(line);
String name=values.get(3);
String postcode=values.get(10);
etc....

Instead I get an Iterable back from split. At first glance, this seems
inconvenient when splitting tabular data, where each column index has
a specific meaning.
What is the recommended way to get a value with a certain index out of
the returned Iterable?

Looking round Guava, I see there is Iterables.get or
Iterables.toArray. So perhaps something like this:

String[] values=Iterables.toArray(splitter.split(line), String.class);
String name=values.get(3);
String postcode=values.get(10);

Is this the best I can do using the Guava splitter, or am I missing
something?

Kevin Bourrillion

unread,
Mar 17, 2010, 9:11:11 PM3/17/10
to guava-...@googlegroups.com
You can use ImmutableList.copyOf() to turn the result of split() into a list.

It would be nice to have splitToList(), or to have split() return a FluentIterable which would have .toImmutableList() on it.  Unfortunately, because of our internal development processes we have a prohibition on circular dependencies between packages. :-(




--
guava-...@googlegroups.com.
http://groups.google.com/group/guava-discuss?hl=en
unsubscribe: guava-discus...@googlegroups.com

This list is for discussion; for help, post to Stack Overflow instead:
http://stackoverflow.com/questions/ask
Use the tag "guava".



--
Kevin Bourrillion @ Google
internal:  http://goto/javalibraries
external: http://guava-libraries.googlecode.com

Andreas

unread,
Mar 17, 2010, 10:32:02 PM3/17/10
to guava-discuss
So we have

List<String> values = ImmutableList.copyOf(splitter.split(line));

That's better than what I was going to use, but still not great.

Could splitToList() return an ArrayList<String> ? No problems with
circular dependencies there?

List<String> values = splitter.split(line);


Johan Van den Neste

unread,
Mar 19, 2010, 12:23:35 PM3/19/10
to guava-...@googlegroups.com
> So we have
>
> List<String> values = ImmutableList.copyOf(splitter.split(line));
>
> That's better than what I was going to use, but still not great.

But why is it not great?

--
Johan Van den Neste

Nikolas Everett

unread,
Mar 19, 2010, 12:49:57 PM3/19/10
to guava-...@googlegroups.com
On Wed, Mar 17, 2010 at 10:32 PM, Andreas <awm...@gmail.com> wrote:
So we have

List<String> values = ImmutableList.copyOf(splitter.split(line));

That's better than what I was going to use, but still not great.

Could splitToList() return an ArrayList<String> ? No problems with
circular dependencies there?


I prefer it to return an Iterable.  That way splitter.split(stringBuffer) doesn't have a chance of creating a one squidillion entry list.

When dealing with tabular data I prefer to do it like
Iterator<String> itr = splitter.split(line);
String name = itr.next();
double unitPrice = Double.parseDouble(itr.next());
...etc...
just so its super clear what the format is.

You could even catch NoSuchElement exception and wrap it in a nicer one about malformed lines.

Nik

Etienne Neveu

unread,
Mar 19, 2010, 6:14:53 PM3/19/10
to guava-discuss
I'd love to see some kind of FluentIterable (with convenience methods
such as toImmutableList(), toArrayList(), toHashSet(),
filteredWith(Predicate), ...). Too bad it's not possible due to this
circular dependency problem.

I don't like the idea of adding splitToList() / splitToArrayList() /
splitToImmutableList() to the Splitter itself. What would then stop us
from adding splitToSet(), splitToImmutableSet(), splitToArray()? I
think that while it would help readability in some cases, it would
also "pollute" the API and make it harder to learn.

Kent Beck touches on this subject in "Implementation Patterns" (pages
87-88 - sections: "Conversion", "Conversion method", "Conversion
Constructor"). I agree with him. Some quotes:
- "It’s not worth introducing a new dependency just to have a
convenient expression of conversion."
- "conversion methods become unwieldy when there are an unbounded
number of potential conversions"
- "These disadvantages lead me to use conversion methods sparingly and
only in situations where I am converting to objects of similar type."
(great book BTW)

I think that, unless we could get some kind of FluentIterable (which
does not seem likely, due to circular dependency requirements), an
Iterable is the perfect return type:
- you may use it "as is" if you simply want to iterate on the
splitting result
- you may dump it in the collection of your choice (whether ArrayList
or some custom collection). It may not read as nicely, but it's
flexible
- unless you decide you want to put the result in a Collection, the
objects are not allocated (as would be the case if, for example, the
Splitter returned an array - like String.split()).

For now, I would do:
List<String> splitList = Lists.newArrayList(splitter.split(line));
or, with static imports
List<String> splitList = newArrayList(splitter.split(line));

If I see myself using it in many places, I guess I could encapsulate
it in some kind of static method.

- Etienne

On Mar 19, 5:49 pm, Nikolas Everett <nik9...@gmail.com> wrote:

Andreas

unread,
Mar 21, 2010, 9:48:07 PM3/21/10
to guava-discuss
>> List<String> values = ImmutableList.copyOf(splitter.split(line));
>> That's better than what I was going to use, but still not great.

>But why is it not great?

The code above is dealing with the *how* (the technical details of
copying data around between different formats) as well as the *what*
(splitting comma delimited text into separate fields). Code that deals
only with *what* you are trying to achieve is easier to understand and
to maintain.

> I prefer it to return an Iterable. That way splitter.split(stringBuffer) doesn't have a chance of creating a one squidillion entry list.

>an Iterable is the perfect return type:


> - you may use it "as is" if you simply want to iterate on the splitting result
> - you may dump it in the collection of your choice (whether ArrayList or some custom collection). It may not read as nicely, but it's flexible
> - unless you decide you want to put the result in a Collection, the objects are not allocated (as would be the case if, for example, the Splitter returned an array - like String.split()).

Those are all valid points. The fact that the Splitter has the option
of returning an Iterable is a big improvement over String.split.
Iterable is certainly the most flexible type. As always, there is a
trade off between flexibility and convenience.

There are two very different use cases for a String splitter.
The first is where you have data of an unknown length. For instance,
separating the text on a web page into distinct words. Iterable is
great for this use case.
The second use case is where the data is of a fixed length and a known
format. For example, tabular text. In this case, Iterable is not
ideal. The data is of a known length, so memory allocation is not an
issue. The extra task of converting the Iterable to a List every time
you call split doubles the number of calls required to use the API.

>I don't like the idea of adding splitToList() / splitToArrayList() /
>splitToImmutableList() to the Splitter itself. What would then stop us
>from adding splitToSet(), splitToImmutableSet(), splitToArray()? I
>think that while it would help readability in some cases, it would
>also "pollute" the API and make it harder to learn.

Yes. Every member of the API should "pull its weight". Adding ALL
those methods would indeed be bad design. I would argue that
supporting a whole other group of users by providing one extra
function is worth it. By only providing an Iterable, you are ignoring
this second use case.

As a concrete suggestion, I would add a function
List<String> splitToList()

Andreas

unread,
Mar 21, 2010, 9:50:53 PM3/21/10
to guava-discuss

> When dealing with tabular data I prefer to do it like
> Iterator<String> itr = splitter.split(line);
> String name = itr.next();
> double unitPrice = Double.parseDouble(itr.next());
> ...etc...
> just so its super clear what the format is.

I have used this approach. It has some maintainability problems. What
happens if you comment out or remove one of the calls to itr.next()?
The rest of your data is now going into the wrong columns. I've found
using set column numbers if clearer.
It is also not ideal for data with a large number of columns where you
are only using 2 or 3 of the columns. You need a lot of calls to
itr.next() just to get at the columns you actually want.
The advantage of this method is that it is easier to change the data
format itself by inserting or removing columns (where this is
possible / desired).

Andreas

unread,
Mar 21, 2010, 9:58:55 PM3/21/10
to guava-discuss
>For now, I would do:
>List<String> splitList = Lists.newArrayList(splitter.split(line));
>or, with static imports
>List<String> splitList = newArrayList(splitter.split(line));

Thanks. That's perhaps a bit nicer.

Reply all
Reply to author
Forward
0 new messages