Some of these suggest interface changes (better now than later, even
if we're all (still) tired of this topic).
(1) Iterable vs. Collection
Right now the methods return Collections. They could instead return
Iterables (which only have an iterator() method).
Arguments for:
- makes it harder for people to screw up our data models by modifying
the Collection returned
- allows us to return (and use internally) arrays, which are Iterables
against:
- we need to provide extra methods (size, contains for each
Collection-returning method, probably).
(2) An easy one (I hope): what purpose does the interface Data serve?
It has no methods and seems to specify no behavior; if it has no
present use I'd suggest removing it (and adding it back in later if
appropriate).
(3) Rather than reinvent Predicate, why not use the
commons-collections interface by that name, which comes with many
useful utilities?
(4) What is the Scale interface for? Some class-level documentation
would be handy; the method-level docs are obscure without the larger
context.
I hope that you all have enjoyed your holidays.
Regards,
Joshua
--
joshua.o...@gmail.com...................www.ics.uci.edu/~jmadden
Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for. -- Bill Watterson
My opinions are too rational and insightful to be those of any organization.
Thanks for keeping the conversation lively. I've been busy writing
reports during the winter break but now, I'm back. So, I start by
wishing a happy new year to everyone in the list.
I would like to raise another issue that is the "portability" of the
interface for other languages.
The Java classes are well designed but quite specific.
Therefore, I would avoid relying on Java idioms as much as possible when
they don't translate well into other languages.
This is why I would prefer:
Iterable<V> getNeighbors(V v)
int getNeighborCount(V v)
boolean isNeighbor(V v1, V v2)
to
Collection<V> getNeighbors(V v)
Furthermore, Iterators are nice, but when you have to create them when
iterating a large graph, the cost is important, due to the allocation
and garbage-collection.
On the same line, AWT Components that initially had only the
Rectangle getBounds()
method, now have
Rectangle getBounds(Rectangle r)
to avoid allocating rectangles when possible. I implemented iterators
the same way in my toolkit:
Iterator<V> getNeighbors(V v, Iterator<V> i)
so that the same iterator object could be reused. It also saves the
double indirection of Iterable<V>.
I know it is ugly but quite effective.
Iterating over a graph becomes:
Iterator<V> vi = null;
for (Iterator<V> v = graph.getNodes(null); v.hasNext(); ) {
for (vi = graph.getNeighbors(v, vi); vi.hasNext(); ) { // note that vi
is reused
V neighbor = vi.next();
...
}
}
Another idiom that Colt likes is mapper. However, it does not translates
well in C/C++ that don't have inner classes or lamda functions.
Therefore, I would vote for NOT using Collections.
Joshua's question 2:
> (2) An easy one (I hope): what purpose does the interface Data serve?
> It has no methods and seems to specify no behavior; if it has no
> present use I'd suggest removing it (and adding it back in later if
> appropriate).
I am not sure of the purpose of the Data interface in Java that has a
toplevel Object type. For other languages, it is handy.
> (3) Rather than reinvent Predicate, why not use the
> commons-collections interface by that name, which comes with many
> useful utilities?
Again, this is a JavaCentric proposition but that would be fine if we
had a Tuple object that would carry the Table and the row.
BTW, I am definitively in favor of adding a Tuple object if, like the
Iterator, it can be reused. It then becomes a bridge between the table
model and the object model at an acceptable cost. Otherwise, it is too
expensive.
> (4) What is the Scale interface for? Some class-level documentation
> would be handy; the method-level docs are obscure without the larger
> context.
Agreed. I just worked on these items that I call Axis. It could be
called AxisScale.
I added some javadoc to give an example. Maybe Jeff wants to elaborate.
Joshua O'Madadhain a écrit :
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - http://www.avg.com
> Version: 8.0.176 / Virus Database: 270.10.3/1878 - Release Date: 06/01/2009 07:56
>
>
>> On the same line, AWT Components that initially had only the
>> Rectangle getBounds()
>> method, now have
>> Rectangle getBounds(Rectangle r)
>> to avoid allocating rectangles when possible. I implemented iterators
>> the same way in my toolkit:
>> Iterator<V> getNeighbors(V v, Iterator<V> i)
>> so that the same iterator object could be reused.
>>
>
> I'd like to see the implementation of this method. My understanding
> is that Iterator objects are not meant to be re-used, and I don't know
> of any way to do so. This is, in fact, at least part of why I have
> been advised to not return Iterators by people whose Java programming
> expertise I respect (the rest of the argument is made nicely here:
> http://jroller.com/tackline/entry/keep_iterators_to_yourself).
>
>
Reusing iterators is easy:
Iterator<E> getLinks(V v, Iterator<V> old) {
MyEdgeIterator ei = null;
if (old instanceof MyEdgeIterator) {
ei = (MyEdgeIterator) old;
ei.reset(whatever);
}
else {
ei = new MyEdgeIterator(whatever);
}
return ei;
}
I'd be happy to strictly conform to Java good practices but the price to
pay is huge. The code above works also
when a null iterator is supplied so it fits beginners/lazy programmers
practice but it also allows for optimizations.
I understand Java syntax is not optimally designed for that as you point
out.
>> It also saves the
>> double indirection of Iterable<V>.
>>
>
> >From a syntactic standpoint, this is a non-issue as of Java 1.5. Rather than
>
> for (Iterator<V> it = graph.getNodes(); it.hasNext(); ) {
> V v = it.next();
> ...
>
> this is much cleaner:
>
> for (V v : graph.getNodes()) {
> ...
>
> If you're talking about the extra method call overhead...it seems to
> me that Java itself comes with much greater overhead than that, and
> I'm not sure why we're using Java at all if we're really that
> concerned about such things.
>
>
Not method call overhead, allocation/GC overhead. Creating a new object
for iterating each entity in the graph is expensive.
>> I know it is ugly but quite effective.
>>
>> Iterating over a graph becomes:
Here is the corrected version. It is ok to allocate one or two iterators
for a graph traversal but not as many as there are nodes.
Iterator<V> vi = null;
for (Iterator<V> iter = graph.getNodes(null); iter.hasNext(); ) {
V v = iter.next();
for (vi = graph.getNeighbors(v, vi); vi.hasNext(); ) { // vi is reused
V neighbor = vi.next();
...
}
}
>> Another idiom that Colt likes is mapper. However, it does not translates
>> well in C/C++ that don't have inner classes or lamda functions.
>>
>
> Not sure what you're referring to here. Can you clarify? (Java
> doesn't have lambda functions either...)
>
It has inner classes that I see as a sort of palliative for
lambda/closure in functional languages.
Yep, exactly, plus all the handy methods.
> Version: 8.0.176 / Virus Database: 270.10.5/1881 - Release Date: 07/01/2009 17:59
>
>
I suppose it could. Has this been a problem in practice for you in
recent years?
> Reusing iterators is easy:
>
> Iterator<E> getLinks(V v, Iterator<V> old) {
> MyEdgeIterator ei = null;
> if (old instanceof MyEdgeIterator) {
> ei = (MyEdgeIterator) old;
> ei.reset(whatever);
> }
> else {
> ei = new MyEdgeIterator(whatever);
> }
> return ei;
> }
>
> I'd be happy to strictly conform to Java good practices but the price to
> pay is huge. The code above works also
> when a null iterator is supplied so it fits beginners/lazy programmers
> practice but it also allows for optimizations.
> I understand Java syntax is not optimally designed for that as you point
> out.
Ah. So in order to make this optimization possible you have to
provide custom Iterators that can be reset. This is not something
that's really natural to Java programmers in my experience.
The price of conforming to Java good practices may be less efficient
code in some regards (although it's not clear to me that that's
actually a big issue in practice these days). The price of _not_
doing so is that our APIs are more confusing and lead to more
confusing code. Generally speaking, unless we're talking about
large-scale computing, if there's a tradeoff between programmer time
and computer time, I know which one I want to optimize for. :)
>>> I know it is ugly but quite effective.
>>>
>>> Iterating over a graph becomes:
> Here is the corrected version. It is ok to allocate one or two iterators
> for a graph traversal but not as many as there are nodes.
>
> Iterator<V> vi = null;
> for (Iterator<V> iter = graph.getNodes(null); iter.hasNext(); ) {
> V v = iter.next();
> for (vi = graph.getNeighbors(v, vi); vi.hasNext(); ) { // vi is reused
> V neighbor = vi.next();
> ...
> }
> }
Seems to me that if you're really worried about this, you would return
arrays or Lists so that you wouldn't have to use iterators at all. (I
am not recommending this! :) )
Joshua O'Madadhain a écrit :
> On Fri, Jan 9, 2009 at 12:18 AM, Jean-Daniel Fekete
> <Jean-Dani...@laposte.net> wrote:
>
>> I am in favor of the Iterable-plus-size-and-contains methods but I hate
>> the allocation/GC overhead that is large when iterating large graphs.
>> The allocation/GC overhead can be larger than the computing code, for
>> example when collecting statistics or searching for connected components.
>>
>
> I suppose it could. Has this been a problem in practice for you in
> recent years?
>
>
Yes. I visualize the structure of Wikipedia (page to page links). The
French Wikipedia is 600,000 pages and 6,000,000 links.
See http://www.aviz.fr/~elm/projects/zame.html
I propose two levels of interface, one that adheres strictly to standard
java good practices and a second level to allow optimizations.
Just like the Component.getBounds() methods. One allocates whereas the
other can avoid allocating.
>>>> I know it is ugly but quite effective.
>>>>
>>>> Iterating over a graph becomes:
>>>>
>> Here is the corrected version. It is ok to allocate one or two iterators
>> for a graph traversal but not as many as there are nodes.
>>
>> Iterator<V> vi = null;
>> for (Iterator<V> iter = graph.getNodes(null); iter.hasNext(); ) {
>> V v = iter.next();
>> for (vi = graph.getNeighbors(v, vi); vi.hasNext(); ) { // vi is reused
>> V neighbor = vi.next();
>> ...
>> }
>> }
>>
>
> Seems to me that if you're really worried about this, you would return
> arrays or Lists so that you wouldn't have to use iterators at all. (Iby
> am not recommending this! :) )
>
There is a tension when designing interfaces. You want to abstract the
underlying structure but you want to keep performance and resources
reasonable.
In our case, since we design "models" of the data structures, there is
always the possibility of using the underlying implementation when you
know it and use whatever concrete data structure it maintains directly
if it provides more efficient mechanisms.
This iterator thing will provide a 10x increase of speed compared to the
standard one. If you want a 100x, you have to access the raw data
structure, that's my point.
Jean-Daniel.
> Joshua