Suggested changes for biogo/io/featio

31 views
Skip to first unread message

Manolis

unread,
Oct 23, 2015, 1:10:12 PM10/23/15
to biogo-dev
Motivation: There two things that bug me in the featio package. The first is that the Read() methods return an interface (feat.Feature) and the second is that feat.Feature itself hides the Orientation. Both of them force me to write unnecessary client code to extract the data that I need. I know that I can write my own types to encapsulate the reader, do type assertions etc but I though it would be better to contact you to see if we can make a better API for the clients.

I suggest the following alternatives.

1. Decouple the file parsers themselves from the featio.Reader interface.
If I'm not mistaken the main reason that the Readers return feat.Feature is to satisfy the featio.Reader interface. So I suggest we move the gff and bed parsers in a new parser package and allow them to return concrete types. To maintain backwards compatibility we can rewrite the featio/gff and featio/bed to encapsulate the new parsers and satisfy the featio.Reader.
Advantages: 
a) Of course this does not reduce client code when you actually want the featio.Reader interface but it can possibly allow cleaner code when you want to encapsulate the parsers under e.g. an orientedFeatio.Reader
b) Avoid the featio.Reader altogether and use the parsers directly accessing their concrete types if one wants to.
Disadvantages: I cannot think of any except maybe if you consider the new structure more complex.

2.  Add the Orientation method in the feat.Feature interface.
Advantages: 
a) Reduce client code that requires the Orientation often (for me this is 90% of my work).
Disadvantages:
a) Breaks backwards compatibility and since feat.Feature is a very key player in biogo it would probably create a mess.

3.  Change featio.Reader interface so that the Read method returns a (new) feat.OrientedFeature interface.
Advantages:
a) Reduce client code that requires the Orientation often.
Disadvantages:
a) Breaks backwards compatibility. Perhaps with less side-effects than alternative #2.
b) Then probably the package shouldn't be named featio but sortedfeatio or sth 

I'm in favor of #1 but I would also love to see #2 at some point.
What are your thoughts? 

Dan Kortschak

unread,
Oct 23, 2015, 7:55:43 PM10/23/15
to Manolis, biogo-dev
On Fri, 2015-10-23 at 10:10 -0700, Manolis wrote:
> Motivation: There two things that bug me in the featio package. The first
> is that the Read() methods return an interface (feat.Feature) and the
> second is that feat.Feature itself hides the Orientation. Both of them
> force me to write unnecessary client code to extract the data that I need.
> I know that I can write my own types to encapsulate the reader, do type
> assertions etc but I though it would be better to contact you to see if we
> can make a better API for the clients.

The issue here is that not all features have an orientation.

In all cases where you are reading from a stream, you must know the
format of the stream and therefore you must know the underlying type. My
pattern is to call `f := sc.Feature().(T)` at the first line of the
scanner loop. I don't think this is cognitively or mechanically costly.

> I suggest the following alternatives.
>
> 1. Decouple the file parsers themselves from the featio.Reader interface.
> If I'm not mistaken the main reason that the Readers return feat.Feature is
> to satisfy the featio.Reader interface.

This is correct.

> So I suggest we move the gff and
> bed parsers in a new parser package and allow them to return concrete
> types. To maintain backwards compatibility we can rewrite the featio/gff
> and featio/bed to encapsulate the new parsers and satisfy the featio.Reader.
> Advantages:
>
> a) Of course this does not reduce client code when you actually want the
> featio.Reader interface but it can possibly allow cleaner code when you
> want to encapsulate the parsers under e.g. an orientedFeatio.Reader
>
> b) Avoid the featio.Reader altogether and use the parsers directly
> accessing their concrete types if one wants to.
>
> Disadvantages: I cannot think of any except maybe if you consider the new
> structure more complex.

More code to maintain and reduced genericity. And significantly,
breaking backwards compatibility.

> 2. Add the Orientation method in the feat.Feature interface.
> Advantages:
>
> a) Reduce client code that requires the Orientation often (for me this is
> 90% of my work).

If you are reading a format that has orientation and you do what I
suggest above, you get this naturally.

> Disadvantages:
>
> a) Breaks backwards compatibility and since feat.Feature is a very key
> player in biogo it would probably create a mess.

This is a deal breaker since the problem is not an issue of correctness.

> 3. Change featio.Reader interface so that the Read method returns a (new)
> feat.OrientedFeature interface.
> Advantages:
>
> a) Reduce client code that requires the Orientation often.
>
> Disadvantages:
>
> a) Breaks backwards compatibility. Perhaps with less side-effects than
> alternative #2.
>
> b) Then probably the package shouldn't be named featio but sortedfeatio or
> sth
>
>
> I'm in favor of #1 but I would also love to see #2 at some point.
> What are your thoughts?

I don't really see the need. When I use these things, I am also nearly
always using it for stranded feature analysis. The cost to me as a user
is one additional line. For this cost I get improved performance (a
single assertion amortised against repeated feat.Feature method calls)
and as a developer, reduced maintenance costs.

I'm happy to hear additional arguments about this, but the problems as
presented don't seem compelling. Perhaps an example use case where you
see this as a problem might help me understand the importance.

Dan

Manolis

unread,
Oct 26, 2015, 6:18:55 PM10/26/15
to biogo-dev, mns....@gmail.com
On Friday, October 23, 2015 at 7:55:43 PM UTC-4, Dan Kortschak wrote:
On Fri, 2015-10-23 at 10:10 -0700, Manolis wrote:
> Motivation: There two things that bug me in the featio package. The first
> is that the Read() methods return an interface (feat.Feature) and the
> second is that feat.Feature itself hides the Orientation. Both of them
> force me to write unnecessary client code to extract the data that I need.
> I know that I can write my own types to encapsulate the reader, do type
> assertions etc but I though it would be better to contact you to see if we
> can make a better API for the clients.

The issue here is that not all features have an orientation.
That's not entirely true.
For BED the ones that don't have it (BED3-BED5) can return NotOriented. That't perfectly legit.
As for the GFF.Feature, this one should have it and return the strand.
 
In all cases where you are reading from a stream, you must know the
format of the stream and therefore you must know the underlying type. My
pattern is to call `f := sc.Feature().(T)` at the first line of the
scanner loop. I don't think this is cognitively or mechanically costly.

That's not always true. Sometimes you don't know the format of the stream a priori.
func DoSth (r featio.Reader) {} does not know the underlying type of r.
 
> I suggest the following alternatives.
>
> 1. Decouple the file parsers themselves from the featio.Reader interface.
> If I'm not mistaken the main reason that the Readers return feat.Feature is
> to satisfy the featio.Reader interface.

This is correct.

> So I suggest we move the gff and
> bed parsers in a new parser package and allow them to return concrete
> types. To maintain backwards compatibility we can rewrite the featio/gff
> and featio/bed to encapsulate the new parsers and satisfy the featio.Reader.
> Advantages:
>
> a) Of course this does not reduce client code when you actually want the
> featio.Reader interface but it can possibly allow cleaner code when you
> want to encapsulate the parsers under e.g. an orientedFeatio.Reader
>
> b) Avoid the featio.Reader altogether and use the parsers directly
> accessing their concrete types if one wants to.
>
> Disadvantages: I cannot think of any except maybe if you consider the new
> structure more complex.

More code to maintain and reduced genericity. And significantly,
breaking backwards compatibility.

I actually believe that it will increase genericity. You will have parsers that work as tokenizers/scanners and can even be used directly without having to dive into the rest of biogo. You will still have types that satisfy the featio.Reader interface (all you have to do is just embed the parser types and create the appropriate signature for the Read method).
For backwards compatibility you are absolutely right it will break stuff that access the concrete types.
Let's see the following example. A function that scans a bed file of initially unknown type and prints the entries that match any from a list of oriented features.
I leave out the rules for matching but the important thing is that it requires the orientation.
func printMatchingBedFeats(file string, orientedFeats []OrientedFeature) {
fileReader, _ := os.Open(file)
r, _ := bed.NewReader(fileReader, guessBedType(file))
for {
f, err := r.Read()
if err != nil {
// handle error
}
var ori feat.Orientation
if o, ok := f.(feat.Orienter); ok {
ori = o.Orientation()
}
if matchFound(orientedFeats, f.Start(), f.End(), ori) {
fmt.Print(f)
}
}
}

The bottom line is that these lines 
var ori feat.Orientation
if o, ok := f.(feat.Orienter); ok {
ori = o.Orientation()
}
can spread in the code. To avoid this we would need to create a new interface and assert f into that and then use that from there on or perhaps even uglier like in the example above pass ori around together with f. Obviously in this case it is not such a big deal but I hope you agree that it is not the cleanest solution. It would be much nicer to have an API that gave access to the things we need most often.

Also a maybe not so irrelevant point from my admittedly small experience with Go. As a best practice, I avoid returning interfaces from functions or at least return the widest possible interface. On the other hand, I use the narrowest possible interface to define input requirements for a function. The argument for the first case is that using an unnecessarily narrow interface I limit the interfaces that the output can satisfy and then I have to use assertions. The argument for the second case is that using an unnecessarily wide interface limits the number of interfaces that can satisfy the requirements. I think the featio.Reader suffers from the first.

BTW I completely agree with you about backwards compatibility concerns. This is very important and potentially a deal breaker that requires further discussion.


 

Manolis

unread,
Oct 26, 2015, 6:28:05 PM10/26/15
to biogo-dev, mns....@gmail.com
Edit: I think the featio.Reader suffers from the first. => I think the readers in the featio package suffer from the first.

Dan Kortschak

unread,
Oct 28, 2015, 9:53:12 PM10/28/15
to Manolis, biogo-dev
On Mon, 2015-10-26 at 15:18 -0700, Manolis wrote:
> On Friday, October 23, 2015 at 7:55:43 PM UTC-4, Dan Kortschak wrote:
> > The issue here is that not all features have an orientation.
> >
> That's not entirely true.
> For BED the ones that don't have it (BED3-BED5) can return NotOriented.
> That't perfectly legit.
> As for the GFF.Feature, this one should have it and return the strand.

For the classes of features that exist in featio, you are right, but
features are more general than that.

> > In all cases where you are reading from a stream, you must know the
> > format of the stream and therefore you must know the underlying type. My
> > pattern is to call `f := sc.Feature().(T)` at the first line of the
> > scanner loop. I don't think this is cognitively or mechanically costly.
> >
>
> That's not always true. Sometimes you don't know the format of the stream a
> priori.
> func DoSth (r featio.Reader) {} does not know the underlying type of r.

In this case you don't know that the feature stream that you have holds
the data that you need for the analysis. feat.Feature is the lowest
common denominator.

If you are in a position to not require some of the data elements of a
feature, then you can make a feat.Feature implementation that implements
additional methods to conditionally return the wrapped feat.Feature
obtained from the Scanner. Alternatively a filter func that does the
same thing can be used. I do these - usually the latter unless I need
the type to match an interface.


> I actually believe that it will increase genericity. You will have parsers
> that work as tokenizers/scanners and can even be used directly without
> having to dive into the rest of biogo. You will still have types that
> satisfy the featio.Reader interface (all you have to do is just embed the
> parser types and create the appropriate signature for the Read method).
> For backwards compatibility you are absolutely right it will break stuff
> that access the concrete types.

It doesn't just break that. It breaks anything that previously
implemented feat.Feature and doesn't return the orientation.

Part of the reason behind the absence of Orientation is the biogical
distinction between orientation (the general concept) and strand (the
nucleic acid-specific concept which is isomorphic). In hindsight, this
is probably unfortunate - maybe sequences should have had orientation
rather than strand. But it is what it is now. It's unreasonable to have
both Orientation and Strand on a seq.Sequence. I am involved in
designing APIs with two methods that return essentially the same thing
(we are still working through this - but that case is far more
conceptually complicated than here), and it's not something I want to
add to seqquences and features.
This is how I would do that if I found that it was happening a lot.

type orientedFeature interface {
feat.Feature
feat.Oriented
}

type oriented struct { feat.Feature }

func (f oriented) Orientation() feat.Orientation { return feat.NotOriented }

func asOriented(f feat.Feature) orientedFeature {
if f, ok := f.(orientedFeature); ok {
return f
}
return oriented{f}
}

func printMatchingBedFeats(file string, orientedFeats []OrientedFeature) {
fileReader, _ := os.Open(file)
r, _ := bed.NewReader(fileReader, guessBedType(file))
for {
f, err := r.Read()
if err != nil {
// handle error
}
if matchFound(orientedFeats, oriented(f)) {
fmt.Print(f)
}
}
}



> Also a maybe not so irrelevant point from my admittedly small experience
> with Go. As a best practice, I avoid returning interfaces from functions or
> at least return the widest possible interface. On the other hand, I use the
> narrowest possible interface to define input requirements for a function.

Rather than using o() and O(), I'd advise you use 𝛺() for your method
set complexity. When it's internal, this is generally fairly easy to
define.

Manolis

unread,
Nov 4, 2015, 8:51:40 PM11/4/15
to biogo-dev, mns....@gmail.com
Thanks for taking the time to give me some coding alternatives. I appreciate it.

> Also a maybe not so irrelevant point from my admittedly small experience
> with Go. As a best practice, I avoid returning interfaces from functions or
> at least return the widest possible interface. On the other hand, I use the
> narrowest possible interface to define input requirements for a function.

Rather than using o() and O(), I'd advise you use 𝛺() for your method
set complexity. When it's internal, this is generally fairly easy to
define.

I did not understand this one.
 

Dan Kortschak

unread,
Nov 4, 2015, 9:14:23 PM11/4/15
to Manolis, biogo-dev
On Wed, 2015-11-04 at 17:51 -0800, Manolis wrote:
> > Also a maybe not so irrelevant point from my admittedly small
> experience
> > > with Go. As a best practice, I avoid returning interfaces from
> functions
> > or
> > > at least return the widest possible interface. On the other hand,
> I use
> > the
> > > narrowest possible interface to define input requirements for a
> > function.
> >
> > Rather than using o() and O(), I'd advise you use &#120570() for
> your method
> > set complexity. When it's internal, this is generally fairly easy
> to
> > define.
>
>
> I did not understand this one.
>
Rather than making the interface the smallest it can be for an input and
the largest it can be for an output, you make it the closest it can
reasonably be for all cases - this is close to impossible for exported
interfaces, but generally readily achievable for internal interfaces.

Reply all
Reply to author
Forward
0 new messages