SymbolList interface : a question about the design

17 views
Skip to first unread message

Murray Patterson

unread,
Oct 27, 2014, 10:36:54 AM10/27/14
to biopp-he...@googlegroups.com
Hello,

I'm in the process of writing a new type of sequence class, which inherits from the Sequence interface, where Sequence subsequently inherits from SymbolList interface.

when doing all of this, I notice (amongst several others), for example, that SymbolList has a pure virtual function

virtual const std::vector<int>& getContent() const = 0;

while it does not contain any  std::vector<int>  (content_ , for example) object!! (containing such an object when there exists a pure virtual function that returns such an object is recommended design of an interface).  So when one goes to override the pure virtual getContent function (a perfectly normal thing which one must do when implementing an interface), one must also define a member object of type std::vector<int> as well!!  I find this surprising

To get around this, for my own type of sequence class, I've just defined an object of type  std::vector<int>  in my list of variables.  Then I go to see (the logical consequence of not defining such an object in SymbolList) that each class that inherits from Sequence (hence indirectly from SymbolList), such as BasicSequence and SequenceWithAnnotation, etc., each define their own (redundant)  std::vector<int> content_  objects.  Why wasn't this just defined in the SymbolList interface, where it should have been defined?  Is there a good reason for doing this that I do not see?

thanks,
Murray

Julien Yann Dutheil

unread,
Oct 28, 2014, 3:16:13 PM10/28/14
to biopp-he...@googlegroups.com
Hi Murray,

You are definitely raising a good point.

First of all, in Bio++, we use a "pure" object approach, meaning that interfaces do not contain any field (they are "pure" abstract classes, with virtual methods only). Yet I agree with you that having a virtual method outputting a reference is extremely silly, as it imposes the implementing class to have such a field (this only leaves the freedom to chose the variable name :s).

So what is the flaw here? I tend to think that actually the method itself should not be in the interface. The interface should define the int operator[size_t], and that's it, because it would make sense for instance to have a Sequence object based on a deque<int> and not a vector<int>, and this is not possible in the current design.

I will update the code accordingly.

Many thanks for your feedback,

Julien.

Murray Patterson

unread,
Oct 28, 2014, 6:11:57 PM10/28/14
to biopp-he...@googlegroups.com
Hi Julien,

yes, I agree to this approach, I think it would be an improvement!  so thanks very much!

Cheers,
Murray

Murray Patterson

unread,
Oct 29, 2014, 10:34:12 AM10/29/14
to biopp-he...@googlegroups.com
Hi Julien,

okay, the reason I'm into the source code of bio++ is because I'm trying to add a way to allow probabilities in the input as per the thread named "bppsuite : a question about the input format" ... I believe you addressed this post as well.  Hence, I take it you know the source quite well, and so I'd like to get your opinion on the following, which is sort of related to the above design issue, i.e., what sparked this current thread

What I've done to allow probabilities in the input is to define a new type of sequence called ProbabilisticSeqeunce, which inherits from Sequence.  The idea of the ProbabilisticSequence, is that instead of a sequence of characters ACG ... each position is a (discrete) PDF on the alphabet, i.e., the BasicSequence "ACG" would be the ProbabilisticSeqeunce "<1,0,0,0> <0,1,0,0> <0,0,1,0>" on the alphabet {A,C,G,T}, and so the sequence "content" of the ProbabilisticSequence is not even a vector<int> or even a vector<string> at all, but rather a matrix, in fact a bpp::DataTable object

And, so in addition to suppressing the existence of virtual const std::vector<int> & getContent() const = 0 , I wonder if one should make it even more generalized, i.e., to allow the content to be any type? ... perhaps to create a sequence "Content" class that inherits from Cloneable?  This would quite a major overhaul of bio++, since at the base, all sequences contain some sort of content that can be represented as a vector<int>, and so I would not vote for this, based on my lack of knowledge on how widely that ProbabilisticSequence will be used by bio++ users ... at the moment it's being added only for the specific use of our project

another way I thought of just now to deal with ProbabilisticSequence objects, is to implement a canonical encoding of a DataTable into a vector<int> ... indeed one exists, we can just pick one and implement it ... then the ProbabilisticSequence fits into this framework of sequence content being vector<int> objects.  Of course, the individual elements of this vector<int> would be gobbledegook because its some complex encoding that can only be translated to and from a DataTable to make any sense of it

what do you think?  Just trying to think of best practices right now.  In any case, I've pulled the git repo of the development version of bio++ and I've added ProbabilisticSequence and a few interfaces for it, and it's fitting in fairly fine so far, compiling and running.  Eventually, when it's stable I'd like to push it to the remote repo for everyone to use

thanks and Cheers,
Murray
 




Murray Patterson

unread,
Oct 30, 2014, 6:28:46 PM10/30/14
to biopp-he...@googlegroups.com
hello again Julien,

I have another question regarding now the design of the Site class ... I figure it's best to keep it within this thread since its part of same issue, that and the fact that the title of this thread is "a question about design" ... sorry if this is sort of turning into a rant, or at least a bombardment of design questions or comments, but I have a few other thoughts/ideas

after more reflection, and dealing with the code, I realized that having the Sequence interface contains a vector of "Content" objects, where the Content object can contain more complex things than integers, may not be the best thing for speed -- especially since I'm unsure how many people will be defining Sequence objects that cannot be represented as integers

and indeed, this ProbabilisticSequence class that I have defined that inherits from Sequence, is a pretty clean way of representing the type of sequence that I want (except for this little drawback of the vector<int> & getContent() which we both agree is a design flaw, and will be henceforth fixed) ... I see now that indeed the Sequence interface provides exactly that function, to be inherited by whatever type of sequence one wishes to devise

However, I'm now into the stage of dealing with aligned ProbabilisticSequences, the logical way to handle this being to translate it into a set of Site objects, since it is my understanding that the Site object is just a vertical Sequence, i.e., the "site" at a set of sequences, while a "sequence" is just a sequence at a set of sites ... that is a Site is (sequence) content with a position.  However, I see here that Site inherits from BasicSymbolList ... should it not just inherit from SymbolList in the same way that Sequence does?  For instance, since BasicSequence inherits from Sequence and BasicSymbolList, shouldn't it be the case also that we have instead a "BasicSite" that inherits from Site (which in turn inherits just from SymbolList) and BasicSymbolList?  Is there any reason why Site needs to inherit from BasicSymbolList and not just SymbolList?

thanks again,
Murray

Julien Yann Dutheil

unread,
Oct 31, 2014, 4:18:30 AM10/31/14
to biopp-he...@googlegroups.com
Dear Murray,

No problem for having a long thread (would have better fitted in the devel forum though...).

I start by the easy one: yes Site should inherit from SymbolList. In fact, so did Sequence before, until we needed to have other types of sequences (SequenceWithAnnotation, etc). We then renamed the Sequence object into a BasicSequence, and created the Sequence interface. We could do the same thing for Site of course. As there was no need for it so far, it was not done. If you would like to, you can rename the current Site object into a BasicSite class, and create a Site interface with the necessary virtual method. I can integrate this to Bio++ (I'm still working on the getContent suppression, which leads to quite some (good) changes... more on that later).

Now concerning your other point (ProbalisticSequence), I agree with your conclusions: having a more general 'content' class would come at the cost of efficiency (mostly memory usage). Do I understand right that you have managed to fit you probabilities into a vector of int :s ?

The alternative would be, in my opinion, to create an interface of a higher level, that is, above SymbolList. This could be something like "SequentialData" for instance, which could be a template. this would allow you to plug your Sequence type on the hierarchy. Yet, all likelihood classes, for instance, do not work with such objects and would probably need some modification. This would require to dissociate completely the likelihood recursion from the data (I think that is a bit the idea behind the BEAGLE library for instance, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3243739/).

Thanks again for sharing your thoughts,

Julien.

Murray Patterson

unread,
Oct 31, 2014, 11:25:30 AM10/31/14
to biopp-de...@googlegroups.com, biopp-he...@googlegroups.com
hello again Julien,

thanks for moving my post to the devel forum, indeed this is the appropriate place for it (I'm new to the forums, so I didn't know this forum existed)

I see, so site inherits from BasicSymbolList, just because there was no need not to do it, i.e., indeed it would be more complex to create a Site interface, and then a BasicSite which inherits from this Site interface and BasicSymbolList, so why do this if uneccessary ... good to hear

so for the meantime, what I'm going to do is just define a "ProbabilisticSite" class (to go along with my ProbabilisticSequence) that will inherit directly from the Site class, and then just go from there.  This will be the fastest way for me to get what need from the current state of Bio++, i.e., the minimal "clean-ish" way to go about handling these probabilistic seqeunces -- because ultimately, that's my only goal, just extending bppAncestor for the uses of my particular project

but, down the road, if people find this added functionality useful, I will really do it in the proper way, i.e., following exactly the lines of all the other classes like BasicSequence create a "ProbabilisticSymbolList", where ProbabilisticSeqeunce inherits from Sequence interface and this symbol list, and then a ProbabilisticSite which inherits from the (would be) Site (interface) and this symbol list -- indeed I could make it a even a bit more general, and have a "GeneralizedX" for X \in {SymbolList, Sequence, Site, ...}, rather than just "ProababilisticX".  Let's see how it goes

thanks for all the answers and advice -- it really clarifies many things, and this thread also serves as a great record to refer to, should I come back someday and completely elaborate this new functionality

Cheers,
Murray

Julien Yann Dutheil

unread,
Nov 4, 2014, 4:50:34 PM11/4/14
to biopp-de...@googlegroups.com, biopp-he...@googlegroups.com
Dear Murray,

I committed an update where the getContent method is removed (bpp-seq, phyl, popgen & bppsuite). Was more work than expected, as this method was used in several places. Yet, the beautiful thing that confirms the previous design was not intelligent enough, is that everytime this method was used, I could replace the code by a more efficient one :)

Changes are on the 'master' branches of the libs.

Best,

Julien.

Murray Patterson

unread,
Nov 17, 2014, 8:20:42 AM11/17/14
to biopp-de...@googlegroups.com, biopp-he...@googlegroups.com
Dear Julien,

great!  Yes I saw this change the last time I came in and did a 'git pull' while developing my modification ... indeed, quite a bit was updated

In any case, I've also finished my modification of Bio++ to allow for these Probabilistic Sequences.  I've tested it for what I will need it for, and it works just fine, however it's by no means ready to be integrated into the 'master' branch.  To do so, I'd probably need a bit of your advice on some design issues, i.e., so that my code conforms to the way that Bio++ has been (very well) designed.  For the most part, I've made classes that inherit from Sequence or Site, but in some places (in particular VectorSiteContainer) I've made quite a few modifications ... it's not clear to me if I've made the right choice, i.e., perhaps I should have made a VectorProbabilisticSiteContainer that inherits from VectorSiteContainer.

In any case, with the eventual goal of integrating this functionality into Bio++, I think the best thing to do would be to create a branch for this added functionality.  Is it possible to create a branch (to/from which I can push/pull changes) on the same remote repo (hosted somewhere in montpellier I believe) that the 'master' branch is on?  This would allow you and others to view what changes have been made, and the possibility to give feedback on how this could be integrated.  If not, I could host this branch on my own repo (I have an account with bitbucket), to which I could give you -- and whomever else wants -- access.

thanks again,
Murray

Julien Yann Dutheil

unread,
Nov 18, 2014, 5:56:35 AM11/18/14
to biopp-de...@googlegroups.com, biopp-he...@googlegroups.com
Hi Murray,

Good to know you could achieve your goals. I'll be happy to discuss any design related issue.
As for the integration into Bio++, the way to go would be as you said by first creating a branch. Before that, however, it is important to discuss a few points, as there are actually several possibilities:
- You can keep you classes as part of your Bio++ based project
- You can develop you own library, which your project links to, but making your code available outside that particular project
- You can integrate your code in an existing library.

The choice of option depends both on the generality of your code and your engagement in its maintenance! No need to say that the last option is the most constraining, and is to be reserved for code with potential broad usage and with high level of maintenance.

I tend to like option 2 as a good compromise. Yet having a library for 2 classes might not be ideal, so this would depend on your project, if other classes and methods could be distributed along. Note that such a library could be hosted and distributed on the Bio++ server.

Cheers,

Julien.

Murray Patterson

unread,
Nov 18, 2014, 10:18:09 AM11/18/14
to biopp-de...@googlegroups.com, biopp-he...@googlegroups.com
Hi Julien,

yes, I agree that option 2 is the best way to go.  The code has been developed in a fairly general fashion to allow for others to use it or to extend it (so I'd like to go beyond option 1), and yet option 3 would seem overkill, especially since it's uncertain how widely this is going to be used

so let's go for that, i.e., option 2.  Currently, I have the modified code as well as its commit history on my personal machine, ready to be pushed to the branch that will be created

Cheers,
Murray

Julien Yann Dutheil

unread,
Nov 18, 2014, 2:11:05 PM11/18/14
to biopp-de...@googlegroups.com
Well actually, there seems to be sone support for option 1, so maybe we can give it a try. I will contact you directly for the procedure...

Ps(@bastien): i recall here that with the new Google group setting, one cannot post to the forum by email any longer, only via the online interface!

Murray Patterson

unread,
Nov 18, 2014, 4:44:01 PM11/18/14
to biopp-de...@googlegroups.com
sounds good! talk to you soon

Murray
Reply all
Reply to author
Forward
0 new messages