general questions about pystruct

518 views
Skip to first unread message

daniele....@vub.ac.be

unread,
Dec 18, 2015, 6:32:24 AM12/18/15
to pystruct
Dear Andreas,

first of all thanks for the effort you put in the development of open source and freely available ML libraries. (I'm a PhD student in Bioinformatics and I'm an *heavy user* of sklearn).

Since in bioinformatics most of the problems we face have an underlying structure, I'm experimenting also with pystruct, but, in order to better grasp the general concept, I have some (I hope not too stupid) questions. It is just to understand the reasoning behind this very nice library!

1) As far as I have understood, in the original SVM struct formalism proposed by [Tsochantaridis et al. (2005)], the implementation of the psi function (joint feature(x,y) in pystruct) was up to the user, since it is problem dependent (the same goes for the loss function, for example). In pystruct you make it more generalizable (and user friendly) by introducing Models based (for example) on the CRF/Maximum Margin Random Fields formalism, and you made this choice because it is a widely used framework that is natively able to model most of the graph-based structure prediction problems, thus avoiding the users to code an ad-hoc psi/joint_feature function for each problem. Is this statement reasonably right? ;)

2) If 1) is correct, have you ever benchmarked (or you have an idea on) how the "general" joint_feature implementation given by the pystruct Models  performs with respect to joint_features and loss functions specifically designed/tuned for a particular structured problem X? In other words, using the Models gives a huge speedup in the implementation of predictors and make it a lot more easier (thanks for that!), but should it be considered as a way for "trading"  better performances for user friendliness? Or in your experience the CRF models are in general good enough for most of the use cases and thus there is little/no point in designing ad-hoc psi/loss for each specific problem?

3) From your experience, do you think the field of structured prediction (and pystruct) is ready to takle difficult problems? Like predicting structures with a *huge* possible solution space, or every "good" search in those huge space will be infeasible by definition?

4) For pystruct, is it correct to say that the more structured the problem is the better it is for the good outcome of the prediction? For example, I'm working with proteins (that can be tought as linear polymers),  so the easiest structure I could associate with my problems is a linear chain, like in your OCR example. A difference is that the "relations" between the elements composing the polymer (aminoacids) are less strong that the grammatical or semantic relationships in NLP, yielding probably to a less constrained chain "structure" underlying the problem.
Do you think it is in general better to pose the prediction problem in the most constrained way possible or pystruct should work equally well with low level of dependence between the outputs? For example, I have  sequences of aminoacids "ACDEAFHG...KLH" that is associated with binay labels "00001111100...00000" in which the 1s are grouped, forming "islands". Usually this problem is solved with UNstructured predictions, using SVM to predict the 1/0 class for every single aminoacid, but since the 1s form islands, I tried to use various models from pystruct (starting from chainCRF to EdgeFeatureGraphCRF) to model the dependences in the labels. Unfortunately I can't get good predictions even on the trainset itself (AUC=<0.60) and I always get random predictions on the test set. I have to say that the problem is very noisy and unbalanced ( I corrected the unbalancement by setting the class_weight variable during the initialization of the pystuct model), but a regular unstructured SVM from sklearn always outperforms pystruct (Even though the best AUC is around 0.75 even for the best unstructured svm, so the problem is difficult ).
Do you have any suggestions? (this is a toy problem I'm using for my first steps with pystruct). I'm training with FrankWolfeSSVM and I tried many configuration of training parameters without noticeable improvement. Do you think that the limited amount of "structure" in the problem could be the cause for these bad performances? Going back to the question 3, in your experience, it is better for pystruct to predict a sequence of length n composed by 1s and 0s (and thus relatively little structure among the labels) (space of solutions 2^n) or the same sequence but with higher number of class labels? (let's say label symbols going from 1 to 10, with a space of the possible solutions of 10^k ).


thanks in advance for your patience and time,
sorry for the colloquial form and the possibly stupid questions
Daniele Raimondi

Shuyang Sheng

unread,
Dec 19, 2015, 1:13:05 AM12/19/15
to pystruct
I can give a few short answers (can't guarantee that Andreas would agree with me).

(1). CRF is a probabilistic model that directly models the posterior density p(Y | X,w) on a graph, and should not be confused with the graph itself. To put it in another way, we have 3 components here: (1) a graph (2) a probabilistic model (e.g. CRF) and (3) a training algorithm (e.g. SSVM). Other probabilistic models can also be used on graphs, however, generative models such as MRF is a bit cumbersome in API design, because a prior function for node potentials must be specified.

(2) what are other candidates in this comparison that you described?

(3) Graph models has its theoretical limit. If you have a very complicated graph (e.g. fully connected graph), inference on it cannot be done in polynomial time, meaning that time complexity of inference will be k^N, where k is number of possible values for y and N is number of nodes. If you are dealing with images that have millions of pixels, k^N time means forever. Real valued solution space (i.e. continuous y) should be more complicated, I'm not very familiar with that.

(4) Modeling a specific model is beyond my knowledge...

daniele....@vub.ac.be

unread,
Dec 20, 2015, 4:16:49 PM12/20/15
to pystruct

Dear Shuyang Sheng,

thanks for your kind answer!




(2) what are other candidates in this comparison that you described?


For example I can image a comparison between the "frequency based" model used as illustrative example in [Tsochantaridis et al. (2005)] and one of the graph models available in pystruct.

 thanks again!

DR

Colin Lea

unread,
Dec 20, 2015, 9:02:37 PM12/20/15
to pystruct

It seems like there are a handful of misconceptions that perhaps I can clear up. Let me give a brief overview of structured prediction models like SSVMs. Sorry if any of this is review. 


There are three necessary components for a structured-output classifier: the model/representation, learning, and inference. The model is used to define features and relationships between objects. For example a linear chain CRF model is typically defined where there is a node at each timestep (comprised of a unary potential/feature I’ll denote U) and a pairwise potential/feature between each consecutive timestep that I’ll denote P. Psi, the joint feature function from the Structural SVM formulation, is a function of the unary and pairwise terms. In particular it is the sum over all time steps of U’s and P’s corresponding to each class. If we have class ‘A’ and ‘B’ Psi will have 6 terms: Psi = [U_A, U_B, P_AA, P_AB, P_BA, P_BB] where each terms (e.g. U_A) will be the sum of all unaries for class A for a particular training sequence. The model component of the classifier is where domain knowledge comes in — this is where you try to encode the relationships between your aminoacids. 


The second component is learning. I don’t know if there are direct comparisons between pyStruct and SVM^struct but Andreas has done in-depth comparisons between Block Coordinate Frank Wolfe, vanilla Stochastic Gradient Descent, and Cutting Planes. These are all algorithms for solving the Structural SVM learning problem. Some of the results are in his thesis — you can find a draft of this somewhere on this message board if you search. In summary his results show that BCFW tends to perform better (faster and better convergence) in several applications. In my own experience I have gotten better results using stochastic gradient descent with RMSProp step updates. Unfortunately all of my experiments were done with a (currently private) structured prediction library I’ve been writing in Julia. 


Third is inference. I’m not going to harp on this. This is highly dependent on your model. Essentially you need a way of predicting what the labels are for every node in your graph. Depending on the structure of your problem this can be done approximately (e.g. using variational inference or sampling-based methods) or exactly. 


Now, back to your questions. Regarding #3 I would say yes, you can tackle problems like yours with structured prediction. Even if the ambient sample space is huge, the point of a structured prediction model is to reduce a lot of the complexity. As Shuyang said above, having a fully connected graph quickly becomes intractable. By making assumptions (for example by assigning which nodes are connected in the graph) you reduce the problem from intractable to tractable. Even if your final model is very complex there are often approximate inference methods that can still give you reasonable predictions.


I don’t fully understand your application, but it makes sense to my why your model (#4) may not be performing well. It seems like each letter in your aminoacids example is an individual node. Effectively you have two unary weights (one for ‘0’ and one for ‘1’) and four pairwise weights (0->0, 0->1, 1->0, 1->1). The ‘0’ unary will give a high score to letters (e.g. A or C) that tend to receive a ‘0’ label and vice versa for ‘1’. This implies that your prediction at timestep ’t’ is largely a function of the letter at that same ’t’. However, based on the simple example you included, it appears that the individual letters don’t matter that much — what does matter is the local sequence of letters. For example, perhaps ‘ABC’ is always a ‘1’ and ‘CBA’ is always a ‘0’. In that cause you want to model your unaries differently. Perhaps each unary should be a function of the past 3 letters. Off the top of my head I can’t recall how you would encode this in pyStruct. If you confirm the details above perhaps I can help you think about how to formulate it. 


I hope this was clear enough. Let me know what other questions you might have!


By the way, questions like this are never "stupid." ;)

daniele....@vub.ac.be

unread,
Dec 21, 2015, 6:11:18 PM12/21/15
to pystruct
Dear Colin,

thanks a lot for your  kind and super-exhaustive answer. :) I'll take some time to study more this prediction paradigm and hopefully understand better the whole thing and after the holidays I'll come back with some more targeted questions.

Thanks again!
Daniele


Colin Lea

unread,
Dec 21, 2015, 7:56:24 PM12/21/15
to pystruct
No problem! I would be happy to video (or audio) chat at some point if it would be more useful. Structured prediction is a hard area to get started in because it requires a lot of background knowledge. 

Enjoy the holiday!

colin

Shuyang Sheng

unread,
Dec 22, 2015, 3:08:25 AM12/22/15
to pystruct
Thanks for the added feature. It was my mistake that I didn't clearly define and distinguish inference from learning, glad that you cleared that out.


To Daniele,

After all, I found it quite difficult to explain graphical models in a few words, so here's a list of the books that I found very useful:

  1. Structured Learning and Prediction in Computer Vision. By Sebastian Nowozin and Christoph H. Lampert.
  2. Markov Random Fiends for Vision and Image Processing. By Andrew Blake.
  3. PROBABILISTIC GRAPHICAL MODELS PRINCIPLES AND TECHNIQUES. By Daphne Koller
To get a general idea of graphical models all you need to read is the first two or three chapters of (1) and (2), but if you are interested in more details you can keep reading. (3) imo has a very systematic notation, and is probably preferred if you'd like to dig deep :)

Andreas Mueller

unread,
Dec 29, 2015, 11:56:30 AM12/29/15
to pyst...@googlegroups.com
Thanks to everybody who already answered :)
There are indeed no stupid questions, and structured prediction is a rather tricky subject, not used by many practitioners (yet).

1) More or less. I gave some implementations of psi that are common for the kind of problems I worked with (in particular computer vision).
So if you have graph labeling problems where all the nodes represent the same kind of entity, you can reuse them.
I think providing as many commonly used "psi" as possible makes is more user-friendly. You can still write your own model, as you can with  SVM^struct.

2) I'm not sure I understand. If you think a model that is not in pystruct is more appropriate for your application, you should code it up yourself (in principle, might
not be as easy in practice ;).

3) I hope so. For sequences, recurrent neural nets become more popular at the moment (in particular LSTMs),  but there are also combinatins of neural nets and CRFs, that might be promising. If you don't have "enough" data for training a recurrent net, using structured prediction is definitely the right choice.
(also see my answer for 4)

4) I am surprised that you get worse results than an unstructured SVM. Are you using a linear SVM as reference, or a kernel SVM?
The main drawback of directly using pystruct on your features, is that all potentials will be linear in the input features. That is VERY limiting in some sense.
Possible ways around that are to use non-linear feature mappings (like the RBFSampler or Nystroem in sklearn), or stagewise learning.
In stagewise learning, you first train an unstructured model like an SVM or a Random Forest on your problem. Then you train a structured model,
where the input features are the predicted probabilities of the unstructured model.

PR welcome for adding an example on "letters" for both strategies ;)


Hth,
Andy


On 12/18/2015 06:32 AM, daniele....@vub.ac.be wrote:
Dear Andreas,

first of all thanks for the effort you put in the development of open source and freely available ML libraries. (I'm a PhD student in Bioinformatics and I'm an *heavy user* of sklearn).

Since in bioinformatics most of the problems we face have an underlying structure, I'm experimenting also with pystruct, but, in order to better grasp the general concept, I have some (I hope not too stupid) questions. It is just to understand the reasoning behind this very nice library!

1) As far as I have understood, in the original SVM struct formalism proposed by [Tsochantaridis et al. (2005)], the implementation of the psi function (joint feature(x,y) in pystruct) was up to the user, since it is problem dependent (the same goes for the loss function, for example). In pystruct you make it more generalizable (and user friendly) by introducing Models based (for example) on the CRF/Maximum Margin Random Fields formalism, and you made this choice because it is a widely used framework that is natively able to model most of the graph-based structure prediction problems, thus avoiding the users to code an ad-hoc psi/joint_feature function for each problem. Is this statement reasonably right? ;)

2) If 1) is correct, have you ever benchmarked (or you have an idea on) how the "general" joint_feature implementation given by the pystruct Models  performs with respect to joint_features and loss functions specifically designed/tuned for a particular structured problem X? In other words, using the Models gives a huge speedup in the implementation of predictors and make it a lot more easier (thanks for that!), but should it be considered as a way for "trading"  better performances for user friendliness? Or in your experience the CRF models are in general good enough for most of the use cases and thus there is little/no point in designing ad-hoc psi/loss for each specific problem?

3) From your experience, do you think the field of structured prediction (and pystruct) is ready to takle difficult problems? Like predicting structures with a *huge* possible solution space, or every "good" search in those huge space will be infeasible by definition? o


4) For pystruct, is it correct to say that the more structured the problem is the better it is for the good outcome of the prediction? For example, I'm working with proteins (that can be tought as linear polymers),  so the easiest structure I could associate with my problems is a linear chain, like in your OCR example. A difference is that the "relations" between the elements composing the polymer (aminoacids) are less strong that the grammatical or semantic relationships in NLP, yielding probably to a less constrained chain "structure" underlying the problem.
Do you think it is in general better to pose the prediction problem in the most constrained way possible or pystruct should work equally well with low level of dependence between the outputs? For example, I have  sequences of aminoacids "ACDEAFHG...KLH" that is associated with binay labels "00001111100...00000" in which the 1s are grouped, forming "islands". Usually this problem is solved with UNstructured predictions, using SVM to predict the 1/0 class for every single aminoacid, but since the 1s form islands, I tried to use various models from pystruct (starting from chainCRF to EdgeFeatureGraphCRF) to model the dependences in the labels. Unfortunately I can't get good predictions even on the trainset itself (AUC=<0.60) and I always get random predictions on the test set. I have to say that the problem is very noisy and unbalanced ( I corrected the unbalancement by setting the class_weight variable during the initialization of the pystuct model), but a regular unstructured SVM from sklearn always outperforms pystruct (Even though the best AUC is around 0.75 even for the best unstructured svm, so the problem is difficult ).
Do you have any suggestions? (this is a toy problem I'm using for my first steps with pystruct). I'm training with FrankWolfeSSVM and I tried many configuration of training parameters without noticeable improvement. Do you think that the limited amount of "structure" in the problem could be the cause for these bad performances? Going back to the question 3, in your experience, it is better for pystruct to predict a sequence of length n composed by 1s and 0s (and thus relatively little structure among the labels) (space of solutions 2^n) or the same sequence but with higher number of class labels? (let's say label symbols going from 1 to 10, with a space of the possible solutions of 10^k ).


thanks in advance for your patience and time,
sorry for the colloquial form and the possibly stupid questions
Daniele Raimondi
--
You received this message because you are subscribed to the Google Groups "pystruct" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystruct+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

daniele....@vub.ac.be

unread,
Jan 5, 2016, 4:45:24 AM1/5/16
to pystruct
Dear Andreas,

thanks for the answers and happy new year!

I get worse results than both linear and RBF SVMs. Actually the problem is that I get almost random results both on training and testing! Here the confusion matrices and the scores obtained on the trainset itself and on the test set.

TRAIN:
--------------------------------------------
       | 1                     | 0           |
pred1  | TP: 1016              | FP: 11125   |
pred0  | FN: 3888              | TN: 216573  |

Sen = 0.207
Spe = 0.951
Acc = 0.935
Bac = 0.579
Pre = 0.084
MCC = 0.102
AUC = 0.579
--------------------------------------------

TEST:
--------------------------------------------
       | 1               | 0           |
pred1  | TP: 503         | FP: 9248    |
pred0  | FN: 4096        | TN: 202315  |

Sen = 0.109
Spe = 0.956
Acc = 0.938
Bac = 0.533
Pre = 0.052
MCC = 0.046
AUC = 0.533
--------------------------------------------


These are the outputs for 100 steps of FrankWolfeSSVM training.

dual: 7.322260, dual_gap: 7956981.415651, primal: 7956988.737911
dual: 78.622115, dual_gap: 7088164.720121, primal: 7088243.342236
dual: 145.360738, dual_gap: 7361463.327332, primal: 7361608.688071
dual: 214.217034, dual_gap: 7463452.563506, primal: 7463666.780539
dual: 283.846964, dual_gap: 7768881.205342, primal: 7769165.052306
dual: 354.871427, dual_gap: 7911905.501601, primal: 7912260.373028
dual: 424.708660, dual_gap: 7858108.259299, primal: 7858532.967960
dual: 494.497318, dual_gap: 7990976.179498, primal: 7991470.676815
dual: 564.371249, dual_gap: 8110837.349558, primal: 8111401.720807
dual: 634.577875, dual_gap: 8182879.335770, primal: 8183513.913645

Using more iterations (or other methods in pystruct.learners) the results don't change much (how many iteration you suggest?). Do these optimization scores suggest that something wrong is happening?

A colleague of mine also tried to train a toy model in which the class labels (1 and 0) were used as features and he inexplicably obtained near-random performances also in that case, which is very strange I think, since the learner has just to learn to output 1s when the input feature is valued 1.

If necessary I can post the code of these simple predictors!

thanks,
Daniele


Andreas Mueller

unread,
Jan 5, 2016, 3:01:50 PM1/5/16
to pyst...@googlegroups.com
The results to the toy problem sound very odd.
Can you post the code for that?

And yes, the optimization for the Frank-Wolfe doesn't look good, maybe it is just slow, though.
How many samples and how many features do you have?
You might want to try one of the other optimizers, or maybe rescale your features.

Do I see it correctly that your data is about 100:1 imbalanced?

daniele....@vub.ac.be

unread,
Jan 5, 2016, 5:59:50 PM1/5/16
to pystruct

Just to know, how does it look a proper FW optimization? Which of the 3 values is more important, just to have an idea? Or I can simply look at how it behaves in the pystruct examples, but in those cases only few iterations are used.

I have ~600  protein sequences in the trainset and ~600 in the test set. These sequences can be quite long, as you can see from the confusion matrix (in which the number of aminoacids is shown (just in case: aminoacids (aka residues) are the monomers composing the protein sequence. proteins are linear polymers of aminoacids)). So very roughly in the train set the average length of the sequences is ~ 220k/600=366 aminoacids per sequence. (each aminoacid is annotated with a 1 or 0 label).

I tried with various lengths for the feature vectors. From 20 to ~400. All the feature vectors so far encode the target aminoacids or the target plus some other flanking aminoacids.

[optional explanation of the feature vectors!]
 In the simplest case the feature vector for each aminoacids j contains only the aminoacid type (20 possible types) using one-hot encoding. (20 dimensional vectors with only one position set to 1 and everything else to 0). In the UNstructured prediction case it is common practice to consider a "window" of residues centred on the target in order to provide some information about the local context. (e.g the feature vector for predicting the aminoacid in position j using a window of w residues is w*20 dimensional (with only w 1s in it) and it encodes the aminoacids from j-((w-1)/2) to j+((w-1)/2). (assuming that window sizes w can be only odd: the central residue plus the symmetrical flanking ones)). I tried various configurations, comparing pystruct results to UNstructured SVM models, both linear and RBF. (sorry for the lenghty and probably unnecessary explanation.)
[END optional explanation of the feature vectors!]

I tried other optimizers, with no good results. The features are in the one hot encoding form, so binary vectors with very sparse 1s, I don't think rescaling is needed.

The dataset is heavily unbalanced, yes! I tried to correct for the unbalancement by setting the class_weight variable during the initialization of the pystuct model, in a similar way to what I do with sklearn models.

I'll try once more to see if I can get thing work following your suggestions (thanks!) and if it still doesn't work I'll post the code, thanks a lot!

Daniele


Andreas Mueller

unread,
Jan 6, 2016, 1:44:50 PM1/6/16
to pyst...@googlegroups.com


On 01/05/2016 05:59 PM, daniele....@vub.ac.be wrote:
>
> The dataset is heavily unbalanced, yes! I tried to correct for the
> unbalancement by setting the class_weight variable during the
> initialization of the pystuct model, in a similar way to what I do
> with sklearn models.
The class_weight works differently in PyStruct and scikit-learn
(libsvm). I implemented the libsvm version for the unstructured SVM (the
rescale_C parameter), but I don't think it's possible to apply it to
structured SVMs.

For the FW: the primal should go down, the dual should go up, and the
duality gap should go to zero.
Sorry, I'm pretty busy at the moment and I'm not sure I can help you
much more right now.

Andy

Lars Petersson

unread,
Feb 6, 2017, 11:17:13 PM2/6/17
to pystruct
Hi Andreas,

Sorry for bringing this old thread to life again, but you mentioned something in it I find interesting for my use case.

You have provided examples in the repository showing how to construct various models, and as you say below, "if you have graph labeling problems where all the nodes represent the same kind of entity, you can reuse them."

In my use case I have various nodes representing different entities. This results in the nodes having different label spaces, different length feature vectors and, as a consequence, different size/shape parameter matrices.

Questions: Can a situation like that be dealt with by pystruct? Is there code available I can draw inspiration from when implementing my own model with such properties? I apologize if this question has been answered elsewhere and if so can you please point me to it?

Cheers,

/Lars

PS. Thanks for providing this library :)

JL Meunier

unread,
Feb 17, 2017, 11:01:02 AM2/17/17
to pystruct
Hi Lars,

I faced the exact same problem and went on to create my own model! 

My goal was exactly what you describe: dealing with "nodes representing different entities. This results in the nodes having different label spaces, different length feature vectors and, as a consequence, different size/shape parameter matrices"

Have a look at https://github.com/jlmeunier/pystruct          (I released it this week :) )

I have sucessfully tested it on an extension ofthe Snake example, but not on my initial problem yet. So any feedback is welcome!

JL

Lars Petersson

unread,
Feb 18, 2017, 6:44:10 AM2/18/17
to pystruct
That's great!

I had a quick read of your extensions over at your github fork, and it seems to be very similar to what I'm after. I'll give it a try Monday/Tuesday at work and will probably have some questions for you :)

BTW, does your way of extending pystruct apply also to latent models with heterogeneous nodes?

Thanks,

/Lars

Jean-Luc Meunier

unread,
Feb 18, 2017, 6:52:17 AM2/18/17
to pyst...@googlegroups.com

Questions are welcome!

I've not looked at latent models. For sure, they are not covered by current extension.

But maybe the extension done on AD3+ will support an extension to the latent model.

I'll have a look next week as well.

JL

--
You received this message because you are subscribed to a topic in the Google Groups "pystruct" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pystruct/r6-i0XKkjrQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pystruct+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages