(2) what are other candidates in this comparison that you described?
It seems like there are a handful of misconceptions that perhaps I can clear up. Let me give a brief overview of structured prediction models like SSVMs. Sorry if any of this is review.
There are three necessary components for a structured-output classifier: the model/representation, learning, and inference. The model is used to define features and relationships between objects. For example a linear chain CRF model is typically defined where there is a node at each timestep (comprised of a unary potential/feature I’ll denote U) and a pairwise potential/feature between each consecutive timestep that I’ll denote P. Psi, the joint feature function from the Structural SVM formulation, is a function of the unary and pairwise terms. In particular it is the sum over all time steps of U’s and P’s corresponding to each class. If we have class ‘A’ and ‘B’ Psi will have 6 terms: Psi = [U_A, U_B, P_AA, P_AB, P_BA, P_BB] where each terms (e.g. U_A) will be the sum of all unaries for class A for a particular training sequence. The model component of the classifier is where domain knowledge comes in — this is where you try to encode the relationships between your aminoacids.
The second component is learning. I don’t know if there are direct comparisons between pyStruct and SVM^struct but Andreas has done in-depth comparisons between Block Coordinate Frank Wolfe, vanilla Stochastic Gradient Descent, and Cutting Planes. These are all algorithms for solving the Structural SVM learning problem. Some of the results are in his thesis — you can find a draft of this somewhere on this message board if you search. In summary his results show that BCFW tends to perform better (faster and better convergence) in several applications. In my own experience I have gotten better results using stochastic gradient descent with RMSProp step updates. Unfortunately all of my experiments were done with a (currently private) structured prediction library I’ve been writing in Julia.
Third is inference. I’m not going to harp on this. This is highly dependent on your model. Essentially you need a way of predicting what the labels are for every node in your graph. Depending on the structure of your problem this can be done approximately (e.g. using variational inference or sampling-based methods) or exactly.
Now, back to your questions. Regarding #3 I would say yes, you can tackle problems like yours with structured prediction. Even if the ambient sample space is huge, the point of a structured prediction model is to reduce a lot of the complexity. As Shuyang said above, having a fully connected graph quickly becomes intractable. By making assumptions (for example by assigning which nodes are connected in the graph) you reduce the problem from intractable to tractable. Even if your final model is very complex there are often approximate inference methods that can still give you reasonable predictions.
I don’t fully understand your application, but it makes sense to my why your model (#4) may not be performing well. It seems like each letter in your aminoacids example is an individual node. Effectively you have two unary weights (one for ‘0’ and one for ‘1’) and four pairwise weights (0->0, 0->1, 1->0, 1->1). The ‘0’ unary will give a high score to letters (e.g. A or C) that tend to receive a ‘0’ label and vice versa for ‘1’. This implies that your prediction at timestep ’t’ is largely a function of the letter at that same ’t’. However, based on the simple example you included, it appears that the individual letters don’t matter that much — what does matter is the local sequence of letters. For example, perhaps ‘ABC’ is always a ‘1’ and ‘CBA’ is always a ‘0’. In that cause you want to model your unaries differently. Perhaps each unary should be a function of the past 3 letters. Off the top of my head I can’t recall how you would encode this in pyStruct. If you confirm the details above perhaps I can help you think about how to formulate it.
I hope this was clear enough. Let me know what other questions you might have!
Dear Andreas,
first of all thanks for the effort you put in the development of open source and freely available ML libraries. (I'm a PhD student in Bioinformatics and I'm an *heavy user* of sklearn).
Since in bioinformatics most of the problems we face have an underlying structure, I'm experimenting also with pystruct, but, in order to better grasp the general concept, I have some (I hope not too stupid) questions. It is just to understand the reasoning behind this very nice library!
1) As far as I have understood, in the original SVM struct formalism proposed by [Tsochantaridis et al. (2005)], the implementation of the psi function (joint feature(x,y) in pystruct) was up to the user, since it is problem dependent (the same goes for the loss function, for example). In pystruct you make it more generalizable (and user friendly) by introducing Models based (for example) on the CRF/Maximum Margin Random Fields formalism, and you made this choice because it is a widely used framework that is natively able to model most of the graph-based structure prediction problems, thus avoiding the users to code an ad-hoc psi/joint_feature function for each problem. Is this statement reasonably right? ;)
2) If 1) is correct, have you ever benchmarked (or you have an idea on) how the "general" joint_feature implementation given by the pystruct Models performs with respect to joint_features and loss functions specifically designed/tuned for a particular structured problem X? In other words, using the Models gives a huge speedup in the implementation of predictors and make it a lot more easier (thanks for that!), but should it be considered as a way for "trading" better performances for user friendliness? Or in your experience the CRF models are in general good enough for most of the use cases and thus there is little/no point in designing ad-hoc psi/loss for each specific problem?
3) From your experience, do you think the field of structured prediction (and pystruct) is ready to takle difficult problems? Like predicting structures with a *huge* possible solution space, or every "good" search in those huge space will be infeasible by definition? o
4) For pystruct, is it correct to say that the more structured the problem is the better it is for the good outcome of the prediction? For example, I'm working with proteins (that can be tought as linear polymers), so the easiest structure I could associate with my problems is a linear chain, like in your OCR example. A difference is that the "relations" between the elements composing the polymer (aminoacids) are less strong that the grammatical or semantic relationships in NLP, yielding probably to a less constrained chain "structure" underlying the problem.
Do you think it is in general better to pose the prediction problem in the most constrained way possible or pystruct should work equally well with low level of dependence between the outputs? For example, I have sequences of aminoacids "ACDEAFHG...KLH" that is associated with binay labels "00001111100...00000" in which the 1s are grouped, forming "islands". Usually this problem is solved with UNstructured predictions, using SVM to predict the 1/0 class for every single aminoacid, but since the 1s form islands, I tried to use various models from pystruct (starting from chainCRF to EdgeFeatureGraphCRF) to model the dependences in the labels. Unfortunately I can't get good predictions even on the trainset itself (AUC=<0.60) and I always get random predictions on the test set. I have to say that the problem is very noisy and unbalanced ( I corrected the unbalancement by setting the class_weight variable during the initialization of the pystuct model), but a regular unstructured SVM from sklearn always outperforms pystruct (Even though the best AUC is around 0.75 even for the best unstructured svm, so the problem is difficult ).
Do you have any suggestions? (this is a toy problem I'm using for my first steps with pystruct). I'm training with FrankWolfeSSVM and I tried many configuration of training parameters without noticeable improvement. Do you think that the limited amount of "structure" in the problem could be the cause for these bad performances? Going back to the question 3, in your experience, it is better for pystruct to predict a sequence of length n composed by 1s and 0s (and thus relatively little structure among the labels) (space of solutions 2^n) or the same sequence but with higher number of class labels? (let's say label symbols going from 1 to 10, with a space of the possible solutions of 10^k ).
thanks in advance for your patience and time,
sorry for the colloquial form and the possibly stupid questions
Daniele Raimondi
--
You received this message because you are subscribed to the Google Groups "pystruct" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystruct+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Questions are welcome!
I've not looked at latent models. For sure, they are not covered
by current extension.
But maybe the extension done on AD3+ will support an extension to the latent model.
I'll have a look next week as well.
JL
--
You received this message because you are subscribed to a topic in the Google Groups "pystruct" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pystruct/r6-i0XKkjrQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pystruct+u...@googlegroups.com.