OK, so here's a sequence of remarks...
There are a large number of "training" and "learning" algorithms. Superficially, some of these seem to be very very different than others. However, if you try to compare them, and ask "what properties do they have in common?", you gain a lot of insight. You gain even more if you disentangle the data representation from the algorithm. For example, "sparse matrix factorization" has been the primary problem that Amazon, Facebook, Google try to solve. The matrix is "consumers" (for rows) and "product preferences" (for columns) with 10's or 100's of millions of consumers, and millions of products (in some of the competitions, "movie reviews" were a stand-in for "products") This can be approached as a old-school linear algebra problem, where you try to apply some smoothing functions, compute some eigenvectors, interpolate across missing values, etc. I'm tempted to call this "boring old linear algebra" and it is interesting only because a good solution will allow GOOG/AMZN/FB to make billions in profits by targeting me with the right kind of advertisements.
You can also change perspective. Here's a super-quick sketch. a graph can be represented as an adjacency matrix. An adjacency matrix is a matrix whose columns and rows are vertices. The entries are 0 or 1 where 1 means "there is an edge connecting these two". If you replace the 0/1 with fractional values, then you can interpret this as a weighted graph. If the weights sum to 1.0 you can interpret this as a Markov matrix. Back in the day when google had only two employees, the grand discovery was that you didn't have to put the entire matrix into RAM in order to solve it -- the Page-Brin algorithm solves the Markov matrix problem by paging into RAM only 0.00001% of the graph at a time.
But the consumer/product-preference matrix is different. Almost all entries in the matrix are unknown. (its impossible for 10 million consumers to express preferences for a million products). The matrix factorization is M=L D R where L and R are sparse matrices, of dimension 1M x 1K, and D is a dense matrix of 1K by 1K. You can solve D with standard neural-net techniques.
Curiously, the natural language dictionaries in Link Grammar have the same structure. Many words have very similar grammatical structure (e.g. "all nouns" vs "all adjectives") -- put these into L. The product D R encodes the grammar. If you look at the dictionary, you will spot that D R is also factorized: R is a collection of "disjuncts" that commonly occur together (for example <b-minus> or <mv-coord> or <verb-rq-aux> in link-grammar) and that the "guts" of the English language lie in D -- which is a dense (not sparse) matrix that interconnects word classes (e.g. "words.n.2.x" which is one of the elements of L, to <b-minus> and <mv-coord> which are the elements of R.)
That task of grammar learning is to perform this factorization: find L and D and R. You can find L and R with Bayesian methods, and D with neural net methods. Or you can use other algorithms, e.g. the algos from the consumer-preference movie-ranking competitions.
BTW: Link grammar uses "disjuncts" and "costs" but these are really just "seeds/germs" from a "sheaf"; the sampling is a very sparse matrix. This is not a "deep" statement, its "obvious" and "shallow" once you see it, but, for whatever reason, almost no one ever "sees it", and even then, they almost never leverage the power behind it. (The power being that you can move algorithms from one kind of data representation to another.)
OK... here is another change in perspective. When you read the neural-net papers, the vast majority of them talk about "costine distance" or, like you, blurt out "SoftMax" without ever thinking about it. I believe this is a serious error. It is exposed by another change of perspective. Soo..
When one says "linear algebra" or one says "vector" or "matrix", this immediately implies "Euclidean space". That is because vectors and matrices have transformational properties (1-tensor, 2-tensor) that are "natural" in Euclidean space. The cosine product is preserved in euclidean space under rotations. (its a zero-tensor, its a kind-of-like Casimir invariant)
But who ever said that consumer preferences live in Euclidean space? Where does it say that euclidean space is the natural setting for neural nets? Nowhere. Absolutely nowhere. If you shift focus, and look at the "matrix" (the 2-tensor) not as a linear-algebra thing, but instead as a weighted graph, then you can see other possibilities. My favorite is mutual information (yes, some people are sick of me saying this over and over ...) if you take a neural net algo which has a cos(theta)=(v dot w)/|v||w| where v and w are vectors, and replace it by MI=log[(v dot w) / (v dot va) (w dot wa)] (see skippy.pdf for details) you get not a Euclidean space, but a probability space (simplex). The conserved quantity is not the scalar dot product, but instead the sum of probabilities. (This is not my discovery; this is something noted a decade or two ago by a handful of authors; and has been ignored by 99% of the neural net literature. Ignored, not rejected; this does not appear to be a conscious decision, but rather just forgotten/unobserved. Would it improve NN learning? Who knows?)
There are a bunch of these kinds of "changes of perspective", I try to sketch them in the "skippy.pdf" paper. It's all "green field" development: fairly obvious direct and immediate possibilities, angles and approaches that have been overlooked and remain unexplored for whatever reasons. I suspect the mainstream researchers are FOMO of other discoveries, that they are too busy to explore some really quite promising algos. Kind of a sociology-of-science question. And, just to ding Ben a bit: he keeps telling me that this is all "obvious and trivial" but if it's so obvious and trivial, then why has no one investigated any of these avenues yet? Why aren't people publishing results? Harrumph.
Anyway, this is where I am currently stalled. One issue is that there are so many different avenues and possibilities that look promising, it's hard to pick just one. And each avenue requires a lot of work to explore. It's hard to do single-handedly; collaborations are more effective.
-- Linas