I am trying to figure out whether I can use caffe for word embeddings. Here is a simplified architecture:
The input to the network consists of ~10 words, where each word is represented as a one-hot binary vector of ~100K dimensions. Each of the n input words is connected to a distinct set of ~100 hidden units, i.e. there are ~1000 hidden units in total. The set of weights from each of the n words to its set of hidden units is the same, i.e. a word causes the same activation in its hidden units (aka the word's embedding) no matter which position it is in. These 1000 hidden units are connected to further hidden layers and finally a softmax output.
The problem is the size of the input. Even though the number of weights is manageable (~10M in the example above), each input instance is 1M dimensional (although there are only 10 non-zero entries). This makes storing the training set standard caffe style a bit tricky. Any suggestions?