Mhm, implementing that with caffe will be at least cumbersome: There is currently no layer to compute that accuracy, and implementing one won't really work, since at runtime that layer does only have access to the labels in the current batch, not all labels, so you cannot compute your accuracy. The "easiest" way to implement that would probably be by a wrapper script in python, that feeds all samples to test through the (trained) net, saves the predictions, then loads all available label vectors by hand and computes your accuracy score. As I said, not very nice.
On the other hand, this concept of accuracy seems very strange to me: Since the training and test sets are only samples of a theoretically very large population, it does not really make sense to compare the individuals in these subsets with each other, ignoring that there might be other individuals with even better fitting labels, which just not happen to be in the train/test sets.
Jan