@martin: I don't think we need to save checkpoints of the runs for future reference, just having textfiles (JSON/CSV) of the results in the results-repo is enough. I mean, maybe the final, trained checkpoint could be useful for something, but keeping all intermediate checkpoints (1 after every x epochs) around seems like overkill. Ideally checkpoints are a transient thing and they get converted (evaluated) into there metrics after a run.
A run is already a separate table in the db, that's what run_id refers to. so at least in the db the benchmark definition etc. will be there, and metrics refer to what run they belong to with the run_id field.
@lie: Are global_iteration and local_iteration ever used at the same time? In async there's not really a global_interation or am I mistaken? Because then we could just call it "iteration". What's the difference between an "epoch" and an "iteration"?
For timestamp, right now we just save wall time (was easiest to implement). In the future it might make sense to switch to time since start. What do you think?
Technically we can have a run in the DB have an optional pointer to a previous run, to keep track of restarted runs. That would allow us to just combine them in the end. Or at least them we wouldn't have a problem with separating different metrics from multiple runs.