Hello Anubhav,
thanks for your time looking at these issues. Simultaneously to your debugging, I looked at the recover_offline call by just running the commands step by step for a particular launch that has been marked as RUNNING again after being FIZZLED.
Here, I will illustrate with Database screenshots what I noticed:
As described before, "updated_on" is set to the current date every time calling "lpad recover_offline -w PATH_TO_THE_APPROPTIATE_WORKER_FILE":
Running the first few lines of the recovery code
m_launch = self.get_launch_by_id(launch_id)
try:
self.m_logger.debug("RECOVERING fw_id: {}".format(m_launch.fw_id))
# look for ping file - update the Firework if this is the case
ping_loc = os.path.join(m_launch.launch_dir, "FW_ping.json")
if os.path.exists(ping_loc):
ping_dict = loadfn(ping_loc)
self.ping_launch(launch_id, ptime=ping_dict['ping_time'])
on the ping file with content '{"ping_time": "2019-07-28T12:54:43.213215"}' modifies the database entry as expected:
After the first part of the few lines pointed out by you,
offline_data = loadfn(offline_loc)
if 'started_on' in offline_data:
m_launch.state = 'RUNNING'
for s in m_launch.state_history:
if s['state'] == 'RUNNING':
s['created_on'] = reconstitute_dates(offline_data['started_on'])
l = self.launches.find_one_and_replace({'launch_id': m_launch.launch_id},
m_launch.to_db_dict(), upsert=True)
, the state history is still consistent:
The Fireworks has not been touched and still looks like this
fw_id = l['fw_id']
f = self.fireworks.find_one_and_update({'fw_id': fw_id},
{'$set':
{'state': 'RUNNING',
'updated_on': datetime.datetime.utcnow()
}
})
the Fireworks is updated to the current time:
That is what yaou described. However, I do not yet understand where that state setter you mention comes into play, I will have to look at that tomorrow.
The launche's state_history is still consistent up until here.
if 'checkpoint' in offline_data:
m_launch.touch_history(checkpoint=offline_data['checkpoint'])
self.launches.find_one_and_replace({'launch_id': m_launch.launch_id},
m_launch.to_db_dict(), upsert=True)
calls "touch_history" again, this time, however, without any ptime argument, and thus overrides the previous change again with the current time:
Since the FW_offline.json contains a non-empty "checkpoint" entry,
{"launch_id": 12392, "started_on": "2019-07-24T12:54:41.031150", "checkpoint": {"_task_n": 0, "_all_stored_data": {}, "_all_update_spec": {}, "_all_mod_spec": []}}
these lines are executed. That is how the current time enters state history. What is the actual purpose of a "checkpoint"? There is not much documentation on this.
Find the test protocal attached (Jupyter notebook and HTML). In the next few days, I will address the other points in your post.
Best,
Johannes