Active jobs since two days

68 views
Skip to first unread message

ism...@gmail.com

unread,
Nov 9, 2022, 9:17:39 AM11/9/22
to AtoM Users
Good afternoon, for two days the pending work in AtoM has not finished.

I'm not sure if they run as a queue, or newer jobs might run before older ones.

Another thing that happens to me is that I have changed the status to published in funds by checking to include the descendants and only the main description has been published, the descendants have not.

Finally, I would like to know if there is any way to know if the jobs that are running have already finished and can be deleted.

Thank you as always for your work, greetings.

Dan Gillean

unread,
Nov 10, 2022, 8:35:03 AM11/10/22
to ica-ato...@googlegroups.com
Hi Isabel, 

It sounds like one of the jobs has exhausted all available resources and stalled. There are two ways you can deal with this. 

Option 1 - kill a specific stalled job; keep the queue

If you want to keep all the other jobs in the queue that are waiting behind the stalled one, then you will need to use SQL to kill only the specific problem job. This involves accessing the MySQL command prompt, which in turn means knowing the credentials used during installation. The following link has information on how to find those credentials again if needed, as well as how to access the MySQL command prompt: 
From there, you can now use SQL to first look up the ID of the job that has stalled, and then use that ID to kill the specific job: 
You will likely want to reset the atom-worker fail counter, and then restart the worker after killing the stalled job - instructions for that are included at the end of the section linked above, as well as here: 
More explanation of what the fail counter is, and how to check the status of the atom-worker, is also included below in a separate section, for reference. 

At this point, the job scheduler *should* pick up other jobs that were in the queue and continue with them. Keep in mind that you may still run into issues - for example, if one of the queued jobs was to publish a series that has since been deleted, etc... In that case, depending on what the issue is, you can either try restarting the job scheduler again or (as in the example I gave, where the job points to records that no longer exist) use the same process as above to kill the problem job. 

Option 2 - clear the entire queue

The second option is much easier, but will also clear ALL jobs - not just those unfinished jobs in the queue, but also the history of completed jobs shown in the user interface. This means that you would need to manually re-run whatever process initiated the previously queued jobs for them to run again - for example, if one of them was a job to update the publication status of a bunch of descendant records, then you'd likely need to go update the publication status of the parent series or fonds again. 

If you want to preserve the history of the previously completed jobs, AtoM does have a CSV export for job history in the user interface that you can use first -see: 
To clear all jobs: 

Run the following command from AtoM's root installation directory: 
  • php symfony jobs:clear
You may need to restart the job scheduler (and therefore, you may need to reset the fail counter as well). 

Fail counter and checking the worker status

You can always check the status of the atom-worker and make sure it's running properly with the following command: 
  • sudo systemctl status atom-worker
Restarting is similar: 
  • sudo systemctl restart atom-worker
If the worker isn't running after a restart, this probably means that the fail counter limit has been reached. The atom-worker will automatically try to restart and then repeat a job after a failure, but to prevent the worker from getting caught in an infinite loop when an issue can't be resolved this way, a limit is added in the configuration: after 3 attempts in 24 hours or less, then the fail limit has been reached, and the fail counter needs to be reset to zero before restarting will work:
  • sudo systemctl reset-failed atom-worker
This command sets the internal fail counter back to zero, so restarting the atom-worker should work again after running this. 

Hope this helps!

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory
he / him


--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/282b43e3-d6ef-4ff0-bb95-af34b203d8e4n%40googlegroups.com.

ism...@gmail.com

unread,
Nov 16, 2022, 9:38:57 AM11/16/22
to AtoM Users
Hi Dan, thank you very much for the reply. We have located the oldest jobs that seemed to be finished and we have been eliminating them, but it continues the same, the rest does not finish.
In addition, we have also verified that there is a process on the server that is occupying a large part of the memory and the hard disk that is the following: /usr/bin/php -d memory_limit=-1 -d error_reporting=E_ALL symfony jobs:worker
 
Could it be preventing worker jobs from running?
Thanks in advance and best regards.

Dan Gillean

unread,
Nov 16, 2022, 10:49:04 AM11/16/22
to ica-ato...@googlegroups.com
Hi Isabel, 

Yes, when I say "exhaust available resources", memory is one of the resources I mean. It seems that one of the jobs used all available memory, which has caused it to stall. 

If you follow option 1 (use SQL) in my previous reply, then the first SQL query should tell you at a glance which job needs to be killed. When you run the first query, you get a list of jobs output, like so: 
  • SELECT id,name,status_id,completed_at FROM job;
+--------+--------------------------+-----------+---------------------+
|  id    |  name                    | status_id |  completed_at       |
+--------+--------------------------+-----------+---------------------+
| 149689 | arUpdateEsIoDocumentsJob |       184 | 2020-01-27 09:15:11 |
| 149690 | arUpdateEsIoDocumentsJob |       184 | 2020-01-27 09:18:28 |
| 155764 | arFileImportJob          |       183 | NULL                |
| 155800 | arUpdateEsIoDocumentsJob |       183 | NULL                |
| 155801 | arUpdateEsIoDocumentsJob |       183 | NULL                |
| 155802 | arObjectMoveJob          |       183 | NULL                |
| 155803 | arObjectMoveJob          |       183 | NULL                |
| 155804 | arUpdateEsIoDocumentsJob |       183 | NULL                |
| 155805 | arUpdateEsIoDocumentsJob |       183 | NULL                |
| 155808 | arUpdateEsIoDocumentsJob |       183 | NULL                |
+--------+--------------------------+-----------+---------------------+
1853 rows in set (0.00 sec)


In the above example, it's job 155764 that has stalled - the first job listed that has NULL for the completion time. Using SQL to kill this job, then restarting the atom-worker, usually resolves the issue and allows the queue to resume, as the memory is freed. 

If that's what you've done and it's not working, then I would suggest you just use option 2, kill all jobs, reset the fail counter, restart the worker, and the manually relaunch any other jobs you had queued. 

Regards, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory
he / him

ism...@gmail.com

unread,
Nov 18, 2022, 8:39:33 AM11/18/22
to AtoM Users
Hi Dan, thanks for the reply. In the end I have eliminated the entire job queue and finally the process that had the memory full has disappeared.
I have launched a single post description process with 6 PDF documents (they are quite large at about 500Mb each) and that process is currently taking up 48% of the memory out of a total of 8Mb of RAM. Is it normal that it takes up so much memory? All the best.

Dan Gillean

unread,
Nov 18, 2022, 9:15:01 AM11/18/22
to ica-ato...@googlegroups.com
Hi Isabel, 

There are some known issues with memory management and "garbage collection" in the atom-worker, particularly around finding aid generatinon. See for example: 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory
he / him

Reply all
Reply to author
Forward
0 new messages