Importing large sets of files is breaking Atom's hierarchy

229 views
Skip to first unread message

Roberto Greiner

unread,
Jan 29, 2025, 2:01:07 PMJan 29
to AtoM Users
Hi,

I'm having a problem importing large sets of files to AtoM.

I'm migrating a reasonable amount of PDF files (~8.500 documents, ~1.4
million pages) from our proprietary system to Atom.

I've created a hierarchy with groups which will receive the files. After
creating the groups, I'm following the documentation in
https://www.accesstomemory.org/pt-br/docs/2.8/admin-manual/maintenance/cli-import-export/#digital-object-load-task.

So first I use a CSV to import the archival descriptions in the group
using the CSV template from
https://wiki.accesstomemory.org/Resources/CSV_templates. This works
fine, even with a few of the groups having a large amount of
descriptions (the largest one has 2,830 descriptions).

The problem I get is when I go to the command line and use the 'php
symfony digitalobject:load' command to link the descriptions to their
respective files.

Every single time, the command indicates that it finished without any
issue, but if I link a larger set of files, the entire hierarchy breaks.
In the web interface, if I click to open the group where I made the
import, the entire group keeps greyed out and does not load. Sometimes,
parts of the hierarchy disappear. The exact number of files that cause
this effect varies between 100 and 300. In a few cases I've managed to
link 200 files, in some cases it breaks between 100 and 200. The import
processes per se does not give any error messages. Any ideas of what
could be happening?


Additional information:
- When I load the CSV with the 'php symfony digitalobject:load', I'm
getting the following to errors for every file imported. Is that normal?
What am I making wrong?:
convert-im6.q16: missing required argument  @
error/convert.c/ConvertImageCommand/545.
identify-im6.q16: missing required argument  @
error/identify.c/IdentifyImageCommand/249.

The VM I'm running has the following setup:
Ubuntu 24.04.01 with MySQL 8.0.40, 6GB RAM, 7GB Swap, 21GB root disk,
1TB disk from Atom mounted in /usr/share/nginx. All the files to be
linked are already in the same server in the folder
'/usr/share/nginx/share' to avoid problems with remote connections. Atom
is in the latest available version (2.8.2 - 193).

MySQL is working fine, with some adjustments recomended by Mysqltuner
(https://github.com/major/MySQLTuner-perl)

The exact command I'm using for import is the following (executed as
root from /usr/share/nginx/atom):
php symfony digitalobject:load --path="/usr/share/nginx/share/"
/home/myuser/importar_objetos.csv  >
/root/registro_importacao_atom."$(date +%Y-%m-%d.%H-%M-%S)".txt \
   2> /root/erro_registro_importacao_atom."$(date +%Y-%m-%d.%H-%M-%S)".txt

I'm using this to make it easier to debug eventual errors, as all the
output is then available in two files in /root.

Also, after running the 'php symfony digitalobject:load' command, I run
'php symfony search:populate' to update the indexes before even looking
at the web pages. No errors are reported in this command.

Thank you,

Roberto

--
-----------------------------------------------------
Marcos Roberto Greiner

Os otimistas acham que estamos no melhor dos mundos
Os pessimistas tem medo de que isto seja verdade
James Branch Cabell
-----------------------------------------------------

Matt Deschaine

unread,
Jan 29, 2025, 7:28:12 PMJan 29
to AtoM Users
Your first errors on object import are probably from ImageMagick, which AtoM uses to create thumbnails of the first page of the PDF on import. Check your installation has the other dependencies installed. https://www.accesstomemory.org/es/docs/2.8/admin-manual/installation/requirements/#other-dependencies

For your treeview issue, try adding rebuild nested set to your commands after you load.

One other possibility is that you're running Ubuntu 24.04 which I don't believe is officially supported. There has been some discussion on this forum about that here: https://groups.google.com/g/ica-atom-users/c/pEDQkCsDeV8

Roberto Greiner

unread,
Feb 3, 2025, 1:06:13 PMFeb 3
to ica-ato...@googlegroups.com
Em 29/01/2025 21:28, Matt Deschaine escreveu:
> Your first errors on object import are probably from ImageMagick,
> which AtoM uses to create thumbnails of the first page of the PDF on
> import. Check your installation has the other dependencies installed.
> https://www.accesstomemory.org/es/docs/2.8/admin-manual/installation/requirements/#other-dependencies

I do have ImageMagick installed. Since does errors are probably not the
problem, I will focus on them later.

>
> For your treeview issue, try adding rebuild nested set to your
> commands after you load.
> https://www.accesstomemory.org/en/docs/2.8/admin-manual/maintenance/troubleshooting/#rebuilding-the-nested-set
I've tried that. Same result. After linking the first 100 descriptions
with the correct files, all was ok. After running the script to link
descriptions 100-200, the hierarchy crashed and keeps showing "Loading"
indefinitely. I ran the rebuild command 'php symfony
propel:build-nested-set'. I completed surprising fast (about a second),
and then reran the index rebuilding command ('php symfony
search:populate'), but that didn't work. I've tried running the command
to purge the cache ('php symfony cc') and restart all involved programs
(ngINX, PHP-FPM, Memcached and Atom-worker), still with the same result.
>
> One other possibility is that you're running Ubuntu 24.04 which I
> don't believe is officially supported. There has been some discussion
> on this forum about that here:
> https://groups.google.com/g/ica-atom-users/c/pEDQkCsDeV8

I really hope this is not the cause of the problems. The official
platform is Ubuntu 20.04, which officially loses support in two months.
I'm not very keen on installing a new service in a platform that is
about to dye. Did anyone else have problems because of this? Could it be
that at least Ubuntu 22.04 runs without this error?

tat...@gmail.com

unread,
Feb 10, 2025, 11:29:57 AMFeb 10
to AtoM Users
Hi Roberto,

for your thumbnails problem, you can check this solution written by Dan Gillean:
https://groups.google.com/g/ica-atom-users/c/_MIUrJ0AotE/m/FxtDctRHCQAJ

I hope it helps!
Tati Canelhas

Roberto Greiner

unread,
Feb 10, 2025, 11:58:34 AMFeb 10
to ica-ato...@googlegroups.com

No, that's not a problem. I'm having some error messages that are probably related to imagemagic, but that's not causing me any problems. The problem is that using the cli script to load multiple files is crashing the main hierarchy. Cleaning the cache or rebuilding the hierarchy is not working also.

Tks,

Roberto

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/ica-atom-users/559f6e4a-d141-490c-b42d-d2d341b858c3n%40googlegroups.com.

Johan Pieterse

unread,
Feb 10, 2025, 2:50:44 PMFeb 10
to Roberto Greiner, ica-ato...@googlegroups.com
Hi Roberto

Did you rebuild the nested sets?

Johan 

Roberto Greiner

unread,
Feb 10, 2025, 3:12:27 PMFeb 10
to ica-ato...@googlegroups.com

Yes,

I've tried that, unfortunately it didn't work.

Tks.

Aloysio Yamada

unread,
Apr 15, 2025, 5:42:42 AMApr 15
to AtoM Users
Hi,

I have same problem.

My installation:
- Ubuntu 24.04
- and I used this link for complete install 

After configuration, a populate 1500 information object using csv:import 
after this I execute search:populate and all data is ok on the web

After import digitalobject using digitalobject:load, I check in the web, and all thumbs and files ok.
I execute again search:populate and the Atom´s Hierarchy not work "loading" infnity.

I tried nested, regen, update replica for zero, delete all data in elasticsearch and rebuild, and not working.

Any ideas of what could be happening?

Aloysio Yamada

unread,
Apr 15, 2025, 5:42:57 AMApr 15
to AtoM Users
.

Em segunda-feira, 10 de fevereiro de 2025 às 17:12:27 UTC-3, Roberto Greiner escreveu:

Aloysio Yamada

unread,
Apr 15, 2025, 8:27:59 AMApr 15
to AtoM Users
Hi everyone,

I resolved ths problem change the config/search.yml, I changed bacth_mode to false and worked.

Thanks

Roberto Greiner

unread,
Apr 15, 2025, 1:46:00 PMApr 15
to ica-ato...@googlegroups.com

I've tried setting this variable (batch_mode), and it worked only partially.

After setting it, I ran the command to load the objects into the Atom nested set. This time, the nested tree did not crash, and when clicking on the root of the branch with all the objects, it did actually show the thumbnails of the objects. The problem is that when I tried open the branch itself, it wouldn't open, it did only show the animation for opening, but never opened. I've tried to repopulate the search cache, clear the cache, rebuild the nested set, restart nginx, php, memcached, but it still wouldn't open. I ended restoring atom from a snapshot....

Any other ideas?

Tks,

Roberto

Aloysio Yamada

unread,
Apr 16, 2025, 10:31:10 AMApr 16
to AtoM Users
Hi Roberto,
in 

This status is GREEEN

IN php symfony search:status
Elasticsearch server information:
 - Version: 6.8.23
 - Host: localhost
 - Port: 9200
 - Index name: atom

Document indexing status:
 - Accession: 0/0
 - Actor: 2/2
 - Aip: 0/0
 - Function object: 6/6
 - Information object: 1660/1660
 - Repository: 1/1
 - Term: 339/339

This is my config in search.yml

all:
  batch_mode: false
  batch_size: 5
  server:
    host: localhost
    port: '9200'
  index:
    name: atom
    configuration:
      number_of_shards: 4
      number_of_replicas: 0
      index.mapping.total_fields.limit: 300000
      index.max_result_window: 1000000

I have only 1660 information objects, and my tree is very simple

Captura de tela 2025-04-16 113024.png

Aloysio Yamada

unread,
Apr 17, 2025, 8:03:37 AMApr 17
to ica-ato...@googlegroups.com
Roberto,

In my case I changed replica for 0 (zero) because I used 1 es node.



Aloysio Yamada

You received this message because you are subscribed to a topic in the Google Groups "AtoM Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ica-atom-users/35CZiuNKlCM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ica-atom-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/ica-atom-users/cf20c84a-f3a5-4904-9ae2-7eaa61e537d1n%40googlegroups.com.

Roberto Greiner

unread,
Sep 12, 2025, 4:21:51 PM (6 days ago) Sep 12
to ica-ato...@googlegroups.com

After a lot of time I THINK i managed to solve the problem with the hierarchy breaking. I have new physical servers, so I bumped up the CPU and memory settings for the VM. Java, which had only 1GB as Xmx and Xms, now has 5GB. I increased disk space (both root partition and the partition where nginx/atom is), and, what I think really made the difference, increased the buffer sizes for MySQL. The usual recommended size for innodb buffer size is (usually) 75% of the size of your DB size. In my case my DB size was already at ~1.5GB and I had a meager 256MB for InnoDB. Now that RAM is no longer a problem I increased it to 2GB and it SEEMS that this solved the problem. Also those changes improved considerably the response time of Atom.

In any case, Monday I will continue loading more files and will report back again.

Roberto

Reply all
Reply to author
Forward
0 new messages