Importing large CSV - is there a advisable max size?

123 views
Skip to first unread message

Cintya Takahaschi

unread,
Aug 9, 2021, 9:47:07 AM8/9/21
to AtoM Users
Hi all,


I'm looking for an opinion, or advice, from developers or anyone who has been working with AtoM for some time...

I tried to import to AtoM 2.6 a CSV file with 72,212 lines, using the csv:import command line, and, after a little more than 20 hours of processing (52,456 registries, if I counted right the period signs), received the error:

[wrapped: SQLSTATE[HY000]: General error: 3636 Recursive query aborted after 1001 iterations. Try increasing @@cte_max_recursion_depth to a larger value.]

I found out that this is a MySQL related issue.

Is it better to change my MySQL configuration (for example, using SET GLOBAL cte_max_recursion_depth = 5000) or to divide my CSV file, so it contains a maximum number of lines each? What would this maximum be?


Thanks in advance,

Cintya Takahaschi

Dan Gillean

unread,
Aug 9, 2021, 10:19:25 AM8/9/21
to ICA-AtoM Users
Hi Cintya, 

It's really hard to give a precise number for advisable CSV row size, because the processing time will depend on a lot of factors. MySQL settings yes, but also PHP execution limits, available memory/CPU, and even the data itself - data with a lot of relations (that will need to also create or link terms, actors, repositories) will take more time to process and need to hold more in memory as it does so, etc)... We've also been trying to increase performance with each major release, so it will also depend on which version of AtoM you're using (2.6 saw a lot of improvements over 2.5 and earlier for example, but there's still more we hope to do in the future). 

With this in mind, I would say in general it's probably better to try to break up the CSV than adjust MySQL, since you'll inevitably hit upon some other limitation. I'd suggest trying 5-10,000 rows and see how that goes? Just remember: if you're working with hierarchical data, be careful about where/how you split the CSV - parent rows should always appear before any children, and ideally all children should be in the same CSV. If not, you might want to experiment a bit with using the command-line for the imports, so you can use the --source-name option to give them a common source name, allowing later imports to find legacyID values from earlier imports more easily. 

Hope that helps!

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory
he / him


--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/dd01f15d-1f31-456c-8e98-74053ef65503n%40googlegroups.com.

Cintya Takahaschi

unread,
Aug 9, 2021, 2:00:00 PM8/9/21
to ica-ato...@googlegroups.com
Thank you so much, Dan Gillean, your explanations and advices were very helpful!

I'll be sharing your thoughts with our team, and we'll do some testing with smaller CSVs and with the --source-name option, being careful with the parent and child rows positioning.

Best regards,

Cintya Takahaschi

Reply all
Reply to author
Forward
0 new messages