Uploading PDFs

Debra Schiff

unread,

Feb 15, 2024, 2:28:36 PM2/15/24

to AtoM Users

Hello,

In the AtoM documentation, it says:

Currently, AtoM 2.x truncates PDF text after the first 65,535 bytes.

Does that mean that the entire PDF is not uploaded? Are the sizes of the uploads limited even if we don't specify a size limit?

Please help. Thanks!

Best regards,

Deb Schiff

Archivist and Special Collections Librarian

R. Barbara Gitenstein Library

The College of New Jersey

PO Box 7718

Ewing, NJ 08628-0718

Phone: 609-771-2284

pronouns: she/her/hers

Dan Gillean

unread,

Feb 15, 2024, 3:17:50 PM2/15/24

to ica-ato...@googlegroups.com

Hi Debra,

Good news!

No, this does not mean that the PDF itself stops being uploaded. This number refers to what amount of the text layer (e.g. from OCR or similar) gets added to AtoM's search index, so search results can include hits from INSIDE a PDF (keep in mind that the accuracy of these results will depend on the quality of the OCR text layer)
Additionally, we have actually increased this limit in the 2.8 release. While previously the database field used to store indexed text from a PDF used the TEXT type (which had that 65K character limit), this has been increased in 2.8 to use a MEDIUMTEXT field type - meaning it jumps to supporting 16MB of text, or approximately 16,777,215 characters - i.e. almost 256 times bigger than previously

If you're curious, you can read more about different storage sizes for MySQL TEXT database field types - see for example:

https://chartio.com/resources/tutorials/understanding-strorage-sizes-for-mysql-text-data-types/

Now, there ARE some actual limits in place in the size of digital objects that can be uploaded in AtoM. Any of these that are set can be configured and changed. I have described those in our docs, as well as in several previous user forum posts (including where and how to change them). See for example:

https://www.accesstomemory.org/docs/latest/admin-manual/maintenance/troubleshooting/#why-can-t-i-upload-large-digital-objects

Some forum threads that sum it up too:

While those can all be changed, perhaps the biggest limit currently for large files is the web browser. Most web browsers have a built-in timeout limit of about 1 minute, so that long-running processes don't continue unchecked and end up consuming all available client resources and crashing your browser. In practice, this means that if you try to upload a very large PDF to a description in AtoM through the web-based user interface, the upload may timeout - and that has the risk of interrupting the process and leaving incomplete data (i.e. data corruption) in the database, potentially leading to serious problems later.

I would love to see us add support for asynchronous uploads that can hash and chunk large files in the background when users upload large content via the interface, but this will require significant time, analysis, and effort to develop, and is not currently slated for an upcoming release.

In the meantime, we have the ability to upload large files via the command-line, or as part of a CSV upload. The forum links above include further links that describe both methods in greater detail. Use these if you need to add very large PDFs to AtoM - and if, somehow, the MEDIUMTEXT change in 2.8 is not enough to index all content in your large PDFs, a local developer could also use the related commit as a guide, and change the relevant database field from MEDIUMTEXT to LONGTEXT

Cheers,

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056

@accesstomemory

he / him

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/00961c22-a41e-4d35-9782-f6f0f75b9630n%40googlegroups.com.

Dan Gillean

unread,

Feb 15, 2024, 3:29:19 PM2/15/24

to ica-ato...@googlegroups.com

PS: Thanks again for reporting this. I have also filed this issue, so our Maintainers can update the 2.8 documentation (and hopefully also clarify it):

https://github.com/artefactual/atom-docs/issues/264

Cheers,

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056