Hi Germán,
Regarding database fields, OCR and character limits:
The following is from MySQL 5.6, but I don't believe it has changed in 5.7 - these are the size limits depending on the field type set in MySQL:
AtoM stores any text layer in an uploaded digital object (such as the OCR layer on a PDF) in the transcript field found in the property table. The transcript field is searchable in AtoM - in the advanced search menu, there is a filter that can limit searches to digital object text. This transcript field is set as a TEXT type in MySQL - meaning it has a limit of 65,535 characters or bytes by default. Remember as well that AtoM uses UTF-8 character encoding, in which characters can be 1-4 bytes, depending. Soooooooo.... it depends on the characters as to how much text that translates to, unfortunately. When the limit is surpassed, the transcript simply clips - the digital object will be saved, but it means that later pages in a very large PDF or text document may not in fact be searchable.
We haven't experimented with this, but I assume it would be possible to change this field type to MEDIUMTEXT or LONGTEXT. Keep in mind that you'll likely need to restart mysql, PHP-FPM, and repopulate the search index after making such a change - and that it will cause the size of your search index to grow considerably if you are uploading large documents. For reference, you can find a copy of our database Entity Relationship Diagrams on the wiki, here:
When I have some time, I will try to generate an updated version for the final public 2.5 release, though many of the database changes included in 2.5 were already completed by November 2018, when the last ERD was uploaded to the wiki.
Also remember that your ability to get accurate search results against text found in a PDF or other text document will depend on the quality of the OCR, which may not accurately reflect that a human can read on a page! I have shown an example of this from our demo site in this thread:
Regarding the size of your digital objects stored:
You can check this in Admin > Settings - see:
Another way to check via the command line would be to check the total size of the uploads directory, where all digital objects (including derivatives, and repository logos and banners) are stored.
Note that there are a number of settings or configuration values that can limit digital object upload, that you should be aware of. I describe many of them here:
If your master digital objects are stored on a separate IIIF server, then I would recommend that you simply ensure that backups of this server are being made - to me it seems to defeat the purpose of using an IIIF server if you are also going to upload the masters directly to AtoM. AtoM simply uses the uploads directory to store digital objects - it is not a repository or a digital preservation platform. However, when making backups of your AtoM data, we do recommend that you back up the uploads directory, as well as the downloads directory if you are generating finding aids. If your IIIF server is backed up separately, then it's not as urgent to back up the uploads directory - your derivatives could always be generated again in the future using the regen-derivatives task listed here:
So long as the path to the master object that is stored in AtoM hasn't changed, and that is is publicly accessible and points directly to the object (i.e. using a URL ending in the file extension), then this regeneration task will work with objects where the master is stored elsewhere as well. There is even an --only-externals option in the task, if you only want to regenerate remote digital objects uploaded via URL.
For reference, I've previously described how the uploads directory is organized in this forum thread:
Regarding code contributions to the public project:
I will ask one of our developers if they have suggestions for where to look regarding digital object management in the code that might affect your work.
In the meantime, some thoughts on sharing development work with the public AtoM project.
This thread was in response to a question about custom theme plugin development, but my response includes a large list of the development resources we have available:
If you are considering development that you wish to share back with the public project, please be sure to review this page:
However, I must make it clear that Artefactual cannot guarantee that all code shared with us will be accepted and merged into a public release. We have only recently started to receive large feature-based pull requests from the community, and so far in general, many of these do not follow our coding standards and development guidelines, are often developed against old versions or branches of the application, are presented to us as one huge commit rather than a series of atomic commits that can be easily reviewed and understood, and/or are submitted by users with no resources (e.g. time and budget) reserved to make changes based on our feedback and recommendations. This has unfortunately made it very difficult for us to accept these.
Merging publicly created features is unsponsored work for us, and we have in the past invested many hours of staff time into reviewing features and functionality and preparing feedback, only to never hear from the developers again. If we accept a feature and merge it into the public release, that means that we are taking on its maintenance going forward through successive versions. This also means we will need to ensure that future development works with the new functionality, that we can provide basic support via our user forum as well as support for our paid clients, that we will invest further time to preparing and then maintaining documentation, that the module can be translated so we can coordinate with our international community to translate the feature, and so forth. All of this represents a huge amount of effort and staff time on Artefactual's part, with no recompense from our community. Because of this, we are cautious in what features we feel able to merge into the public project. If you are curious to learn more about how Artefactual maintains AtoM currently, I encourage you to review the following wiki page:
For us to be able to merge large feature-based pull requests, it is therefore imperative that the code be in a state that our team can maintain - that is, it follows our standards and development patterns; it reuses existing functions, methods, and libraries whenever possible; it includes clarifying comments that helps to explain the code for future developers - and all of the other aspects outlined in the Community Development Recommendations linked above. Any time we are offered code, we have to consider:
- Is this something we have heard many different community members ask for? Will there be uptake to make this worth maintaining?
- Will this benefit the entire AtoM community, or just a small part?
- Can the feature be turned off for those in our community who don't wish to use it? Will it change what a default new installation looks like?
- Have upgrades from the current version to a new version that includes this feature been considered in the development? Are there database migration schemas in place?
- How much time will this require for review, fixes/revisions, documentation, forum support?
- How much work will this be to maintain through future versions?
- Can Artefactual afford to undertake the work required without support?
And so forth.
Given our recent experiences we are now requesting that, for large, feature-based pull requests to the public project, we require that the original community development team include budget for Artefactual to undertake the following:
- Code review
- Merging and code conflict resolution where needed
- Testing
- Documentation
The original developer(s) should also budget time and resources to be able to respond to and implement feedback that we provide, or budget for Artefactual to make the necessary changes.
One alternative is to continue to maintain the feature yourself as a fork, so other users can access it and implement it as desired. If there is a lot of uptake, this will likely make it much easier for more community members to share costs associated with the work of merging it into the public release in the future, should that remain desirable. In such cases, we are happy to list the available features on our wiki (as with
these features), and as we encounter interest, we will continue to encourage our community to help us sponsor the addition of these features to a public release.
Regards,