Atom OCR search.

125 views
Skip to first unread message

cha...@piql.com.br

unread,
Jul 4, 2018, 8:08:25 AM7/4/18
to AtoM Users
Hi there.
 
I,ve been loading PDF´s to atom that already have OCR in them, but a lot of text available in the OCR for search doesnt return a result in atom.

Example:

If I search in ATOM for "MyName" , it doesnt find anything, but if I open the file containing "MyName" and perform a search using adobe reader, it finds the string.

Is that normal or some kind of bug?

Thks.
CC.

Dan Gillean

unread,
Jul 4, 2018, 1:50:15 PM7/4/18
to ICA-AtoM Users
Hi Charles, 

I did a couple quick tests in our public demo site, and could get results - for example, see: 
I don't know how Adobe Reader's search works but since it's designed to work with text layers in PDFs, it may be optimized in ways to handle OCR irregularities that AtoM is not. I would suggest that you try copying the target text, and then pasting it into a document or browser search box, so you can see how it transposes - it's possible there are minor variations (e.g. the OCR text layer actually says M yNa me (rather than MyName) or something like that, and Adobe's search algorithm could be "fuzzy" enough to still find a match, but AtoM won't. 

If you are using AtoM 2.4, then you might also try using the dedicated digital object text search, available in the Advanced search panel, so you know you are targetting the correct field: 



I'd suggest copying the text directly from the uploaded document in AtoM, and then pasting it into the advanced search field, so you know you are grabbing the text exactly as it is rendered in the text layer that AtoM has indexed. 

For example, in this document found in the public AtoM demo site, I copied part of the text that reads: civilization which abound

However, the OCR is bad, so the text layer when copied actually says:  c i vilizution ~l ch abound

Now the ~ character is a reserved character in Elasticsearch (AtoM's search index), so I need to add a forward slash before it to escape it, otherwise I will get an error when searching - so I need to actually search for:  c i vilizution \~l ch abound

When I do so, and use the Digital object text filter, I get the result I want: 

Some other possibilities to consider: 
  • Is the related description published? If not (if it is in draft mode), are you sure you are searching when logged in? If the description is Draft, then a public user will not find matching results
  • Are you searching in the same user interface language in which related description was created? There are some known issues with multilingual content and culture fallback, so if (for example) your description was created in the Brazilian Portuguese interface, but you are searching while using the English user interface, then this might cause issues. 
  • There can also be issues when the default installation culture is different from the user interface language the description was created in
  • Finally, I believe that there is a character limit on this field in the database, so if your uploaded text document is very large, then it's possible that some of the OCR text has been truncated?
Regards, 



Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/ica-atom-users/fa8a3bb0-0d0e-4b3c-9569-48eed4066d37%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Charles Carminati

unread,
Jul 4, 2018, 2:04:35 PM7/4/18
to ica-ato...@googlegroups.com
Hi There and thanks for your response.

The problem is due to char limitations (64Kb). Changing the field from text to mediumtext should solve the problem, but it doesnt.  

Running  sudo php symfony search:populate , populates the database with ocr again, but it is still incomplete...



To post to this group, send email to ica-ato...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "AtoM Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ica-atom-users/edwpU_zoBEk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ica-atom-users+unsubscribe@googlegroups.com.

To post to this group, send email to ica-atom-users@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.

For more options, visit https://groups.google.com/d/optout.



--

Charles Carminati
cha...@piql.com.br 

+5548 984643090

Av. João Cabral de Melo Neto, 850- sala 1005 bl 3- Barra da Tijuca- Rio de Janeiro- RJ

CEP: 22.775-057

www.piql.com.br

Dan Gillean

unread,
Jul 7, 2018, 4:56:34 AM7/7/18
to ICA-AtoM Users
Hi Charles, 

I would also suggest clearing the application cache, as well as restarting PHP-FPM and memcached, and perhaps testing in an incognito browser window (or at least making sure your browser cache is flushed). 

Otherwise, I'm not sure. I will ask our developers if they have further suggestions. 

Regards, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory




--

Charles Carminati
cha...@piql.com.br 

+5548 984643090

Av. João Cabral de Melo Neto, 850- sala 1005 bl 3- Barra da Tijuca- Rio de Janeiro- RJ

CEP: 22.775-057

www.piql.com.br

--
You received this message because you are subscribed to the Google Groups "AtoM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ica-atom-users+unsubscribe@googlegroups.com.
To post to this group, send email to ica-atom-users@googlegroups.com.
Visit this group at https://groups.google.com/group/ica-atom-users.

Dan Gillean

unread,
Jul 7, 2018, 5:08:18 AM7/7/18
to ICA-AtoM Users
Hi again Charles, 

Also, I realized that re-indexing alone might not have solved the issue - the text is extracted at upload time. However, there is a command-line task you can try that will re-extract the digital object text layer: 

I would clear caches, restart services, and re-index your site again after running this task. 

Cheers, 

Dan Gillean, MAS, MLIS
AtoM Program Manager
Artefactual Systems, Inc.
604-527-2056
@accesstomemory

Reply all
Reply to author
Forward
0 new messages