fulltext indexing PDF files in Google search

40 views
Skip to first unread message

Jan Skůpa

unread,
Sep 24, 2021, 7:19:04 AM9/24/21
to DSpace Community
Hi,
I found that most of the PDFs in our dspace (5.3) are not fully searchable via Google. The records are indexed, but the phrases from the PDF are not found. Is it possible that there is a bug in the settings somewhere? Should this work? Thanks!

Tim Donohue

unread,
Sep 24, 2021, 11:09:05 AM9/24/21
to Jan Skůpa, DSpace Community
Hi Jan,

If the record is being indexed by Google already, then they should be aware of the PDF already, and there's not much DSpace can do to force Google to full text index the PDF.  That said, it's worth noting there are two main types of PDFs, and only one of which is easily indexed:
  • PDFs created from digital files or OCRed images.  These PDFs have embedded text and are more easily full text indexed.
  • PDFs created from scanned files (without OCR). These are image-based PDFs with no embedded text, and they are often not able to be full text indexed​, unless the system which grabs the PDF is able to OCR it reliably in an automatic fashion.
So, if the PDFs you are talking about were created from scanned images, then make sure to OCR them so that they are easier to index.

DSpace provides some other hints/tips about Search Engine Optimization here which you may want to review for your repository: https://wiki.lyrasis.org/display/DSDOC5x/Search+Engine+Optimization

If you have other questions let us know on this list.

Tim


From: dspace-c...@googlegroups.com <dspace-c...@googlegroups.com> on behalf of Jan Skůpa <skup...@gmail.com>
Sent: Friday, September 24, 2021 2:53 AM
To: DSpace Community <dspace-c...@googlegroups.com>
Subject: [dspace-community] fulltext indexing PDF files in Google search
 
Hi,
I found that most of the PDFs in our dspace (5.3) are not fully searchable via Google. The records are indexed, but the phrases from the PDF are not found. Is it possible that there is a bug in the settings somewhere? Should this work? Thanks!

--
All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
---
You received this message because you are subscribed to the Google Groups "DSpace Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-communi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/c3b24342-a0ef-4946-9576-6ae2b32c55ffn%40googlegroups.com.

Jan Skůpa

unread,
Sep 29, 2021, 3:47:39 AM9/29/21
to DSpace Community
Hi Tim,
thank you for your answer.
All PDF (99 % for sure) is digital born files, so this isn't the problem. Site with SEO i know but didn't help...

Dne pátek 24. září 2021 v 17:09:05 UTC+2 uživatel Tim Donohue napsal:
Reply all
Reply to author
Forward
0 new messages