Pdf does not equal "text"! Pdf is a
complex format where, more often than not, human-visible "text" is actually just a bunch of picture(s) instead of rendered glyphs:
https://en.m.wikipedia.org/wiki/Glyph
Your line IMPLIES that the pdf(s) you struggle with are generated by/via tesseract. Lacking information, this is what I assume, for now.
OCR is complex machinery, and first order of business with diagnosing complex machinery is reducing the scope of error. For that, and hence for anyone possibly being able to assist you, you need to check and reduce.
Check: nobody human needs actual text to "read" (means: view on screen or on printed paper) pdf content. We look at images=pictures and that is what "pdf readers" produce - except specialized ware for blind people and the otherwise visually handicapped.
As you mention "search" as the problem area, which DOES require machine text rather than basic pictures, first you must find out whether the OCR process actually does produce "text", and if so, what that text actually IS: pdf viewers hide "text overlays" by default, so you need specialized tools to uncover the text inside the pdf or, much easier, change the OCR output format.
For that it is strongly advised (I'd say mandatory) to adjust your OCR process to have it produce HOCR format, which is a kind of augmented HTML: you can open such a file in notepad and actually read the raw content. Some of us are okay with TEXT output format, because that is the simplest format, but it drops info that is available in HOCR and thus obscures/hides several problem types, hence my advice to find out how you can produce HOCR format directly from tesseract.
Reduce:
To enable anyone to possibly assist, you must reduce = boil down the issue to tesseract in a structure and mini process that makes it potentially reproducible; along the way you may find that the issue you have is not tesseract related but located elsewhere in your process/pipeline. Here we'll assume your issue is with tesseract or it's immediate surroundings.
Required action
Here's what you need to do (everyone has to, because there's a plethora of processes around, before, after and on top of tesseract out there and those only make things easy as long as things go exactly as advertised. You, on the other hand, have an issue, so you will have to divide and conquer, i.e. reduce your problem zone/area/scope, or you will forever be unable to discover where the problem originates); reduce your (OCR) process to this and report:
>>>>>>>>>>> (Checklist)
- you use the tesseract CLI (aka "tesseract executable/binary with its command line interface"); this is not a python script, not anything "script"-ish otherwise; you execute tesseract directly in bash/cmd and specify the precise command line (tesseract + argument set). This command line is also needed by anyone else out there to possibly reproduce your issue and help diagnose & fix.
- you feed tesseract a (page) image, preferably PNG format. If your original source is jpeg, use the jpeg.
- your tesseract commandline is such that tesseract outputs HOCR format (my preference) or plain text; this already empowers you to diagnose your issue deeper yourself as you can easily check yourself whether tesseract then produces desired/expected output or something else. Which is also useful to know as you're looking for the root cause here.
In your particular case, with the minimal information handed over, three general main problem sources are to be expected and reduction must be applied to discover which of these is yours:
1. errors in pdf text embedding process (part of OCR postprocess); failure to correctly and compatibly embed text in pdf
2. failure to produce a page image that is ready for OCR by tesseract. (OCR preprocess) Lots of issues are due to this.
3. unexpected/faulty OCR results for the given input image (the OCR process itself: tesseract)
- for reporting, anyone will need your tesseract commandline, the input image(s) used and the results you get (error+info console output; output text/file(s)) plus the tesseract version/build info, which can, for example, be obtained by running
tesseract -v
<<<<<<<<<(Checklist ends)
:-S to have a pdf indexed and searchable by Google, you need to publish the pdf online and the Google index bot must go and find and access it; that is a nontrivial process, so I wonder... Besides, once Google gets to your pdf, it will judiciously run it's own OCR process internally before indexing your pdf content, which makes this a non-starter for diagnostics purposes regarding your own process/pipeline.... At the very least, this is way off into any postprocessing pipeline and definitely not instantaneous for anyone; Google indexing is arbitrary in time.
This is also indicative that you might want to seek additional, local, technical support while diagnosing your issue.
As I stated near the beginning: these are pdf viewers and they are happy to show you page scans or any other picture format/potpourri in your pdf, next to possible text glyphs. Pdf is a very complex format, you don't need machine text to show text and "text overlays" are not shown on screen or in print.
Meanwhile, SEARCHING in a pdf requires TEXT (machine text) plus pdf search permissions (pdfs can be "secured" against search, copy-paste, etc. to complicate those pdf text search issues even further).
Hence the advice to REDUCE your problem surface area; currently, also due to the minimal provided information, it is... without bounds.