The page in that PDF has it's content defined in 3 objects, and one of them claims to be Flate compressed. However, I believe the compression is broken.
If I open the PDF in firefox (which uses pdf.js), I get console warnings that the zlib compressed data has invalid check bits in the header:
You can see that some text still renders though. pdf.js seems to just skip over the broken stream and attempts to render the page anyway.
I tried a similar approach in pdf-reader, and it works for your file. I made this code change:
diff --git a/lib/pdf/reader/filter/flate.rb b/lib/pdf/reader/filter/flate.rb
index 2489757..aefbc45 100644
--- a/lib/pdf/reader/filter/flate.rb
+++ b/lib/pdf/reader/filter/flate.rb
@@ -32,8 +34,9 @@ class PDF::Reader
Depredict.new(@options).filter(deflated)
rescue Exception => e
# Oops, there was a problem inflating the stream
- raise MalformedPDFError,
- "Error occured while inflating a compressed stream (#{e.class.to_s}: #{e.to_s})"
+ #raise MalformedPDFError,
+ # "Error occured while inflating a compressed stream (#{e.class.to_s}: #{e.to_s})"
+ return ""
end
end
end
.. and the text is extracted:
$ ruby -Ilib bin/pdf_text sample\(1\).pdf
DocuSign Envelope ID: 45BFED27-0910-4248-8030-C853B0DE0248
「辻」と「辻」
I'm not sure if that's a general approach I want to commit though. It may not work as well on other PDFs.
James