The problem is as follows: We're trying to publish information
containing personal details as PDF. Publishing this information as PDF
is a requirement. However, due to the personal nature of the
information, we would like to disallow people from indexing and
searching this information.
Is there a way to make this PDF unsearchable? Manipulating robots.txt
is a solution for well-behaved indexers, but we'd like to prevent
non-behaving indexers to easily index the document as well.
We realise that any human-readable solution is probably susceptable to
OCR'ing it back after downloading. We're currently considering whether
or not this is acceptable.
One option we've come up with is converting the PDF-text to PDF-image.
One major drawback may be the increase in file size. The files are
probably not very large (several pages at most) so this could be ok.
Are there other options to do this? Or are we looking in the wrong
direction?
Regards,
Tsjok-Wing Man
Best Regards,
Paulo Soares
You could convert glyphs to outlines in the PDF producer application. Most
dtp apps can do this.
> One option we've come up with is converting the PDF-text to PDF-image.
> One major drawback may be the increase in file size. The files are
> probably not very large (several pages at most) so this could be ok.
The solution above shouldn't increase the size and will probably even
decrease it if you were embedding the fonts before.
If the outlines were not altered, they too would still be readable
from OCR applications. If the text paths were shortened to individual
words, then these paths modified from true horizontal to slightly
curved, or possibly angled a few degrees off true horizontal, this
would overcome OCR readability.
You might try instead to change the stroke of the text outlines so
that they were not crisp and clean, (but still legible).
Either would be tedious, but overcoming automated leaching is somewhat
overwhelming.
jbl
j> If the outlines were not altered, they too would still be readable
j> from OCR applications. If the text paths were shortened to individual
j> words, then these paths modified from true horizontal to slightly
j> curved, or possibly angled a few degrees off true horizontal, this
j> would overcome OCR readability.
OCR's easily handle angled or curved texts. They've been doing this for over
5 years now.
With best regards,
Eugene Mayevski
Scan at 300 dpi, directly into Acrobat. Straight text stored with CCITT
GRP4 compression will not produce unmanageable file sizes. Avoid having half
tone images, or screens in the file, otherwise they will get big.
"TeeWee" <tsjok.w...@gmail.com> wrote in message
news:1133261946.7...@g14g2000cwa.googlegroups.com...
I am using the current version of Abbyy Finereader, which up to this
moment, I was under the impression was about the best OCR application
out there.
I can (have) create(ed) a PDF document with text that is clearly
visible and that is readable at a glance. All text on a flat
horizontal path.
Not one of the text characters can be accurately read by Abbyy
Finereader.
What OCR are you using? I might have to take a look at it.
Thanks
jbl
Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com
This is surprising. Can/would you make this un-ocr'able file
available somewhere for others to try?
--
Jukka....@iki.fi
* "... the fact that what you seem to be saying is stupid is no
evidence that it's not what you meant" -- D. Ullrich in comp.theory
>jbl <jbl...@spamhotmail.com> writes:
>>I can (have) create(ed) a PDF document with text that is clearly
>>visible and that is readable at a glance. All text on a flat
>>horizontal path.
>>
>>Not one of the text characters can be accurately read by Abbyy
>>Finereader.
>
>This is surprising. Can/would you make this un-ocr'able file
>available somewhere for others to try?
Gladly, but alas, I don't have access to a site that I can upload to.
Possibly someone knows of a site.
j> I can (have) create(ed) a PDF document with text that is clearly
j> visible and that is readable at a glance. All text on a flat
j> horizontal path.
Really strange - I used Finereader 5-6 years ago and by that time it was
capable of handling various texts. It would really be nice to have a try
with your file. Maybe you can create a small (tiny) PDF and attach it to the
message?
Rather than post a binary file here, I emailed it to the address in
your subject header.
>
> OCR's easily handle angled or curved texts. They've been doing this for over
> 5 years now.
>
Out of curiosity: Suppose one were to randomly switch font family,
size, slant, color, case, etc, at random, almost letter by letter.
Humans could still read (with some difficulty). Would OCRs find their
job much harder?
Font family, size, color, case would not matter at all.
Slant however would. You can slant the characters enough (quite a bit
more than "Italic", or even backward) so that OCR will not read them
but they will still be readable by the viewer.
Sort of tough to automate this process but it will work.
jbl
The irrelevance of font family surprises me a bit -- I would have
thought an OCR algorithm examining a sentence or paragraph would have
tried to converge down on (or adapt to) a most likely font (or at least,
consistent set of letter shapes) for that block of text.
So, how about throwing in some carefully designed "garbage" noise? --
lines extending out of the side of characters, spots around (or inside)
them, in ways that humans would filter out but OSR algorithms would have
to try to process? Plus, having every instance of each letter (the
letter "b" for example) being a different shape, so the algorithm has to
identify and track multiple diffrerent instances of "a", "b", "c", etc?
Possible to throw sand in the OCR gears that way?
Keeping in mind the original posters comments and assuming that the
documents are supposed to end up somewhat "professional looking".
The object is to obfuscate the text enough to render OCR useless,
while still being easily readable by the viewer.
Yes it can be done.
Hard to automate, without creating a new font set.
(remaping fonts would only prevent copy-paste)
Yes, there are many ways to do it.
Well, there are several free www hosting sites (a Google search by
"free www host" will find them), but perhaps you could just mail me
the file and I can make it available on my page (if it's not too huge).
jbl has kindly e-mailed me this file, and I have made it available
on his behalf at <http://www.cs.helsinki.fi/u/kohonen/hard-to-ocr.pdf>.
I tried it with Abbyy FineReader 8.0 (trial version, default settings)
and indeed none of the characters were recognized.
Now, it is true that the text is on a flat horizontal path, and is
readable at a glance, but with each letter almost 45 degree
left-rotated, I guess many people would not particularly _like_ to
read such text...
>jbl <jbl...@spamhotmail.com> writes:
>>I can (have) create(ed) a PDF document with text that is clearly
>>visible and that is readable at a glance. All text on a flat
>>horizontal path.
>>
>>Not one of the text characters can be accurately read by Abbyy
>>Finereader.
>
>jbl has kindly e-mailed me this file, and I have made it available
>on his behalf at <http://www.cs.helsinki.fi/u/kohonen/hard-to-ocr.pdf>.
>
>I tried it with Abbyy FineReader 8.0 (trial version, default settings)
>and indeed none of the characters were recognized.
>
>Now, it is true that the text is on a flat horizontal path, and is
>readable at a glance, but with each letter almost 45 degree
>left-rotated, I guess many people would not particularly _like_ to
>read such text...
At the risk of continuing this thread and making it tediously long:
The text in the sample mentioned is at 30 degrees not 45. It was
quickly done and intended as an example only, I have tested it with
less rotation and it works as well. I don't know at what point Abby
will begin to recognize it (22 degrees? 18 degrees?), I didn't spend
that much time on it.
This is only one of many ways to obfuscate the text so as to not be
recognized and not intended to necessarily be "THE WAY". The object
was to "how to simply and quickly" distort the text and still be
readable without being too distorted.
> Keeping in mind the original posters comments and assuming that the
> documents are supposed to end up somewhat "professional looking".
> The object is to obfuscate the text enough to render OCR useless,
> while still being easily readable by the viewer.
Actually, making it unindexable was the main issue, the OCR issue is
more a side issue for us right now. Professional looking is indeed a
requirement.
Thank you for your answers in any case.
Tsjok