Fwd: [litsupport] Recommendation for searching PDF's

Troy Howard

unread,

Jun 3, 2010, 11:44:42 PM6/3/10

to op...@googlegroups.com

From: Caleb Hailey <caleb...@gmail.com>
Sent: Jun 3, 2010 7:54 AM

Good description of a product our industry needs badly (the creation of searchable PDF component). DE hasn't been interested in doing this - should we make this an OpSED project? If we eventually got something good working it would be a good lure to our site (and we could eventually charge for it).

Should we start an open source PDF OCR utility?

Thoughts?

- Caleb

---------- Forwarded message ----------
From: "deusnerd" <ddeu...@babc.com>
Date: Jun 3, 2010 7:31 AM
Subject: [litsupport] Recommendation for searching PDF's
To: <litsu...@yahoogroups.com>

We need software that will go thru folders of natives and find PDF's and OCR them if they are not already in searchable format. (Total PDF Converter has a product that will do this I think, but I'm looking for something less expensive.
Additionally, I'm looking for software that will process natives to TIF/PDF and brand bates #'s and create load files (we currently have eScanIt and it isn't working well.)

Your input and recommendations are much apprecaited.

Best,
David

Troy Howard

unread,

Jun 3, 2010, 11:46:06 PM6/3/10

to op...@googlegroups.com

From: Kyle Jones <ky...@bucebuce.com>

Sent: Jun 3, 2010 8:31 AM

There seem to be a pretty good combination of open source tools which could be combined for something like this. Well, okay, "good" might not be the right word... semi-working may be better. But something to piece together.

My only concern is that (I think?) the goal of OpSED is standards/reference implmentations/knowledge base rather than specific application development. Admittedly, they overlap a little bit, but maybe something like this would do better as a separate project ("supported" by OpSED, but under a different banner?)

- Kyle

Troy Howard

unread,

Jun 3, 2010, 11:48:15 PM6/3/10

to op...@googlegroups.com

From: Troy Howard <thow...@gmail.com>
Sent: Jun 3, 2010 2:37 PM

I understand Kyle's concerns. OpSED could be getting too large in
scope already.

In fact, most projects of this nature divorce the standards body from
the reference implementations completely, and definitely the
applications from those reference implementations.

That said, the currently speced project Open Tools for eDiscovery
(http://opsed.org/projects/open-tools) is meant to cover exactly what
Caleb's describing.. It would be a place to catalog existing tools,
or, where there is no existing open tool, hold software projects to
create those tools.

I kind of like how Apache deals with this. They have lots of projects,
and subprojects. They eat thier own dog food.. They make libraries
which are used by other Apache projects, and built into applications.
It's nice to know that a single org is managing not only the concept,
but the implementation and application of those concepts.

I don't think we should shy away from building applications/tools for
the community, and hosting them on our site, but maybe we should
present a clearer distinction between the more conceptual/theoretical
standards/process definition projects and the more
practical/implementation oriented projects to prevent confusion on the
website.

Thanks,
Troy

Aline Bernstein

unread,

Jun 4, 2010, 12:13:49 AM6/4/10

to Open Standards for eDiscovery (OpSED)

I would hope that a utility to analyze a collection of documents and
report on their types and searchability would be in scope for OpSED,
but that actually doing the OCR would be out of scope. I'd like to
know that any OpSED utility I use yields unambiguous results, and OCR
quality depends on the engine used.

Ideally, the tool would help me pack up the nonsearchable PDFs to a
zip archive, which I would upload to a vendor who can OCR them faster
and cheaper than I can, and then help me overlay the new searchable
PDFs over the original files.

This would give me control over the process without tying up my
machine resources or putting me in the position of explaining OCR
results.

On Jun 3, 8:44 pm, Troy Howard <thowar...@gmail.com> wrote:
> From: Caleb Hailey <calebhai...@gmail.com>

Troy Howard

unread,

Jun 4, 2010, 1:31:41 AM6/4/10

to op...@googlegroups.com

Aline,

I like your perspective on that. It gets to the core of what OpSED is
about - defensibility.

One of the key benefits of that tools written by OpSED would provide
to the legal industry is that they would be open source software.
Because the source code to the program is exposed, it's processes and
the decisions inherit in those processes are available for inspection,
and transparent to the end user. It's provable, auditable and
defensible in a way that closed-source software is not.

With regards to the ambiguity of results, one thing to keep in mind --
almost all software programs provide unambiguous results. Given the
same set of input they will always produce the same output. That
includes OCR software. Though they may seem to vary a lot in terms of
quality output (measured by fidelity to the intent of the original
text), it is at least consistent in that a given engine would always
produce the same output, given the same input.

That said, I don't think that we would want to write an entire OCR
engine. There are already a number of existing open source OCR engines
out there that could be utilized by our tools. I think an important
design choice in a tool like this though, would be to make the choice
of OCR engine optional.

The process you've outlined -- a simple tool that would scan a
collection of documents, determine which ones needed to be OCR-ed,
then package them up for processing, is a great idea. One of the
reasons it's a great idea is that is keeps the scope of functionality
limited to only one part of the problem, which ultimately allows
flexibility for the end user.

If we built a tool which could then complete the process by OCR-ing
the packaged items and outputting searchable PDFs, and built them in
such a way that they could run standalone, or be wired together in a
single workflow (using something like a shell script) then we'd have
the bet of both worlds.

I'd even take that second stage tool and break it down further, and
make two tools -- one which OCR-ed and created paginated text files,
and one which generated the searchable PDFs from the "Image Only" PDFs
and the OCR text created by the previous tool.

That provides a opportunity for the user to omit un-needed steps or
insert additional steps such as a quality control phase, and possible
manual fix-ups, etc..

Getting back to OpSED's philosophic underpinnings, one of the big
motivations behind this project is to empower legal professional to
complete the process of law without being shackled and controlled by
vendors and service providers who are only interested in increased
profits. Using a commercial product or paying a service provider
should not be a legal professional only choice when faced with the
need to handle electronic data. There's an ethical conflict there that
holds the legal process hostage to the whims of a fickle for-profit
industry.

One way to solve that problem is to create a set of free open source
software tools that are an *option* for legal professionals to use.
Then the choice to use commercial offerings is just that -- a choice.
Maybe OpSED tools aren't perfect, or "best-in-breed" solutions, but
they are at minimum, open and transparent, and provide a reliable
consistent set of functionality that doesn't need to be paid for.