I understand Kyle's concerns. OpSED could be getting too large in
scope already.
In fact, most projects of this nature divorce the standards body from
the reference implementations completely, and definitely the
applications from those reference implementations.
That said, the currently speced project Open Tools for eDiscovery
(http://opsed.org/projects/open-tools) is meant to cover exactly what
Caleb's describing.. It would be a place to catalog existing tools,
or, where there is no existing open tool, hold software projects to
create those tools.
I kind of like how Apache deals with this. They have lots of projects,
and subprojects. They eat thier own dog food.. They make libraries
which are used by other Apache projects, and built into applications.
It's nice to know that a single org is managing not only the concept,
but the implementation and application of those concepts.
I don't think we should shy away from building applications/tools for
the community, and hosting them on our site, but maybe we should
present a clearer distinction between the more conceptual/theoretical
standards/process definition projects and the more
practical/implementation oriented projects to prevent confusion on the
website.
Thanks,
Troy
I like your perspective on that. It gets to the core of what OpSED is
about - defensibility.
One of the key benefits of that tools written by OpSED would provide
to the legal industry is that they would be open source software.
Because the source code to the program is exposed, it's processes and
the decisions inherit in those processes are available for inspection,
and transparent to the end user. It's provable, auditable and
defensible in a way that closed-source software is not.
With regards to the ambiguity of results, one thing to keep in mind --
almost all software programs provide unambiguous results. Given the
same set of input they will always produce the same output. That
includes OCR software. Though they may seem to vary a lot in terms of
quality output (measured by fidelity to the intent of the original
text), it is at least consistent in that a given engine would always
produce the same output, given the same input.
That said, I don't think that we would want to write an entire OCR
engine. There are already a number of existing open source OCR engines
out there that could be utilized by our tools. I think an important
design choice in a tool like this though, would be to make the choice
of OCR engine optional.
The process you've outlined -- a simple tool that would scan a
collection of documents, determine which ones needed to be OCR-ed,
then package them up for processing, is a great idea. One of the
reasons it's a great idea is that is keeps the scope of functionality
limited to only one part of the problem, which ultimately allows
flexibility for the end user.
If we built a tool which could then complete the process by OCR-ing
the packaged items and outputting searchable PDFs, and built them in
such a way that they could run standalone, or be wired together in a
single workflow (using something like a shell script) then we'd have
the bet of both worlds.
I'd even take that second stage tool and break it down further, and
make two tools -- one which OCR-ed and created paginated text files,
and one which generated the searchable PDFs from the "Image Only" PDFs
and the OCR text created by the previous tool.
That provides a opportunity for the user to omit un-needed steps or
insert additional steps such as a quality control phase, and possible
manual fix-ups, etc..
Getting back to OpSED's philosophic underpinnings, one of the big
motivations behind this project is to empower legal professional to
complete the process of law without being shackled and controlled by
vendors and service providers who are only interested in increased
profits. Using a commercial product or paying a service provider
should not be a legal professional only choice when faced with the
need to handle electronic data. There's an ethical conflict there that
holds the legal process hostage to the whims of a fickle for-profit
industry.
One way to solve that problem is to create a set of free open source
software tools that are an *option* for legal professionals to use.
Then the choice to use commercial offerings is just that -- a choice.
Maybe OpSED tools aren't perfect, or "best-in-breed" solutions, but
they are at minimum, open and transparent, and provide a reliable
consistent set of functionality that doesn't need to be paid for.
Well, that's my opinion anyhow... ;)
Thanks,
Troy