I'm glad you brought this up. Python is actually an ideal language for
this situation. From the perspective of learning Python or editing
existing Python code, it's widely considered one of the easiest
programming languages to learn, with the lowest barrier of entry. It
has a massive collection of existing tools in its base libraries,
which often results in only needing a few lines of code to do fairly
complicated tasks. It's powerful enough in its capabilities and fast
enough to scale to the tasks that eDiscovery requires, while still
being easy to use for a beginner.
Your correct that it runs on a runtime, and thus requires a separate
install to execute Python files. Mac OS/X has Python installed by
default, as well as most Linux distributions. Windows has an easy to
install distribution (click-next type installer) that is a one-time
task. After that you can just double click on .py files.
The download for Windows is here:
http://www.python.org/download/windows/
That said, you can also compile Python code into free-standing
executables which removes the need for the separate runtime install.
There are compilation tools for the three major platforms, which
result in native executable code. Those are:
py2exe - makes a windows exe
Freeze - makes a Unix executable
py2app - makes a MacOS executable
This topic is covered on the Python wiki at:
http://wiki.python.org/moin/PythonInstalledByDefault
For OpSED tooling, I think the best way to distribute the tools would
be to make the .py source code available as well as binary builds
(with installers) for each of the major platforms (Windows, Mac, and
Linux variants). The simplest way to use the tools would be to
download the binaries. When you're ready to start hacking the source
and customizing the tools, you can download the runtime and the source
code.
That said, we look to Python as a first choice because it's so ideal
for this purpose, but it is not the only option. Java or C++ are also
both widely used for this purpose, however both are more complicated
to deploy and compile.
In other situations, languages like CPL might be the best option
(automating something in Concordance), and of course we'd gladly host
and distribute those options.
In the end, the goal is to facilitate the sharing and reuse of
solutions, in whatever way is easiest for the end user.
Thanks,
Troy
On Mon, Jun 7, 2010 at 1:22 PM, Aline Bernstein
<aline.b...@gmail.com> wrote:
As an example, I created a couple of Python scripts and posted them up
at opsed.org for download.
Here are the links:
http://opsed.org/attachments/download/2/walk-files.py
http://opsed.org/attachments/download/3/show-mimetypes.py
To run the files, you can just double click them, or run from the DOS
command prompt like this:
C:\Python26\python.exe C:\opsed\walk-files.py
To view the source code, just open them in a text editor...
The first script, 'walk-files.py', simply loops recursively over every
subdirectory/file of a root directory and prints out the name to the
console. For the sake of showing the simplicity of the Python code, I
didn't include many comments.
It currently expects a directory called 'C:\Test' to exist, so if it
doesn't exist on your machine, it won't do anything... To change the
root directory, just edit the fourth line to point to a different
directory.
Example:
To scan 'C:\My\Folder\Path'...
rootdir = 'C:\Test'
would change to:
rootdir = 'C:\My\Folder\Path'
To show something a bit more complex, I made 'show-mimetypes.py' which
first asks the user to enter the root directory, then scans all the
files under it recursively, and prints out the filename and the mime
type to the console. It then prompts the user to 'Press any key to
exit'. For the sake of explaining all the details, I include a bunch
of comments.
This should give you a rough idea of how to accomplish similar tasks.
As an exercise, why don't we put together a few simple tools like this
to see how things can be done. It'll be great to do this in the
context of eDiscovery, so let's choose something valuable to typical
post-processing tasks that a litigation support team might do.
The tools we discussed from requests on the litsupport list are a bit
more complex to accomplish, but maybe we can come up with something
equally useful but more simple as a learning exercise.
Thanks,
Troy
One thing that can mess a lot of people up is that Python treats tabs
and spaces differently.. For example, in my text editor, when I hit
<tab> it will insert 4 spaces, not a <tab> character. In other
people's editors, it will insert a <tab> character.
If your text editor is inserting <tab> characters, and you tried to
edit a file that I made when I was using spaces, it would upset
Python. It would think that your tab was equivalent to a single space,
instead of four, as it interprets a <tab> as a single whitespace
character... The easy way to deal with that is configure the text
editor to always use spaces instead of tabs, or vice-versa. Then the
problem just goes away. I choose to use spaces, because most text
editors will insert spaces instead of tab characters, so it had the
greatest chance of avoiding this problem.. but that's still just a
chance.
It's a common gotcha with Python, but it's mostly due to the way
various that text editors all work differently by default.
Anyhow, a user/tester/discussion partner is just as useful as a
programmer to OpSED. Many of our projects have nothing to do with
writing code, and those are in fact, our more important projects. The
tools project is just something that happened to come up. That
project's original focus was just a place to compile a list of
existing tools and how to use them, with the thought that we could
write tools when there wasn't already an existing free/open source one
available.
There are a lot of people out there who don't know our industry, but
do know Python, who could write the code for these kinds of tools. But
only people like yourself can describe what kinds of tools will be
most useful to you, and how you want them to work. Of course, it's
ideal if you can also write or modify your tools, because that gives
you more power as a user.
Thanks,
Troy