Distributed bulk_extractor?

31 views
Skip to first unread message

Dewhirst, Rob

unread,
Oct 21, 2011, 11:19:05 AM10/21/11
to aff-d...@googlegroups.com
I have an (almost) infinite supply of older single and dual core
servers but no single spare system that is is very powerful.

Other than splitting up evidence files and having separate systems
work on them, is there a way to distribute bulk_extractor processing
across multiple systems?

Simson Garfinkel

unread,
Oct 23, 2011, 12:29:02 AM10/23/11
to aff-d...@googlegroups.com, Dewhirst, Rob
Hi, Rob. Thanks for the email.

At the present time there is no way to do what you describe. Besides splitting up bulk_extractor runs, you would also need to split up the data. Are all of your systems networked together? Do they have a high-speed switched bandwidth between them? How many systems do you actually have? 100? 1000? 10,000? Are they all running the same operating system?

There are three obvious approaches to follow here:

1 - Run Hadoop and HFS on the cluster. Store the disk images in HFS, write a map/reduce job that runs bulk_extractor and combines the results.

2 - Run PBS/Torque to distribute the computation engine. The disk images will need to be stored on a high performance file system, though, so you'll need to set up Lusture or GPFS.

How much background do you have in high performance computing?

> --
> You received this message because you are subscribed to the Google Groups "aff-discuss" group.
> To post to this group, send email to aff-d...@googlegroups.com.
> To unsubscribe from this group, send email to aff-discuss...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/aff-discuss?hl=en.
>

Benjamin Brink

unread,
Oct 23, 2011, 12:41:42 AM10/23/11
to aff-d...@googlegroups.com
Using an all-flash storage (harddisk emmulator) would significantly
reduce timing as well.

Simson Garfinkel

unread,
Oct 23, 2011, 12:52:19 AM10/23/11
to aff-d...@googlegroups.com, Benjamin Brink
Benjamin,

Actually, I disagree. bulk_extractor doesn't seek much. In my timing tests SSDs do not make the program significantly faster.

Furthermore, I think that there is a big difference between what Rob says he wants to do and what he actually is able to do. If he really does have an infinite number of systems, then he can set up 1000 or more of them and actually get a significant performance improvement. But it's unlikely that an organization which cannot purchase a multi-core server has the ability to effectively manage a network with 100s or 1000s of nodes. A 24-core Mac Pro with 12GB of RAM costs less than $3000. (I know; I just bought one). If he can't afford to spend $3000 on a machine, then it's unlikely that he'll be able to buy the SSDs, or to manage the cluster that he seems to want to build.

Benjamin Brink

unread,
Oct 23, 2011, 1:25:49 AM10/23/11
to aff-d...@googlegroups.com
Simson,

I see your point. I didn't mean to suggest disk access is the main issue.

Sometimes one doesn't realize the value of certain assets one has
available, if they aren't thought of in context.

For example, on first look one might discount a low-end MacBook Air or a
remote VPS running fully in RAM. They don't have the presence of an
"(almost) infinite supply of older single and dual core servers". Yet, a
"single spare system that is very powerful" might be "spared", if its
relative value is realized in the context of a simple single run vs.
setting up and managing a cluster.

cheers,

Benjamin

Dewhirst, Rob

unread,
Oct 23, 2011, 10:21:19 AM10/23/11
to aff-d...@googlegroups.com
I thought for sure I heard Simson say last year at the OSDF conference
that "RAID on SSDs" was one of the best ways to improve bulk_extractor
processing speed. Did I make that up, or is the combination of RAID +
SSDs that makes the difference, versus just the SSD?

I may have confused this statement with someone talking about another
tool as well.

Dewhirst, Rob

unread,
Oct 23, 2011, 10:19:20 AM10/23/11
to aff-d...@googlegroups.com
Our network is the size of a small city and managed quite well. It
is also not the case that we can't purchase systems. This would be an
experiment and our small security office doesn't have a budget for new
hardware for "experiments". You don't need to convince me how short
sighted and dumb that is.

In my effort to be concise I may have been misleading. I wanted to
experiment with improving bulk_extractor processing time using my
currently available resources (spare equipment and some spare time)
and without eating into my new equipment budget. Forensic processing
needs have to share resources with ALL of our other security projects.

On Sat, Oct 22, 2011 at 11:52 PM, Simson Garfinkel <sim...@acm.org> wrote:

Simson Garfinkel

unread,
Oct 27, 2011, 3:28:05 PM10/27/11
to aff-d...@googlegroups.com
Currently most people running bulk_extractor are CPU bound, not IO bound. Therefore moving to RAID on SSD does not help bulk_extractor. If I said otherwise, I was mistaken.

Simson Garfinkel

unread,
Oct 27, 2011, 3:26:52 PM10/27/11
to aff-d...@googlegroups.com
Thanks for the clarification.

It would certainly be useful to have a peer-to-peer option in bulk_extractor that allowed pages to be processed by other nodes. You are welcome to implement this. I do not have the time to do so, alas.

Reply all
Reply to author
Forward
0 new messages