Other than splitting up evidence files and having separate systems
work on them, is there a way to distribute bulk_extractor processing
across multiple systems?
At the present time there is no way to do what you describe. Besides splitting up bulk_extractor runs, you would also need to split up the data. Are all of your systems networked together? Do they have a high-speed switched bandwidth between them? How many systems do you actually have? 100? 1000? 10,000? Are they all running the same operating system?
There are three obvious approaches to follow here:
1 - Run Hadoop and HFS on the cluster. Store the disk images in HFS, write a map/reduce job that runs bulk_extractor and combines the results.
2 - Run PBS/Torque to distribute the computation engine. The disk images will need to be stored on a high performance file system, though, so you'll need to set up Lusture or GPFS.
How much background do you have in high performance computing?
> --
> You received this message because you are subscribed to the Google Groups "aff-discuss" group.
> To post to this group, send email to aff-d...@googlegroups.com.
> To unsubscribe from this group, send email to aff-discuss...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/aff-discuss?hl=en.
>
Actually, I disagree. bulk_extractor doesn't seek much. In my timing tests SSDs do not make the program significantly faster.
Furthermore, I think that there is a big difference between what Rob says he wants to do and what he actually is able to do. If he really does have an infinite number of systems, then he can set up 1000 or more of them and actually get a significant performance improvement. But it's unlikely that an organization which cannot purchase a multi-core server has the ability to effectively manage a network with 100s or 1000s of nodes. A 24-core Mac Pro with 12GB of RAM costs less than $3000. (I know; I just bought one). If he can't afford to spend $3000 on a machine, then it's unlikely that he'll be able to buy the SSDs, or to manage the cluster that he seems to want to build.
I see your point. I didn't mean to suggest disk access is the main issue.
Sometimes one doesn't realize the value of certain assets one has
available, if they aren't thought of in context.
For example, on first look one might discount a low-end MacBook Air or a
remote VPS running fully in RAM. They don't have the presence of an
"(almost) infinite supply of older single and dual core servers". Yet, a
"single spare system that is very powerful" might be "spared", if its
relative value is realized in the context of a simple single run vs.
setting up and managing a cluster.
cheers,
Benjamin
I may have confused this statement with someone talking about another
tool as well.
In my effort to be concise I may have been misleading. I wanted to
experiment with improving bulk_extractor processing time using my
currently available resources (spare equipment and some spare time)
and without eating into my new equipment budget. Forensic processing
needs have to share resources with ALL of our other security projects.
On Sat, Oct 22, 2011 at 11:52 PM, Simson Garfinkel <sim...@acm.org> wrote:
It would certainly be useful to have a peer-to-peer option in bulk_extractor that allowed pages to be processed by other nodes. You are welcome to implement this. I do not have the time to do so, alas.