Hi Eira,
I should probably add a page on the BitCurator wiki to explain exactly how these things work in bulk_extractor - a lot of people have been confused by this. For now:
The alert list facility in bulk_extractor will *only* alert you to those features that are identified by the scanners that are being run (things like emails, exif metadata, etc). So, for example, create an alert_list.txt file on the Desktop with, for example, two lines in it:
cartel
barter
Create a new directory on the Desktop called "foo", and put a new file called "bar.txt" in it. Edit "bar.txt" and put the following lines in it:
cartel
barter
no rebar
simple
anodyne
Now run bulk_extractor with the default scanners and a recursive scan of all files in the foo directory using the command:
bcadmin@ubuntu:~$ bulk_extractor -r ~/Desktop/alert_list.txt -R ~/Desktop/foo -o ~/Desktop/beresults
you won't get *any* results as alerts, because there's no scanner in bulk_extractor looking for the terms "cartel" or "barter" by default.
However, if you now *replace* the contents of the alert_list.txt file with:
remove the existing beresults directory, and rerun that previous command, you will get a hit for an email address, and in the new beresults output directory you will now see an ALERTS_found.txt file with the following line in it:
So how do you search for the *other* terms, the ones that no existing scanner will recognize as hits? You need to use the "find" facility,
pointing bulk_extractor either at a single regular expression or a file containing the relevant terms (one term or regular expression per line,
or using an existing feature file). Here's an example:
Create a file on the desktop called rlist.txt, and put the following words in it:
bar
car
These terms will be treated as regular expressions; in this case, the "find" scanner will identify any instance of foo or bar, even if it appears within another word.
Now, remove the existing beresults directory, and run the bulk_extractor command again, but this time add the regular expression list file (rlist.txt) in addition to the alerts_list.txt, like this:
bcadmin@ubuntu:~$ bulk_extractor -r ~/Desktop/alert_list.txt -F ~/Desktop/rlist.txt -R ~/Desktop/foo -o ~/Desktop/beresults
Now you will get both an ALERTS_found.txt file in the output, with the same contents as before. However, you'll also see that the contents of "find.txt" now include the following:
# BANNER FILE NOT PROVIDED (-b option)
# BULK_EXTRACTOR-Version: 1.6.0-dev ($Rev: 10844 $)
# Feature-Recorder: find
# Filename: /home/bcadmin/Desktop/foo
# Feature-File-Version: 1.1
/home/bcadmin/Desktop/foo/bar.txt<U+10001C>-0 car cartel\x0Abarter\x0Ano re
/home/bcadmin/Desktop/foo/bar.txt<U+10001C>-7 bar cartel\x0Abarter\x0Ano rebar\x0Afoo
/home/bcadmin/Desktop/foo/bar.txt<U+10001C>-19 bar tel\x0Abarter\x0Ano rebar\
x0Af...@grok.com/home/bcadmin/Desktop/foo/bar.txt<U+10001C>-26 bar ter\x0Ano rebar\
x0Af...@grok.com\x0Asimple
Sure enough, it found one instance of car, and three instances of bar (note that it will show you the context, as well, as with the feature files
which is why the last part of the line includes all that other junk).
You can, of course, do this just as easily with the same file setup using BEViewer. I just didn't want to include a bunch of screenshots, here.
Hope this helps.
Regards,
Kam