Bulk Extractor and alert lists to flag sensitive words

571 views
Skip to first unread message

Eira Tansey

unread,
Oct 14, 2015, 11:58:04 AM10/14/15
to BitCurator Users
Apologies if anyone already saw this message on the Digital Curation listserv (https://groups.google.com/d/msg/digital-curation/qj7uQJnfzPk/T1Nbx2RrBQAJ), but having received no responses there, I wanted to try over here next.

I have a group of records I'm working with that I know contain some sensitive files (e.g., job candidate evaluations). I want to identify these and segregate them from the rest of the accession. I am using Bulk Extractor (within BitCurator), and tried to create an alert list text file to flag sensitive words and phrases (see pages 27-28: http://digitalcorpora.org/downloads/bulk_extractor/BEUsersManual.pdf).

However, passing this alert list against a directory of files does not seem to result in any output, even though I know these words appear in at least some of the text documents. The built-in scanners are working fine and output things like phone numbers. Does anyone have advice? Alternative suggestions re: tools/methods for identifying documents with sensitive keywords would also be welcome.

Thanks,

Eira Tansey
Digital Archivist/Records Manager

Archives and Rare Books Library
University of Cincinnati Libraries
806 Blegen Library
2602 McMicken Circle
PO Box 210113
Cincinnati, OH 45221-0113

Direct Tel: 513-556-1958
Library Tel: 513-556-1959
Email:
eira....@uc.edu
Web:
www.libraries.uc.edu/libraries/arb/


Carol Kussmann

unread,
Oct 14, 2015, 1:09:05 PM10/14/15
to bitcurat...@googlegroups.com
Not sure if this is helpful, just sharing my own experience...

I was having problems with bulk extractor within BitCurator not identifying known ss numbers because it didn't match the default scan (which I think is that the result must include SSN in front of the number).  I had to use the command line (outside of BitCurator) and tell it to just look for the number pattern.   
 
I don't know if there is another setting that would need to be changed to allow it to look for the list of terms... ?  Looking quickly at the user guide, it talks about a different type of list on pg 31-32 when using the command line.  

Currently we use Identity Finder to look for sensitive information.  A paid proprietary program.  We haven't tested it yet, but looks like you can add your own words.  

Best,
Carol 






--
You received this message because you are subscribed to the Google Groups "BitCurator Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bitcurator-use...@googlegroups.com.
To post to this group, send email to bitcurat...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bitcurator-users/68ff9032-a1a5-4fde-831d-29a8cc211b3a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--


Carol Kussmann
Digital Preservation Analyst
Digital Preservation and Repository Technologies | University of Minnesota Libraries
499 Wilson Library, 309 19th Avenue South, Minneapolis, MN 55455

Kam Woods

unread,
Oct 14, 2015, 3:12:57 PM10/14/15
to bitcurat...@googlegroups.com
Hi Eira,

I should probably add a page on the BitCurator wiki to explain exactly how these things work in bulk_extractor - a lot of people have been confused by this. For now:

The alert list facility in bulk_extractor will *only* alert you to those features that are identified by the scanners that are being run (things like emails, exif metadata, etc). So, for example, create an alert_list.txt file on the Desktop with, for example, two lines in it:

cartel
barter

Create a new directory on the Desktop called "foo", and put a new file called "bar.txt" in it. Edit "bar.txt" and put the following lines in it:

cartel
barter
no rebar
simple
anodyne

Now run bulk_extractor with the default scanners and a recursive scan of all files in the foo directory using the command:

bcadmin@ubuntu:~$ bulk_extractor -r ~/Desktop/alert_list.txt -R ~/Desktop/foo -o ~/Desktop/beresults

you won't get *any* results as alerts, because there's no scanner in bulk_extractor looking for the terms "cartel" or "barter" by default.

However, if you now *replace* the contents of the alert_list.txt file with:


remove the existing beresults directory, and rerun that previous command, you will get a hit for an email address, and in the new beresults output directory you will now see an ALERTS_found.txt file with the following line in it:

/home/bcadmin/Desktop/foo/bar.txt􀀜-23 foo...@grok.com

So how do you search for the *other* terms, the ones that no existing scanner will recognize as hits? You need to use the "find" facility, 
pointing bulk_extractor either at a single regular expression or a file containing the relevant terms (one term or regular expression per line, 
or using an existing feature file). Here's an example:

Create a file on the desktop called rlist.txt, and put the following words in it:

bar
car

These terms will be treated as regular expressions; in this case, the "find" scanner will identify any instance of foo or bar, even if it appears within another word.

Now, remove the existing beresults directory, and run the bulk_extractor command again, but this time add the regular expression list file (rlist.txt) in addition to the alerts_list.txt, like this:

bcadmin@ubuntu:~$ bulk_extractor -r ~/Desktop/alert_list.txt -F ~/Desktop/rlist.txt -R ~/Desktop/foo -o ~/Desktop/beresults

Now you will get both an ALERTS_found.txt file in the output, with the same contents as before. However, you'll also see that the contents of "find.txt" now include the following:

# BANNER FILE NOT PROVIDED (-b option)
# BULK_EXTRACTOR-Version: 1.6.0-dev ($Rev: 10844 $)
# Feature-Recorder: find
# Filename: /home/bcadmin/Desktop/foo
# Feature-File-Version: 1.1
/home/bcadmin/Desktop/foo/bar.txt<U+10001C>-0   car     cartel\x0Abarter\x0Ano re
/home/bcadmin/Desktop/foo/bar.txt<U+10001C>-7   bar     cartel\x0Abarter\x0Ano rebar\x0Afoo
/home/bcadmin/Desktop/foo/bar.txt<U+10001C>-19  bar     tel\x0Abarter\x0Ano rebar\x0Af...@grok.com
/home/bcadmin/Desktop/foo/bar.txt<U+10001C>-26  bar     ter\x0Ano rebar\x0Af...@grok.com\x0Asimple

Sure enough, it found one instance of car, and three instances of bar (note that it will show you the context, as well, as with the feature files 
which is why the last part of the line includes all that other junk).

You can, of course, do this just as easily with the same file setup using BEViewer. I just didn't want to include a bunch of screenshots, here.

Hope this helps.

Regards,

Kam

--
You received this message because you are subscribed to the Google Groups "BitCurator Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bitcurator-users+unsubscribe@googlegroups.com.
To post to this group, send email to bitcurator-users@googlegroups.com.

Brian Dietz

unread,
Oct 17, 2015, 2:23:03 PM10/17/15
to BitCurator Users
Hey Kam,

That explanation is super useful. Thanks. I'm now trying to run this in the BE GUI, using a list of regular expressions to find SSNs in a test document that includes a few SSN patterns. I'm not getting the output I was expecting in the find.txt file. When I do a regex search in my text file in a text editor, the strings are found. Here's what I have in my "rlist.txt":

\d{9}
\d{3}-?\d{2}-?\d{4}

I also included the string "foo" in my test document and my regex list, and that string does appear in the find.txt output file.

Do I need to escape these differently?

Brian
To unsubscribe from this group and stop receiving emails from it, send an email to bitcurator-use...@googlegroups.com.
To post to this group, send email to bitcurat...@googlegroups.com.

Kam Woods

unread,
Oct 18, 2015, 6:26:25 PM10/18/15
to bitcurat...@googlegroups.com
Hi Brian,

Without seeing your exact test files I couldn't say exactly what's going on. A similar test appears to work for me:

In rlist.txt, I have the following lines:

\d{9}
\d{3}-?\d{2}-?\d{4}

As before, I make a "foo" directory on the Desktop with a "bar.txt" file in it, and in that file I have the following lines:

This is some sample 812fake-68-34fake26 text.
There might be 813fake42fake3594 some numbers in here worth paying attention to.
SfakeSfakeN: 777fake-21-32fake45

(obviously my text file doesn't include the "fake" insertions - those are just here so this message doesn't get flagged by email and message board filters)

Now, I can run the following:

bulk_extractor -F ~/Desktop/rlist.txt -R ~/Desktop/foo -o ~/Desktop/beresults

and the find.txt output looks like this:

# BANNER FILE NOT PROVIDED (-b option)
# BULK_EXTRACTOR-Version: 1.6.0-dev ($Rev: 10844 $)
# Feature-Recorder: find
# Filename: /home/bcadmin/Desktop/foo
# Feature-File-Version: 1.1
/home/bcadmin/Desktop/foo/bar.txt<U+10001C>-20  812fake-68-34fake26      is some sample 812fake-68-34fake26 text.\x0AThere mig
/home/bcadmin/Desktop/foo/bar.txt<U+10001C>-53  813fake42fake3594       \x0AThere might be 813fake42fake3594 some numbers in
/home/bcadmin/Desktop/foo/bar.txt<U+10001C>-116 777fake-21-32fake45     ention to.\x0ASfakeSfakeN: 777fake-21-32fake45

(again without the fake insertions).

So, I'm not sure what your problem might be. Do you have additional spaces? Line breaks? Those very simple regular expressions you're using won't accommodate those. If that's not the issue, you may need to play around a bit more with your sample data to narrow down the issue.

Regards,

Kam

Eira Tansey

unread,
Oct 19, 2015, 11:07:20 AM10/19/15
to BitCurator Users
Hi Kam,

I think this is consistent with the answer you gave to Brian, but I tried a few variations of this, and using this argument:

bcadmin@ubuntu:~$ bulk_extractor-F ~/Desktop/rlist.txt -R ~/Desktop/foo -o ~/Desktop/beresults

seems to (somewhat) result in what I'm looking for. I just want to make sure that without using the alert_list (in other words, without use of -r ~/Desktop/alert_list.txt and just using the arguments found in rlist (keyword list)), that I'm not going to miss anything else.

I guess I am still a bit confused by the intended purpose between the two lists -- is alert_list a very specific file that BE expects to find certain text patterns in (e.g. an email address or hyphenated set of numbers), as opposed to text string keywords?

Thanks,
Eira


On Wednesday, October 14, 2015 at 3:12:57 PM UTC-4, Kam Woods wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to bitcurator-use...@googlegroups.com.
To post to this group, send email to bitcurat...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages