Bulk Extractor and Identifying Social Security Numbers ?

490 views
Skip to first unread message

Carol Kussmann

unread,
Sep 11, 2014, 3:59:45 PM9/11/14
to bitcurat...@googlegroups.com
I have been trying to use Bulk Extractor within Bit Curator to identify Social Security Numbers.  For a variety of reasons, I ended up running Bulk Extractor on a "Directory of Files" rather than a disk image.  This directory of files was on a USB Flash Drive.  The files are OCRed PDFs.  

I ran Bulk Extractor and it produced the regular set of reports.  It found urls, email addresses, and phone numbers but no social security numbers.  This group of 72 files has at least 40 known social security numbers in them.  

Has anyone else had this experience, not having SS# identified when there are known instances.  Any suggestions?

Are there any other programs you use to identify SS# specifically?  (Using Spider I found the 40+ numbers.)

Thank you,
Carol 



--


Carol Kussmann
Digital Preservation Analyst
Digital Preservation and Repository Technologies | University of Minnesota Libraries
499 Wilson Library, 309 19th Avenue South, Minneapolis, MN 55455

Kam Woods

unread,
Sep 11, 2014, 4:29:44 PM9/11/14
to bitcurat...@googlegroups.com
Hi Carol,

The default settings for bulk_extractor flag only those SSNs that are prefixed with "SSN" and have dashes between the three groups of digits (this is "mode 0"). There are two other modes available - "mode 1", which flags SSNs that are not necessarily prefixed by "SSN", and "mode 2" which flags SSNs that are neither necessarily prefixed by SSNs nor necessarily have dashes.

You can pass bulk_extractor the "-S #" (where # is the number of the mode) flag on the command-line to active either of these modes. Note that, as with quite a few other useful features of bulk_extractor (which is primarily a command-line tool), this option is not available via the GUI.

Kam


--
You received this message because you are subscribed to the Google Groups "BitCurator Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bitcurator-use...@googlegroups.com.
To post to this group, send email to bitcurat...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bitcurator-users/CAALj18hAp%3Ddgcn5mAcMQRSOA763iLKiXmi82ad93gOTJxPLrVQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Tracy P.

unread,
Aug 1, 2017, 5:18:15 PM8/1/17
to BitCurator Users
Hi All,

I realize this question is a few years old, but thought I'd inquire anyway if there are others encountering this issue. I am encountering an issue with known social security numbers being omitted from scans. I have a couple of questions. With the 0,1,2 flag options is it possible to flag so that each mode is employed during one search or does the scanner need to be run three times if the mode is changed? Is it possible to edit the scanners which the GUI uses? Finally, is it possible to write custom scanners?

Thanks,
Tracy Popp

Matthew Disregardmatthew Farrell

unread,
Aug 2, 2017, 2:28:40 PM8/2/17
to bitcurat...@googlegroups.com
Hey - 

I can't speak with authority on ssn-mode, but my understanding (based on this post in the bulk_extractor_users group) is:
0: required feature to be prefixed with "SSN"
1: does not require SSN prefix, but does require hyphens.
2: no hyphens or SSN prefix required, but will also identify features from modes 0 and 1.

In any event, you can set up your own custom regular expression search from the GUI or the CLI. On the CLI, you either use -F <file of regular expression(s)> or -f <single regular expression>. In the GUI, the options are found among the General Options part of the Run bulk_extractor window. Results from these options will appear in the feature file "find.txt."

I believe you can write plugins for bulk_extractor as well, but I have not explored this as an option.



--
You received this message because you are subscribed to the Google Groups "BitCurator Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bitcurator-users+unsubscribe@googlegroups.com.
To post to this group, send email to bitcurator-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bitcurator-users/40e1b03c-a8f5-4770-a8f8-f4668983a0b8%40googlegroups.com.

Tracy Popp

unread,
Aug 7, 2017, 4:51:00 PM8/7/17
to bitcurat...@googlegroups.com
Thanks, Farrell!

I've received a few good tips regarding using regular expression and will be trying them out shortly.

Best,
Tracy

Reply all
Reply to author
Forward
0 new messages