CCN recognition in Bulk Extractor

193 views
Skip to first unread message

Dan Noonan at The OSU

unread,
Jun 29, 2023, 9:14:09 AM6/29/23
to BitCurator Users
Hi: I am having a problem understanding how Bulk Extractor (BE) recognizes  credit Card Numbers (CCNs).

We have been testing using BE as a means to identify PII including SSNs, CCNs and our BuckID (which is like CCN, but without spaces or dashes).

I have created a test document with mock SSNs, CCNs and BuckIDs. The CCNs were created as a 16 character string as well as broken into sets of 4 by spaces or dashes.

We are using BE 1.6.0 in a Windows 10 environment. In the image below you can see which scanners we are running including accts which is what should catch the CCNs (as well as BuckIDs as it is a 16 character numeric string); but it doesn't, we don't get anything in the ccn.txt

We have included the Use Settable Options = ssn_mode=1 which does capture the SSN and provides data in the pii.txt.

Additionally, I have created (copied) a regex to ID at a minimum the BuckID which seems to be working; Use Find Regex Text = \b\d{16}\b

BE-Example-20230629.PNG

Any thought son what we are doing wrong in being able to recognize CCNs. MY concern is that if we are not doing something correctly, we will have a potential exposure in the future that we may have missed.

Thanks -- Dan

Daniel W. Noonan

Associate Professor/Digital Preservation Librarian

The Ohio State University | University Libraries

614.247.2425 Office

Matthew Farrell

unread,
Jun 29, 2023, 10:03:59 AM6/29/23
to bitcurat...@googlegroups.com
Hi Dan -

I believe this is because the accounts scanner employs logic to filter out fake credit card numbers in order to reduce the number of false positives reported. I assume this behavior would also have the scanner ignore BuckIDs 

I confess this is not well documented in the tool’s documentation itself. I found discussions of CCN false positives on the bulk_extractor users google group (which is more focused on development and testing), such as this thread (
https://groups.google.com/g/bulk_extractor-users/c/FXvFluEi9Sg/m/NyuXnXN3SvsJ) that discusses some of the logic that goes into validating cons before reporting them. 

I could be overlooking something too so if others have additional info or corrections, please chime in. 

Best,
farrell 

--
You received this message because you are subscribed to the Google Groups "BitCurator Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bitcurator-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bitcurator-users/f2f49119-8159-4848-92d2-e3ae7bde330dn%40googlegroups.com.

Lee, Cal

unread,
Jul 1, 2023, 10:37:04 AM7/1/23
to bitcurat...@googlegroups.com

The accounts scanner in Bulk Extractor looks for string of numbers of a given length but then takes further steps to reduce false positives.

 

In the case of credit card numbers, this includes determining whether the string of numbers conforms to the Luhn algorithm.

https://en.wikipedia.org/wiki/Luhn_algorithm

 

So if you feed it a bunch of 16-digit numbers, it will NOT report any that fail the additional Luhn test.  As many people on this list know, this still doesn't eliminate all false positives, but it can significantly reduce them.  You're still likely, for example, to find a lot of number sequences within the bowels of PDF files that weren't recorded to serve as CCNs but pass all the tests (could be used as valid CCNs).

 

If you want to force Bulk Extractor to skip additional validation, you could instead use a regular expression to just look for numbers that conform to the right number of digits (no Luhn test), but you're likely to generate a very large number of false positives.

 

You'll similarly run into a lot of false positives if you tell Bulk Extractor to just report any 9-digit number as a social security number (SSN).  Unfortunately, since "randomization" in 2011 (https://www.ssa.gov/employer/randomization.html), there's no simple test for whether a number is a valid SSN (besides excluding those that start with the prohibited sequences 000, 666, and 900-999). 

 

However, by analyzing a lot of real data from drives, Simson Garfinkel found that many real (not false positive) social security numbers are preceded by the letters "SSN."  For example, it used to be quite common for people to include the number of their resume or CV, but this was most often preceded by those three letters.  This is why in "strict" mode, the accounts scanner will only report cases in which the 9-digit number  is followed by "SSN."  You could instead set it to "medium" ("no 'SSN' prefix required, but dashes are required) or "lenient" ("no dashes required. Allow any 9-digit number that matches SSN allocation range").  This can potentially catch some additional real cases of SSNs but at the cost of numerous false alarms.

 

According to OSU's documentation, the BuckID number is "randomly generated and changes every time you receive a new card" and can be either 16 or 19 digits.

https://buckid.osu.edu/faqs/details/28.

 

This creates a situation similar to SSNs, but a bit more messy because there are two different numbers of digits to test for.  Similar to the modes for finding SSNs, you could try to determine if there are any other strings that tend to appear along with BuckID numbers.  For example, maybe they're usually preceded by "BuckID."  Or maybe they're usually broken into predictable groups of digits separated by dashes.  You could build these conditions into a regular expression and run that through Bulk Extractor along with the built-in scanners.

 

- Cal    

 

From: bitcurat...@googlegroups.com <bitcurat...@googlegroups.com> On Behalf Of Matthew Farrell
Sent: Thursday, June 29, 2023 10:04 AM
To: bitcurat...@googlegroups.com
Subject: Re: CCN recognition in Bulk Extractor

 

Hi Dan -

 

I believe this is because the accounts scanner employs logic to filter out fake credit card numbers in order to reduce the number of false positives reported. I assume this behavior would also have the scanner ignore BuckIDs 

 

I confess this is not well documented in the tool’s documentation itself. I found discussions of CCN false positives on the bulk_extractor users google group (which is more focused on development and testing), such as this thread (

https://groups.google.com/g/bulk_extractor-users/c/FXvFluEi9Sg/m/NyuXnXN3SvsJ) that discusses some of the logic that goes into validating cons before reporting them. 

 

I could be overlooking something too so if others have additional info or corrections, please chime in. 

 

Best,

farrell 

 

On Thu, Jun 29, 2023 at 7:14 AM Dan Noonan at The OSU <noonan.3...@gmail.com> wrote:

Hi: I am having a problem understanding how Bulk Extractor (BE) recognizes  credit Card Numbers (CCNs).

 

We have been testing using BE as a means to identify PII including SSNs, CCNs and our BuckID (which is like CCN, but without spaces or dashes).

 

I have created a test document with mock SSNs, CCNs and BuckIDs. The CCNs were created as a 16 character string as well as broken into sets of 4 by spaces or dashes.

 

We are using BE 1.6.0 in a Windows 10 environment. In the image below you can see which scanners we are running including accts which is what should catch the CCNs (as well as BuckIDs as it is a 16 character numeric string); but it doesn't, we don't get anything in the ccn.txt

 

We have included the Use Settable Options = ssn_mode=1 which does capture the SSN and provides data in the pii.txt.

 

Additionally, I have created (copied) a regex to ID at a minimum the BuckID which seems to be working; Use Find Regex Text = \b\d{16}\b

 

 

Any thought son what we are doing wrong in being able to recognize CCNs. MY concern is that if we are not doing something correctly, we will have a potential exposure in the future that we may have missed.

 

Thanks -- Dan

 

Daniel W. Noonan

Associate Professor/Digital Preservation Librarian

The Ohio State University | University Libraries

614.247.2425 Office

--
You received this message because you are subscribed to the Google Groups "BitCurator Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bitcurator-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bitcurator-users/f2f49119-8159-4848-92d2-e3ae7bde330dn%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "BitCurator Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bitcurator-use...@googlegroups.com.

Lee, Cal

unread,
Jul 1, 2023, 10:42:40 AM7/1/23
to bitcurat...@googlegroups.com

Sorry for a couple typos:

 

> For example, it used to be quite common for people to include the number of their resume or CV,

> but this was most often preceded by those three letters.  This is why in "strict" mode, the

> accounts scanner will only report cases in which the 9-digit number  is followed by "SSN."

 

Should instead have said:

 

For example, it used to be quite common for people to include the number on their resume or CV, but this was most often preceded by those three letters.  This is why in "strict" mode, the accounts scanner will only report cases in which the 9-digit number  is preceded by "SSN."

 

- Cal

Dan Noonan at The OSU

unread,
Jul 5, 2023, 11:36:43 AM7/5/23
to BitCurator Users
Hi Matt & Cal: Thanks so much, for your timely and thoughtful/thorough response;.this is very helpful as we begin to implement this workflow.

Thanks - Dan
Reply all
Reply to author
Forward
0 new messages