NYU Poly Student Working on Bulk_Extractor CCN False Positives Project Would Like Your Inputs

Molly Morgan

unread,

Aug 7, 2012, 9:39:33 PM8/7/12

to bulk_extra...@googlegroups.com

Hi everyone! I am developing ideas about how to perhaps improve the CCN identification process that Bulk_Extractor is currently using. I have a large list of questions that I would be thrilled to get b_e users group input on. My questions do sort of end up falling into two main categories - (1) Create an external / post processing step or, (2) Create an internal change to current code.

I will list the questions in this post but would be more than happy to list each as an individual post if that would help - just let me know. I am on a tight timeline and would appreciate any and all suggestions for tackling the design and implementation of a solution (or several solutions).

Here goes:

Question #1: Do you know of any research that demonstrates how many numbers returned from B_E CCN analyses are mathematically valid but still not real CCNs?

After studying the B_E, I could not tell if there was a context sensitive stop list being utilized (I am examining version1.2).

Question #2: Do you know if there is already a context sensitive stop list for B_E CCN analysis?

Question #3: Do you think that a python post processor is a good approach?

Question #4: Would the fiwalk application perhaps be the best approach?

Question #5: Do you think that manipulating / increasing the analysis within the internal B_E application code relating to scan_accts classes and functions is the best approach?

Currently, B_E is validating CCNs by checking (after parsing with Flex) the following on the carved out numbers: (1) extract_digits_and _test, (2) prefix_test, (3) ccv1_test, (4) pattern_test, (5) histogram_test, (6) before_window test, and (7) after_window test.

If any of the above fails, B_E reports the failure and suppresses the CCN. If all pass B_E reports a validation of the number being a CCN.

Obviously, B_E is returning validation of many numbers that are ending up NOT being CCNs even after meeting the above 7 criteria.

Question #6: Can I assume that the code written for all of the tests is written correctly, including correct logic?

Question #7: Could numbers be getting validated that are not actually passing all tests?

Question #8 Is there more contextual data surrounding the carved numbers that could bring greater accuracy (currently only 4 characters before and after are assessed for context)?

Question #9: Could the order of the testing logic be improved - is it occurring in the OPTIMAL order?

Question #10: Do we know if the CCN false positive problem is largely a Windows problem (lots of Windows PEs and DLLs have GUIDS that are passing all LUHN, length, and BIN tests?

Question #11: What about the name of the file that the number is found in - would this assist with accurate identification (ie, improve accurate identification of the number as a true CCN or not)?

Question #12: Could it be that all CCNs covered by PCI are simply not being addressed / captured by B_E code?

Question #13: Would it be useful to use the Mars Banks Base and / or the ISO/IEC 7812-1:2006 to refine the B_E search for CCNs by comparing the obtained list of CCNs from analysis to either or both of these CCN number databases post B_E processing - sort of like a plugin or post processor for B_E?

Simson Garfinkel

unread,

Aug 7, 2012, 9:48:19 PM8/7/12

to bulk_extra...@googlegroups.com

Hi, Molly. You have a lot of good questions here. Before I answer them, I'd recommend that you look at the 1.3 beta, and that you read the bulk_extractor paper. Also, I have a presentation on bulk_extractor that I am presenting tomorrow. It's 2-hours and goes through all of the feature files. I will post a copy later tomorrow.

Simson

Simson Garfinkel

unread,

Aug 7, 2012, 10:12:57 PM8/7/12

to bulk_extra...@googlegroups.com

>
> Question #1: Do you know of any research that demonstrates how many numbers returned from B_E CCN analyses are mathematically valid but still not real CCNs?

No, but the answer would be on the order of 10^14.

> After studying the B_E, I could not tell if there was a context sensitive stop list being utilized (I am examining version1.2).

They are in 1.3; I don't think that they work properly in 1.2.

> Question #2: Do you know if there is already a context sensitive stop list for B_E CCN analysis?

No, I do not know.

> Question #3: Do you think that a python post processor is a good approach?

No, I do not.

> Question #4: Would the fiwalk application perhaps be the best approach?

Best approach for what?

> Question #5: Do you think that manipulating / increasing the analysis within the internal B_E application code relating to scan_accts classes and functions is the best approach?

What do you mean best approach?

> Currently, B_E is validating CCNs by checking (after parsing with Flex) the following on the carved out numbers: (1) extract_digits_and _test, (2) prefix_test, (3) ccv1_test, (4) pattern_test, (5) histogram_test, (6) before_window test, and (7) after_window test.
> If any of the above fails, B_E reports the failure and suppresses the CCN. If all pass B_E reports a validation of the number being a CCN.
> Obviously, B_E is returning validation of many numbers that are ending up NOT being CCNs even after meeting the above 7 criteria.

Why obviously?

> Question #6: Can I assume that the code written for all of the tests is written correctly, including correct logic?

No, you should not assume that. I believe that it is formally undecidable if code is written correctly, including correct logic.

> Question #7: Could numbers be getting validated that are not actually passing all tests?

Yes.

> Question #8 Is there more contextual data surrounding the carved numbers that could bring greater accuracy (currently only 4 characters before and after are assessed for context)?

Probably.

> Question #9: Could the order of the testing logic be improved - is it occurring in the OPTIMAL order?

It is usually impossible to prove that something is Optimal.

> Question #10: Do we know if the CCN false positive problem is largely a Windows problem (lots of Windows PEs and DLLs have GUIDS that are passing all LUHN, length, and BIN tests?

I don't know what problem you are referring to.

> Question #11: What about the name of the file that the number is found in - would this assist with accurate identification (ie, improve accurate identification of the number as a true CCN or not)?

Huh? bulk_extractor doesn't know about files.

> Question #12: Could it be that all CCNs covered by PCI are simply not being addressed / captured by B_E code?

This question makes no sense.

> Question #13: Would it be useful to use the Mars Banks Base and / or the ISO/IEC 7812-1:2006 to refine the B_E search for CCNs by comparing the obtained list of CCNs from analysis to either or both of these CCN number databases post B_E processing - sort of like a plugin or post processor for B_E?

Sure. Is Mars Bank Base freely available?

Molly Morgan

unread,

Aug 8, 2012, 11:23:34 PM8/8/12

to bulk_extra...@googlegroups.com

ok - thank you I will certainly take a look at 1.3, I come from a working work where the older releases tend to be more stable so I cracked into that one first...

Molly Morgan

unread,

Aug 8, 2012, 11:42:53 PM8/8/12

to bulk_extra...@googlegroups.com

OK - thank you for the input I would like to clarify where I have not communicated well...

In my digital forensics course at NYU Polytechnic, most of the students in the class discovered false positive results in their bulk_extractor runs performed on images of our own machines. This is why I stated that, obviously, we were getting numbers that are passing all validation tests and yet are still NOT really CCNs.

I was interested in this and the hows and whys of it. For my semester project I took on the effort of undergoing a project to determine ways of minimizing the number of false positives - that is the "best approach" that I keep refering to below - that is, the best approach to helping solve the occurrence of false positive CCNs using bulk_extractor.

I have tried to clarify several statements made earlier below:

> Question #10: Do we know if the CCN false positive problem is largely a Windows problem (lots of Windows PEs and DLLs have GUIDS that are passing all LUHN, length, and BIN tests?

I don't know what problem you are referring to.

I am curious to know if the false positive rate is higher when running bulk_extractor on a windows image file versus a linux image file.

> Question #11: What about the name of the file that the number is found in - would this assist with accurate identification (ie, improve accurate identification of the number as a true CCN or not)?

Huh? bulk_extractor doesn't know about files.

I am interested in the use of fiwalk and identify_filenames.py to gain contextual information that might assist in determining the legitimacy of a returned CCN from bulk_extractor.

> Question #12: Could it be that all CCNs covered by PCI are simply not being addressed / captured by B_E code?

This question makes no sense.

What I am trying to express is this: after studying credit card number creation and the logic behind that creation <and the PCI industry in general> I recognized that there is a VAST array of CCN forms that may not be captured in the logic / validation functions that are currently present in the code that I have examined. Because the problem occurs when numbers pass the LUHN, length, and BIN tests but are still not CCNs - I was attempting to figure out a way to prove more of those numbers as false by adding more logic. I will work on how to communicate this - I am still not doing a good job.

Regards - I appreciate it.

Jeff Phillips

unread,

Aug 9, 2012, 3:27:32 PM8/9/12

to bulk_extra...@googlegroups.com

I took a look at the ccn code a few days ago and was curious as to why the "sense was reversed" in the ccn validation checks? I.e. the doubling was occurring starting from the last digit and working backwards on every other digit. Did I read this wrong or is this because of something I am missing as to how the blocks are passed to the validator code? Since not all ccn are the same length starting from the last digit and working backwards would have an invalid result on odd versus even length ccns.

The other piece of logic I didn't get was the histogram and before/after window. Can anyone enlighten me?

Jeff

Simson Garfinkel

unread,

Aug 9, 2012, 10:20:14 PM8/9/12

to bulk_extra...@googlegroups.com

On Aug 9, 2012, at 3:27 PM, Jeff Phillips <je...@jeffphillips.com> wrote:

I took a look at the ccn code a few days ago and was curious as to why the "sense was reversed" in the ccn validation checks? I.e. the doubling was occurring starting from the last digit and working backwards on every other digit. Did I read this wrong or is this because of something I am missing as to how the blocks are passed to the validator code? Since not all ccn are the same length starting from the last digit and working backwards would have an invalid result on odd versus even length ccns.

The algorithm is correct.

http://en.wikipedia.org/wiki/Luhn_algorithm

Informal explanation

The formula verifies a number against its included check digit, which is usually appended to a partial account number to generate the full account number. This account number must pass the following test:

Counting from the check digit, which is the rightmost, and moving left, double the value of every second digit.
Sum the digits of the products (e.g., 10: 1 + 0 = 1, 14: 1 + 4 = 5) together with the undoubled digits from the original number.
If the total modulo 10 is equal to 0 (if the total ends in zero) then the number is valid according to the Luhn formula; else it is not valid.

Assume an example of an account number "7992739871" that will have a check digit added, making it of the form 7992739871x:

That means you count from right to the left.

The other piece of logic I didn't get was the histogram and before/after window. Can anyone enlighten me?

It's based on an observation of common false positives.

Jeff Phillips

unread,

Aug 10, 2012, 2:48:23 AM8/10/12

to bulk_extra...@googlegroups.com

Ha, totally missed that first sentence when I read through the algorithm, thanks.

Reply all

Reply to author

Forward