Bulk Reviewer custom regex file syntax

28 views
Skip to first unread message

Erica Donnis (she/her)

unread,
Oct 28, 2025, 12:32:59 PM (13 days ago) Oct 28
to bitcurat...@googlegroups.com
Hello,

My colleagues and I are working on a custom regex file for Bulk Reviewer but are running into lots of false positives when characters on the list appear in the middle of a word. Example: we would like the system to flag "VIN" for vehicle identification number and are using the regex expression \bvin\b, but it is also catching every instance when "vin" appears in the middle of a word such as "diving." Is there a specific regex library that Bulk Reviewer prefers? 

In looking into the online documentation, it appears that bulk_extractor has a scanner that uses lightgrep. The lightgrep documentation states that lightgrep does not support boundary assertions, which we understand to mean \b wouldn't work. Is this what's causing the problem? If so, is there a way around it?

Thank you in advance,

Erica


Erica Donnis (she/her)
Congressional Papers Archivist
Associate Professor

Silver Special Collections Library
University of Vermont
Room B201 - Billings Library | 48 University Place
Burlington, VT 05405 
(802) 656-0410 | erica.donnis@uvm.edu
libraries.uvm.edu/specialcollections.

UVM’s Our Common Ground Values:
Respect | Integrity | Innovation | Openness | Justice | Responsibility

University of Vermont logo: A white letter V outlined by a dark green shield and the text University of Vermont
UVM is subject to the Vermont Public Records Act and communications to and from this email address, including attachments, are subject to disclosure unless exempted under the Act or otherwise applicable law.

Simson Garfinkel

unread,
Oct 28, 2025, 6:10:20 PM (13 days ago) Oct 28
to bitcurat...@googlegroups.com
Hi. If you want a VIN recognizer, that should be done with a new scanner that knows the VIN validation algorithm. 

If this is exceedingly important, I would be happy to work with you or your programmer to write it and get it into the code base.

Simson 


--
You received this message because you are subscribed to the Google Groups "BitCurator Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bitcurator-use...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/bitcurator-users/CH2PR01MB578390184E1230DF8FA322D48DFDA%40CH2PR01MB5783.prod.exchangelabs.com.

Kam Woods

unread,
Oct 29, 2025, 11:08:46 AM (12 days ago) Oct 29
to BitCurator Users
It's worth noting for clarity that in current (5.x) BitCurator releases, bulk_extractor is not compiled with lightgrep support. The Find scanner uses the RE2 engine in this build, which *does* support \b for word boundaries. 

For good results in feature searches where there's an algorithmic verification, Simson's suggestion is the best path forward. But you may still be wondering about the general answer to the question "are \b boundary markers working as intended in RE2 as used by bulk_extractor they way it is compiled in BitCurator?" The answer is that they are, but with some qualifications.

The RE2 \b (along with \d, \s, and \w) work for ASCII characters only (see https://github.com/google/re2/issues/344). If you have an ASCII doc that includes the items "term TERM interminable" and pass it to either bulk-reviewer or bulk_extractor directly in BitCurator along with the regex "\bterm\b", only the first two items will match. But if you perform a similar test with pretty much any PDF, you'll get a bunch of overmatching. So the utility of this simple approach depends on the type of source material you're processing.

Hope this bit of additional info is useful.

Kam

James Truitt

unread,
Nov 3, 2025, 9:49:11 AM (7 days ago) Nov 3
to BitCurator Users
Hi Kam,

Are look-arounds something that bulk_extractor supports in BitCurator 5.x?

I've also been dealing with a lot of false positives, and the lack of support for negative look-arounds in the custom regex file has been a major obstacle to reducing them. For example, f we wanted to flag discussions of stock trading, but not get false positives from the common phrase "take stock of", we could hypothetically use a PCRE like /(?<!take )\bstocks?\b(?! of)/—but this wasn't something that lightgrep supported.

Best,
James Truitt

Kam Woods

unread,
Nov 3, 2025, 2:19:51 PM (7 days ago) Nov 3
to bitcurat...@googlegroups.com
Hi James,

The bulk_extractor build in BitCurator is not (currently) compiled with lightgrep support, so it defaults to the RE2 engine. RE2 does not support lookarounds because it guarantees linear performance in the size of input pattern and text - a guarantee that could not (in its current form - there's some new-ish research out there about this) be made if accepting constructs that required backtracking. Lightgrep *also* does not support constructs that require backtracking, as it is similarly geared towards high performance, single-pass scans of large inputs.

One possible option would be to run the find scan with your regexes as-is (with an appropriately-sized context window) and then filter the results from the report with something like pcre2grep (found in pcre2-utils in Ubuntu), which *does* support lookarounds, backreferences, and non-capturing groups. It might be a good idea for us to just include pcre2-utils in BitCurator either way, in the future, as an additional set of tool options.

Others may have better suggestions. For a broadly scoped search like this where the context may be pretty varied and the result cannot be algorithmically verified (unlike the VIN example), I imagine there are better approaches I'm not thinking of right now.

Kam


Reply all
Reply to author
Forward
0 new messages