Proposal: Expose FilteredRE2 atoms in Python bindings

39 views
Skip to first unread message

Akshay Joshi

unread,
Dec 15, 2025, 9:32:53 PM12/15/25
to re2...@googlegroups.com
Hi there,

I'd like to propose a small change to the Python bindings: making `Filter.Compile()` return the extracted literal atoms instead of `None`.

The C++ `FilteredRE2::Compile()` currently extracts atoms that must appear for any pattern to match. These are useful for building secondary indexes (bloom filters, n-gram indexes) to prefilter candidates before running expensive regex matches.

Currently, the Python `Filter` class computes these atoms internally but doesn't expose them. The Python wrapper then discards the bool value and implicitly returns None.

The current return value is always None, so `if f.Compile():` is always false. Returning a list would make this truthy when atoms are extracted, which would arguably be more useful than the current return value. The only breaking case would be explicit `if result is None:` checks, which seems unlikely given the undocumented None return.

Questions:

1. Is this the right approach, or would a separate `GetAtoms()` method be preferred?
2. Any concerns about storing atoms as a member variable?

Happy to send a PR if this sounds reasonable.

Thanks,

Akshay

Daniel McClanahan

unread,
Dec 20, 2025, 10:51:34 AM12/20/25
to re2-dev
I think this is likely to be a great idea and would help to make the RE2 python bindings easier to slot into large-scale text search, relying upon the build process used to generate the python dist rather than requiring users to go through that themselves.

As prior art, I have exposed these atoms in my Rust interface to RE2: re2::filtered::FilteredRE2Builder#compile (junyer told me he liked my crate). In order to overcome the issue you describe (extracting atoms but not exposing them very clearly), I added a small C++ stub file which is compiled in the build.rs for re2-sys: FilteredRE2Wrapper::compile().

I don't want to derail this proposal: I think having atoms exposed in the python bindings would be a great addition in its own right! But I think you have identified a general rough edge with the current C++ API, in that it tends to use the same C++ class definitions both for building as well as searching. This produces a general sense of uncertainty around when and what is initialized where (e.g. the re2::RE2::ok() method). I linked the above (where I take special care to separate my rust struct re2::filtered::FilteredRE2Builder from my other rust struct re2::filtered::Filter) because it also demonstrates how we can make our return values meaningful--much like your great observation here:

> Returning a list would make this truthy when atoms are extracted, which would arguably be more useful than the current return value.

Just to be clear, I am not at all a qualified RE2 reviewer. But I was planning to eventually upstream the C++ API changes I had made in the stub file for my rust crate, and I wanted to highlight the possibility of a more general API improvement in the longer term here. Outside of lifetimes, there is nothing I'm doing in the rust crate that can't be done in C++ (and C++ has far more powerful metaprogramming than rust).

This is how I expose internal references from the compiled Filter object (re2::filtered::Filter#get_atoms):
pub fn get_atoms<'a>(&'a self, atoms: &'a MatchedSetInfo) -> impl ExactSizeIterator<Item = StringView<'a>>

StringView, for example, has a direct analogue in C++ land. And re2::set::MatchedSetInfo is just an FFI wrapper around a C++ std::vector<int>. It would honestly be much cleaner to do this in the C++ API itself, instead of having to create auxiliary rust wrappers and stubs. I just don't really know how to propose a drastic API change like this, since I'm very far removed from the people who rely upon RE2's current interface and don't know how it would affect them.

My approach in the re2 rust crate involves quite a few auxiliary struct definitions to codify the different intermediate and final states of each return value. In doing so, I take some degree of inspiration from regex-automata, the crate Andrew Gallant/burntsushi publishes which exposes implementation details for the rust regex crate.

I actually think that exposing atoms to the end user like the RE2 C++ API enables (not the python API yet, as you note) is something more string search projects should be looking to do. I am hoping to work on text search for a doctoral degree, and I had been scheming with junyer last year about how to consider text search problems as a composition of specialized tools (e.g. atom searching with a fast SIMD literal finder), as opposed to a monolithic "regex engine" interface which obscures the underlying mechanics from the user (who often has specialized requirements!). My re2 rust crate does not try to achieve this grand idea, and mostly just enforces a hard boundary between builders and searchers (e.g. re2::RE2::compile() will return an error instead of having to check ok()).

Anyway, great observation, great analysis, and I definitely would love to see atoms exposed in the python API--no need to block on refactoring the C++ API too much before doing that part. And I will finally reiterate that I'm not an RE2 maintainer, so please do not take the above too seriously unless you want to.

Thanks,
danny mcC
Reply all
Reply to author
Forward
0 new messages