I think this is likely to be a great idea and would help to make the RE2 python bindings easier to slot into large-scale text search, relying upon the build process used to generate the python dist rather than requiring users to go through that themselves.
I don't want to derail this proposal: I think having atoms exposed in the python bindings would be a great addition in its own right! But I think you have identified a general rough edge with the current C++ API, in that it tends to use the same C++ class definitions both for
building as well as
searching. This produces a general sense of uncertainty around when and what is initialized where (e.g. the
re2::RE2::ok() method). I linked the above (where I take special care to separate my rust struct
re2::filtered::FilteredRE2Builder from my other rust struct
re2::filtered::Filter) because it also demonstrates how we can make our return values meaningful--much like your great observation here:
> Returning a list would make this truthy when atoms are extracted, which
would arguably be more useful than the current return value.
Just to be clear, I am not at all a qualified RE2 reviewer. But I was planning to eventually upstream the C++ API changes I had made in the stub file for my rust crate, and I wanted to highlight the possibility of a more general API improvement in the longer term here. Outside of lifetimes, there is nothing I'm doing in the rust crate that can't be done in C++ (and C++ has far more powerful metaprogramming than rust).
StringView, for example, has a direct analogue in C++ land. And
re2::set::MatchedSetInfo is just an FFI wrapper around a C++
std::vector<int>. It would honestly be much cleaner to do this in the C++ API itself, instead of having to create auxiliary rust wrappers and stubs. I just don't really know how to propose a drastic API change like this, since I'm very far removed from the people who rely upon RE2's current interface and don't know how it would affect them.
My approach in the re2 rust crate involves quite a few auxiliary struct definitions to codify the different intermediate and final states of each return value. In doing so, I take some degree of inspiration from
regex-automata, the crate Andrew Gallant/burntsushi publishes which exposes implementation details for the rust regex crate.
I actually think that exposing atoms to the end user like the RE2 C++ API enables (not the python API yet, as you note) is something more string search projects should be looking to do. I am hoping to work on text search for a doctoral degree, and I had been scheming with junyer last year about how to consider text search problems as a composition of specialized tools (e.g. atom searching with a fast SIMD literal finder), as opposed to a monolithic "regex engine" interface which obscures the underlying mechanics from the user (who often has specialized requirements!). My re2 rust crate does
not try to achieve this grand idea, and mostly just enforces a hard boundary between builders and searchers (e.g.
re2::RE2::compile() will return an error instead of having to check ok()).
Anyway, great observation, great analysis, and I definitely would love to see atoms exposed in the python API--no need to block on refactoring the C++ API too much before doing that part. And I will finally reiterate that I'm not an RE2 maintainer, so please do not take the above too seriously unless you want to.
Thanks,
danny mcC