I'm looking at using re2j. My specific use case involves searching for a single pattern in a very large number of strings. Each string is typically 1 to 500 characters long, and is stored in UTF-8 form in a byte[]. There may be up to ~10,000,000 strings to search at once, totalling a few gigabytes, all held in RAM.
Is it possible to invoke re2j directly on UTF-8 data? I see some hints of this in the code, e.g. the UTF8Input class, but there doesn't seem to be any way to access this from the outside.
My goal is to minimize runtime overhead, and memory allocation in particular. As a temporary hack, I've created an implementation of CharSequence that works directly from a byte[], but so far I've only implemented the easy case where there are no multi-byte characters. If re2j supports UTF-8 input directly, that would seem to be the best route.
Steve