Crashes in pcre2_match_16 with binary data

Thomas Tempelmann

unread,

Jan 11, 2023, 5:17:04 AM1/11/23

to pcre...@googlegroups.com

In 2020 I had asked for help with searching for UTF-8 text in binary data (i.e. any files on a disk). I got some advice and all worked well.

Now I expanded my code to also search for UTF-16 text in the same data.

So I built the lib with both the 8 and 16 bit functions, and create separate pcre_code, match_data, context etc. using the _8 and _16 suffixes instead of the default macros.

And it all works fine in small test cases - It finds both UTF-8 and UTF-16 strings in files.

However, I now get crashes in the pcre2_match_16() function when searching large files, whereas I never get them in the _8 function. And it happens both with JIT and without.

I also use the same options for both versions, of course.

With PCRE2 v10.42.

I ruled out a mix-up between the _8 and _16 structs by only using the _16 code, and I don't use concurrent threads.

The crashes are a bit random, i.e. certain files crash often but not always.

But within 5 seconds of scanning random files on my disk, I get always a crash.

Since I use a built lib, I cannot easily look at the source code where it crashes.

I wonder if there are cmdline tools I can use for testing in order to rule out a mistake on my end. But it seems that pcre2grep does not support UTF-16 search, right? Or do I have to build the tool with special options first?

--
Thomas Tempelmann, http://apps.tempel.org/

Jeffrey Walton

unread,

Jan 11, 2023, 8:32:30 AM1/11/23

to Thomas Tempelmann, pcre...@googlegroups.com

Rebuild your program and pcre2 with Asan via -fsanitize=address. You
can also use Valgrind, but you need to rebuild your program and pcre2
with -O1 (or -O0). In both cases you should trigger a memory error
based on your description of the problem.

If you are going to rebuild for testing purposes, you may as well use
Asan as it is a bit easier than Valgrind.

Jeff

Thomas Tempelmann

unread,

Jan 11, 2023, 10:04:37 AM1/11/23

to PCRE2 discussion list

Jeff,

Rebuild your program and pcre2 with Asan via -fsanitize=address.

For what purpose?

You
can also use Valgrind, but you need to rebuild your program and pcre2
with -O1 (or -O0). In both cases you should trigger a memory error
based on your description of the problem.

I already do that, so how does your suggestion make it better?

If you are going to rebuild for testing purposes, you may as well use
Asan as it is a bit easier than Valgrind.

I have no idea what Asan or Valgrind are.

I am using the lib on macOS, FWIW, building with clang and Xcode.

Thomas

Thomas Tempelmann

unread,

Jan 11, 2023, 10:06:54 AM1/11/23

to PCRE2 discussion list

Clarification (something's wrong with comment indenting in googlegroups!?):

You
can also use Valgrind, but you need to rebuild your program and pcre2
with -O1 (or -O0). In both cases you should trigger a memory error
based on your description of the problem.

I already do that, so how does your suggestion make it better?

I meant: I already get memory errors, so what does your suggestion change about it?

Thomas

Thomas Tempelmann

unread,

Jan 11, 2023, 10:23:56 AM1/11/23

to PCRE2 discussion list

Looks like the crash occurs inside `_pcre2_valid_utf_16`, which makes sense.

The crash is bc of a bus error (addr 0x00007f945e005000)

With these registers:

rax: 0x0000000000000000 rbx: 0x00007f945e597e68 rcx: 0x0000000000000000 rdx: 0x0000600002ae96c0
rdi: 0x00007f945d2cd720 rsi: 0x00000000002c9734 rbp: 0x000070000ed342d0 rsp: 0x000070000ed342d0
r8: 0x00000000002c9734 r9: 0x0000000000000000 r10: 0x00007f945e005000 r11: 0x0000000000000110
r12: 0x00000000012ca718 r13: 0x00000000012ca6e6 r14: 0x00007f945d2cd71e r15: 0x00007f945d2cd720
rip: 0x00000001022b294c rfl: 0x0000000000010202 cr2: 0x00007f945e005000

Could someone please clarify: The data length argument I give to `pcre2_match_16` is in byte units, not in unichar (double byte) units, correct?

Thomas

Thomas Tempelmann

unread,

Jan 11, 2023, 10:32:35 AM1/11/23

to PCRE2 discussion list

Could someone please clarify: The data length argument I give to `pcre2_match_16` is in byte units, not in unichar (double byte) units, correct?

Oh my! That seems to be it!

Sadly, this is not explained at all in the docs. I'll file a ticket about that.

Thomas

Giuseppe D'Angelo

unread,

Jan 11, 2023, 10:50:46 AM1/11/23

to Thomas Tempelmann, PCRE2 discussion list

Could someone please clarify: The data length argument I give to `pcre2_match_16` is in byte units, not in unichar (double byte) units, correct?

It's in "number of code units" https://www.pcre.org/current/doc/html/pcre2api.html#SEC15

For UTF-16, that's indeed "double bytes" (number of UTF-16 code units).

HTH,

--

Giuseppe D'Angelo

Thomas Tempelmann

unread,

Jan 11, 2023, 11:11:04 AM1/11/23

to PCRE2 discussion list

Thanks for clarifying, Giuseppe.

This probably also means that if I look for UTF-16 chars in random binary data, they won't be found unless they're word-aligned. So I better run two searches on the same data, one with the byte-pointer, and one with the byte-pointer+1. That also explains why it wouldn't find UTF-16BE on an LE system.

Sigh. So the "find in binary data is quite limited and that should be more clearly pointed out in the docs", I think.

Thomas

Thomas Tempelmann

unread,

Jan 11, 2023, 11:21:31 AM1/11/23

to PCRE2 discussion list

Ticket: https://github.com/PCRE2Project/pcre2/issues/187

Jeffrey Walton

unread,

Jan 11, 2023, 12:15:24 PM1/11/23

to Thomas Tempelmann, PCRE2 discussion list

My apologies for bothering you with the suggestions.

Reply all

Reply to author

Forward