Crashes in pcre2_match_16 with binary data

32 views
Skip to first unread message

Thomas Tempelmann

unread,
Jan 11, 2023, 5:17:04 AM1/11/23
to pcre...@googlegroups.com
In 2020 I had asked for help with searching for UTF-8 text in binary data (i.e. any files on a disk). I got some advice and all worked well.

Now I expanded my code to also search for UTF-16 text in the same data.

So I built the lib with both the 8 and 16 bit functions, and create separate pcre_code, match_data, context etc. using the _8 and _16 suffixes instead of the default macros.

And it all works fine in small test cases - It finds both UTF-8 and UTF-16 strings in files.

However, I now get crashes in the pcre2_match_16() function when searching large files, whereas I never get them in the _8 function. And it happens both with JIT and without.

I also use the same options for both versions, of course.

With PCRE2 v10.42.

I ruled out a mix-up between the _8 and _16 structs by only using the _16 code, and I don't use concurrent threads.

The crashes are a bit random, i.e. certain files crash often but not always.

But within 5 seconds of scanning random files on my disk, I get always a crash.

Since I use a built lib, I cannot easily look at the source code where it crashes.

I wonder if there are cmdline tools I can use for testing in order to rule out a mistake on my end. But it seems that pcre2grep does not support UTF-16 search, right? Or do I have to build the tool with special options first?

--
Thomas Tempelmann, http://apps.tempel.org/

Jeffrey Walton

unread,
Jan 11, 2023, 8:32:30 AM1/11/23
to Thomas Tempelmann, pcre...@googlegroups.com
Rebuild your program and pcre2 with Asan via -fsanitize=address. You
can also use Valgrind, but you need to rebuild your program and pcre2
with -O1 (or -O0). In both cases you should trigger a memory error
based on your description of the problem.

If you are going to rebuild for testing purposes, you may as well use
Asan as it is a bit easier than Valgrind.

Jeff

Thomas Tempelmann

unread,
Jan 11, 2023, 10:04:37 AM1/11/23
to PCRE2 discussion list
Jeff,

Rebuild your program and pcre2 with Asan via -fsanitize=address.

For what purpose?
 
You
can also use Valgrind, but you need to rebuild your program and pcre2
with -O1 (or -O0). In both cases you should trigger a memory error
based on your description of the problem.

I already do that, so how does your suggestion make it better?
 
If you are going to rebuild for testing purposes, you may as well use
Asan as it is a bit easier than Valgrind.

I have no idea what Asan or Valgrind are.

I am using the lib on macOS, FWIW, building with clang and Xcode.

Thomas

Thomas Tempelmann

unread,
Jan 11, 2023, 10:06:54 AM1/11/23
to PCRE2 discussion list
Clarification (something's wrong with comment indenting in googlegroups!?):

You
can also use Valgrind, but you need to rebuild your program and pcre2
with -O1 (or -O0). In both cases you should trigger a memory error
based on your description of the problem.

I already do that, so how does your suggestion make it better?

I meant: I already get memory errors, so what does your suggestion change about it?

Thomas 
 

Thomas Tempelmann

unread,
Jan 11, 2023, 10:23:56 AM1/11/23
to PCRE2 discussion list
Looks like the crash occurs inside `_pcre2_valid_utf_16`, which makes sense.

The crash is bc of a bus error (addr 0x00007f945e005000)

With these registers:

  rax: 0x0000000000000000  rbx: 0x00007f945e597e68  rcx: 0x0000000000000000  rdx: 0x0000600002ae96c0
  rdi: 0x00007f945d2cd720  rsi: 0x00000000002c9734  rbp: 0x000070000ed342d0  rsp: 0x000070000ed342d0
   r8: 0x00000000002c9734   r9: 0x0000000000000000  r10: 0x00007f945e005000  r11: 0x0000000000000110
  r12: 0x00000000012ca718  r13: 0x00000000012ca6e6  r14: 0x00007f945d2cd71e  r15: 0x00007f945d2cd720
  rip: 0x00000001022b294c  rfl: 0x0000000000010202  cr2: 0x00007f945e005000

Could someone please clarify: The data length argument I give to `pcre2_match_16` is in byte units, not in unichar (double byte) units, correct?

Thomas

Thomas Tempelmann

unread,
Jan 11, 2023, 10:32:35 AM1/11/23
to PCRE2 discussion list
Could someone please clarify: The data length argument I give to `pcre2_match_16` is in byte units, not in unichar (double byte) units, correct?

Oh my! That seems to be it!

Sadly, this is not explained at all in the docs. I'll file a ticket about that.

Thomas

Giuseppe D'Angelo

unread,
Jan 11, 2023, 10:50:46 AM1/11/23
to Thomas Tempelmann, PCRE2 discussion list
Could someone please clarify: The data length argument I give to `pcre2_match_16` is in byte units, not in unichar (double byte) units, correct?



For UTF-16, that's indeed "double bytes" (number of UTF-16 code units).
 
HTH,
--
Giuseppe D'Angelo

Thomas Tempelmann

unread,
Jan 11, 2023, 11:11:04 AM1/11/23
to PCRE2 discussion list
Thanks for clarifying, Giuseppe.

This probably also means that if I look for UTF-16 chars in random binary data, they won't be found unless they're word-aligned. So I better run two searches on the same data, one with the byte-pointer, and one with the byte-pointer+1. That also explains why it wouldn't find UTF-16BE on an LE system.

Sigh. So the "find in binary data is quite limited and that should be more clearly pointed out in the docs", I think.

Thomas

Thomas Tempelmann

unread,
Jan 11, 2023, 11:21:31 AM1/11/23
to PCRE2 discussion list

Jeffrey Walton

unread,
Jan 11, 2023, 12:15:24 PM1/11/23
to Thomas Tempelmann, PCRE2 discussion list
My apologies for bothering you with the suggestions.
Reply all
Reply to author
Forward
0 new messages