Using replace fillter with limited sregex functionality

431 views
Skip to first unread message

rvsw

unread,
Nov 2, 2014, 11:33:08 PM11/2/14
to openre...@googlegroups.com
Hello agentzh
If it is known that replace_filter will only have strings i.e. no regex as argument, is it possible to use sregex in a way that it does parse / look for regex arguments. As I understand, the  assumption that an argument may be a regex causes sregex to do additional processing with a performance penalty. Is there a way (or at least a direction in which I can proceed to modify code), just string processing can be done.

Thanks

Yichun Zhang (agentzh)

unread,
Nov 3, 2014, 3:22:10 PM11/3/14
to openresty-en
Hello!

On Sun, Nov 2, 2014 at 8:33 PM, rvsw wrote:
> If it is known that replace_filter will only have strings i.e. no regex as
> argument,

The replace_filter directive *does* support regex as its first argument:

https://github.com/openresty/replace-filter-nginx-module#replace_filter

Do you mean use of nginx variables in the regex argument here?

> is it possible to use sregex in a way that it does parse / look
> for regex arguments.

I cannot parse this question. Will you rephrase or elaborate?

> As I understand, the assumption that an argument may
> be a regex causes sregex to do additional processing with a performance
> penalty. Is there a way (or at least a direction in which I can proceed to
> modify code), just string processing can be done.
>

Assuming you mean using nginx variables in the regex argument, yes, it
is doable. But we'll need an LRU cache for the compiled regexes
because compiling regexes upon every request is very expensive :)

Regards,
-agentzh

rvsw

unread,
Nov 4, 2014, 12:50:55 PM11/4/14
to openre...@googlegroups.com
Hello agentzh
Sorry, I probably did not communicate clearly. 
Here are more details
  1. Replace filter allows use of regex as the first argument as you mention
  2. However, if we *know* that the first argument will *always* be string (e.g. I can add some sort of a user interface or validation to make sure that only a string is added as teh first argument), then can we
  3. get performance improvements from replace_filter+sregex by a) either configuring sregex library or b) modifying the code path to assume that the first argument will always be string.
My understanding is that the fact that processing regex will have a performance penalty. If regex is always replaced by the string, then perhaps we may not need to do processing specifically for regex.

Please let me know if the query is descriptive enough
Thank you for your help

Yichun Zhang (agentzh)

unread,
Nov 4, 2014, 1:10:41 PM11/4/14
to openresty-en
Hello!

On Tue, Nov 4, 2014 at 9:50 AM, rvsw wrote:
> My understanding is that the fact that processing regex will have a
> performance penalty. If regex is always replaced by the string, then perhaps
> we may not need to do processing specifically for regex.
>

Okay, I finally see your point. You mean the pattern is a non-regex
literal string as in the standard ngx_sub module [1]?

Yes, we can get (significant) performance boost by not using an
NFA-based algorithm.

If all the replace_filter directives used in a single location use
literal string patterns, then we could use the Aho-Corasick algorithm
[2] to do the matching instead of using sregex. (if we only have a
single such replace_filter directive in a location, then we could just
use the standard ngx_sub module).

Another option is to build this AC optimization into the sregex engine itself.

Regards,
-agentzh

[1] http://nginx.org/en/docs/http/ngx_http_sub_module.html#sub_filter
[2] http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

rvsw

unread,
Dec 16, 2014, 12:47:30 PM12/16/14
to openre...@googlegroups.com

Hello agentzh
Is it possibly less effort to extend sub filter to do streaming as a first step? Essentially, if there is a match at the boundary of a buffer, then sub filter should wait for the next buffer. Since you have already implemented streaming extensively, I wonder if you can comment on this simplistic approach and possible pitfalls

Yichun Zhang (agentzh)

unread,
Dec 16, 2014, 5:10:28 PM12/16/14
to openresty-en
Hello!

On Tue, Dec 16, 2014 at 9:47 AM, rvsw wrote:
> Is it possibly less effort to extend sub filter to do streaming as a first
> step? Essentially, if there is a match at the boundary of a buffer, then sub
> filter should wait for the next buffer. Since you have already implemented
> streaming extensively, I wonder if you can comment on this simplistic
> approach and possible pitfalls
>

ngx_replace_filter's model is similar to ngx_sub's. Actually I
originally took ngx_sub's code base as the basis of
ngx_replace_filter. In ngx_replace_filter, we still have to buffer
some data when running into ambiguity at the current data chunk's
boundary. That's why we offer this replace_filter_max_buffered_size
directive in ngx_replace_filter:

https://github.com/openresty/replace-filter-nginx-module#replace_filter_max_buffered_size

But it's worth mentioning that ngx_replace_filter never buffer any
more data than absolutely necessary. So carefully written regex should
never use much memory even in extreme conditions.

Regards,
-agentzh
Reply all
Reply to author
Forward
0 new messages