Replacement for lack of negative lookbehind in regexp?

6,321 views
Skip to first unread message

Victor Hooi

unread,
Jul 21, 2015, 2:24:47 AM7/21/15
to golan...@googlegroups.com
In my particular case, I want to find quoted strings, where the field-name before the quoted string is not "$comment". 

So for example, say I have the string:

command foo.$cmd command: { mapreduce: "lorem ipsum", map: "lorem ipsum", reduce: "lorem ipsum", verbose: true, query: { $comment: "I WANT TO KEEP THIS STRING", _id.foo: { $in: [ "lorem", "ipsum", "dolor"]}}

I'm using the following to replace any quoted strings with "XXXX":

string_regex, _ := regexp.Compile(`"[^"]*"[,| }]`)

scanner := bufio.NewScanner(file)
for scanner.Scan() {
    message := strings.SplitN(scanner.Text(), "]", 2)[1]
    fmt.Println("Strings removed: " + string_regex.ReplaceAllString(message, "XXXX"))

string_regex in this case will find any double-quoted strings, that are followed by either a comma or curly brace).

However, I want to leave any string that comes after $comment alone (e.g. "I WANT TO KEEP THIS STRING" in the above).

One option is to use a negative lookahead to ignore those strings like so:

(?!\$comment: )"...


However, I understand that Go's stdlib regexp does not support negative lookaheads.


In the absence of having that in regexp, what would be the next best alternative in Go?

Tamás Gulácsi

unread,
Jul 21, 2015, 2:53:50 AM7/21/15
to golan...@googlegroups.com
Just parse the whole thing (separate field names, values and other tokens), and then you know the field name before the value.

Victor Hooi

unread,
Jul 21, 2015, 4:47:34 AM7/21/15
to golan...@googlegroups.com
There won't always be a field-name before a quoted string - in several cases you can have a square bracket before a quoted string:

$in: [ "lorem ipsum", "lorem ipsum"]
 
In the case where there is a field-name, and that field-name is $comment, then I want to avoid replacing that field.

However, based on your idea I can do something similar (colleague helped here as well), and have an optional group for $comment:

quoted_string_regex, _ := regexp.Compile(`(\$comment: )?"[^"]*"[,| }]`)

I can then check if the $comment group matched anything.

The major issue is - all of this will tell me if #comment is there or not - but I still need a way to actually do the replacement.

One idea I had was to use regexp.ReplaceAllStringFunc - and then in the function I pass in, I could check if the match contained $comment. The following appears to work:

fmt.Println("Everything" + quoted_string_regex.ReplaceAllStringFunc(message, func(s string) string {
    if strings.Contains(s, `$comment`) {
        return s
    } else {
        return `XXXX`
    }
}))

Does that seem like a reasonable way to do things?

Ultimately, it seems like I'll need a better way of handling all of these replacements - Tamás, I think you were actually helping me in the other thread =). Doing a single parse to extract the positions of all the strings, field-names, IP addresses etc. seems good, but then as soon as I make one replacement, all the positions will change so that won't work.

Also, two questions:
  1. Apart from the immutability issue, are there any reasons for using ReplaceAllStringFunc, versus ReplaceAllFunc?
  2. I also tried using a named function, rather than an anonymous function:
func redact_if_not_comment(s string) string {
    if strings.Contains(s, `$comment`) {
        return s
    } else {
        return `XXXX`
    }
}
...
quoted_string_regex, _ := regexp.Compile(`(\$comment: )?"[^"]*"[,| }]`)
fmt.Println("Everything" + quoted_string_regex.ReplaceAllFunc(message, redact_if_not_comment(matched_string)))


However, I ended up getting a type error - I could have sworn I got the types right, but obviously I'm missing something here:

./redact.go:56: cannot use redact_if_not_comment(message) (type string) as type func(string) string in argument to quoted_string_regex.ReplaceAllStringFunc

Can anybody see what i missed?

Fredrik Ehnbom

unread,
Jul 21, 2015, 7:07:50 AM7/21/15
to golan...@googlegroups.com
 
fmt.Println("Everything" + quoted_string_regex.ReplaceAllFunc(message, redact_if_not_comment(matched_string)))
However, I ended up getting a type error - I could have sworn I got the types right, but obviously I'm missing something here:
./redact.go:56: cannot use redact_if_not_comment(message) (type string) as type func(string) string in argument to quoted_string_regex.ReplaceAllStringFunc

You need to reference redact_if_not_comment, not call it. Ie remove "(matched_string)".

Tamás Gulácsi

unread,
Jul 21, 2015, 9:47:52 AM7/21/15
to golan...@googlegroups.com
The replacement problem says to me that just parse the line, and do all replacement in one goroutine - for parallelization, use multiple goroutines for multiple lines.

Victor Hooi

unread,
Jul 21, 2015, 8:33:49 PM7/21/15
to golang-nuts
The only thing is, the line are loglines - and there is an ordering to them.

If i use multiple go-routines for multiple lines, rather than working on just one line at a time, does not that mean the lines could be re-ordered in the output?

Matt Harden

unread,
Jul 21, 2015, 9:10:36 PM7/21/15
to Victor Hooi, golang-nuts
You can still output the lines in the right order. One way to synchronize this would be to use a Pipeline. You only really need one of the two sides of the pipeline; the Request or the Response side.

var p textproto.Pipeline
for <each line> {
    id := p.Next()
    go process(line, p, id)
}

func process(line string, p textproto.Pipeline, id uint) {
    // process the line
    p.StartResponse(id)
    // write the line
    p.EndResponse(id)
}

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Victor Hooi

unread,
Jul 22, 2015, 3:51:09 AM7/22/15
to golang-nuts
Interesting, I hadn't heard of the actual Pipeline type. I'll need to read up on this.

The other thing I've realised is that "ReplaceAllStringFunc" doesn't support regex sub-groups/sub-matches (https://github.com/golang/go/issues/5690), nor is there an equivalent regexp function that does.

This is tad annoying (particularly when combined with the lack of lookahead/lookbehind), as I was hoping to be able to strip out both the $comment and trailing comma or curly brace into sub-groupsm modify the quoted string in the middle, then put it all back together:

quoted_string_regex, _ := regex.Compile(`(\$comment: )?"([^"]*)"[,| }]`)

So for example:

"Hello I'm a comment", 

would become:

"ac85fd2b01946ab0a85111aecdbad81ebf021c34", 

with the sha-1 hash being of *just* the text between the ", and the trailing comma preserved. 

However, it seems like I'll need to parse the regex again a second time inside the actual ReplaceAllStringFunc callback function, or use a bunch of Contains/EndsWith/Splits.

Or is there another way of doing it in Go?

Tamás Gulácsi

unread,
Jul 22, 2015, 4:28:18 AM7/22/15
to golang-nuts
Yes, parse the log line!
Is it like json? Start with the json decoder, or some less strict (rjson).

Victor Hooi

unread,
Jul 22, 2015, 7:24:27 AM7/22/15
to golang-nuts
Hi Tamás,

I'd love to parse the logline! =).

Unfortunately, these loglines are from the MongoDB database, and whilst it looks like JSON, it is actually not. This is a known issue. There is an open feature request to get true JSON output (https://jira.mongodb.org/browse/SERVER-17357), but it is unlikely to get traction.

Even with the lines that look like JSON, there are several issues. For example, quotation marks are not escaped (https://jira.mongodb.org/browse/SERVER-16620). So this is a valid MongoDB logline:

query { name: "John", age: 15, message: "This is a message with a quotation " mark right in the middle" }

Likewise, newlines are not escaped, so you often get loglines spanning multiple lines, if a field contains newlines. And if a logline is greater than 10K characters, it will be truncated in the middle - irrespective of where you are - so the {} may not match up.

And there's also the issue that not everything is quoted (although I think rjson is OK with that), and there's special $ and _ characters.

I put a gist of some sample loglines - theses ones seem reasonably well behaved, but I'm still not certain if they can be parsed.

The rjson (http://godoc.org/launchpad.net/rjson or https://github.com/rogpeppe/rjson) looks interesting, however, I think it will still struggle:

victorhooi@thadeus ~/c/g/s/g/r/r/c/rjson> echo '{ mapreduce: "lorem", map: "function() {emit({id:this._id.fooId}, {version:this._id.vid, transId:this._id.transactionId})}", reduce: "function(key, vals) { var ver = vals[0]; for(var idx=1; idx<vals.length;idx++) { if (ver < vals[idx]) { ver = vals[idx]; }} return ver; }", verbose: true, out: { inline: 1 }, query: { $comment: "loremipsumdolor:44...@243c13n1.foo.com:kg._342Instance_1_dev.cached.tree-12:1572", _id.transactionId: { $in: [ "lorem", "ipsum", "dolor", "sic" ] }, _id.fooId: { $in: [ "lorem", "ipsum" ] }, lorem.ipsum.dolor: { $elemMatch: { loremType: "LOREM_IPSUM", lorem.ipsumDate: { $gt: "2014-02-24" } } } }, sort: {} }' | go run main.go
rjson: decode: invalid character '$' looking for beginning of object key stringexit status 1
victorhooi@thadeus ~/c/g/s/g/r/r/c/rjson> echo '{ $query: { $comment: "GetMaxLorem:18...@hz452c38n1.foo.com:pool-12-thread-4:3468", _id.id: "lorem ipsum" }, $orderby: { _id.version: -1 } }' | go run main.go
rjson: decode: invalid character '$' looking for beginning of object key stringexit status 1
victorhooi@thadeus ~/c/g/s/g/r/r/c/rjson> echo '{ query: { comment: "GetMaxLorem:18...@hz452c38n1.foo.com:pool-12-thread-4:3468", _id.id: "lorem ipsum" }, orderby: { _id.version: -1 } }' | go run main.go
rjson: decode: invalid character '_' looking for beginning of object key stringexit status 1
victorhooi@thadeus ~/c/g/s/g/r/r/c/rjson> echo '{ query: { comment: "GetMaxLorem:18...@hz452c38n1.foo.com:pool-12-thread-4:3468", id.id: "lorem ipsum" }, orderby: { id.version: -1 } }' | go run main.go
rjson: decode: invalid character '.' after object keyexit status 1

That is why I'm trying the current approach using regexes, to try and extract each of the keys/values and operate directly on them.

I'm still not sure how of the best way to work around the lack of sub-groups/matches in ReplaceAllStringFunc, but it seems like parsing the regex a second type inside the callback function may be the best workaround?

Regards,
Victor

PS: I should mention there is a project to try to properly parse MongoDB loglines - https://github.com/mongodb-js/log - it's in NodeJS, and uses PegJS. I don't know how much work it'd be to get this working in Go.

Tamás Gulácsi

unread,
Jul 22, 2015, 7:46:27 AM7/22/15
to golang-nuts
I see.

Go with the simplest solution, then benchmark, if it is slow.

You can try https://github.com/pointlander/peg if you have a peg grammar.

Egon

unread,
Jul 22, 2015, 9:23:57 AM7/22/15
to golang-nuts
Just a helpful tip where to start:

>^.^<

+ Egon

roger peppe

unread,
Jul 23, 2015, 9:20:36 AM7/23/15
to Victor Hooi, golang-nuts
I'm afraid there's no way that rjson can read mongodb's output
in general - rjson only accepts a very limited character set in its
keys and only in keys, not in values. Mongodb values can be
non-JSON values such as BinData(0,"Im51bGwi") and there's
no way that rjson will parse those (it's not designed to).

It would be really nice to have a MongoDB-styled-JSON parser
actually but you're right that log line truncation is a problem.
Basically the logs are not actually parsable AFAICS, so
you'll have to use good-most-of-the-time heuristics.

As for regexp replacement, it's not too hard to roll your
own for special applications like this.
For example: http://play.golang.org/p/1zAiGH1mfC

cheers,
rog.
Reply all
Reply to author
Forward
0 new messages