False Positives Management

445 views
Skip to first unread message

Pierre

unread,
Jun 18, 2025, 12:48:20 AM6/18/25
to Wazuh | Mailing List
Hello everyone,

I am reaching out to you to as I am facing issues with my application logs. I am currently ingesting access logs from different web servers, and would like to ask for advice on how to efficiently manage false positives.

Here is an example pseudo-decoder that I am using to decode access logs (anonymized):
<!-- Create a new custom decoder called 'access' that will match the web access logs -->
<decoder name="access">

  <!-- Only consider the logs that have header format -->
  <prematch type="pcre2">$header_regex</prematch>
 
  <!-- Define a regular expression to match the access log definition -->
  <regex type="pcre2">$access_regex</regex>
 
  <!-- Define the variables names assigned to each matching group -->
  <order>$access_variables</order>
 
  <!-- Display the child decoder name in the dashboard, otherwise, the parent decoder named is used by default -->
  <use_own_name>true</use_own_name>
 
  <!-- Mark the decoder as web-log so that built-in rules can use it -->
  <type>web-log</type>

</decoder>

This works great and allows built-in and custom rules to be evaluated. However, due to the nature and variety of the applications that I am monitoring, I encounter too much false positives that I would like to filter out. Here are some illustrative examples using access urls:

Unwanted SQL injection attempts: /api/endpoint/x/error?user=blah_blah&...&reason=some%20SQLi%20looking%20terms%like%20where%20or%20union&param=...
Unwanted common web attack: /api/endpoint/y/auth?user=blah_blah&...&token=pwsh82jd82ndkja22938e9832hdhdcs&param=...
Unwanted command injection: /api/endpoint/z/work?user=blah_blah&...&command_id=%20cmd%20random_stuff%20%20&param=...

There are lots of instances where I have client names, random tokens, operation ids that are indistinguishable from real attack attempts when using regex matching only. I end up having more false positive noise than actual attack attempts in my dashboards. In order to implement filtering, I tried the two following options:

1. Create a level 0 'drop' rule

Creating a child rule with level 0 that triggers when it matches regexes that identify a known 'trusted' url parameter.
<rule id="100000" level="0">
  <if_sid>x,y,z...</if_sid>
  <description>Common false positive in URL</description>
  <url type="pcre2">regex_1|</url>
  <url type="pcre2">regex_2|</url>
  <url type="pcre2">regex_3|</url>
  ...
</rule>

2. Creating a query DSL filter in dashboards

Creating a filter to hide alerts that are known to be false positives using wildcard or regex aggregations:
{
  "query": {
    "bool": {
      "should": [
        {
          "wildcard": {
            "data.url": "/api/endpoint/x\\?arg=blah_blah&ignore=*&argn=..."
          }
        },
        {
          "regexp": {
            "data.url": {
              "value": "/api/endpoint/y\\?arg=blah_blah&ignore=(trusted1|trusted2|...)&argn=..."
            }
          }
        },
        {
          "regexp": {
            "data.url": {
              "case_insensitive": true,
              "value": "/api/endpoint/z\\?arg=blah_blah&req=[0-9]+&ignore=(trusted1|trusted2|...)&argn=..."
            }
          }
        },
        ...
      ]
    }
  }
}

However, there are significant drawbacks that I am dissatisfied with using this approach:

1. This kind of matching is inaccurate and might fail if parameters come in a different order
2. Matching such cases require complex regular expressions
3. The previous examples assume that the log is safe if the 'ignore' parameter contains a trusted value. But what if argn is malicious? We would need to make sure that 'ignore' is trusted and that all other parameters are also safe, which is realistically not feasible using regex alone.
4. The more expressions you use, the slower it gets
5. Terrible to maintain

The previous examples only show 3 expressions, but in a practical application a lot more are required, leading to very poor performance especially when dealing with high volumes of data. In the first case, running dozens of false positive regexes for each attack log would kill analysisd performance, and in the second case, running all the filters would yield excessive loading times in dashboards.

Having access to a more data-centric model would help solving this problem by allowing more complex logic processing (processing separate parameters, dealing with url encoding, rather than plain string matching). Given the real-time processing context of Wazuh, it makes sense that such processing is not supported.

After doing some research, I was wondering if using active response alongside a custom application would be a viable solution. Here is what I have in mind:

- Write a custom application with a locally running http server on the manager (using a performant language like Go or C++)
- Configure an active response that would send a POST request to the local server with the alert payload for the desired rules
- Perform our custom false positive validation in our app.
- If there is a false positive, use the indexer api to update the entry with additional custom data to identify false positives
- On the dashboard and reports, simply filter out false positives based on whether this data is present or not

Implementing such solution could also allow integrating AI assisted evaluation later if relevant.

- Is there a better way to approach this problem that I might not be aware of?
- Is running a custom data processing app viable in your opinion? Does it go against Wazuh design/use cases/scope?

As I am still new to SIEM solutions, I would be interested in your advice on how to tackle such problems, or if I am going completely off track.

Looking forward to your answers
🙂

Kind Regards,

Pierre

Ayooluwa Paul Akindeko

unread,
Jun 18, 2025, 4:04:00 AM6/18/25
to Wazuh | Mailing List
1. One of the first things you can do to reduce noise within your Wazuh alerts is to utilize the frequency and timeframe rule properties. What this does is group identical hits that occur n times inside a sliding window of t seconds (or minutes).
So that one pattern match is not enough, but if the pattern happens, say, 5 times in 30 seconds, Wazuh escalates the event. Just go to any of your rules and  configure the frequency and timeframe options.
  <frequency>5</frequency>
  <timeframe>60</timeframe>

You can refer to the rule documentation.
2. A better alternative to adding more regex to your setup is to utilizing the CBD list. Wazuh is able to check if a field extracted during the decoding phase is in a CDB list (constant database). The main use case of this feature is to create a white/black list of users, file hashes, IP addresses, or domain names. Documentation for CBD lists in Wazuh.
3. An out of the box option from Wazuh is to use Sibling Decoders. This helps you to split one big access-log regex into several small, purpose-built decoders that each capture just the fields you need. Read more about sibling decoders with a practical example of how you can use it here. Documentation for Sibling Decoders in Wazuh.

These are native approaches you can take without having to build any custom service.

Pierre

unread,
Jun 19, 2025, 7:32:26 AM6/19/25
to Wazuh | Mailing List
Hello again,

Thank you very much for your answer, I haven't thought of using sibling decoders for this purpose, I will definitely give it a try.

I'd have another question about other custom decoding. I am working on ingesting DDoS logs from Radware DefensePro appliances, but I am encountering issues because of the pre-decoding phase. Here are three sample logs (as they arrive in Wazuh, taken from archives.log):

Jan 01 00:00:00 cyber-controller-server : [Device: x.x.x.x x.x.x.x] M_20000: An attack of type "Anti-Scanning" started. Detected by policy: My-Policy; Attack name: TCP IP Scan; Source IP: x.x.x.x; Destination IP: Multiple; Destination port: 0; Action: drop.
Jan 01 00:00:00 cyber-controller-server : [Device: x.x.x.x x.x.x.x] M_20000: 3 attacks of type "Anti-Scanning" started between 00:00:00 CEST and 00:00:00 CEST. Detected by policies: Policy-A, Policy-B; Attack name: TCP IP Scan; Source IPs: x.x.x.x, x.x.x.x; Destination IP: Multiple; Destination port: 0; Action: drop.
Jan 01 00:00:00 cyber-controller-server : [Device: x.x.x.x x.x.x.x] M_20000: An attack of type "DoS" started. Detected by policy: Policy-A; Attack name: My-Attack; Source IP: x.x.x.x; Destination IP: x.x.x.x; Destination port: 443; Action: drop.

I have linked the decoder I wrote to parse such logs in attachments. The issue I have is that the pre-decoding phase breaks the decoder in a way that I do not understand:

**Phase 1: Completed pre-decoding.
        full event: 'Jan 01 00:00:00 cyber-controller-server : [Device: x.x.x.x x.x.x.x] M_20000: An attack of type "Anti-Scanning" started. Detected by policy: My-Policy; Attack name: TCP IP Scan; Source IP: x.x.x.x; Destination IP: Multiple; Destination port: 0; Action: drop.'
        timestamp: 'Jan 01 00:00:00'
        hostname: 'cyber-controller-server'
        program_name: ''

**Phase 2: Completed decoding.
        No decoder matched.

**Phase 3: Completed filtering (rules).
        id: '1002'
        level: '2'
        description: 'Unknown problem somewhere in the system.'
        groups: '['syslog', 'errors']'
        firedtimes: '1'
        gpg13: '['4.3']'
        mail: 'False'
**Alert to be generated.

Here the pre-decoder gets the timestamp and hostname correctly, but it gives an empty 'program_name'. The weird thing is that it does not trigger any decoder, even when using a dummy empty or single letter prematch which should basically match anything. Why is that? I suspect the predecoder consumes the entire log, which prevents anything from matching. In fact, when adding any character in front of the log to prevent the pre-decoding (like a '1' for example), my decoder works just as expected:

**Phase 1: Completed pre-decoding.
        full event: '1 Jan 01 00:00:00 cyber-controller-server : [Device: x.x.x.x x.x.x.x] M_20000: 3 attacks of type "Anti-Scanning" started between 00:00:00 CEST and 00:00:00 CEST. Detected by policies: Policy-A, Policy-B; Attack name: TCP IP Scan; Source IPs: x.x.x.x, x.x.x.x; Destination IP: Multiple; Destination port: 0; Action: drop.'

**Phase 2: Completed decoding.
        name: 'ddos'
        action: 'drop'
        appliance_ip: 'x.x.x.x'
        appliance_name: 'M_20000'
        attack_count: '3'
        attack_name: 'TCP IP Scan'
        attack_type: 'Anti-Scanning'
        dstip: 'Multiple'
        dstport: '0'
        end_time: '00:00:00 CEST'
        policies: 'Policy-A, Policy-B'
        srcip: 'x.x.x.x, x.x.x.x'
        start_time: '00:00:00 CEST'

**Phase 3: Completed filtering (rules).
        id: '1002'
        level: '2'
        description: 'Unknown problem somewhere in the system.'
        groups: '['syslog', 'errors']'
        firedtimes: '1'
        gpg13: '['4.3']'
        mail: 'False'
**Alert to be generated.

I have come across this post, but in my case, I am sending the logs from an aggregation platform to the Wazuh syslog listener, and I would like to avoid adding custom configuration to my forwarding routes just to add a dummy prefix to the logs. Adding an empty program_name tag as mentioned here does not work either.

Is there a way to work around that? I think you should add a way to disable the pre-decoding phase and allow the user to parse the header variables themselves, this has caused me hassles more than once.

(Sidenote: I am running Wazuh 4.11.2)

Thank you again for your help.

Best regards,

Pierre
0000-ddos.xml

Ayooluwa Paul Akindeko

unread,
Jun 23, 2025, 2:51:53 AM6/23/25
to Wazuh | Mailing List
I'm sure using Sibling decoders will help.
For your second question, we recommend creating a new discussion thread. This helps other community users who might have a similar issue find the discussion more easily.

Reply all
Reply to author
Forward
0 new messages