[URGENT] wazuh-db orphan processes locking global.db -> zero FIM alerts for agent 000 (production manager) — need correct fix ASAP

5 views
Skip to first unread message

Roshinni Gandhi

unread,
Jun 30, 2026, 5:31:34 AM (3 days ago) Jun 30
to Wazuh | Mailing List
Hi all,
 
Production single-node Wazuh manager is in a degraded state and I need to confirm the
correct fix before applying anything further. Posting here in parallel with a Slack
thread to try to get eyes on this faster — will cross-post any resolution back to both.
 
ENVIRONMENT
- Wazuh 4.14.1-rc2 (Manager + Indexer + Dashboard), single-node
- Ubuntu 20.04 (Kernel 5.4.0-208-generic)
- Manager has been running since November 2025 without a reboot
- DB path on this version: /var/ossec/queue/db/ (not /var/ossec/var/db/)
 
SYMPTOM
FIM alerts for the manager itself (agent 000) have produced zero results across the
entire available log retention window (31 daily archives, Feb 15 - Jun 26, 2026, ~4.5
months). True onset predates available logs. All other agents are generating FIM
alerts normally. Non-FIM alerts for agent 000 (SSH, PAM, Office365) flow normally,
isolating the fault to the FIM/wazuh-db path specifically.
 
CONFIRMED ROOT CAUSE
KillMode=process in the wazuh-manager systemd unit means restarts only signal the
top-level wazuh-control process. wazuh-control itself manages child daemons with a
60-second per-daemon stop timeout; if wazuh-db doesn't exit in time, wazuh-control
removes its PID file and moves on, but the process survives as an orphan (PPID=1).
Each subsequent restart can launch a fresh wazuh-db alongside any surviving orphan(s),
and they contend for an exclusive lock on global.db.
 
CURRENT LIVE STATE (verified this morning, multiple independent methods)
- 2 simultaneous wazuh-db processes:
  - PID 878507 — started Jun 29 17:44, ~18hrs uptime, holds the global.db WRITE lock
    (confirmed via both `lsof` showing FD 31uw, and `lslocks` showing POSIX WRITE on
    byte range 1073741824-1073741825)
  - PID 925488 — started today 10:43:33, logged its own startup independently
    (wazuh-db: INFO: Started (pid: 925488)) with NO corresponding wazuh-control or
    systemd log entry anywhere near that timestamp. We've ruled out a manual restart
    and ruled out our own diagnostic commands as the trigger. Cause unknown.
- global.db-journal exists, ~21K, confirmed static/frozen across repeated checks
  (not actively growing — an open, uncommitted transaction)
- wazuh-db.sock does not exist on this version. Instead there's a live socket at
  /var/ossec/queue/sockets/wdb-http.sock, owned by both PIDs, which DOES respond
  correctly to a well-formed HTTP request (GET / -> 404 Not Found). So wazuh-db is
  not fully hung -- its HTTP listener is alive -- it's specifically stuck on the
  transaction it holds against global.db.
- This morning's syscheck scan (07:07:08-09) completed cleanly per syscheckd's own
  INFO start/end log lines, with zero errors anywhere in the surrounding log window.
  fim.db's mtime matches the scan-end timestamp exactly. So the local scan is NOT
  timing out or failing -- it completes and writes locally. The precisely confirmed
  failure point is narrower: zero alerts for agent 000 reach alerts.json even after
  this clean, on-schedule, error-free scan. This points to a failure in the
  comparison/sync step between completed local scan and alert generation, which
  routes through global.db's registration/checksum state -- not a scan timeout.
- No backup directory exists at /var/ossec/backup/db/, so the documented Wazuh
  global.db restore-from-backup recovery procedure is not available to us as a
  fallback if a simple restart doesn't fully clear the stuck state.
 
WHAT WE HAVE NOT DONE YET
- Have not killed either wazuh-db process
- Have not touched global.db or the journal file
- Have not changed KillMode
- No manager restart performed by us today
 
QUESTIONS — trying to get to a confirmed-correct fix, not just a plausible one:
 
1. Given no backup exists, is `kill -9` on both orphaned PIDs followed by a clean
   `systemctl restart wazuh-manager` sufficient for a fresh wazuh-db to roll back
   the open journal automatically via SQLite's normal crash recovery on next open?
   Or is there a safer/recommended manual step to clear the journal first?
 
2. On KillMode: we've seen the PR history (wazuh/wazuh#14816, wazuh-qa#3266) showing
   the deliberate move from the deprecated KillMode=none to KillMode=process, NOT to
   control-group or mixed, specifically to preserve wazuh-control's own graceful
   shutdown sequencing (avoiding the WPK-upgrade self-kill issue that motivated
   `none` in the first place). Given that, is control-group actually the right fix
   here, or does it risk reintroducing that original problem? Is `mixed` a better
   middle ground, or is the real fix to address why wazuh-db isn't honoring
   wazuh-control's 60s shutdown timeout in the first place?
 
3. Has anyone seen wazuh-db start with no corresponding wazuh-control/systemd log
   line in 4.14.x? Trying to determine if there's an internal respawn/watchdog
   mechanism that isn't logged, since that would mean orphan accumulation can
   happen organically, not just across manual restarts.
 
4. Is there a way to manually issue a wazuh-db "force sync" or "rebuild checksum"
   command via wdb-http.sock for a single agent (000) to test the
   sync-failure theory without restarting the whole manager?
 
This is a production box and we want to apply the right fix once rather than
iterate live. Any input appreciated, especially from anyone who has hit this exact
combination (orphaned wazuh-db + frozen global.db-journal + silent FIM-only failure
for the manager's own agent) in 4.14.x specifically.
 
Thanks in advance.
Reply all
Reply to author
Forward
0 new messages