[URGENT] wazuh-db orphan processes locking global.db -> zero FIM alerts for agent 000 (production manager) — need correct fix ASAP

5 views

Skip to first unread message

Roshinni Gandhi

unread,

Jun 30, 2026, 5:31:34 AM (3 days ago) Jun 30

to Wazuh | Mailing List

Hi all,

Production single-node Wazuh manager is in a degraded state and I need to confirm the
correct fix before applying anything further. Posting here in parallel with a Slack
thread to try to get eyes on this faster — will cross-post any resolution back to both.

ENVIRONMENT
- Wazuh 4.14.1-rc2 (Manager + Indexer + Dashboard), single-node
- Ubuntu 20.04 (Kernel 5.4.0-208-generic)
- Manager has been running since November 2025 without a reboot
- DB path on this version: /var/ossec/queue/db/ (not /var/ossec/var/db/)

SYMPTOM
FIM alerts for the manager itself (agent 000) have produced zero results across the
entire available log retention window (31 daily archives, Feb 15 - Jun 26, 2026, ~4.5
months). True onset predates available logs. All other agents are generating FIM
alerts normally. Non-FIM alerts for agent 000 (SSH, PAM, Office365) flow normally,
isolating the fault to the FIM/wazuh-db path specifically.

CONFIRMED ROOT CAUSE
KillMode=process in the wazuh-manager systemd unit means restarts only signal the
top-level wazuh-control process. wazuh-control itself manages child daemons with a
60-second per-daemon stop timeout; if wazuh-db doesn't exit in time, wazuh-control
removes its PID file and moves on, but the process survives as an orphan (PPID=1).
Each subsequent restart can launch a fresh wazuh-db alongside any surviving orphan(s),
and they contend for an exclusive lock on global.db.

CURRENT LIVE STATE (verified this morning, multiple independent methods)
- 2 simultaneous wazuh-db processes:
- PID 878507 — started Jun 29 17:44, ~18hrs uptime, holds the global.db WRITE lock
(confirmed via both `lsof` showing FD 31uw, and `lslocks` showing POSIX WRITE on
byte range 1073741824-1073741825)
- PID 925488 — started today 10:43:33, logged its own startup independently
(wazuh-db: INFO: Started (pid: 925488)) with NO corresponding wazuh-control or
systemd log entry anywhere near that timestamp. We've ruled out a manual restart
and ruled out our own diagnostic commands as the trigger. Cause unknown.
- global.db-journal exists, ~21K, confirmed static/frozen across repeated checks
(not actively growing — an open, uncommitted transaction)
- wazuh-db.sock does not exist on this version. Instead there's a live socket at
/var/ossec/queue/sockets/wdb-http.sock, owned by both PIDs, which DOES respond
correctly to a well-formed HTTP request (GET / -> 404 Not Found). So wazuh-db is
not fully hung -- its HTTP listener is alive -- it's specifically stuck on the
transaction it holds against global.db.
- This morning's syscheck scan (07:07:08-09) completed cleanly per syscheckd's own
INFO start/end log lines, with zero errors anywhere in the surrounding log window.
fim.db's mtime matches the scan-end timestamp exactly. So the local scan is NOT
timing out or failing -- it completes and writes locally. The precisely confirmed
failure point is narrower: zero alerts for agent 000 reach alerts.json even after
this clean, on-schedule, error-free scan. This points to a failure in the
comparison/sync step between completed local scan and alert generation, which
routes through global.db's registration/checksum state -- not a scan timeout.
- No backup directory exists at /var/ossec/backup/db/, so the documented Wazuh
global.db restore-from-backup recovery procedure is not available to us as a
fallback if a simple restart doesn't fully clear the stuck state.

WHAT WE HAVE NOT DONE YET
- Have not killed either wazuh-db process
- Have not touched global.db or the journal file
- Have not changed KillMode
- No manager restart performed by us today

QUESTIONS — trying to get to a confirmed-correct fix, not just a plausible one:

1. Given no backup exists, is `kill -9` on both orphaned PIDs followed by a clean
`systemctl restart wazuh-manager` sufficient for a fresh wazuh-db to roll back
the open journal automatically via SQLite's normal crash recovery on next open?
Or is there a safer/recommended manual step to clear the journal first?

2. On KillMode: we've seen the PR history (wazuh/wazuh#14816, wazuh-qa#3266) showing
the deliberate move from the deprecated KillMode=none to KillMode=process, NOT to
control-group or mixed, specifically to preserve wazuh-control's own graceful
shutdown sequencing (avoiding the WPK-upgrade self-kill issue that motivated
`none` in the first place). Given that, is control-group actually the right fix
here, or does it risk reintroducing that original problem? Is `mixed` a better
middle ground, or is the real fix to address why wazuh-db isn't honoring
wazuh-control's 60s shutdown timeout in the first place?

3. Has anyone seen wazuh-db start with no corresponding wazuh-control/systemd log
line in 4.14.x? Trying to determine if there's an internal respawn/watchdog
mechanism that isn't logged, since that would mean orphan accumulation can
happen organically, not just across manual restarts.

4. Is there a way to manually issue a wazuh-db "force sync" or "rebuild checksum"
command via wdb-http.sock for a single agent (000) to test the
sync-failure theory without restarting the whole manager?

This is a production box and we want to apply the right fix once rather than
iterate live. Any input appreciated, especially from anyone who has hit this exact
combination (orphaned wazuh-db + frozen global.db-journal + silent FIM-only failure
for the manager's own agent) in 4.14.x specifically.

Thanks in advance.

Reply all

Reply to author

Forward

0 new messages