Roshinni Gandhi
unread,Jun 30, 2026, 5:31:34 AM (3 days ago) Jun 30Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Wazuh | Mailing List
Hi all,
Production single-node Wazuh manager is in a degraded state and I need to confirm the
correct fix before applying anything further. Posting here in parallel with a Slack
thread to try to get eyes on this faster — will cross-post any resolution back to both.
ENVIRONMENT
- Wazuh 4.14.1-rc2 (Manager + Indexer + Dashboard), single-node
- Ubuntu 20.04 (Kernel 5.4.0-208-generic)
- Manager has been running since November 2025 without a reboot
- DB path on this version: /var/ossec/queue/db/ (not /var/ossec/var/db/)
SYMPTOM
FIM alerts for the manager itself (agent 000) have produced zero results across the
entire available log retention window (31 daily archives, Feb 15 - Jun 26, 2026, ~4.5
months). True onset predates available logs. All other agents are generating FIM
alerts normally. Non-FIM alerts for agent 000 (SSH, PAM, Office365) flow normally,
isolating the fault to the FIM/wazuh-db path specifically.
CONFIRMED ROOT CAUSE
KillMode=process in the wazuh-manager systemd unit means restarts only signal the
top-level wazuh-control process. wazuh-control itself manages child daemons with a
60-second per-daemon stop timeout; if wazuh-db doesn't exit in time, wazuh-control
removes its PID file and moves on, but the process survives as an orphan (PPID=1).
Each subsequent restart can launch a fresh wazuh-db alongside any surviving orphan(s),
and they contend for an exclusive lock on global.db.
CURRENT LIVE STATE (verified this morning, multiple independent methods)
- 2 simultaneous wazuh-db processes:
- PID 878507 — started Jun 29 17:44, ~18hrs uptime, holds the global.db WRITE lock
(confirmed via both `lsof` showing FD 31uw, and `lslocks` showing POSIX WRITE on
byte range 1073741824-1073741825)
- PID 925488 — started today 10:43:33, logged its own startup independently
(wazuh-db: INFO: Started (pid: 925488)) with NO corresponding wazuh-control or
systemd log entry anywhere near that timestamp. We've ruled out a manual restart
and ruled out our own diagnostic commands as the trigger. Cause unknown.
- global.db-journal exists, ~21K, confirmed static/frozen across repeated checks
(not actively growing — an open, uncommitted transaction)
- wazuh-db.sock does not exist on this version. Instead there's a live socket at
/var/ossec/queue/sockets/wdb-http.sock, owned by both PIDs, which DOES respond
correctly to a well-formed HTTP request (GET / -> 404 Not Found). So wazuh-db is
not fully hung -- its HTTP listener is alive -- it's specifically stuck on the
transaction it holds against global.db.
- This morning's syscheck scan (07:07:08-09) completed cleanly per syscheckd's own
INFO start/end log lines, with zero errors anywhere in the surrounding log window.
fim.db's mtime matches the scan-end timestamp exactly. So the local scan is NOT
timing out or failing -- it completes and writes locally. The precisely confirmed
failure point is narrower: zero alerts for agent 000 reach alerts.json even after
this clean, on-schedule, error-free scan. This points to a failure in the
comparison/sync step between completed local scan and alert generation, which
routes through global.db's registration/checksum state -- not a scan timeout.
- No backup directory exists at /var/ossec/backup/db/, so the documented Wazuh
global.db restore-from-backup recovery procedure is not available to us as a
fallback if a simple restart doesn't fully clear the stuck state.
WHAT WE HAVE NOT DONE YET
- Have not killed either wazuh-db process
- Have not touched global.db or the journal file
- Have not changed KillMode
- No manager restart performed by us today
QUESTIONS — trying to get to a confirmed-correct fix, not just a plausible one:
1. Given no backup exists, is `kill -9` on both orphaned PIDs followed by a clean
`systemctl restart wazuh-manager` sufficient for a fresh wazuh-db to roll back
the open journal automatically via SQLite's normal crash recovery on next open?
Or is there a safer/recommended manual step to clear the journal first?
2. On KillMode: we've seen the PR history (wazuh/wazuh#14816, wazuh-qa#3266) showing
the deliberate move from the deprecated KillMode=none to KillMode=process, NOT to
control-group or mixed, specifically to preserve wazuh-control's own graceful
shutdown sequencing (avoiding the WPK-upgrade self-kill issue that motivated
`none` in the first place). Given that, is control-group actually the right fix
here, or does it risk reintroducing that original problem? Is `mixed` a better
middle ground, or is the real fix to address why wazuh-db isn't honoring
wazuh-control's 60s shutdown timeout in the first place?
3. Has anyone seen wazuh-db start with no corresponding wazuh-control/systemd log
line in 4.14.x? Trying to determine if there's an internal respawn/watchdog
mechanism that isn't logged, since that would mean orphan accumulation can
happen organically, not just across manual restarts.
4. Is there a way to manually issue a wazuh-db "force sync" or "rebuild checksum"
command via wdb-http.sock for a single agent (000) to test the
sync-failure theory without restarting the whole manager?
This is a production box and we want to apply the right fix once rather than
iterate live. Any input appreciated, especially from anyone who has hit this exact
combination (orphaned wazuh-db + frozen global.db-journal + silent FIM-only failure
for the manager's own agent) in 4.14.x specifically.
Thanks in advance.