Ops Agent not starting after adding additional log files to monitor

1,015 views
Skip to first unread message

Frank Shimizu

unread,
Apr 27, 2022, 1:44:45 AM4/27/22
to Google Stackdriver Discussion Forum
Hello,

We are using the Ops Agent (https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent) on our instances to monitor log files on disk and push them to Google (Stackdriver) Logging. The log files to monitor are configured in /etc/google-cloud-ops-agent/config.yaml. Each log file has its own receiver and pipeline, respectively. This has been working well so far.

Over time we have been adding more files. Since yesterday, after adding some additional log files, the Ops Agent does not seem to start any longer. The result is that no logs and metrics are pushed any more. If we revert to the older config.yaml with fewer files, the Ops Agent starts normally again. The old (working) config file contains roughly 60 log files to monitor, the new (non-working) file contains roughly 90.

We have checked the newly added log files for permission problems, very large sizes and so on, but could not find any likely cause for the problem. The status of the Ops Agent service looks like this when the issue is present:
● google-cloud-ops-agent.service - Google Cloud Ops Agent
   Loaded: loaded (/usr/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
   Active: active (exited) since Wed 2022-04-20 12:49:07 UTC; 20h ago
  Process: 28604 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
  Process: 28563 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/google-cloud-ops-agent/config.yaml (code=exited, status=0/SUCCESS)
 Main PID: 28604 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/google-cloud-ops-agent.service


When the issue happens there are no startup log entries from the Ops Agent in /var/log/google-cloud-ops-agent/subagents/logging-module.log. Normally there is the startup log, but in our case on a restart the last logs we see is the Ops Agent shutting down and then nothing. So it does not appear to start up correctly, but there is no error message.

What could be preventing the Ops Agent from starting up without logs or errors? Is there any limitation, e.g. the number of files to monitor?

Best regards
Frank Shimizu

Igor Peshansky

unread,
Apr 27, 2022, 6:41:44 PM4/27/22
to Frank Shimizu, Google Stackdriver Discussion Forum
Hi, Frank,

Generally, the number of configured log files should not affect the operation of the agent (especially not if it's under a hundred). We would need to investigate what's going on in your case, and request more information (starting with the exact config you are using and the status of all relevant services, as per the troubleshooting guide).

Are you in a position to open a Cloud support case? If so, that would be the best way to proceed, as it would allow us to exchange information outside of a public list, and provide appropriate tracking.
        Igor

--
© 2021 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdr...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.
---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-stackdriver-discussion/a6a13478-a96e-42d5-b3e1-7014873dbe74n%40googlegroups.com.

Kyle Benson

unread,
Apr 29, 2022, 10:02:48 AM4/29/22
to Frank Shimizu, Google Stackdriver Discussion Forum
Hey Frank, thanks for reaching out on this. 

Before making any changes, can you provide the output from the following commands:
sudo service google-cloud-ops-agent restart  && sudo service google-cloud-ops-agent status
And: 
sudo journalctl -xe | grep "google_cloud_ops_agent_engine"
To help us confirm errors. 

As a next step, if you comment out the new items in your configuration, and start the agent, does it work normally?

Thanks,
Kyle


sudo journalctl -xe | grep "google_cloud_ops_agent_engine"
sudo journalctl -xe | grep "google_cloud_ops_agent_engine"
On Wed, Apr 27, 2022 at 1:44 AM Frank Shimizu <shi...@mediamarktsaturn.com> wrote:
--
© 2021 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdr...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.
---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-stackdriver-discussion/a6a13478-a96e-42d5-b3e1-7014873dbe74n%40googlegroups.com.
--

Kyle Benson  he / him / his
Product Manager, Cloud Ops
kylea...@google.com
856-383-3273


Kyle Benson

unread,
Apr 29, 2022, 10:03:07 AM4/29/22
to Frank Shimizu, Google Stackdriver Discussion Forum
Hey Frank -- just bumping this.. did you get it sorted out?

Thanks,
Kyle

Igor Peshansky

unread,
Apr 29, 2022, 10:07:55 AM4/29/22
to Frank Shimizu, Google Stackdriver Discussion Forum
Hi, Frank. Just following up on this. Thanks for the report — we've managed to reproduce the problem, and opened an internal issue to track. If you do end up opening a support case, feel free to ping me privately with the case number, so I can link it to the issue. That way you'll be kept apprised on the status of the investigation.
        Igor

Frank Shimizu

unread,
May 2, 2022, 9:59:24 AM5/2/22
to Google Stackdriver Discussion Forum
Hello Igor, hello Kyle,

Thank you for your reply. Please excuse my delayed reply, it's been busy. We had in fact opened a support case too, after my post here. The case number was sent to you directly.

In case this can help others in the future:
In our case it turned out that the fluent-bit process was hitting the limit of 1024 open files. It's noteworthy that this limit is hit, even though we have less than 100 log files configured, so it seems possible that fluent-bit needs many more file handles than the number of log files it is configured to track. This may not be obvious during troubleshooting.

The solution is to configure systemd, which starts fluent-bit, to set a higher open file limit for the fluent-bit process.

Regards
Frank Shimizu

Igor Peshansky

unread,
May 2, 2022, 3:50:10 PM5/2/22
to Frank Shimizu, Google Stackdriver Discussion Forum
Frank,

Thank you for the investigation and the findings. We should be able to update the systemd unit for the fluent-bit sub-agent appropriately — that way it'll be fixed in the ops agent going forward.
        Igor

Reply all
Reply to author
Forward
0 new messages