Wrong disk/percent_used metric for /dev/sda2 after upgrade to Ops Agent

418 views
Skip to first unread message

Georgi Sotirov

unread,
Dec 9, 2021, 9:47:56 AM12/9/21
to gce-discussion
Hello,

After upgrading some of our VMs to Ops Agent following the procedure for installing on individual VMs, we noticed our disk usage alerts based on disk/percent_used metric firing. The reason appears to be that for /dev/sda2 the metric is showing strange values - 10 or even 100 times higher than actual (verified on the VM instances themselves). The same metric for /dev/sda1 (the boot partition) on the problematic VMs is just fine - showing real percent of used disk space. I'm unable to find explanation for this problem and after troubleshooting the Ops Agent I found noting abnormal. What could be the reason for this strange behavior? Any suggestions are welcome.


Regards,
--
Georgi

Digil (Google Cloud Platform Support)

unread,
Dec 9, 2021, 3:08:33 PM12/9/21
to gce-discussion

As long as you are using the correct metric for the disk ( prefixed with agent.googleapis.com/disk/ ), the Ops Agent should give you the correct result. From your description, it is not clear whether you have installed the agent in a compatible GCE VM with a supported operating system running in it or not.  

If your GCE VM(as explained in the pre-requisite document) has a supporting operating system and still facing the issue, I would recommend you to test the scenario on a brand new VM with a  supported operating system. This would confirm whether the issue is specific to the GCE VM (specifically at the OS level) or not. If the issue is reproducible on the brand new VM with a supported operating system, then I would strongly recommend you to report it to the Google Cloud's Monitoring team using the issue-tracker for an additional review on the specific metric and the Ops Agent

In order to create a defect report connected with 'Google Cloud's Monitoring' service, you may use this direct issue-tracker thread as well. While submitting an issue report, try to include all relevant details, including the reproduction steps(if any), error logs, the commands you are using, etc. This would help the Cloud Monitoring Engineering team to pinpoint the issue in a quicker and efficient manner. 

Georgi Sotirov

unread,
Dec 10, 2021, 2:39:17 AM12/10/21
to gce-discussion
Well, I believe we're using the correct metric, whose fully qualified name is "agent.googleapis.com/disk/percent_used", but the Ops Agent is not giving the correct result as seen from the example below. The actual disk usage for /dev/sda2 on the VM is about 45%, but on the graphic we currently get 1514,3% which is weird since the value for /dev/sda1 is correct as it was before with the legacy agent. The same information appears in OBSERVABILITY tab on VM details page. This value is well above 100, which according to the documentation should be "...between 0.0 and 100.0...".

disk_usage_sda2.png
The alert itself has not been changed in months. The VM instance is CentOS 8, so it runs a compatible and supported by Ops Agent OS.

I created a brand new CentOS 8 VM instance with a name such it's taken into account by alert's filter, but the problem didn't reproduce. However, we'd like to have working alerts for our existing VM instances, so I would consider your suggestion and open an issue.


Regards,
--
Georgi

Digil (Google Cloud Platform Support)

unread,
Dec 10, 2021, 3:26:07 PM12/10/21
to gce-discussion
So the issue is not reproducible on brand new VM with a supported operating system, which is a good sign. This confirms there isn't exists any defect within the product itself. At this point, I am not sure whether there is anything wrong within the configuration of agent itself or any other OS level set-up. The issue can be caused by anything such as a faulty OS configuration(try updating the updating the OS with the latest available package/update), Ops agent configuration( Uninstalling the agent and then re-installing it might help) etc. 

If you want in-depth troubleshooting, I would highly recommend you to post this concern in 'Stackoverflow.com' as well(gce-discussion group is not meant for troubleshooting a 1-1 issue/support). While posting the concern there, try to include as much as details(including the error logs(if any)) of the issue. If you think that this issue is related to a particular GCP product that needs to be investigated further, feel free to use Google Cloud Platform's support channel. OR Like I mentioned in my last message, if you have enough documents and reproduction steps that point to a defect in the Google Cloud Platform product, kindly use the Google Cloud's issue-tracker to report it as a bug/defect. 

Georgi Sotirov

unread,
Dec 13, 2021, 2:04:08 AM12/13/21
to gce-discussion
What I'm trying to understand is where is the problem. We haven't made any configuration changes to Ops Agent or at OS level - just replaced the legacy agent. I tried re-installing the Ops Agent and also installing older versions (2.6.x, 2.5.x, etc.). At OS level I'm not sure what OS configuration could effect reported disk usage and at OS level I do not find any problem (e.g. disk usage is reported properly by df utility). The OS (CentOS 8.5.2111) is also up to date. The issue is that a value which (according to documentation) should range between 0.0 and 100.0 is actually reported as 10 or even 100 times higher, which triggers alerts. I find it quite strange that for one partition it is OK, but not for the other.

Georgi Sotirov

unread,
Dec 13, 2021, 3:45:06 AM12/13/21
to gce-discussion
I read Some of the metrics are missing or inconsistent section before, but I though this is known and handled properly. Apparently it is not, because I just removed version 2.7.0 of Ops Agent, then installed version 1.0.10 and the problem was immediately resolved. The disk/percent_used metric for /dev/sda2 is now properly reported in the range between 0.0 and 100.0 as expected. I hope this new information helps for identification of the issue.

Marc Pons Carrascos

unread,
Dec 16, 2021, 7:23:12 AM12/16/21
to gce-discussion

As per your last message, I understand that using the version 1.0.10, you don’t have that issue anymore. Regarding this documentation [1], the Ops Agent is supported on Centos8. Could you please confirm which concrete version of the Ops Agent gave you that problem?

Kind Regards

---------------------

[1]:https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent#linux_operating_systems

Georgi Sotirov

unread,
Dec 16, 2021, 8:14:00 AM12/16/21
to gce-discussion
I had the problem with the latest version - 2.7.0, but I also tried 2.6.0 and 2.5.0, before switching to 1.0.10.

Georgi Sotirov

unread,
Dec 22, 2021, 1:16:00 AM12/22/21
to gce-discussion
Good news. The reported issue 210008250 was investigated and the problem was found to be due to bind mounts. The fix is due to appear in next Ops Agent version 2.7.2, so I'll post a final update after it's released and the problem is confirmed fixed.

Georgi Sotirov

unread,
Jan 10, 2022, 7:04:46 AM1/10/22
to gce-discussion
To conclude the thread: The problem was fixed with version 2.8.0 of Ops Agent.
Reply all
Reply to author
Forward
0 new messages