Libceph Connect Error 101

13 views

Skip to first unread message

Deandra Uleman

unread,

Aug 4, 2024, 11:09:33 PM8/4/24

to injebacchart

Afterceph server's ips change, kernel libceph client keeps connection retry to old monitor ips. could someone please has experience to resolve the problem, namely cleaning kernel errors up?kernel log errors resemble below:

Many of these problem cases are hard to summarize down to a short phrase that adequately describes the problem. Each problem will start with a bulleted list of symptoms. Keep in mind that all symptoms may not apply depending on the configuration of Rook. If the majority of the symptoms are seen there is a fair chance you are experiencing that problem.

If after trying the suggestions found on this page and the problem is not resolved, the Rook team is very happy to help you troubleshoot the issues in their Slack channel. Once you have registered for the Rook Slack, proceed to the #ceph channel to ask for assistance.

After you verify the basic health of the running pods, next you will want to run Ceph tools for status of the storage components. There are two ways to run the Ceph tools, either in the Rook toolbox or inside other Rook pods that are already running.

The first two status commands provide the overall cluster health. The normal state for cluster operations is HEALTH_OK, but will still function when the state is in a HEALTH_WARN state. If you are in a WARN state, then the cluster is in a condition that it may enter the HEALTH_ERROR state at which point all disk I/O operations are halted. If a HEALTH_WARN state is observed, then one should take action to prevent the cluster from halting when it enters the HEALTH_ERROR state.

There are many Ceph sub-commands to look at and manipulate Ceph objects, well beyond the scope this document. See the Ceph documentation for more details of gathering information about the health of the cluster. In addition, there are other helpful hints and some best practices located in the Advanced Configuration section. Of particular note, there are scripts for collecting logs and gathering OSD information there.

What is happening here is that the MON pods are restarting and one or more of the Ceph daemons are not getting configured with the proper cluster information. This is commonly the result of not specifying a value for dataDirHostPath in your Cluster CRD.

The dataDirHostPath setting specifies a path on the local host for the Ceph daemons to store configuration and data. Setting this to a path like /var/lib/rook, reapplying your Cluster CRD and restarting all the Ceph daemons (MON, MGR, OSD, RGW) should solve this problem. After the Ceph daemons have been restarted, it is advisable to restart the rook-tools pod.

When the operator is starting a cluster, the operator will start one mon at a time and check that they are healthy before continuing to bring up all three mons. If the first mon is not detected healthy, the operator will continue to check until it is healthy. If the first mon fails to start, a second and then a third mon may attempt to start. However, they will never form quorum and the orchestration will be blocked from proceeding.

Likely you will see an error similar to the following that the operator is timing out when connecting to the mon. The last command is ceph mon_status, followed by a timeout message five minutes later.

If you see the timeout in the operator log, verify if the mon pod is running (see the next section). If the mon pod is running, check the network connectivity between the operator pod and the mon pod. A common issue is that the CNI is not configured correctly.

If the mon pod is failing as in this example, you will need to look at the mon pod status or logs to determine the cause. If the pod is in a crash loop backoff state, you should see the reason by describing the pod.

This is a common problem reinitializing the Rook cluster when the local directory used for persistence has not been purged. This directory is the dataDirHostPath setting in the cluster CRD and is typically set to /var/lib/rook. To fix the issue you will need to delete all components of Rook and then delete the contents of /var/lib/rook (or the directory specified by dataDirHostPath) on each of the hosts in the cluster. Then when the cluster CRD is applied to start a new cluster, the rook-operator should start all the pods as expected.

To confirm if you have OSDs in your cluster, connect to the Rook Toolbox and run the ceph status command. You should see that you have at least one OSD up and in. The minimum number of OSDs required depends on the replicated.size setting in the pool created for the storage class. In a "test" cluster, only one OSD is required (see storageclass-test.yaml). In the production storage class example (storageclass.yaml), three OSDs would be required.

Lastly, if you have OSDs up and in, the next step is to confirm the operator is responding to the requests. Look in the Operator pod logs around the time when the PVC was created to confirm if the request is being raised. If the operator does not show requests to provision the block image, the operator may be stuck on some other operation. In this case, restart the operator pod to get things going again.

When an OSD starts, the device or directory will be configured for consumption. If there is an error with the configuration, the pod will crash and you will see the CrashLoopBackoff status for the pod. Look in the osd pod logs for an indication of the failure.

One common case for failure is that you have re-deployed a test cluster and some state may remain from a previous deployment. If your cluster is larger than a few nodes, you may get lucky enough that the monitors were able to start and form quorum. However, now the OSDs pods may fail to start due to the old state. Looking at the OSD pod logs you will see an error about the file already existing.

If the error is from the file that already exists, this is a common problem reinitializing the Rook cluster when the local directory used for persistence has not been purged. This directory is the dataDirHostPath setting in the cluster CRD and is typically set to /var/lib/rook. To fix the issue you will need to delete all components of Rook and then delete the contents of /var/lib/rook (or the directory specified by dataDirHostPath) on each of the hosts in the cluster. Then when the cluster CRD is applied to start a new cluster, the rook-operator should start all the pods as expected.

Second, if Rook determines that a device is not available (has existing partitions or a formatted filesystem), Rook will skip consuming the devices. If Rook is not starting OSDs on the devices you expect, Rook may have skipped it for this reason. To see if a device was skipped, view the OSD preparation log on the node where the device was skipped. Note that it is completely normal and expected for OSD prepare pod to be in the completed state. After the job is complete, Rook leaves the pod around in case the logs need to be investigated.

After the settings are updated or the devices are cleaned, trigger the operator to analyze the devices again by restarting the operator. Each time the operator starts, it will ensure all the desired devices are configured. The operator does automatically deploy OSDs in most scenarios, but an operator restart will cover any scenarios that the operator doesn't detect automatically.

When the reboot command is issued, network interfaces are terminated before disks are unmounted. This results in the node hanging as repeated attempts to unmount Ceph persistent volumes fail with the following error:

You can set a given log level and apply it to all the Ceph daemons at the same time. For this, make sure the toolbox pod is running, then determine the level you want (between 0 and 20). You can find the list of all subsystems and their default values in Ceph logging and debug official guide. Be careful when increasing the level as it will produce very verbose logs.

So for each daemon, dataDirHostPath is used to store logs, if logging is activated. Rook will bindmount dataDirHostPath for every pod. Let's say you want to enable logging for mon.a, but only for this daemon. Using the toolbox or from inside the operator run:

This will activate logging on the filesystem, you will be able to find logs in dataDirHostPath/$NAMESPACE/log, so typically this would mean /var/lib/rook/rook-ceph/log. You don't need to restart the pod, the effect will be immediate.

The meaning of this warning is written in the document. However, in many cases it is benign. For more information, please see the blog entry. Please refer to Configuring Pools if you want to know the proper pg_num of pools and change these values.

There is a critical flaw in OSD on LV-backed PVC. LVM metadata can be corrupted if both the host and OSD container modify it simultaneously. For example, the administrator might modify it on the host, while the OSD initialization process in a container could modify it too. In addition, if lvmetad is running, the possibility of occurrence gets higher. In this case, the change of LVM metadata in OSD container is not reflected to LVM metadata cache in host for a while.

You can know whether the above-mentioned tag exists with the command: sudo lvs -o lv_name,lv_tags. If the lv_tag field is empty in an LV corresponding to the OSD lv_tags, this OSD encountered the problem. In this case, please retire this OSD or replace with other new OSD before restarting.

This problem doesn't happen in newly created LV-backed PVCs because OSD container doesn't modify LVM metadata anymore. The existing lvm mode OSDs work continuously even thought upgrade your Rook. However, using the raw mode OSDs is recommended because of the above-mentioned problem. You can replace the existing OSDs with raw mode OSDs by retiring them and adding new OSDs one by one. See the documents Remove an OSD and Add an OSD on a PVC.

Unexpected partitions are created on host disks that are used by Ceph OSDs. This happens more often on SSDs than HDDs and usually only on disks that are 875GB or larger. Many tools like lsblk, blkid, udevadm, and parted will not show a partition table type for the partition. Newer versions of blkid are generally able to recognize the type as "atari".