Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#1063338: dlm: cannot start dlm midcomms -97

11 views
Skip to first unread message

Valentin Kleibel

unread,
Feb 6, 2024, 7:10:04 AMFeb 6
to
Package: linux-image-amd64
Version: 6.1.76+1
Source: linux
Source-Version: 6.1.76+1
Severity: important
Control: notfound -1 6.6.15-2

Dear Maintainers,

We discovered a bug affecting dlm that prevents any tcp communications
by dlm when booted with debian kernel 6.1.76-1.

Dlm startup works (corosync-cpgtool shows the dlm:controld group with
all expected nodes) but as soon as we try to add a lockspace dmesg shows:
```
dlm: Using TCP for communications
dlm: cannot start dlm midcomms -97
```

It seems that commit "dlm: use kernel_connect() and kernel_bind()"
(e9cdebbe) was merged to 6.1.

Checking the code it seems that the changed function
dlm_tcp_listen_bind() fails with exit code 97 (EAFNOSUPPORT)
It is called from

dlm/lockspace.c: threads_start() -> dlm_midcomms_start()
dlm/midcomms.c: dlm_midcomms_start() -> dlm_lowcomms_start()
dlm/lowcomms.c: dlm_lowcomms_start() -> dlm_listen_for_all() ->
dlm_proto_ops->listen_bind() = dlm_tcp_listen_bind()

The error code is returned all the way to threads_start() where the
error message is emmitted.

Booting with the unsigned kernel from testing (6.6.15-2), which also
contains this commit, works without issues.

I'm not sure what additional changes are required to get this working or
if rolling back this change is an option.

We'd be happy to test patches that might fix this issue.

Thanks for your help,
Valentin

Salvatore Bonaccorso

unread,
Feb 7, 2024, 5:50:05 AMFeb 7
to
Hi Valentin, hi all

[This is about a regression reported in Debian for 6.1.67]
Thanks for your report. So we have a 6.1.76 specific regression for
the backport of e9cdebbe23f1 ("dlm: use kernel_connect() and
kernel_bind()") .

Let's loop in the upstream regression list for tracking and people
involved for the subsystem to see if the issue can be identified. As
it is working for 6.6.15 which includes the commit backport as well it
might be very well that a prerequisite is missing.

# annotate regression with 6.1.y specific commit
#regzbot ^introduced e11dea8f503341507018b60906c4a9e7332f3663
#regzbot link: https://bugs.debian.org/1063338

Any ideas?

Regards,
Salvatore

Jordan Rife

unread,
Feb 7, 2024, 1:40:04 PMFeb 7
to
Just a quick look comparing dlm_tcp_listen_bind between the latest 6.1
and 6.6 stable branches,
it looks like there is a mismatch here with the dlm_local_addr[0] parameter.

6.1
----

static int dlm_tcp_listen_bind(struct socket *sock)
{
int addr_len;

/* Bind to our port */
make_sockaddr(dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
addr_len);
}

6.6
----
static int dlm_tcp_listen_bind(struct socket *sock)
{
int addr_len;

/* Bind to our port */
make_sockaddr(&dlm_local_addr[0], dlm_config.ci_tcp_port, &addr_len);
return kernel_bind(sock, (struct sockaddr *)&dlm_local_addr[0],
addr_len);
}

6.6 contains commit c51c9cd8 (fs: dlm: don't put dlm_local_addrs on heap) which
changed

static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];

to

static struct sockaddr_storage dlm_local_addr[DLM_MAX_ADDR_COUNT];

It looks like kernel_bind() in 6.1 needs to be modified to match.


-Jordan

Alexander Aring

unread,
Feb 7, 2024, 4:40:05 PMFeb 7
to
Hi,
makes sense. I tried to cherry-pick e9cdebbe23f1 ("dlm: use
kernel_connect() and kernel_bind()") on v6.1.67 as I don't see it
there. It failed and does not apply cleanly.

Are we talking here about a debian kernel specific backport? If so,
maybe somebody missed to modify those parts you mentioned.

- Alex

Jordan Rife

unread,
Feb 8, 2024, 12:50:04 PMFeb 8
to
On Thu, Feb 8, 2024 at 3:37 AM Valentin Kleibel <vale...@vrvis.at> wrote:
>
> Hi Jordan, hi all
> We tried to apply commit c51c9cd8 (fs: dlm: don't put dlm_local_addrs on
> heap) to the debian kernel 6.1.76 and came up with the attached patch.
> Besides the different offsets there is a slight change dlm_tcp_bind()
> where in 6.1.76 kernel_bind() is used instead of sock->ops->bind() in
> the original commit.
>
> This patch solves the issue we experienced.
>
> Thanks for your help,
> Valentin

Good to hear that works for you! We should fix this in the 6.1 stable
kernel as well.

IMO it may be less risky and simpler to fix the backport of my patch
e9cdebbe23f1 ("dlm: use kernel_connect() and
kernel_bind()") and just switch (struct sockaddr *)&dlm_local_addr[0]
to (struct sockaddr *)dlm_local_addr[0]
in the call to kernel_bind() rather than backporting c51c9cd8 (fs:
dlm: don't put dlm_local_addrs on
heap) to 6.1.

I will have some time soon to fix the 6.1 backport, but it may make
sense just to revert in the meantime.

-Jordan

Jordan Rife

unread,
Feb 8, 2024, 4:30:05 PMFeb 8
to
Hi Valentin,

Would you be able to confirm that the attached patch fixes your issue as well?

-Jordan
0001-dlm-Treat-dlm_local_addr-0-as-sockaddr_storage.patch

Valentin Kleibel

unread,
Feb 9, 2024, 6:10:04 AMFeb 9
to
Hi

> Would you be able to confirm that the attached patch fixes your issue as well?

Yes it does.

@debian maintainers: is it possible to include this patch in the next
point release?

Thank you for your work,
Valentin

Jordan Rife

unread,
Feb 9, 2024, 11:40:04 AMFeb 9
to
I sent this patch out to sta...@vger.kernel.org. Everyone should be
CCd. Thanks for your help in confirming the fix works.

-Jordan

Salvatore Bonaccorso

unread,
Feb 19, 2024, 3:10:03 PMFeb 19
to
Control: tags -1 + pending confirmed

Hi,

The fix for this issue landed in v6.1.78 and is pending for a next
upload.

Regards,
Salvatore

Debian Bug Tracking System

unread,
Feb 19, 2024, 3:10:04 PMFeb 19
to
Processing control commands:

> tags -1 + pending confirmed
Bug #1063338 [src:linux] dlm: cannot start dlm midcomms -97
Added tag(s) confirmed and pending.

--
1063338: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1063338
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems
0 new messages