Re: fatal error cib permission denied

Marc Smith

unread,

Jun 17, 2020, 9:36:19 AM6/17/20

to esos-...@googlegroups.com

Hi Thomas,

So starting 'pacemakerd' sounds like it fails. Can you try just
running 'pacemakerd' on the shell with something like this: pacemakerd
--verbose

And then report the output of that.

--Marc

On Wed, Jun 17, 2020 at 9:31 AM T. Sch. <tre...@web.de> wrote:
>
> Hi Marc and all others,
>
> I am using ESOS for a while as a stand-alone-storage-Server
> which worked very well for the last years.
>
> Now, I am learning how to setup an ESOS cluster config with
> a two node cluster, but I am not able to even get pacemaker
> to start.
> (Other tests I learned from e.g. with Centos were successfull,
> so I am not entirely inexperienced with
> corosync/pacemaker-cluster setups)
>
> I tried to adapt some pieces out from
> "http://marcitland.blogspot.com/2013/04/building-using-highly-available-esos.html"
> but no success at all.
>
> I even tried multiple ESOS branches:
> 2.0.16 and master_3ab9c22_dgvszq
>
> So, I am nearly shure the mistake is mine,
> but I have no more idea where to look for it.
> (May be it is only an "ESOS peculiarity")
>
> My config looks like this:
> #fresh install on two test nodes#
>
> #uname -n#
> sv0xx.domain.local (xx per host 11 and 12)
>
> #xtra_hosts#
> 192.168.0.31 sv011 sv011.domain.local
> 192.168.0.32 sv012 sv012.domain.local
> 10.35.6.21 corosync011
> 10.35.6.22 corosync012
>
> #/etc/corosync/corosync.conf#
> totem {
> version: 2
> crypto_cipher: none
> crypto_hash: none
> interface {
> ringnumber: 0
> bindnetaddr: 10.35.6.0
> mcastaddr: 239.255.1.1
> mcastport: 5405
> ttl: 1
> }
> }
> nodelist {
> node {
> ring0_addr: 10.35.6.21
> nodeid: 1
> name: corosync011
> }
> node {
> ring0_addr: 10.35.6.22
> nodeid: 2
> name: corosync012
> }
> }
> logging {
> fileline: off
> to_stderr: no
> to_logfile: yes
> logfile: /var/log/corosync.log
> to_syslog: yes
> syslog_facility: local2
> debug: off
> timestamp: off
> logger_subsys {
> subsys: QUORUM
> debug: off
> }
> }
> quorum {
> provider: corosync_votequorum
> expected_votes: 1
> two_node: 1
> }
>
> #crm corosync status is fine (seems to be) on both#
> crm corosync status
> Printing ring status.
> Local node ID 1
> RING ID 0
> id = 10.35.6.21
> status = ring 0 active with no faults
> Quorum information
> ------------------
> Date: Mon Jun 15 18:21:34 2020
> Quorum provider: corosync_votequorum
> Nodes: 2
> Node ID: 1
> Ring ID: 1/40
> Quorate: Yes
>
> Votequorum information
> ----------------------
> Expected votes: 2
> Highest expected: 2
> Total votes: 2
> Quorum: 1
> Flags: 2Node Quorate WaitForAll
>
> Membership information
> ----------------------
> Nodeid Votes Name
> 1 1 10.35.6.21 (local)
> 2 1 10.35.6.22
>
> BUT 10 seconds after starting rc.pacemaker on both nodes manually, every node stops pacemaker and corosync daemon.
>
> Thing I noticed:
> ping -c 1 $(uname -n) resolves to 127.0.0.1 on both nodes
> daemon.log say: "Error in connection setup (/dev/shm...Broken pipe"
>
> Some Logs and commands are appended to this entry
> (so as not to impair readability too much).
> -corosync.log
> -daemon.log
> -local2.log
> -messages
> -pacemaker.log
>
> Please be so kind and show me a way that could solve my problem.
>
> Let me know, if you need more info's / logfiles to have a helping look ;-)
>
> (As I said: I am just learning ESOS clustering - and willing to give you a well formatted documentation if I'll havbe success with it.)
>
> Greets from Germany - Thomas
>
>
> Hi Marc and all others, I am using ESOS for a while as a stand-alone-storage-Server which worked very well for the last years.
>
> Now, I am learning how to setup an ESOS cluster config with a two node cluster, but I am
> not able to even get pacemaker to start. (Other tests I learned from e.g. with Centos were successfull, so I am not entirely inexperienced with corosync/pacemaker-cluster setups)
>
> I tried to adapt some pieces out from marcitland.blogspot.com 2013/04/building-using-highly-available-esos.html" but no success at all.
>
> I even tried multiple ESOS branches:
> 2.0.16 and master_3ab9c22_dgvszq
>
> So, I am nearly shure the mistake is mine, but I have no more idea where to look for it.
> (May be it is only an "ESOS peculiarity")
> My config looks like this:
> #fresh install on two test nodes#
>
> #uname -n#
> sv0xx.domain.local (xx per host 11 and 12)
>
> #xtra_hosts#
> 192.168.0.31 sv011 sv011.domain.local
> 192.168.0.32 sv012 sv012.domain.local
> 10.35.6.21 corosync011
> 10.35.6.22 corosync012
>
> #/etc/corosync/corosync.conf#
> totem {
> version: 2
> crypto_cipher: none
> crypto_hash: none
> interface {
> ringnumber: 0
> bindnetaddr: 10.35.6.0
> mcastaddr: 239.255.1.1
> mcastport: 5405
> ttl: 1
> }
> }
> nodelist {
> node {
> ring0_addr: 10.35.6.21
> nodeid: 1
> name: corosync011
> }
> node {
> ring0_addr: 10.35.6.22
> nodeid: 2
> name: corosync012
> }
> }
> logging {
> fileline: off
> to_stderr: no
> to_logfile: yes
> logfile: /var/log/corosync.log
> to_syslog: yes
> syslog_facility: local2
> debug: off
> timestamp: off
> logger_subsys {
> subsys: QUORUM
> debug: off
> }
> }
> quorum {
> provider: corosync_votequorum
> expected_votes: 1
> two_node: 1
> }
>
> #crm corosync status is fine (seems to be) on both#
> crm corosync status
> Printing ring status.
> Local node ID 1
> RING ID 0
> id = 10.35.6.21
> status = ring 0 active with no faults
> Quorum information
> ------------------
> Date: Mon Jun 15 18:21:34 2020
> Quorum provider: corosync_votequorum
> Nodes: 2
> Node ID: 1
> Ring ID: 1/40
> Quorate: Yes
>
> Votequorum information
> ----------------------
> Expected votes: 2
> Highest expected: 2
> Total votes: 2
> Quorum: 1
> Flags: 2Node Quorate WaitForAll
>
> Membership information
> ----------------------
> Nodeid Votes Name
> 1 1 10.35.6.21 (local)
> 2 1 10.35.6.22
>
> BUT 10 seconds after starting rc.pacemaker on both nodes manually,
> every node stops pacemaker and corosync daemon.
>
> Thing I noticed:
> ping -c 1 $(uname -n) resolves to 127.0.0.1 on both nodes
> daemon.log says: "Error in connection setup (/dev/shm...Broken pipe"
> local2.log says: FATAL: Cannot exec /usr/libexec/pacemaker/cib: Permission denied (13)
>
> Some Logs and commands are appended to this entry
> (so as not to impair readability too much).
> -corosync.log
> -daemon.log
> -local2.log
> -messages
> -pacemaker.log
>
> Please be so kind and show me a way that could solve my problem.
>
> Let me know, if you need more info's / logfiles to have a helping look ;-)
>
> (As I said: I am just learning ESOS clustering - and willing to give you a
> well formatted documentation if I'll have success with it.)
>
> Greets from Germany - Thomas
>
> --
> You received this message because you are subscribed to the Google Groups "esos-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to esos-users+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/esos-users/c540351d-e860-42c6-a505-deedf87bece7n%40googlegroups.com.

Message has been deleted

T. Sch.

unread,

Jun 27, 2020, 3:40:34 AM6/27/20

to esos-users

Dear Marc, thank you so much for looking at my problem.

I tried my best:

output on shell:

[root@schsv011 ~]# pacemakerd --verbose
wait(1846) = 0: Success (0)
wait(1847) = 0: Success (0)
wait(1848) = 0: Success (0)
wait(1849) = 0: Success (0)
wait(1850) = 0: Success (0)
wait(1851) = 0: Success (0)
wait(1847) = 0: Success (0)
wait(1848) = 0: Success (0)
wait(1849) = 0: Success (0)
wait(1850) = 0: Success (0)
wait(1851) = 0: Success (0)
wait(1847) = 0: Interrupted system call (4)
wait(1848) = 0: Interrupted system call (4)
wait(1850) = 0: Success (0)
wait(1851) = 0: Success (0)
wait(1847) = 0: Interrupted system call (4)
wait(1848) = 0: Interrupted system call (4)
wait(1851) = 0: Success (0)
wait(1847) = 0: Interrupted system call (4)
wait(1848) = 0: Interrupted system call (4)
wait(1847) = 0: Interrupted system call (4)

I also added the "Support Bundle" - Files and some more.

stati.txt = corosync status before "pacemakerd --verbose"

esos_support... = you know it

Greets Thomas.

esos_support_pkg-1593100435.tgz

stati.txt

T. Sch.

unread,

Jul 2, 2020, 7:34:09 AM7/2/20

to esos-users

Hi Mark,
are the infos ok, or do you need more / other ones?

Marc Smith schrieb am Mittwoch, 17. Juni 2020 um 15:36:19 UTC+2:

T. Sch.

unread,

Jul 8, 2020, 2:13:05 AM7/8/20

to esos-users

output on shell:

#> pacemakerd --verbose
wait(1846) = 0: Success (0)
wait(1847) = 0: Success (0)
wait(1848) = 0: Success (0)
wait(1849) = 0: Success (0)
wait(1850) = 0: Success (0)
wait(1851) = 0: Success (0)
wait(1847) = 0: Success (0)
wait(1848) = 0: Success (0)
wait(1849) = 0: Success (0)
wait(1850) = 0: Success (0)
wait(1851) = 0: Success (0)
wait(1847) = 0: Interrupted system call (4)
wait(1848) = 0: Interrupted system call (4)
wait(1850) = 0: Success (0)
wait(1851) = 0: Success (0)
wait(1847) = 0: Interrupted system call (4)
wait(1848) = 0: Interrupted system call (4)
wait(1851) = 0: Success (0)
wait(1847) = 0: Interrupted system call (4)
wait(1848) = 0: Interrupted system call (4)
wait(1847) = 0: Interrupted system call (4)

Marc Smith schrieb am Mittwoch, 17. Juni 2020 um 15:36:19 UTC+2:

Marc Smith

unread,

Jul 8, 2020, 10:39:01 AM7/8/20

to esos-...@googlegroups.com

Sorry for the delay! That doesn't necessarily look like 'pacemakerd'
is not running. What does the output of "crm_mon -1" and "crm
configure show" look like?

--Marc

> To view this discussion on the web visit https://groups.google.com/d/msgid/esos-users/ffa5395e-50d0-4cc2-ac74-ec54dde2e32an%40googlegroups.com.

T. Sch.

unread,

Jul 9, 2020, 6:06:07 AM7/9/20

to esos-users

(Hint: at the moment nothing starts automatically, during testing i have to start all services on both hosts manually)

here is the output:

[root@schsv011 ~]# /etc/rc.d/rc.corosync start

Starting corosync...

[root@schsv011 ~]# crm corosync status

Printing ring status.

Local node ID 1

RING ID 0

id = 10.35.6.21

status = ring 0 active with no faults

Quorum information

------------------

Date: Thu Jul 9 11:53:19 2020

Quorum provider: corosync_votequorum

Nodes: 2

Node ID: 1

Ring ID: 1/1144

Quorate: Yes

Votequorum information

----------------------

Expected votes: 2

Highest expected: 2

Total votes: 2

Quorum: 1

Flags: 2Node Quorate WaitForAll

Membership information

----------------------

Nodeid Votes Name

1 1 10.35.6.21 (local)

2 1 10.35.6.22

[root@schsv011 ~]# pacemakerd

wait(17011) = 0: Success (0)

wait(17012) = 0: Success (0)

wait(17013) = 0: Success (0)

wait(17014) = 0: Success (0)

wait(17015) = 0: Success (0)

wait(17016) = 0: Success (0)

wait(17012) = 0: Success (0)

wait(17013) = 0: Success (0)

wait(17014) = 0: Success (0)

wait(17015) = 0: Success (0)

wait(17016) = 0: Success (0)

wait(17012) = 0: Interrupted system call (4)

wait(17013) = 0: Interrupted system call (4)

wait(17015) = 0: Success (0)

wait(17016) = 0: Success (0)

wait(17012) = 0: Interrupted system call (4)

wait(17013) = 0: Interrupted system call (4)

wait(17016) = 0: Success (0)

wait(17012) = 0: Interrupted system call (4)

wait(17013) = 0: Interrupted system call (4)

wait(17012) = 0: Interrupted system call (4)

[root@schsv011 ~]# crm corosync status

Printing ring status.

Could not initialize corosync configuration API error 2

Cannot initialize CMAP service

So, pacemaker has "killed" corosync.

[root@schsv011 ~]# /etc/rc.d/rc.corosync start

Starting corosync...

[root@schsv011 ~]# pacemakerd ; sleep 3s ; crm_mon -1 ; crm

wait(17036) = 0: Success (0)

wait(17037) = 0: Success (0)

wait(17038) = 0: Success (0)

wait(17039) = 0: Success (0)

wait(17040) = 0: Success (0)

wait(17041) = 0: Success (0)

wait(17037) = 0: Success (0)

wait(17038) = 0: Success (0)

wait(17039) = 0: Success (0)

wait(17040) = 0: Success (0)

wait(17041) = 0: Success (0)

wait(17037) = 0: Interrupted system call (4)

wait(17038) = 0: Interrupted system call (4)

wait(17040) = 0: Success (0)

wait(17041) = 0: Success (0)

wait(17037) = 0: Interrupted system call (4)

wait(17038) = 0: Interrupted system call (4)

wait(17041) = 0: Success (0)

wait(17037) = 0: Interrupted system call (4)

wait(17038) = 0: Interrupted system call (4)

wait(17037) = 0: Interrupted system call (4)

Error: cluster is not available on this node

crm(live/schsv011.schneiderit.local)# quit

bye

[root@schsv011 ~]# pacemakerd ; sleep 3s ; crm_mon -1 ; crm configure show

cmap connection setup failed: CS_ERR_LIBRARY. Retrying in 1s

cmap connection setup failed: CS_ERR_LIBRARY. Retrying in 2s

cmap connection setup failed: CS_ERR_LIBRARY. Retrying in 3s

cmap connection setup failed: CS_ERR_LIBRARY. Retrying in 4s

cmap connection setup failed: CS_ERR_LIBRARY. Retrying in 5s

Could not connect to Cluster Configuration Database API, error 2

Error: cluster is not available on this node

ERROR: running cibadmin -Ql: Signon to CIB failed: Transport endpoint is not connected

Init failed, could not perform requested operations

ERROR: configure: Missing requirements

Alan Simpson

unread,

Jul 21, 2020, 5:23:37 AM7/21/20

to esos-users

Hi,

I am seeing exactly the same problem, my local2 log shows :-

Jul 21 09:05:59 sana pacemakerd[1332]: warning: Quorum lost

Jul 21 09:05:59 sana pacemakerd[1335]: error: FATAL: Cannot exec /usr/libexec/pacemaker/cib: Permission denied (13)

Jul 21 09:05:59 sana pacemakerd[1332]: notice: Node sana state is now member

Jul 21 09:05:59 sana pacemakerd[1340]: error: FATAL: Cannot exec /usr/libexec/pacemaker/crmd: Permission denied (13)

Jul 21 09:05:59 sana pacemakerd[1339]: error: FATAL: Cannot exec /usr/libexec/pacemaker/pengine: Permission denied (13)

Jul 21 09:05:59 sana pacemakerd[1332]: warning: The cib process (1335) can no longer be respawned, shutting the cluster down.

Jul 21 09:05:59 sana pacemakerd[1332]: notice: Shutting down Pacemaker

Jul 21 09:05:59 sana pacemakerd[1332]: notice: Stopping crmd

Jul 21 09:05:59 sana pacemakerd[1332]: warning: The crmd process (1340) can no longer be respawned, shutting the cluster down.

Jul 21 09:05:59 sana pacemakerd[1332]: notice: Stopping pengine

Jul 21 09:05:59 sana pacemakerd[1338]: error: FATAL: Cannot exec /usr/libexec/pacemaker/attrd: Permission denied (13)

Jul 21 09:05:59 sana pacemakerd[1332]: warning: The pengine process (1339) can no longer be respawned, shutting the cluster down.

Jul 21 09:05:59 sana pacemakerd[1332]: notice: Stopping attrd

Jul 21 09:05:59 sana pacemakerd[1332]: warning: The attrd process (1338) can no longer be respawned, shutting the cluster down.

Jul 21 09:05:59 sana pacemakerd[1332]: notice: Stopping lrmd

Jul 21 09:05:59 sana pacemakerd[1332]: error: The lrmd process (1337) terminated with signal 15 (core=0)

Jul 21 09:05:59 sana pacemakerd[1332]: notice: Stopping stonith-ng

Jul 21 09:05:59 sana pacemakerd[1332]: error: The stonith-ng process (1336) terminated with signal 15 (core=0)

Jul 21 09:05:59 sana pacemakerd[1332]: notice: Shutdown complete

Jul 21 09:05:59 sana pacemakerd[1332]: notice: Attempting to inhibit respawning after fatal error

The permissions on those files are :-

-rwxr-xr-x 1 root root 236312 May 15 06:04 attrd

-rwxr-xr-x 1 root root 418208 May 15 06:04 cib

-rwxr-xr-x 1 root root 47712 May 15 06:04 cibmon

-rwxr-xr-x 1 root root 1540744 May 15 06:04 crmd

-rwxr-xr-x 1 root root 182912 May 15 06:04 lrmd

-rwxr-xr-x 1 root root 54392 May 15 06:04 lrmd_internal_ctl

-rwxr-xr-x 1 root root 72208 May 15 06:04 lrmd_test

-rwxr-xr-x 1 root root 52520 May 15 06:04 pengine

-rwxr-xr-x 1 root root 101848 May 15 06:04 stonith-test

-rwxr-xr-x 1 root root 506560 May 15 06:04 stonithd

Should they be owned by hacluster:haclient ?

Thanks all.

Marc Smith

unread,

Jul 21, 2020, 4:51:27 PM7/21/20

to esos-...@googlegroups.com

No, the permissions on those files are okay... I looked at running
ESOS clusters and those permissions are the same.

What about the permissions on /var/lib/pacemaker (and directories
underneath that)? What does that look like on your system?

This is a fresh install of which ESOS version? You configured Corosync
(corosync.conf), then started Pacemaker (or attempted to) and you
can't configure it at all?

--Marc

> To view this discussion on the web visit https://groups.google.com/d/msgid/esos-users/235ac19b-e6a1-49c5-8769-6869911b4d50n%40googlegroups.com.

T. Sch.

unread,

Jul 27, 2020, 8:31:45 AM7/27/20

to esos-users

permissions on /var/lib/pacemaker:

[root@schsv011 lib]# pwd

/var/lib

[root@schsv011 lib]# ls pacemaker -alR

pacemaker:

total 0

drwx------ 6 hacluste haclient 120 Jan 1 2001 .

drwx------ 16 root root 320 Jan 1 2001 ..

drwx------ 2 hacluste haclient 40 Jan 1 2001 blackbox

drwx------ 2 hacluste haclient 160 Jun 10 13:23 cib

drwx------ 4 hacluste haclient 80 Jan 1 2001 cores

drwxr-x--- 2 hacluste haclient 40 Jan 1 2001 pengine

pacemaker/blackbox:

total 0

drwx------ 2 hacluste haclient 40 Jan 1 2001 .

drwx------ 6 hacluste haclient 120 Jan 1 2001 ..

pacemaker/cib:

total 16

drwx------ 2 hacluste haclient 160 Jun 10 13:23 .

drwx------ 6 hacluste haclient 120 Jan 1 2001 ..

-rw-r----- 1 root root 1 Jun 9 10:06 cib.last

-rw------- 1 root root 259 Jun 9 10:06 cib.xml

-rw------- 1 root root 33 Jun 10 11:54 cib.xml.sig

-rw-r--r-- 1 root root 33 Jun 10 13:23 cib.xml.sum

-rw-rw--w- 1 root root 0 Jun 9 20:26 shadow.1411

-rw-r----- 1 root root 0 Jun 9 20:26 shadow.1415

pacemaker/cores:

total 0

drwx------ 4 hacluste haclient 80 Jan 1 2001 .

drwx------ 6 hacluste haclient 120 Jan 1 2001 ..

drwx------ 2 root root 40 Jan 1 2001 hacluster

drwx------ 2 root root 40 Jan 1 2001 root

pacemaker/cores/hacluster:

total 0

drwx------ 2 root root 40 Jan 1 2001 .

drwx------ 4 hacluste haclient 80 Jan 1 2001 ..

pacemaker/cores/root:

total 0

drwx------ 2 root root 40 Jan 1 2001 .

drwx------ 4 hacluste haclient 80 Jan 1 2001 ..

pacemaker/pengine:

total 0

drwxr-x--- 2 hacluste haclient 40 Jan 1 2001 .

drwx------ 6 hacluste haclient 120 Jan 1 2001 ..

yes, for me it was a fresh install of ESOS 2.0.16

yes, I configured Corosync then started Pacemaker (or attempted to) and was not able to configure it at all?

May be it was similar to Alan?

Should I try Version 2.1.0?

Greets

Thomas.

JR

unread,

Sep 3, 2020, 6:47:05 AM9/3/20

to esos-users

Hi,

same problem, fresh install 2.1.3>

Sep 3 12:39:12 esos1 pacemakerd[1619]:   notice: Starting Pacemaker 1.1.20
Sep 3 12:39:12 esos1 pacemakerd[1619]:   notice: Quorum acquired
Sep 3 12:39:12 esos1 pacemakerd[1619]:   notice: Node esos1.panflex.admin state is now member
Sep 3 12:39:12 esos1 pacemakerd[1619]:   notice: Node esos2.panflex.admin state is now member
Sep 3 12:39:12 esos1 pacemakerd[1622]:    error: FATAL: Cannot exec /usr/libexec/pacemaker/cib: Permission denied (13)
Sep 3 12:39:12 esos1 pacemakerd[1619]: warning: The cib process (1622) can no longer be respawned, shutting the cluster down.

Jan

Dne pondělí 27. července 2020 v 14:31:45 UTC+2 uživatel tre...@web.de napsal:

Marc Smith

unread,

Sep 30, 2020, 10:18:19 AM9/30/20

to esos-...@googlegroups.com

Hi,

Sorry for the extremely late reply... I took a look, and it appears
the Buildbot host that builds these ESOS images is not using the
correct umask. This causes directories like "/usr/libexec/pacemaker/"
to be set with the wrong permissions (0700) where they should have
0755 for those binaries to run.

I'll take a look at the build host and get this resolved.

--Marc

> To view this discussion on the web visit https://groups.google.com/d/msgid/esos-users/ca2c0990-bb37-4bd7-968f-1c3395c7072bn%40googlegroups.com.

T. Sch.

unread,

Oct 1, 2020, 10:45:47 AM10/1/20

to esos-users

so, using 2.1.5 won't help? - because it's from 28th of September-

Marc Smith

unread,

Oct 1, 2020, 2:05:27 PM10/1/20

to esos-...@googlegroups.com

No. Should have this fixed soon though.

--Marc

> To view this discussion on the web visit https://groups.google.com/d/msgid/esos-users/b4483b7e-dd71-4484-98f8-0a1eb7b4ff94n%40googlegroups.com.

Marc Smith

unread,

Oct 27, 2020, 11:26:01 PM10/27/20

to esos-...@googlegroups.com

Hi,

Can you try the 3.0.0 build, this issue should be resolved there:
http://download.esos-project.com/packages/3.x.x/esos-3.0.0_z.zip

--Marc

T. Sch.

unread,

Oct 30, 2020, 11:19:30 AM10/30/20

to esos-users

starting today with 3.0.0_z, and will give the result as soon as I am able to finish all tests

JR

unread,

Nov 2, 2020, 12:00:12 PM11/2/20

to esos-users

Hi,

3.0.0_z

two nodes configured and online, thank you Marc.

JR

Dne středa 28. října 2020 v 4:26:01 UTC+1 uživatel Marc Smith napsal:

Reply all

Reply to author

Forward