perfctr via accessD fails for cores > 0

259 views
Skip to first unread message

Harald Klimach

unread,
Mar 21, 2014, 5:30:54 PM3/21/14
to likwid...@googlegroups.com
Hi,

I've installed LIKWID on our SandyBridge EP processor system running Archlinux.
I set up the access daemon to provide access to the msr devices. Running perfctr now seems to work fine, but only for the first core:

likwid-perfctr -C 0 -g FLOPS_DP build/ateles

But trying to run this on other cores fails due to missing access rights:

$ likwid-perfctr -C 1 -g FLOPS_DP build/ateles
Failed to write data through daemon: daemon returned error 3 'failed to open device file' for cpu 1 reg 0xf4

It even fails with the same error when running as root.


I have:

$ ls -l /usr/bin/likwid-accessD 
-rwsr-xr-x 1 root root 15560 Nov 14 05:50 /usr/bin/likwid-accessD

and

$ getcap /usr/bin/likwid-accessD 
/usr/bin/likwid-accessD = cap_sys_rawio+ep

My Linux kernel is 3.13.6.

The MSR devices all seem to have the same rights:

$ ls -l /dev/cpu/0/msr
crw------- 1 root root 202, 0 Mar 13 09:38 /dev/cpu/0/msr

$ ls -l /dev/cpu/1/msr
crw------- 1 root root 202, 1 Mar 13 09:38 /dev/cpu/1/msr

What am I doing wrong?

Thanks,
Harald

Thomas Röhl

unread,
Mar 24, 2014, 7:20:22 AM3/24/14
to likwid...@googlegroups.com
Hello Harald,

From my point of view your configuration looks fine. I tried likwid on our SandyBridge EP machine and got no error message like this.

But the message

Failed to write data through daemon: daemon returned error 3 'failed to open device file' for cpu 1 reg 0xf4
is not MSR-specific, it is returned by the accessDaemon when the Uncore PCI devices cannot be opened.

Please supply additional output to get behind the problem.
To check the existance of the PCI accessible counters: ls -la /proc/bus/pci/*
Additionally please run both likwid-perfctr calls again with the -V verbosity switch. Please try "-c <cpuid>" and "-C <cpuid>" to check whether pinning is the problem. Does likwid-perfctr work as root with -M 0 on the commandline?
Maybe your SandyBridge EP has another topology than our test machine, please supply likwid-topology output.


Greetings,
Thomas

Harald Klimach

unread,
Mar 24, 2014, 10:39:14 AM3/24/14
to likwid...@googlegroups.com
Dear Thomas,

thanks for taking this up.

Failed to write data through daemon: daemon returned error 3 'failed to open device file' for cpu 1 reg 0xf4
is not MSR-specific, it is returned by the accessDaemon when the Uncore PCI devices cannot be opened.

Please supply additional output to get behind the problem.
To check the existance of the PCI accessible counters: ls -la /proc/bus/pci/* 
OK, so I have:

$ lspci | grep "Performance counters"
3f:0e.1 Performance counters: Intel Corporation Xeon E5/Core i7 Processor Home Agent Performance Monitoring (rev 07)
3f:13.1 Performance counters: Intel Corporation Xeon E5/Core i7 Ring to PCI Express Performance Monitor (rev 07)
3f:13.4 Performance counters: Intel Corporation Xeon E5/Core i7 QuickPath Interconnect Agent Ring Registers (rev 07)
3f:13.5 Performance counters: Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 0 Performance Monitor (rev 07)
7f:0e.1 Performance counters: Intel Corporation Xeon E5/Core i7 Processor Home Agent Performance Monitoring (rev 07)
7f:13.1 Performance counters: Intel Corporation Xeon E5/Core i7 Ring to PCI Express Performance Monitor (rev 07)
7f:13.4 Performance counters: Intel Corporation Xeon E5/Core i7 QuickPath Interconnect Agent Ring Registers (rev 07)
7f:13.5 Performance counters: Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 0 Performance Monitor (rev 07)

and

$ ls -la /proc/bus/pci/3f/
total 0
dr-xr-xr-x  2 root root    0 Mar 24 15:02 .
dr-xr-xr-x 15 root root    0 Mar 24 15:02 ..
-rw-r--r--  1 root root  256 Mar 24 15:02 08.0
-rw-r--r--  1 root root  256 Mar 24 15:02 09.0
-rw-r--r--  1 root root  256 Mar 24 15:02 0a.0
-rw-r--r--  1 root root  256 Mar 24 15:02 0a.1
-rw-r--r--  1 root root  256 Mar 24 15:02 0a.2
-rw-r--r--  1 root root  256 Mar 24 15:02 0a.3
-rw-r--r--  1 root root  256 Mar 24 15:02 0b.0
-rw-r--r--  1 root root  256 Mar 24 15:02 0b.3
-rw-r--r--  1 root root  256 Mar 24 15:02 0c.0
-rw-r--r--  1 root root  256 Mar 24 15:02 0c.1
-rw-r--r--  1 root root  256 Mar 24 15:02 0c.2
-rw-r--r--  1 root root  256 Mar 24 15:02 0c.3
-rw-r--r--  1 root root  256 Mar 24 15:02 0c.6
-rw-r--r--  1 root root  256 Mar 24 15:02 0c.7
-rw-r--r--  1 root root  256 Mar 24 15:02 0d.0
-rw-r--r--  1 root root  256 Mar 24 15:02 0d.1
-rw-r--r--  1 root root  256 Mar 24 15:02 0d.2
-rw-r--r--  1 root root  256 Mar 24 15:02 0d.3
-rw-r--r--  1 root root  256 Mar 24 15:02 0d.6
-rw-r--r--  1 root root  256 Mar 24 15:02 0e.0
-rw-r--r--  1 root root  256 Mar 24 15:02 0e.1
-rw-r--r--  1 root root 4096 Mar 24 15:02 0f.0
-rw-r--r--  1 root root 4096 Mar 24 15:02 0f.1
-rw-r--r--  1 root root 4096 Mar 24 15:02 0f.2
-rw-r--r--  1 root root 4096 Mar 24 15:02 0f.3
-rw-r--r--  1 root root 4096 Mar 24 15:02 0f.4
-rw-r--r--  1 root root 4096 Mar 24 15:02 0f.5
-rw-r--r--  1 root root  256 Mar 24 15:02 0f.6
-rw-r--r--  1 root root 4096 Mar 24 15:02 10.0
-rw-r--r--  1 root root 4096 Mar 24 15:02 10.1
-rw-r--r--  1 root root 4096 Mar 24 15:02 10.2
-rw-r--r--  1 root root 4096 Mar 24 15:02 10.3
-rw-r--r--  1 root root 4096 Mar 24 15:02 10.4
-rw-r--r--  1 root root 4096 Mar 24 15:02 10.5
-rw-r--r--  1 root root 4096 Mar 24 15:02 10.6
-rw-r--r--  1 root root 4096 Mar 24 15:02 10.7
-rw-r--r--  1 root root  256 Mar 24 15:02 11.0
-rw-r--r--  1 root root  256 Mar 24 15:02 13.0
-rw-r--r--  1 root root  256 Mar 24 15:02 13.1
-rw-r--r--  1 root root  256 Mar 24 15:02 13.4
-rw-r--r--  1 root root  256 Mar 24 15:02 13.5
-rw-r--r--  1 root root  256 Mar 24 15:02 13.6

and a similar list for 7f.


Additionally please run both likwid-perfctr calls again with the -V verbosity switch. Please try "-c <cpuid>" and "-C <cpuid>" to check whether pinning is the problem. Does likwid-perfctr work as root with -M 0 on the commandline?

$ likwid-perfctr -C0 -V -g FLOPS_DP build/ateles
CPU family:
CPU model: 45 
CPU stepping:
CPU features: SSE SSE2 SSE3 SSE4.1 SSE4.2 AES AVX  
-------------------------------------------------------------
PERFMON version:
PERFMON number of counters:
PERFMON width of counters: 48 
PERFMON number of fixed counters:
-------------------------------------------------------------
-------------------------------------------------------------
CPU type: Intel Core SandyBridge EP processor 
CPU clock: 2.00 GHz 
Measuring group FLOPS_DP
Found event INSTR_RETIRED_ANY : Event_id 0x00 Umask 0x00 CfgBits 0x00 Cmask 0x00 
Found event CPU_CLK_UNHALTED_CORE : Event_id 0x00 Umask 0x00 CfgBits 0x00 Cmask 0x00 
Found event CPU_CLK_UNHALTED_REF : Event_id 0x00 Umask 0x00 CfgBits 0x00 Cmask 0x00 
Found event FP_COMP_OPS_EXE_SSE_FP_PACKED_DOUBLE : Event_id 0x10 Umask 0x10 CfgBits 0x00 Cmask 0x00 
Found event FP_COMP_OPS_EXE_SSE_FP_SCALAR_DOUBLE : Event_id 0x10 Umask 0x80 CfgBits 0x00 Cmask 0x00 
[0] perfmon_setup_counter PMC: Write Register 0x186 , Flags: 0x411010 
[0] perfmon_setup_counter PMC: Write Register 0x187 , Flags: 0x418010 
-------------------------------------------------------------
Executing: build/ateles 
perfmon_start_counters: Write Register 0x38F , Flags: 0x700000003 
perfmon_start_counters: Write Register 0x391 , Flags: 0x10000 

-> Runs successfully.

$ likwid-perfctr -C1 -V -g FLOPS_DP build/ateles
Failed to write data through daemon: daemon returned error 3 'failed to open device file' for cpu 1 reg 0xf4

# likwid-perfctr -M0 -C1 -V -g FLOPS_DP build/ateles
ERROR
failed to open pci device: No such file or directory!

Using -c yields the same behavior (works for 0, but not for 1)

Maybe your SandyBridge EP has another topology than our test machine, please supply likwid-topology output.
$ likwid-topology 
-------------------------------------------------------------
CPU type: Intel Core SandyBridge EP processor 
*************************************************************
Hardware Thread Topology
*************************************************************
Sockets:
Cores per socket:
Threads per core:
-------------------------------------------------------------
HWThread Thread Core Socket
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 0 2 0
5 0 2 1
6 0 3 0
7 0 3 1
8 0 4 0
9 0 4 1
10 0 5 0
11 0 5 1
12 0 6 0
13 0 6 1
14 0 7 0
15 0 7 1
16 1 0 0
17 1 0 1
18 1 1 0
19 1 1 1
20 1 2 0
21 1 2 1
22 1 3 0
23 1 3 1
24 1 4 0
25 1 4 1
26 1 5 0
27 1 5 1
28 1 6 0
29 1 6 1
30 1 7 0
31 1 7 1
-------------------------------------------------------------
Socket 0: ( 0 16 2 18 4 20 6 22 8 24 10 26 12 28 14 30 )
Socket 1: ( 1 17 3 19 5 21 7 23 9 25 11 27 13 29 15 31 )
-------------------------------------------------------------

*************************************************************
Cache Topology
*************************************************************
Level: 1
Size: 32 kB
Cache groups: ( 0 16 ) ( 2 18 ) ( 4 20 ) ( 6 22 ) ( 8 24 ) ( 10 26 ) ( 12 28 ) ( 14 30 ) ( 1 17 ) ( 3 19 ) ( 5 21 ) ( 7 23 ) ( 9 25 ) ( 11 27 ) ( 13 29 ) ( 15 31 )
-------------------------------------------------------------
Level: 2
Size: 256 kB
Cache groups: ( 0 16 ) ( 2 18 ) ( 4 20 ) ( 6 22 ) ( 8 24 ) ( 10 26 ) ( 12 28 ) ( 14 30 ) ( 1 17 ) ( 3 19 ) ( 5 21 ) ( 7 23 ) ( 9 25 ) ( 11 27 ) ( 13 29 ) ( 15 31 )
-------------------------------------------------------------
Level: 3
Size: 20 MB
Cache groups: ( 0 16 2 18 4 20 6 22 8 24 10 26 12 28 14 30 ) ( 1 17 3 19 5 21 7 23 9 25 11 27 13 29 15 31 )
-------------------------------------------------------------

*************************************************************
NUMA Topology
*************************************************************
NUMA domains: 2 
-------------------------------------------------------------
Domain 0:
Processors:  0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Relative distance to nodes:  10 20
Memory: 175608 MB free of total 193415 MB
-------------------------------------------------------------
Domain 1:
Processors:  1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Relative distance to nodes:  20 10
Memory: 187489 MB free of total 193534 MB
-------------------------------------------------------------


It looks like, all cores on the first socket can be accessed, while those on the second (the odd ones) can not be used.

Thanks,
Harald

P.S.:
Complete output of
$ ls -la /proc/bus/pci/*
-r--r--r-- 1 root root 0 Mar 24 15:17 /proc/bus/pci/devices

/proc/bus/pci/00:
total 0
dr-xr-xr-x  2 root root    0 Mar 24 15:17 .
dr-xr-xr-x 15 root root    0 Mar 24 15:17 ..
-rw-r--r--  1 root root 4096 Mar 24 15:17 00.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 01.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 01.1
-rw-r--r--  1 root root 4096 Mar 24 15:17 02.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 02.2
-rw-r--r--  1 root root 4096 Mar 24 15:17 03.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 03.2
-rw-r--r--  1 root root 4096 Mar 24 15:17 05.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 05.2
-rw-r--r--  1 root root 4096 Mar 24 15:17 11.0
-rw-r--r--  1 root root  256 Mar 24 15:17 16.0
-rw-r--r--  1 root root  256 Mar 24 15:17 16.1
-rw-r--r--  1 root root  256 Mar 24 15:17 1a.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 1c.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 1c.7
-rw-r--r--  1 root root  256 Mar 24 15:17 1d.0
-rw-r--r--  1 root root  256 Mar 24 15:17 1e.0
-rw-r--r--  1 root root  256 Mar 24 15:17 1f.0
-rw-r--r--  1 root root  256 Mar 24 15:17 1f.2

/proc/bus/pci/01:
total 0
dr-xr-xr-x  2 root root    0 Mar 24 15:17 .
dr-xr-xr-x 15 root root    0 Mar 24 15:17 ..
-rw-r--r--  1 root root 4096 Mar 24 15:17 00.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 00.1

/proc/bus/pci/02:
total 0
dr-xr-xr-x  2 root root    0 Mar 24 15:17 .
dr-xr-xr-x 15 root root    0 Mar 24 15:17 ..
-rw-r--r--  1 root root 4096 Mar 24 15:17 00.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 00.1

/proc/bus/pci/03:
total 0
dr-xr-xr-x  2 root root    0 Mar 24 15:17 .
dr-xr-xr-x 15 root root    0 Mar 24 15:17 ..
-rw-r--r--  1 root root 4096 Mar 24 15:17 00.0

/proc/bus/pci/04:
total 0
dr-xr-xr-x  2 root root    0 Mar 24 15:17 .
dr-xr-xr-x 15 root root    0 Mar 24 15:17 ..
-rw-r--r--  1 root root 4096 Mar 24 15:17 00.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 00.1

/proc/bus/pci/05:
total 0
dr-xr-xr-x  2 root root    0 Mar 24 15:17 .
dr-xr-xr-x 15 root root    0 Mar 24 15:17 ..
-rw-r--r--  1 root root 4096 Mar 24 15:17 00.0

/proc/bus/pci/09:
total 0
dr-xr-xr-x  2 root root    0 Mar 24 15:17 .
dr-xr-xr-x 15 root root    0 Mar 24 15:17 ..
-rw-r--r--  1 root root 4096 Mar 24 15:17 00.0

/proc/bus/pci/0a:
total 0
dr-xr-xr-x  2 root root    0 Mar 24 15:17 .
dr-xr-xr-x 15 root root    0 Mar 24 15:17 ..
-rw-r--r--  1 root root 4096 Mar 24 15:17 00.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 01.0

/proc/bus/pci/0b:
total 0
dr-xr-xr-x  2 root root    0 Mar 24 15:17 .
dr-xr-xr-x 15 root root    0 Mar 24 15:17 ..
-rw-r--r--  1 root root 4096 Mar 24 15:17 00.0

/proc/bus/pci/0c:
total 0
dr-xr-xr-x  2 root root   0 Mar 24 15:17 .
dr-xr-xr-x 15 root root   0 Mar 24 15:17 ..
-rw-r--r--  1 root root 256 Mar 24 15:17 00.0

/proc/bus/pci/3f:
total 0
dr-xr-xr-x  2 root root    0 Mar 24 15:17 .
dr-xr-xr-x 15 root root    0 Mar 24 15:17 ..
-rw-r--r--  1 root root  256 Mar 24 15:17 08.0
-rw-r--r--  1 root root  256 Mar 24 15:17 09.0
-rw-r--r--  1 root root  256 Mar 24 15:17 0a.0
-rw-r--r--  1 root root  256 Mar 24 15:17 0a.1
-rw-r--r--  1 root root  256 Mar 24 15:17 0a.2
-rw-r--r--  1 root root  256 Mar 24 15:17 0a.3
-rw-r--r--  1 root root  256 Mar 24 15:17 0b.0
-rw-r--r--  1 root root  256 Mar 24 15:17 0b.3
-rw-r--r--  1 root root  256 Mar 24 15:17 0c.0
-rw-r--r--  1 root root  256 Mar 24 15:17 0c.1
-rw-r--r--  1 root root  256 Mar 24 15:17 0c.2
-rw-r--r--  1 root root  256 Mar 24 15:17 0c.3
-rw-r--r--  1 root root  256 Mar 24 15:17 0c.6
-rw-r--r--  1 root root  256 Mar 24 15:17 0c.7
-rw-r--r--  1 root root  256 Mar 24 15:17 0d.0
-rw-r--r--  1 root root  256 Mar 24 15:17 0d.1
-rw-r--r--  1 root root  256 Mar 24 15:17 0d.2
-rw-r--r--  1 root root  256 Mar 24 15:17 0d.3
-rw-r--r--  1 root root  256 Mar 24 15:17 0d.6
-rw-r--r--  1 root root  256 Mar 24 15:17 0e.0
-rw-r--r--  1 root root  256 Mar 24 15:17 0e.1
-rw-r--r--  1 root root 4096 Mar 24 15:17 0f.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 0f.1
-rw-r--r--  1 root root 4096 Mar 24 15:17 0f.2
-rw-r--r--  1 root root 4096 Mar 24 15:17 0f.3
-rw-r--r--  1 root root 4096 Mar 24 15:17 0f.4
-rw-r--r--  1 root root 4096 Mar 24 15:17 0f.5
-rw-r--r--  1 root root  256 Mar 24 15:17 0f.6
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.1
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.2
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.3
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.4
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.5
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.6
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.7
-rw-r--r--  1 root root  256 Mar 24 15:17 11.0
-rw-r--r--  1 root root  256 Mar 24 15:17 13.0
-rw-r--r--  1 root root  256 Mar 24 15:17 13.1
-rw-r--r--  1 root root  256 Mar 24 15:17 13.4
-rw-r--r--  1 root root  256 Mar 24 15:17 13.5
-rw-r--r--  1 root root  256 Mar 24 15:17 13.6

/proc/bus/pci/40:
total 0
dr-xr-xr-x  2 root root    0 Mar 24 15:17 .
dr-xr-xr-x 15 root root    0 Mar 24 15:17 ..
-rw-r--r--  1 root root 4096 Mar 24 15:17 01.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 02.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 03.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 03.2
-rw-r--r--  1 root root 4096 Mar 24 15:17 05.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 05.2

/proc/bus/pci/7f:
total 0
dr-xr-xr-x  2 root root    0 Mar 24 15:17 .
dr-xr-xr-x 15 root root    0 Mar 24 15:17 ..
-rw-r--r--  1 root root  256 Mar 24 15:17 08.0
-rw-r--r--  1 root root  256 Mar 24 15:17 09.0
-rw-r--r--  1 root root  256 Mar 24 15:17 0a.0
-rw-r--r--  1 root root  256 Mar 24 15:17 0a.1
-rw-r--r--  1 root root  256 Mar 24 15:17 0a.2
-rw-r--r--  1 root root  256 Mar 24 15:17 0a.3
-rw-r--r--  1 root root  256 Mar 24 15:17 0b.0
-rw-r--r--  1 root root  256 Mar 24 15:17 0b.3
-rw-r--r--  1 root root  256 Mar 24 15:17 0c.0
-rw-r--r--  1 root root  256 Mar 24 15:17 0c.1
-rw-r--r--  1 root root  256 Mar 24 15:17 0c.2
-rw-r--r--  1 root root  256 Mar 24 15:17 0c.3
-rw-r--r--  1 root root  256 Mar 24 15:17 0c.6
-rw-r--r--  1 root root  256 Mar 24 15:17 0c.7
-rw-r--r--  1 root root  256 Mar 24 15:17 0d.0
-rw-r--r--  1 root root  256 Mar 24 15:17 0d.1
-rw-r--r--  1 root root  256 Mar 24 15:17 0d.2
-rw-r--r--  1 root root  256 Mar 24 15:17 0d.3
-rw-r--r--  1 root root  256 Mar 24 15:17 0d.6
-rw-r--r--  1 root root  256 Mar 24 15:17 0e.0
-rw-r--r--  1 root root  256 Mar 24 15:17 0e.1
-rw-r--r--  1 root root 4096 Mar 24 15:17 0f.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 0f.1
-rw-r--r--  1 root root 4096 Mar 24 15:17 0f.2
-rw-r--r--  1 root root 4096 Mar 24 15:17 0f.3
-rw-r--r--  1 root root 4096 Mar 24 15:17 0f.4
-rw-r--r--  1 root root 4096 Mar 24 15:17 0f.5
-rw-r--r--  1 root root  256 Mar 24 15:17 0f.6
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.0
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.1
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.2
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.3
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.4
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.5
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.6
-rw-r--r--  1 root root 4096 Mar 24 15:17 10.7
-rw-r--r--  1 root root  256 Mar 24 15:17 11.0
-rw-r--r--  1 root root  256 Mar 24 15:17 13.0
-rw-r--r--  1 root root  256 Mar 24 15:17 13.1
-rw-r--r--  1 root root  256 Mar 24 15:17 13.4
-rw-r--r--  1 root root  256 Mar 24 15:17 13.5
-rw-r--r--  1 root root  256 Mar 24 15:17 13.6

Thomas Röhl

unread,
Mar 31, 2014, 10:51:32 AM3/31/14
to likwid...@googlegroups.com
Hi Harald,

sorry for the delay. Currently we have no clue what is the problem on your machine. We think the problem is that LIKWID tries to access a non-existing PCI device. Therefore, I attached a patch that logs the filepath of PCI devices to the syslog in case of a failure. Can you please use this patch (-p1) and check the existence of the logged path in case of a failure?

Greetings,
Thomas
syslog_pci_file.patch

Harald Klimach

unread,
Mar 31, 2014, 2:08:22 PM3/31/14
to likwid...@googlegroups.com
Dear Thomas,


Am Montag, 31. März 2014 16:51:32 UTC+2 schrieb Thomas Röhl:
Hi Harald,

sorry for the delay. Currently we have no clue what is the problem on your machine. We think the problem is that LIKWID tries to access a non-existing PCI device. Therefore, I attached a patch that logs the filepath of PCI devices to the syslog in case of a failure. Can you please use this patch (-p1) and check the existence of the logged path in case of a failure?

thanks a lot for following up on this, with the patched version I see the following in the syslog:

Mar 31 19:56:26 shu accessD[15864]: daemon started
Mar 31 19:56:27 shu accessD[15864]: daemon accepted client
Mar 31 19:56:27 shu accessD[15864]: Failed to open device file /proc/bus/pci/ff/10.0 on socket 1

As shown in my previous mail, there indeed is no /proc/bus/pci/ff on my system.

Best,
Harald
 

Thomas Röhl

unread,
Apr 1, 2014, 12:48:20 PM4/1/14
to likwid...@googlegroups.com
Hi Harald,


Mar 31 19:56:27 shu accessD[15864]: Failed to open device file /proc/bus/pci/ff/10.0 on socket 1

As shown in my previous mail, there indeed is no /proc/bus/pci/ff on my system.

Since we parse the  /proc/bus/pci/devices file for the Uncore Counter device, this looks strange. For a last check, can you please execute:

grep -i "80863c44" /proc/bus/pci/devices

This is the PCI Device ID of the SandyBridge EP Uncore counter devices and according to the logs in the previous mails, the devices should be on 3f and 7f. If ff is not printed there, I have to check the lookup code again.

Greetings,
Thomas

Harald Klimach

unread,
Apr 1, 2014, 1:38:38 PM4/1/14
to likwid...@googlegroups.com
Hi,
grep -i "80863c44" /proc/bus/pci/devices
yields:
3f9d 80863c44 0               0               0               0               0               0               0               0               0               0               0               0               0               0               0 snbep_uncore
7f9d 80863c44 0               0               0               0               0               0               0               0               0               0               0               0               0               0               0 snbep_uncore

Thanks a lot,
Harald

Thomas Röhl

unread,
Apr 2, 2014, 8:01:06 AM4/2/14
to likwid...@googlegroups.com
Hi Harald,

Thanks for the output, it seems like we have a bug in LIKWID's PCI code but I cannot find it. I extracted the lookup function into a single file, please compile and run it. It searches for the Vendor ID 80863c44 in the devices file and checks the access permission of all performance counter related PCI devices on the bus. Maybe you find a possible bug in the code since it occurs only on your machine. All of my Uncore capable maschines determine the right bus numbers with this code.

Greetings,
Thomas


pci_test.c

Harald Klimach

unread,
Apr 2, 2014, 9:37:26 AM4/2/14
to likwid...@googlegroups.com
Dear Thomas,

I am so very sorry! It is all my fault. Somehow I managed to omit the Version-Number of LIKWID I am running on.
Which is 3.0. As we figured out now.
It turns out, that there is a static socket bus mapping in there:



I am deeply sorry, that my blindness to the installed version caused so much of a hassle.
I thought, I got the latest released version installed, but obviously I left this lying around too long.

My apologies and thanks for bearing with me!
Harald

Thomas Röhl

unread,
Apr 2, 2014, 9:49:57 AM4/2/14
to likwid...@googlegroups.com
Dear Harald,

I'm glad that you found the problem and can start using LIKWID on your SandyBridge EP system now. No need to apologize, I'm happy that I have not missed a bug in this part of the code.

Have Fun with LIKWID
Thomas


Tito Cruz

unread,
Apr 4, 2014, 3:47:19 PM4/4/14
to likwid...@googlegroups.com
Dear Harald,

I'm having similar problems, but still unsolved. I'm running a Kernel 3.2 and LIKWID 3.1.1

I'm getting stuck when I try to use the likwid-accessD. For instance,

# likwid-perfctr  -i -c0 -M 1
Failed to write data through daemon: daemon returned error 3 'failed to open device file' for cpu 0 reg 0xf4

# likwid-perfctr  -i -c0 -M 0

CPU family:    6
CPU model:    45
CPU stepping:    7
CPU features:    SSE SSE2 TM RDTSCP  

-------------------------------------------------------------
PERFMON version:    3
PERFMON number of counters:    4
PERFMON width of counters:    48
PERFMON number of fixed counters:    3
-------------------------------------------------------------

The daemon has the right permissions
# ls -l /usr/local/bin/likwid-accessD
-rwsr-sr-x 1 root staff 20215 Apr  4 21:23 /usr/local/bin/likwid-accessD

And also the capabilites

# getcap /usr/local/bin/likwid-accessD
/usr/local/bin/likwid-accessD = cap_sys_rawio+ep

# getcap /usr/local/bin/likwid-perfctr
/usr/local/bin/likwid-perfctr = cap_sys_rawio+ep

I've been trying with LIKWID 3.0.0 and 3.1.1 with same results.

I applied the patch you proposed to Thomas in order to figure out why the likwid-accessD is failing and I got this

Apr  4 21:43:30 machine accessD: daemon started
Apr  4 21:43:31 machine accessD: daemon accepted client
Apr  4 21:43:31 machine accessD: Failed to open device file /proc/bus/pci/7f/10.4 on socket 0
Apr  4 21:43:31 machine accessD: ERROR - [accessDaemon.c:672] zero read
Apr  4 21:43:31 machine accessD: daemon dropped client
Apr  4 21:43:31 machine accessD: daemon exiting

The device file /proc/bus/pci/7f/10.4 does not exist on my machine.

my topolgy is

-------------------------------------------------------------
CPU type:    Intel Core SandyBridge EP processor
*************************************************************
Hardware Thread Topology
*************************************************************
Sockets:    2
Cores per socket:    6
Threads per core:    2
-------------------------------------------------------------
HWThread    Thread        Core        Socket
0        0        0        0
1        0        1        0
2        0        2        0
3        0        3        0
4        0        4        0
5        0        5        0
6        0        0        1
7        0        1        1
8        0        2        1
9        0        3        1
10        0        4        1
11        0        5        1
12        1        0        0
13        1        1        0
14        1        2        0
15        1        3        0
16        1        4        0
17        1        5        0
18        1        0        1
19        1        1        1
20        1        2        1
21        1        3        1
22        1        4        1
23        1        5        1
-------------------------------------------------------------
Socket 0: ( 0 12 1 13 2 14 3 15 4 16 5 17 )
Socket 1: ( 6 18 7 19 8 20 9 21 10 22 11 23 )

-------------------------------------------------------------

*************************************************************
Cache Topology
*************************************************************
Level:    1
Size:    32 kB
Cache groups:    ( 0 12 ) ( 1 13 ) ( 2 14 ) ( 3 15 ) ( 4 16 ) ( 5 17 ) ( 6 18 ) ( 7 19 ) ( 8 20 ) ( 9 21 ) ( 10 22 ) ( 11 23 )

-------------------------------------------------------------
Level:    2
Size:    256 kB
Cache groups:    ( 0 12 ) ( 1 13 ) ( 2 14 ) ( 3 15 ) ( 4 16 ) ( 5 17 ) ( 6 18 ) ( 7 19 ) ( 8 20 ) ( 9 21 ) ( 10 22 ) ( 11 23 )
-------------------------------------------------------------
Level:    3
Size:    15 MB
Cache groups:    ( 0 12 1 13 2 14 3 15 4 16 5 17 ) ( 6 18 7 19 8 20 9 21 10 22 11 23 )

-------------------------------------------------------------

*************************************************************
NUMA Topology
*************************************************************
NUMA domains: 2
-------------------------------------------------------------
Domain 0:
Processors:  0 1 2 3 4 5 12 13 14 15 16 17

Relative distance to nodes:  10 20
Memory: 15870.4 MB free of total 16350.1 MB
-------------------------------------------------------------
Domain 1:
Processors:  6 7 8 9 10 11 18 19 20 21 22 23

Relative distance to nodes:  20 10
Memory: 16071.3 MB free of total 16384 MB
-------------------------------------------------------------

I've been trying to make it work for hours with no luck. Do you have any idea of what is going on?
Thank you very much in advance.

Tito Cruz

unread,
Apr 4, 2014, 3:58:27 PM4/4/14
to likwid...@googlegroups.com
Uppps..., I get  lost in the conversation and I ended up switching your names... this post was addressed to Thömas, sorry for the confussion :(

Thomas Röhl

unread,
Apr 7, 2014, 2:45:10 AM4/7/14
to likwid...@googlegroups.com
Dear Tito,

can you please supply additional information about your system.

Output of
cat /proc/bus/pci/devices
ls -la /proc/bus/pci/7f

For better clarity on this mailing list, please attach the output as files to your post.

Greetings,
Thomas

Tito Cruz

unread,
Apr 7, 2014, 3:59:46 AM4/7/14
to likwid...@googlegroups.com
Of course, Thomas,

Files are attached.
thank you very much for your helping me on this.

Best,
Tito Cruz
cat_proc_bus_pci_devices.txt
ls_proc_bus_pci_7f.txt

Thomas Röhl

unread,
Apr 7, 2014, 7:57:02 AM4/7/14
to likwid...@googlegroups.com
Hi Tito,

as you stated, the PCI device 10.4 does not exist and this is quite strange because you have the corresponding IMC device 0,1 and 3 but 2 is missing. I don't know the reason for this. The IMC devices correspond to your memory channels and the SandyBridge EP has commonly four of it. I will ask a colleague whether he has heard of such a case. Do you know whether your machine is kind of "stripped", only supports 3 memory channels or anything else that could be the reason for this?

Just patching out the memory channel 2 will not help you, there are several files that need to be patched. But if we find no other solution this might be the only way to get LIKWID running on this machine.

Greetings,
Thomas

Tito Cruz

unread,
Apr 7, 2014, 8:50:29 AM4/7/14
to likwid...@googlegroups.com
Dear Thomas,

Thank you again for your quick answer.

I just had a look to my Fujitsu Primergy TX200 S7 technical specs (attached).

In page 7 I can read :
"Up to two Xeon 4C, 6C & 8C CPU`s (Socket-B2) with 1 serial QPI link ( Quick Path Interconnect ) and 3 memory channels per CPU"
It matches with the memory diagram presented in the same document at page 4.

I confess I am starting to get a lost a little bit here. Does it mean that this server has a rare memory architecture design?

If the only solution is to patch out the memory channel 2, can you please give me a general idea on how to do that? I will be happy to contribute with testing if it is needed.

Best,
Tito Cruz
cnfgTX200S7.pdf

Thomas Röhl

unread,
Apr 7, 2014, 10:02:40 AM4/7/14
to likwid...@googlegroups.com
Hi Tito,

this system configuration for SandyBridge EP is new to me but maybe not as rare as you and me think of. I did a quick patch that removes the memory channel nr. 2 for SandyBridge. Please try it but if it does not work I will have to get a deeper look at the code what will take some time. A PCI lookup code with automaitc configuration of the available counters would be delight but since LIKWID uses structures that are initialized at compile time its hard to remove them automatically from the runtime configuration.

Greetings
Thomas
no_imc2.patch

Tito Cruz

unread,
Apr 8, 2014, 6:41:06 AM4/8/14
to likwid...@googlegroups.com
Dear Thomas,

Thank you very much for your patch. Unfortunately I am afraid it didn't work. I tried to fix it by myself but it was too complicated for me. I'm sorry...

Error is still the same...

Failed to write data through daemon: daemon returned error 3 'failed to open device file' for cpu 0 reg 0xf4

If you are interested I could give you access to the machine.

Best,
Tito Cruz

Thomas Röhl

unread,
Apr 8, 2014, 8:30:01 AM4/8/14
to likwid...@googlegroups.com
Dear Tito,

As I said, it was just a quick patch. have you tried executing LIKWID as root with -M 0? Du you still have the failure? Maybe I forgot to remove parts from the access-daemon in my patch.

Currently I don't have time to get a deeper insight in your system but if I have same free time, I will contact you to get access if then still needed.

Greetings,
Thomas

Tito Cruz

unread,
Apr 8, 2014, 9:55:07 AM4/8/14
to likwid...@googlegroups.com
Thank you again Thomas,

Yes, I have problems running under root with -M 0.
Anyway thank you very much. Your efforts are highly appreciated. We will postpone the use of likwid in this machine for the future.

Best,
Tito Cruz 


--

---
You received this message because you are subscribed to a topic in the Google Groups "likwid-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/likwid-users/UJm8mHuwQIQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to likwid-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages