Stratum SIGBUS error

78 views
Skip to first unread message

Ariel Góes de Castro

unread,
Jul 27, 2023, 11:38:26 AM7/27/23
to stratum-dev
Hello everyone,
I'm facing difficulties to implement Stratum as a kubernetes container. This approach is presented in the SD-Fabric v1.2.1-dev charts.

Environment info:
- Control VM (8 cores, 16GB RAM, Ubuntu 18.04) - i.e., where the sd-fabric helm charts are instanciated
- leaf node (Wedge 100BF-32x - SONiC-Barefoot-2022-08-12)
- Calico v3.25.1
- Multus v4.0.2


The sdfabric-helm-charts/sdfabric/values.yaml instruct me to set the taints and label the switch node as decpited below:

stratum.png

Then, I label the switch as suggested in the stratum kubernetes deployment guide:
1. kubectl label node <node-name> node-role.kubernetes.io=switch
2. kubectl taint node <node-name> node-role.kubernetes.io=switch:NoSchedule

The problem is: Once stratum is instantiated, a SIGBUS errors suddenly appears at the kubectl logs:
+ LOCAL_CHASSIS_CONFIG=/config/chassis_config.pb.txt
+ [ -f /config/chassis_config.pb.txt ]
+ /usr/bin/start-stratum.sh -enable_onlp=false -chassis_config_file=/config/chassis_config.pb.txt -max_log_size=0 -write_req_log_file= -read_req_log_file= -v=0 -minloglevel=0 -bf_switchd_background=false -colorlogtostderr=false -logtostderr=true -experimental_enable_p4runtime_translation
Mounting hugepages...
Skipping kernel module installation.
I20230724 18:11:57.530447     7 logging.cc:64] Stratum version: e22940edeadbee23956a903ed5580fa2248830df built at 2022-03-25T00:05:58+00:00 on host 7a47aab43a31 by user root.
I20230724 18:11:57.530922     7 bf_sde_wrapper.cc:1754] bf_sysfs_fname: /sys/class/bf/bf0/device/dev_add
Install dir: /usr (0x2396020)
bf_switchd: system services initialized
bf_switchd: loading conf_file /usr/share/stratum/tofino_skip_p4.conf...
bf_switchd: processing device configuration...
Configuration for dev_id 0
  Family        : Tofino
  pci_sysfs_str : /sys/devices/pci0000:00/0000:00:03.0/0000:05:00.0
  pci_domain    : 0
  pci_bus       : 5
  pci_fn        : 0
  pci_dev       : 0
  pci_int_mode  : 1
  sbus_master_fw: /usr/
  pcie_fw       : /usr/
  serdes_fw     : /usr/
  sds_fw_path   : /usr/share/tofino_sds_fw/avago/firmware
  microp_fw_path:
bf_switchd: processing P4 configuration...
P4 profile for dev_id 0
  p4_name: dummy
    libpd:
    libpdthrift:
    context:
    config:
  Agent[0]: /usr/lib/libpltfm_mgr.so
  diag:
  accton diag:
  non_default_port_ppgs: 0
  SAI default initialize: 1
bf_switchd: library /usr/lib/libpltfm_mgr.so loaded
bf_switchd: agent[0] initialized
Health monitor started
Operational mode set to ASIC
Initialized the device types using platforms infra API
ASIC detected at PCI /sys/class/bf/bf0/device
ASIC pci device id is 16
bf_switchd: drivers initialized
Skipping P4 program load for dev_id 0
*** Aborted at 1690222321 (unix time) try "date -d @1690222321" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGBUS (@0x7f49e8800000) received by PID 7 (TID 0x7f4a083b8880) from PID 18446744073315287040; stack trace: ***
    @     0x7f49f68f50e0 (unknown)
    @     0x7f4a0744e865 bf_sys_dma_pool_create
    @     0x7f49f82e77e7 (unknown)
    @     0x7f49f82e9d7e bf_switchd_lib_init
    @           0x43766e (unknown)
    @           0x41bf7b (unknown)
    @           0x41d114 (unknown)
    @     0x7f49f5ac42e1 __libc_start_main
    @           0x41bbea (unknown)
    @                0x0 (unknown)
Bus error (core dumped)



Has anyone ever experienced this problem?




Ariel Góes de Castro

unread,
Jul 27, 2023, 11:45:11 AM7/27/23
to stratum-dev, Ariel Góes de Castro

If other information is needed I am happy to provide. For example:
1. I haven't tried it with ONL
2. I already tried with Ubuntu 18.04 and Ubuntu 20.04
3. I didn't set up a custom chassis
4. I already solved a problem of the PLATFORM variable, modifying it according to the model that I believe to be the correct one - i.e., PLATFORM=x86-64-accton-wedge100bf-32qs-r0
5. Installation ran smoothly using just Docker and Ubuntu 20.04 (no Kubernetes). However, it is more complicated to make the connection with the rest of the cluster, because in this way, Stratum becomes an external component and does not follow the recommendation of the Deployment Guide.

Brian O'Connor

unread,
Jul 30, 2023, 11:11:10 PM7/30/23
to stratum-dev, Ariel Góes de Castro
Hi Ariel,

I'd guess based on the stack trace that maybe huge pages aren't set up: 
https://github.com/stratum/stratum/blob/main/stratum/hal/bin/barefoot/README.run.md#huge-pages--dma-allocation-error

It should be a one-time setup after each fresh OS install.

Let us know if that fixes it.

Brian

Ariel Góes de Castro

unread,
Aug 25, 2023, 3:45:00 PM8/25/23
to stratum-dev, Brian O'Connor, Ariel Góes de Castro
I replicated the whole environment and double checked the huge pages. I don't believe it is the problem. My output is the same as in the provided link. Is there anything else you believe is causing this problem?

Ariel Góes de Castro

unread,
Oct 5, 2023, 1:55:05 PM10/5/23
to stratum-dev, Ariel Góes de Castro, Brian O'Connor
Update: For some reason this error only occured on Wedge100BF-32X. However, the error disappears when I ran the same commands on Wedge100BF-32QS

Ariel Góes de Castro

unread,
Oct 30, 2023, 12:27:06 PM10/30/23
to stratum-dev, Ariel Góes de Castro, Brian O'Connor
SOLVED:
I definitively solved the problem by installing the Linux headers suggested by the SONiC version suggested in the SD-Fabric deployment.

SONiC [supported by SD-Fabric] 
Screenshot from 2023-10-30 13-24-36.png
Reply all
Reply to author
Forward
0 new messages