Error in running Backprop benchmark

923 views

Skip to first unread message

Tuan Ta

unread,

Sep 11, 2015, 12:53:56 AM9/11/15

to gem5-gpu Developers List

Hi all,

I'm a gem5-gpu newbie. I started building gem5-gpu binary with x86 CPU and get gem5.opt. The building process was all good.

Then, I compiled the backprop benchmark provided in Rodinia suite. I run "make gem5-fusion" and got gem5_fusion_backprop file.

Environment variables CUDAHOME, PATH and NVIDIA_CUDA_SDK_LOCATION are all set as instructed in gem5-gpu documentation.

When I run gem5_fusion_backprop on top of gem5-gpu by typing build/X86_VI_hammer_GPU/gem5.opt ../gem5-gpu/configs/se_fusion.py -c ../benchmarks/rodinia/backprop/gem5_fusion_backprop -o "16", I get the following error message:

fatal: syscall gettid (#186) unimplemented.

@ tick 31087000

[unimplementedFunc:build/X86_VI_hammer_GPU/sim/syscall_emul.cc, line 91]

I suspect that the gem5_fusion_backprop binary file is not statically linked, so I checked ldd ../benchmarks/rodinia/backprop/gem5_fusion_backprop and saw that the file is actually linked statically.

Could any one help me solve this problem? Thank you very much!

Tuan Ta

Joel Hestness

unread,

Sep 11, 2015, 9:41:47 AM9/11/15

to Tuan Ta, gem5-gpu Developers List

Hi Tuan,

I believe this problem is due to something that occurs before the fatal that you're seeing. Can you send over the full simulator output and benchmark output (if you're redirecting it)?

Thanks!

Joel

Joel Hestness
PhD Candidate, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/

Tuan Ta

unread,

Sep 11, 2015, 10:26:34 AM9/11/15

to gem5-gpu Developers List, taquang...@gmail.com

Hi Joel,

Here is the full simulator output:

gem5 Simulator System. http://gem5.org

gem5 is copyrighted software; use the --copyright option for details.

gem5 compiled Sep 2 2015 16:43:02

gem5 started Sep 11 2015 09:24:27

gem5 executing on tqta

command line: ./build/X86_VI_hammer_GPU/gem5.opt ../gem5-gpu/configs/se_fusion.py -c ../benchmarks/rodinia/backprop/gem5_fusion_backprop -o 16

Warning: Only block size currently supported is 128B. Defaulting to 128.

Using template and command line options for gpgpusim.config

Global frequency set at 1000000000000 ticks per second

warn: system.ruby.network adopting orphan SimObject param 'int_links'

warn: system.ruby.network adopting orphan SimObject param 'ext_links'

warn: DRAM device capacity (8192 Mbytes) does not match the address range assigned (2048 Mbytes)

*** GPGPU-Sim Simulator Version 3.2.2 [build 17315] ***

GPGPU-Sim PTX: simulation mode 0 (can change with PTX_SIM_MODE_FUNC environment variable:

1=functional simulation only, 0=detailed performance simulator)

GPGPU-Sim: Configuration options:

-network_mode 1 # Interconnection network mode

-inter_config_file m5out/config_fermi_islip.icnt # Interconnection network config file

-gpgpu_ptx_use_cuobjdump 0 # Use cuobjdump to extract ptx and sass from binaries

-gpgpu_experimental_lib_support 0 # Try to extract code from cuda libraries [Broken because of unknown cudaGetExportTable]

-gpgpu_ptx_convert_to_ptxplus 0 # Convert SASS (native ISA) to ptxplus and run ptxplus

-gpgpu_ptx_force_max_capability 20 # Force maximum compute capability

-gpgpu_ptx_inst_debug_to_file 0 # Dump executed instructions' debug information to file

-gpgpu_ptx_inst_debug_file inst_debug.txt # Executed instructions' debug output file

-gpgpu_ptx_inst_debug_thread_uid 1 # Thread UID for executed instructions' debug output

-gpgpu_simd_model 1 # 1 = post-dominator

-gpgpu_shader_core_pipeline 1536:32 # shader core pipeline config, i.e., {<nthread>:<warpsize>}

-gpgpu_tex_cache:l1 4:128:24,L:R:m:N,F:128:4,128:2 # per-shader L1 texture cache (READ-ONLY) config {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>:<rf>}

-gpgpu_const_cache:l1 64:64:2,L:R:f:N,A:2:32,4 # per-shader L1 constant memory cache (READ-ONLY) config {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}

-gpgpu_cache:il1 8:128:4,L:R:f:N,A:2:32,4 # shader L1 instruction cache config {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}

-gpgpu_cache:dl1 none # per-shader L1 data cache config {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}

-gpgpu_cache:dl1PrefL1 none # per-shader L1 data cache config {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}

-gpgpu_cache:dl1PreShared none # per-shader L1 data cache config {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}

-gmem_skip_L1D 0 # global memory access skip L1D cache (implements -Xptxas -dlcm=cg, default=no skip)

-gpgpu_perfect_mem 0 # enable perfect memory mode (no cache miss)

-n_regfile_gating_group 4 # group of lanes that should be read/written together)

-gpgpu_clock_gated_reg_file 0 # enable clock gated reg file for power calculations

-gpgpu_clock_gated_lanes 0 # enable clock gated lanes for power calculations

-gpgpu_shader_registers 32768 # Number of registers per shader core. Limits number of concurrent CTAs. (default 8192)

-gpgpu_shader_cta 8 # Maximum number of concurrent CTAs in shader (default 8)

-gpgpu_n_clusters 16 # number of processing clusters

-gpgpu_n_cores_per_cluster 1 # number of simd cores per cluster

-gpgpu_n_cluster_ejection_buffer_size 8 # number of packets in ejection buffer

-gpgpu_n_ldst_response_buffer_size 2 # number of response packets in ld/st unit ejection buffer

-gpgpu_shmem_size 16384 # Size of shared memory per shader core (default 16kB)

-gpgpu_shmem_size 49152 # Size of shared memory per shader core (default 16kB)

-gpgpu_shmem_size_PrefL1 16384 # Size of shared memory per shader core (default 16kB)

-gpgpu_shmem_size_PrefShared 16384 # Size of shared memory per shader core (default 16kB)

-gpgpu_shmem_num_banks 32 # Number of banks in the shared memory in each shader core (default 16)

-gpgpu_shmem_limited_broadcast 0 # Limit shared memory to do one broadcast per cycle (default on)

-gpgpu_shmem_warp_parts 1 # Number of portions a warp is divided into for shared memory bank conflict check

-gpgpu_warpdistro_shader -1 # Specify which shader core to collect the warp size distribution from

-gpgpu_warp_issue_shader 0 # Specify which shader core to collect the warp issue distribution from

-gpgpu_local_mem_map 1 # Mapping from local memory space address to simulated GPU physical address space (default = enabled)

-gpgpu_num_reg_banks 16 # Number of register banks (default = 8)

-gpgpu_reg_bank_use_warp_id 0 # Use warp ID in mapping registers to banks (default = off)

-gpgpu_operand_collector_num_units_sp 6 # number of collector units (default = 4)

-gpgpu_operand_collector_num_units_sfu 8 # number of collector units (default = 4)

-gpgpu_operand_collector_num_units_mem 2 # number of collector units (default = 2)

-gpgpu_operand_collector_num_units_gen 0 # number of collector units (default = 0)

-gpgpu_operand_collector_num_in_ports_sp 2 # number of collector unit in ports (default = 1)

-gpgpu_operand_collector_num_in_ports_sfu 1 # number of collector unit in ports (default = 1)

-gpgpu_operand_collector_num_in_ports_mem 1 # number of collector unit in ports (default = 1)

-gpgpu_operand_collector_num_in_ports_gen 0 # number of collector unit in ports (default = 0)

-gpgpu_operand_collector_num_out_ports_sp 2 # number of collector unit in ports (default = 1)

-gpgpu_operand_collector_num_out_ports_sfu 1 # number of collector unit in ports (default = 1)

-gpgpu_operand_collector_num_out_ports_mem 1 # number of collector unit in ports (default = 1)

-gpgpu_operand_collector_num_out_ports_gen 0 # number of collector unit in ports (default = 0)

-gpgpu_coalesce_arch 13 # Coalescing arch (default = 13, anything else is off for now)

-gpgpu_num_sched_per_core 2 # Number of warp schedulers per core

-gpgpu_max_insn_issue_per_warp 1 # Max number of instructions that can be issued per warp in one cycle by scheduler

-gpgpu_simt_core_sim_order 1 # Select the simulation order of cores in a cluster (0=Fix, 1=Round-Robin)

-gpgpu_pipeline_widths 2,1,1,2,1,1,2 # Pipeline widths ID_OC_SP,ID_OC_SFU,ID_OC_MEM,OC_EX_SP,OC_EX_SFU,OC_EX_MEM,EX_WB

-gpgpu_num_sp_units 2 # Number of SP units (default=1)

-gpgpu_num_sfu_units 1 # Number of SF units (default=1)

-gpgpu_num_mem_units 1 # Number if ldst units (default=1) WARNING: not hooked up to anything

-gpgpu_scheduler gto # Scheduler configuration: < lrr | gto | two_level_active > If two_level_active:<num_active_warps>:<inner_prioritization>:<outer_prioritization>For complete list of prioritization values see shader.h enum scheduler_prioritization_typeDefault: gto

-gpgpu_dram_scheduler 1 # 0 = fifo, 1 = FR-FCFS (defaul)

-gpgpu_dram_partition_queues 8:8:8:8 # i2$:$2d:d2$:$2i

-l2_ideal 0 # Use a ideal L2 cache that always hit

-gpgpu_cache:dl2 64:256:8,L:B:m:W,A:32:4,4:0,32 # unified banked L2 data cache config {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}

-gpgpu_cache:dl2_texture_only 0 # L2 cache used for texture only

-gpgpu_n_mem 1 # number of memory modules (e.g. memory controllers) in gpu

-gpgpu_n_sub_partition_per_mchannel 2 # number of memory subpartition in each memory module

-gpgpu_n_mem_per_ctrlr 2 # number of memory chips per memory controller

-gpgpu_memlatency_stat 14 # track and display latency statistics 0x2 enables MC, 0x4 enables queue logs

-gpgpu_frfcfs_dram_sched_queue_size 16 # 0 = unlimited (default); # entries per chip

-gpgpu_dram_return_queue_size 116 # 0 = unlimited (default); # entries per chip

-gpgpu_dram_buswidth 4 # default = 4 bytes (8 bytes per cycle at DDR)

-gpgpu_dram_burst_length 8 # Burst length of each DRAM request (default = 4 data bus cycle)

-dram_data_command_freq_ratio 4 # Frequency ratio between DRAM data bus and command bus (default = 2 times, i.e. DDR)

-gpgpu_dram_timing_opt nbk=16:CCD=2:RRD=6:RCD=12:RAS=28:RP=12:RC=40: CL=12:WL=4:CDLR=5:WR=12:nbkgrp=4:CCDL=3:RTPL=2 # DRAM timing parameters = {nbk:tCCD:tRRD:tRCD:tRAS:tRP:tRC:CL:WL:tCDLR:tWR:nbkgrp:tCCDL:tRTPL}

-rop_latency 120 # ROP queue latency (default 85)

-dram_latency 100 # DRAM latency (default 30)

-gpgpu_mem_addr_mapping dramid@8;00000000.00000000.00000000.00000000.0000RRRR.RRRRRRRR.BBBCCCCB.CCSSSSSS # mapping memory address to dram model {dramid@<start bit>;<memory address map>}

-gpgpu_mem_addr_test 0 # run sweep test to check address mapping for aliased address

-gpgpu_mem_address_mask 1 # 0 = old addressing mask, 1 = new addressing mask, 2 = new add. mask + flipped bank sel and chip sel bits

-gpuwattch_xml_file gpuwattch.xml # GPUWattch XML file

-power_simulation_enabled 0 # Turn on power simulator (1=On, 0=Off)

-power_per_cycle_dump 0 # Dump detailed power output each cycle

-power_trace_enabled 0 # produce a file for the power trace (1=On, 0=Off)

-power_trace_zlevel 6 # Compression level of the power trace output log (0=no comp, 9=highest)

-steady_power_levels_enabled 0 # produce a file for the steady power levels (1=On, 0=Off)

-steady_state_definition 8:4 # allowed deviation:number of samples

-gpgpu_max_cycle 0 # terminates gpu simulation early (0 = no limit)

-gpgpu_max_insn 0 # terminates gpu simulation early (0 = no limit)

-gpgpu_max_cta 0 # terminates gpu simulation early (0 = no limit)

-gpgpu_runtime_stat 50000 # display runtime statistics such as dram utilization {<freq>:<flag>}

-liveness_message_freq 1 # Minimum number of seconds between simulation liveness messages (0 = always print)

-gpgpu_flush_l1_cache 0 # Flush L1 cache at the end of each kernel call

-gpgpu_flush_l2_cache 0 # Flush L2 cache at the end of each kernel call

-gpgpu_deadlock_detect 1 # Stop the simulation at deadlock (1=on (default), 0=off)

-gpgpu_ptx_instruction_classification 0 # if enabled will classify ptx instruction types per kernel (Max 255 kernels now)

-gpgpu_ptx_sim_mode 0 # Select between Performance (default) or Functional simulation (1)

-gpgpu_clock_domains 700.0:1400.0:700.0:1848.0 # Clock Domain Frequencies in MhZ {<Core Clock>:<ICNT Clock>:<L2 Clock>:<DRAM Clock>}

-gpgpu_max_concurrent_kernel 8 # maximum kernels that can run concurrently on GPU

-gpgpu_cflog_interval 0 # Interval between each snapshot in control flow logger

-visualizer_enabled 0 # Turn on visualizer output (1=On, 0=Off)

-visualizer_outputfile NULL # Specifies the output log file for visualizer

-visualizer_zlevel 6 # Compression level of the visualizer output log (0=no comp, 9=highest)

-trace_enabled 0 # Turn on traces

-trace_components none # comma seperated list of traces to enable. Complete list found in trace_streams.tup. Default none

-trace_sampling_core 0 # The core which is printed using CORE_DPRINTF. Default 0

-trace_sampling_memory_partition -1 # The memory partition which is printed using MEMPART_DPRINTF. Default -1 (i.e. all)

-enable_ptx_file_line_stats 0 # Turn on PTX source line statistic profiling. (1 = On)

-ptx_line_stats_filename gpgpu_inst_stats.txt # Output file for PTX source line statistics.

-save_embedded_ptx 0 # saves ptx files embedded in binary as <n>.ptx

-keep 0 # keep intermediate files created by GPGPU-Sim when interfacing with external programs

-gpgpu_ptx_save_converted_ptxplus 0 # Saved converted ptxplus to a file

-ptx_opcode_latency_int 4,13,4,5,145 # Opcode latencies for integers <ADD,MAX,MUL,MAD,DIV>Default 1,1,19,25,145

-ptx_opcode_latency_fp 4,13,4,5,39 # Opcode latencies for single precision floating points <ADD,MAX,MUL,MAD,DIV>Default 1,1,1,1,30

-ptx_opcode_latency_dp 8,19,8,8,330 # Opcode latencies for double precision floating points <ADD,MAX,MUL,MAD,DIV>Default 8,8,8,8,335

-ptx_opcode_initiation_int 1,2,2,1,8 # Opcode initiation intervals for integers <ADD,MAX,MUL,MAD,DIV>Default 1,1,4,4,32

-ptx_opcode_initiation_fp 1,2,1,1,4 # Opcode initiation intervals for single precision floating points <ADD,MAX,MUL,MAD,DIV>Default 1,1,1,1,5

-ptx_opcode_initiation_dp 8,16,8,8,130 # Opcode initiation intervals for double precision floating points <ADD,MAX,MUL,MAD,DIV>Default 8,8,8,8,130

DRAM Timing Options:

nbk 16 # number of banks

CCD 2 # column to column delay

RRD 6 # minimal delay between activation of rows in different banks

RCD 12 # row to column delay

RAS 28 # time needed to activate row

RP 12 # time needed to precharge (deactivate) row

RC 40 # row cycle time

CDLR 5 # switching from write to read (changes tWTR)

WR 12 # last data-in to row precharge

CL 12 # CAS latency

WL 4 # Write latency

nbkgrp 4 # number of bank groups

CCDL 3 # column to column delay between accesses to different bank groups

RTPL 2 # read to precharge delay between accesses to different bank groups

Total number of memory sub partition = 2

addr_dec_mask[CHIP] = 0000000000000000 high:64 low:0

addr_dec_mask[BK] = 000000000000e100 high:16 low:8

addr_dec_mask[ROW] = 000000000fff0000 high:28 low:16

addr_dec_mask[COL] = 0000000000001eff high:13 low:0

addr_dec_mask[BURST] = 000000000000003f high:6 low:0

sub_partition_id_mask = 0000000000000100

GPGPU-Sim uArch: clock freqs: 700000000.000000:1400000000.000000:700000000.000000:1848000000.000000

GPGPU-Sim uArch: clock periods: 0.00000000142857142857:0.00000000071428571429:0.00000000142857142857:0.00000000054112554113

*** Initializing Memory Statistics ***

GPGPU-Sim uArch: interconnect node map (shaderID+MemID to icntID)

GPGPU-Sim uArch: Memory nodes ID start from index: 16

GPGPU-Sim uArch: 0 1 2 3

GPGPU-Sim uArch: 4 5 6 7

GPGPU-Sim uArch: 8 9 10 11

GPGPU-Sim uArch: 12 13 14 15

GPGPU-Sim uArch: 16 17

GPGPU-Sim uArch: interconnect node reverse map (icntID to shaderID+MemID)

GPGPU-Sim uArch: Memory nodes start from ID: 16

GPGPU-Sim uArch: 0 1 2 3

GPGPU-Sim uArch: 4 5 6 7

GPGPU-Sim uArch: 8 9 10 11

GPGPU-Sim uArch: 12 13 14 15

GPGPU-Sim uArch: 16 17

GPGPU-Sim uArch: performance model initialization complete.

warn: Sockets disabled, not accepting gdb connections

**** REAL SIMULATION ****

info: Entering event queue @ 0. Starting simulation...

warn: readlink may yield unexpected results if multiple binaries are used

info: Increasing stack size by one page.

../sysdeps/unix/sysv/linux/dl-origin.c:47: _dl_get_origin: Assertion `linkval[0] == '/'' failed.

warn: ignoring syscall rt_sigprocmask(1, ...)

(further warnings will be suppressed)

fatal: syscall gettid (#186) unimplemented.

@ tick 31087000

[unimplementedFunc:build/X86_VI_hammer_GPU/sim/syscall_emul.cc, line 91]

Memory Usage: 4495424 KBytes

Program aborted at cycle 31087000

Aborted (core dumped)

Thank you!

Tuan Ta

Joel Hestness

unread,

Sep 11, 2015, 10:42:24 AM9/11/15

to Tuan Ta, gem5-gpu Developers List

Hi Tuan,

Yes, you're running into the same readlink problem that many of us are experiencing right now. Here's a gem5 email list thread on the subject: https://www.mail-archive.com/gem5...@gem5.org/msg16660.html, and there is another one that hasn't been indexed by The Mail Archive yet (you can see it on the gem5-users email list if you're subscribed to it).

One thing that you can try is to specify the full path to your benchmark binary in the gem5 command line (I'm not yet sure if that will work). Note that unfortunately, if you change the benchmark path, it is likely that your simulation results will change due to the way readlink will then read your binary's path.

Keep me posted if the full path route works or not. Thanks!

Joel

Tuan Ta

unread,

Sep 11, 2015, 1:23:42 PM9/11/15

to gem5-gpu Developers List, taquang...@gmail.com

Hi Joel,

Thank you for your response!

Full binary's path works, but how will changing from relative to full path change the simulation results? Incorrect benchmark's or simulator's output?

Thanks,

Tuan Ta

Joel Hestness

unread,

Sep 11, 2015, 2:10:05 PM9/11/15

to Tuan Ta, gem5-gpu Developers List

Hi Tuan,

Full binary's path works,

Cool. This is good to know. Thanks!

but how will changing from relative to full path change the simulation results? Incorrect benchmark's or simulator's output?

Actually, if you change the full path of the benchmark (e.g. rename a directory or move the binary on the host), your simulation results may end up changing. Basically, the new implementation of readlink reads file paths from the host system, so the simulated system will now process the host system's path to the binary. If the host path changes (especially path length), the amount of processing that the simulated system does to handle the path is likely to change even if you don't change the binary itself. This causes apparent non-determinism in the simulator.

This problem is a result of poor separation between host and simulated system, and I'm not a fan. Hopefully we'll be able to come up with a more robust solution.