tiered memory usage

60 views
Skip to first unread message

Anton Gavriliuk

unread,
Oct 1, 2022, 2:35:07 PM10/1/22
to pmem
Hi all

I noticed that there is extra memory usage in tiered memory setup.

The server after reboot, so no any workloads running.

Current setup

[root@memverge anton]# free -h
              total        used        free      shared  buff/cache   available
Mem:          755Gi       3.5Gi       751Gi        13Mi       354Mi       748Gi
Swap:         4.0Gi          0B       4.0Gi
[root@memverge anton]#
[root@memverge anton]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
node 0 size: 386616 MB
node 0 free: 384469 MB
node 1 cpus: 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 1 size: 387063 MB
node 1 free: 385301 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10
[root@memverge anton]#
[root@memverge anton]# daxctl list
[
  {
    "chardev":"dax1.0",
    "size":3183575302144,
    "target_node":3,
    "align":2097152,
    "mode":"devdax"
  },
  {
    "chardev":"dax0.0",
    "size":3183575302144,
    "target_node":2,
    "align":2097152,
    "mode":"devdax"
  }
]
[root@memverge anton]#

After system-ram command I have 90+ GB memory usage

[root@memverge anton]# daxctl reconfigure-device --mode=system-ram all
[
  {
    "chardev":"dax0.0",
    "size":3183575302144,
    "target_node":2,
    "align":2097152,
    "mode":"system-ram",
    "movable":true
  }
]
reconfigured 2 devices
[root@memverge anton]#

So initially memory used was 3.5Gi and now 95Gi

[root@memverge anton]# free -h
              total        used        free      shared  buff/cache   available
Mem:          6.5Ti        95Gi       6.4Ti        13Mi       373Mi       6.4Ti
Swap:         4.0Gi          0B       4.0Gi

I can't identify yet who consumed 90+Gi memory.

Any ideas ??

[root@memverge anton]# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
node 0 size: 386616 MB
node 0 free: 337113 MB
node 1 cpus: 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 1 size: 387063 MB
node 1 free: 337964 MB
node 2 cpus:
node 2 size: 3033088 MB
node 2 free: 3033088 MB
node 3 cpus:
node 3 size: 3033088 MB
node 3 free: 3033088 MB
node distances:
node   0   1   2   3
  0:  10  21  17  28
  1:  21  10  28  17
  2:  17  28  10  28
  3:  28  17  28  10
[root@memverge anton]#
[root@memverge anton]# daxctl list
[
  {
    "chardev":"dax1.0",
    "size":3183575302144,
    "target_node":3,
    "align":2097152,
    "mode":"system-ram",
    "movable":true
  },
  {
    "chardev":"dax0.0",
    "size":3183575302144,
    "target_node":2,
    "align":2097152,
    "mode":"system-ram",
    "movable":true
  }
]
[root@memverge anton]#
[root@memverge anton]# swapoff -a
[root@memverge anton]# echo 1 > /sys/kernel/mm/numa/demotion_enabled
[root@memverge anton]# echo 2 > /proc/sys/kernel/numa_balancing
[root@memverge anton]#
[root@memverge anton]# ndctl list
[
  {
    "dev":"namespace1.0",
    "mode":"devdax",
    "map":"dev",
    "size":3183575302144,
    "uuid":"31cc288f-7a82-447f-a232-3e7106006b40",
    "chardev":"dax1.0",
    "align":2097152
  },
  {
    "dev":"namespace0.0",
    "mode":"devdax",
    "map":"dev",
    "size":3183575302144,
    "uuid":"356bc11e-2acd-4df0-b2f0-27123cb929de",
    "chardev":"dax0.0",
    "align":2097152
  }
]
[root@memverge anton]#

Anton

Dan Williams

unread,
Oct 1, 2022, 6:11:34 PM10/1/22
to Anton Gavriliuk, pmem
Looks like 'struct page' overhead to me. You pay 64 bytes for every 4096 byte page. Which is about 90GB for your ~5.8TB.

--
You received this message because you are subscribed to the Google Groups "pmem" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/CAAiJnjpWQb5YDEh9RqN7hYLW868AKDRQiysgzyqA%2B6QEOwvm8g%40mail.gmail.com.

Anton Gavriliuk

unread,
Oct 2, 2022, 6:30:57 AM10/2/22
to Dan Williams, pmem
Yes, but not so clear yet.

Let's forget about memory tier for some time.  I switched 2 devdax devices map "dev" -> "mem", so overhead should be in DDR4
typical system memory.

The output below after server reboot

[root@memverge ~]#
[root@memverge ~]# ndctl list

[
  {
    "dev":"namespace1.0",
    "mode":"devdax",
    "map":"mem",
    "size":3234108276736,
    "uuid":"a79da3ee-2009-41e3-a95b-3f3fd0b490a8",

    "chardev":"dax1.0",
    "align":2097152
  },
  {
    "dev":"namespace0.0",
    "mode":"devdax",
    "map":"mem",
    "size":3234108276736,
    "uuid":"26b6d65f-cc10-4762-a36d-c6735761f191",

    "chardev":"dax0.0",
    "align":2097152
  }
]
[root@memverge ~]#
[root@memverge ~]# free -h

              total        used        free      shared  buff/cache   available
Mem:          755Gi        27Gi       728Gi        13Mi       354Mi       725Gi
Swap:         4.0Gi          0B       4.0Gi
[root@memverge ~]#

So without memory tiering overhead only 27Gi, which is much smaller ~90Gi

[root@memverge ~]# numactl -H

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
node 0 size: 386648 MB
node 0 free: 372632 MB

node 1 cpus: 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 1 size: 387031 MB
node 1 free: 373053 MB

node distances:
node   0   1
  0:  10  21
  1:  21  10
[root@memverge ~]#

And only after system-ram setup memory usage 90+Gi

[root@memverge ~]# daxctl reconfigure-device --mode=system-ram all
[
  {
    "chardev":"dax0.0",
    "size":3234108276736,

    "target_node":2,
    "align":2097152,
    "mode":"system-ram",
    "movable":true
  }
]
reconfigured 2 devices
[root@memverge ~]#
[root@memverge ~]#
[root@memverge ~]# free -h

              total        used        free      shared  buff/cache   available
Mem:          6.6Ti        97Gi       6.5Ti        13Mi       394Mi       6.5Ti
Swap:         4.0Gi          0B       4.0Gi
[root@memverge ~]#
[root@memverge ~]# numactl -H

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
node 0 size: 386648 MB
node 0 free: 336396 MB

node 1 cpus: 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 1 size: 387031 MB
node 1 free: 337028 MB
node 2 cpus:
node 2 size: 3082240 MB
node 2 free: 3082240 MB
node 3 cpus:
node 3 size: 3082240 MB
node 3 free: 3082240 MB

node distances:
node   0   1   2   3
  0:  10  21  17  28
  1:  21  10  28  17
  2:  17  28  10  28
  3:  28  17  28  10
[root@memverge ~]#

So it is still not clear how it works in detail..

Anton

вс, 2 окт. 2022 г. в 01:11, Dan Williams <dan.j.w...@gmail.com>:
Reply all
Reply to author
Forward
0 new messages