Are hugepages by default a good idea for compute instance types?

126 views
Skip to first unread message

Chandler Wilkerson

unread,
Sep 18, 2023, 8:56:23 AM9/18/23
to kubevirt-dev
Using a compute or memory instance type requires a node with huge pages support, which is not a default setting, so out of the box, compute and memory instance types are unusable.

When trying to create a VM with one of these instance types, it fails with an unhelpful error that points to scheduling, but does not mention huge pages as the reason.

I propose we have  better messaging to bring to light the lack of huge pages when scheduling is prevented for this reason. 

I would like to see a discussion here about whether to remove huge pages from the compute and memory instance types (by default) and find another way to introduce to admins the ability to require huge pages in an instance type.

Huge pages make sense across the board for VM handling nodes, but fine tuning the number of huge pages per compute node is a per-cluster exercise, and IMO fits better in a performance and tuning guide than a default.

I have opened [1] to discuss

--
Chandler Wilkerson, RHCE
Sr. Software Engineer

Red Hat

Fabian Deutsch

unread,
Sep 18, 2023, 9:14:07 AM9/18/23
to Chandler Wilkerson, Lee Yarwood, kubevirt-dev
Chandler, hi!

Adding @Lee Yarwood 

On Mon, Sep 18, 2023 at 2:56 PM Chandler Wilkerson <cwil...@redhat.com> wrote:
Using a compute or memory instance type requires a node with huge pages support, which is not a default setting, so out of the box, compute and memory instance types are unusable.

When trying to create a VM with one of these instance types, it fails with an unhelpful error that points to scheduling, but does not mention huge pages as the reason.

Please share the exact message - here and in [1].
 

I propose we have  better messaging to bring to light the lack of huge pages when scheduling is prevented for this reason. 

All errors need to be excellently bubbled up to the user. If not, then this is a bug.
 

I would like to see a discussion here about whether to remove huge pages from the compute and memory instance types (by default) and find another way to introduce to admins the ability to require huge pages in an instance type.

Can you please specify to which instanceTypes you refer to specifically?

to me, hugepages shoudl be limited to memory intensive, compute exclusive, and network (coming up).
 

Huge pages make sense across the board for VM handling nodes, but fine tuning the number of huge pages per compute node is a per-cluster exercise, and IMO fits better in a performance and tuning guide than a default.

In general, for no specific needs, U is the right series to take.
The goal of istance types is to move the tuning needs on users to pratically zero.
 

I have opened [1] to discuss

--
Chandler Wilkerson, RHCE
Sr. Software Engineer

Red Hat

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAEbMwwzm2C5DMPYqX%3DyghjZNYVL5GShKpXpd_bJpo2Q869zHKQ%40mail.gmail.com.

Chandler Wilkerson

unread,
Sep 18, 2023, 11:04:57 AM9/18/23
to Fabian Deutsch, Lee Yarwood, kubevirt-dev
On Mon, Sep 18, 2023 at 8:14 AM Fabian Deutsch <fdeu...@redhat.com> wrote:
Chandler, hi!

Adding @Lee Yarwood 

On Mon, Sep 18, 2023 at 2:56 PM Chandler Wilkerson <cwil...@redhat.com> wrote:
Using a compute or memory instance type requires a node with huge pages support, which is not a default setting, so out of the box, compute and memory instance types are unusable.

When trying to create a VM with one of these instance types, it fails with an unhelpful error that points to scheduling, but does not mention huge pages as the reason.

Please share the exact message - here and in [1].
 

Here's the describe Status: output:
 
Status:
  Active Pods:
    bd8dc179-cd57-4e75-a589-40c1c02e770a:  
  Conditions:
    Last Probe Time:       2023-09-18T14:42:43Z
    Last Transition Time:  2023-09-18T14:42:43Z
    Message:               Guest VM is not reported as running
    Reason:                GuestNotRunning
    Status:                False
    Type:                  Ready
    Last Probe Time:       <nil>
    Last Transition Time:  2023-09-18T14:42:43Z
    Message:               0/6 nodes are available: 1 node(s) were unschedulable, 2 Insufficient hugepages-2Mi, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 2 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling..
    Reason:                Unschedulable
    Status:                False
    Type:                  PodScheduled
  Guest OS Info:
  Phase:  Scheduling
  Phase Transition Timestamps:
    Phase:                        Pending
    Phase Transition Timestamp:   2023-09-18T14:42:43Z
    Phase:                        Scheduling
    Phase Transition Timestamp:   2023-09-18T14:42:43Z
  Qos Class:                      Burstable
  Runtime User:                   107
  Virtual Machine Revision Name:  revision-start-vm-b0215b66-8379-456d-a4c0-ee318b3d0307-2
Events:
  Type    Reason            Age   From                       Message
  ----    ------            ----  ----                       -------
  Normal  SuccessfulCreate  2m    virtualmachine-controller  Created virtual machine pod virt-launcher-imaginative-marmoset-b8mgv


I propose we have  better messaging to bring to light the lack of huge pages when scheduling is prevented for this reason. 

All errors need to be excellently bubbled up to the user. If not, then this is a bug.

I recognize this may well be a deeper issue with the K8s Pod messaging too:
 
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3m20s  default-scheduler  0/6 nodes are available: 1 node(s) were unschedulable, 2 Insufficient hugepages-2Mi, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 2 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling..
  Warning  FailedScheduling  2m14s  default-scheduler  0/6 nodes are available: 1 node(s) were unschedulable, 2 Insufficient hugepages-2Mi, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 2 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling..

In the Pod Status, it mentions the huge pages, but does not underline it as not being met:

    Limits:
      devices.kubevirt.io/kvm:        1
      devices.kubevirt.io/tun:        1
      devices.kubevirt.io/vhost-net:  1
      hugepages-2Mi:                  16Gi
    Requests:
      cpu:                            200m
      devices.kubevirt.io/kvm:        1
      devices.kubevirt.io/tun:        1
      devices.kubevirt.io/vhost-net:  1
      ephemeral-storage:              50M
      hugepages-2Mi:                  16Gi
      memory:                         295698433
 

I would like to see a discussion here about whether to remove huge pages from the compute and memory instance types (by default) and find another way to introduce to admins the ability to require huge pages in an instance type.

Can you please specify to which instanceTypes you refer to specifically?

to me, hugepages shoudl be limited to memory intensive, compute exclusive, and network (coming up).
 

I'm referring to the cx and m types specifically, both require 2Mi hugepages.

For that matter, the cx types require dedicated CPU placement, which requires cpumanager support, exposed in the virt-launcher Pod as a nodeSelector:

  nodeSelector:
    cpumanager: "true"
    kubevirt.io/schedulable: "true"
 

Huge pages make sense across the board for VM handling nodes, but fine tuning the number of huge pages per compute node is a per-cluster exercise, and IMO fits better in a performance and tuning guide than a default.

In general, for no specific needs, U is the right series to take.
The goal of istance types is to move the tuning needs on users to pratically zero.

I agree with that goal; would it help to add help to the instancetype.kubevirt.io/description annotation for the instance type itself?
Current for a CX is defined here [2]
In part:
       The exclusive resources are given to the compute threads of the
      VM. In order to ensure this, some additional cores (depending
      on the number of disks and NICs) will be requested to offload
      the IO threading from cores dedicated to the workload.
      In addition, in this series, the NUMA topology of the used
      cores is provided to the VM.

Perhaps a link to a page in user-guide explaining how to adjust a cluster node to support the required features for each instance type that requires them?
 
--
Chandler Wilkerson, RHCE
Sr. Software Engineer

Red Hat

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAEbMwwzm2C5DMPYqX%3DyghjZNYVL5GShKpXpd_bJpo2Q869zHKQ%40mail.gmail.com.

Lee Yarwood

unread,
Sep 18, 2023, 3:28:21 PM9/18/23
to Chandler Wilkerson, Fabian Deutsch, kubevirt-dev
Thanks Chandler, Fabian, comments in-line below.

On Mon, 18 Sept 2023 at 16:05, Chandler Wilkerson <cwil...@redhat.com> wrote:
> On Mon, Sep 18, 2023 at 8:14 AM Fabian Deutsch <fdeu...@redhat.com> wrote:
>>
>> Chandler, hi!
>>
>> Adding @Lee Yarwood

:) thanks!

>> On Mon, Sep 18, 2023 at 2:56 PM Chandler Wilkerson <cwil...@redhat.com> wrote:
>>>
>>> Using a compute or memory instance type requires a node with huge pages support, which is not a default setting, so out of the box, compute and memory instance types are unusable.

I tend to agree for compute but I think it's fine for the memory
intensive class.

>>> When trying to create a VM with one of these instance types, it fails with an unhelpful error that points to scheduling, but does not mention huge pages as the reason.

Insufficient hugepages-2Mi is listed in the message below, I don't
think there's anything more we could do here tbh as we don't want to
wrap the VMI scheduling process with some awareness of instance types
and preferences when the reason is already documented pretty well.

>> Please share the exact message - here and in [1].
>>
> Here's the describe Status: output:
>
> Status:
> Active Pods:
> bd8dc179-cd57-4e75-a589-40c1c02e770a:
> Conditions:
> Last Probe Time: 2023-09-18T14:42:43Z
> Last Transition Time: 2023-09-18T14:42:43Z
> Message: Guest VM is not reported as running
> Reason: GuestNotRunning
> Status: False
> Type: Ready
> Last Probe Time: <nil>
> Last Transition Time: 2023-09-18T14:42:43Z
> Message: 0/6 nodes are available: 1 node(s) were unschedulable, 2 Insufficient hugepages-2Mi, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 2 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling..

^ `2 Insufficient hugepages-2Mi`
I honestly think Insufficient hugepages-2Mi is pretty clear here.

>>> I would like to see a discussion here about whether to remove huge pages from the compute and memory instance types (by default) and find another way to introduce to admins the ability to require huge pages in an instance type.
>>
>>
>> Can you please specify to which instanceTypes you refer to specifically?
>>
>> to me, hugepages shoudl be limited to memory intensive, compute exclusive, and network (coming up).

What's the justification for compute exclusive again?

> I'm referring to the cx and m types specifically, both require 2Mi hugepages.
>
> For that matter, the cx types require dedicated CPU placement, which requires cpumanager support, exposed in the virt-launcher Pod as a nodeSelector:
>
> nodeSelector:
> cpumanager: "true"
> kubevirt.io/schedulable: "true"
>
>>> Huge pages make sense across the board for VM handling nodes, but fine tuning the number of huge pages per compute node is a per-cluster exercise, and IMO fits better in a performance and tuning guide than a default.
>>
>> In general, for no specific needs, U is the right series to take.
>> The goal of istance types is to move the tuning needs on users to pratically zero.
>
> I agree with that goal; would it help to add help to the instancetype.kubevirt.io/description annotation for the instance type itself?
> Current for a CX is defined here [2]
> In part:
> The exclusive resources are given to the compute threads of the
> VM. In order to ensure this, some additional cores (depending
> on the number of disks and NICs) will be requested to offload
> the IO threading from cores dedicated to the workload.
> In addition, in this series, the NUMA topology of the used
> cores is provided to the VM.
>
> Perhaps a link to a page in user-guide explaining how to adjust a cluster node to support the required features for each instance type that requires them?

ACK to links to documentation, we also expose the requirements as
labels now FWIW:

https://blog.yarwood.me.uk/2023/06/22/kubevirt_instancetype_update_5/#resource-labels

>>>
>>> I have opened [1] to discuss
>>>
>>> 1. https://github.com/kubevirt/common-instancetypes/issues/105

Thanks, I'll document my thoughts there as well.

> 2. https://github.com/kubevirt/common-instancetypes/blob/main/instancetypes/cx/1/cx1.yaml

Cheers,

Lee

Chandler Wilkerson

unread,
Sep 18, 2023, 6:04:28 PM9/18/23
to Lee Yarwood, Fabian Deutsch, kubevirt-dev
Oop, I definitely scanned that too quickly and missed it. You're right of course. Now if we could get that info into the VMI, we'd be good.

Fabian Deutsch

unread,
Sep 22, 2023, 8:10:04 AM9/22/23
to Chandler Wilkerson, Lee Yarwood, kubevirt-dev
We have it:

status:
  conditions:
    - lastProbeTime: '2023-09-22T12:08:51Z'
      lastTransitionTime: '2023-09-22T12:08:51Z'
      message: Guest VM is not reported as running
      reason: GuestNotRunning
      status: 'False'
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: '2023-09-22T12:08:51Z'
      message: >-
        0/24 nodes are available: 1 node(s) had untolerated taint {dedicated:
        nfv-qe}, 1 node(s) were unschedulable, 2 node(s) had untolerated taint
        {dedicated: realtime}, 20 Insufficient hugepages-2Mi. preemption: 0/24
        nodes are available: 20 No preemption victims found for incoming pod, 4

        Preemption is not helpful for scheduling..
      reason: Unschedulable
      status: 'False'
      type: PodScheduled

Chandler Wilkerson

unread,
Sep 22, 2023, 10:32:48 AM9/22/23
to Fabian Deutsch, Lee Yarwood, kubevirt-dev
I went back and double-checked, and I found the point where I got careless.
Thus far, I have been representing this as an issue with huge pages, but that's incorrect. As demonstrated, a lack of huge pages is effectively reported all the way back to the 
VM, and at all levels underneath.

My issue was originally with the compute profile, which requires both huge pages and the cpu manager node selector. When you don't have cpu manager capable nodes, there isn't reporting to say so, and then (I guess here) the scheduler stops before looking at the huge pages requirement.

POD:
  Warning  FailedScheduling  5m    default-scheduler  0/6 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..

Both the VMI and the VM include the above message in their Status.conditions[type=Ready] object.

Chandler Wilkerson

unread,
Sep 22, 2023, 12:05:56 PM9/22/23
to Fabian Deutsch, Lee Yarwood, kubevirt-dev
Now that I am going over the warning again, I see that 3 nodes did not match the Pod's node affinity/selector, and that is valid for bringing the admin's attention to the missing cpumanager node attribute. There is still truncation of the message, which might clarify more about the huge pages later, but the first issue that needs fixing is exposed by the message after all.

Apologies for dragging this out, thanks all for your patience!
Reply all
Reply to author
Forward
0 new messages