How VMware's vSphere performs live migration on AI workloads

Did you know you can migrate live GPU workloads without powering them down first?

How VMware's vSphere performs live migration on AI workloads
Photo credit: Paul Mah. A Xeon-powered GPU server from MiTAC.

Did you know: You can migrate running GPU workloads within your VMware private cloud without powering them down first?

I read up on VMware Cloud Foundation 9.0 and gained insights into how enterprises might work with locally hosted AI in private clouds.

Live migration

For years, I was captivated by VMware's vMotion, which allowed a running virtual machine to be transferred from one physical server to another. No shutdown required.

The invention made vMotion synonymous with live migration. Today, live migration is a foundational capability in virtualisation. You can't not have it.

It offers various benefits:

  • Maintenance.
  • Load balancing.
  • Server upgrading.
  • Resource consolidation.

As AI processing increasingly makes its way into enterprise software, the ability to easily migration AI workloads will become a must-have requirement, too.

Migrating AI workloads

Here's how vMotion works.

  1. Pre-copy stage.
  2. VM is "stunned."
  3. Copy checkpoint data.
  4. Resume on new host.

Pre-copy transfers cold memory pages over without interrupting operations. When done, the VM is paused ("stunned") and dynamic memory ("checkpoint") is copied.

What's the challenge with GPUs, though? Simple, the sheer amount of memory.

  • An H100 GPU has up to 80GB of RAM.
  • 8x H100 is 640GB to transfer.

That's a lot!

To be clear, vMotion already supports AI workloads. Optimisations within VCF 9.0 dramatically shortened the stun time needed, making it less likely for a vMotion to timeout and fail.

Tests conducted on a 100G equipped server show 3-4x less time needed in VCF 9.0 and peaked at 69Gbps of bandwidth. So, there's headroom left to go for beefier GPUs in the future.

Optimised for enterprises

Like CPUs, GPUs can also be used in a fractional way to increase hardware utilisation or right-size performance for every workload.

  • This is done by creating a vGPU profile defined either via Nvidia's MIG (Multi-instance GPU) or software-based time-slicing. Profiles are then assigned to virtual machines.
  • VCF 9 also offers the new ability to review GPU deployments across the entire cluster. This benefits capacity planning and management.

I expect enterprise apps with built-in AI capabilities to increase over time. Some apps will be GPU-intensive, others less so - but support for the unique characteristics of GPUs will be indispensable.

Are you doing local AI apps yet?