Background

Cloud is increasingly popular due to its low upfront and perceived reliability. I wrote an article a few weeks ago about why it can be a sensible choice, but needs to be considered carefully.

One of the main drawbacks of cloud (VPS / VDS / VM / cloud server / IaaS / whatever you want to call it – it’s generally the same thing) – is the CPU power. The physical servers that host virtual servers are very easy to add disks and memory to, and since the incorporation of solid state disks into cloud offers, disk IO is no longer a primary concern either. However, the ratio of CPU resources to memory and disk resources can be rather limiting. Consider the following hypothetical but realistic scenario, supported by anecdotal evidence and server suppliers’ hardware specifications:

2x Xeon E5-2620V3 (12 cores, 24 theads)
256GB Memory
12 x 512GB SSD in RAID10 = 3TB usable space

Let’s package that into VMs by dividing it:

0.375 access to one CPU thread!
4GB Memory
48GB Disk

As you can see, the disk and memory seem plausible but the CPU seems dismal. Of course, this will vary from provider to provider – and you won’t see “0.375 of 1 CPU thread” but more likely something like “shared CPU”.

It can get worse, however, as many providers will oversubscribe the resources on the physical servers – i.e., they will sell more resources than they have, knowing that generally, the physical server will cope ok.

What does this mean for the customer? It means that you need to think carefully about CPU, because it’s the most expensive aspect of cloud, the most necessary for many common cloud applications and the hardest to scale and often, to get solid information on. It reinforces the argument in my earlier post that cloud is great for small deployments, but needs to be weighed very carefully against dedicated options for larger deployments.

The technically interesting bit

A grand challenge in cloud is not just gathering generalised, historical performance data but knowing how my application will perform when I deploy it. There does not exist a definitive solution to this challenge yet, however, there are ways of getting some idea of how it is performing now. Utilities like top, which systems administrators are familiar with, provide some input and have changed in recent years to accommodate the prevalence of shared resources.

 

Systems administrators will probably know the first 5 columns of the bottom row very well – they are the most commonly used and, generally, are very useful indicators of how a system is performing. They rely on what is called, tick-based CPU accounting, where, every CPU tick (the atomic unit of CPU resources) is categorised based on how the kernel decides it should be used, and a quick calculation is made to arrive at percentages. The kernel uses a scheduler to manage the CPU resources efficiently when there are potentially far more applications running than there are CPU threads available, by switching them in and out.

You may be interested to know that the unit of time between CPU scheduler decisions is, in technical parlance, called a jiffy and that the algorithm used to schedule CPU resources in most modern Linux systems is based heavily on the Brain Fuck Scheduler, written by an anaesthesiologist turned computer scientist, who liked to argue on forums.

Anyway, the very last column, “st”, stands for “steal” time. This is supposed to account for CPU time that is “stolen” by the virtualisation software on the physical server and is therefore completely unavailable to the kernel on the virtual server. The virtual server isn’t really aware of why it isn’t available but in a virtual system, it’s safe to assume that it’s probably the result of noisy neighbours and high contention.

Steal time is not easy to measure, as virtual servers are dependent on the virtualisation software on the hypervisor for the resources that would otherwise be physical – that includes the obvious ones like storage devices, network devices and so on, but also the simple act of keeping time, which is an integral part of the system. Thus, the number that you see under the “st” column depends not only on contention but also the virtualisation technology. Literature suggests that IBM’s virtualisation technology allows for CPU time to be accounted for extremely accurately, followed by Xen, followed by the others that are less effective. In our Proxmox setup, no steal time was reported even under very high load and oversubscription

Leave a Reply

Your email address will not be published. Required fields are marked *