Conventional HCI architectures increase troubleshooting complexity and can put critical operations at risk.
This post is the final installment in a series looking at the limitations of deploying conventional HCI architectures for enterprise IT.
This time, I look at one of the most often-claimed advantages of conventional HCI: that these architectures decrease complexity. The “simplicity” of HCI comes at a price:
- Increased troubleshooting complexity
- Greater operational risk
Don’t overlook these factors when you are making important infrastructure decisions that you’re going to have to live with for many years.
The tightly coupled architecture of HCI makes it more difficult to troubleshoot performance issues. Because everything is layered together on each node, it becomes almost impossible to isolate the source of a performance bottleneck.
If increasing a VM’s memory and CPU resources don’t solve the problem, you then have to assume the problem is IO.
- Where is the IO bottleneck? Is it in the host, network, or storage?
- Are too many data services (deduplication, erasure coding, replication, etc.) increasing metadata and affecting performance?
- Since guest VMs and the storage VM share the same resources, how do you isolate the problem?
- Does the problem result from IO to internal storage or storage on another node?
- If it’s internal storage, can you throttle or migrate workloads? Will migrating workloads fix or increase the performance problems?
- If it’s storage on another node, is it a network bottleneck or is it the other node? Are multiple external nodes involved? In some cases, data for a single VM could be spread across many nodes.
As you can see, the process gets complicated quickly and grows with the size of your HCI cluster. Often, the only solution to the above scenario is to add another node.
Virtualization helped solve many traditional infrastructure issues such as hardware maintenance and patching. With external storage, you can easily move VMs to another host by moving the compute and memory state using VMotion or Hyper-V live migration. With HCI architectures, storage is more tightly coupled with compute so there’s a lot more to think about:
- Data evacuation before maintenance. Although most vendors allow maintenance without data evacuation, it is not a best practice because it introduces risk. When you evacuate, you’re spreading the entire load from the node—both compute and storage—across other nodes, increasing the potential for bottlenecks and noisy neighbor problems.
- Reduction in amount of storage. When a node goes offline for maintenance, a big chunk of your storage goes offline too, potentially leaving your cluster constrained.
- Reduction in amount of available flash. Especially in hybrid configurations, when a node goes offline that also means a big chunk of flash goes offline. Flash is highly important as a cache, so flash misses go up and performance goes down.
All this means that your IT team needs to be extra careful about taking maintenance windows, and, in many cases, you lose the independence to do maintenance activities on hosts because it has a broader impact. Doing maintenance becomes risky, but we know the risks of not doing patching and other maintenance all too well.
The HCI Snowball Effect
A single HCI failure can trigger much larger problems. When a host fails for any reason, it has the following effects:
- Reduces available resources for compute and storage.
- Reduces available flash in hybrid configurations, resulting in a double dip. Flash for VMs from the failed node must be rewarmed. The extra pressure causes data from existing VMs to be evicted. Thus, VMs from both the failed and surviving nodes suffer.
- The process repeats on the reintroduction of the failed node.
Failure of even a single component, such as a flash drive, can cause an entire node to collapse. The result is a far greater impact on operations than when storage is decoupled from the host.
A Best-of-Breed Architecture Is Lower Risk
Conventional HCI destroys the stateless nature of virtualization and increases risk. Performance problems are much easier to troubleshoot with a decoupled architecture, especially when the architecture has been built from the ground up to provide workload-granular analytics. For example, Tintri allows you to see the root cause of any latency issue across compute, network, and storage, giving you a comprehensive view of your infrastructure at the VM or container level.
Because storage and compute are physically and logically separated, none of the risks described above affect the Tintri enterprise cloud platform, making it a lower-risk option for large-scale enterprise infrastructure deployments.