This time we are pleased to present a guest blog from Chris Evans of Architecting.IT, an IT consultant with over 30 years of commercial experience. Over his career Chris has provided consultancy and advice to a wide range of customers and industry segments, including finance, utilities, and IT organisations.
The role of the storage administrator is changing, as we move towards service-based infrastructure deployments and increased automation. AIOps or Algorithmic IT Operations provides a framework to offload the more mundane tasks of resource management but also to address the challenges that simply can’t be resolved by scaling human resources. This post looks at what AIOps represents and how vendors are meeting the requirements of their customers.
Background
AIOps is a term that was coined by Gartner in 2016. It describes three disciplines (Automation, Performance Management and Service Management) that make up a framework to ameliorate the capabilities of infrastructure administration staff. We can imagine an implementation consisting of multiple layers.
- Layer 1 – Data Sources – implementing automation and the typical tasks performed by administrators requires configuration and usage data. This includes telemetry from systems (as discussed in a recent post on Wisdom of the Storage Crowd) and applications.
- Layer 2 – Real-time processing – this means the collection and processing of telemetry data in real time, in order to gain immediate value.
- Layer 3 – Rules/Patterns – data needs to be analysed using rules and patterns already identified. Vendors are already developing algorithms that take petabytes of telemetry analytics and translate this into tools like anomaly detection and fault diagnosis.
- Layer 4 – Domain algorithms – this includes site specific knowledge to understand localised usage patterns and requirements,
- Layer 5 – Automation – the use of APIs and CLIs to drive tasks such as provisioning and decommissioning (customer facing). This also includes automating performance management, for example to rebalance workloads across available infrastructure.
Spanning all of these layers is the use of machine learning to observe and detect trends, anomalies and outliers in telemetry data that would be impractical or impossible for humans to calculate. We’ll come back and look at how ML/AI is aiding in delivering more efficient management of data and storage.
The Human Factor
Why do we need to introduce tools like AIOps into storage management? While the amount of information created globally continues to increase exponentially, the data produced and more importantly, stored in the enterprise also sees exponential growth. Data that was previously discarded or not even created is seen today as having some perceived future value. The increased use of machine learning and AI by businesses is taking information from new sources that are increasingly machine generated. Businesses are now storing multi-petabytes of information and want to do something practical with it.
Agility
Businesses processes are driving greater demand in data storage capacities, but this is only one aspect of the challenges experienced by IT organisations. MTTR or mean time to repair is becoming ever critical in ensuring levels of infrastructure availability approaching 100%. IT organisations typically want to identify and resolve problems before they occur, rather than waiting for a hard failure.
Reducing or managing hardware interventions has other positive aspects. IT organisations want to minimise time engineers spend in the data centre replacing faulty equipment. Any data centre intervention is a risk. Engineers have been known to unplug the wrong hardware due for replacement or to accidentally knock into equipment and cause undesired outages or reboots.
Note: one of the original premises of storage area networks was to consolidate infrastructure into shared hardware that could be managed more efficiently.
Time to value from data analytics is becoming shorter as enterprises compete with each other. This means developers want access to storage and in shorter cycles, preferably automated and on-demand. As resources are created, used and returned to the pool, we can expect increasingly fluid configurations that no storage administrator can hope to track effectively.
Level 1 – Metrics
In order to implement efficient AIOps, systems need metadata and metrics that measure information on storage operations. These endpoints collect data from both physical and logical aspects of storage systems. For example, data on individual HDD or SSD operations provides information on temperature, permanent and transient media failures, throughput, performance and uptime of devices. This collection extends to the storage chassis, recording statistics on front-end port activity, processor and memory load, server temperature and ambient room temperature.
Data collection isn’t restricted to hardware. Storage software is highly complex, and many vendors have modularised their design. Software endpoints can track internal application crashes, excessive use of memory, bugs in hardware drivers and even the use of commands used to drive the software. The last point may seem an unusual metric to collect, however it can be useful to see whether end users are making full use of command functions available or configuring the right set of best-practice options.
The third area of data collection extends out of the storage platform itself. Vendors now routinely collect information from hypervisors and application hosts. This data can be used to identify configuration issues and other risks to normal operations. Storage systems that are application-aware and understand the structure of data being stored, will be able to provide additional value in ensuring that application performance is maintained as data volumes grow.
Levels 2,3 & 4 – Processing in Real-Time
All of this information is of little use if collation and analysis can’t be done in real time. Typically, we see a two-tier approach to analysis. First, vendors collate data into large central repositories or data warehouses that represent trillions of pieces of individual endpoint data across the entire customer install base.
This aggregate of data provides enough information to do statistical analysis of drive failures, or configuration problems that may affect the entire customer base. As a long-term archive of information, vendors use this data to fix bugs in drive firmware or proactively replace failure-prone media. This data source can also be used to validate the quality of storage operating system software.
Ultimately, this type of data collection benefits the vendor as it helps improve system availability and reduce the number of support calls raised by the field. However, the customer sees benefits too. Bugs or other issues that might be introduced through code updates can be avoided or mitigated. The administrator is provided with information to make informed decisions, rather than running into problems other customers have already experienced.
Anomalies
The second benefit of collating large volumes of individual customer data is the ability to use machine learning and AI techniques that highlight anomalies or other unusual issues within a configuration. These scenarios could include identifying performance hotspots, unexpected growth in capacity or throughput, or configuration data issues within other components of the infrastructure, such as at the host or hypervisor layer.
Increasingly, vendors are offering capabilities to identify ransomware, to rebalance workloads across multiple hardware configurations and to advise on future upgrades or hardware replacement. This last option is particularly useful, as it allows administrators to build a model that picks the most efficient new hardware configuration for upgrades and replacements.
ML/AI
Throughout this discussion we’ve mentioned the use of machine learning and artificial intelligence. Why is this becoming so important as a feature of modern-day infrastructure management? In the storage world, administrators will recognise many “time sinks” that rarely result in finding the answers to problems but can easily consume hours or days of work.
Good examples are identifying hotspots in performance (and resolving them), Balancing I/O activity across systems (front-end or back-end) and managing capacity growth across multiple storage platforms. Thankfully, by design, modern storage solutions automatically resolve many of these challenges, saving administrators hours of time to work on more valuable tasks that add value to their customers (the business).
Despite these advancements in design, anomalies still arise that would be difficult for humans to identify (ransomware is one good example). AI provides the capability to automatically analyse huge amounts of data and create trained models that then provide real-time analysis of active systems.
New Tools
New management tools are required for us to take advantage of the benefits of AIOps. Storage vendors have already started to transition management interfaces away from just GUI-based systems and now offer CLIs and APIs. Command Line Interfaces offer the ability to integrate commands into scripting and automated build processes. APIs provide a more advanced level of interaction, especially in extracting reporting or telemetry data.
This doesn’t mean an end to graphical interfaces. In fact, the more astute storage vendors have moved towards using GUIs as dashboards that show system status, display trends in growth and performance and generally move to an exception-based visualisation of systems infrastructure.
Evaluating your Vendor
How should we choose between vendor AIOps solutions? Here are some pointers to follow when choosing products.
- Is my vendor collecting and actively using telemetry data?
- How are issues fed back to storage administrators (alerts, emails, dashboards)?
- How much information is collected from outside the storage platform?
- How is my data anonymised and protected?
I include the last point because many IT organisations will be concerned about the security of data stored in shared repositories. Storage vendors should be able to articulate exactly how data is being stored and managed, including processes around the destruction of non-essential data over time.
The Architect’s View
While automation can never totally replace the storage administrator, features such as those implemented with AIOps can improve the efficiency of storage teams and free up individuals to work on more valuable tasks, such as working more closely with the business on future requirements. The rate of data growth in the enterprise means businesses have to find ways to make individual team members more efficient. Without solutions like AIOPs, businesses simply won’t be able to keep up with their peers and risk falling behind in their ability to fully exploit data assets.