Gaining Insight to Storage Numbers for Your SQL Server Databases
Have you ever tried to troubleshoot at storage issue with SQL Server? Undoubtedly, you soon learn there is not great visibility.
This is the second of a seven-part blog series on SQL Integrated Storage, the first part was an introduction to the product. Now we will start going deeper into various separate facets and features. In this post, we are going to examine visibility.
SQL Server tracks file stats at a very fine level using the sys.dm_virtual_io_file_stats. These are great; however, they only show you the cumulative stats since the last reboot of the SQL Server instance. You don’t get visibility into what is occurring right now, or at a specific point in time 2 hours ago; plus, it cannot tell you anything outside of SQL Server. Windows can only give us data on a per drive basis not per file. If we only have one file per drive, then we have great information, but this is not usually the case. Hypervisors, such as VMware can only report on the virtual disk level, or for VMware the .vmdk file. The storage system reports on the LUN or volume level. With these many places for data it should be pretty easy, but the problem is each level reports only on what is in its purview and each has its own way to look at it. To do any troubleshooting you need the DBA, storage admin, virtualization admin, and the network admin to work together to figure things out. This is rather cumbersome and inefficient. SQL Integrated Storage changes all this.
What if we could address this problem and have real-time and historical granular database visibility across the storage stack? Can storage actually “understand” databases? If you are using Intelligent Infrastructure, the answer is “yes”.
The Tintri VMstore is a purpose-built Intelligent Infrastructure appliance with a file system that provides per file metrics for virtual environments, initially for managing storage at the virtual machine level. We have extended these same concepts to support databases so now we can manage storage at the file level for SQL Server databases. This provides unprecedented visibility into I/O usage of each database event down at the file level.
Digging into how this works, we start at the big picture of all your databases and then drill down into each SQL Server instance, then each database, and ultimately each file. We examine all sorts of metrics, including the big three: IOPS, throughput, and latency both in real time and historical contexts. Plus, we have the ability to look at reads versus writes as well as block sizes of the reads and writes coming in. We can also look at space usage on the disk and how well compression and dedupe are being utilized to understand when we will really run out of space, not just a guess. There are more ways to look into this as well, and the historical information is incredibly helpful in determining growth trends or finding repeatable patterns for troubleshooting.
When you first open up Tintri Global Center (TGC) you see lots of things, but once you select a metric you want to look at, e.g., IOPS, Throughput or Latency, you are presented with a chart showing usage for the past day. You can click on any point on this chart to find the top contributors at any point in time. As illustrated in Figure 1, when I clicked on a specific point in time, I can see which objects are making up spike in performance. I can quickly spot what are the key drivers. I can also scan through the past 30 days and look at data in many ways. When that ticket comes in saying the database is slow and you note that the timestamp was 4 hours ago, you can go look and see what was going on with all databases 4 hours ago and then drill down.
Figure 1: TGC Top Contributors
All My Databases
If you click the Databases tab in TGC you see all the databases being managed by SQL Integrated Storage. It doesn’t matter which storage array they are on or which SQL Server they are attached to you will have visibility into all databases. There are six predefined views to enable finding the data you are looking for quickly; there are over 70 total fields that can be displayed. Most of the time you’ll probably select one of the defined views: Database Overview, Performance, Protection, QoS, Space, or Troubleshooting.
In the Database Overview view you can see all databases and quickly sort to see which databases are consuming the most resources or having the most issues with latency. You can even drill into the latency to find out where the latency is coming from. In Figure 2 we see all databases sorted by latency, you can see there are multiple SQL Servers registered and multiple storage appliances. This view is looking at all the SQL Server databases in the environment.
Figure 2: Latency by Database
When I hover over the latency (Figure 3), I can see a breakdown of where the latency is coming from; is it the VMstore or the network? We can track when the I/O request is leaving the SQL Server and when it has been acknowledged, this is how we are tracking these numbers. In the figure we see 0.8 ms is coming from the flash and the rest if from network. In my lab the network is not robust, and I can easily max it out. Microsoft’s current best practice for latency states the date files, .mdf and .ndf files, should have less than 10 ms of latency; log files, .ldf should have less than 5 ms of latency. This is measured from inside the SQL Server, so the numbers we see in TGC may not account for latency added by Windows or SQL Server. So, we need to be below these best practice numbers to provide sufficient headroom. This latency shows up as slower transactional time, which in turn slows down your applications. Now we have real time view into overall latency performance, and we can see at what times the latency numbers were not being met.
Figure 3: Detailed Latency
Without Intelligent Infrastructure you cannot look at all the databases in the environment easily from a storage perspective; with SQL Integrated Storage you can. Spotting any databases that are currently having any issues or quickly finding what workloads are making up spikes in utilization is as easy as looking at the dashboards. Next, we are going to drill down and look at one single SQL Server.
I can use the filter to look at only one SQL Server instance, or I can select the ‘SQL Servers’ tab and select a single instance and just see the databases for a specific SQL Server. Now I am looking at the same information as before but focusing on a single instance (see Figure 4). I can quickly spot which database is using most of the I/O and sort by space and see which database are using what space. All the same visibility for the whole environment but just at the instance level.
Figure 4: Per Instance View
If you need more detailed information you can continue to drill down into each database. For example, you can look a chart for the different date ranges to review IOPS, throughput, latency or many other metrics for this specific database. Notice the yellow and blue, which show reads and writes. From these I can note read and write behaviors and as I know more about the database, the more I can correlate this to specific workloads. Any spikes are easy to spot and quickly drill into for more information. In Figure 5, we are looking at ProdAW-DW2017 database and I have looked back a few days to drill into a specific point and time. When I hover over that period of time, I can see the actual data for that specific time period. I am only showing throughput and latency in Figure 5, but there are many other metrics available. Thus, you can see what each database was doing at any point in time for the last month in seconds. No more trying to figure out how the LUN reports translate into specific workloads or how to rollout a delta from sys.dm_virtual_io_file_stats. With Intelligent Infrastructure, it’s right here at your fingertips, all the time, in seconds. Get the team together and look at actionable data, instead of having the long discussion and inevitable finger pointing as to who is at fault, you can instead work on identification of the problem and its solution.
Figure 5: Per Database Reads and Write Break Down
If per database granularity is not enough, you can drill down and see the read and writes, plus any other metric you want for each individual database file. This can help you spot if the issues are related to logs or data files.
There are many side benefits as well that I have not covered, but you can export data quickly. You can run a report of actual space being used for all databases and have the storage location and SQL Server they are attached listed out. There are many, many other ways you can slice and look at the information.
With Intelligent Infrastructure, the data is there, and we can use it to better understand our environment, and more importantly determine what proactive actions we need to take to avoid problems in the first place. With all of these data points we will be able to leverage AIOps and Machine Learning to bring about the next generation of Intelligent Infrastructure to further improve on the capabilities that are currently available with SQL Integrated Storage.
Now, troubleshooting storage performance issues is much simpler, both the infrastructure staff and the DBAs will be able to look at the same screen and see data that is relevant to the issue. Looking at real time or historical data trends can be helpful in understanding the impacts of daily reporting and night database maintenance, and off queries can be measured in terms of impact to your database storage performance.
With SQL Integrated Storage, we now have unprecedented views into what SQL Server is using in terms of storage across many metrics. You start with the large picture to see what is being used overall and you can drill down when you spot an issue. Tools that allow you bridge the gap between the storage and database teams are always a win. With SQL Integrated Storage, Tintri is not raising the bar, but actually changing the game. Stop thinking about SQL Server and its relationship to storage the way you have in the past and instead see the power of managing storage for your databases at the database level with deep insight on what is occurring in each database. Imagine your productivity increases or what other problems you can solve for your organization.
I invite you to join me next time for part 3 of this series when I’ll discuss Database Performance, with a focus on ensuring one bad query does not impact other databases. Keep an eye out or follow #Tintri on Twitter and get notified when new content is available.