SQL SERVER INDEX (Part 1): Index Seek vs Index Scan and Statistics
Firstly, I recommend watching this How to Think Like the Engine - YouTube by Brent Ozar on YouTube (all parts). It is incredibly helpful foundational material before you begin your SQL performance tuning journey.
For example, an ideal index for a query might not exist yet. In such cases, it is up to the DBA to create the index so that the optimizer can make use of it.
Similarly, inaccurate row count estimations on the execution plan can lead to suboptimal choices by the optimizer. As DBAs, we may need to update statistics or tune the code depending on the cause of the misestimations.
When we have accurate indexes and correct row estimations, the optimizer's choices are usually optimal, and we can stop our part of tuning the query.
In this post, I will go over:
- The differences between index scan and index seek operations.
- How the optimizer chooses between scans and seeks, with examples
To demonstrate, I will use the StackOverflow2010 database. Feel free to follow along with my examples. If you haven't, please checkout Tools I Used During Query Performance Troubleshooting to see how to setup to obtain execution plan and logical reads info.
Index Scan and Index Seek
Index seek
This symbol represents an Index Seek operation in SQL Server. Do you notice the symbol looks like a tree traversal that we studied in our data structure class?
For a large table (e.g. million rows), the tree depth might be between 4 to 6 levels. Thus, searching a single record with index seek requires only 4 to 6 logical reads.
To view the depth D of an index tree, under table>indexes, right click on the index, select Properties, and select Fragmentation.
This makes seeking for a small number of rows highly efficient, since a seek is a low I/O operation.
However, if the number of seeks are large, it would take T X D logical reads (where T is the number of times to seek, and D is the depth of the index tree). The higher the T is, the higher the number of logical reads is, thus THE MYTH "INDEX SEEK IS ALWAYS BETTER" IS WRONG.
Note that T is the number of times to seek, not necessarily the number of rows to search. For example, searching rows with CreationDate > '12/23/2010' using an index on the CreationDate column returns 4869 rows, but it requires only one seek to find the boundary row '12/23/2010' and then SCAN to get the rest matching rows.
Index scan
This symbol represents an Index Scan operation in SQL Server. As the symbol suggests, it is an operation that scans all pages at the leaf level of an index tree.
When querying larger amounts of data, an index scan is thus more efficient.
Let's use a simple calculation to prove that.
Consider the Posts table in the StackOverflow database, the index d has 800,000 pages and a depth of 4.
Consider the Posts table in the StackOverflow database, the index d has 800,000 pages and a depth of 4.
If perform 3 million seeks in order to read the data: 3,000,000 X 4 = 12,000,000 logical reads
If scan the entire index d: 800,000 logical reads
Obviously, in this case, an index scan is more efficient!
Index scan/seek examples and statistics
SQL Server stores table statistics to help generate optimal execution plans!!! (sadly, sometimes mission fails even with up-to-date statistics)
There are various materials out there to help you understand more about statistics, and how to create and update statistics.
These statistics inform SQL Server of row counts, which helps it decides between using seeks or scans for operation.
139 of 110 is "the actual number of rows for all executions" of "the estimated number of rows for all executions."
During the compilation, it estimated that the query needs to retrieve 110 posts, but the run time indicates 139 posts read. Minor estimation discrepancies like this are typically acceptable.
The row estimate is called cardinality estimate.
When scan is better than seek:
On the first execution, the SQL Server optimizer chooses to scan the Posts and Users tables and join them using a hash match operation.
Two reasons for me to justify that the execution plan is optimal:
- The cardinality estimates are accurate, suggesting SQL Server made informed choices.
- Since the entire table is being read, a scan is efficient.
(The numbers of logical reads are 800,856 on the Posts table and 7405 on the Users table.)
In contrast, for the second execution, I force the optimizer to use a seek operation on the Users table with a query hint. The plan thus loops through each Post one by one and seek its corresponding User record.
The Posts table has more than 3,000,000 posts, thus requires 3,000,000 seeks on the Users table to get the users info. This results in 11 million logical reads, which is highly inefficient.
I forced seek with a QUERY HINT in this case. However, it can be the optimizer's choice to seek if the cardinality estimate is underestimated for some reason (e.g. instead of estimating 3,000,000 users, it estimates 3,000 users).
When index seek is an optimal choice
For the same query used above, I added a where clause to filter out the number of posts. Only ~503 posts are queried this time.
The optimizer decided that since the number of users to match is small (~503 users), it performs a seek on the Users table for each post returned (~503 seeks). The number of logical reads on the Users table is only 84.
To allow an efficient access path to the Posts table, we could also create an index on PostTypeID in the Posts table.
CREATE INDEX ix_posttypeid
ON dbo.posts(posttypeid)
include (id, title, lasteditdate, score, owneruserid)
ON dbo.posts(posttypeid)
include (id, title, lasteditdate, score, owneruserid)
When both seek and scan are decent choices:
The two queries query posts that are either questions or answers.
The first execution performed two seeks to find the first post where PostTypeID=2 and the first post where PostTypeID=1.
Once it locates the two segments: PostTypeID=2 and PostTypeID=1, it SCANS all rows until it read all the matching rows.
Note: though there are more than 3,000,000 rows queried, we just need to perform the SEEK operation twice. Because the IX_PostTypeID index stores rows in the order of PostTypeID, the optimizer seeks to locate where PostTypeID = 1 and PostTypeID = 2 and then perform SCANS to get the rest of the matching rows.
In the second execution that we FORCE SCAN; the number of logical reads remains similar (1st execution is 28287 reads, 2nd execution is 28266).
The primary role of a seek is to narrow down the scanning range, making it particularly helpful when the searched window is small. In this example, since the searched window is 95% of the table, using seek DOESN'T improve the performance significantly.
Non-clustered index scan and clustered-index scan
For the same query above, I ask myself, now that scan is as good as seek, will adding the non-clustered index IX_PostTypeID still helpful?
When we create a non-clustered index, we must be aware that it will require maintenance (impacting storage and potentially slowing down
INSERT
, UPDATE
, and DELETE
operations).Thus, when adding new index, we should consider the loss and gain.
In this particular query, if I opt to use the clustered index instead, the number of logical reads is 800,856; compared to just 28,266 with a non-clustered index scan on
IX_PostTypeID
. Obviously, an improvement from 800,856 to 28,266 reads is significant. In any case, it is up to us to decide whether an improvement is necessary to create a new index, since it is important to balance between user experience objective and maintenance cost.
Cleanup
DROP INDEX IF EXISTS ix_posttypeid ON dbo.posts
Comments
Post a Comment