SQL SERVER INDEX (Part 1): Index Seek vs Index Scan and Statistics

Firstly, I recommend watching this How to Think Like the Engine - YouTube by Brent Ozar on YouTube (all parts). It is incredibly helpful foundational material before you begin your SQL performance tuning journey.


The SQL Server optimizer is responsible for determining which operations —such as scans or seeks — should be used to query the least number of 8KB pages. However, the optimizer sometimes needs a DBA's intervention.

For example, an ideal index for a query might not exist yet. In such cases, it is up to the DBA to create the index so that the optimizer can make use of it.

Similarly, inaccurate row count estimations on the execution plan can lead to suboptimal choices by the optimizer. As DBAs, we may need to update statistics or tune the code depending on the cause of the misestimations.

When we have accurate indexes and correct row estimations, the optimizer's choices are usually optimal, and we can stop our part of tuning the query.

In this post, I will go over:
- The differences between index scan and index seek operations.
- How the optimizer chooses between scans and seeks, with examples

To demonstrate, I will use the StackOverflow2010 database. Feel free to follow along with my examples. If you haven't, please checkout Tools I Used During Query Performance Troubleshooting to see how to setup to obtain execution plan and logical reads info.

Index Scan and Index Seek

Index seek

This symbol represents an Index Seek operation in SQL Server. Do you notice the symbol looks like a tree traversal that we studied in our data structure class?


Indeed, an index is a search tree and seeking a record means traversing the tree's levels.

For a large table (e.g. million rows), the tree depth might be between 4 to 6 levels. Thus, searching a single record with index seek requires only 4 to 6 logical reads. 

To view the depth D of an index tree, under table>indexes, right click on the index, select Properties, and select Fragmentation.


This makes seeking for a small number of rows highly efficient, since a seek is a low I/O operation.

However, if the number of seeks are large, it would take T X D logical reads (where T is the number of times to seek, and D is the depth of the index tree). The higher the T is, the higher the number of logical reads is, thus THE MYTH "INDEX SEEK IS ALWAYS BETTER" IS WRONG. 

Note that T is the number of times to seek, not necessarily the number of rows to search. For example, searching rows with CreationDate > '12/23/2010' using an index on the CreationDate column returns 4869 rows, but it requires only one seek to find the boundary row '12/23/2010' and then SCAN to get the rest matching rows.


Index scan

This symbol represents an Index Scan operation in SQL Server. As the symbol suggests, it is an operation that scans all pages at the leaf level of an index tree.


If an index has, say 800,000 pages, at the leaf level, a scan will involve 800,000 logical reads—a much larger I/O operation. 



When querying larger amounts of data, an index scan is thus more efficient. 

Let's use a simple calculation to prove that.

Consider the Posts table in the StackOverflow database, the index d has 800,000 pages and a depth of 4

If perform 3 million seeks in order to read the data: 3,000,000 X 4 = 12,000,000 logical reads

If scan the entire index d: 800,000 logical reads

Obviously, in this case, an index scan is more efficient!

Index scan/seek examples and statistics

SQL Server stores table statistics to help generate optimal execution plans!!! (sadly, sometimes mission fails even with up-to-date statistics) 

There are various materials out there to help you understand more about statistics, and how to create and update statistics.



These statistics inform SQL Server of row counts, which helps it decides between using seeks or scans for operation.



139 of 110 is "the actual number of rows for all executions" of "the estimated number of rows for all executions." 

During the compilation, it estimated that the query needs to retrieve 110 posts, but the run time indicates 139 posts read. Minor estimation discrepancies like this are typically acceptable.

The row estimate is called cardinality estimate.

When scan is better than seek:



The two queries aim to retrieve the Post and User info for all posts.

On the first execution, the SQL Server optimizer chooses to scan the Posts and Users tables and join them using a hash match operation.

Two reasons for me to justify that the execution plan is optimal:
- The cardinality estimates are accurate, suggesting SQL Server made informed choices.
- Since the entire table is being read, a scan is efficient.

(The numbers of logical reads are 800,856 on the Posts table and 7405 on the Users table.)

In contrast, for the second execution, I force the optimizer to use a seek operation on the Users table with a query hint. The plan thus loops through each Post one by one and seek its corresponding User record. 

The Posts table has more than 3,000,000 posts, thus requires 3,000,000 seeks on the Users table to get the users info. This results in 11 million logical reads, which is highly inefficient.

I forced seek with a QUERY HINT in this case. However, it can be the optimizer's choice to seek if the cardinality estimate is underestimated for some reason (e.g. instead of estimating 3,000,000 users, it estimates 3,000 users).


When index seek is an optimal choice



For the same query used above, I added a where clause to filter out the number of posts. Only ~503 posts are queried this time.

The optimizer decided that since the number of users to match is small (~503 users), it performs a seek on the Users table for each post returned (~503 seeks). The number of logical reads on the Users table is only 84.

To allow an efficient access path to the Posts table, we could also create an index on PostTypeID in the Posts table.

CREATE INDEX ix_posttypeid
  ON dbo.posts(posttypeid)
  include (id, title, lasteditdate, score, owneruserid) 


The optimizer now performed seeks on both tables. The number of logical reads on the Posts table is reduced from 800,856 to 1561.

When both seek and scan are decent choices:


The two queries query posts that are either questions or answers. 

The first execution performed two seeks to find the first post where PostTypeID=2 and the first post where PostTypeID=1

Once it locates the two segments: PostTypeID=2 and PostTypeID=1, it SCANS all rows until it read all the matching rows. 

Note: though there are more than 3,000,000 rows queried, we just need to perform the SEEK operation twice. Because the IX_PostTypeID index stores rows in the order of PostTypeID, the optimizer seeks to locate where PostTypeID = 1 and PostTypeID = 2 and then perform SCANS to get the rest of the matching rows. 

In the second execution that we FORCE SCAN; the number of logical reads remains similar (1st execution is 28287 reads, 2nd execution is 28266). 

The primary role of a seek is to narrow down the scanning range, making it particularly helpful when the searched window is small. In this example, since the searched window is 95% of the table, using seek DOESN'T improve the performance significantly.


Non-clustered index scan and clustered-index scan

For the same query above, I ask myself, now that scan is as good as seek, will adding the non-clustered index IX_PostTypeID still helpful?

When we create a non-clustered index, we must be aware that it will require maintenance (impacting storage and potentially slowing down INSERTUPDATE, and DELETE operations).

Thus, when adding new index, we should consider the loss and gain.

In this particular query, if I opt to use the clustered index instead, the number of logical reads is 800,856; compared to just 28,266 with a non-clustered index scan on IX_PostTypeID


The reason for a smaller number of logical reads during scanning a non-clustered index is that the non-clustered index is smaller than the clustered index. It doesn't contain all the table's columns.

Obviously, an improvement from 800,856 to 28,266 reads is significant. In any case, it is up to us to decide whether an improvement is necessary to create a new index, since it is important to balance between user experience objective and maintenance cost.

Cleanup

DROP INDEX IF EXISTS ix_posttypeid ON dbo.posts 

Comments

Popular posts from this blog

📐 SQL Server Performance: Troubleshooting Memory Grant Overestimation using SQL Diagnostic Manager and SQL Server execution plan

📐 SQL Server Performance: Solving Memory Grant Overestimation - the limitations of Memory Grant Feedback and the power of SQL Tuning