Query Optimization

Learning outcomes:

Each query has a time that it takes to run
Each query has a plan, this is the steps the database is using to figure out the results to give you for the query
Each query can have different amounts of time needed to run based on how it's put together
Some databases will come with a built in way to see the plan for your query, this will likely include things like operator attributes and types
Query execution plans in SQL

Query optimization is a way to make a query more efficient
Query optimization is really only useful for much more complex queries and large sets of data
When we look at performance, and simple queries, there isn't much to be done, they are what they are
You can change the queries slightly to improve time, but database tuning might not be enough
Some databases come with build in optimizers, however! do not trust them blindly! Make sure it actually works and does what it says on the tin

In the real world we have very large sets of data and very complex queries
These datasets are not 10,000 records, they can be millions of records and that will make every second or even tenth of a second count if we have to go through each record
Very complex queries have a lot of moving parts and can be put together in many ways to solve our problems
SQL query optimization — How to determine when and if it's needed

An index is a way to copy some columns of the data so that we can get the information faster instead of having to sort through the whole table
You can also think of it as a map to the table, or shortcuts to the data. Like a cheat code of a hidden level in Mario to get to the last world
Indexes usually have a way to refer back to the original data using a key or direct link
Think of a library index

If you look at your query plan and see a scan that goes over each record in sequence that might be a good place to have an index

Indexes can be created on any column in a database
Technically you can create indexes on multiple columns
There is also a concept of a partial index or filtered index, where you can include part of the rows as well

For example, a partial index could be used when you need to track a shipment that is in the warehouse, but not ordered or shipped or delivered so you can focus your queries more

Index creation time is dependent on the size of the table, this could be a few seconds or several
If your query is using the index, there is a direct relationship between how long it takes for the index to be created and how much time you can save using them
Indexes can take a lot of space
Moves a sequential search to a tree search

If you need a multi-column index, you should consider revising your schema instead
Multi-column indexes can increase the time it takes to add new rows to the database
If there isn't a lot of changes or writing happening to the database this might not be an issue
If we continue the library index card example, this would be the cards sorted by author, and once we got the author, it was then sorted by title

Different types of indexes may have both pros and cons
Different types of indexes will use different data structures internally
Mostly the databases will make this choice for you, you don't have to pick which one is used
Some examples of types include clustered index, hash index, bitmap index and filtered index
There are a lot more types, some are only available on some systems or in some databases (such as bitmap only available on Oracle)
What Are the Types of Indexes in a Relational Database?

Indexes will use data structures, the different databases will likely have different names for each data structure
Some popular ways to implement indexes are hashes and trees (self-balanced binary search trees and B+ trees specifically)
Most databases will create an index based on a primary key
Some databases might also create an index for foreign keys
There is a concept of a covering index, where what you need is actually in the index and you don't need to go find the record (The library card has the info, you don't need the book)
There are no standards for how indexes are created

The cost of your query and alternative plans for your query is a hard problem to solve
At this point in time, hardware isn't usually the issue, disk space is cheap, and adding more servers isn't always the answer

There can be clear metrics such as "users complained about load times less this month" or "Users complained A LOT about how slow their shopping was"
How long the query takes to run can be variable enough it may not be the best indicator of improvements
Consider other metrics to also look for such as CPU use, or I/O
Another metric you could consider is how understandable your query and plan is to future people

One thing to consider is the amount of time saved by what you're doing, is this a onetime cost to save a lot of time later, or is the a large cost upfront for a thing you don't do very often
Another thing to think about is "what is reasonable" Is this query actually possible to run in the requested time, or is this request not in the bounds of reality
Is the slowness of the query something outside of your control? Is there another option? Such as running reports off hours, or creating less reports because they are being sent to the cabinet of no return anyway?

Suggested Activities and Discussion Topics:

Would you like to see some more classes? Click here