SQL scaling

Learning outcomes:

Scaling is the ability to have your software able to work with large amounts of data gracefully
Scaling is a huge issue in industry
Vertical vs horizontal scaling

Keeping the database/software responsive, you have to keep the database usable at scale
Being as efficient as possible, you don't want to do more then you need, and because everything is so much more, even small delays can be problematic
Reliability, databases that go down or slow down frequently do not make people happy
Vocab note: Shards and servers tend to be used interchangeably

There are limits to how much power you can add to a server
The larger the single server is, the more expensive it will be, think of the power in a gaming laptop vs 10 Raspberry pi computers and the cost of each

Breaking the database up over multiple nodes is called sharding
The sharding key is how we can figure out how to distribute the data
One issue to contend with is data skew, where the database isn't broken up evenly
It's like a group project, everyone needs to pull their own weight, but it's easy to have some people (nodes) do more than others

Because SQL relies so heavily on keys for relationships and to keep the data together, we need to be careful with them
If we need to make any changes, everything must be changed, which is harder on multiple servers
The sharding key is a field that we use to decide how the database is broken up
If the tables weren't sharded the same way we can have inconsistencies and relationships between the tables can get lost and we get orphaned data

To fix this you have to check the data keeps its data integrity, making sure that the foreign key relationships are valid, this extra check takes extra time
Changes like adding or removing data, or changing data need to be done across servers as well, and making sure each server is changed

Cardinality - This is how we can pick the max number of pieces, we want larger number of pieces, so we need high cardinality

Frequency - This is how often the key value is in the data, if the value is used too often the shards won't be evenly distributed

Example: If you used LibraryID one shard might be overloaded if the library is massive so having multiple fields used as the key can help that such as using LibraryID and a BookID

Monotonically - Keys that increase at a predictable rate can affect node balancing

Example: If you check out books and use the date as the key, too many writes will be done on one shard and create a bottleneck

Query Patterns - A lot of queries are done routinely so it's worth thinking about if your shard key will cover your commonly used patterns

Example: If you need to check on which books are currently checked out of the library, that's a common query and may be helpful to consider as you shard the database

These have to be maintained through all the shards because otherwise you lose one of the guarantees of a SQL database
Keeping to ACID gets more complicated the more nodes we have or nodes we want to add

Having the data in multiple places means we have more challenges to overcome including communication and networking
Trying to make sure we keep ACID transactions across all the shards is hard, especially because we have to make sure all the data is changed and no hold outs are left
It's like trying to coordinate a room of people in one spot in one location vs spread across the world

More machines have to talk to each other
More machines are working on the query so there can be upsides depending on the query (more ovens don't cook your bread faster)
The original join was a simple thing locally, but now it's now a network issue because it's a distributed query
Communication between shards can be slow depending on how many shards there are, and where they are located

Each node or server should have equal work to do
They also need to have the same data state, all the data must be synchronized. Again, easier to do locally with vertical scaling, than shards
Making sure all the nodes have the correct data is difficult, you have to make sure the updates are done quickly but also reliably. You can't have old data in some places, new data in others

If a person updates their email it needs to be updated across the database cleanly
This is even tougher with real time consistency issues

Sharding the database can change the schema and that is harder on horizontally scaled databases
Coordinated changes get harder and harder the more nodes we have that we need to keep consistent
We have to worry about synchronization, data consistency, but also the downtime of the database, we need the database accessible and don't want it to go down at all
Successful scaling should also make sure the table relationships stay correct, and are checked

In query optimization we talked a lot about indexes
When we have multiple servers, each query might have to hit multiple tables in multiple places which adds communication time since it's not local anymore
Changing how you do queries to take into account sharding schemes can help
Making sure the data is load balanced can help to reduce bottlenecks
Caches can also be used to help speed up queries
Breaking complex queries into simpler ones can also help reduce query time

Suggested Activities and Discussion Topics:

Would you like to see some more classes? Click here