There are limits to how much power you can add to a server
The larger the single server is, the more expensive it will be, think of the power in a gaming laptop vs 10 Raspberry pi computers and the cost of each
Data consistency guarantees - harder at scale
Data consistency is much harder at horizontal scale
Data replication - making sure the data is the same in all shards
Data integrity - making sure the data is correct
How to break up the database
Breaking the database up over multiple nodes is called sharding
The sharding key is how we can figure out how to distribute the data
One issue to contend with is data skew, where the database isn't broken up evenly
It's like a group project, everyone needs to pull their own weight, but it's easy to have some people (nodes) do more than others
Because SQL relies so heavily on keys for relationships and to keep the data together, we need to be careful with them
If we need to make any changes, everything must be changed, which is harder on multiple servers
The sharding key is a field that we use to decide how the database is broken up
If the tables weren't sharded the same way we can have inconsistencies and relationships between the tables can get lost and we get orphaned data
To fix this you have to check the data keeps its data integrity, making sure that the foreign key relationships are valid, this extra check takes extra time
Changes like adding or removing data, or changing data need to be done across servers as well, and making sure each server is changed
Cardinality - This is how we can pick the max number of pieces, we want larger number of pieces, so we need high cardinality
Example: Use a BookID as the key to ensure there are lots of pieces possible
Frequency - This is how often the key value is in the data, if the value is used too often the shards won't be evenly distributed
Example: If you used LibraryID one shard might be overloaded if the library is massive so having multiple fields used as the key can help that such as using LibraryID and a BookID
Monotonically - Keys that increase at a predictable rate can affect node balancing
Example: If you check out books and use the date as the key, too many writes will be done on one shard and create a bottleneck
Query Patterns - A lot of queries are done routinely so it's worth thinking about if your shard key will cover your commonly used patterns
Example: If you need to check on which books are currently checked out of the library, that's a common query and may be helpful to consider as you shard the database
ACID Transactions
Atomicity - All or nothing
Consistency - rules are respected
Isolation - independent processing
Durability - transaction is permanent
One of the strengths of SQL and relational databases is the ACID transactions
These have to be maintained through all the shards because otherwise you lose one of the guarantees of a SQL database
Keeping to ACID gets more complicated the more nodes we have or nodes we want to add
Distributed queries while keeping ACID?
Having the data in multiple places means we have more challenges to overcome including communication and networking
Trying to make sure we keep ACID transactions across all the shards is hard, especially because we have to make sure all the data is changed and no hold outs are left
It's like trying to coordinate a room of people in one spot in one location vs spread across the world
Table Joins
Another fundamental thing SQL databases do are table joins in the queries
Once we add more nodes, the joins become harder to do
More machines have to talk to each other
More machines are working on the query so there can be upsides depending on the query (more ovens don't cook your bread faster)
The original join was a simple thing locally, but now it's now a network issue because it's a distributed query
Communication between shards can be slow depending on how many shards there are, and where they are located
Node Consistency
Each node or server should have equal work to do
They also need to have the same data state, all the data must be synchronized. Again, easier to do locally with vertical scaling, than shards
Making sure all the nodes have the correct data is difficult, you have to make sure the updates are done quickly but also reliably. You can't have old data in some places, new data in others
If a person updates their email it needs to be updated across the database cleanly
This is even tougher with real time consistency issues
Schema
Sharding the database can change the schema and that is harder on horizontally scaled databases
Coordinated changes get harder and harder the more nodes we have that we need to keep consistent
We have to worry about synchronization, data consistency, but also the downtime of the database, we need the database accessible and don't want it to go down at all
Successful scaling should also make sure the table relationships stay correct, and are checked
Query Optimization
In query optimization we talked a lot about indexes
When we have multiple servers, each query might have to hit multiple tables in multiple places which adds communication time since it's not local anymore
Changing how you do queries to take into account sharding schemes can help
Making sure the data is load balanced can help to reduce bottlenecks
Caches can also be used to help speed up queries
Breaking complex queries into simpler ones can also help reduce query time