SQL scaling

Learning outcomes:

  • Describe how to scale using SQL
  • List some SQL scaling limitations

Would you like to download my PowerPoint to follow along?

  • What is scaling
    • Scaling is the ability to have your software able to work with large amounts of data gracefully
    • Scaling is a huge issue in industry
    • Vertical vs horizontal scaling
      • SQL databases are usually scaled vertically
      • This is done by adding more power to the servers holding the database
      • There are inherent limits to this
    • If you can scale them horizontally, you have more options
    • Horizontal scaling can also be called "sharding" or "shards"
  • Important things to consider when scaling
    • Keeping the database/software responsive, you have to keep the database usable at scale
    • Being as efficient as possible, you don't want to do more then you need, and because everything is so much more, even small delays can be problematic
    • Reliability, databases that go down or slow down frequently do not make people happy
    • Vocab note: Shards and servers tend to be used interchangeably
  • Scaling Challenges
    • Vertical scaling challenges
      • There are limits to how much power you can add to a server
      • The larger the single server is, the more expensive it will be, think of the power in a gaming laptop vs 10 Raspberry pi computers and the cost of each
    • Data consistency guarantees - harder at scale
      • Data consistency is much harder at horizontal scale
      • Data replication - making sure the data is the same in all shards
      • Data integrity - making sure the data is correct
  • How to break up the database
    • Breaking the database up over multiple nodes is called sharding
    • The sharding key is how we can figure out how to distribute the data
    • One issue to contend with is data skew, where the database isn't broken up evenly
    • It's like a group project, everyone needs to pull their own weight, but it's easy to have some people (nodes) do more than others
  • Keys
    • Because SQL relies so heavily on keys for relationships and to keep the data together, we need to be careful with them
    • If we need to make any changes, everything must be changed, which is harder on multiple servers
    • The sharding key is a field that we use to decide how the database is broken up
    • If the tables weren't sharded the same way we can have inconsistencies and relationships between the tables can get lost and we get orphaned data
      • To fix this you have to check the data keeps its data integrity, making sure that the foreign key relationships are valid, this extra check takes extra time
      • Changes like adding or removing data, or changing data need to be done across servers as well, and making sure each server is changed
  • How to pick a shard key
    • Cardinality - This is how we can pick the max number of pieces, we want larger number of pieces, so we need high cardinality
      • Example: Use a BookID as the key to ensure there are lots of pieces possible
    • Frequency - This is how often the key value is in the data, if the value is used too often the shards won't be evenly distributed
      • Example: If you used LibraryID one shard might be overloaded if the library is massive so having multiple fields used as the key can help that such as using LibraryID and a BookID
    • Monotonically - Keys that increase at a predictable rate can affect node balancing
      • Example: If you check out books and use the date as the key, too many writes will be done on one shard and create a bottleneck
    • Query Patterns - A lot of queries are done routinely so it's worth thinking about if your shard key will cover your commonly used patterns
      • Example: If you need to check on which books are currently checked out of the library, that's a common query and may be helpful to consider as you shard the database
  • ACID Transactions
    • Atomicity - All or nothing
    • Consistency - rules are respected
    • Isolation - independent processing
    • Durability - transaction is permanent
    • One of the strengths of SQL and relational databases is the ACID transactions
      • These have to be maintained through all the shards because otherwise you lose one of the guarantees of a SQL database
      • Keeping to ACID gets more complicated the more nodes we have or nodes we want to add
  • Distributed queries while keeping ACID?
    • Having the data in multiple places means we have more challenges to overcome including communication and networking
    • Trying to make sure we keep ACID transactions across all the shards is hard, especially because we have to make sure all the data is changed and no hold outs are left
    • It's like trying to coordinate a room of people in one spot in one location vs spread across the world
  • Table Joins
    • Another fundamental thing SQL databases do are table joins in the queries
    • Once we add more nodes, the joins become harder to do
      • More machines have to talk to each other
      • More machines are working on the query so there can be upsides depending on the query (more ovens don't cook your bread faster)
      • The original join was a simple thing locally, but now it's now a network issue because it's a distributed query
      • Communication between shards can be slow depending on how many shards there are, and where they are located
  • Node Consistency
    • Each node or server should have equal work to do
    • They also need to have the same data state, all the data must be synchronized. Again, easier to do locally with vertical scaling, than shards
    • Making sure all the nodes have the correct data is difficult, you have to make sure the updates are done quickly but also reliably. You can't have old data in some places, new data in others
      • If a person updates their email it needs to be updated across the database cleanly
      • This is even tougher with real time consistency issues
  • Schema
    • Sharding the database can change the schema and that is harder on horizontally scaled databases
    • Coordinated changes get harder and harder the more nodes we have that we need to keep consistent
    • We have to worry about synchronization, data consistency, but also the downtime of the database, we need the database accessible and don't want it to go down at all
    • Successful scaling should also make sure the table relationships stay correct, and are checked
  • Query Optimization
    • In query optimization we talked a lot about indexes
    • When we have multiple servers, each query might have to hit multiple tables in multiple places which adds communication time since it's not local anymore
    • Changing how you do queries to take into account sharding schemes can help
    • Making sure the data is load balanced can help to reduce bottlenecks
    • Caches can also be used to help speed up queries
    • Breaking complex queries into simpler ones can also help reduce query time

Suggested Activities and Discussion Topics:

Would you like to see some more classes? Click here