NoSQL Scaling

Learning outcomes:

  • Describe how to scale using NoSQL
  • List challenges created by using NoSQL

Would you like to download my PowerPoint to follow along?

  • SQL vs NoSQL scaling
    • SQL
      • Vertical scaling is easier, lots of work needs to be done to do horizontal scaling
      • Relational
      • Strict Schema
    • NoSQL
      • Horizontal scaling is easier to setup on NoSQL
      • Non-relational so there isn't relationships to keep track of
      • Flexible schemas, and you don't always have to even have one at all
      • Designed to be used at scale
  • NoSQL architecture
    • Doesn't have to use ACID, there are lots of options for how transactions can work
    • High availability is assumed as part of the design
    • Horizontally scalable instead of vertical scalability
    • Dynamic provisioning for the servers to ensure you have exactly what you need and no more, easy to get more power when needed
    • Distributed data storage so you can have lots of different data types
    • Hybrid and multi-model databases are becoming common
      • So you can have multiple NoSQL types together, like document stores that also support key-value stores
  • NoSQL scaling challenges
    • Data isn't guaranteed to be consistent immediately, it tends to have eventual consistency instead, can also have issues of corrupt data
    • Because there is no schema, storing the data, isn't the challenge, getting anything useful out of it is
    • New applications and integrations can be hard to build if the data isn't organized
      • The burden of finding how the data is organized and what you can get out of it falls to the application instead of the database, so the cost is moved not taken away
    • Once companies are big enough they need more from their data, have data analysts working, and will care more about things like downtime
  • CAP Theorem
    • NoSQL databases are scaled horizontally so they are considered distributed computing
    • Distributed databases has an idea that says you can have Consistency, Availability or Partition tolerance, but you can't have all three at once
    • Consistency means that the data has to be up to date or throw an error
    • Availability says everything must get a response unless it's a failing node
    • Partition tolerance says that the system must work even when messages are lost/delayed
      • Because network communication is considered lossy, no one gives up partition tolerance, so you have to pick one of the other two to give up
      • SQL without scale is done on one node so no communication is lost
    • "A Critique of the CAP Theorem"
  • Sharding
    • NoSQL databases will also use sharding to scale up
    • Breaking the database into pieces, or shards, is important for both SQL and NoSQL
    • NoSQL sharding tends to happen in the background, rather than needing to worry about it like SQL does
    • NoSQL databases were designed to scale using sharding so it's more inherent in the use
      • Shards are equal so there is less load balance needed
      • Communication is still important, gossip protocol is used
      • Data can be copied to other shards equally since the shards are equal
  • Cloud options and outsourcing
    • Database-as-a-Service (DBaaS)
      • This is a third party you can buy your database from
      • Popular for people that don't have in house DB talent, and small companies that don't want to invest yet
    • Hosted
      • Having a Virtual Machine (VM) or virtual image can be stored on cloud servers and easily used
      • Several companies make their money hosting VMs for people
  • Database-as-a-Service
    • Turnkey options, this is a third party that takes care of your database needs
    • NoSQL vendors can get pricey fast. Consultants for the vendor can be a hidden cost. Consultants can cost several thousand a day
    • Because everything is done by the third party, you don't have to worry about much of anything
    • Vendor lock-in is also an issue, switching vendors can be a VERY big problem
  • Hosted VM
    • Hosting on cloud providers like AWS, Azure, and Google Cloud is one of the most popular solutions for companies of all sizes because it can be a good intersection of cost + skill compared to DBaaS
    • Using a hosted VM means you have to have someone in house to run the database and be on call for it
    • This will allow you to not have the infrastructure and IT costs associated with your own servers
    • Database maintenance needs to be done by you, and this includes everything from general backups to data security
    • A middle ground between Hosted VM and DBaaS is where the cloud company hosts a database for you
      • This won't have 24/7 on call service like a DBaaS might
      • Cheaper than a DBaaS

Suggested Activities and Discussion Topics:

Would you like to see some more classes? Click here