Big Data

Learning outcomes:

  • Describe some uses of AI/ML for databases
  • List some pros/cons for using AI/ML in databases

Would you like to download my PowerPoint to follow along?

  • What is considered big data?
    • Big data is usually identified by the "3 Vs of Big Data"
    • Volume - Very literally how much data, lots of data from lots of sources
    • Velocity - How fast the data is generated, how quickly it's being produced is another hallmark of "big data"
    • Variety - The different kinds of data, big data means it's going to be a lot of different types of data from structured to unstructured, such as IoT device metrics with locations, devices and readings
    • Veracity (how trustworthy the data is), variability (the meaning of the data can change) and value (is this actually useful to me) are also commonly talked about with big data
  • Why we might be interested in more data
    • Most companies like making money, traditionally the more customers you have the more you can make, finding new customers, new products, new venture, new ideas, is one way data is being used
    • The more data you have the better you can tailor what you have to the person, for example, if we recommend seeing a big sportsball game to everyone in MA we might spend a bunch on advertising but accidentally send them Yankees wins highlights and then there will be riots in the streets. Again.
    • Real time analytics needs a lot of data continuously coming in to work, the more data you have the better your analytics can work if it's good data
    • For example, AI/ML training models that can find breast cancer, it was however originally designed to identify if a photo was a croissant or a bear claw
  • Examples of Big Data right now
    • Anything using Artificial Intelligence and Machine learning models is using big data
    • Search engines use big data to give you the best results
    • Any large streaming platform uses big data to both hold the videos and see analytics so they can do things like make suggestions of what to watch next, produce next or buy next
    • Online advertisements use a surprising amount of data, your phone isn't recording you, but it is doing things like see what other phones you spend time around, their histories and current interests, and places you go
  • Data lakes
    • Data lakes - any and all data welcome
      • Example use cases include streaming media and suggestions for what to watch, investment houses watching the market to decide where to invest money, healthcare using past patient data to improve current patient outcomes
    • Data lakehouses are a newer concept where it's a cross between a data lake and a data warehouse, you can analyze unstructured data because the lakehouse automatically structures it. This involves more setup and not everyone wants their structured and unstructured data mixing
    • Commonly used by companies to pull all the data from disparate groups into one place
  • Data warehouse
    • Data warehouses - tends to welcome relational data only
      • Commonly used for business analytics and data analysts
      • Data warehouses tend to hold a lot of historical data so can be used for data mining and data visualizations and other types of reports
      • What is the Purpose of a Data Warehouse
    • Data marts are data warehouses but for specific use cases and teams, think smaller and more focused warehouse. Boutique shopping instead of big box store
    • Very commonly used, but starting to fall out of corporate fashion because so many people are moving to NoSQL
  • What are AI and ML
    • Artificial Intelligence (AI)
      • This is a popular thing right now, with technologies like ChatGPT and other Large Learning Models (LLMs)
      • AI can be anything that is computers doing things that people can do, but require intelligence from the person
      • Some common examples are things like facial recognition or other picture recognition, answering questions or even driving cars
    • Machine Learning (ML)
      • ML is a subset of AI that is supposed to be machines that imitate human behaviors
      • It can be seen as computers learning without being programmed, but instead use learned behaviors and computer models to do training
  • Machine Learning categories
    • Supervised
      • Models are trained with labeled data sets
      • So for this is if you're looking for pictures of cute animals, you would need to have a large data set of pictures, labeled by humans, that are of cute animals. Once the computer has seen enough of these it should be able to find cute animals on its own
      • Most popular option right now
    • Unsupervised
      • This is going to be unlabeled data where the computer is looking for patterns and trends
      • So instead of looking for cute animals, you'd have the computer look for patterns you didn't expect, like all photos of cute animals are of a specific size, or eye shape or something
    • Reinforcement
      • Training through trial and error
      • This is how some people will do self-driving cars or have the computer play games
  • Pros and Cons of AI/ML
    • Pros
      • Can be used to make choices faster
      • Some people see it as less error prone (not true)
      • Always available
      • Can cost less than other options
      • Good for repetitive work
      • Can be used for real time analysis
    • Cons
      • Because it's seen as less error prone people ignore the inherent bias in AI
      • Not able to make exceptions or choices that were not explicitly programmed
      • Not creative, can reformat other people's ideas, but doesn't come up with its own
      • Doesn't learn from experience unless you're in actively training it
      • Ethical issues with implementations, bias, data persistence, and ownership of original ideas (AI art is a good example of that)
  • How AI/ML can be used in databases
    • AI can be used for some forms of analysis. Examples include finding patterns or relationships that aren't obvious to humans. Such as hidden trends or potential correlations between data
    • AI databases can store data as mathematical vectors instead of traditional data storage ideas
      • Mathematical vectors are a way to represent the data abstractly, the data can be generated by ML
      • AI databases can scale vertically or horizontally
      • AI databases can also support natural language processing
      • Can be SQL or NoSQL style database
    • AI databases can also have predictive capabilities, so they can apply ML for trying to predict future trends
  • Some examples of AI/ML in use right now
  • Some resources for learning more about AI/ML

Suggested Activities and Discussion Topics:

Would you like to see some more classes? Click here