Big Data

Learning outcomes:

  • Define how the term “Big Data” is usually used
  • List ways that big data differs from databases

Would you like to download my PowerPoint to follow along?

  • What is Big Data
    • Big Data is what we refer to whenever we are talking about extraordinarily large collections of data
    • This data can be structured, unstructred or semi-structured
    • These data stores or datasets are so large they can't be handled by traditional means, let alone used in any reasonable way
    • Big data is usually data that is not only extremely large, but also growing very quickly
    • Big Data is used commonly for things like predictive modeling, AI, Machine Learning and other popular topics
  • Why is Big Data Important
    • Big data means you can (technically) make better choices because you have more data to work with, better patterns can be found with more data points
    • Real time data collection and analytics can mean you can move and make choices faster
    • Having more data and real time data can mean you can automate more of your business to reduce costs
    • The more data you have, the closer you can tailor your business to your customers
  • The Vs of Big Data
    • Big Data has characteristics that are referred to as the "3 Vs of big data" by Gartner in 2001
    • Volume, High volume of data (Much data)
    • Velocity, Speed of data being generated, tends to be real time (So fast)
    • Variety, Many sources of data (Such variety)
  • Some examples of Big Data being used
    • Advertisements for products and marketing campaigns
    • Route Software such as Google Maps or Waze that use GPS and real time traffic to help plan your routes
    • Fraud detection for banks and credit cards
    • Predicting weather patterns, natural disasters, climate change and early warning systems
  • Why Big Data needs different tools/languages
    • The ability to handle large volumes of data
    • Able to handle real time visualizations that can also be interactive as a bonus
    • Large amount of data storage are needed, which will require places to store it, backups, people to take care of both the data, the backups, and the security of the system
  • What tools do we use with Big Data
    • Apache Hadoop - Stores and processes data, very popular, open source, uses distributed computing to process the data
    • Apache Spark - unified analytics engine, very popular, open source, uses cluster computing to process the data, lots of inbuilt options for working with your data such as ML, SQL, and even APIs for other languages like Python, R, and Java
    • Splunk - Data analytics tool that can handle big data, can work with dashboards and data visualizations, also incorporates AI
    • Tableau - Data visualization tool, very popular in companies, can create lots of different types of charts and has drag-and-drop properties so it's easier to learn, seen commonly as a dashboard
    • This is not an exhaustive list, just some examples and popular options
  • How Big Data is different than databases
    • NoSQL databases are said to handle bigger data pools by default
    • Big data can be stored in places besides databases such as a data lake (raw data) instead of data warehouse (processed data) or data mark (data warehouse for a specific purpose)
    • Data warehouses (or marts) can be databases because the data has been processed, it will need a schema design so the data can be easily worked with
    • Data lakes can be anything from web logs, social media, or even sensor data collected real time from IoT (Internet of Things) devices from around the world
    • What's the Difference Between a Data Warehouse, Data Lake, and Data Mart? by AWS
  • Database Scalability
    • Database scalability is how gracefully you can work with a small amount of data and grow it into a large amount of data, for example taking the data of 100 customers, to 1,000 customers to 100,000 customers
    • An example of this is the ID vs GUID and how that is used to make the database accessible for multiple people manipulating the data at once
    • At large scale tradeoffs need to be made to ensure the database works correctly, traditional relational databases will basically fall over if you try and use them for big data
    • NoSQL is an example of a way we can make some of these tradeoffs that relational databases won't be able to handle
  • Some problems with Big Data
    • Lack of talent/skills - it's hard to find people that can work with big data tools well, people with experience and the skills needed are in short supply so they can be very expensive to hire and hard to find
    • Scalability - Infrastructure weaknesses and tech debt will hit FAST if you aren't careful, big data comes in fast, needs to be worked with fast and not all systems and networks can handle it
    • Quality - Not all data is good data. We can collect data we shouldn't, collect data that isn't helpful and can be hard to organize
    • Security/Compliance - The more data you have, the bigger the target for hackers and the more valuable you are to bad actors
  • Where Big Data might be going next
    • Streaming data in real time, so the data is processed as a stream instead of in batches so you can get more real time information and analytics
    • Artificial Intelligence (AI) and Machine Learning (ML) for more automated decision making and responses
    • Democratization of data and people having more access to their own data, ability to remove their data, download their data, and see what is "known" about them
    • More no-code and low-code solutions, if AI and ML can be used to help out it might be possible to have tools that more people can use easier with less background knowledge required

Suggested Activities and Discussion Topics:

  • In groups or pairs, discuss the following questions
    • Think of a specific scenario where a large volume of data is generated (such as online bank transactions, wikipedia edits, sports stats, or sensor data). Estimate the scale of this data in terms of size (gigabytes or terabytes? Larger?). What challenges might arise when dealing with such massive amounts of data?
    • Can you think of any recent controversies or ethical dilemmas related to big data usage for the topic you picked above?
  • Share an article related to a topic of your choice that generates a lot of data, explain why you find it interesting and why you picked it. Some examples of topics might be online games, some variety of sports or gaming statistics
  • Complete this PDF

Would you like to see some more classes? Click here