Big Data

Learning outcomes:

Big data is usually identified by the "3 Vs of Big Data"
Volume - Very literally how much data, lots of data from lots of sources
Velocity - How fast the data is generated, how quickly it's being produced is another hallmark of "big data"
Variety - The different kinds of data, big data means it's going to be a lot of different types of data from structured to unstructured, such as IoT device metrics with locations, devices and readings
Veracity (how trustworthy the data is), variability (the meaning of the data can change) and value (is this actually useful to me) are also commonly talked about with big data

Most companies like making money, traditionally the more customers you have the more you can make, finding new customers, new products, new venture, new ideas, is one way data is being used
The more data you have the better you can tailor what you have to the person, for example, if we recommend seeing a big sportsball game to everyone in MA we might spend a bunch on advertising but accidentally send them Yankees wins highlights and then there will be riots in the streets. Again.
Real time analytics needs a lot of data continuously coming in to work, the more data you have the better your analytics can work if it's good data
For example, AI/ML training models that can find breast cancer, it was however originally designed to identify if a photo was a croissant or a bear claw

Anything using Artificial Intelligence and Machine learning models is using big data
Search engines use big data to give you the best results
Any large streaming platform uses big data to both hold the videos and see analytics so they can do things like make suggestions of what to watch next, produce next or buy next
Online advertisements use a surprising amount of data, your phone isn't recording you, but it is doing things like see what other phones you spend time around, their histories and current interests, and places you go

Example use cases include streaming media and suggestions for what to watch, investment houses watching the market to decide where to invest money, healthcare using past patient data to improve current patient outcomes

Data lakehouses are a newer concept where it's a cross between a data lake and a data warehouse, you can analyze unstructured data because the lakehouse automatically structures it. This involves more setup and not everyone wants their structured and unstructured data mixing
Commonly used by companies to pull all the data from disparate groups into one place

Commonly used for business analytics and data analysts
Data warehouses tend to hold a lot of historical data so can be used for data mining and data visualizations and other types of reports
What is the Purpose of a Data Warehouse

Data marts are data warehouses but for specific use cases and teams, think smaller and more focused warehouse. Boutique shopping instead of big box store
Very commonly used, but starting to fall out of corporate fashion because so many people are moving to NoSQL

This is a popular thing right now, with technologies like ChatGPT and other Large Learning Models (LLMs)
AI can be anything that is computers doing things that people can do, but require intelligence from the person
Some common examples are things like facial recognition or other picture recognition, answering questions or even driving cars

ML is a subset of AI that is supposed to be machines that imitate human behaviors
It can be seen as computers learning without being programmed, but instead use learned behaviors and computer models to do training

Models are trained with labeled data sets
So for this is if you're looking for pictures of cute animals, you would need to have a large data set of pictures, labeled by humans, that are of cute animals. Once the computer has seen enough of these it should be able to find cute animals on its own
Most popular option right now

This is going to be unlabeled data where the computer is looking for patterns and trends
So instead of looking for cute animals, you'd have the computer look for patterns you didn't expect, like all photos of cute animals are of a specific size, or eye shape or something

Training through trial and error
This is how some people will do self-driving cars or have the computer play games

Because it's seen as less error prone people ignore the inherent bias in AI
Not able to make exceptions or choices that were not explicitly programmed
Not creative, can reformat other people's ideas, but doesn't come up with its own
Doesn't learn from experience unless you're in actively training it
Ethical issues with implementations, bias, data persistence, and ownership of original ideas (AI art is a good example of that)

AI can be used for some forms of analysis. Examples include finding patterns or relationships that aren't obvious to humans. Such as hidden trends or potential correlations between data
AI databases can store data as mathematical vectors instead of traditional data storage ideas

Mathematical vectors are a way to represent the data abstractly, the data can be generated by ML
AI databases can scale vertically or horizontally
AI databases can also support natural language processing
Can be SQL or NoSQL style database

AI databases can also have predictive capabilities, so they can apply ML for trying to predict future trends

Suggested Activities and Discussion Topics:

Would you like to see some more classes? Click here