Big data is usually identified by the "3 Vs of Big Data"
Volume - Very literally how much data, lots of data from lots of sources
Velocity - How fast the data is generated, how quickly it's being produced is another hallmark of "big data"
Variety - The different kinds of data, big data means it's going to be a lot of different types of data from structured to unstructured, such as IoT device metrics with locations, devices and readings
Veracity (how trustworthy the data is), variability (the meaning of the data can change) and value (is this actually useful to me) are also commonly talked about with big data
Why we might be interested in more data
Most companies like making money, traditionally the more customers you have the more you can make, finding new customers, new products, new venture, new ideas, is one way data is being used
The more data you have the better you can tailor what you have to the person, for example, if we recommend seeing a big sportsball game to everyone in MA we might spend a bunch on advertising but accidentally send them Yankees wins highlights and then there will be riots in the streets. Again.
Real time analytics needs a lot of data continuously coming in to work, the more data you have the better your analytics can work if it's good data
Anything using Artificial Intelligence and Machine learning models is using big data
Search engines use big data to give you the best results
Any large streaming platform uses big data to both hold the videos and see analytics so they can do things like make suggestions of what to watch next, produce next or buy next
Online advertisements use a surprising amount of data, your phone isn't recording you, but it is doing things like see what other phones you spend time around, their histories and current interests, and places you go
Example use cases include streaming media and suggestions for what to watch, investment houses watching the market to decide where to invest money, healthcare using past patient data to improve current patient outcomes
Data lakehouses are a newer concept where it's a cross between a data lake and a data warehouse, you can analyze unstructured data because the lakehouse automatically structures it. This involves more setup and not everyone wants their structured and unstructured data mixing
Commonly used by companies to pull all the data from disparate groups into one place
Data marts are data warehouses but for specific use cases and teams, think smaller and more focused warehouse. Boutique shopping instead of big box store
Very commonly used, but starting to fall out of corporate fashion because so many people are moving to NoSQL
ML is a subset of AI that is supposed to be machines that imitate human behaviors
It can be seen as computers learning without being programmed, but instead use learned behaviors and computer models to do training
Machine Learning categories
Supervised
Models are trained with labeled data sets
So for this is if you're looking for pictures of cute animals, you would need to have a large data set of pictures, labeled by humans, that are of cute animals. Once the computer has seen enough of these it should be able to find cute animals on its own
Most popular option right now
Unsupervised
This is going to be unlabeled data where the computer is looking for patterns and trends
So instead of looking for cute animals, you'd have the computer look for patterns you didn't expect, like all photos of cute animals are of a specific size, or eye shape or something
Reinforcement
Training through trial and error
This is how some people will do self-driving cars or have the computer play games
Because it's seen as less error prone people ignore the inherent bias in AI
Not able to make exceptions or choices that were not explicitly programmed
Not creative, can reformat other people's ideas, but doesn't come up with its own
Doesn't learn from experience unless you're in actively training it
Ethical issues with implementations, bias, data persistence, and ownership of original ideas (AI art is a good example of that)
How AI/ML can be used in databases
AI can be used for some forms of analysis. Examples include finding patterns or relationships that aren't obvious to humans. Such as hidden trends or potential correlations between data
AI databases can store data as mathematical vectors instead of traditional data storage ideas
Mathematical vectors are a way to represent the data abstractly, the data can be generated by ML
AI databases can scale vertically or horizontally
AI databases can also support natural language processing
Can be SQL or NoSQL style database
AI databases can also have predictive capabilities, so they can apply ML for trying to predict future trends