Data Integrity and uses

Learning outcomes:

  • List what data integrity entails
  • List some examples of what might happen with data integrity failure
  • List some approaches to fixing a data base without sufficient integrity
  • List ways to ensure you don't lose data integrity in a database

Would you like to download my PowerPoint to follow along?

  • Data integrity
    • Data integrity can be the accuracy of the data, or making sure that the data you have stored is correct.
    • Integrity can also be the completeness of the data, or making sure that if you are storing information, you are storing all the information
    • Data integrity can also refer to the method of data storage, or making sure that wherever you have the data such as a hard drive, is working properly
    • Data integrity can also be guidelines for data retention, or how long you keep dataData integrity can also be guidelines for data retention, or how long you keep data
  • Examples of Data integrity
    • Complete Data
      • Tolkien instead of J. R. R. Tolkien
      • $1000.30 in your account vs $900ish dollars saved somewhere
      • Heathcare record says there are allergies but doesn't list them
      • You were awarded some amount of financial aid
    • Accurate Data
      • My dude Tolks instead of Tolkien
      • You're pretty sure you have $1000 in your account vs you KNOW you have $1034.56 in your account
      • Healthcare record says there is an allergy to pomegranate but it meant penicillin
      • You were awarded $20,000 in tuition vs $2,000 in tuition
  • Incorrect data vs Badly organized
    • Incorrect Data
      • Asked for birthday and told colour blue
      • Wrong allergy written down in heathcare record
      • Wrong transcript was sent to transfer school
      • Incorrect amount was put on your credit card for your coffee
    • Badly Organized Data
      • Asked for birthday and told dec but meant 12/2/2000
      • Data that doesn't correspond to what was asked for, such as sometimes first name is first, sometimes last name is first
      • Odd organizational methods, such as organizing your library database by wordcount, or book weight
  • Examples of why data more complex than you think
    • Names
      • Firstname Lastname
      • Lastname Firstname
      • Multiple Names For First And Last
      • Short names(Wu, Li, Fan)
      • Long Names
      • Hyphenated names
      • Names that are a symbol
      • Names with non-letter characters
      • Characters not in the English language
    • Examples of how to collect data well
      • Data collection methods in business
      • Make sure you have a clear plan that everyone who is collecting data can see, including identifying what you need and how it's being measured in clear and hard to misinterpret language
      • Decide if you are collecting qualitative (Open ended response such as "I feel good today") or quantitative(numbers such as year you were born) data
      • Have a clear system! Include procedures and tests of the people doing the collecting BEFORE sending them out
    • How and why to clean data
      • Data cleaning is fixing data, removing duplicate data, fixing incorrect data and fixing formatting
      • Data cleaning usually removes the data that doesn't belong (Important note! Keep copy of ORIGINAL data, just in case)
      • Quality and validity matters, make sure your data makes sense
      • We clean data so that when we work with it later, we get better results.
        • For example, we might see both N/A and "non applicable" but they mean the same, so we could group them together
        • This can include outliers, but be warned! Outlier isn't the same as incorrect, and your definition of "outlier" might not match someone else's, don't ruin your data for a goal or personal opinion
    • Examples of how data is cleaned
      • We might clean data by only looking at specific age ranges, or locations, as part of the process to make sure we're looking at the right group of people
      • We might clean data by removing duplicates such as making sure everyone only did the survey once
      • We might clean data by fixing typos, naming conventions, or even just making sure the format is the same for all collected data
      • We might clean data by checking validity such as does this data make sense? We asked for a name and got a date, or we asked for a birthday and got a colour
    • How AI is used to clean data
      • Artificial Intelligence is being used to clean data more and more
      • We can use programs to clean data, either the program we used to collect the data, or write out own program to look for specific things.
      • New advances in technology is having AI do this for us automatically! Not always correctly, but automatically! Such as AWS sagemaker or AI validator for Google Sheets
    • Why data is important to every job
      • Good data can lead to better choices, bad data isn't helpful
      • All businesses run on data to some extent, even if it's just sales and goods.
      • Some businesses run on data more such as social media and more tech focused companies
      • Data can help make business more efficient, improve quality, and either make more money or help more people depending on the type of company
    • Examples of how data is used in jobs well
      • Predicting future sales and stocking of goods
      • Predicating market trends so that the right amount of goods can be produced or ordered on time
      • Healthcare uses data to see failures in the system and fix them to help save more people
      • Data can be used to see trends and either continue or correct them
    • Examples of how data is used in jobs poorly
      • Amazon warehouse efficiency dangers and drivers peeing in bottles to make time
      • Laying off people because the data says they don't make money, except those were the people making and checking the good produced so the company can't make money in 6 months
      • Mars Climate Orbiter was unable to perform the test because the software didn't convert data to metric
      • 2008 housing crash was bad data saying pieces of the market were worth more than they actually were, collapsing pieces of the economy when true values were found out
      • Unity getting bad data for its audience guessing tool and lost $110mil on bad bet

    Suggested Activities and Discussion Topics:

    • In small groups or pairs, Consider the following questions:
      • How can data-driven decisions lead to better outcomes in various fields, such as healthcare or education?
      • What are some potential risks and challenges associated with the misuse or mishandling of data?
      • How does data impact privacy, ethics, and security concerns?
      • How does Artificial Intelligence and Machine Learning affect the topic of your choice?
      • Include at least 1 article on the topic of data in modern life. In your discussion posting you should link your article, give a quick (5 sentences or less) summary, and include your opinion on the quality of the article using the CRAAP method as described HERE
      • Activity: Download this PDF and the sample data from here and follow the instructions from the PDF. To do the second half of this activity, you need the data you collected from this activity More sources of data can also be found at the resources page for this course

      Would you like to see some more classes? Click here