Data Sets

What is Data?

Data is information with potential for meaning. It is often stored digitally. It can be structured or unstructured.

What is Data Integration?

Nearly always, data isn’t located all in one place and some combining of data from multiple sources is needed. This is data integration.

A lot of businesses derive value through data integration alone!

How to Combine?

  • Use multiple databases using JOIN

Places to find data

  • Kaggle (Data Science Website)
  • UCI Machine Learning Repository
  • ISLR (data sets for Introduction to Statistical Learning with Applications in R) Built into R. See more info

Big Data Sets

  • Twitter: Who follows who
  • Facebook: Who is friends with who
  • Amazon: Who buys what
  • Cellphones: Who calls who
  • Protein-Protein Interactions: 200 million different human protein interactions
  • Internet: 50 billion web pages (see [https://internet-map.net/])

Fault Detection / Reliability

  • NASA Turbo Fan Fault Detection This data set has data from an engine failure study. It can be useful for reliability of fault detection algorithms.
    • (Grab the data from my Github) [https://github.com/AveryData/DataSets/tree/master/NASA%20Turbofan%20Engine%20Fault%20Data]
    • (see link to data here on nasa site)[https://data.nasa.gov/dataset/Turbofan-engine-degradation-simulation-data-set/vrks-gjie]
  • Tennesse Eastman Data
    • (Tanslated .mat files on my Github)[https://github.com/AveryData/DataSets/tree/master/Tennessee%20Eastman%20Data]
    • (Data in R)[https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6C3JR1

Movies

Check out The Movie Database API

  • Crime City Data
  • NYC Data