Data Cleaning
Garbage In, Garbage Out
Data scientists spend a lot of their time cleaning data. Unfortunately, cleaning data takes a long time and is seen as the least exciting task data scientists do. Check out this Forbes article to read more about it.
Ways Data Can Be Dirty
Dates
Dates are some of trickiest parts of data. There are all sorts of formats that you have to deal with. Some look like:
- Jan 18, 2020
- January 18, 20
- 1/18/20
- 18/1/2020
- Many many more…
Manual Entry
A lot of data is entered by humans and you can get misspellings, similar words, and blanks. A question could be something like, “How did you feel after watching the movie?” People could answer “Happy”, “Hapy”, “Content”, “Joyful” could all be representative of “Happy”.
Units of measure
You could have things like temperature unit differences.
Strings
Capitals and lowercase sometimes make a difference as “Salt Lake City” would need special programming to be matched with “salt lake city” and maybe even harder, “Salt lake city”.
Others
- Missing data
- Leading zeros
- Missing data
- Duplicates
- Inconsistant
- Wrong data type
Ways to Clean Data
Open Refine
A free software from Google to help clean data sets. Mostly deals with string/text data with typos and similarities. Very similar to JMP’s recode platform. Find and fix inconsistencies.
Check out this video to understand what OpenRefine can do with you:
If you want to learn every more, look at the Open Refine website.
Wrangler
Wrangler is from Stanford’s data visualization team. It’s an interactive interface that intuitively understands what we are doing. It can take dirty data and make it clean.
Watch a video demonstrating the power:
Wrangler Demo Video from Stanford Visualization Group on Vimeo.
If you want to learn more about Wrangler, their intro page has more information.