Articles by Randy Au
- Data Science Practice 101: Always Leave An Analysis Paper Trail
Randy advocates that data documentation makes the work of data scientists easily traceable and reproducible. This is particularly important when working on ad-hoc analysis requests, as it is easy for output to end up in a temporary directory with little context. There are several ways to package analysis deliverables, including Excel files, CSV files, slide decks, dashboards, and shared documents, in order to make them free-standing and easily traceable.
- Data Cleaning IS Analysis, Not Grunt Work
Cleaning data is considered by some people to be menial work that’s somehow “beneath” the sexy “real” data science work. Randy calls BS on this. The act of cleaning data imposes values/judgments/interpretations upon data intended to allow downstream analysis algorithms to function and give results. That’s exactly the same as doing data analysis. Data cleaning is a spectrum of reusable data transformations on the path towards doing a full data analysis. Once we accept that framework, the steps we need to take to clean data flow more naturally. We want to allow our analysis code to run, control useless variance, eliminate bias, and document for others to use, all in service to the array of potential analyses we want to run in the future.
- Learning SQL 201: Optimizing Queries, Regardless of Platform
Randy's article is all about speed: common strategies for making queries go faster.