UCLA Stats 404 - Statistical Computing and Programming
Developed and taught this core graduate course to 80+ students at UCLA Masters of Applied Statistics (2019-2021)
Course Description
Lecture, three hours; discussion, one hour. Limited to Master of Applied Statistics students. Fundamentals of statistical programming using Python and SQL. Python is currently state-of-the-art for the analysis of data, statistical computing, and software development in the industry. Performance of analysis of real datasets using Python...
Course Learning Objectives
The goal of this course is to prepare students for Data Scientist or Machine Learning Engineering roles in industry, by learning marketable skills and best practices for collaborating with technical and non-technical stakeholders, and developing statistical or Machine Learning software via Python, SQL, and Git.
By the end of the course, students should be able to:
Explain how their work contributes to the business;
Learn about and implement iterative model development;
Write production-ready code, that runs not just on their computer;
Answer the business question end-to-end with ML/AI;
Gain experience in Python, SQL, and Git.
Prepare for interviews, by having (an additional) end-to-end project to showcase in their portfolio, which also serves as a template for performing data analysis in Python for any interview take-home homework assignments.
Agenda (subject to change)
Week 1: Uncovering the business question and outlining a strategy to answer it, setting up a reproducible machine learning environment, introduction to Git
Weeks 2 and 3: Introduction to Python, pandas, and SQL
Python: expressions, control flow, functions, variable types, passing by reference, list comprehension, functional programming
pandas: reading-in data, subsetting, EDA, split + apply+ combine, pandas + databases
Weeks 4 and 5: Introduction to developing an ML proof-of-concept (POC)
Linear and logistic regression, Elastic nets, PCA regression, hyper-parameter tuning, Deep Learning, and custom loss functions
Week 6: Big Data in Python -- Improving POC by Understanding Computational Constraints
pandas and big data, Dask, pySpark + SparkSQL, embarrassingly parallel processes, AWS S3
Week 7 and 8: Introduction to Software Development to Productionalize POC
reproducibility, readability, robustness
testing suite, ML test, typing, model roll-out
Weeks 9 and 10: Final Project Presentations
In-class presentations
Week 11: Final's Week
Final project due
Relevant Links
UCLA Course listing
Class pre-requisites and software installation instructions
Final project guidelines
Please note: When I'm not teaching the class, the content is under development and the links may not work.