UCLA Stats 404 - Statistical Computing and Programming

Developed and taught this core graduate course to 80+ students at UCLA Masters of Applied Statistics (2019-2021)

Course Description

Lecture, three hours; discussion, one hour. Limited to Master of Applied Statistics students. Fundamentals of statistical programming using Python and SQL. Python is currently state-of-the-art for the analysis of data, statistical computing, and software development in the industry. Performance of analysis of real datasets using Python...

Course Learning Objectives

The goal of this course is to prepare students for Data Scientist or Machine Learning Engineering roles in industry, by learning marketable skills and best practices for collaborating with technical and non-technical stakeholders, and developing statistical or Machine Learning software via Python, SQL, and Git.

By the end of the course, students should be able to:

Explain how their work contributes to the business;
Learn about and implement iterative model development;
Write production-ready code, that runs not just on their computer;
Answer the business question end-to-end with ML/AI;
Gain experience in Python, SQL, and Git.
Prepare for interviews, by having (an additional) end-to-end project to showcase in their portfolio, which also serves as a template for performing data analysis in Python for any interview take-home homework assignments.

Agenda (subject to change)

Week 1: Uncovering the business question and outlining a strategy to answer it, setting up a reproducible machine learning environment, introduction to Git
Weeks 2 and 3: Introduction to Python, pandas, and SQL
- Python: expressions, control flow, functions, variable types, passing by reference, list comprehension, functional programming
- pandas: reading-in data, subsetting, EDA, split + apply+ combine, pandas + databases
Weeks 4 and 5: Introduction to developing an ML proof-of-concept (POC)
- Linear and logistic regression, Elastic nets, PCA regression, hyper-parameter tuning, Deep Learning, and custom loss functions
Week 6: Big Data in Python -- Improving POC by Understanding Computational Constraints
- pandas and big data, Dask, pySpark + SparkSQL, embarrassingly parallel processes, AWS S3
Week 7 and 8: Introduction to Software Development to Productionalize POC
- reproducibility, readability, robustness
- testing suite, ML test, typing, model roll-out
Weeks 9 and 10: Final Project Presentations
- In-class presentations
Week 11: Final's Week
- Final project due

Relevant Links

UCLA Course listing
Syllabus
GitHub repository
Class pre-requisites and software installation instructions
Final project guidelines

Please note: When I'm not teaching the class, the content is under development and the links may not work.

Google Sites

Report abuse