Course Abstract

Training duration: 90 min (Hands-on)

Supervised Learning is a course series that walks through all steps of the classical supervised machine learning pipeline. We use python and packages like scikit-learn, pandas, numpy, and matplotlib. The course series focuses on topics like cross validation and splitting strategies, evaluation metrics, supervised machine learning algorithms (like linear and logistic regression, support vector machines, and tree-based methods like random forest, gradient boosting, and XGBoost), and interpretability. You can complete the courses in sequence or complete individual courses based on your interest. Part 2 of the course series is on how to prepare your data for training and evaluating a machine learning model. Two steps are covered: how to split and preprocess your data. My experience is that beginner practitioners often make a mistake referred to as data leakage when splitting their dataset. Data leakage means that you use information in the model training process which will not be available at prediction time. The unfortunate side effect is that the model seems to perform well in production but poorly in deployment. Two modules are dedicated to splitting with the hope that the participants will be well-equipped to avoid data leakage upon completing the modules. The third module is on preprocessing. There are two driving concepts behind preprocessing: the feature matrix needs to be numerical (no strings or any other data types are allowed when using sklearn), and some machine learning models converge faster and perform better if all features are standardized.


Learning Objectives

  • Describe why data splitting is necessary in machine learning

  • Summarize the properties of IID data

  • List examples of non-IID datasets

  • Apply IID splitting techniques

  • Apply non-IID splitting techniques

  • Identify when a custom splitting strategy is necessary

  • Describe the two motivating concepts behind preprocessing

  • Apply various preprocessors to categorical and continuous features

  • Perform preprocessing with a sklearn pipeline and ColumnTransformer

Instructor Bio:

Andras Zsom, PhD

Andras Zsom is a Lead Data Scientist in the Center for Computation and Visualization and an Adjunct Lecturer in Data Science at Brown University, Providence, RI, USA. He works with high-level academic administrators to tackle predictive modeling problems and to promote data-driven decision making, he collaborates with faculty members on data-intensive research projects, and he is the instructor of a mandatory course in the Data Science Master’s curriculum.

Andras Zsom, PhD

Lead Data Scientist and Adjunct Lecturer in Data Science | Brown University, Center for Computation and Visualization

Course Outline

Module 1: Split IID data

  • Review why we split the data (hyperparameter tuning, the bias-variance trade off, the generalization error)

  • The properties of Independent and Identically Distributed (IID) data

  • The basic approach to split the data into training, validation, and test sets

  • K-Fold cross validation

  • How to split imbalanced data in classification, the stratified split

  • The uncertainty introduced by data splitting and how to measure it

Module 2: Split non-IID data

  • Examples of non-IID datasets

  • Guiding questions to ask yourself when coming up with a splitting strategy

  • Split a dataset with group structure: GroupShuffleSplit and GroupKFold

  • How to work with time series data: the TimeSeriesSplit

  • The limitations of sklearn: when you should consider writing your own custom splitting function

Module 3: Preprocess continuous and categorical features

  • Review the driving concepts behind preprocessing

  • Overview of sklearn transformers and methods

  • Apply the one hot encoder to categorical features

  • Apply the ordinal encoder to ordinal features

  • Standardize continuous features

  • Introduction to sklearn’s ColumnTransformer and pipelines

Background knowledge

  • Python coding experience

  • Familiarity with pandas and numpy

  • Prior experience with scikit-learn and matplotlib are a plus but not required

Applicable Use-cases

  • The dataset can be expressed as a 2D feature matrix with the columns as features and the rows as data points

  • A continuous or categorical target variable exists and the dataset is IID (the points are independent and identically distributed)

  • Fraud detection, predict if patients have a certain illness, predict the selling or rental price of properties, predict customer satisfaction