Course Abstract

Training duration : 90 minutes

Datasets are almost never complete and this can introduce various biases to your analysis. Due to these biases, your supervised machine learning model can produce incorrect predictions. The goal of this post is to give you an idea of why some of the most common approaches for dealing with missing values often introduce some type of bias. I will describe the methods and techniques that can help you to arrive at an unbiased conclusion in the face of missing data.

DIFFICULTY LEVEL: INTERMEDIATE

Learning Objectives

  • Describe the three main types of missingness patterns

  • Evaluate simple approaches for handling missing values

  • Apply XGBoost to a dataset with missing values

  • Apply multivariate imputation

  • Apply the reduced-features model (also called the pattern submodel approach)

  • Decide which approach is best for your dataset

Instructor

Instructor Bio:

Andras Zsom is a Lead Data Scientist in the Center for Computation and Visualization group at Brown University, Providence, RI. He works with high-level academic administrators to tackle predictive modeling problems, he collaborates with faculty members on data-intensive research projects, and he was the instructor of a data science course offered to the data science master students at Brown.

Andras Zsom, PhD

Lead Data Scientist and Adjunct Lecturer in Data Science | Brown University, Center for Computation and Visualization

Course Outline

Module 1: Missing data patterns

- MCAR - Missing Complete At Random 

- MAR - Missing At Random 

- MNAR - Missing Not At Random 

Module 2: Apply the reduced-features model (also called the pattern submodel approach)

 - Reduced-features model (or pattern submodel approach) 

Module 3: How to determine the patterns? 

- A python implementation 

Module 4: Decide which approach is best for your dataset 

- XGB models 

- Imputation 

- Reduced-features

Background knowledge

  • Experience with python and scikit-learn

  • Knowledge of building a machine learning pipeline (e.g., cross validation, hyper-parameter tuning)

Real-world applications

  • Supervised Learning can be used in Customer churn modeling can help identify which of the customers of a business are likely to stop engaging with the business and why.

  • Dynamic pricing for marketing campaigns for any goods or services rely on pricing data. Airlines and ride-share services have successfully implemented dynamic price optimization strategies using supervised learning

  • Tackling missing data scenarios help rectify and enhance modeling capabilities in a variety of business applications including streaming, finance, e-commerce.