Optimizing Customer Behavior Predictions: How to Select the Best Machine-Learning Model with Scikit-learn Pipelines

Using Classification Models for Growth Marketing Optimization

One thing that never ceases to amaze me about growth marketing is the predictability of human behavior at scale. PPC platforms like Google and Meta have become so confident that they can predict free will that, as a marketer, you can opt-in to only paying for conversions.

For this post, we’ll skip the debate on whether or not we’re living in a simulation and instead focus on techniques that growth marketers can leverage in order to tap into the value that predictability in behavior provides.

Developing a deep understanding of your customer’s preferences has always been at the core of successful marketing strategies. However, it is much more recent that we can fairly easily convert that understanding of our customer into a mathematical equation that accurately predicts customer behavior, like the likelihood that someone will convert into a paying customer.

Machine learning, particularly classification models, can be an extremely useful arrow in a growth-marketer’s quiver when it comes to predicting and understanding audience behavior.

Selecting the right model for your use case however, can be time consuming and repetitive. For each model you want to assess, you’ll need to clean and transform your dataset, split out training data and then pass that information to the model in the correct form.

Fortunately, scikit-learn has a pipeline function that can automate a lot of these steps.

Step 1: Define Your Objective and Create a Dataset

Before diving into data and models, clearly define your objective. Are you trying to predict which users will convert, segment users into different interest groups, or identify which users are at risk of churning?

Your objective will guide the choice of data to collect and the metrics to evaluate model performance and every use case will need a different dataset.

Keep in mind, this pipeline is for supervised ML models. so in order to train these models, we need to know the outcome that we are predicting. For example, Im using sessions from google analytics and want to predict whether or not the session resulted in the event ‘generate_lead’, so I have included that outcome in my dataset.

We won’t dive into the process of building a dataset here, but if you need a test dataset to give this a try, here is an ecommerce dataset that should do the trick.

Step 2: Prepare Your Environment

Typically I pop open a jupyter-lab session for most of my data analytics tasks; however, since we’re going to be running through a number of computationally rigorous models, I’m running this out of a colab notebook.

Colab is great because at no cost, you can easily upgrade your runtime. For this, I’ve switched my runtime over to TPU, which is an architecture specifically designed for machine-learning applications. Here is how to set up your Colab notebook to do the same:

Open up Colab and head over to Runtime in the menu:

Select “Change runtime type”. The following modal will pop up. Select “TPU” :

Note – changing runtime will restart your notebook. If you uploaded your data before switching runtimes, you’ll have to repeat the process.

Step 2: Load Required Packages

Colab should have all of these packages pre-installed, but if not, simply run a !pip install on whatever packages are missing:

import time
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.neural_network import MLPClassifier

Some of these models tend to throw ugly warnings, so you may want to mute some of these by ignoring the CovergenceWarning

from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

Step 3: Import Your Data and Perform Any Required Feature Engineering


df = pd.read_csv('/content/your_data_set.csv')

#Perform any feature engineering jobs necessary


#for simplicity, I'm setting unknown categorical values to 'unkown'
for column in df.columns:
    if df[column].dtype == 'object':
        df[column].fillna('Unknown', inplace=True)

#My data set has a duration column in the form HH:MM:SS. Here I convert that seconds
def time_convert(x):
    h,m,s = map(int,x.split(':'))
    return (h*60+m)*60+s

df['session_duration'] = df['session_duration'].apply(time_convert)

Step 4: Split out your training and test data

For my use case, I want to predict whether or not a user session is going to result in a website lead event.

The X feature set will include all features except the target variable. If you have other variables that only occur when the target event also occurs, you should drop those from the model.

The y feature set is just your target variable.

We’ll now randomly split the data into a training and test dataset using the train_test_split() function.

  • test_size : the percent of the data which will be held out of training. We’ll use this to evaluate the model once it has been trained.
  • random_state : controls the shuffling applied to the data before applying the split. We’ll pass 0 to ensure we get reproducible results each time we run the code.
  • stratify : if not None, data is split in a stratified fashion. We’ll pass y to ensure the target variable is present in equal proportions in both the training and test groups.
X = df.drop(['generate_lead'], axis=1)
y = df['generate_lead']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=0,
                                                    stratify=y
                                                   )

Step 5: Build the Pipeline Functions

First we’ll define our get_pipeline() model. This function creates a preprocessor to format the data and then bundles the data with a model

def get_pipeline(X, model):
		#identify categorical and numerical column types
    numeric_columns = list(X.select_dtypes(exclude=['object']).columns.values.tolist())
    categorical_columns = list(X.select_dtypes(include=['object']).columns.values.tolist())
    numeric_pipeline = SimpleImputer(strategy='constant')
		#Encode categorical data
    categorical_pipeline = OneHotEncoder(handle_unknown='ignore')

    preprocessor = ColumnTransformer(
        transformers=[
            ('numeric', numeric_pipeline, numeric_columns),
            ('categorical', categorical_pipeline, categorical_columns),
            ], remainder='passthrough'
    )

    bundled_pipeline = imbpipeline(steps=[
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=1)),
        ('scaler', MaxAbsScaler()),
        ('feature_selection', SelectKBest(score_func=chi2, k=6)),
        ('model', model)
    ])

    return bundled_pipeline

Next define the select_model() function, which defines the models we want to evaluate, pulls in the preprocessed bundled pipeline and returns the ROC_AUC score.

#updated select model function
def select_model(X, y, pipeline=None):

    classifiers = {
        "DummyClassifier": DummyClassifier(strategy='most_frequent'),
        "XGBClassifier": XGBClassifier(verbosity=0, eval_metric='logloss', objective='binary:logistic'),
        "RandomForestClassifier": RandomForestClassifier(),
        "DecisionTreeClassifier": DecisionTreeClassifier(),
        "ExtraTreeClassifier": ExtraTreeClassifier(),
        "ExtraTreesClassifier": ExtraTreesClassifier(),
        "AdaBoostClassifier": AdaBoostClassifier(),
        "KNeighborsClassifier": KNeighborsClassifier(),
        "RidgeClassifier": RidgeClassifier(),
        "SGDClassifier": SGDClassifier(),
        "BaggingClassifier": BaggingClassifier(),
        "BernoulliNB": BernoulliNB(),
        "SVC": SVC(),
        "MLPClassifier": MLPClassifier(),
        "MLPClassifier (paper)": MLPClassifier(hidden_layer_sizes=(27, 50), max_iter=300, activation='relu', solver='adam', random_state=1)
    }

    rows = []  # List to collect each row

    for key in classifiers:

        start_time = time.time()

        # pass the training data and classifier through the get_pipeline() function
        pipeline = get_pipeline(X_train, classifiers[key])

        cv = cross_val_score(pipeline, X, y, cv=10, scoring='roc_auc')

        row = {'model': key,
               'run_time': format(round((time.time() - start_time)/60,2)),
               'roc_auc': cv.mean(),
        }
        print(row)
        rows.append(row)

    df_models = pd.DataFrame(rows)  # Create DataFrame from collected rows
    df_models = df_models.sort_values(by='roc_auc', ascending=False)
    return df_models

Step 5: Run the pipeline

Now all you have to do is execute the the code, sit back and enjoy the show.

models = select_model(X_train, y_train)
models.head(20)

The output of the select_model() function will return the ROC_AUC score for each model. The ROC AUC score (Receiver Operating Curve – Area Under Curve) plots the true positive rate against the false positive rate for the model and is generally perceived as the best metric of evaluation when determining the accuracy of a classification model.

For this example, A good way to intuitively think about the ROC AUC curve is the probability the model will give a higher value for a randomly chosen session that results in a ‘generate_lead’ event than a randomly chosen session that does not convert.

A rough rule of thumb is that the accuracy of tests with AUCs between 0.50 and 0.70 is low; between 0.70 and 0.90, the accuracy is moderate; and it is high for AUCs over 0.90.

Since it can take a bit of time to get results for each model, I’ve written the code to print out results along the way. Once the pipeline run is complete, you’ll have a comprehensive overview of what model to use for whatever classification task you have in mind.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top