Data Preparation with Pipeline and ColumnTransformer

April 18, 2024

One way to prepare data for the machine learning algorithms is to implement preprocessing and feature engineering steps one by one. A more convenient way is to create class objects for each step with fit and transform methods, combine these into a set of pipelines (for instance, one pipeline for float type columns, another for object type columns), and finally implement each pipeline on the desired set of columns. We can accomplish this using the Pipeline and ColumnTransformer classes from the scikit-learn library.

○ Contents

Libraries and Modules
Data
Feature-Target Split
Custom Transformer for Dropping Constant Columns
Feature Data Type
Custom Transformer for Missing Data Imputation
Custom Transformer for Ordinal Encoding and One-Hot Encoding
Custom Transformer for Feature Scaling
Pipelines
ColumnTransformer
Train-Test Split
Fit and Transform
Training, Inference, and Evaluation
References

○ Libraries and Modules

We begin by importing the numpy and pandas libraries.

import numpy as np
import pandas as pd

To split the data into training set and test set, we make use of the train_test_split module from the scikit-learn library.

from sklearn.model_selection import train_test_split

The next modules are specifically required for building the required classes, Pipeline constructors, and ColumnTransformer constructor.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

We import the logistic regression classifier, which we shall train on the processed data.

from sklearn.linear_model import LogisticRegression

○ Data

We load the dataset using the read_csv function from pandas.

data = pd.read_csv('https://raw.githubusercontent.com/sugatagh/Patient-Survival-Prediction-using-Deep-Learning/main/Dataset/Dataset.csv')

data.info() provides basic information on the dataset. It contains \(91713\) observations with a total of \(186\) columns. Specifically, it has \(170\) float-type columns, \(8\) integer-type columns, and \(8\) object-type columns.

The objective here is to predict the binary variable hospital_death based on the independent variables. Thus, it is a binary classification problem.

○ Feature-Target Split

The following function splits a given target variable from the independent variables of a dataset.

def predictor_target_split(data, target):
    X = data.drop(target, axis = 1)
    y = data[target]
    return X, y

Here, we split the target variable hospital_death from the rest of the variables in the dataset.

X, y = predictor_target_split(data, 'hospital_death')

○ Custom Transformer for Dropping Constant Columns

We build a custom transformer for dropping constant columns. The fit method identifies the constant columns of an input DataFrame X. The transform method directly drops the constant columns from X (without returning anything) if inplace = True. Otherwise, it leaves X unaltered but returns a copy of X with the constant columns dropped.

class CustomDropConstant(BaseEstimator, TransformerMixin):

    def __init__(self, inplace = True):
        self.inplace = inplace

    def fit(self, X, y = None):
        cols_constant = data.columns[data.nunique() == 1].tolist()
        self.constant = cols_constant
        return self

    def transform(self, X):
        if self.inplace == True:
            X.drop(self.constant, axis = 1, inplace = True)
        else:
            X_drop = X.copy(deep = True)
            X_drop.drop(self.constant, axis = 1, inplace = True)
            return X_drop

We create an object of CustomDropConstant with inplace = True. Then, we fit it on the feature DataFrame X and transform it. The two processes are combined together with the fit_transform method.

dropconstant = CustomDropConstant(inplace = True)
dropconstant.fit_transform(X)

○ Feature Data Type

We list the integer type features, float type features, and object type features.

cols_features_int = list(X.select_dtypes(include = ['int64']).columns.values)
cols_features_flt = list(X.select_dtypes(include = ['float64']).columns.values)
cols_features_obj = list(X.select_dtypes(include = ['object']).columns.values)

○ Custom Transformer for Missing Data Imputation

The next custom transformer imputes missing values present in the dataset. The method of imputation is determined by the parameter strategy.

strategy = 'mean': imputes the missing values in a numerical column by its mean.
strategy = 'median': imputes the missing values in a numerical column by its median.
strategy = 'mode': imputes the missing values in a categorical column by its mode.
strategy = 'constant': imputes the missing values in a column by an input value fill_value initialized at \(0.\)

The fit method computes values that will be used to impute the missing values in different columns. For \(n\) columns, the fit method computes a vector of length \(n\) , one component for each column.

The transform method uses this vector to return a copy of the input DataFrame with the missing values imputed.

class CustomImputer(BaseEstimator, TransformerMixin):

    def __init__(self, strategy = 'median', fill_value = 0):
        self.strategy = strategy
        self.fill_value = fill_value

    def fit(self, X, y = None):
        if self.strategy == 'mean':
            self.scores = X.mean()
        elif self.strategy == 'median':
            self.scores = X.median()
        elif self.strategy == 'most_frequent':
            self.scores = X.mode().iloc[0]
        elif self.strategy == 'constant':
            self.scores = pd.Series(data = self.fill_value, index = X.columns)
        else:
            pass
        return self

    def transform(self, X):
        X_imputed = X.copy(deep = True)
        for col in X_imputed.columns:
            X_imputed[col].fillna(self.scores[col], inplace = True)
        return X_imputed

○ Custom Transformer for Ordinal Encoding and One-Hot Encoding

The next custom transformer encodes categorical variables in the dataset. The method of encoding is determined by the parameter strategy.

strategy = 'ordinal': encodes the categorical variables by ordinal encoder.
strategy = 'onehot': encodes the categorical variables by one-hot encoder.

The fit method (applicable when strategy = 'ordinal') computes a series with keys as the different columns of the input DataFrame and data as dictionaries mapping the unique values of a column to integers (starting from \(0\)).

The transform method returns a copy of the input DataFrame with the categorical variables encoded with a scheme dictated by the parameter strategy.

class CustomEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, strategy = 'ordinal'):
        self.strategy = strategy

    def fit(self, X, y = None):
        if self.strategy == 'ordinal':
            self.encoder = pd.Series(dtype = 'object')
            for col in X.columns:
                keys_ = X[col].unique().tolist()
                values_ = np.arange(len(keys_))
                dict_ = {keys_[i]: values_[i] for i in range(len(keys_))}
                self.encoder[col] = dict_
        else:
            pass
        return self

    def transform(self, X):
        X_encoded = X.copy(deep = True)
        if self.strategy == 'ordinal':
            for col in X_encoded.columns:
                X_encoded[col].replace(self.encoder[col], inplace = True)
        elif self.strategy == 'onehot':
            for col in X_encoded.columns:
                for cat in X_encoded[col].unique():
                    newcol = col + '_' + cat
                    X_encoded[newcol] = [int(t) for t in (X_encoded[col] == cat).values.tolist()]
                X_encoded.drop(col, axis = 1, inplace = True)
        else:
            pass
        return X_encoded

○ Custom Transformer for Feature Scaling

The next custom transformer scales the feature variables in the dataset. The method of scaling is determined by the parameter strategy.

strategy = 'standard': standardizes the columns using mean and standard deviation.
strategy = 'minmax': normalizes the columns using minimum and maximum.

The fit method computes a vector of column means and a vector of column standard deviations when strategy = 'standard'. On the other hand, it computes a vector of minimum values and a vector of maximum values when strategy = 'minmax'.

The transform method implements the scaling scheme as specified by the parameter strategy.

class CustomScaler(BaseEstimator, TransformerMixin):

    def __init__(self, strategy = 'standard'):
        self.strategy = strategy

    def fit(self, X, y = None):
        if self.strategy == 'standard':
            self.loc = X.mean()
            self.scl = X.std()
        elif self.strategy == 'minmax':
            self.loc = X.min()
            self.scl = X.max() - X.min()
        else:
            pass
        return self

    def transform(self, X):
        X_scaled = X.copy(deep = True)
        cols = list(X_scaled.select_dtypes(include = ['number']).columns.values)
        for col in cols:
            X_scaled[col] = (X_scaled[col] - self.loc[col]) / self.scl[col]
        return X_scaled

○ Pipelines

We begin by creating a Pipeline constructor for integer features.

pipeline_int = Pipeline([
    ("imputer", CustomImputer(strategy = 'median')),
    ("scaler", CustomScaler(strategy = 'standard'))
])

Next, we create a Pipeline constructor for float features.

pipeline_flt = Pipeline([
    ("imputer", CustomImputer(strategy = 'median')),
    ("scaler", CustomScaler(strategy = 'standard'))
])

Next, we create a Pipeline constructor for object features.

pipeline_obj = Pipeline([
    ("imputer", CustomImputer(strategy = 'most_frequent')),
    ("encoder", CustomEncoder(strategy = 'ordinal')),
    ("scaler", CustomScaler(strategy = 'minmax'))
])

○ ColumnTransformer

We combine the pipelines and the corresponding lists of columns to create a ColumnTransformer constructor.

pipeline_full = ColumnTransformer([
    ("int", pipeline_int, cols_features_int),
    ("flt", pipeline_flt, cols_features_flt),
    ("obj", pipeline_obj, cols_features_obj)
])

○ Train-Test Split

We split the dataset into a training set and a test set in the \(80:20\) ratio, stratifying the split using the target variable and shuffling the data before the split.

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    stratify = y,
    test_size = 0.2,
    shuffle = True
)

○ Fit and Transform

We fit pipeline_full to the training features and transform both the training features and the test features. Note that the fit_transform method (which applies the fit and transform methods sequentially) and the transform method return arrays, which have to be converted back to DataFrames.

def CustomTransformer(X_train, X_test):
    X_train_out_arr = pipeline_full.fit_transform(X_train)
    X_test_out_arr = pipeline_full.transform(X_test)
    X_train_out = pd.DataFrame(data = X_train_out_arr, index = X_train.index, columns = cols_features_int + cols_features_flt + cols_features_obj)
    X_test_out = pd.DataFrame(data = X_test_out_arr, index = X_test.index, columns = cols_features_int + cols_features_flt + cols_features_obj)
    return X_train_out, X_test_out

X_train, X_test = CustomTransformer(X_train, X_test)

○ Training, Inference, and Evaluation

We load a baseline logistic regression classifier and train it on the processed training data using the fit method. Then, we predict the test labels based on the processed test features using the predict method. Finally, we evaluate the performance through the accuracy metric using the score method.

clf = LogisticRegression(max_iter = 500) # Model
clf.fit(X_train, y_train) # Training
y_pred = clf.predict(X_test) # Inference
print(f"Test accuracy: {clf.score(X_test, y_test)}") # Evaluation
# Output - Test accuracy: 0.9274382598266369