Data Preparation with Pipeline and ColumnTransformer
One way to prepare data for the machine learning algorithms is to implement preprocessing and feature engineering steps one by one. A more convenient way is to create class objects for each step with fit
and transform
methods, combine these into a set of pipelines (for instance, one pipeline for float type columns, another for object type columns), and finally implement each pipeline on the desired set of columns. We can accomplish this using the Pipeline
and ColumnTransformer
classes from the scikit-learn library.
○ Contents
- Libraries and Modules
- Data
- Feature-Target Split
- Custom Transformer for Dropping Constant Columns
- Feature Data Type
- Custom Transformer for Missing Data Imputation
- Custom Transformer for Ordinal Encoding and One-Hot Encoding
- Custom Transformer for Feature Scaling
- Pipelines
- ColumnTransformer
- Train-Test Split
- Fit and Transform
- Training, Inference, and Evaluation
- References
○ Libraries and Modules
We begin by importing the numpy
and pandas
libraries.
import numpy as np
import pandas as pd
To split the data into training set and test set, we make use of the train_test_split
module from the scikit-learn library.
from sklearn.model_selection import train_test_split
The next modules are specifically required for building the required classes, Pipeline constructors, and ColumnTransformer constructor.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
We import the logistic regression classifier, which we shall train on the processed data.
from sklearn.linear_model import LogisticRegression
○ Data
We load the dataset using the read_csv
function from pandas
.
data = pd.read_csv('https://raw.githubusercontent.com/sugatagh/Patient-Survival-Prediction-using-Deep-Learning/main/Dataset/Dataset.csv')
data.info()
provides basic information on the dataset. It contains \(91713\)
observations with a total of \(186\)
columns. Specifically, it has \(170\)
float-type columns, \(8\)
integer-type columns, and \(8\)
object-type columns.
The objective here is to predict the binary variable hospital_death
based on the independent variables. Thus, it is a binary classification problem.
○ Feature-Target Split
The following function splits a given target variable from the independent variables of a dataset.
def predictor_target_split(data, target):
X = data.drop(target, axis = 1)
y = data[target]
return X, y
Here, we split the target variable hospital_death
from the rest of the variables in the dataset.
X, y = predictor_target_split(data, 'hospital_death')
○ Custom Transformer for Dropping Constant Columns
We build a custom transformer for dropping constant columns. The fit
method identifies the constant columns of an input DataFrame X
. The transform
method directly drops the constant columns from X
(without returning anything) if inplace = True
. Otherwise, it leaves X
unaltered but returns a copy of X
with the constant columns dropped.
class CustomDropConstant(BaseEstimator, TransformerMixin):
def __init__(self, inplace = True):
self.inplace = inplace
def fit(self, X, y = None):
cols_constant = data.columns[data.nunique() == 1].tolist()
self.constant = cols_constant
return self
def transform(self, X):
if self.inplace == True:
X.drop(self.constant, axis = 1, inplace = True)
else:
X_drop = X.copy(deep = True)
X_drop.drop(self.constant, axis = 1, inplace = True)
return X_drop
We create an object of CustomDropConstant
with inplace = True
. Then, we fit it on the feature DataFrame X
and transform it. The two processes are combined together with the fit_transform
method.
dropconstant = CustomDropConstant(inplace = True)
dropconstant.fit_transform(X)
○ Feature Data Type
We list the integer type features, float type features, and object type features.
cols_features_int = list(X.select_dtypes(include = ['int64']).columns.values)
cols_features_flt = list(X.select_dtypes(include = ['float64']).columns.values)
cols_features_obj = list(X.select_dtypes(include = ['object']).columns.values)
○ Custom Transformer for Missing Data Imputation
The next custom transformer imputes missing values present in the dataset. The method of imputation is determined by the parameter strategy
.
strategy = 'mean'
: imputes the missing values in a numerical column by its mean.strategy = 'median'
: imputes the missing values in a numerical column by its median.strategy = 'mode'
: imputes the missing values in a categorical column by its mode.strategy = 'constant'
: imputes the missing values in a column by an input valuefill_value
initialized at \(0.\)
The fit
method computes values that will be used to impute the missing values in different columns. For \(n\)
columns, the fit
method computes a vector of length \(n\)
, one component for each column.
The transform
method uses this vector to return a copy of the input DataFrame with the missing values imputed.
class CustomImputer(BaseEstimator, TransformerMixin):
def __init__(self, strategy = 'median', fill_value = 0):
self.strategy = strategy
self.fill_value = fill_value
def fit(self, X, y = None):
if self.strategy == 'mean':
self.scores = X.mean()
elif self.strategy == 'median':
self.scores = X.median()
elif self.strategy == 'most_frequent':
self.scores = X.mode().iloc[0]
elif self.strategy == 'constant':
self.scores = pd.Series(data = self.fill_value, index = X.columns)
else:
pass
return self
def transform(self, X):
X_imputed = X.copy(deep = True)
for col in X_imputed.columns:
X_imputed[col].fillna(self.scores[col], inplace = True)
return X_imputed
○ Custom Transformer for Ordinal Encoding and One-Hot Encoding
The next custom transformer encodes categorical variables in the dataset. The method of encoding is determined by the parameter strategy
.
strategy = 'ordinal'
: encodes the categorical variables by ordinal encoder.strategy = 'onehot'
: encodes the categorical variables by one-hot encoder.
The fit
method (applicable when strategy = 'ordinal'
) computes a series with keys as the different columns of the input DataFrame and data as dictionaries mapping the unique values of a column to integers (starting from \(0\)).
The transform
method returns a copy of the input DataFrame with the categorical variables encoded with a scheme dictated by the parameter strategy
.
class CustomEncoder(BaseEstimator, TransformerMixin):
def __init__(self, strategy = 'ordinal'):
self.strategy = strategy
def fit(self, X, y = None):
if self.strategy == 'ordinal':
self.encoder = pd.Series(dtype = 'object')
for col in X.columns:
keys_ = X[col].unique().tolist()
values_ = np.arange(len(keys_))
dict_ = {keys_[i]: values_[i] for i in range(len(keys_))}
self.encoder[col] = dict_
else:
pass
return self
def transform(self, X):
X_encoded = X.copy(deep = True)
if self.strategy == 'ordinal':
for col in X_encoded.columns:
X_encoded[col].replace(self.encoder[col], inplace = True)
elif self.strategy == 'onehot':
for col in X_encoded.columns:
for cat in X_encoded[col].unique():
newcol = col + '_' + cat
X_encoded[newcol] = [int(t) for t in (X_encoded[col] == cat).values.tolist()]
X_encoded.drop(col, axis = 1, inplace = True)
else:
pass
return X_encoded
○ Custom Transformer for Feature Scaling
The next custom transformer scales the feature variables in the dataset. The method of scaling is determined by the parameter strategy
.
strategy = 'standard'
: standardizes the columns using mean and standard deviation.strategy = 'minmax'
: normalizes the columns using minimum and maximum.
The fit
method computes a vector of column means and a vector of column standard deviations when strategy = 'standard'
. On the other hand, it computes a vector of minimum values and a vector of maximum values when strategy = 'minmax'
.
The transform
method implements the scaling scheme as specified by the parameter strategy
.
class CustomScaler(BaseEstimator, TransformerMixin):
def __init__(self, strategy = 'standard'):
self.strategy = strategy
def fit(self, X, y = None):
if self.strategy == 'standard':
self.loc = X.mean()
self.scl = X.std()
elif self.strategy == 'minmax':
self.loc = X.min()
self.scl = X.max() - X.min()
else:
pass
return self
def transform(self, X):
X_scaled = X.copy(deep = True)
cols = list(X_scaled.select_dtypes(include = ['number']).columns.values)
for col in cols:
X_scaled[col] = (X_scaled[col] - self.loc[col]) / self.scl[col]
return X_scaled
○ Pipelines
We begin by creating a Pipeline constructor for integer features.
pipeline_int = Pipeline([
("imputer", CustomImputer(strategy = 'median')),
("scaler", CustomScaler(strategy = 'standard'))
])
Next, we create a Pipeline constructor for float features.
pipeline_flt = Pipeline([
("imputer", CustomImputer(strategy = 'median')),
("scaler", CustomScaler(strategy = 'standard'))
])
Next, we create a Pipeline constructor for object features.
pipeline_obj = Pipeline([
("imputer", CustomImputer(strategy = 'most_frequent')),
("encoder", CustomEncoder(strategy = 'ordinal')),
("scaler", CustomScaler(strategy = 'minmax'))
])
○ ColumnTransformer
We combine the pipelines and the corresponding lists of columns to create a ColumnTransformer constructor.
pipeline_full = ColumnTransformer([
("int", pipeline_int, cols_features_int),
("flt", pipeline_flt, cols_features_flt),
("obj", pipeline_obj, cols_features_obj)
])
○ Train-Test Split
We split the dataset into a training set and a test set in the \(80:20\) ratio, stratifying the split using the target variable and shuffling the data before the split.
X_train, X_test, y_train, y_test = train_test_split(
X, y,
stratify = y,
test_size = 0.2,
shuffle = True
)
○ Fit and Transform
We fit pipeline_full
to the training features and transform both the training features and the test features. Note that the fit_transform
method (which applies the fit
and transform
methods sequentially) and the transform
method return arrays, which have to be converted back to DataFrames.
def CustomTransformer(X_train, X_test):
X_train_out_arr = pipeline_full.fit_transform(X_train)
X_test_out_arr = pipeline_full.transform(X_test)
X_train_out = pd.DataFrame(data = X_train_out_arr, index = X_train.index, columns = cols_features_int + cols_features_flt + cols_features_obj)
X_test_out = pd.DataFrame(data = X_test_out_arr, index = X_test.index, columns = cols_features_int + cols_features_flt + cols_features_obj)
return X_train_out, X_test_out
X_train, X_test = CustomTransformer(X_train, X_test)
○ Training, Inference, and Evaluation
We load a baseline logistic regression classifier and train it on the processed training data using the fit
method. Then, we predict the test labels based on the processed test features using the predict
method. Finally, we evaluate the performance through the accuracy metric using the score
method.
clf = LogisticRegression(max_iter = 500) # Model
clf.fit(X_train, y_train) # Training
y_pred = clf.predict(X_test) # Inference
print(f"Test accuracy: {clf.score(X_test, y_test)}") # Evaluation
# Output - Test accuracy: 0.9274382598266369
○ References
- Binary classification
- Categorical variable
- ColumnTransformer constructor
- Imputation
- Logistic regression
- Missing data
- Min-max normalization
- One-hot
- Pipeline constructor
- Python classes
- Standardization
- Statistical classification