Notes

Data Preparation with Pipeline and ColumnTransformer

One way to prepare data for the machine learning algorithms is to implement preprocessing and feature engineering steps one by one. A more convenient way is to create class objects for each step with fit and transform methods, combine these into a set of pipelines (for instance, one pipeline for float type columns, another for object type columns), and finally implement each pipeline on the desired set of columns. We can accomplish this using the Pipeline and ColumnTransformer classes from the scikit-learn library.

Notes

Implementing Neural Network with Callbacks

A callback is an object that can perform certain actions at various stages of training. It is typically used to confirm or adjust specific behaviors and is called periodically throughout a procedure. Callbacks can be used in machine learning to specify what occurs before, during, or after a training epoch or a single batch. In this note, we implement EarlyStopping, LearningRateScheduler, and ModelCheckpoint callbacks in the training of a neural network.

Notes

A Simple Implementation of Support Vector Machine (SVM)

In this note, we implement the support vector machine algorithm to recognize handwritten digits. The MNIST dataset available in Keras is in tuple format. We fetch the training set and test set from this dataset, visualize some observations, and unroll the 2-D pixel values into 1-D arrays. Then, we train the SVM classifier with default hyperparameters on the training set, use the fitted model to predict on the test set, and evaluate the performance in terms of accuracy.

Notes

Term Frequency-Inverse Document Frequency (TF-IDF)

Term frequency-inverse document frequency, or TF-IDF, is a numerical statistic used in information retrieval that aims to represent the significance of a word within a document in a collection or corpus. It is frequently utilized as a weighting factor in text mining, user modeling, and information retrieval searches. In order to account for the fact that some words appear more frequently than others overall, the TF-IDF value increases proportionately to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the term.

Notes

Implementing Logistic Regression from Scratch

Many advanced libraries, such as scikit-learn, make it possible to carry out training and inference with a few lines of code. While it is very convenient for day-to-day practice, it does not give insight into the details of what really happens underneath when we run those codes. In this note, we implement a logistic regression model manually from scratch, without using any advanced library, to understand how it works in the context of binary classification.