Electron Energy Flux Prediction

April 30, 2022

In McGranaghan et al. (2021), the authors considered the problem of modeling electron particle precipitation from the magnetosphere to the ionosphere. They attempted to address the problem through a new particle precipitation database, using machine learning tools to extract useful information from it. The new database contains 51 satellite years of Defense Meteorological Satellite Program (DMSP) observations. Based on this, we aim to predict the continuous target variable electron total energy flux.

GitHub repository

○ Overview

In McGranaghan et al. (2021), the authors have considered the problem of modeling electron particle precipitation from the magnetosphere to the ionosphere. They attempted to address it through a new database, using machine learning tools to extract useful information from it.
Based on that database, we aim to predict electron total energy flux, which is a continuous variable.
A detailed exploratory data analysis on the dataset is carried out. In particular, we observe that there are clear groups among the feature variables. We investigate the group-wise correlation structure through averaging the pairwise correlation coefficients.
We use the insights obtained from EDA in the data preprocessing stages, which consists of feature extraction, data transformation, feature scaling, and principal component analysis.
We build a neural network and tune it to predict electron total energy flux.
We employ the root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (\(R^2\)) metrics to evaluate the models. Performance summary of the final model:

○ Introduction

Data
Project Objective
Evaluation Metric

Data

The dataset accompanies this paper published in the AGU Space Weather Journal and is used to produce new machine learning models of particle precipitation from the magnetosphere to the ionosphere. The authors have attempted to make these data ready to be used in artificial intelligence or machine learning explorations, following a community definition of AI-ready provided at this repository. The dataset contains \(1945789\) observations, each containing \(74\) variables.

Project Objective

The goal is to predict ELE_TOTAL_ENERGY_FLUX, which is a continuous variable, based on the features available in the dataset.

Evaluation Metric

We use the trained models to predict on the test dataset and evaluate the models in terms of root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (\(R^2\)).

○ Exploratory Data Analysis

Dataset Synopsis
Univariate Analysis
Multivariate Analysis

Dataset Synopsis

Number of observations: \(1945789\)
Number of columns: \(74\)
Number of integer columns: \(0\)
Number of float columns: \(73\)
Number of object columns: \(1\)
Number of duplicate observations: \(0\)
Constant columns: None
Number of columns with missing values: \(0\)
Memory Usage: \(1098.54\) MB

Univariate Analysis

We visualize the distribution of the continuous target variable ELE_TOTAL_ENERGY_FLUX through histogram and observe that it is extremely skewed in linear scale. So, we set log_scale = True in seaborn histplot and visualize the distribution on logarithmic scale.

Observations

The target variable has extremely left-skewed distribution, which suggests applying log-transformation \(y \mapsto \log{y}\) on it
Under logarithmic scale, the distribution appears to be bimodal

Next, we visualize the distributions of the numerical features.

Observations

Extremely skewed columns: ELE_AVG_ENERGY
Moderately skewed columns: ELE_TOTAL_ENERGY_FLUX_STD, ELE_AVG_ENERGY_STD, F107, AE, AU, F107_6hr, AE_6hr, AU_6hr, psw_6hr, F107_5hr, AE_5hr, AU_5hr, PC_5hr, psw_5hr, F107_3hr, AE_3hr, AU_3hr, psw_3hr, F107_1hr, AE_1hr, AU_1hr, psw_1hr, F107_45min, AE_45min, AU_45min, psw_45min, F107_30min, AE_30min, AU_30min, psw_30min, F107_15min, AE_15min, AU_15min, psw_15min, F107_10min, AE_10min, AU_10min, psw_10min, F107_5min, AE_5min, AU_5min, psw_5min
Extremely leptokurtic columns: PC_6hr, PC_5hr, PC_3hr, PC_1hr
Moderately skewed columns taking negative values: AL, psw, AL_6hr, AL_5hr, AL_3hr, AL_1hr, AL_45min, AL_30min, AL_15min, AL_10min, AL_5min
Columns with zero-inflated distribution: borovsky, newell, borovsky_45min, newell_45min, borovsky_30min, newell_30min, borovsky_15min, newell_15min, borovsky_10min, newell_10min, borovsky_5min, newell_5min

Multivariate Analysis

Barring the object type variable Datetimes, all the features are of the float datatype. We begin with investigating the pairwise correlation coefficients for the numerical feature variables.

We present a histogram of the pairwise correlation coefficient values to give an overall picture of the extention of the correlation among the numerical features in the dataset.

Observations

In total there are \(\binom{152}{2} = 11476\) pairs of numerical features
No significant correlation \((\left\vert r \right\vert < 0.1):\) \(6061\) pairs
Weak correlation \((0.1 \leq \left\vert r \right\vert < 0.3):\) \(3195\) pairs
Moderate correlation \((0.3 \leq \left\vert r \right\vert < 0.7):\) \(1653\) pairs
Strong correlation \((0.7 \leq \left\vert r \right\vert < 0.9):\) \(222\) pairs
Extremely strong correlation \((0.9 \leq \left\vert r \right\vert < 0.99):\) \(238\) pairs
Almost linear relationship \((0.99 \leq \left\vert r \right\vert \leq 1):\) \(107\) pairs

We observe that there are clear groups among the feature variables, as indicated by their names starting with certain common strings. For example, the features Bz, Bz_6hr, Bz_3hr, Bz_1hr, Bz_45min, Bz_30min, and Bz_10min appears to be related in some sense. We consider these different groups of features and investigate the group-wise correlation structure through averaging the pairwise correlation coefficients.

To elaborate, let us take a group \(G_1\) of the features \(X_1, X_2, \cdots, X_m\). We can now define the mean of within-group correlation coefficient as

\[\text{mean}(r_\text{within}) := \frac{1}{\binom{m}{2}} \sum_{i=1}^m \sum_{j=1;\,j \neq i}^m \text{cov}(X_i, X_j). \]

We define the variance of within-group correlation coefficient as

\[\text{var}(r_\text{within}) := \frac{1}{\binom{m}{2}} \sum_{i=1}^m \sum_{j=1;\,j \neq i}^m \big\{\text{cov}(X_i, X_j) - \text{mean}(r_\text{within})\big\}^2. \]

Now, let us take another group \(G_2\) of the features \(Y_1, Y_2, \cdots, Y_n\). Note that the group sizes of \(G_1\) and \(G_2\) do not necessarily have to be equal. We define the mean of between-group correlation coefficient as

\[\text{mean}(r_\text{between}) := \frac{1}{mn} \sum_{i=1}^m \sum_{j=1}^n \text{cov}(X_i, Y_j). \]

We define the variance of between-group correlation coefficient as

\[\text{var}(r_\text{between}) := \frac{1}{mn} \sum_{i=1}^m \sum_{j=1}^n \big\{\text{cov}(X_i, Y_j) - \text{mean}(r_\text{between})\big\}^2. \]

These measures prove to be useful in the group-wise analysis of the correlation structure among numerical features. We begin with forming different groups of features.

group_list = [col.split("_")[0] for col in data.columns if '_6hr' in col]
group_dict = {}
for item in group_list:
    group_dict[item] = [col for col in data.columns if item in col]
group_dict['AL'].remove('ELE_TOTAL_ENERGY_FLUX')

We illustrate the correlation structure among the groups through average pairwise correlation coefficients with a heatmap.

Note that the diagonal elements of the group-wise correlation matrix is not necessarily equal to \(1\), unlike the case of usual correlation matrices. This is because, in this scenario, the diagonal elements represent the average pairwise correlation coefficient of a group of features, instead of the correlation coefficient of a feature with itself.

To illustrate how this grouped heatmap works, we give an in-depth view of the top-left square, which shows the average correlation coefficient within the Bz-group.

Observations from the grouped heatmap

The diagonal squares show that all the groups have moderate to high average correlation among its features (taken as pairs). We shall take in-depth look into each of the groups.
The non-diagonal squares show that the average correlation between different groups varies from extremely negative to moderately positive. Some squares with strong (positive or negative) correlation, which we may want to look into, are given in first group \(\times\) second group format:
- vsw-group \(\times\) vx-group
- AE-group \(\times\) AL-group
- AE-group \(\times\) AU-group
- AL-group \(\times\) AU-group
- AE-group \(\times\) newell-group
- AE-group \(\times\) PC-group
- Bz-group \(\times\) newell-group
- AU-group \(\times\) newell-group
- AL-group \(\times\) PC-group
- AL-group \(\times\) newell-group
- SymH-group \(\times\) vsw-group
- SymH-group \(\times\) vx-group
- PC-group \(\times\) newell-group
- AU-group \(\times\) PC-group
- borovsky-group \(\times\) newell-group
- Bz-group \(\times\) borovsky-group
- AL-group \(\times\) SymH-group

The detailed visualizations for correlation structure within groups and between groups are available in this notebook.

○ Data Preprocessing

Feature Extraction from Datetimes
Log-transformation of the Target
Feature Scaling
Principle Component Analysis (PCA)

The preprocessing of data is carried out in four stages. We postpone the train-test split until the end of the second stage, as the first two stages have no scope of data leakage.

Feature Extraction from `Datetimes`

A typical value of the object type feature variable Datetimes looks like 1987-01-12 12:57:00. Clearly, it contains information on

Year
Month
Day
Hour
Minute
Second

We separate out these information as new features and subsequently drop the original Datetimes column from the dataset. A snapshot of the new features is given as follows:

Next, we give statistical description of the new features.

We drop the constant column second from the dataset.

Log-transformation of the Target

We observed in the exploratory data analysis that the target variable ELE_TOTAL_ENERGY_FLUX has an extremely left-skewed distribution. Thus, we transform the target by applying the map \(y \mapsto \log{y}\).

Train-Test Split. We split the dataset into training set and test set in \(80 : 20\) ratio. The training set contains \(1556631\) observations and the test set contains \(389158\) observations, each having \(77\) variables.

The distributions of the log-transformed target variable in the training set and test set are shown in the next diagram. The two distributions appear to be similar.

Feature Scaling

It may be natural for one feature variable to have a greater impact on classification procedure than another. But often this is generated artificially by the difference of range of values that the features take. The unit of measurement in which the features are measured can be one a possible reason for such occurrence.

This necessitates the practitioner to scale the features appropriately before feeding the data to machine learning algorithms. For this purpose, the min-max normalization transforms the features in the following way:

\[x \mapsto \frac{x - \min{\left(x\right)}}{\max{\left(x\right)} - \min{\left(x\right)}}\]

The next function implements this using the MinMaxScaler class from the scikit-learn library. To keep the transformation the same, we use the minimum and maximum values of the training columns to scale the corresponding columns in both the training set and the test set. Using the minimum and maximum values of the test columns for both datasets would have led to data leakage.

def scaler(X_train, X_test):
    scaling = MinMaxScaler().fit(X_train)
    X_train_scaled = scaling.transform(X_train)
    X_test_scaled = scaling.transform(X_test)
    return X_train_scaled, X_test_scaled

Principle Component Analysis (PCA)

We import the PCA class from the scikit-learn library and fit it on the training features. We set the argument n_components (the number of components to keep) to \(20\). The number of components kept and the ratio of variance explained by the retained components are summarized below.

We plot the cumulative explained variance ratio over the number of components.

A snapshot of the correlation structure of components and features is given as follows.

Next, we PCA-transform the training set features and give a snapshot of the transformed data.

Similarly, we PCA-transform the test set features and give a snapshot of the transformed data.

○ Baseline Neural Network

Model Creation
Model Compilation
Model Fitting
Model Evaluation

Model Creation

We add dense layers, with appropriate number of units and activation function, to a sequential model.

model = Sequential()
model.add(Dense(units = 128, input_dim = len(X_train.columns), activation = 'relu'))
model.add(Dense(units = 64, activation = 'relu'))
model.add(Dense(units = 32, activation = 'relu'))
model.add(Dense(units = 16, activation = 'relu'))
model.add(Dense(units = 4, activation = 'relu'))
model.add(Dense(units = 1, activation = 'linear'))

A summary of the model is given below.

Model Compilation

For model compilation, we employ the Adam optimizer.

adam_opt = Adam(learning_rate = 0.001, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-07, amsgrad = False)

An alternative choice is the stochastic gradient descent optimizer, which can be used for experimentation.

sgd_opt = SGD(learning_rate = 0.01, momentum = 0.9, nesterov = False)

We compile the model with mean squared error loss and Adam optimizer.

model.compile(loss = 'mean_squared_error', optimizer = adam_opt)

Model Fitting

Next, we fit the model for \(120\) epochs on the training set, with \(32\) samples in each batch of training. Specifically, we set aside the last \(20\%\) observations from the training set to constitute the validation set. The rest constitutes the effective training set. In each epoch, the model is trained over the effective training set and is evaluated on the validation set.

history = model.fit(
    X_train,
    y_train,
    epochs = 120,
    batch_size = 32,
    validation_split = 0.2
)

We present the model loss (mean squared error) values for both the effective training set and the validation set in the last fifteen epochs.

We plot the model loss for both the effective training set and the validation set against the epochs.

Model Evaluation

We use the fitted model to predict on the training set and the test set. The error distribution for prediction of ELE_TOTAL_ENERGY_FLUX_LOG in the test set is shown below.

We summarize the baseline model performance through the evaluation metrics RMSE, MAE, and \(R^2\).

○ Hyperparameter Tuning

Model Builder
Keras Tuner
Tuner Search
Tuned Model Fitting

In this project, we tune the number of units in the first dense layer and the learning rate for the optimizer using the KerasTuner library.

Model Builder

The next function builds a model, with specified set of values for the hyperparameters to be optimized.

def model_builder(hp):
    model = Sequential()
    model.add(Flatten(input_shape = (X_train.shape[1],)))

    hp_units = hp.Int('units', min_value = 128, max_value = 512, step = 128)
    model.add(Dense(units = hp_units, activation = 'relu'))
    model.add(Dense(units = 64, activation = 'relu'))
    model.add(Dense(units = 32, activation = 'relu'))
    model.add(Dense(units = 16, activation = 'relu'))
    model.add(Dense(units = 4, activation = 'relu'))
    model.add(Dense(units = 1, activation = 'linear'))

    hp_learning_rate = hp.Choice('learning_rate', values = [0.01, 0.001, 0.0001])
    model.compile(loss = 'mean_squared_error', optimizer = Adam(learning_rate = hp_learning_rate))

    return model

Keras Tuner

We now make the tuner using the Hyperband function, setting the objective as validation loss.

tuner = kt.Hyperband(model_builder, objective = 'val_loss', max_epochs = 120, factor = 2, directory = 'dir_1')

We summarize the search space for the tuner.

Tuner Search

We implement the tuner using the search function.

tuner.search(X_train, y_train, epochs = 120, validation_split = 0.2)

We retrieve the optimal values of the hyperparameters to be tuned using the get_best_hyperparameters function.

hp_tuned = tuner.get_best_hyperparameters()[0]

Next, we build the model with optimal hyperparameter values.

model_tuned = tuner.hypermodel.build(hp_tuned)

A summary of the tuned model is given below.

Tuned Model Fitting

We fit the model for \(120\) epochs. Specifically, we set aside \(20\%\) training samples as validation data through the argument validation_split. We train the model on the rest of the training observations (effective training set) and evaluate it on the validation data.

On the effective training set (first \(80\%\) observations from the original training set) and evaluate it on the validation set (last \(20\%\) samples from the original training set).

history_tuned = model_tuned.fit(X_train, y_train, epochs = 120, validation_split = 0.2)

We present the model loss for both the effective training set and the validation set in the last fifteen epochs.

We visualize the model loss for both the effective training set and the validation set against the epochs.

○ Prediction and Evaluation

We use the fitted tuned model to predict on the training set and the test set. The error distribution for prediction of ELE_TOTAL_ENERGY_FLUX_LOG (by the tuned model) in the test set is shown below.

We summarize the hyperparameter-tuned model performance below.

Comparing with the baseline model performance, we observe that the model obtained after tuning hyperparameters performs slightly better.

Electron Energy Flux Prediction

○ Contents

○ Overview

○ Introduction

Data

Project Objective

Evaluation Metric

○ Exploratory Data Analysis

Dataset Synopsis

Univariate Analysis

Multivariate Analysis

○ Data Preprocessing

Feature Extraction from `Datetimes`

Log-transformation of the Target

Feature Scaling

Principle Component Analysis (PCA)

○ Baseline Neural Network

Model Creation

Model Compilation

Model Fitting

Model Evaluation

○ Hyperparameter Tuning

Model Builder

Keras Tuner

Tuner Search

Tuned Model Fitting

○ Prediction and Evaluation

○ Acknowledgements

○ References

○ Contents

○ Overview

○ Introduction

Data

Project Objective

Evaluation Metric

○ Exploratory Data Analysis

Dataset Synopsis

Univariate Analysis

Multivariate Analysis

○ Data Preprocessing

Feature Extraction from Datetimes

Log-transformation of the Target

Feature Scaling

Principle Component Analysis (PCA)

○ Baseline Neural Network

Model Creation

Model Compilation

Model Fitting

Model Evaluation

○ Hyperparameter Tuning

Model Builder

Keras Tuner

Tuner Search

Tuned Model Fitting

○ Prediction and Evaluation

○ Acknowledgements

○ References

Feature Extraction from `Datetimes`