Predicting the DST current index from other variables

Downloading data

We will first download low-resolution data from OMNI, and split it into a training and a test set.

[1]:
from datetime import datetime
from aidapy import load_data
from sklearn.model_selection import train_test_split

# Set the start and end date as year, month, day
t0 = datetime(2000, 1, 1)
t1 = datetime(2015, 12, 31)

# Download the data
omnixr = load_data(mission='omni', start_time=t0, end_time=t1)

# Store data in pandas format
pd_data = omnixr['all1'].to_pandas()

# Split into training and test data
dtrain, dtest = train_test_split(pd_data, shuffle=False)

pd_data.describe()
Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2000.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2000.dat
100.0% 2883584 / 2881152


Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2001.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2001.dat
100.0% 2875392 / 2873280


Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2002.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2002.dat
100.0% 2875392 / 2873280


Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2003.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2003.dat
100.0% 2875392 / 2873280


Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2004.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2004.dat
100.0% 2883584 / 2881152


[1]:
products Bartels Rotation Number ID IMF Spacecraft ID SW Plasma Spacecraft points(IMF Average) points(Plasma Average) |B| Magnitude of Avg Field Vector Lat. Angle of Aver. Field Vector Long. Angle of Aver. Field Vector Bx GSE, GSM ... Proton Flux > 10MeV Proton Flux > 30MeV Proton Flux > 60MeV flag ap index f10.7 index PC(N) index AL index (Kyoto) AU index (Kyoto) Magnetosonic Mach No.
count 140233.000000 140233.000000 140105.000000 140233.000000 140105.000000 140233.000000 140233.000000 140233.000000 140233.000000 140233.000000 ... 113159.000000 113134.000000 113120.000000 140233.000000 140233.000000 140113.000000 140035.000000 140233.000000 140233.000000 135984.000000
mean 2379.926836 55.852353 56.569801 48.675333 30.049670 5.827601 5.174818 0.130479 201.635718 -0.004155 ... 7.994024 2.036895 0.759653 -0.807028 10.408599 116.238255 0.960367 -110.092596 68.026370 5.641059
std 62.473060 8.573346 8.166480 19.356001 9.397953 3.219993 3.069832 29.213007 100.877033 3.473102 ... 152.075039 48.944587 18.042820 0.394632 16.728629 43.382109 1.345067 143.618415 70.739359 1.158862
min 2272.000000 51.000000 45.000000 1.000000 1.000000 0.400000 0.100000 -89.200000 0.000000 -40.800000 ... 0.010000 0.010000 0.010000 -1.000000 0.000000 65.100000 -21.500000 -2452.000000 -260.000000 0.600000
25% 2326.000000 51.000000 52.000000 51.000000 17.000000 3.800000 3.200000 -18.500000 122.300000 -2.500000 ... 0.150000 0.080000 0.060000 -1.000000 3.000000 78.500000 0.100000 -148.000000 20.000000 4.900000
50% 2380.000000 51.000000 52.000000 59.000000 35.000000 5.100000 4.500000 -0.100000 181.100000 0.000000 ... 0.200000 0.110000 0.080000 -1.000000 6.000000 107.100000 0.600000 -49.000000 42.000000 5.700000
75% 2434.000000 51.000000 52.000000 60.000000 37.000000 7.000000 6.200000 18.600000 302.000000 2.400000 ... 0.250000 0.150000 0.110000 -1.000000 12.000000 141.200000 1.500000 -20.000000 91.000000 6.400000
max 2488.000000 71.000000 71.000000 94.000000 74.000000 62.000000 60.700000 89.800000 360.000000 34.800000 ... 9650.000000 3220.000000 1220.000000 0.000000 400.000000 325.100000 28.000000 22.000000 1226.000000 12.700000

8 rows × 52 columns

Let’s have a look at the data

Below, the DST index is shown. Note that there are quiet periods, and periods with more magnetic storms, indicated by large negative values.

[2]:
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
%matplotlib notebook
register_matplotlib_converters()

plt.figure()
plt.title("DST index (training/test data)")
dtrain['DST Index'].plot(label='train')
dtest['DST Index'].plot(label='test')
plt.gcf().autofmt_xdate()
plt.legend();

The correlation between the DST Index and other variables

Let’s have a look at the linear correlation between the DST index and other variables. Correlations range from -1 (anti-correlation) to 1 (perfect correlation), and a value around zero indicates the variables are linearly independent. (However, there might be a complex non-linear correlation)

[3]:
# Generate a correlation matrix
corr_matrix = dtrain.corr()

# Get the correlation with the DST index
dst_corr = corr_matrix['DST Index']

# Display correlations sorted by absolute value
ix = dst_corr.abs().sort_values(ascending=False).index
sorted_corr = dst_corr.reindex(ix)

# Print the variables with the strongest correlations
print(sorted_corr[:10])
products
DST Index                        1.000000
ap index                        -0.629716
Kp                              -0.573182
AL index (Kyoto)                 0.548138
AE Index                        -0.542807
PC(N) index                     -0.514771
Plasma Flow Speed               -0.456372
AU index (Kyoto)                -0.420654
|B|                             -0.396887
Magnitude of Avg Field Vector   -0.386568
Name: DST Index, dtype: float64

Selecting features

We will now select features from which we will predict the DST Index (our target). There are a number of other indices that the DST correlates to, but to make this example somewhat challenging we will exclude them!

[4]:
# Get names of the variables
all_names = list(sorted_corr.index)

# Remove index variables
my_names = [name for name in all_names if 'ndex' not in name]

# Remove other variables
my_names.remove('Kp')

# Select variables with the strongest (absolute) correlations as features
n_features = 5
features = my_names[:n_features]
targets = ['DST Index']
all_vars = features + targets
print("The features are: ", features)
print("The targets are: ", targets)

# Remove rows with missing values. We do this after selecting features,
# otherwise we remove too many rows.
dtrain_dropna = dtrain[all_vars].dropna()
dtest_dropna = dtest[all_vars].dropna()

# Get the features and targets
X_train = dtrain_dropna[features].values
y_train = dtrain_dropna[['DST Index']].values

X_test = dtest_dropna[features].values
y_test = dtest_dropna[['DST Index']].values

t_test = dtest_dropna.index
t_train = dtrain_dropna.index
The features are:  ['Plasma Flow Speed', '|B|', 'Magnitude of Avg Field Vector', 'Na/Np', 'Bz GSM']
The targets are:  ['DST Index']

Regression models

In this example, we consider a linear regression model and an artificial neural network (ANN) regression model. These models are used in a pipeline, which also performs scaling of the input data.

[5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from aidapy.ml import mlp
from skorch import NeuralNetRegressor
import torch

# A list of models to use
models = []

# Append a dictionary with the model name and pipeline
models.append({
    'name': 'Linear Regression',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', LinearRegression())])
})

# RegressorMlp is a simple, fully-connected neural network,
# for which the layer sizes are defined below. The default
# activation function is ReLU.
mlp_model = NeuralNetRegressor(
    mlp.RegressorMlp,
    max_epochs=25,
    lr=0.001,
    batch_size=128,
    optimizer=torch.optim.Adam,
    module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)

models.append({
    'name': 'Multilayer perceptron',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', mlp_model)])
})

Training the models

[6]:
from sklearn.metrics import r2_score

for model in models:
    model['pipe'].fit(X_train, y_train)
    model['test_predict'] = model['pipe'].predict(X_test)
    model['train_predict'] = model['pipe'].predict(X_train)

for model in models:
    print("{:30} R2 score on test / train set:  {:8.3f} {:8.3f}".format(
        model['name'], r2_score(y_test, model['test_predict']),
        r2_score(y_train, model['train_predict'])))
  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1      265.4195      510.7300  4.9675
      2      199.7389      514.0299  6.8559
      3      196.6792      513.1639  5.4448
      4      194.9946      511.3364  8.0728
      5      193.8425      508.8832  3.3963
      6      192.9440      506.7368  6.7031
      7      192.1152      507.0628  4.3043
      8      191.5201      504.7900  4.5337
      9      190.9533      504.0500  5.6420
     10      190.5970      503.2600  3.0030
     11      190.2063      502.6676  3.0651
     12      189.9117      501.8307  7.2454
     13      189.5939      501.4661  7.0313
     14      189.3153      501.0051  4.6045
     15      189.0875      500.4397  8.1879
     16      188.8131      500.0692  5.8018
     17      188.5830      499.5452  8.0832
     18      188.3553      499.0485  5.2213
     19      188.0861      498.7523  4.9091
     20      187.9017      498.4360  4.6728
     21      187.7025      498.1764  4.8058
     22      187.4986      497.6907  3.5740
     23      187.3130      497.4540  4.4685
     24      187.1445      497.1794  4.2933
     25      186.9713      496.8183  7.3463
Linear Regression              R2 score on test / train set:     0.394    0.375
Multilayer perceptron          R2 score on test / train set:     0.415    0.404

Visualizing the model predictions

[7]:
plt.figure()
plt.title('Performance on test/train data')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models:
    p = plt.plot(t_test, model['test_predict'], label=model['name'])
    # Plot another line with the same color
    plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)

Working with a history to improve predictions

Before, we only used the ‘instantaneous’ values of other variables to predict the DST Index. We can improve predictions by keeping track of a history of values, for which we use the AIDApy preprocessing module.

[8]:
from aidapy.ml import preprocess

histsize = 24                   # Number of past hours
forecast_time = 0               # Hours into the future

Xtr = dtrain[features].values
ytr = dtrain[targets].values
Xte = dtest[features].values
yte = dtest[targets].values

# Use the AIDApy preprocessing method for time series
X_train, y_train, mask_train = preprocess.time_series(
    Xtr, ytr, histsize, forecast_time)
X_test, y_test, mask_test = preprocess.time_series(
    Xte, yte, histsize, forecast_time)

t_train = dtrain.index[mask_train]
t_test = dtest.index[mask_test]

models_v2 = []

models_v2.append({
    'name': 'Linear Regression',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', LinearRegression())])
})

mlp_model = NeuralNetRegressor(
    mlp.RegressorMlp,
    max_epochs=20,
    lr=0.001,
    batch_size=128,
    optimizer=torch.optim.Adam,
    module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)

models_v2.append({
    'name': 'Multilayer perceptron',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', mlp_model)])
})

Training and evaluating the new models

[9]:
for model in models_v2:
    model['pipe'].fit(X_train, y_train)
    model['test_predict'] = model['pipe'].predict(X_test)
    model['train_predict'] = model['pipe'].predict(X_train)

for model in models_v2:
    print("{:30} R2 score on test / train set:  {:8.3f} {:8.3f}".format(
        model['name'], r2_score(y_test, model['test_predict']),
        r2_score(y_train, model['train_predict'])))
  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1      211.1822      320.4362  3.7218
      2      120.9668      272.9826  4.0578
      3      116.8801      265.6732  5.9941
      4      114.9084      249.5811  3.8256
      5      112.3243      244.7855  3.6040
      6      110.7551      240.7139  3.4175
      7      109.4082      237.4817  3.4066
      8      108.2381      234.2301  3.3303
      9      107.1030      231.2991  2.4646
     10      105.9993      228.7583  5.1124
     11      105.0800      226.6691  3.4773
     12      104.1000      224.7690  3.3862
     13      103.2819      222.9194  3.0130
     14      102.5243      221.2111  5.1869
     15      101.7451      219.7934  2.9591
     16      101.0786      218.5600  2.5240
     17      100.4546      216.9458  4.8371
     18       99.7634      215.4997  2.8842
     19       99.2128      214.3354  2.6664
     20       98.6472      213.3654  5.0065
Linear Regression              R2 score on test / train set:     0.660    0.634
Multilayer perceptron          R2 score on test / train set:     0.667    0.700

Looking at the results

The performance of both models seems to have significantly improved by using a history of variables.

[10]:
plt.figure()
plt.title('Performance on test/train data')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models_v2:
    p = plt.plot(t_test, model['test_predict'], label=model['name'])
    # Plot another line with the same color
    plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)

Possible extensions

Above, features were selected based on their linear correlation to the DST index. Such features will work well for a linear model, but a neural network could potentially also use features that have a more complex correlation with the DST index. We could simply use all variables as the input for the neural network, but then we have to take care that there are no missing values (NaN’s).

[ ]: