Predicting the DST current index from other variables¶
Downloading data¶
We will first download low-resolution data from OMNI, and split it into a training and a test set.
[1]:
from datetime import datetime
from aidapy import load_data
from sklearn.model_selection import train_test_split
# Set the start and end date as year, month, day
t0 = datetime(2000, 1, 1)
t1 = datetime(2015, 12, 31)
# Download the data
omnixr = load_data(mission='omni', start_time=t0, end_time=t1)
# Store data in pandas format
pd_data = omnixr['all1'].to_pandas()
# Split into training and test data
dtrain, dtest = train_test_split(pd_data, shuffle=False)
pd_data.describe()
Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2000.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2000.dat
100.0% 2883584 / 2881152
Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2001.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2001.dat
100.0% 2875392 / 2873280
Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2002.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2002.dat
100.0% 2875392 / 2873280
Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2003.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2003.dat
100.0% 2875392 / 2873280
Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2004.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2004.dat
100.0% 2883584 / 2881152
[1]:
products | Bartels Rotation Number | ID IMF Spacecraft | ID SW Plasma Spacecraft | points(IMF Average) | points(Plasma Average) | |B| | Magnitude of Avg Field Vector | Lat. Angle of Aver. Field Vector | Long. Angle of Aver. Field Vector | Bx GSE, GSM | ... | Proton Flux > 10MeV | Proton Flux > 30MeV | Proton Flux > 60MeV | flag | ap index | f10.7 index | PC(N) index | AL index (Kyoto) | AU index (Kyoto) | Magnetosonic Mach No. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 140233.000000 | 140233.000000 | 140105.000000 | 140233.000000 | 140105.000000 | 140233.000000 | 140233.000000 | 140233.000000 | 140233.000000 | 140233.000000 | ... | 113159.000000 | 113134.000000 | 113120.000000 | 140233.000000 | 140233.000000 | 140113.000000 | 140035.000000 | 140233.000000 | 140233.000000 | 135984.000000 |
mean | 2379.926836 | 55.852353 | 56.569801 | 48.675333 | 30.049670 | 5.827601 | 5.174818 | 0.130479 | 201.635718 | -0.004155 | ... | 7.994024 | 2.036895 | 0.759653 | -0.807028 | 10.408599 | 116.238255 | 0.960367 | -110.092596 | 68.026370 | 5.641059 |
std | 62.473060 | 8.573346 | 8.166480 | 19.356001 | 9.397953 | 3.219993 | 3.069832 | 29.213007 | 100.877033 | 3.473102 | ... | 152.075039 | 48.944587 | 18.042820 | 0.394632 | 16.728629 | 43.382109 | 1.345067 | 143.618415 | 70.739359 | 1.158862 |
min | 2272.000000 | 51.000000 | 45.000000 | 1.000000 | 1.000000 | 0.400000 | 0.100000 | -89.200000 | 0.000000 | -40.800000 | ... | 0.010000 | 0.010000 | 0.010000 | -1.000000 | 0.000000 | 65.100000 | -21.500000 | -2452.000000 | -260.000000 | 0.600000 |
25% | 2326.000000 | 51.000000 | 52.000000 | 51.000000 | 17.000000 | 3.800000 | 3.200000 | -18.500000 | 122.300000 | -2.500000 | ... | 0.150000 | 0.080000 | 0.060000 | -1.000000 | 3.000000 | 78.500000 | 0.100000 | -148.000000 | 20.000000 | 4.900000 |
50% | 2380.000000 | 51.000000 | 52.000000 | 59.000000 | 35.000000 | 5.100000 | 4.500000 | -0.100000 | 181.100000 | 0.000000 | ... | 0.200000 | 0.110000 | 0.080000 | -1.000000 | 6.000000 | 107.100000 | 0.600000 | -49.000000 | 42.000000 | 5.700000 |
75% | 2434.000000 | 51.000000 | 52.000000 | 60.000000 | 37.000000 | 7.000000 | 6.200000 | 18.600000 | 302.000000 | 2.400000 | ... | 0.250000 | 0.150000 | 0.110000 | -1.000000 | 12.000000 | 141.200000 | 1.500000 | -20.000000 | 91.000000 | 6.400000 |
max | 2488.000000 | 71.000000 | 71.000000 | 94.000000 | 74.000000 | 62.000000 | 60.700000 | 89.800000 | 360.000000 | 34.800000 | ... | 9650.000000 | 3220.000000 | 1220.000000 | 0.000000 | 400.000000 | 325.100000 | 28.000000 | 22.000000 | 1226.000000 | 12.700000 |
8 rows × 52 columns
Let’s have a look at the data¶
Below, the DST index is shown. Note that there are quiet periods, and periods with more magnetic storms, indicated by large negative values.
[2]:
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
%matplotlib notebook
register_matplotlib_converters()
plt.figure()
plt.title("DST index (training/test data)")
dtrain['DST Index'].plot(label='train')
dtest['DST Index'].plot(label='test')
plt.gcf().autofmt_xdate()
plt.legend();
The correlation between the DST Index and other variables¶
Let’s have a look at the linear correlation between the DST index and other variables. Correlations range from -1 (anti-correlation) to 1 (perfect correlation), and a value around zero indicates the variables are linearly independent. (However, there might be a complex non-linear correlation)
[3]:
# Generate a correlation matrix
corr_matrix = dtrain.corr()
# Get the correlation with the DST index
dst_corr = corr_matrix['DST Index']
# Display correlations sorted by absolute value
ix = dst_corr.abs().sort_values(ascending=False).index
sorted_corr = dst_corr.reindex(ix)
# Print the variables with the strongest correlations
print(sorted_corr[:10])
products
DST Index 1.000000
ap index -0.629716
Kp -0.573182
AL index (Kyoto) 0.548138
AE Index -0.542807
PC(N) index -0.514771
Plasma Flow Speed -0.456372
AU index (Kyoto) -0.420654
|B| -0.396887
Magnitude of Avg Field Vector -0.386568
Name: DST Index, dtype: float64
Selecting features¶
We will now select features from which we will predict the DST Index (our target). There are a number of other indices that the DST correlates to, but to make this example somewhat challenging we will exclude them!
[4]:
# Get names of the variables
all_names = list(sorted_corr.index)
# Remove index variables
my_names = [name for name in all_names if 'ndex' not in name]
# Remove other variables
my_names.remove('Kp')
# Select variables with the strongest (absolute) correlations as features
n_features = 5
features = my_names[:n_features]
targets = ['DST Index']
all_vars = features + targets
print("The features are: ", features)
print("The targets are: ", targets)
# Remove rows with missing values. We do this after selecting features,
# otherwise we remove too many rows.
dtrain_dropna = dtrain[all_vars].dropna()
dtest_dropna = dtest[all_vars].dropna()
# Get the features and targets
X_train = dtrain_dropna[features].values
y_train = dtrain_dropna[['DST Index']].values
X_test = dtest_dropna[features].values
y_test = dtest_dropna[['DST Index']].values
t_test = dtest_dropna.index
t_train = dtrain_dropna.index
The features are: ['Plasma Flow Speed', '|B|', 'Magnitude of Avg Field Vector', 'Na/Np', 'Bz GSM']
The targets are: ['DST Index']
Regression models¶
In this example, we consider a linear regression model and an artificial neural network (ANN) regression model. These models are used in a pipeline, which also performs scaling of the input data.
[5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from aidapy.ml import mlp
from skorch import NeuralNetRegressor
import torch
# A list of models to use
models = []
# Append a dictionary with the model name and pipeline
models.append({
'name': 'Linear Regression',
'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
('model', LinearRegression())])
})
# RegressorMlp is a simple, fully-connected neural network,
# for which the layer sizes are defined below. The default
# activation function is ReLU.
mlp_model = NeuralNetRegressor(
mlp.RegressorMlp,
max_epochs=25,
lr=0.001,
batch_size=128,
optimizer=torch.optim.Adam,
module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)
models.append({
'name': 'Multilayer perceptron',
'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
('model', mlp_model)])
})
Training the models¶
[6]:
from sklearn.metrics import r2_score
for model in models:
model['pipe'].fit(X_train, y_train)
model['test_predict'] = model['pipe'].predict(X_test)
model['train_predict'] = model['pipe'].predict(X_train)
for model in models:
print("{:30} R2 score on test / train set: {:8.3f} {:8.3f}".format(
model['name'], r2_score(y_test, model['test_predict']),
r2_score(y_train, model['train_predict'])))
epoch train_loss valid_loss dur
------- ------------ ------------ ------
1 265.4195 510.7300 4.9675
2 199.7389 514.0299 6.8559
3 196.6792 513.1639 5.4448
4 194.9946 511.3364 8.0728
5 193.8425 508.8832 3.3963
6 192.9440 506.7368 6.7031
7 192.1152 507.0628 4.3043
8 191.5201 504.7900 4.5337
9 190.9533 504.0500 5.6420
10 190.5970 503.2600 3.0030
11 190.2063 502.6676 3.0651
12 189.9117 501.8307 7.2454
13 189.5939 501.4661 7.0313
14 189.3153 501.0051 4.6045
15 189.0875 500.4397 8.1879
16 188.8131 500.0692 5.8018
17 188.5830 499.5452 8.0832
18 188.3553 499.0485 5.2213
19 188.0861 498.7523 4.9091
20 187.9017 498.4360 4.6728
21 187.7025 498.1764 4.8058
22 187.4986 497.6907 3.5740
23 187.3130 497.4540 4.4685
24 187.1445 497.1794 4.2933
25 186.9713 496.8183 7.3463
Linear Regression R2 score on test / train set: 0.394 0.375
Multilayer perceptron R2 score on test / train set: 0.415 0.404
Visualizing the model predictions¶
[7]:
plt.figure()
plt.title('Performance on test/train data')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models:
p = plt.plot(t_test, model['test_predict'], label=model['name'])
# Plot another line with the same color
plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)
Working with a history to improve predictions¶
Before, we only used the ‘instantaneous’ values of other variables to predict the DST Index. We can improve predictions by keeping track of a history of values, for which we use the AIDApy preprocessing module.
[8]:
from aidapy.ml import preprocess
histsize = 24 # Number of past hours
forecast_time = 0 # Hours into the future
Xtr = dtrain[features].values
ytr = dtrain[targets].values
Xte = dtest[features].values
yte = dtest[targets].values
# Use the AIDApy preprocessing method for time series
X_train, y_train, mask_train = preprocess.time_series(
Xtr, ytr, histsize, forecast_time)
X_test, y_test, mask_test = preprocess.time_series(
Xte, yte, histsize, forecast_time)
t_train = dtrain.index[mask_train]
t_test = dtest.index[mask_test]
models_v2 = []
models_v2.append({
'name': 'Linear Regression',
'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
('model', LinearRegression())])
})
mlp_model = NeuralNetRegressor(
mlp.RegressorMlp,
max_epochs=20,
lr=0.001,
batch_size=128,
optimizer=torch.optim.Adam,
module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)
models_v2.append({
'name': 'Multilayer perceptron',
'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
('model', mlp_model)])
})
Training and evaluating the new models¶
[9]:
for model in models_v2:
model['pipe'].fit(X_train, y_train)
model['test_predict'] = model['pipe'].predict(X_test)
model['train_predict'] = model['pipe'].predict(X_train)
for model in models_v2:
print("{:30} R2 score on test / train set: {:8.3f} {:8.3f}".format(
model['name'], r2_score(y_test, model['test_predict']),
r2_score(y_train, model['train_predict'])))
epoch train_loss valid_loss dur
------- ------------ ------------ ------
1 211.1822 320.4362 3.7218
2 120.9668 272.9826 4.0578
3 116.8801 265.6732 5.9941
4 114.9084 249.5811 3.8256
5 112.3243 244.7855 3.6040
6 110.7551 240.7139 3.4175
7 109.4082 237.4817 3.4066
8 108.2381 234.2301 3.3303
9 107.1030 231.2991 2.4646
10 105.9993 228.7583 5.1124
11 105.0800 226.6691 3.4773
12 104.1000 224.7690 3.3862
13 103.2819 222.9194 3.0130
14 102.5243 221.2111 5.1869
15 101.7451 219.7934 2.9591
16 101.0786 218.5600 2.5240
17 100.4546 216.9458 4.8371
18 99.7634 215.4997 2.8842
19 99.2128 214.3354 2.6664
20 98.6472 213.3654 5.0065
Linear Regression R2 score on test / train set: 0.660 0.634
Multilayer perceptron R2 score on test / train set: 0.667 0.700
Looking at the results¶
The performance of both models seems to have significantly improved by using a history of variables.
[10]:
plt.figure()
plt.title('Performance on test/train data')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models_v2:
p = plt.plot(t_test, model['test_predict'], label=model['name'])
# Plot another line with the same color
plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)
Possible extensions¶
Above, features were selected based on their linear correlation to the DST index. Such features will work well for a linear model, but a neural network could potentially also use features that have a more complex correlation with the DST index. We could simply use all variables as the input for the neural network, but then we have to take care that there are no missing values (NaN’s).
[ ]: