DST forecasting from OMNI data¶

This example shows how to use AIDApy to perform the following tasks:

Download time series data from OMNI
Preprocess the data so that it can be used for machine learning
Train several models and evaluate their performance

Downloading data¶

We will first download low-resolution data from OMNI.

[1]:

from datetime import datetime
from aidapy import load_data

# Set the start and end date as year, month, day
t0 = datetime(2005, 1, 1)
t1 = datetime(2015, 12, 31)

# Download the data
omnixr = load_data(mission='omni', start_time=t0, end_time=t1)

# Store data in pandas format
pd_data = omnixr['all1'].to_pandas()

pd_data.describe()

[1]:

products	Bartels Rotation Number	ID IMF Spacecraft	ID SW Plasma Spacecraft	points(IMF Average)	points(Plasma Average)	\|B\|	Magnitude of Avg Field Vector	Lat. Angle of Aver. Field Vector	Long. Angle of Aver. Field Vector	Bx GSE, GSM	...	Proton Flux > 10MeV	Proton Flux > 30MeV	Proton Flux > 60MeV	flag	ap index	f10.7 index	PC(N) index	AL index (Kyoto)	AU index (Kyoto)	Magnetosonic Mach No.
count	96385.000000	96385.000000	96354.00000	96385.000000	96354.00000	96385.000000	96385.000000	96385.000000	96385.000000	96385.000000	...	91174.000000	91171.000000	91169.000000	96385.000000	96385.000000	96265.000000	96212.000000	96385.000000	96385.000000	95501.000000
mean	2413.760481	51.469160	52.41442	58.407595	34.32062	5.216666	4.620777	0.442976	200.817234	-0.003756	...	4.467667	0.895241	0.283256	-0.945936	8.207802	98.542824	0.876475	-93.396431	56.163034	5.769463
std	42.939382	3.027076	2.95985	7.630976	5.37908	2.785629	2.659993	29.518247	101.021925	3.079144	...	74.289989	15.688849	5.969648	0.226146	12.072234	29.679255	1.171071	128.300576	61.312826	1.119411
min	2339.000000	51.000000	51.00000	1.000000	1.00000	0.400000	0.100000	-89.200000	0.000000	-40.800000	...	0.050000	0.040000	0.030000	-1.000000	0.000000	65.100000	-6.900000	-2452.000000	-225.000000	0.600000
25%	2377.000000	51.000000	52.00000	58.000000	34.00000	3.400000	2.900000	-18.600000	121.100000	-2.200000	...	0.140000	0.080000	0.060000	-1.000000	3.000000	73.500000	0.100000	-120.000000	17.000000	5.100000
50%	2414.000000	51.000000	52.00000	60.000000	36.00000	4.600000	4.000000	0.100000	182.600000	0.000000	...	0.180000	0.100000	0.070000	-1.000000	5.000000	88.400000	0.600000	-38.000000	34.000000	5.800000
75%	2451.000000	51.000000	52.00000	61.000000	37.00000	6.200000	5.600000	19.200000	300.600000	2.200000	...	0.220000	0.120000	0.090000	-1.000000	9.000000	117.800000	1.300000	-17.000000	72.000000	6.500000
max	2488.000000	71.000000	71.00000	91.000000	52.00000	55.400000	54.500000	89.800000	360.000000	26.400000	...	4559.950000	1210.000000	956.000000	0.000000	300.000000	255.000000	18.400000	15.000000	873.000000	10.900000

8 rows × 52 columns

Split the data into different sets¶

For evaluating the performance of a machine learning approach, the input data has to be split in different sets:

A training set, on which the model is trained
Optionally, a validation set, to tune so-called hyperparameters of the model
A test set, which is only used at the end to evaluate the performance of the model

Such a separation is important to prevent overfitting. Since there is usually a correlation between consecutive observations in a time series, it is important not to shuffle the data before splitting it in different sets. Otherwise, the test set would not be truly independent from the training set.

[2]:

from sklearn.model_selection import train_test_split

# Split into training and test data
dtrain, dtest = train_test_split(pd_data, shuffle=False)

Let’s have a look at the data¶

Below, the DST index is shown. Note that there are quiet periods, and periods with more magnetic storms, indicated by large negative values.

[3]:

import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
%matplotlib notebook
register_matplotlib_converters()

plt.figure()
plt.title("DST index (training/test data)")
dtrain['DST Index'].plot(label='train')
dtest['DST Index'].plot(label='test')
plt.gcf().autofmt_xdate()
plt.legend();

Selecting features¶

The features are here selected by hand (and they are not necessarily the best ones).

[4]:

features = ['|B|',
            'Magnitude of Avg Field Vector',
            'Proton Density',
            'Proton Temperature',
            'Alfven Mach Number',
            'Bz GSM',
            'Na/Np',
            'Plasma Flow Speed',
            'Plasma Beta',
            'Electric Field',
            'DST Index']
targets = ['DST Index']

Preprocessing data¶

The training and test data are preprocessed so that we can make predictions for the targets based on a history of the features. If the original features and targets are \(X_i\) and \(y_i\), then we define new features \(X_i' = [X_i, X_{i-1}, \dots, X_{i-n-1}]\), where \(n\) is the history size. We use these features to make predictions at some future time, so that the new targets are \(y_i' = y_{i+k}\), where \(k\) is the forecast time.

Another question is what to do with missing values. This is a common issue in working with space craft data – in particular if data from multiple instruments is combined. We have opted for a simple approach: remove all samples for which some data is missing, either in the features or the targets.

[5]:

from aidapy.ml import preprocess

histsize = 8                    # Number of past hours
forecast_time = 1               # Hours into the future

# Use the AIDApy preprocessing method for time series
X_train, y_train, mask_train = preprocess.time_series(
    dtrain[features].values, dtrain[targets].values, histsize, forecast_time)
X_test, y_test, mask_test = preprocess.time_series(
    dtest[features].values, dtest[targets].values, histsize, forecast_time)

t_train = dtrain.index[mask_train]
t_test = dtest.index[mask_test]

Regression models¶

For the first use case, the following regression models are used: * A fully-connected neural network implemented in PyTorch. Users can customize the number of layers, the number of neurons per layer, and the activation functions of the layers. * A linear regression model imported from scikit-learn

[6]:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from aidapy.ml import mlp
from skorch import NeuralNetRegressor
import torch

models = []

models.append({
    'name': 'Linear Regression',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', LinearRegression())])
})

# RegressorMlp is a simple, fully-connected neural network,
# for which the layer sizes are defined below. The default
# activation function is ReLU.
mlp_model = NeuralNetRegressor(
    mlp.RegressorMlp,
    max_epochs=25,
    lr=0.001,
    batch_size=128,
    optimizer=torch.optim.Adam,
    module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)

models.append({
    'name': 'Multilayer perceptron',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', mlp_model)])
})

[7]:

from sklearn.metrics import r2_score

for model in models:
    model['pipe'].fit(X_train, y_train)
    model['test_predict'] = model['pipe'].predict(X_test)
    model['train_predict'] = model['pipe'].predict(X_train)

for model in models:
    print("{:30} R2 score on test / train set:  {:8.3f} {:8.3f}".format(
        model['name'], r2_score(y_test, model['test_predict']),
        r2_score(y_train, model['train_predict'])))

  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1       60.7240       32.4020  2.9466
      2       17.2051       27.2859  1.8680
      3       13.9108       27.3186  4.9162
      4       11.7659       25.7481  3.9948
      5       11.0440       29.0479  2.6648
      6       10.3904       26.3400  2.7699
      7       10.1087       27.2759  1.9154
      8        9.7082       24.8305  2.6972
      9        9.3186       24.3196  2.6852
     10        9.0785       23.8498  3.3641
     11        8.9514       23.7893  2.6362
     12        8.8422       23.3065  4.0155
     13        8.6888       22.9633  2.9111
     14        8.4773       21.6068  2.7943
     15        8.4333       22.4387  2.9364
     16        8.2843       21.1695  3.0378
     17        8.1932       21.9871  3.1358
     18        8.0966       20.9862  3.1069
     19        8.1332       22.9881  3.7974
     20        7.9812       20.2496  4.5640
     21        7.9269       21.3206  4.9270
     22        7.9495       21.2599  3.3066
     23        7.7617       19.6344  5.3131
     24        7.6907       20.3049  1.9088
     25        7.6336       19.9777  2.4472
Linear Regression              R2 score on test / train set:     0.969    0.964
Multilayer perceptron          R2 score on test / train set:     0.952    0.948

Let’s compare the predictions with the actual data

[8]:

plt.figure()
plt.title('Predicting the change in DST Index')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models:
    p = plt.plot(t_test, model['test_predict'], label=model['name'])
    # Plot another line with the same color
    plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)

This looks pretty good, but…¶

Most variables we consider in a time series have strong auto-correlation. This means that future values strongly resemble past values; examples are the temperature at a location on earth or the price of a stock on a stock market. When making predictions for a variable with strong auto-correlation, one has to be careful in assessing the predictive power of a model. Simply predicting the past state as the future state can look convincing (also numerically in most error norms), but gives little information.

Below, we therefore take the following approach: predict the change in the variable.

[9]:

from aidapy.ml import preprocess

histsize = 24                   # Number of past hours
forecast_time = 1               # Hours into the future

# Use the AIDApy preprocessing method for time series
# With the predict_change=True flag, the change in the DST index
# is the target variable
X_train, y_train, mask_train = preprocess.time_series(
    dtrain[features].values, dtrain[targets].values,
    histsize, forecast_time, predict_change=True)
X_test, y_test, mask_test = preprocess.time_series(
    dtest[features].values, dtest[targets].values,
    histsize, forecast_time, predict_change=True)

t_train = dtrain.index[mask_train]
t_test = dtest.index[mask_test]

models = []

models.append({
    'name': 'Linear Regression',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', LinearRegression())])
})

# RegressorMlp is a simple, fully-connected neural network,
# for which the layer sizes are defined below. The default
# activation function is ReLU.
mlp_model = NeuralNetRegressor(
    mlp.RegressorMlp,
    max_epochs=25,
    lr=0.001,
    batch_size=128,
    optimizer=torch.optim.Adam,
    module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)

models.append({
    'name': 'ANN',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', mlp_model)])
})

[10]:

from sklearn.metrics import r2_score

for model in models:
    model['pipe'].fit(X_train, y_train)
    model['test_predict'] = model['pipe'].predict(X_test)
    model['train_predict'] = model['pipe'].predict(X_train)

for model in models:
    print("{:30} R2 score on test / train set:  {:8.3f} {:8.3f}".format(
        model['name'], r2_score(y_test, model['test_predict']),
        r2_score(y_train, model['train_predict'])))

  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1        9.3040       13.4525  2.5473
      2        8.0748       13.2288  2.4043
      3        7.6921       12.3340  2.4899
      4        7.5276       12.2525  3.9910
      5        7.3974       12.1197  3.8056
      6        7.2771       11.9705  3.0214
      7        7.1804       11.9094  2.0155
      8        7.0822       11.9217  2.8191
      9        7.0014       11.8192  2.6564
     10        6.8758       11.8942  2.6622
     11        6.7928       11.8929  1.9705
     12        6.6663       11.8819  1.9809
     13        6.6267       12.3010  2.6679
     14        6.5041       12.2031  1.8921
     15        6.4059       12.2843  2.4354
     16        6.3420       12.9039  3.0471
     17        6.2806       12.1478  2.7479
     18        6.1105       13.0912  4.9462
     19        6.0671       12.2712  4.6252
     20        5.9791       12.7974  1.9119
     21        5.9633       12.8095  3.2434
     22        6.0265       13.1740  2.2317
     23        5.9151       13.1795  1.7708
     24        5.8232       13.1638  3.2358
     25        5.7188       12.9696  4.8095
Linear Regression              R2 score on test / train set:     0.389    0.411
ANN                            R2 score on test / train set:     0.287    0.457

[11]:

plt.figure()
plt.title('Predicting the change in DST Index')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models:
    p = plt.plot(t_test, model['test_predict'], label=model['name'])
    # Plot another line with the same color
    plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)

[ ]: