Welcome to AIDApy’s documentation!

AIDApy

LicenseMIT LicenseCC Pipeline PyPi CoverageReport PylintScore DocSphinx Maintenance

The Python package aidapy centralizes and simplifies access to:

  • Spacecraft data from heliospheric missions

  • Space physics simulations

  • Advanced statistical tools

  • Machine Learning, Deep Learning algorithms, and applications

The aidapy package includes four main sub-packages which are explained in detail in the Reference Manual section:

  • Mission tool

  • Event search tool

  • Velocity distribution function tool

  • Machine learning tool

The aidapy package is part of the project AIDA (Artificial Intelligence Data Analysis) in Heliophysics funded from the European Unions Horizon 2020 research and innovation programme under grant agreement No 776262. It is distributed under the open-source MIT license.

Full documentation can be found here

Content

Getting Started

Motivations

A major advantage of our era is the easiness of gathering and recording data. Data can be found literary everywhere, from quantum scale to stars and galaxies. Great challenges are arising from this fact and the biggest of them all is how you can efficient manage all this information.

In particular for space data, technology has advanced so rapidly that every day, petabytes of useful information is recorded through satellites, other sensors around the globe, and simulations. However, a big part of this information is remain unused (or partly used) because the proper tools have not yet been created. New software packages can lead the way to win that challenge and fully (or partly) automate various processes that are necessary in order to transform them in a format that the scientists can understand and analyze.

This is the reason why we developed aidapy which is a high level Python package for the analysis of spacecraft data from heliospheric missions using modern techniques including data assimilation, machine learning (ML), artificial intelligence (AI), and advanced statistical models. The transition from proprietary software based on commercial languages to the new open-source community-developed packages in Python represent one of the main objective of aidapy. Our vision is to combine software tools that are already existing out there for different reasons, in a nice and efficient way. In our minds, this is a crucial and necessary task for the reducing of the knowledge “gap” that exist in several occasions between different scientific fields. aidapy has been created in order to fulfil the needs of a user with high programming skills as well as, for a user with basic programming knowledge.

Finally, we aim to build a new specific database for helping the ongoing trend in database standardization in space. Key heliophysics problems are selected to produce a database (AIDAdb) of new high-level data products that include catalogs of features and events detected by ML and AI algorithms from existing numerical simulations and observations. New simulations are also performed to enrich this database. Moreover, many of the AI methods developed in AIDA represent themselves higher-level data products, for instance in the form of trained models.

The transition to a big-data phase and the use of the big-data language of choice Python prepares the space community to the use of ML and AI. The latest AI developments and especially the methodology of deep neural networks are particularly suited for the identification of physical processes in space images and data. However, there is a barrier to overcome between the two very distinct communities of space scientists and experts in AI. Aidapy aims at bridging this gap linking the software developed in the AI community with heliospheric data.

We are developing aidapy to make it easier for a researcher in space science who does not have a background in computer science or in artificial intelligence to use some of the most advanced tools in this field and provide some concrete examples on how AI can help analyse data from simulations and observations, how AI can help discover physical processes hidden in the data and how it can make space weather forecasts.

~

Reference manual

Mission

Introduction

The Mission subpackage is based on routines mainly adopted from HelioPy (https://heliopy.readthedocs.io/en/stable/) and SunPy packages (https://docs.sunpy.org/en/stable/) and further developed. It allows to select and download data from in situ, remote and ground open-access databases. In its present version, the subpackage allows to download data from NASA OMNI, ESA Cluster and NASA MMS in situ data. The tool includes the capability to verify the version of the data to download, in order to always have the latest version of the files. The main output of the subpackage are xarray to be further used by other subpackages of the AIDApy package.

The main features of the Mission subpackage are:

  • being able to handle various missions, such as the Cluster mission or the Magnetospheric Multiscale Mission. Additional missions could be added according to needs without any major change to the architecture, either by contributing to HelioPy or building our own downloader;

  • having precise control of the data. The query can ask specific time range, probes, coordinates, etc. The module will check the availability of the data;

  • managing time-varying multi-dimensional distributions. This feature is of great importance to the mission module as it is not available for other Python packages;

  • ensuring the use of a proper data container providing raw data but also time range, metadata, etc;

  • offering a uniform interface to other AIDApy’s packages in order to perform advanced analysis on the mission data. For instance, an aidapy timeseries object can be generated using data from the mission package, given access to statistical tools and data processing to the final user.

Inputs

The philosophy of AIDApy is to ease as much as possible the user experience. To this end, we decided to minimize the number of arguments entered by the user and the knowledge required to obtain the desired data. Below is a list of the parameters the code needs from the user:

  1. The first parameter to enter is the start and the end of the requested time interval as a Python datetime.datetime object written as: datetime(year, month, day, hour, minute, second) or an astropy.Time object allowing to handle multiple time format such as GPS, ISOT, or FITS.

  2. The second parameter specifies the mission from which to download the data, as a string: 'mission' = '<mission1>'. Missions that can be selected currently are: OMNIWeb, MMS, Cluster.

  3. The third parameter is a list of strings, containing the different probes of the multi spacecraft mission considered: 'probes' = ['<probe1>', '<probe2>', ...]. Probes that can be selected are probe1, probe2, probe3 and probe4 for MMS and Cluster and probe1 for OMNIWeb. Note that for the single spacecraft mission or if one wants data from all probes, one can omit the 'probes' parameter or set it to an empty string.

  4. The fourth parameter is a string specifying the coordinate system the user wants the data in, such as: 'coords' = '<coord_system>'. Note that for the sake of clarity, it has been decided that data can be retrieved in only one coordinate system at once. The coordinate system that is currently available is Geocentric Solar Ecliptic (GSE).

  5. The last parameter to enter is a list of strings containing the data products that the user wants to be loaded (for instance the magnetic field vector, the ion density, etc), written as: 'prod' = ['<product1>', '<product2>', ...]. As mentioned in the previous section, a catalog of all available products is embedded in the Mission subpackage, which will be updated on the fly as more and more space missions and datasets will be added. We note here that this catalog also includes Level 3 (L3) products that are not directly available through open-access databases but are processed in the AIDApy package. A preliminary catalog of detailed available products can be found below in the following table:

Data product

Level

Available for missions

Description

dc_mag

L2

  • OMNIWeb

  • MMS

  • Cluster

3-component (x, y, z) or 4-component (x, y, z, tot) vector of magnetic field

i_dens

L2

  • OMNIWeb

  • MMS

  • Cluster

Ion number density

e_dens

L2

  • OMNIWeb

  • MMS

Electron number density

i_dist

L2

  • MMS

  • Cluster

3D ion distribution function

e_dist

L2

  • MMS

3D electron distribution function

i_bulkv

L2

  • MMS

  • Cluster

Ion bulk velocity vector

e_bulkv

L2

  • MMS

  • Cluster

Electron bulk velocity vector

i_temppara

L2

  • MMS

  • Cluster

Ion temperature parallel to dc_mag

i_tempperp

L2

  • MMS

  • Cluster

Ion temperature perpendicular to dc_mag

i_temp

L2

  • Cluster

Total ion temperature

all

L2

  • OMNIWeb

All products available for OMNI data

sc_pos

L2

  • OMNIWeb

  • MMS

  • Cluster

Spacecraft location

sc_att

L2

  • MMS

Spacecraft attitude

dc_elec

L2

  • MMS

DC electric field

i_omniflux

L2

  • MMS

Omnidirectional ion energy spectrum

i_energy

L2

  • MMS

Ion energy channels table

i_aspoc

L2

  • MMS

ASPOC instrument ion current

i_prestens

L2

  • MMS

Ion pressure tensor

i_temptens

L2

  • MMS

Ion temperature tensor

i_heatq

L2

  • MMS

Ion heat flux

j_curl

L3

  • MMS

  • Cluster

Current density calculated from the Curlometer method

mag_elev_angle

L3

  • OMNIWeb

  • MMS

  • Cluster

Magnetic elevation angle

i_beta

L3

  • MMS

  • Cluster

Ion plasma beta

e_beta

L3

  • MMS

  • Cluster

Electron plasma beta

Availability of data products

As stated in the introduction, AIDApy is designed to be upgraded on the fly according to needs such that the availability of data products can vary depending on the considered space mission. In order to ease the burden of browsing through the above (preliminary) data products catalog to search for the keyword corresponding to the desired physical quantity, AIDApy provides the get_mission_info function as a part of the high-level load_data function (see below). This function is designed to give information on the data products available for each space mission, but also to deliver the data product keywords corresponding the queried physical quantity as well as other data product settings (e.g., available data rates or modes, probes, coordinates, etc).

Below is a very simple code snippet showing how to get all available data products from a particular mission:

get_mission_info(mission='<mission>')

To get the AIDApy data product keywords corresponding to a physical quantity, simply provide the desired quantity as a string (e.g., 'magnetic field'):

get_mission_info(mission='<mission>', product='<physical quantity>')

All supplemental product parameters are accessible by setting the full_product_catalog parameter to True:

get_mission_info(mission='<mission>', product='<physical quantity>', full_product_catalog=True)
High level interface

The function load_data is used as a high-level interface with the mission subpackage. Below is a code snippet showing how to define the inputs for the load_data function using a Python dictionary:

# Define the time interval
start_time = datetime(<year>, <month>, <day>, <hour>, <minute>, <second>)
end_time = datetime(<year>, <month>, <day>, <hour>, <minute>, <second>)

# Define the settings as a Python dictionary
settings = {'prod': ['<prod1>', '<prod2>'], 'probes': ['<probe1>', '<probe2>'],
'coords': '<coord_system>'}

Once the parameters are set up, we generate the Mission downloader and download the data. This is done by calling the load_data function, that tells the Mission subpackage to create a specific downloader for every mission and time interval requested, and download and load the desired data products for the specified probes and coordinate system:

data = load_data(mission='<mission>', start_time, end_time, **settings)

The data is then returned in the data object and can be printed using the print() Python command. The returned data is discussed in the following section.

Outputs

The DataArray and Dataset objects from the Python package xarray (http://xarray.pydata.org/en/stable/) have been chosen as outputs for the Mission subpackage, as they have been especially designed to handle time-series of multidimensional data. It is basically an N-dimensional array with labeled coordinates and dimensions, which also supports metadata aware operations (see details in the Metadata subsection). It also provides many functions to easily manipulate multidimensional data, such as indexing, reshaping, resampling, etc.

The following code snippet is an example of what is printed when downloading a 1D array (e.g., time series of values) of size (2000) and a 2D array (e.g., time series of vector components) of size (1000,3) (the first dimension is the time and the second dimension is the vector components):

>>> print(data)
<xarray.Dataset>
Dimensions:                        (<1D_array_dimension1>: 2000, <2D_array_dimension1>: 1000, <2D_array_dimension2>: 3)
Coordinates:
  * <1D_array_dimension1>          (<time1>) datetime64[ns] 2013-08-05T00:00:00.000007 ... 2013-08-05T00:04:59.000987
  * <2D_array_dimension1>          (<time2>) datetime64[ns] 2013-08-05T00:00:00.000006 ... 2013-08-05T00:05:00
  * <2D_array_dimension2>          (<vector_components>) <U1 'x' ... 'z'
Data variables:
        <prod1>                        (<time1>) float32 86.812 ... 29.063
        <prod2>                        (<time2>, <vector_components>) float32 144.979 ... 5.028
Attributes:
        mission:  <mission>

The data retrieved from the Mission subpackage is an xarray.Dataset object, which is basically a labeled dictionary of xarray.DataArray objects. This Dataset contains two xarray.DataArray objects, which are in their turn N-dimensional arrays with labeled coordinates and dimensions. This Dataset object has a total of 3 Dimensions, which represent the two Coordinates (labels) of the 2D DataArray (<vector_components>, the xyz components of the vector and time1, the timestamps) and the coordinate (label) of the 1D DataArray (time2, the timestamps). The dimensions of the arrays are summarized in the Data variables section: each DataArray has one Data variable that represent the queried data products. The dimension of each data product (array) is shown in front of each data variable (here (1000,3) and (2000)), as well as the data type and first and last value of each data product (i.e., DataArray). Metadata is embedded in the objects Attributes in the form of a basic Python dictionary. For instance, the mission is written in the Dataset Attributes and the units and spacecraft location can be found in the Attributes of each DataArray object (e.g., units can be retrieved by typing data['prod1'].attrs['Units']).

Plotting

The DataArray and Dataset objects provide a simplified plotting system (see http://xarray.pydata.org/en/stable/plotting.html) based on the popular matplotlib package (https://matplotlib.org). Thus, it is rather easy to plot the data that is returned by the Mission subpackage using either the xarray.Dataset.plot() or xarray.DataArray.plot() methods. For instance, the following code snippet shows how to line plot the 1D array from the above Dataset:

>>> data['<prod1>'].plot()

The following code snippet shows how to line plot the 2D array from the above Dataset:

>>> data['<prod2>'].plot.line(x='time2')
File handling

Because the core of the downloading process is based on the HelioPy package, we chose to borrow its file handling system.

  1. Configuration file:

    HelioPy comes with a sample source or configuration file (“heliopyrc”), that is located in ~/.heliopy/heliopyrc and can be customised by the user (the config parser will look in ~/.heliopy first). The default contents of the configuration file are:

  • The working directory is the parent directory in which all downloaded data will be stored. By default, it is set to: download_dir = ~/heliopy/data Note that this default value may be changed to ~/aidapy/data in the future. Inside this directory, the files are stored in a rather standardized folder tree with the following structure: ~/mission/probe/instrument/mode (if available, else the year).

  • The user can also choose whether to convert all downloaded data to a hdf store, enabling much faster file reading after the initial load, but requiring the additional h5py and py-tables dependencies. By default, this value is set to False: use_hdf = False

  • In this file is also stored the user’s personal Cluster cookie that is required to download data from the Cluster Science Archive (CSA). This personal cookie can be retrieved by registering at the following address: http://www.cosmos.esa.int/web/csa/register-now . By default, the cookie value is not set: cluster_cookie = none

  1. File version control:

    One key point of the AIDA project objectives is to always deal with up-to-date data files, that evolve on the fly from open-access sources due to regular reprocessing of available datasets. Fortunately, the HelioPy package fulfills this requirement by getting the latest version of data files at every execution. In other words, the program checks the latest available files from the open-access sources every time data is requested. If the available file is not already in the user’s database, then it is downloaded and processed. Thus, the up-to-date data is returned to the user every time. This behavior will also be generalized to the AIDApy in the future. Every time the user will analyse spacecraft data in the AIDApy, the program will check if the latest available files are in the user database, otherwise it will download and process them.

load_data function

Velocity distribution function

vdf extension to xarray

Machine Learning tool

Introduction

Examples

In this section, we provide simple and practical examples on how to use the different subpackages of the aidapy package.

Spacecraft data downloading

Introduction

Here we demonstrate how to download spacecraft data using the load_data function of AIDApy. Three missions are currently implemented: Omniweb, Cluster and MMS

[1]:
from datetime import datetime

#AIDApy Modules
from aidapy import load_data
import aidapy.aidaxr
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-9e026ada4137> in <module>
      2
      3 #AIDApy Modules
----> 4 from aidapy import load_data
      5 import aidapy.aidaxr

ModuleNotFoundError: No module named 'aidapy'
Omniweb data

First, we set up the parameters of the data: time interval and name of the mission. For this example, we will download from Omniweb all available data between May 15, 2008 and May 16, 2008, thus all other parameters are let to default values.

[3]:
start_time_omni = datetime(2008, 5, 15, 0, 0, 0)
end_time_omni = datetime(2008, 5, 16, 0, 0, 0)

Once all the parameters have been set up, the load_data function can be called. The data are given in a xarray.DataSet format, which is an implementation of labeled and multi-dimensional arrays. A specific extension has been developed (aidapy.aidaxr) to provide specific pre and postprocessing of the data (see the statistical example).

[4]:
xr_omni = load_data(mission='omni', start_time=start_time_omni, end_time=end_time_omni)
xr_omni_all = xr_omni['all1']
print(xr_omni_all)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-c0a5c15c425d> in <module>
----> 1 xr_omni = load_data(mission='omni', start_time=start_time_omni, end_time=end_time_omni)
      2 xr_omni_all = xr_omni['all1']
      3 print(xr_omni_all)

NameError: name 'load_data' is not defined
[4]:
xr_omni_mag = xr_omni_all.sel(products=['Bx GSE, GSM', 'By GSM', 'Bz GSM'])
xr_omni_mag.plot.line(x='time1')
/home/rdupuis/miniconda3/envs/aidapy/lib/python3.7/site-packages/pandas/plotting/_matplotlib/converter.py:103: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.

To register the converters:
        >>> from pandas.plotting import register_matplotlib_converters
        >>> register_matplotlib_converters()
  warnings.warn(msg, FutureWarning)
[4]:
[<matplotlib.lines.Line2D at 0x7ff6c028a750>,
 <matplotlib.lines.Line2D at 0x7ff6c0246050>,
 <matplotlib.lines.Line2D at 0x7ff6c0246310>]
_images/examples_01_missions_mission_test_6_2.png
Cluster data

Then, data can be downloaded from cluster mission. As this mission has several spacecraft, the index of the spacecraft called probe must also be provided. For this example, we will download the magnetic field during 5 minutes on August 5, 2013 for the probe 1 and 2 in GSE coordinate system.

[5]:
start_time_cluster = datetime(2013, 8, 5, 0, 0, 0)
end_time_cluster = datetime(2013, 8, 5, 0, 5, 0)

settings_cluster = {'prod': ['dc_mag'], 'probes': ['1', '2'], 'coords': 'gse'}
[6]:
xr_cluster = load_data('cluster', start_time_cluster, end_time_cluster, **settings_cluster)
print(xr_cluster)
<xarray.Dataset>
Dimensions:                        (B_vec_xyz_gse__C1_CP_FGM_SPIN: 3, B_vec_xyz_gse__C2_CP_FGM_SPIN: 3, time1: 71, time2: 72)
Coordinates:
  * B_vec_xyz_gse__C1_CP_FGM_SPIN  (B_vec_xyz_gse__C1_CP_FGM_SPIN) <U1 'x' ... 'z'
  * time1                          (time1) datetime64[ns] 2013-08-05T00:00:02.000119 ... 2013-08-05T00:04:58.000458
  * B_vec_xyz_gse__C2_CP_FGM_SPIN  (B_vec_xyz_gse__C2_CP_FGM_SPIN) <U1 'x' ... 'z'
  * time2                          (time2) datetime64[ns] 2013-08-05T00:00:02.000182 ... 2013-08-05T00:04:56.000512
Data variables:
    dc_mag1                        (time1, B_vec_xyz_gse__C1_CP_FGM_SPIN) float32 144.983 ... 4.068
    dc_mag2                        (time2, B_vec_xyz_gse__C2_CP_FGM_SPIN) float32 86.769 ... 29.271
Attributes:
    mission:        cluster
    load_settings:  {'prod': ['dc_mag'], 'probes': ['1', '2'], 'coords': 'gse'}
[7]:
xr_cluster['dc_mag1'].plot.line(x='time1')
[7]:
[<matplotlib.lines.Line2D at 0x7ff6c02b2f50>,
 <matplotlib.lines.Line2D at 0x7ff6c01291d0>,
 <matplotlib.lines.Line2D at 0x7ff6c012f190>]
_images/examples_01_missions_mission_test_10_1.png
[8]:
xr_cluster['dc_mag2'].plot.line(x='time2')
[8]:
[<matplotlib.lines.Line2D at 0x7ff6bf759810>,
 <matplotlib.lines.Line2D at 0x7ff6bf7be9d0>,
 <matplotlib.lines.Line2D at 0x7ff6bf7becd0>]
_images/examples_01_missions_mission_test_11_1.png
MMS data

Finally, data can be downloaded from Magnetospheric Multiscale Mission (MMS) mission. This mission has also several spacecrafts. For this example, we will download the magnetic field during 1 minute on April 8, 2018 for the probe 1 and 2 in GSE coordinate system.

[9]:
start_time_mms = datetime(2018, 4, 8, 0, 0, 0)
end_time_mms = datetime(2018, 4, 8, 0, 1, 0)

settings_mms = {'prod': ['dc_mag'], 'probes': ['1', '2'], 'coords': 'gse'}
[10]:
xr_mms = load_data(mission='mms', start_time=start_time_mms, end_time=end_time_mms, **settings_mms)
/home/rdupuis/miniconda3/envs/aidapy/lib/python3.7/site-packages/HelioPy-0+unknown-py3.7.egg/heliopy/data/util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms1_mec_mlt' are unknown
  warnings.warn(message)
Creating new directory /home/rdupuis/heliopy/data/mms/2/fgm/srvy

/home/rdupuis/miniconda3/envs/aidapy/lib/python3.7/site-packages/HelioPy-0+unknown-py3.7.egg/heliopy/data/util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms2_mec_mlt' are unknown
  warnings.warn(message)
[11]:
print(xr_mms['dc_mag1'])
xr_mms['dc_mag1'].plot.line(x='time1')
<xarray.DataArray 'dc_mag1' (time1: 849, mms1_fgm_b_gse_srvy_l2: 4)>
array([[ 1.712529,  2.073415, -1.225534,  2.95529 ],
       [ 1.714843,  2.064923, -1.207659,  2.943303],
       [ 1.710802,  2.077349, -1.215393,  2.952863],
       ...,
       [ 2.291496,  1.610505, -1.082593,  3.00278 ],
       [ 2.281956,  1.615301, -1.089721,  3.000669],
       [ 2.277435,  1.605456, -1.105932,  2.99788 ]], dtype=float32)
Coordinates:
  * mms1_fgm_b_gse_srvy_l2  (mms1_fgm_b_gse_srvy_l2) <U3 'x' 'y' 'z' 'tot'
  * time1                   (time1) datetime64[ns] 2018-04-08T00:00:06.987911 ... 2018-04-08T00:00:59.988633
Attributes:
    Units:    {'mms1_fgm_b_gse_srvy_l2': Unit("nT")}
    pos_gse:  <xarray.DataArray 'mms1_mec_r_gse' (time: 3, mms1_mec_r_gse: 3)...
[11]:
[<matplotlib.lines.Line2D at 0x7ff6bf78ff10>,
 <matplotlib.lines.Line2D at 0x7ff6b65c3a90>,
 <matplotlib.lines.Line2D at 0x7ff6b65c3790>,
 <matplotlib.lines.Line2D at 0x7ff6b65c3610>]
_images/examples_01_missions_mission_test_15_2.png
[12]:
settings_mms_dist = {'prod': ['i_dist'], 'probes': ['1'], 'coords': 'gse'}
xr_mms_dist = load_data(mission='mms', start_time=start_time_mms, end_time=end_time_mms, **settings_mms_dist)
/home/rdupuis/miniconda3/envs/aidapy/lib/python3.7/site-packages/HelioPy-0+unknown-py3.7.egg/heliopy/data/util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms1_mec_mlt' are unknown
  warnings.warn(message)
[14]:
print(xr_mms_dist['i_dist1'])
<xarray.DataArray 'i_dist1' (time1: 14, mms1_dis_energy_fast: 32, mms1_dis_theta_fast: 16, mms1_dis_phi_fast: 32)>
array([[[[0.000000e+00, ..., 0.000000e+00],
         ...,
         [0.000000e+00, ..., 0.000000e+00]],

        ...,

        [[0.000000e+00, ..., 0.000000e+00],
         ...,
         [1.977800e-20, ..., 0.000000e+00]]],


       ...,


       [[[8.998902e-21, ..., 0.000000e+00],
         ...,
         [0.000000e+00, ..., 0.000000e+00]],

        ...,

        [[9.016211e-21, ..., 0.000000e+00],
         ...,
         [0.000000e+00, ..., 0.000000e+00]]]], dtype=float32)
Coordinates:
  * mms1_dis_energy_fast  (mms1_dis_energy_fast) float32 2.16 3.91 ... 17800.0
  * mms1_dis_theta_fast   (mms1_dis_theta_fast) float32 5.625 16.875 ... 174.375
  * mms1_dis_phi_fast     (mms1_dis_phi_fast) float32 2.75 14.0 ... 340.25 351.5
  * time1                 (time1) datetime64[ns] 2018-04-08T00:00:00.120976 ... 2018-04-08T00:00:58.621428
Attributes:
    mms1_dis_dist_fast:  s^3/cm^6
    Units:               {'mms1_dis_dist_fast': Unit("s3 / cm6"), 'mms1_dis_e...
    pos_gse:             <xarray.DataArray 'mms1_mec_r_gse' (time: 3, mms1_me...
[ ]:

Search for scientific processes

Introduction

Here we demonstrate how to search for events of specific scientific processes using the Event Search subpackage included in AIDApy. First, we start with importing the modules required by Event Search:

[1]:
#Python modules
from datetime import datetime
import numpy as np

#AIDApy modules
from aidapy import event_search
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-781f059ab734> in <module>
      4
      5 #AIDApy modules
----> 6 from aidapy import event_search

ModuleNotFoundError: No module named 'aidapy'
Looking for specific EDR events

In particular, here we want to look for Electron Diffusion Regions, which are tiny regions located at the very center of magnetic reconnection. To do so, we chose to use data from the MMS mission, which has been specifically designed to study the microphysics (down to electron scales) of magnetic reconnection. We also chose to look at data on 2017 July 20, during which a specific campaign was dedicated to the observation of reconnection in the magnetotail. The time interval is thus defined as:

[6]:
start_time = datetime(2017, 8, 20, 2, 0, 0)
end_time = datetime(2017, 8, 20, 2, 10, 0)

Then, we need to set up the settings for the specific scientific process we want to study; i. e., here Electron Diffusion Region (EDR). The EDR is characterized by a very low magnetic field (< 5 nT), high-speed electron jets (> 4000 km/s) and a high currents (current density > 70 nA/m²). Thus we define these criteria as inputs to the Event Search tool, in a format compliant with the event_search function (see the Event Search section in the user guide). Additionally, we chose to download data from all probes on board MMS mission, in the default GSE coordinate system. Finally, since the EDR region size is of electron scale, we chose to use the high-resolution burst-mode data from MMS, and a time window of 60 s.

[7]:
settings = {"criteria": lambda dc_mag_tot, j_curl_tot, e_bulkv_x: (np.any(dc_mag_tot < 5)) &
            (np.any(j_curl_tot > 70)) & (np.any(np.abs(e_bulkv_x) > 4000)),
            "parameters": {"mission": "mms",
                           "process": "edr",
                           "probes": ['1'],
                           "mode": "brst",
                           "time_window": 120,
                           "time_step": 120}}

Once the settings on the data and parameters are defined, we call event_search function as follows:

[9]:
event_search(settings, start_time, end_time, plot=True)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('0, 1') for key 'mms1_des_compressionloss_brst' are unknown
  warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('0, 1') for key 'mms1_des_steptable_parity_brst' are unknown
  warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('00-32') for key 'mms1_des_sector_despinp_brst' are unknown
  warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms1_mec_mlt' are unknown
  warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms1_mec_mlt' are unknown
  warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms1_mec_mlt' are unknown
  warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms2_mec_mlt' are unknown
  warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms3_mec_mlt' are unknown
  warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms4_mec_mlt' are unknown
  warnings.warn(message)
_images/examples_01_missions_event_search_mms_edr_7_1.png
[9]:
[0, 3999, 8000, 11999]

As seen above, the function gives as ouput an overview plot with highlighted events for the user to eyeball the results, but if to_file parameter set to True, also writes a simple text file with the events selected, as follows:

[10]:
#events_list.txt

'''
List of events found:

N°,       START TIME (UTC),              END TIME (UTC),                        S/C location (km)
1, 2017-08-10T12:18:01.728287000, 2017-08-10T12:19:01.208817000          -96743.35/16839.555/30637.818
''';
[ ]:

Overview of adiapy tools for VDF analysis


illustration ___ The present Notebook illustrates the possibilities offered by aidapy for handling and analysing particle Velocity Distribution Functions (VDFs) observed by the Magnetospheric MultiScale (MMS) mission.

[1]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from datetime import datetime
from IPython.display import Image
import sys, os
import warnings
warnings.filterwarnings('ignore')
Regular aidapy setup

We first import the data loading utilities (aidapy and heliopy) and the xarray classes/accessors.

[2]:
from aidapy import load_data
import aidapy.aidaxr

We select global start and end times for the data to be loaded. A sub-selection is allowed later for finer analyses. But a time selection shorter than the duration of a single file results in no file being loaded at all.

[3]:
start_time = datetime(2019, 3, 8, 13, 54, 53)
end_time   = datetime(2019, 3, 8, 13, 57, 0)

The settings dictionnary allows us to define the products we want to load. In this example, we will load the ion and electron VDFs via the keywords i_dist and e_dist, the (fluxgate) magnetic field (dc_mag), some spacecraft attitude products (sc_att) and the ion bulk velocity i_bulkv.

[4]:
settings = {'prod': ['i_dist', 'e_dist', 'dc_mag', 'sc_att', 'i_bulkv'],
            'probes': ['1'], 'coords': 'gse', 'mode': 'high_res', 'frame':'gse'}

The data are loaded with the following call of load_data:

[5]:
xr_mms = load_data(mission='mms', start_time=start_time, end_time=end_time, **settings)

The previous steps are shared by all applications using aidapy, though the precise products vary between the applications.

Probes positions

We can easely get a quick global view of the probes’ positions for the chosen time interval using the SDC plots. We find that between 13:00:00 and 14:00:00, the probes were in the magnetosheath.

[6]:
url_sdc = 'https://lasp.colorado.edu/mms/sdc/public/data/sdc/mms_orbit_plots/'
url_sdc += 'mms_orbit_plot_{:4d}{:02d}{:02d}{:02d}0000.png'.format(start_time.year, start_time.month, start_time.day, start_time.hour)
Image(url=url_sdc, embed=True)
[6]:
_images/examples_01_missions_vdf_mms_overview_12_0.png
VDF-specific aidapy features

Next, we will explore the possibilities offered by the librairy for VDFs interpolation and visualisations. aidapy visualisation tools only handle interpolated VDFs. ### Setting up and running the interpolation To set up the interpolation, we first select the species of interest, the frame of interest in which we want to visualise the VDFs (available are the instrument frame instrument and the B-aligned frame B), the coordinate system/geometry of the interpolation grid (cartesian cart, spherical spher or cylindrical cyl), its maximum extent in terms of speed and its resolution, and the method used for the interpolation, to chose between nearest-neighbour (near), trilinear (lin) and tricubic (cub).

[7]:
species      = 'electron'
frame        = 'B'
grid_geom    = 'spher'
v_max        = 1.2e7
resolution   = 100
interp_schem = 'lin'

Additionaly, one can select a sub-interval of time, shorter than the overall interval defined before loading the data.

[8]:
start_time_sub = datetime(start_time.year, start_time.month, start_time.day, 13, 55, 0, 100000)
end_time_sub   = datetime(start_time.year, start_time.month, start_time.day, 13, 55, 1, 500000)

The interpolation method is now called with the defined parameters. Depending on the chosen resolution and the chosen interpolating method, the interpolation may take quite some time. The index of the current distribution being interpolated is shown and updated every 10 VDF.

[9]:
xr_mms = xr_mms.vdf.interpolate(start_time, end_time, start_time_sub, end_time_sub,
                        species=species, frame=frame, grid_geom=grid_geom,
                        v_max=v_max, resolution=resolution, interp_schem=interp_schem)


.____________________________________________________________
| mms_vdf.py, aidapy.
|
| Product(s):
|   - i_dist
|   - e_dist
|   - dc_mag
|   - sc_att
|   - i_bulkv
| Grid geometry:    spher
| Resolution:       100
| Interpolation:    lin
| Start time:       2019-03-08T13:55:00.115349000
| Stop time :       2019-03-08T13:55:01.525361000
| Ind. start-stop:  237-284
| Nb distributions: 47
|____________________________________________________________


280/4234 (current VDF/nb total VDFs)
Total runtime: 14 s.

The output of the method indicates that with the chosen sub-interval, 47 distributions were interpolated, with their corresponding indices in the original data file/arrays.

Visualisations

A few visualisations are offered by the aidaxr.graphical module, and are all illustrated below. The first one uses the spherical geometry of the grid to display the parallel and perpedicular 1D profiles of the VDF averaged over time, in this case 35 times 30 ms. The striking feature of these profiles are the arcs appearing for medium to high velocities. These are visual artifacts due to the linear interpolation, resulting in straight segments between two measurment points, and in arc segments in a logarithmic scale.

[10]:
xr_mms.vdf.plot('1d')
_images/examples_01_missions_vdf_mms_overview_20_0.png

The main new possibility offered by the library is the visualisation of the distribution using a scaled view and a normalised view, aiming at highlighting features of a VDF by getting rid of its high gradients, without using additional information (VDF at some other time, or fitting procedures). These views can be displayed using the vdf_spher() method, within the \(v_{para}, v_{perp})\)-plane. the plt_contourf keyword (boolean) toggles the use of a filled contour representation implemented in matplotib.pyplot

[11]:
xr_mms.vdf.plot('2d', plt_contourf=True)
_images/examples_01_missions_vdf_mms_overview_22_0.png

A cylindrical average is often used to display the data in the (\(v_{para}, v_{perp})\)-plane. For quickly checking that the VDF does not present obvious departure from gyrotropy, the following vdf_gyro() method displays the equatorial cut of the VDF (left-most full circular plot), and four angular segment of the spherical grid, their angular range displayed in the left-most plot by the dashed lines. The first row displays the original interpolated VDF values, the second row presents the scaled values. We see in this representation that the characteristic double branch feature seen in the scaled view is indeed gyrotropic.

[12]:
xr_mms.vdf.plot('3d_gyro')
_images/examples_01_missions_vdf_mms_overview_24_0.png

These three first methods operate a time average over the entire sub-selection (from start_time_sub to end_time_sub). vdf_spher_time retain the time dimension, by splitting the energy/speed dimension in 5 intervals and averaging the scaled VDF over speed, over each of these intervals. The result, namely five scaled pitch-angle distributions, is close to what would be obtained by a classical pitch-angle distributions. We see in this example the 35 distributions along time, and check that the double branch signature is consistently found through time.

[13]:
xr_mms.vdf.plot('3d_time', plt_contourf=False)
_images/examples_01_missions_vdf_mms_overview_26_0.png
Changing parameters and re-interpolating VDFs

We show now that using the exact same loaded data, one can reset the parameters, change the species, the resolution, the interpolation scheme, etc., and re-interpolate the VDFs. Note that the time sub-interval can be redefined here. In this example however, we keep the same time selection to compare electrons and ions.

[14]:
grid_geom  = 'cart'
species    = 'ion'
resolution = 120
v_max      = 8.e5

The interpolating method is called exactly the same way as previously. Because of the difference of measurement cadence, we only get 7 distributions in this case, with a corresponding gain in time.

[16]:
xr_mms = xr_mms.vdf.interpolate(start_time, end_time, start_time_sub, end_time_sub,
                        species=species, frame=frame, grid_geom=grid_geom,
                        v_max=v_max, resolution=resolution, interp_schem=interp_schem)


.____________________________________________________________
| mms_vdf.py, aidapy.
|
| Product(s):
|   - i_dist
|   - e_dist
|   - dc_mag
|   - sc_att
|   - i_bulkv
| Grid geometry:    cart
| Resolution:       120
| Interpolation:    lin
| Start time:       2019-03-08T13:55:00.235349000
| Stop time :       2019-03-08T13:55:01.585361000
| Ind. start-stop:  48-57
| Nb distributions: 9
|____________________________________________________________


50/847 (current VDF/nb total VDFs)
Total runtime: 4 s.

A cartesian coordinate system is much less handy when it comes to representation and selections. Only one plotting method is offered in this case, which provides cuts through the time average VDF, with iso-contours.

[17]:
xr_mms.vdf.plot('3d')
_images/examples_01_missions_vdf_mms_overview_32_0.png
Manual handling of the interpolated VDFs

Until now, the user has only been calling methods and hasn’t been manipulating the various products. We breifly illustrate here how to access the products, in order to code specific representations, or implement some further analyses. We also refer to the source scripts of each of the above plotting methods for more detailled use cases.

Let’s first check the content of the dataset xr_mms:

[18]:
xr_mms
[18]:
<xarray.Dataset>
Dimensions:                   (mms1_des_energy_brst: 32, mms1_des_phi_brst: 32, mms1_des_theta_brst: 16, mms1_dis_bulkv_gse_brst: 3, mms1_dis_energy_brst: 32, mms1_dis_phi_brst: 32, mms1_dis_theta_brst: 16, mms1_fgm_b_gse_brst_l2: 4, mms1_mec_quat_eci_to_gse: 4, phi: 120, speed: 120, theta: 120, time: 9, time1: 847, time2: 4234, time3: 16253, time4: 5, time5: 847, v: 3, v_para: 120, v_perp: 120, vx: 120, vy: 120, vz: 120)
Coordinates:
  * mms1_dis_energy_brst      (mms1_dis_energy_brst) float32 2.16 ... 17800.0
  * mms1_dis_phi_brst         (mms1_dis_phi_brst) float32 7.0 18.25 ... 355.75
  * mms1_dis_theta_brst       (mms1_dis_theta_brst) float32 5.625 ... 174.375
  * time1                     (time1) datetime64[ns] 2019-03-08T13:54:53.035267 ... 2019-03-08T13:56:59.936741
  * mms1_des_theta_brst       (mms1_des_theta_brst) float32 5.625 ... 174.375
  * mms1_des_phi_brst         (mms1_des_phi_brst) float32 5.375 ... 354.125
  * mms1_des_energy_brst      (mms1_des_energy_brst) float32 6.52 ... 27525.0
  * time2                     (time2) datetime64[ns] 2019-03-08T13:54:53.005267 ... 2019-03-08T13:56:59.996741
  * mms1_fgm_b_gse_brst_l2    (mms1_fgm_b_gse_brst_l2) <U3 'x' 'y' 'z' 'tot'
  * time3                     (time3) datetime64[ns] 2019-03-08T13:54:53.024982 ... 2019-03-08T13:56:59.995838
  * mms1_mec_quat_eci_to_gse  (mms1_mec_quat_eci_to_gse) <U3 'x' 'y' 'z' 'tot'
  * time4                     (time4) datetime64[ns] 2019-03-08T13:55:00 ... 2019-03-08T13:57:00
  * mms1_dis_bulkv_gse_brst   (mms1_dis_bulkv_gse_brst) <U1 'x' 'y' 'z'
  * time5                     (time5) datetime64[ns] 2019-03-08T13:54:53.035267 ... 2019-03-08T13:56:59.936741
Dimensions without coordinates: phi, speed, theta, time, v, v_para, v_perp, vx, vy, vz
Data variables:
    i_dist1                   (time1, mms1_dis_energy_brst, mms1_dis_theta_brst, mms1_dis_phi_brst) float32 0.0 ... 0.0
    e_dist1                   (time2, mms1_des_energy_brst, mms1_des_theta_brst, mms1_des_phi_brst) float32 1.2165267e-25 ... 0.0
    dc_mag1                   (time3, mms1_fgm_b_gse_brst_l2) float32 -0.29875872 ... 41.451748
    sc_att1                   (time4, mms1_mec_quat_eci_to_gse) float64 -0.2019 ... 0.9733
    i_bulkv1                  (time5, mms1_dis_bulkv_gse_brst) float32 4.4563413 ... -76.759415
    vdf_interp_time           (time, vx, vz) float32 1.2914518e-15 ... 2.6537375e-16
    vdf_interp                (vx, vy, vz) float64 3.526e-16 ... 2.228e-15
    grid_interp_cart          (v, vx, vy, vz) float32 -800000.0 ... 800000.0
    grid_interp_spher         (v, speed, theta, phi) float32 1385640.6 ... 0.7853982
    grid_interp_cyl           (v, v_perp, phi, v_para) float32 1131370.9 ... 800000.0
    time_interp               (time) datetime64[ns] 2019-03-08T13:55:00.235349 ... 2019-03-08T13:55:01.435361
    grid_geom                 <U4 'cart'
Attributes:
    mission:        mms
    load_settings:  {'prod': ['i_dist', 'e_dist', 'dc_mag', 'sc_att', 'i_bulk...

One can spot the four variables added by the execution of interpolate() method: distrib_interp, grid_interp_spher, grid_interp_cart and time_interp . grid_* variables are three dimensional grids giving three coordinates for each of the VDF values. We can get the 1D arrays of the centers of the bins as shown here:

[19]:
vdf_time = xr_mms['vdf_interp_time'].values
grid_cart = xr_mms['grid_interp_cart'].values
centers_vx = grid_cart[0,:,0,0]
centers_vy = grid_cart[1,0,:,0]
centers_vz = grid_cart[2,0,0,:]

We check the content and dimensions of the distrib_interp variable:

[20]:
xr_mms['vdf_interp_time'].dims
[20]:
('time', 'vx', 'vz')

Therefore vdf_time is a time series of interpolated VDFs, along the dimensions \(v_x\) and \(v_z\). [The reason for the absence of \(v_y\) is a matter of memory saving and is obviously not ideal. It is “herited” from the spherical product, where the two dimensions of speed and pitch angle are enough to well describe the data] As an arbitrary example, we can plot the VDFs averaged over the perpendicular velocities \(v_x\) (different from the parallel profile shown previously, which only shows VDF values for velocities close to \(v_{perp}=0\)).

[21]:
%matplotlib notebook
fig, ax = plt.subplots(figsize=(9,9))

ind_mid = int(resolution/2.)

for d in vdf_time:
    ax.loglog(-1.*centers_vz[:ind_mid], np.nanmean(d, axis=0)[:ind_mid], '--', lw=1)
    ax.loglog(centers_vz[ind_mid:], np.nanmean(d, axis=0)[ind_mid:], lw=1)
ax.set_xlabel('v_para (m/s)')
ax.set_ylabel('VDF (#/m^6/s^3)')

plt.show()

Disturbance Storm Time analysis from OMNI data

This section contains notebooks for the analysis of Disturbance Storm Time (DST) index using OMNI data. COPY/PASTE FROM WIKIPEDIA TO BE MODIFIED The DST index is widely used in the context of space weather and gives information about the strength of the ring current around Earth caused by solar protons and electrons. The ring current around Earth produces a magnetic field that is directly opposite Earth’s magnetic field, i.e. if the difference between solar electrons and protons gets higher, then Earth’s magnetic field becomes weaker. A negative Dst value means that Earth’s magnetic field is weakened. This is particularly the case during solar storms.

The two notebooks download time series data from OMNI:

  • The First notebook trains several models to forecast the DST index and evaluates their performance.

  • The Second notebook analyses the influence of each variable on the DST index by computing correlations and selecting the best features.

DST forecasting from OMNI data

This example shows how to use AIDApy to perform the following tasks:

  • Download time series data from OMNI

  • Preprocess the data so that it can be used for machine learning

  • Train several models and evaluate their performance

Downloading data

We will first download low-resolution data from OMNI.

[15]:
from datetime import datetime
from aidapy import load_data


# Set the start and end date as year, month, day
t0 = datetime(2005, 1, 1)
t1 = datetime(2015, 12, 31)

# Download the data
omnixr = load_data(mission='omni', start_time=t0, end_time=t1)

# Store data in pandas format
pd_data = omnixr['all1'].to_pandas()

pd_data.describe()
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-15-38f1038ecd2b> in <module>
      1 from datetime import datetime
----> 2 from aidapy import load_data
      3
      4
      5 # Set the start and end date as year, month, day

~\Desktop\Clone1\aidapy\aidapy\__init__.py in <module>
      2
      3 #aidapy init
----> 4 from aidapy.aidafunc.load_data import load_data
      5 from aidapy.aidafunc.load_data import get_mission_info
      6 from aidapy.aidafunc.event_search import event_search

~\Desktop\Clone1\aidapy\aidapy\aidafunc\load_data.py in <module>
      9
     10 import datetime
---> 11 import xarray as xr
     12 from astropy.time import Time
     13

ModuleNotFoundError: No module named 'xarray'
Split the data into different sets

For evaluating the performance of a machine learning approach, the input data has to be split in different sets:

  • A training set, on which the model is trained

  • Optionally, a validation set, to tune so-called hyperparameters of the model

  • A test set, which is only used at the end to evaluate the performance of the model

Such a separation is important to prevent overfitting. Since there is usually a correlation between consecutive observations in a time series, it is important not to shuffle the data before splitting it in different sets. Otherwise, the test set would not be truly independent from the training set.

[2]:
from sklearn.model_selection import train_test_split

# Split into training and test data
dtrain, dtest = train_test_split(pd_data, shuffle=False)
Let’s have a look at the data

Below, the DST index is shown. Note that there are quiet periods, and periods with more magnetic storms, indicated by large negative values.

[3]:
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
%matplotlib notebook
register_matplotlib_converters()

plt.figure()
plt.title("DST index (training/test data)")
dtrain['DST Index'].plot(label='train')
dtest['DST Index'].plot(label='test')
plt.gcf().autofmt_xdate()
plt.legend();
Selecting features

The features are here selected by hand (and they are not necessarily the best ones).

[4]:
features = ['|B|',
            'Magnitude of Avg Field Vector',
            'Proton Density',
            'Proton Temperature',
            'Alfven Mach Number',
            'Bz GSM',
            'Na/Np',
            'Plasma Flow Speed',
            'Plasma Beta',
            'Electric Field',
            'DST Index']
targets = ['DST Index']
Preprocessing data

The training and test data are preprocessed so that we can make predictions for the targets based on a history of the features. If the original features and targets are \(X_i\) and \(y_i\), then we define new features \(X_i' = [X_i, X_{i-1}, \dots, X_{i-n-1}]\), where \(n\) is the history size. We use these features to make predictions at some future time, so that the new targets are \(y_i' = y_{i+k}\), where \(k\) is the forecast time.

Another question is what to do with missing values. This is a common issue in working with space craft data – in particular if data from multiple instruments is combined. We have opted for a simple approach: remove all samples for which some data is missing, either in the features or the targets.

[5]:
from aidapy.ml import preprocess

histsize = 8                    # Number of past hours
forecast_time = 1               # Hours into the future

# Use the AIDApy preprocessing method for time series
X_train, y_train, mask_train = preprocess.time_series(
    dtrain[features].values, dtrain[targets].values, histsize, forecast_time)
X_test, y_test, mask_test = preprocess.time_series(
    dtest[features].values, dtest[targets].values, histsize, forecast_time)

t_train = dtrain.index[mask_train]
t_test = dtest.index[mask_test]
Regression models

For the first use case, the following regression models are used: * A fully-connected neural network implemented in PyTorch. Users can customize the number of layers, the number of neurons per layer, and the activation functions of the layers. * A linear regression model imported from scikit-learn

[6]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from aidapy.ml import mlp
from skorch import NeuralNetRegressor
import torch

models = []

models.append({
    'name': 'Linear Regression',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', LinearRegression())])
})

# RegressorMlp is a simple, fully-connected neural network,
# for which the layer sizes are defined below. The default
# activation function is ReLU.
mlp_model = NeuralNetRegressor(
    mlp.RegressorMlp,
    max_epochs=25,
    lr=0.001,
    batch_size=128,
    optimizer=torch.optim.Adam,
    module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)

models.append({
    'name': 'Multilayer perceptron',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', mlp_model)])
})
[7]:
from sklearn.metrics import r2_score

for model in models:
    model['pipe'].fit(X_train, y_train)
    model['test_predict'] = model['pipe'].predict(X_test)
    model['train_predict'] = model['pipe'].predict(X_train)

for model in models:
    print("{:30} R2 score on test / train set:  {:8.3f} {:8.3f}".format(
        model['name'], r2_score(y_test, model['test_predict']),
        r2_score(y_train, model['train_predict'])))
  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1       60.7240       32.4020  2.9466
      2       17.2051       27.2859  1.8680
      3       13.9108       27.3186  4.9162
      4       11.7659       25.7481  3.9948
      5       11.0440       29.0479  2.6648
      6       10.3904       26.3400  2.7699
      7       10.1087       27.2759  1.9154
      8        9.7082       24.8305  2.6972
      9        9.3186       24.3196  2.6852
     10        9.0785       23.8498  3.3641
     11        8.9514       23.7893  2.6362
     12        8.8422       23.3065  4.0155
     13        8.6888       22.9633  2.9111
     14        8.4773       21.6068  2.7943
     15        8.4333       22.4387  2.9364
     16        8.2843       21.1695  3.0378
     17        8.1932       21.9871  3.1358
     18        8.0966       20.9862  3.1069
     19        8.1332       22.9881  3.7974
     20        7.9812       20.2496  4.5640
     21        7.9269       21.3206  4.9270
     22        7.9495       21.2599  3.3066
     23        7.7617       19.6344  5.3131
     24        7.6907       20.3049  1.9088
     25        7.6336       19.9777  2.4472
Linear Regression              R2 score on test / train set:     0.969    0.964
Multilayer perceptron          R2 score on test / train set:     0.952    0.948

Let’s compare the predictions with the actual data

[8]:
plt.figure()
plt.title('Predicting the change in DST Index')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models:
    p = plt.plot(t_test, model['test_predict'], label=model['name'])
    # Plot another line with the same color
    plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)
This looks pretty good, but…

Most variables we consider in a time series have strong auto-correlation. This means that future values strongly resemble past values; examples are the temperature at a location on earth or the price of a stock on a stock market. When making predictions for a variable with strong auto-correlation, one has to be careful in assessing the predictive power of a model. Simply predicting the past state as the future state can look convincing (also numerically in most error norms), but gives little information.

Below, we therefore take the following approach: predict the change in the variable.

[9]:
from aidapy.ml import preprocess

histsize = 24                   # Number of past hours
forecast_time = 1               # Hours into the future

# Use the AIDApy preprocessing method for time series
# With the predict_change=True flag, the change in the DST index
# is the target variable
X_train, y_train, mask_train = preprocess.time_series(
    dtrain[features].values, dtrain[targets].values,
    histsize, forecast_time, predict_change=True)
X_test, y_test, mask_test = preprocess.time_series(
    dtest[features].values, dtest[targets].values,
    histsize, forecast_time, predict_change=True)

t_train = dtrain.index[mask_train]
t_test = dtest.index[mask_test]

models = []

models.append({
    'name': 'Linear Regression',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', LinearRegression())])
})

# RegressorMlp is a simple, fully-connected neural network,
# for which the layer sizes are defined below. The default
# activation function is ReLU.
mlp_model = NeuralNetRegressor(
    mlp.RegressorMlp,
    max_epochs=25,
    lr=0.001,
    batch_size=128,
    optimizer=torch.optim.Adam,
    module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)

models.append({
    'name': 'ANN',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', mlp_model)])
})
[10]:
from sklearn.metrics import r2_score

for model in models:
    model['pipe'].fit(X_train, y_train)
    model['test_predict'] = model['pipe'].predict(X_test)
    model['train_predict'] = model['pipe'].predict(X_train)

for model in models:
    print("{:30} R2 score on test / train set:  {:8.3f} {:8.3f}".format(
        model['name'], r2_score(y_test, model['test_predict']),
        r2_score(y_train, model['train_predict'])))
  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1        9.3040       13.4525  2.5473
      2        8.0748       13.2288  2.4043
      3        7.6921       12.3340  2.4899
      4        7.5276       12.2525  3.9910
      5        7.3974       12.1197  3.8056
      6        7.2771       11.9705  3.0214
      7        7.1804       11.9094  2.0155
      8        7.0822       11.9217  2.8191
      9        7.0014       11.8192  2.6564
     10        6.8758       11.8942  2.6622
     11        6.7928       11.8929  1.9705
     12        6.6663       11.8819  1.9809
     13        6.6267       12.3010  2.6679
     14        6.5041       12.2031  1.8921
     15        6.4059       12.2843  2.4354
     16        6.3420       12.9039  3.0471
     17        6.2806       12.1478  2.7479
     18        6.1105       13.0912  4.9462
     19        6.0671       12.2712  4.6252
     20        5.9791       12.7974  1.9119
     21        5.9633       12.8095  3.2434
     22        6.0265       13.1740  2.2317
     23        5.9151       13.1795  1.7708
     24        5.8232       13.1638  3.2358
     25        5.7188       12.9696  4.8095
Linear Regression              R2 score on test / train set:     0.389    0.411
ANN                            R2 score on test / train set:     0.287    0.457
[11]:
plt.figure()
plt.title('Predicting the change in DST Index')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models:
    p = plt.plot(t_test, model['test_predict'], label=model['name'])
    # Plot another line with the same color
    plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)
[ ]:

Predicting the DST current index from other variables
Downloading data

We will first download low-resolution data from OMNI, and split it into a training and a test set.

[1]:
from datetime import datetime
from aidapy import load_data
from sklearn.model_selection import train_test_split

# Set the start and end date as year, month, day
t0 = datetime(2000, 1, 1)
t1 = datetime(2015, 12, 31)

# Download the data
omnixr = load_data(mission='omni', start_time=t0, end_time=t1)

# Store data in pandas format
pd_data = omnixr['all1'].to_pandas()

# Split into training and test data
dtrain, dtest = train_test_split(pd_data, shuffle=False)

pd_data.describe()
Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2000.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2000.dat
100.0% 2883584 / 2881152


Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2001.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2001.dat
100.0% 2875392 / 2873280


Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2002.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2002.dat
100.0% 2875392 / 2873280


Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2003.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2003.dat
100.0% 2875392 / 2873280


Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2004.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2004.dat
100.0% 2883584 / 2881152


[1]:
products Bartels Rotation Number ID IMF Spacecraft ID SW Plasma Spacecraft points(IMF Average) points(Plasma Average) |B| Magnitude of Avg Field Vector Lat. Angle of Aver. Field Vector Long. Angle of Aver. Field Vector Bx GSE, GSM ... Proton Flux > 10MeV Proton Flux > 30MeV Proton Flux > 60MeV flag ap index f10.7 index PC(N) index AL index (Kyoto) AU index (Kyoto) Magnetosonic Mach No.
count 140233.000000 140233.000000 140105.000000 140233.000000 140105.000000 140233.000000 140233.000000 140233.000000 140233.000000 140233.000000 ... 113159.000000 113134.000000 113120.000000 140233.000000 140233.000000 140113.000000 140035.000000 140233.000000 140233.000000 135984.000000
mean 2379.926836 55.852353 56.569801 48.675333 30.049670 5.827601 5.174818 0.130479 201.635718 -0.004155 ... 7.994024 2.036895 0.759653 -0.807028 10.408599 116.238255 0.960367 -110.092596 68.026370 5.641059
std 62.473060 8.573346 8.166480 19.356001 9.397953 3.219993 3.069832 29.213007 100.877033 3.473102 ... 152.075039 48.944587 18.042820 0.394632 16.728629 43.382109 1.345067 143.618415 70.739359 1.158862
min 2272.000000 51.000000 45.000000 1.000000 1.000000 0.400000 0.100000 -89.200000 0.000000 -40.800000 ... 0.010000 0.010000 0.010000 -1.000000 0.000000 65.100000 -21.500000 -2452.000000 -260.000000 0.600000
25% 2326.000000 51.000000 52.000000 51.000000 17.000000 3.800000 3.200000 -18.500000 122.300000 -2.500000 ... 0.150000 0.080000 0.060000 -1.000000 3.000000 78.500000 0.100000 -148.000000 20.000000 4.900000
50% 2380.000000 51.000000 52.000000 59.000000 35.000000 5.100000 4.500000 -0.100000 181.100000 0.000000 ... 0.200000 0.110000 0.080000 -1.000000 6.000000 107.100000 0.600000 -49.000000 42.000000 5.700000
75% 2434.000000 51.000000 52.000000 60.000000 37.000000 7.000000 6.200000 18.600000 302.000000 2.400000 ... 0.250000 0.150000 0.110000 -1.000000 12.000000 141.200000 1.500000 -20.000000 91.000000 6.400000
max 2488.000000 71.000000 71.000000 94.000000 74.000000 62.000000 60.700000 89.800000 360.000000 34.800000 ... 9650.000000 3220.000000 1220.000000 0.000000 400.000000 325.100000 28.000000 22.000000 1226.000000 12.700000

8 rows × 52 columns

Let’s have a look at the data

Below, the DST index is shown. Note that there are quiet periods, and periods with more magnetic storms, indicated by large negative values.

[2]:
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
%matplotlib notebook
register_matplotlib_converters()

plt.figure()
plt.title("DST index (training/test data)")
dtrain['DST Index'].plot(label='train')
dtest['DST Index'].plot(label='test')
plt.gcf().autofmt_xdate()
plt.legend();
The correlation between the DST Index and other variables

Let’s have a look at the linear correlation between the DST index and other variables. Correlations range from -1 (anti-correlation) to 1 (perfect correlation), and a value around zero indicates the variables are linearly independent. (However, there might be a complex non-linear correlation)

[3]:
# Generate a correlation matrix
corr_matrix = dtrain.corr()

# Get the correlation with the DST index
dst_corr = corr_matrix['DST Index']

# Display correlations sorted by absolute value
ix = dst_corr.abs().sort_values(ascending=False).index
sorted_corr = dst_corr.reindex(ix)

# Print the variables with the strongest correlations
print(sorted_corr[:10])
products
DST Index                        1.000000
ap index                        -0.629716
Kp                              -0.573182
AL index (Kyoto)                 0.548138
AE Index                        -0.542807
PC(N) index                     -0.514771
Plasma Flow Speed               -0.456372
AU index (Kyoto)                -0.420654
|B|                             -0.396887
Magnitude of Avg Field Vector   -0.386568
Name: DST Index, dtype: float64
Selecting features

We will now select features from which we will predict the DST Index (our target). There are a number of other indices that the DST correlates to, but to make this example somewhat challenging we will exclude them!

[4]:
# Get names of the variables
all_names = list(sorted_corr.index)

# Remove index variables
my_names = [name for name in all_names if 'ndex' not in name]

# Remove other variables
my_names.remove('Kp')

# Select variables with the strongest (absolute) correlations as features
n_features = 5
features = my_names[:n_features]
targets = ['DST Index']
all_vars = features + targets
print("The features are: ", features)
print("The targets are: ", targets)

# Remove rows with missing values. We do this after selecting features,
# otherwise we remove too many rows.
dtrain_dropna = dtrain[all_vars].dropna()
dtest_dropna = dtest[all_vars].dropna()

# Get the features and targets
X_train = dtrain_dropna[features].values
y_train = dtrain_dropna[['DST Index']].values

X_test = dtest_dropna[features].values
y_test = dtest_dropna[['DST Index']].values

t_test = dtest_dropna.index
t_train = dtrain_dropna.index
The features are:  ['Plasma Flow Speed', '|B|', 'Magnitude of Avg Field Vector', 'Na/Np', 'Bz GSM']
The targets are:  ['DST Index']
Regression models

In this example, we consider a linear regression model and an artificial neural network (ANN) regression model. These models are used in a pipeline, which also performs scaling of the input data.

[5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from aidapy.ml import mlp
from skorch import NeuralNetRegressor
import torch

# A list of models to use
models = []

# Append a dictionary with the model name and pipeline
models.append({
    'name': 'Linear Regression',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', LinearRegression())])
})

# RegressorMlp is a simple, fully-connected neural network,
# for which the layer sizes are defined below. The default
# activation function is ReLU.
mlp_model = NeuralNetRegressor(
    mlp.RegressorMlp,
    max_epochs=25,
    lr=0.001,
    batch_size=128,
    optimizer=torch.optim.Adam,
    module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)

models.append({
    'name': 'Multilayer perceptron',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', mlp_model)])
})
Training the models
[6]:
from sklearn.metrics import r2_score

for model in models:
    model['pipe'].fit(X_train, y_train)
    model['test_predict'] = model['pipe'].predict(X_test)
    model['train_predict'] = model['pipe'].predict(X_train)

for model in models:
    print("{:30} R2 score on test / train set:  {:8.3f} {:8.3f}".format(
        model['name'], r2_score(y_test, model['test_predict']),
        r2_score(y_train, model['train_predict'])))
  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1      265.4195      510.7300  4.9675
      2      199.7389      514.0299  6.8559
      3      196.6792      513.1639  5.4448
      4      194.9946      511.3364  8.0728
      5      193.8425      508.8832  3.3963
      6      192.9440      506.7368  6.7031
      7      192.1152      507.0628  4.3043
      8      191.5201      504.7900  4.5337
      9      190.9533      504.0500  5.6420
     10      190.5970      503.2600  3.0030
     11      190.2063      502.6676  3.0651
     12      189.9117      501.8307  7.2454
     13      189.5939      501.4661  7.0313
     14      189.3153      501.0051  4.6045
     15      189.0875      500.4397  8.1879
     16      188.8131      500.0692  5.8018
     17      188.5830      499.5452  8.0832
     18      188.3553      499.0485  5.2213
     19      188.0861      498.7523  4.9091
     20      187.9017      498.4360  4.6728
     21      187.7025      498.1764  4.8058
     22      187.4986      497.6907  3.5740
     23      187.3130      497.4540  4.4685
     24      187.1445      497.1794  4.2933
     25      186.9713      496.8183  7.3463
Linear Regression              R2 score on test / train set:     0.394    0.375
Multilayer perceptron          R2 score on test / train set:     0.415    0.404
Visualizing the model predictions
[7]:
plt.figure()
plt.title('Performance on test/train data')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models:
    p = plt.plot(t_test, model['test_predict'], label=model['name'])
    # Plot another line with the same color
    plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)
Working with a history to improve predictions

Before, we only used the ‘instantaneous’ values of other variables to predict the DST Index. We can improve predictions by keeping track of a history of values, for which we use the AIDApy preprocessing module.

[8]:
from aidapy.ml import preprocess

histsize = 24                   # Number of past hours
forecast_time = 0               # Hours into the future

Xtr = dtrain[features].values
ytr = dtrain[targets].values
Xte = dtest[features].values
yte = dtest[targets].values

# Use the AIDApy preprocessing method for time series
X_train, y_train, mask_train = preprocess.time_series(
    Xtr, ytr, histsize, forecast_time)
X_test, y_test, mask_test = preprocess.time_series(
    Xte, yte, histsize, forecast_time)

t_train = dtrain.index[mask_train]
t_test = dtest.index[mask_test]

models_v2 = []

models_v2.append({
    'name': 'Linear Regression',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', LinearRegression())])
})

mlp_model = NeuralNetRegressor(
    mlp.RegressorMlp,
    max_epochs=20,
    lr=0.001,
    batch_size=128,
    optimizer=torch.optim.Adam,
    module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)

models_v2.append({
    'name': 'Multilayer perceptron',
    'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
                            ('model', mlp_model)])
})
Training and evaluating the new models
[9]:
for model in models_v2:
    model['pipe'].fit(X_train, y_train)
    model['test_predict'] = model['pipe'].predict(X_test)
    model['train_predict'] = model['pipe'].predict(X_train)

for model in models_v2:
    print("{:30} R2 score on test / train set:  {:8.3f} {:8.3f}".format(
        model['name'], r2_score(y_test, model['test_predict']),
        r2_score(y_train, model['train_predict'])))
  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1      211.1822      320.4362  3.7218
      2      120.9668      272.9826  4.0578
      3      116.8801      265.6732  5.9941
      4      114.9084      249.5811  3.8256
      5      112.3243      244.7855  3.6040
      6      110.7551      240.7139  3.4175
      7      109.4082      237.4817  3.4066
      8      108.2381      234.2301  3.3303
      9      107.1030      231.2991  2.4646
     10      105.9993      228.7583  5.1124
     11      105.0800      226.6691  3.4773
     12      104.1000      224.7690  3.3862
     13      103.2819      222.9194  3.0130
     14      102.5243      221.2111  5.1869
     15      101.7451      219.7934  2.9591
     16      101.0786      218.5600  2.5240
     17      100.4546      216.9458  4.8371
     18       99.7634      215.4997  2.8842
     19       99.2128      214.3354  2.6664
     20       98.6472      213.3654  5.0065
Linear Regression              R2 score on test / train set:     0.660    0.634
Multilayer perceptron          R2 score on test / train set:     0.667    0.700
Looking at the results

The performance of both models seems to have significantly improved by using a history of variables.

[10]:
plt.figure()
plt.title('Performance on test/train data')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models_v2:
    p = plt.plot(t_test, model['test_predict'], label=model['name'])
    # Plot another line with the same color
    plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)
Possible extensions

Above, features were selected based on their linear correlation to the DST index. Such features will work well for a linear model, but a neural network could potentially also use features that have a more complex correlation with the DST index. We could simply use all variables as the input for the neural network, but then we have to take care that there are no missing values (NaN’s).

[ ]:

Modules

aidapy

aidapy package
Subpackages
aidapy.aidafunc package
Submodules
aidapy.aidafunc.event_search module
aidapy.aidafunc.load_data module
aidapy.aidafunc.set_load_config module
Module contents
aidapy.aidaxr package
Submodules
aidapy.aidaxr.epsilon module
aidapy.aidaxr.graphical module
aidapy.aidaxr.hankel module
aidapy.aidaxr.process module
aidapy.aidaxr.statistics module
aidapy.aidaxr.tools module
aidapy.aidaxr.vdf module
aidapy.aidaxr.xevents module
Module contents
aidapy.data package
Subpackages
aidapy.data.mission package
Subpackages
aidapy.data.mission.mission_settings package
Submodules
aidapy.data.mission.mission_settings.settings module
Module contents
Submodules
aidapy.data.mission.base_mission module

This module serves as base for all the mission data manager

Author : Romain Dupuis

class aidapy.data.mission.base_mission.BaseMission(t_start, t_end)[source]

Bases: object

Base format for the data downloaded from missions

Parameters
  • t_start (datetime object) – It gives the time at which we started to look at the data

  • t_end (datetime object) – It gives the time at which we finished looking at the data

static convert_time(time, format_out)[source]

Convert time to an accepted astropy time format

Parameters
  • time (datetime or astropy.time instance) – Time object we want to convert

  • format_out (str) – Indicates what is the format output. Must be a format accepted by astropy

download_data(data_types, probes, coords, mode, frame)[source]

Download data

get_mission_info()[source]

Provide information on the mission

Returns

info – Dictionary with information on the mission with the following keys - name: - allowed_probes: - product_catalog:

Return type

dict

static parse_units(xarray_obj)[source]

Convert unit at the dim level to unit at coord level

set_coords(coords)[source]

Set the coordinates

set_frame(frame)[source]

Set the frame only for spacecraft attitude

set_mode(mode)[source]

Set the instrument mode

abstract set_observation(obs_settings)[source]

Abstract method to set the different parameters of the mission

set_probe(probe)[source]

Set the current working probe

set_probes(probes)[source]

Set the probes

set_product_catalog()[source]

Docstring

aidapy.data.mission.cluster module
aidapy.data.mission.mission module
aidapy.data.mission.mms module
aidapy.data.mission.modified_cluster module
aidapy.data.mission.omni module
Module contents
Module contents
aidapy.ml package
Subpackages
aidapy.ml.builders package
Submodules
aidapy.ml.builders.builders module
Module contents
aidapy.ml.callbacks package
Submodules
aidapy.ml.callbacks.callback module
aidapy.ml.callbacks.checkpoint module
aidapy.ml.callbacks.imgscallback module
aidapy.ml.callbacks.metricscallback module
aidapy.ml.callbacks.preprocesscallback module
aidapy.ml.callbacks.savecallback module
aidapy.ml.callbacks.unspr_checkpoint module
Module contents
aidapy.ml.data package
Subpackages
aidapy.ml.data.common package
Submodules
aidapy.ml.data.common.common module
aidapy.ml.data.common.dcnn_format module
aidapy.ml.data.common.dl_data_mixin module
aidapy.ml.data.common.lstm_format module
aidapy.ml.data.common.utils module
Module contents
Submodules
aidapy.ml.data.aidapy_omniweb module
aidapy.ml.data.ch_unsuper module
aidapy.ml.data.coronalholes module
aidapy.ml.data.data module
aidapy.ml.data.data_unsuper module
aidapy.ml.data.generic_db module
aidapy.ml.data.mixin_class module
Module contents
aidapy.ml.loss package
Submodules
aidapy.ml.loss.loss module
aidapy.ml.loss.loss_func module
Module contents
aidapy.ml.metrics package
Submodules
aidapy.ml.metrics.dtw module
aidapy.ml.metrics.dtw_m_e module
aidapy.ml.metrics.metric_func module
aidapy.ml.metrics.metrics module
Module contents
aidapy.ml.models package
Submodules
aidapy.ml.models.RegressorMlp module
aidapy.ml.models.dcnn module
aidapy.ml.models.kmeans module
aidapy.ml.models.lstm module
aidapy.ml.models.model module
aidapy.ml.models.model_unsuper module
aidapy.ml.models.multilayer_perceptron module
aidapy.ml.models.unet module
Module contents
aidapy.ml.optim package
Submodules
aidapy.ml.optim.optim module
Module contents
aidapy.ml.postprocess package
Module contents
aidapy.ml.utils package
Submodules
aidapy.ml.utils.config_parser module
aidapy.ml.utils.logger module
aidapy.ml.utils.prune module
aidapy.ml.utils.ts2plot module
aidapy.ml.utils.utils module
Module contents
aidapy.ml.visualization package
Submodules
aidapy.ml.visualization.dst_vis module
aidapy.ml.visualization.visualizer module
Module contents
Submodules
aidapy.ml.callback_hook module
aidapy.ml.cli module
aidapy.ml.engine module
aidapy.ml.engine_unspr module
aidapy.ml.factory module
aidapy.ml.hpo_optuna module
aidapy.ml.pruner module
Module contents
aidapy.tools package
Submodules
aidapy.tools.sitl_parsing module
aidapy.tools.vdf_plot module
aidapy.tools.vdf_utils module
Module contents
Module contents

Contributing to AIDApy

All contributions are welcome. For detailed information about the code style please read the following instructions. All the code must have a adequate number of tests that assures it’s credibility. Feel free to make pull requests.

Introduction

This guide defines the conventions for writing Python code for aidapy. The main ideas are:

  • ensuring a consistent code style

  • promote good practices for testing

  • maintaining a good level of readability and maintainability

  • to keep it simple

Python version

Prefer if possible Python>=3.6 since there are major dependencies that do not support other python versions. There is no reason motivating the use of Python 2 as AIDApy is not using dependencies or legacy code.

Coding style

Stick as much as possible to PEP8 for general guidelines in term of coding conventions and to PEP257 for typical docstring conventions. You can also have a look to Python anti-pattern.

Main guidelines from PEP8

PEP8 coding conventions are:

  • Use 4 spaces per indentation level.

  • Limit all lines to a maximum of 100 characters.

  • Separate top-level function and class definitions with two blank lines.

  • Imports should be grouped in the following order:

  • Standard library imports.

  • Related third party imports.

  • Local application/library specific imports.

  • A blank line between each group of imports.

Use Linters

Linters are tools for static code quality checker. For instance, you can use the following tools to test conformity with the common pythonic standards:

  • pylint is one of the oldest linters and tracks various problems such as good practice violation, coding standard violation, or programming issues. Pylint may be seen as slow, too verbose and complex to configure to get it working properly. You can run a complete static analysis with the following command:

pylint aidapy --rcfile=setup.cfg --ignore-patterns='file_to_ignore1.py','file_to_ignore2.py'

All these linters can be simply installed with pip. Further details on the functionnalities can be found here or there. Also, a lot of features can also be provided natively or by installing plugins with your IDE (PyCharm, Spyder, Eclipse, etc.).

Documentation

Please follow the numpy guidelines to comment your code. All the function and classes have to be commented.

Documentation of all the files must be done in-line using Sphinx. We strongly encourage you to setup a documentation using doctest.

Testing

The library pytest is used to launch tests in the code. All tests can be launched using:

coverage run -m pytest

This command gives coverage at the same time. The output consists in tests results and coverage report.

Regarding the location of the test files, add a directory where you will keep the unit tests for each individual modules of the package.

<AIDApy_root>/aidapy/<mymodule>/tests

Each test must be minimal and should ensure non-regression, and full functionality of the module. The files should be named:

  • test_<name>.py where <name> is any name you think representative of the test.

Integration tests will be kept in the following directory. Do not forget that integration tests are those that involve two or more sub-modules of the aidapy package.

<AIDApy_root>/tests

These tests are handled by the package integration team.

Using the git repository

You will always work in a branch different from master. The general rule is the following (source):

  • The production version of your code sits in a branch named master.

  • The development version of your code (ready to be tested and productionised) sits in a branch named develop.

  • When development on a new feature is started, a new branch is created off of the develop branch. This is a feature branch.

  • When development has completed, the feature branch is either merged back into the develop branch if the changes are to be taken into production, or left in their respective branch if they are not.

  • When the code in the develop branch is finally ready to be released into production, a new release branch is created.

  • When the release branch is approved for release, it is merged into the master branch and develop branch to capture any fixes that may have been done directly on the release branch.

When you are working on your feature branch, commit often, every time that an consistent modification is made. Please use meaningful commit messages: follow the standards presented in the guide How to write a git commit message.

Remember to always merge the latest version of the master branch to your branch before any commit. Please check that your code works and passes the standardization tool tests before requesting a merge to the master branch.

Once you are sure that your branch is in good conditions to share it with the developers and users, please make a merge request on gitlab.

Testing Standards

In order the merging request (of the feature branch to master branch ) to be authorized the feature branch must pass successfully from the Continuous Integration tools. To achieve this, the score of the pylint library must be over 8.0 / 10.0 and the test coverage above 80%.

For assessing the pylint score run the command:

pip install pylint
pylint aidapy --rcfile=setup.cfg

Finally for assessing the code test coverage of your branch run:

coverage run setup.py test

Help & reference

Help

Reference

About