Welcome to AIDApy’s documentation!¶
AIDApy¶
The Python package aidapy
centralizes and simplifies access to:
Spacecraft data from heliospheric missions
Space physics simulations
Advanced statistical tools
Machine Learning, Deep Learning algorithms, and applications
The aidapy
package includes four main sub-packages which are explained in detail in the Reference Manual section:
Mission tool
Event search tool
Velocity distribution function tool
Machine learning tool
The aidapy
package is part of the project AIDA (Artificial Intelligence Data Analysis) in Heliophysics funded from
the European Unions Horizon 2020 research and innovation programme under grant agreement No 776262.
It is distributed under the open-source MIT license.
Full documentation can be found here
Content¶
Getting Started¶
Motivations¶
A major advantage of our era is the easiness of gathering and recording data. Data can be found literary everywhere, from quantum scale to stars and galaxies. Great challenges are arising from this fact and the biggest of them all is how you can efficient manage all this information.
In particular for space data, technology has advanced so rapidly that every day, petabytes of useful information is recorded through satellites, other sensors around the globe, and simulations. However, a big part of this information is remain unused (or partly used) because the proper tools have not yet been created. New software packages can lead the way to win that challenge and fully (or partly) automate various processes that are necessary in order to transform them in a format that the scientists can understand and analyze.
This is the reason why we developed aidapy
which is a high level Python package for the analysis of spacecraft
data from heliospheric missions using modern techniques including data assimilation, machine learning (ML), artificial intelligence (AI), and advanced statistical models.
The transition from proprietary software based on commercial languages to the
new open-source community-developed packages in Python represent one of the main objective of aidapy
.
Our vision is to combine software tools that are already existing out there for different reasons, in a nice and efficient way.
In our minds, this is a crucial and necessary task for the reducing of the knowledge “gap” that exist in
several occasions between different scientific fields.
aidapy
has been created in order to fulfil the needs of a user with high programming skills as well as, for a user with
basic programming knowledge.
Finally, we aim to build a new specific database for helping the ongoing trend in database standardization in space. Key heliophysics problems are selected to produce a database (AIDAdb) of new high-level data products that include catalogs of features and events detected by ML and AI algorithms from existing numerical simulations and observations. New simulations are also performed to enrich this database. Moreover, many of the AI methods developed in AIDA represent themselves higher-level data products, for instance in the form of trained models.
The transition to a big-data phase and the use of the big-data language of choice Python prepares the space community to the use of ML and AI. The latest AI developments and especially the methodology of deep neural networks are particularly suited for the identification of physical processes in space images and data. However, there is a barrier to overcome between the two very distinct communities of space scientists and experts in AI. Aidapy aims at bridging this gap linking the software developed in the AI community with heliospheric data.
We are developing aidapy
to make it easier for a researcher
in space science who does not have a background in computer science or in artificial
intelligence to use some of the most advanced tools in this field and provide some concrete
examples on how AI can help analyse data from simulations and observations, how AI can help
discover physical processes hidden in the data and how it can make space weather forecasts.
~
Reference manual¶
Mission¶
Introduction¶
The Mission subpackage is based on routines mainly adopted from HelioPy (https://heliopy.readthedocs.io/en/stable/) and SunPy packages (https://docs.sunpy.org/en/stable/) and further developed. It allows to select and download data from in situ, remote and ground open-access databases. In its present version, the subpackage allows to download data from NASA OMNI, ESA Cluster and NASA MMS in situ data. The tool includes the capability to verify the version of the data to download, in order to always have the latest version of the files. The main output of the subpackage are xarray to be further used by other subpackages of the AIDApy package.
The main features of the Mission subpackage are:
being able to handle various missions, such as the Cluster mission or the Magnetospheric Multiscale Mission. Additional missions could be added according to needs without any major change to the architecture, either by contributing to HelioPy or building our own downloader;
having precise control of the data. The query can ask specific time range, probes, coordinates, etc. The module will check the availability of the data;
managing time-varying multi-dimensional distributions. This feature is of great importance to the mission module as it is not available for other Python packages;
ensuring the use of a proper data container providing raw data but also time range, metadata, etc;
offering a uniform interface to other AIDApy’s packages in order to perform advanced analysis on the mission data. For instance, an aidapy timeseries object can be generated using data from the mission package, given access to statistical tools and data processing to the final user.
Inputs¶
The philosophy of AIDApy is to ease as much as possible the user experience. To this end, we decided to minimize the number of arguments entered by the user and the knowledge required to obtain the desired data. Below is a list of the parameters the code needs from the user:
The first parameter to enter is the start and the end of the requested time interval as a Python datetime.datetime object written as: datetime(year, month, day, hour, minute, second) or an astropy.Time object allowing to handle multiple time format such as GPS, ISOT, or FITS.
The second parameter specifies the mission from which to download the data, as a string:
'mission' = '<mission1>'
. Missions that can be selected currently are: OMNIWeb, MMS, Cluster.The third parameter is a list of strings, containing the different probes of the multi spacecraft mission considered:
'probes' = ['<probe1>', '<probe2>', ...]
. Probes that can be selected are probe1, probe2, probe3 and probe4 for MMS and Cluster and probe1 for OMNIWeb. Note that for the single spacecraft mission or if one wants data from all probes, one can omit the'probes'
parameter or set it to an empty string.The fourth parameter is a string specifying the coordinate system the user wants the data in, such as:
'coords' = '<coord_system>'
. Note that for the sake of clarity, it has been decided that data can be retrieved in only one coordinate system at once. The coordinate system that is currently available is Geocentric Solar Ecliptic (GSE).The last parameter to enter is a list of strings containing the data products that the user wants to be loaded (for instance the magnetic field vector, the ion density, etc), written as:
'prod' = ['<product1>', '<product2>', ...]
. As mentioned in the previous section, a catalog of all available products is embedded in the Mission subpackage, which will be updated on the fly as more and more space missions and datasets will be added. We note here that this catalog also includes Level 3 (L3) products that are not directly available through open-access databases but are processed in the AIDApy package. A preliminary catalog of detailed available products can be found below in the following table:
Data product |
Level |
Available for missions |
Description |
---|---|---|---|
dc_mag |
L2 |
|
3-component (x, y, z) or 4-component (x, y, z, tot) vector of magnetic field |
i_dens |
L2 |
|
Ion number density |
e_dens |
L2 |
|
Electron number density |
i_dist |
L2 |
|
3D ion distribution function |
e_dist |
L2 |
|
3D electron distribution function |
i_bulkv |
L2 |
|
Ion bulk velocity vector |
e_bulkv |
L2 |
|
Electron bulk velocity vector |
i_temppara |
L2 |
|
Ion temperature parallel to dc_mag |
i_tempperp |
L2 |
|
Ion temperature perpendicular to dc_mag |
i_temp |
L2 |
|
Total ion temperature |
all |
L2 |
|
All products available for OMNI data |
sc_pos |
L2 |
|
Spacecraft location |
sc_att |
L2 |
|
Spacecraft attitude |
dc_elec |
L2 |
|
DC electric field |
i_omniflux |
L2 |
|
Omnidirectional ion energy spectrum |
i_energy |
L2 |
|
Ion energy channels table |
i_aspoc |
L2 |
|
ASPOC instrument ion current |
i_prestens |
L2 |
|
Ion pressure tensor |
i_temptens |
L2 |
|
Ion temperature tensor |
i_heatq |
L2 |
|
Ion heat flux |
j_curl |
L3 |
|
Current density calculated from the Curlometer method |
mag_elev_angle |
L3 |
|
Magnetic elevation angle |
i_beta |
L3 |
|
Ion plasma beta |
e_beta |
L3 |
|
Electron plasma beta |
Availability of data products¶
As stated in the introduction, AIDApy is designed to be upgraded on the
fly according to needs such that the availability of data products can vary
depending on the considered space mission. In order to ease the burden of
browsing through the above (preliminary) data products catalog to search
for the keyword corresponding to the desired physical quantity, AIDApy provides
the get_mission_info
function as a part of the high-level load_data
function (see below). This function is designed to give
information on the data products available for each space mission, but also to
deliver the data product keywords corresponding the queried physical quantity
as well as other data product settings (e.g., available data rates or modes,
probes, coordinates, etc).
Below is a very simple code snippet showing how to get all available data products from a particular mission:
get_mission_info(mission='<mission>')
To get the AIDApy data product keywords corresponding to a physical quantity,
simply provide the desired quantity as a string (e.g., 'magnetic field'
):
get_mission_info(mission='<mission>', product='<physical quantity>')
All supplemental product parameters are accessible by setting the
full_product_catalog
parameter to True
:
get_mission_info(mission='<mission>', product='<physical quantity>', full_product_catalog=True)
High level interface¶
The function load_data
is used as a high-level interface with the mission subpackage.
Below is a code snippet showing how to define the inputs for the load_data
function
using a Python dictionary:
# Define the time interval
start_time = datetime(<year>, <month>, <day>, <hour>, <minute>, <second>)
end_time = datetime(<year>, <month>, <day>, <hour>, <minute>, <second>)
# Define the settings as a Python dictionary
settings = {'prod': ['<prod1>', '<prod2>'], 'probes': ['<probe1>', '<probe2>'],
'coords': '<coord_system>'}
Once the parameters are set up, we generate the Mission downloader and
download the data. This is done by calling the load_data
function, that tells the
Mission subpackage to create a specific downloader for every mission and time
interval requested, and download and load
the desired data products for the specified probes and coordinate system:
data = load_data(mission='<mission>', start_time, end_time, **settings)
The data is then returned in the data object and can be printed using the print() Python command. The returned data is discussed in the following section.
Outputs¶
The DataArray and Dataset objects from the Python package xarray (http://xarray.pydata.org/en/stable/) have been chosen as outputs for the Mission subpackage, as they have been especially designed to handle time-series of multidimensional data. It is basically an N-dimensional array with labeled coordinates and dimensions, which also supports metadata aware operations (see details in the Metadata subsection). It also provides many functions to easily manipulate multidimensional data, such as indexing, reshaping, resampling, etc.
The following code snippet is an example of what is printed when downloading a 1D array (e.g., time series of values) of size (2000) and a 2D array (e.g., time series of vector components) of size (1000,3) (the first dimension is the time and the second dimension is the vector components):
>>> print(data)
<xarray.Dataset>
Dimensions: (<1D_array_dimension1>: 2000, <2D_array_dimension1>: 1000, <2D_array_dimension2>: 3)
Coordinates:
* <1D_array_dimension1> (<time1>) datetime64[ns] 2013-08-05T00:00:00.000007 ... 2013-08-05T00:04:59.000987
* <2D_array_dimension1> (<time2>) datetime64[ns] 2013-08-05T00:00:00.000006 ... 2013-08-05T00:05:00
* <2D_array_dimension2> (<vector_components>) <U1 'x' ... 'z'
Data variables:
<prod1> (<time1>) float32 86.812 ... 29.063
<prod2> (<time2>, <vector_components>) float32 144.979 ... 5.028
Attributes:
mission: <mission>
The data retrieved from the Mission subpackage is an xarray.Dataset
object,
which is basically a labeled dictionary of xarray.DataArray objects. This Dataset
contains two xarray.DataArray
objects, which are in their turn N-dimensional
arrays with labeled coordinates and dimensions.
This Dataset object has a total of 3 Dimensions
, which represent the two
Coordinates
(labels) of the 2D DataArray (<vector_components>
, the xyz components
of the vector and time1
, the timestamps) and the coordinate (label) of the 1D DataArray
(time2
, the timestamps).
The dimensions of the arrays are summarized in the Data variables
section: each
DataArray has one Data variable
that represent the queried data products.
The dimension of each data product (array) is shown in front of each data variable
(here (1000,3) and (2000)), as well as the data type and first and last value of
each data product (i.e., DataArray).
Metadata is embedded in the objects Attributes
in the form of a basic Python
dictionary. For instance, the mission is written in the Dataset Attributes
and the units and spacecraft location can be found in the Attributes
of
each DataArray object (e.g., units can be retrieved by typing data['prod1'].attrs['Units']
).
Plotting¶
The DataArray and Dataset objects provide a simplified plotting
system (see http://xarray.pydata.org/en/stable/plotting.html) based on
the popular matplotlib package (https://matplotlib.org). Thus, it is rather
easy to plot the data that is returned by the Mission subpackage using either the
xarray.Dataset.plot()
or xarray.DataArray.plot()
methods. For instance,
the following code snippet shows how to line plot the 1D array from the above Dataset:
>>> data['<prod1>'].plot()
The following code snippet shows how to line plot the 2D array from the above Dataset:
>>> data['<prod2>'].plot.line(x='time2')
File handling¶
Because the core of the downloading process is based on the HelioPy package, we chose to borrow its file handling system.
Configuration file:
HelioPy comes with a sample source or configuration file (“heliopyrc”), that is located in
~/.heliopy/heliopyrc
and can be customised by the user (the config parser will look in ~/.heliopy first). The default contents of the configuration file are:
The working directory is the parent directory in which all downloaded data will be stored. By default, it is set to:
download_dir = ~/heliopy/data
Note that this default value may be changed to~/aidapy/data
in the future. Inside this directory, the files are stored in a rather standardized folder tree with the following structure:~/mission/probe/instrument/mode
(if available, else the year).The user can also choose whether to convert all downloaded data to a hdf store, enabling much faster file reading after the initial load, but requiring the additional h5py and py-tables dependencies. By default, this value is set to False:
use_hdf = False
In this file is also stored the user’s personal Cluster cookie that is required to download data from the Cluster Science Archive (CSA). This personal cookie can be retrieved by registering at the following address: http://www.cosmos.esa.int/web/csa/register-now . By default, the cookie value is not set:
cluster_cookie = none
File version control:
One key point of the AIDA project objectives is to always deal with up-to-date data files, that evolve on the fly from open-access sources due to regular reprocessing of available datasets. Fortunately, the HelioPy package fulfills this requirement by getting the latest version of data files at every execution. In other words, the program checks the latest available files from the open-access sources every time data is requested. If the available file is not already in the user’s database, then it is downloaded and processed. Thus, the up-to-date data is returned to the user every time. This behavior will also be generalized to the AIDApy in the future. Every time the user will analyse spacecraft data in the AIDApy, the program will check if the latest available files are in the user database, otherwise it will download and process them.
load_data function¶
Event Search¶
Introduction¶
The Event Search subpackage is based on Matlab routines available at LPP and further developed, and on the Mission subpackage. It allows to search for events related to specific scientific processes (e.g. reconnection, turbulence, shocks, particle acceleration etc) in spacecraft open-access databases, and produce a list of the associated time intervals, by defining criteria to be fulfilled by some parameters (e.g., location of the spacecraft, magnitude of magnetic field and currents, etc). The output of are ASCII files with header information (date, spacecraft etc.) and time intervals, to be further used by other AIDApy subpackages e.g. Machine Learning tools. The Event Search tool also produce summary plots of the listed events, to allow the user to eyeball the automatic selection.
The main features of the Event Search subpackage are:
handling of various scientific processes (such as magnetic reconnection electron and ion diffusion regions and separatrices, plasma jet fronts, coherent structures in turbulence, particle acceleration sites at shocks, etc) in different regions of the heliosphere;
additional search parameters (e.g. density gradients, amplitude of magnetic and electric fluctuations, etc), and/or scientific processes can be added by the user according to his needs without any change to the program itself;
handling of various missions by using the Mission tool capabilities; precise control of the data by using the Mission tool capabilities. The query can ask for specific time ranges, spatial regions, probes, subprocesses, etc. The module will check the availability of the data and will ensure that the user processes the latest version of the data files;
basic plotting functionalities for the user (summary plots) to be able to eyeball the results
Inputs¶
The philosophy of this subpackage and more generally AIDApy is to ease as much as possible the user experience. To this end, we decided to minimize the number of arguments entered by the user and the knowledge required to obtain the desired data. However, for the code to be flexible and scalable, the user can also input a customized search and/or parameters. Below is a list of the parameters the code needs from the user:
The first input is the
start_time
and theend_time
of the requested interval as a Pythondatetime.datetime
object written as:datetime(year, month, day, hour, minute, second)
or an astropy. Time object allowing to handle multiple time format such as GPS, ISOT, or FITS.The second input is a Python dictionary of the
settings
which allow to customize the search. Note that for the sake of clarity, thesettings
are divided into two groups:- The
parameters
which include: The first parameter is a string specifying the mission from which to download the data:
'mission' = '<mission>'
.The second parameter is a list of strings, containing the different probes of the multi spacecraft mission considered:
'probes' = ['<probe1>', '<probe2>', ...]
. Note that for the single spacecraft mission or if one wants data from all probes, the string is set to “all” by default.The third parameter is a string specifying the coordinate system the user wants the data to be used in, such as:
'coords' = '<coord_system>'
. Note that for the sake of clarity, it has been decided that data can be retrieved in only one coordinate system at once. By default, the coordinate system is set to Geocentric Solar Ecliptic (GSE).The fourth parameter is a string specifying the data rate or instrument mode used to find the events
The fifth and sixth parameters are the length (in seconds) of the rolling time window and the leap step (also in seconds), respectively.
- The
- The
criteria
which the data should fulfill: The parameter to enter is a python
lambda
function containing the data products from the sample catalog of available data products on AIDApy shown in Mission subpackage, which can be found here. These data products are required to search for the specific process defined by the user. Each keyword should be accompanied with the criterion to be fulfilled (see code snippet below). For more complex searches, these operations can be applied on each dimension of 2D arrays such as vector components and on different probes. Note that the vector components should be referred as ‘x’, ‘y’ and ‘z’, and the magnitude as ‘tot’; the different probes should be referred as ‘1’, ‘2’, ‘3’, or ‘4’. For instance, to ensure that one spacecraft from a fleet is in the Earth’s magnetotail (\(X_{GSM} \leq -5 R_E\) and \(|Y_{GSM}| \leq 15 R_E\), \(R_E\) being the Earth radius and GSM standing for the Geocentric Solar Magnetic coordinate system) and its separation distance in the GSM XY plane with a second spacecraft is smaller than 100 km, the criterion can be defined as follows (the spacecraft location being referred as ‘sc_pos’ in AIDApy):
- The
lambda sc_pos_x1, sc_pos_y1, sc_pos_x2, sc_pos_y2: (sc_pos_x1 <= -5 * 6378) & (abs(sc_pos_y1) <= 15 * 6378) & (sqrt((sc_pos_x1 - sc_pos_x2)**2 + (sc_pos_y1 - sc_pos_y2)**2) < 100)
Note: We recommend to use Numpy mathematical functions (e.g., numpy.abs
, numpy.sqrt
), as they are usually more compatible with Numpy arrays used for arithmetic operations in the subpackage.
Below is a code snippet showing how to define the input parameters to the event_search function:
# Import Python modules
from aidapy import event_search
from datetime import datetime
import numpy
# Define desired time interval
start_time = datetime(<year>, <month>, <day>, <hour>, <minute>, <second>)
end_time = datetime(<year>, <month>, <day>, <hour>, <minute>, <second>)
# Define parameters relevant to the queried scientific process
settings = {"criteria": {lambda <var1>, <var2>: <var1> < <threshold1>, <var2> > <threshold2>}}
"parameters": {'mission': '<mission>',
'probes': ['<probe1>', '<probe2>', '<probe3>'],
'mode': '<mode>',
'time_window': '<value>',
'time_step': '<value>'},
# Search for events using the list_events subpackage
event_search(settings, start_time, end_time)
Note that the names of the parameters should be compliant with the AIDApy catalog of products (a sample is shown in the Mission subpackage description). Once the parameters are set up, the Event Search subpackage can be called to search for the desired events associated with the specific scientific process. The outputs of this subpackage are discussed in the following section.
Outputs¶
As stated above, the first output of the Event Search subpackage is a list of the found events. This list is written in a standard ASCII file, so that it can be easily loaded by other subpackages of the AIDApy package, such as Statistical, Machine Learning and Deep Learning tools. This file also contains additional header information such as the scientific process/sub-process of interest, the parameters used for the search, the spacecraft location, etc. The output file names also contains the scientific process and the run parameters. The output files are stored in a folder named after the date and time when the search has been performed.
An example of such file is shown below:
Output_list_events_<prod1>_<threshold1>_<prod2>_<threshol2>_timewindow_<value>_step_<value>.txt:
Scientific process: <optional>
Mission: <mission>
Probes: <probes>
Coordinates: <optional>
Criteria: <prod1> <var1> < <threshold1>, <prod2> <var2> < <threshold2>
List of events found:
Tint (UTC) S/C location (km)
2017-08-10T12:18:31.728287000/2017-08-10T12:19:41.208817000 -96743.35/16839.555/30637.818
Plotting¶
Here we show the basic plotting functionalities of the Event Search subpackage. A simple summary plot of all data loaded by the subpackage is displayed for the querid time interval, so that the user can eyeball the results. Only the data products on which criteria are applied are shown in different panels for the queried probes. The events found are highlighted in each panel.
The figure below shows the output plot obtained using the Event Search subpackage, for the use case of EDRs in the examples:
event_search function¶
Velocity distribution function¶
vdf extension to xarray¶
Machine Learning tool¶
Introduction¶
Examples¶
In this section, we provide simple and practical examples on how to use
the different subpackages of the aidapy
package.
Spacecraft data downloading¶
Introduction¶
Here we demonstrate how to download spacecraft data using the load_data function of AIDApy. Three missions are currently implemented: Omniweb, Cluster and MMS
[1]:
from datetime import datetime
#AIDApy Modules
from aidapy import load_data
import aidapy.aidaxr
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-9e026ada4137> in <module>
2
3 #AIDApy Modules
----> 4 from aidapy import load_data
5 import aidapy.aidaxr
ModuleNotFoundError: No module named 'aidapy'
Omniweb data¶
First, we set up the parameters of the data: time interval and name of the mission. For this example, we will download from Omniweb all available data between May 15, 2008 and May 16, 2008, thus all other parameters are let to default values.
[3]:
start_time_omni = datetime(2008, 5, 15, 0, 0, 0)
end_time_omni = datetime(2008, 5, 16, 0, 0, 0)
Once all the parameters have been set up, the load_data function can be called. The data are given in a xarray.DataSet format, which is an implementation of labeled and multi-dimensional arrays. A specific extension has been developed (aidapy.aidaxr) to provide specific pre and postprocessing of the data (see the statistical example).
[4]:
xr_omni = load_data(mission='omni', start_time=start_time_omni, end_time=end_time_omni)
xr_omni_all = xr_omni['all1']
print(xr_omni_all)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-4-c0a5c15c425d> in <module>
----> 1 xr_omni = load_data(mission='omni', start_time=start_time_omni, end_time=end_time_omni)
2 xr_omni_all = xr_omni['all1']
3 print(xr_omni_all)
NameError: name 'load_data' is not defined
[4]:
xr_omni_mag = xr_omni_all.sel(products=['Bx GSE, GSM', 'By GSM', 'Bz GSM'])
xr_omni_mag.plot.line(x='time1')
/home/rdupuis/miniconda3/envs/aidapy/lib/python3.7/site-packages/pandas/plotting/_matplotlib/converter.py:103: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.
To register the converters:
>>> from pandas.plotting import register_matplotlib_converters
>>> register_matplotlib_converters()
warnings.warn(msg, FutureWarning)
[4]:
[<matplotlib.lines.Line2D at 0x7ff6c028a750>,
<matplotlib.lines.Line2D at 0x7ff6c0246050>,
<matplotlib.lines.Line2D at 0x7ff6c0246310>]
Cluster data¶
Then, data can be downloaded from cluster mission. As this mission has several spacecraft, the index of the spacecraft called probe must also be provided. For this example, we will download the magnetic field during 5 minutes on August 5, 2013 for the probe 1 and 2 in GSE coordinate system.
[5]:
start_time_cluster = datetime(2013, 8, 5, 0, 0, 0)
end_time_cluster = datetime(2013, 8, 5, 0, 5, 0)
settings_cluster = {'prod': ['dc_mag'], 'probes': ['1', '2'], 'coords': 'gse'}
[6]:
xr_cluster = load_data('cluster', start_time_cluster, end_time_cluster, **settings_cluster)
print(xr_cluster)
<xarray.Dataset>
Dimensions: (B_vec_xyz_gse__C1_CP_FGM_SPIN: 3, B_vec_xyz_gse__C2_CP_FGM_SPIN: 3, time1: 71, time2: 72)
Coordinates:
* B_vec_xyz_gse__C1_CP_FGM_SPIN (B_vec_xyz_gse__C1_CP_FGM_SPIN) <U1 'x' ... 'z'
* time1 (time1) datetime64[ns] 2013-08-05T00:00:02.000119 ... 2013-08-05T00:04:58.000458
* B_vec_xyz_gse__C2_CP_FGM_SPIN (B_vec_xyz_gse__C2_CP_FGM_SPIN) <U1 'x' ... 'z'
* time2 (time2) datetime64[ns] 2013-08-05T00:00:02.000182 ... 2013-08-05T00:04:56.000512
Data variables:
dc_mag1 (time1, B_vec_xyz_gse__C1_CP_FGM_SPIN) float32 144.983 ... 4.068
dc_mag2 (time2, B_vec_xyz_gse__C2_CP_FGM_SPIN) float32 86.769 ... 29.271
Attributes:
mission: cluster
load_settings: {'prod': ['dc_mag'], 'probes': ['1', '2'], 'coords': 'gse'}
[7]:
xr_cluster['dc_mag1'].plot.line(x='time1')
[7]:
[<matplotlib.lines.Line2D at 0x7ff6c02b2f50>,
<matplotlib.lines.Line2D at 0x7ff6c01291d0>,
<matplotlib.lines.Line2D at 0x7ff6c012f190>]
[8]:
xr_cluster['dc_mag2'].plot.line(x='time2')
[8]:
[<matplotlib.lines.Line2D at 0x7ff6bf759810>,
<matplotlib.lines.Line2D at 0x7ff6bf7be9d0>,
<matplotlib.lines.Line2D at 0x7ff6bf7becd0>]
MMS data¶
Finally, data can be downloaded from Magnetospheric Multiscale Mission (MMS) mission. This mission has also several spacecrafts. For this example, we will download the magnetic field during 1 minute on April 8, 2018 for the probe 1 and 2 in GSE coordinate system.
[9]:
start_time_mms = datetime(2018, 4, 8, 0, 0, 0)
end_time_mms = datetime(2018, 4, 8, 0, 1, 0)
settings_mms = {'prod': ['dc_mag'], 'probes': ['1', '2'], 'coords': 'gse'}
[10]:
xr_mms = load_data(mission='mms', start_time=start_time_mms, end_time=end_time_mms, **settings_mms)
/home/rdupuis/miniconda3/envs/aidapy/lib/python3.7/site-packages/HelioPy-0+unknown-py3.7.egg/heliopy/data/util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms1_mec_mlt' are unknown
warnings.warn(message)
Creating new directory /home/rdupuis/heliopy/data/mms/2/fgm/srvy
/home/rdupuis/miniconda3/envs/aidapy/lib/python3.7/site-packages/HelioPy-0+unknown-py3.7.egg/heliopy/data/util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms2_mec_mlt' are unknown
warnings.warn(message)
[11]:
print(xr_mms['dc_mag1'])
xr_mms['dc_mag1'].plot.line(x='time1')
<xarray.DataArray 'dc_mag1' (time1: 849, mms1_fgm_b_gse_srvy_l2: 4)>
array([[ 1.712529, 2.073415, -1.225534, 2.95529 ],
[ 1.714843, 2.064923, -1.207659, 2.943303],
[ 1.710802, 2.077349, -1.215393, 2.952863],
...,
[ 2.291496, 1.610505, -1.082593, 3.00278 ],
[ 2.281956, 1.615301, -1.089721, 3.000669],
[ 2.277435, 1.605456, -1.105932, 2.99788 ]], dtype=float32)
Coordinates:
* mms1_fgm_b_gse_srvy_l2 (mms1_fgm_b_gse_srvy_l2) <U3 'x' 'y' 'z' 'tot'
* time1 (time1) datetime64[ns] 2018-04-08T00:00:06.987911 ... 2018-04-08T00:00:59.988633
Attributes:
Units: {'mms1_fgm_b_gse_srvy_l2': Unit("nT")}
pos_gse: <xarray.DataArray 'mms1_mec_r_gse' (time: 3, mms1_mec_r_gse: 3)...
[11]:
[<matplotlib.lines.Line2D at 0x7ff6bf78ff10>,
<matplotlib.lines.Line2D at 0x7ff6b65c3a90>,
<matplotlib.lines.Line2D at 0x7ff6b65c3790>,
<matplotlib.lines.Line2D at 0x7ff6b65c3610>]
[12]:
settings_mms_dist = {'prod': ['i_dist'], 'probes': ['1'], 'coords': 'gse'}
xr_mms_dist = load_data(mission='mms', start_time=start_time_mms, end_time=end_time_mms, **settings_mms_dist)
/home/rdupuis/miniconda3/envs/aidapy/lib/python3.7/site-packages/HelioPy-0+unknown-py3.7.egg/heliopy/data/util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms1_mec_mlt' are unknown
warnings.warn(message)
[14]:
print(xr_mms_dist['i_dist1'])
<xarray.DataArray 'i_dist1' (time1: 14, mms1_dis_energy_fast: 32, mms1_dis_theta_fast: 16, mms1_dis_phi_fast: 32)>
array([[[[0.000000e+00, ..., 0.000000e+00],
...,
[0.000000e+00, ..., 0.000000e+00]],
...,
[[0.000000e+00, ..., 0.000000e+00],
...,
[1.977800e-20, ..., 0.000000e+00]]],
...,
[[[8.998902e-21, ..., 0.000000e+00],
...,
[0.000000e+00, ..., 0.000000e+00]],
...,
[[9.016211e-21, ..., 0.000000e+00],
...,
[0.000000e+00, ..., 0.000000e+00]]]], dtype=float32)
Coordinates:
* mms1_dis_energy_fast (mms1_dis_energy_fast) float32 2.16 3.91 ... 17800.0
* mms1_dis_theta_fast (mms1_dis_theta_fast) float32 5.625 16.875 ... 174.375
* mms1_dis_phi_fast (mms1_dis_phi_fast) float32 2.75 14.0 ... 340.25 351.5
* time1 (time1) datetime64[ns] 2018-04-08T00:00:00.120976 ... 2018-04-08T00:00:58.621428
Attributes:
mms1_dis_dist_fast: s^3/cm^6
Units: {'mms1_dis_dist_fast': Unit("s3 / cm6"), 'mms1_dis_e...
pos_gse: <xarray.DataArray 'mms1_mec_r_gse' (time: 3, mms1_me...
[ ]:
Search for scientific processes¶
Introduction¶
Here we demonstrate how to search for events of specific scientific processes using the Event Search subpackage included in AIDApy. First, we start with importing the modules required by Event Search:
[1]:
#Python modules
from datetime import datetime
import numpy as np
#AIDApy modules
from aidapy import event_search
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-781f059ab734> in <module>
4
5 #AIDApy modules
----> 6 from aidapy import event_search
ModuleNotFoundError: No module named 'aidapy'
Looking for specific EDR events¶
In particular, here we want to look for Electron Diffusion Regions, which are tiny regions located at the very center of magnetic reconnection. To do so, we chose to use data from the MMS mission, which has been specifically designed to study the microphysics (down to electron scales) of magnetic reconnection. We also chose to look at data on 2017 July 20, during which a specific campaign was dedicated to the observation of reconnection in the magnetotail. The time interval is thus defined as:
[6]:
start_time = datetime(2017, 8, 20, 2, 0, 0)
end_time = datetime(2017, 8, 20, 2, 10, 0)
Then, we need to set up the settings for the specific scientific process we want to study; i. e., here Electron Diffusion Region (EDR). The EDR is characterized by a very low magnetic field (< 5 nT), high-speed electron jets (> 4000 km/s) and a high currents (current density > 70 nA/m²). Thus we define these criteria as inputs to the Event Search tool, in a format compliant with the event_search function (see the Event Search section in the user guide). Additionally, we chose to download data from all probes on board MMS mission, in the default GSE coordinate system. Finally, since the EDR region size is of electron scale, we chose to use the high-resolution burst-mode data from MMS, and a time window of 60 s.
[7]:
settings = {"criteria": lambda dc_mag_tot, j_curl_tot, e_bulkv_x: (np.any(dc_mag_tot < 5)) &
(np.any(j_curl_tot > 70)) & (np.any(np.abs(e_bulkv_x) > 4000)),
"parameters": {"mission": "mms",
"process": "edr",
"probes": ['1'],
"mode": "brst",
"time_window": 120,
"time_step": 120}}
Once the settings on the data and parameters are defined, we call event_search function as follows:
[9]:
event_search(settings, start_time, end_time, plot=True)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('0, 1') for key 'mms1_des_compressionloss_brst' are unknown
warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('0, 1') for key 'mms1_des_steptable_parity_brst' are unknown
warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('00-32') for key 'mms1_des_sector_despinp_brst' are unknown
warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms1_mec_mlt' are unknown
warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms1_mec_mlt' are unknown
warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms1_mec_mlt' are unknown
warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms2_mec_mlt' are unknown
warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms3_mec_mlt' are unknown
warnings.warn(message)
C:\Users\breuillard\AppData\Local\Continuum\anaconda3\lib\site-packages\heliopy-0+unknown-py3.7.egg\heliopy\data\util.py:601: UserWarning: The CDF provided units ('hours') for key 'mms4_mec_mlt' are unknown
warnings.warn(message)
[9]:
[0, 3999, 8000, 11999]
As seen above, the function gives as ouput an overview plot with highlighted events for the user to eyeball the results, but if to_file
parameter set to True
, also writes a simple text file with the events selected, as follows:
[10]:
#events_list.txt
'''
List of events found:
N°, START TIME (UTC), END TIME (UTC), S/C location (km)
1, 2017-08-10T12:18:01.728287000, 2017-08-10T12:19:01.208817000 -96743.35/16839.555/30637.818
''';
[ ]:
Overview of adiapy tools for VDF analysis¶
___ The present Notebook illustrates the possibilities offered by aidapy for handling and analysing particle Velocity Distribution Functions (VDFs) observed by the Magnetospheric MultiScale (MMS) mission.
[1]:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from datetime import datetime
from IPython.display import Image
import sys, os
import warnings
warnings.filterwarnings('ignore')
Regular aidapy setup¶
We first import the data loading utilities (aidapy and heliopy) and the xarray classes/accessors.
[2]:
from aidapy import load_data
import aidapy.aidaxr
We select global start and end times for the data to be loaded. A sub-selection is allowed later for finer analyses. But a time selection shorter than the duration of a single file results in no file being loaded at all.
[3]:
start_time = datetime(2019, 3, 8, 13, 54, 53)
end_time = datetime(2019, 3, 8, 13, 57, 0)
The settings dictionnary allows us to define the products we want to load. In this example, we will load the ion and electron VDFs via the keywords i_dist and e_dist, the (fluxgate) magnetic field (dc_mag), some spacecraft attitude products (sc_att) and the ion bulk velocity i_bulkv.
[4]:
settings = {'prod': ['i_dist', 'e_dist', 'dc_mag', 'sc_att', 'i_bulkv'],
'probes': ['1'], 'coords': 'gse', 'mode': 'high_res', 'frame':'gse'}
The data are loaded with the following call of load_data:
[5]:
xr_mms = load_data(mission='mms', start_time=start_time, end_time=end_time, **settings)
The previous steps are shared by all applications using aidapy, though the precise products vary between the applications.
Probes positions¶
We can easely get a quick global view of the probes’ positions for the chosen time interval using the SDC plots. We find that between 13:00:00 and 14:00:00, the probes were in the magnetosheath.
[6]:
url_sdc = 'https://lasp.colorado.edu/mms/sdc/public/data/sdc/mms_orbit_plots/'
url_sdc += 'mms_orbit_plot_{:4d}{:02d}{:02d}{:02d}0000.png'.format(start_time.year, start_time.month, start_time.day, start_time.hour)
Image(url=url_sdc, embed=True)
[6]:
VDF-specific aidapy features¶
Next, we will explore the possibilities offered by the librairy for VDFs interpolation and visualisations. aidapy visualisation tools only handle interpolated VDFs. ### Setting up and running the interpolation To set up the interpolation, we first select the species of interest, the frame of interest in which we want to visualise the VDFs (available are the instrument frame instrument and the B-aligned frame B), the coordinate system/geometry of the interpolation grid (cartesian cart, spherical spher or cylindrical cyl), its maximum extent in terms of speed and its resolution, and the method used for the interpolation, to chose between nearest-neighbour (near), trilinear (lin) and tricubic (cub).
[7]:
species = 'electron'
frame = 'B'
grid_geom = 'spher'
v_max = 1.2e7
resolution = 100
interp_schem = 'lin'
Additionaly, one can select a sub-interval of time, shorter than the overall interval defined before loading the data.
[8]:
start_time_sub = datetime(start_time.year, start_time.month, start_time.day, 13, 55, 0, 100000)
end_time_sub = datetime(start_time.year, start_time.month, start_time.day, 13, 55, 1, 500000)
The interpolation method is now called with the defined parameters. Depending on the chosen resolution and the chosen interpolating method, the interpolation may take quite some time. The index of the current distribution being interpolated is shown and updated every 10 VDF.
[9]:
xr_mms = xr_mms.vdf.interpolate(start_time, end_time, start_time_sub, end_time_sub,
species=species, frame=frame, grid_geom=grid_geom,
v_max=v_max, resolution=resolution, interp_schem=interp_schem)
.____________________________________________________________
| mms_vdf.py, aidapy.
|
| Product(s):
| - i_dist
| - e_dist
| - dc_mag
| - sc_att
| - i_bulkv
| Grid geometry: spher
| Resolution: 100
| Interpolation: lin
| Start time: 2019-03-08T13:55:00.115349000
| Stop time : 2019-03-08T13:55:01.525361000
| Ind. start-stop: 237-284
| Nb distributions: 47
|____________________________________________________________
280/4234 (current VDF/nb total VDFs)
Total runtime: 14 s.
The output of the method indicates that with the chosen sub-interval, 47 distributions were interpolated, with their corresponding indices in the original data file/arrays.
Visualisations¶
A few visualisations are offered by the aidaxr.graphical module, and are all illustrated below. The first one uses the spherical geometry of the grid to display the parallel and perpedicular 1D profiles of the VDF averaged over time, in this case 35 times 30 ms. The striking feature of these profiles are the arcs appearing for medium to high velocities. These are visual artifacts due to the linear interpolation, resulting in straight segments between two measurment points, and in arc segments in a logarithmic scale.
[10]:
xr_mms.vdf.plot('1d')
The main new possibility offered by the library is the visualisation of the distribution using a scaled view and a normalised view, aiming at highlighting features of a VDF by getting rid of its high gradients, without using additional information (VDF at some other time, or fitting procedures). These views can be displayed using the vdf_spher() method, within the \(v_{para}, v_{perp})\)-plane. the plt_contourf keyword (boolean) toggles the use of a filled contour representation implemented in matplotib.pyplot
[11]:
xr_mms.vdf.plot('2d', plt_contourf=True)
A cylindrical average is often used to display the data in the (\(v_{para}, v_{perp})\)-plane. For quickly checking that the VDF does not present obvious departure from gyrotropy, the following vdf_gyro() method displays the equatorial cut of the VDF (left-most full circular plot), and four angular segment of the spherical grid, their angular range displayed in the left-most plot by the dashed lines. The first row displays the original interpolated VDF values, the second row presents the scaled values. We see in this representation that the characteristic double branch feature seen in the scaled view is indeed gyrotropic.
[12]:
xr_mms.vdf.plot('3d_gyro')
These three first methods operate a time average over the entire sub-selection (from start_time_sub to end_time_sub). vdf_spher_time retain the time dimension, by splitting the energy/speed dimension in 5 intervals and averaging the scaled VDF over speed, over each of these intervals. The result, namely five scaled pitch-angle distributions, is close to what would be obtained by a classical pitch-angle distributions. We see in this example the 35 distributions along time, and check that the double branch signature is consistently found through time.
[13]:
xr_mms.vdf.plot('3d_time', plt_contourf=False)
Changing parameters and re-interpolating VDFs¶
We show now that using the exact same loaded data, one can reset the parameters, change the species, the resolution, the interpolation scheme, etc., and re-interpolate the VDFs. Note that the time sub-interval can be redefined here. In this example however, we keep the same time selection to compare electrons and ions.
[14]:
grid_geom = 'cart'
species = 'ion'
resolution = 120
v_max = 8.e5
The interpolating method is called exactly the same way as previously. Because of the difference of measurement cadence, we only get 7 distributions in this case, with a corresponding gain in time.
[16]:
xr_mms = xr_mms.vdf.interpolate(start_time, end_time, start_time_sub, end_time_sub,
species=species, frame=frame, grid_geom=grid_geom,
v_max=v_max, resolution=resolution, interp_schem=interp_schem)
.____________________________________________________________
| mms_vdf.py, aidapy.
|
| Product(s):
| - i_dist
| - e_dist
| - dc_mag
| - sc_att
| - i_bulkv
| Grid geometry: cart
| Resolution: 120
| Interpolation: lin
| Start time: 2019-03-08T13:55:00.235349000
| Stop time : 2019-03-08T13:55:01.585361000
| Ind. start-stop: 48-57
| Nb distributions: 9
|____________________________________________________________
50/847 (current VDF/nb total VDFs)
Total runtime: 4 s.
A cartesian coordinate system is much less handy when it comes to representation and selections. Only one plotting method is offered in this case, which provides cuts through the time average VDF, with iso-contours.
[17]:
xr_mms.vdf.plot('3d')
Manual handling of the interpolated VDFs¶
Until now, the user has only been calling methods and hasn’t been manipulating the various products. We breifly illustrate here how to access the products, in order to code specific representations, or implement some further analyses. We also refer to the source scripts of each of the above plotting methods for more detailled use cases.
Let’s first check the content of the dataset xr_mms:
[18]:
xr_mms
[18]:
<xarray.Dataset> Dimensions: (mms1_des_energy_brst: 32, mms1_des_phi_brst: 32, mms1_des_theta_brst: 16, mms1_dis_bulkv_gse_brst: 3, mms1_dis_energy_brst: 32, mms1_dis_phi_brst: 32, mms1_dis_theta_brst: 16, mms1_fgm_b_gse_brst_l2: 4, mms1_mec_quat_eci_to_gse: 4, phi: 120, speed: 120, theta: 120, time: 9, time1: 847, time2: 4234, time3: 16253, time4: 5, time5: 847, v: 3, v_para: 120, v_perp: 120, vx: 120, vy: 120, vz: 120) Coordinates: * mms1_dis_energy_brst (mms1_dis_energy_brst) float32 2.16 ... 17800.0 * mms1_dis_phi_brst (mms1_dis_phi_brst) float32 7.0 18.25 ... 355.75 * mms1_dis_theta_brst (mms1_dis_theta_brst) float32 5.625 ... 174.375 * time1 (time1) datetime64[ns] 2019-03-08T13:54:53.035267 ... 2019-03-08T13:56:59.936741 * mms1_des_theta_brst (mms1_des_theta_brst) float32 5.625 ... 174.375 * mms1_des_phi_brst (mms1_des_phi_brst) float32 5.375 ... 354.125 * mms1_des_energy_brst (mms1_des_energy_brst) float32 6.52 ... 27525.0 * time2 (time2) datetime64[ns] 2019-03-08T13:54:53.005267 ... 2019-03-08T13:56:59.996741 * mms1_fgm_b_gse_brst_l2 (mms1_fgm_b_gse_brst_l2) <U3 'x' 'y' 'z' 'tot' * time3 (time3) datetime64[ns] 2019-03-08T13:54:53.024982 ... 2019-03-08T13:56:59.995838 * mms1_mec_quat_eci_to_gse (mms1_mec_quat_eci_to_gse) <U3 'x' 'y' 'z' 'tot' * time4 (time4) datetime64[ns] 2019-03-08T13:55:00 ... 2019-03-08T13:57:00 * mms1_dis_bulkv_gse_brst (mms1_dis_bulkv_gse_brst) <U1 'x' 'y' 'z' * time5 (time5) datetime64[ns] 2019-03-08T13:54:53.035267 ... 2019-03-08T13:56:59.936741 Dimensions without coordinates: phi, speed, theta, time, v, v_para, v_perp, vx, vy, vz Data variables: i_dist1 (time1, mms1_dis_energy_brst, mms1_dis_theta_brst, mms1_dis_phi_brst) float32 0.0 ... 0.0 e_dist1 (time2, mms1_des_energy_brst, mms1_des_theta_brst, mms1_des_phi_brst) float32 1.2165267e-25 ... 0.0 dc_mag1 (time3, mms1_fgm_b_gse_brst_l2) float32 -0.29875872 ... 41.451748 sc_att1 (time4, mms1_mec_quat_eci_to_gse) float64 -0.2019 ... 0.9733 i_bulkv1 (time5, mms1_dis_bulkv_gse_brst) float32 4.4563413 ... -76.759415 vdf_interp_time (time, vx, vz) float32 1.2914518e-15 ... 2.6537375e-16 vdf_interp (vx, vy, vz) float64 3.526e-16 ... 2.228e-15 grid_interp_cart (v, vx, vy, vz) float32 -800000.0 ... 800000.0 grid_interp_spher (v, speed, theta, phi) float32 1385640.6 ... 0.7853982 grid_interp_cyl (v, v_perp, phi, v_para) float32 1131370.9 ... 800000.0 time_interp (time) datetime64[ns] 2019-03-08T13:55:00.235349 ... 2019-03-08T13:55:01.435361 grid_geom <U4 'cart' Attributes: mission: mms load_settings: {'prod': ['i_dist', 'e_dist', 'dc_mag', 'sc_att', 'i_bulk...
One can spot the four variables added by the execution of interpolate() method: distrib_interp, grid_interp_spher, grid_interp_cart and time_interp . grid_* variables are three dimensional grids giving three coordinates for each of the VDF values. We can get the 1D arrays of the centers of the bins as shown here:
[19]:
vdf_time = xr_mms['vdf_interp_time'].values
grid_cart = xr_mms['grid_interp_cart'].values
centers_vx = grid_cart[0,:,0,0]
centers_vy = grid_cart[1,0,:,0]
centers_vz = grid_cart[2,0,0,:]
We check the content and dimensions of the distrib_interp variable:
[20]:
xr_mms['vdf_interp_time'].dims
[20]:
('time', 'vx', 'vz')
Therefore vdf_time is a time series of interpolated VDFs, along the dimensions \(v_x\) and \(v_z\). [The reason for the absence of \(v_y\) is a matter of memory saving and is obviously not ideal. It is “herited” from the spherical product, where the two dimensions of speed and pitch angle are enough to well describe the data] As an arbitrary example, we can plot the VDFs averaged over the perpendicular velocities \(v_x\) (different from the parallel profile shown previously, which only shows VDF values for velocities close to \(v_{perp}=0\)).
[21]:
%matplotlib notebook
fig, ax = plt.subplots(figsize=(9,9))
ind_mid = int(resolution/2.)
for d in vdf_time:
ax.loglog(-1.*centers_vz[:ind_mid], np.nanmean(d, axis=0)[:ind_mid], '--', lw=1)
ax.loglog(centers_vz[ind_mid:], np.nanmean(d, axis=0)[ind_mid:], lw=1)
ax.set_xlabel('v_para (m/s)')
ax.set_ylabel('VDF (#/m^6/s^3)')
plt.show()
Disturbance Storm Time analysis from OMNI data¶
This section contains notebooks for the analysis of Disturbance Storm Time (DST) index using OMNI data. COPY/PASTE FROM WIKIPEDIA TO BE MODIFIED The DST index is widely used in the context of space weather and gives information about the strength of the ring current around Earth caused by solar protons and electrons. The ring current around Earth produces a magnetic field that is directly opposite Earth’s magnetic field, i.e. if the difference between solar electrons and protons gets higher, then Earth’s magnetic field becomes weaker. A negative Dst value means that Earth’s magnetic field is weakened. This is particularly the case during solar storms.
The two notebooks download time series data from OMNI:
The First notebook trains several models to forecast the DST index and evaluates their performance.
The Second notebook analyses the influence of each variable on the DST index by computing correlations and selecting the best features.
DST forecasting from OMNI data¶
This example shows how to use AIDApy to perform the following tasks:
Download time series data from OMNI
Preprocess the data so that it can be used for machine learning
Train several models and evaluate their performance
Downloading data¶
We will first download low-resolution data from OMNI.
[15]:
from datetime import datetime
from aidapy import load_data
# Set the start and end date as year, month, day
t0 = datetime(2005, 1, 1)
t1 = datetime(2015, 12, 31)
# Download the data
omnixr = load_data(mission='omni', start_time=t0, end_time=t1)
# Store data in pandas format
pd_data = omnixr['all1'].to_pandas()
pd_data.describe()
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-15-38f1038ecd2b> in <module>
1 from datetime import datetime
----> 2 from aidapy import load_data
3
4
5 # Set the start and end date as year, month, day
~\Desktop\Clone1\aidapy\aidapy\__init__.py in <module>
2
3 #aidapy init
----> 4 from aidapy.aidafunc.load_data import load_data
5 from aidapy.aidafunc.load_data import get_mission_info
6 from aidapy.aidafunc.event_search import event_search
~\Desktop\Clone1\aidapy\aidapy\aidafunc\load_data.py in <module>
9
10 import datetime
---> 11 import xarray as xr
12 from astropy.time import Time
13
ModuleNotFoundError: No module named 'xarray'
Split the data into different sets¶
For evaluating the performance of a machine learning approach, the input data has to be split in different sets:
A training set, on which the model is trained
Optionally, a validation set, to tune so-called hyperparameters of the model
A test set, which is only used at the end to evaluate the performance of the model
Such a separation is important to prevent overfitting. Since there is usually a correlation between consecutive observations in a time series, it is important not to shuffle the data before splitting it in different sets. Otherwise, the test set would not be truly independent from the training set.
[2]:
from sklearn.model_selection import train_test_split
# Split into training and test data
dtrain, dtest = train_test_split(pd_data, shuffle=False)
Let’s have a look at the data¶
Below, the DST index is shown. Note that there are quiet periods, and periods with more magnetic storms, indicated by large negative values.
[3]:
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
%matplotlib notebook
register_matplotlib_converters()
plt.figure()
plt.title("DST index (training/test data)")
dtrain['DST Index'].plot(label='train')
dtest['DST Index'].plot(label='test')
plt.gcf().autofmt_xdate()
plt.legend();
Selecting features¶
The features are here selected by hand (and they are not necessarily the best ones).
[4]:
features = ['|B|',
'Magnitude of Avg Field Vector',
'Proton Density',
'Proton Temperature',
'Alfven Mach Number',
'Bz GSM',
'Na/Np',
'Plasma Flow Speed',
'Plasma Beta',
'Electric Field',
'DST Index']
targets = ['DST Index']
Preprocessing data¶
The training and test data are preprocessed so that we can make predictions for the targets based on a history of the features. If the original features and targets are \(X_i\) and \(y_i\), then we define new features \(X_i' = [X_i, X_{i-1}, \dots, X_{i-n-1}]\), where \(n\) is the history size. We use these features to make predictions at some future time, so that the new targets are \(y_i' = y_{i+k}\), where \(k\) is the forecast time.
Another question is what to do with missing values. This is a common issue in working with space craft data – in particular if data from multiple instruments is combined. We have opted for a simple approach: remove all samples for which some data is missing, either in the features or the targets.
[5]:
from aidapy.ml import preprocess
histsize = 8 # Number of past hours
forecast_time = 1 # Hours into the future
# Use the AIDApy preprocessing method for time series
X_train, y_train, mask_train = preprocess.time_series(
dtrain[features].values, dtrain[targets].values, histsize, forecast_time)
X_test, y_test, mask_test = preprocess.time_series(
dtest[features].values, dtest[targets].values, histsize, forecast_time)
t_train = dtrain.index[mask_train]
t_test = dtest.index[mask_test]
Regression models¶
For the first use case, the following regression models are used: * A fully-connected neural network implemented in PyTorch. Users can customize the number of layers, the number of neurons per layer, and the activation functions of the layers. * A linear regression model imported from scikit-learn
[6]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from aidapy.ml import mlp
from skorch import NeuralNetRegressor
import torch
models = []
models.append({
'name': 'Linear Regression',
'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
('model', LinearRegression())])
})
# RegressorMlp is a simple, fully-connected neural network,
# for which the layer sizes are defined below. The default
# activation function is ReLU.
mlp_model = NeuralNetRegressor(
mlp.RegressorMlp,
max_epochs=25,
lr=0.001,
batch_size=128,
optimizer=torch.optim.Adam,
module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)
models.append({
'name': 'Multilayer perceptron',
'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
('model', mlp_model)])
})
[7]:
from sklearn.metrics import r2_score
for model in models:
model['pipe'].fit(X_train, y_train)
model['test_predict'] = model['pipe'].predict(X_test)
model['train_predict'] = model['pipe'].predict(X_train)
for model in models:
print("{:30} R2 score on test / train set: {:8.3f} {:8.3f}".format(
model['name'], r2_score(y_test, model['test_predict']),
r2_score(y_train, model['train_predict'])))
epoch train_loss valid_loss dur
------- ------------ ------------ ------
1 60.7240 32.4020 2.9466
2 17.2051 27.2859 1.8680
3 13.9108 27.3186 4.9162
4 11.7659 25.7481 3.9948
5 11.0440 29.0479 2.6648
6 10.3904 26.3400 2.7699
7 10.1087 27.2759 1.9154
8 9.7082 24.8305 2.6972
9 9.3186 24.3196 2.6852
10 9.0785 23.8498 3.3641
11 8.9514 23.7893 2.6362
12 8.8422 23.3065 4.0155
13 8.6888 22.9633 2.9111
14 8.4773 21.6068 2.7943
15 8.4333 22.4387 2.9364
16 8.2843 21.1695 3.0378
17 8.1932 21.9871 3.1358
18 8.0966 20.9862 3.1069
19 8.1332 22.9881 3.7974
20 7.9812 20.2496 4.5640
21 7.9269 21.3206 4.9270
22 7.9495 21.2599 3.3066
23 7.7617 19.6344 5.3131
24 7.6907 20.3049 1.9088
25 7.6336 19.9777 2.4472
Linear Regression R2 score on test / train set: 0.969 0.964
Multilayer perceptron R2 score on test / train set: 0.952 0.948
Let’s compare the predictions with the actual data
[8]:
plt.figure()
plt.title('Predicting the change in DST Index')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models:
p = plt.plot(t_test, model['test_predict'], label=model['name'])
# Plot another line with the same color
plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)
This looks pretty good, but…¶
Most variables we consider in a time series have strong auto-correlation. This means that future values strongly resemble past values; examples are the temperature at a location on earth or the price of a stock on a stock market. When making predictions for a variable with strong auto-correlation, one has to be careful in assessing the predictive power of a model. Simply predicting the past state as the future state can look convincing (also numerically in most error norms), but gives little information.
Below, we therefore take the following approach: predict the change in the variable.
[9]:
from aidapy.ml import preprocess
histsize = 24 # Number of past hours
forecast_time = 1 # Hours into the future
# Use the AIDApy preprocessing method for time series
# With the predict_change=True flag, the change in the DST index
# is the target variable
X_train, y_train, mask_train = preprocess.time_series(
dtrain[features].values, dtrain[targets].values,
histsize, forecast_time, predict_change=True)
X_test, y_test, mask_test = preprocess.time_series(
dtest[features].values, dtest[targets].values,
histsize, forecast_time, predict_change=True)
t_train = dtrain.index[mask_train]
t_test = dtest.index[mask_test]
models = []
models.append({
'name': 'Linear Regression',
'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
('model', LinearRegression())])
})
# RegressorMlp is a simple, fully-connected neural network,
# for which the layer sizes are defined below. The default
# activation function is ReLU.
mlp_model = NeuralNetRegressor(
mlp.RegressorMlp,
max_epochs=25,
lr=0.001,
batch_size=128,
optimizer=torch.optim.Adam,
module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)
models.append({
'name': 'ANN',
'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
('model', mlp_model)])
})
[10]:
from sklearn.metrics import r2_score
for model in models:
model['pipe'].fit(X_train, y_train)
model['test_predict'] = model['pipe'].predict(X_test)
model['train_predict'] = model['pipe'].predict(X_train)
for model in models:
print("{:30} R2 score on test / train set: {:8.3f} {:8.3f}".format(
model['name'], r2_score(y_test, model['test_predict']),
r2_score(y_train, model['train_predict'])))
epoch train_loss valid_loss dur
------- ------------ ------------ ------
1 9.3040 13.4525 2.5473
2 8.0748 13.2288 2.4043
3 7.6921 12.3340 2.4899
4 7.5276 12.2525 3.9910
5 7.3974 12.1197 3.8056
6 7.2771 11.9705 3.0214
7 7.1804 11.9094 2.0155
8 7.0822 11.9217 2.8191
9 7.0014 11.8192 2.6564
10 6.8758 11.8942 2.6622
11 6.7928 11.8929 1.9705
12 6.6663 11.8819 1.9809
13 6.6267 12.3010 2.6679
14 6.5041 12.2031 1.8921
15 6.4059 12.2843 2.4354
16 6.3420 12.9039 3.0471
17 6.2806 12.1478 2.7479
18 6.1105 13.0912 4.9462
19 6.0671 12.2712 4.6252
20 5.9791 12.7974 1.9119
21 5.9633 12.8095 3.2434
22 6.0265 13.1740 2.2317
23 5.9151 13.1795 1.7708
24 5.8232 13.1638 3.2358
25 5.7188 12.9696 4.8095
Linear Regression R2 score on test / train set: 0.389 0.411
ANN R2 score on test / train set: 0.287 0.457
[11]:
plt.figure()
plt.title('Predicting the change in DST Index')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models:
p = plt.plot(t_test, model['test_predict'], label=model['name'])
# Plot another line with the same color
plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)
[ ]:
Predicting the DST current index from other variables¶
Downloading data¶
We will first download low-resolution data from OMNI, and split it into a training and a test set.
[1]:
from datetime import datetime
from aidapy import load_data
from sklearn.model_selection import train_test_split
# Set the start and end date as year, month, day
t0 = datetime(2000, 1, 1)
t1 = datetime(2015, 12, 31)
# Download the data
omnixr = load_data(mission='omni', start_time=t0, end_time=t1)
# Store data in pandas format
pd_data = omnixr['all1'].to_pandas()
# Split into training and test data
dtrain, dtest = train_test_split(pd_data, shuffle=False)
pd_data.describe()
Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2000.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2000.dat
100.0% 2883584 / 2881152
Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2001.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2001.dat
100.0% 2875392 / 2873280
Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2002.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2002.dat
100.0% 2875392 / 2873280
Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2003.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2003.dat
100.0% 2875392 / 2873280
Downloading https://cdaweb.gsfc.nasa.gov/pub/data/omni//low_res_omni/omni2_2004.dat to /users/cpa/romaind/heliopy/data/omni/omni2_2004.dat
100.0% 2883584 / 2881152
[1]:
products | Bartels Rotation Number | ID IMF Spacecraft | ID SW Plasma Spacecraft | points(IMF Average) | points(Plasma Average) | |B| | Magnitude of Avg Field Vector | Lat. Angle of Aver. Field Vector | Long. Angle of Aver. Field Vector | Bx GSE, GSM | ... | Proton Flux > 10MeV | Proton Flux > 30MeV | Proton Flux > 60MeV | flag | ap index | f10.7 index | PC(N) index | AL index (Kyoto) | AU index (Kyoto) | Magnetosonic Mach No. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 140233.000000 | 140233.000000 | 140105.000000 | 140233.000000 | 140105.000000 | 140233.000000 | 140233.000000 | 140233.000000 | 140233.000000 | 140233.000000 | ... | 113159.000000 | 113134.000000 | 113120.000000 | 140233.000000 | 140233.000000 | 140113.000000 | 140035.000000 | 140233.000000 | 140233.000000 | 135984.000000 |
mean | 2379.926836 | 55.852353 | 56.569801 | 48.675333 | 30.049670 | 5.827601 | 5.174818 | 0.130479 | 201.635718 | -0.004155 | ... | 7.994024 | 2.036895 | 0.759653 | -0.807028 | 10.408599 | 116.238255 | 0.960367 | -110.092596 | 68.026370 | 5.641059 |
std | 62.473060 | 8.573346 | 8.166480 | 19.356001 | 9.397953 | 3.219993 | 3.069832 | 29.213007 | 100.877033 | 3.473102 | ... | 152.075039 | 48.944587 | 18.042820 | 0.394632 | 16.728629 | 43.382109 | 1.345067 | 143.618415 | 70.739359 | 1.158862 |
min | 2272.000000 | 51.000000 | 45.000000 | 1.000000 | 1.000000 | 0.400000 | 0.100000 | -89.200000 | 0.000000 | -40.800000 | ... | 0.010000 | 0.010000 | 0.010000 | -1.000000 | 0.000000 | 65.100000 | -21.500000 | -2452.000000 | -260.000000 | 0.600000 |
25% | 2326.000000 | 51.000000 | 52.000000 | 51.000000 | 17.000000 | 3.800000 | 3.200000 | -18.500000 | 122.300000 | -2.500000 | ... | 0.150000 | 0.080000 | 0.060000 | -1.000000 | 3.000000 | 78.500000 | 0.100000 | -148.000000 | 20.000000 | 4.900000 |
50% | 2380.000000 | 51.000000 | 52.000000 | 59.000000 | 35.000000 | 5.100000 | 4.500000 | -0.100000 | 181.100000 | 0.000000 | ... | 0.200000 | 0.110000 | 0.080000 | -1.000000 | 6.000000 | 107.100000 | 0.600000 | -49.000000 | 42.000000 | 5.700000 |
75% | 2434.000000 | 51.000000 | 52.000000 | 60.000000 | 37.000000 | 7.000000 | 6.200000 | 18.600000 | 302.000000 | 2.400000 | ... | 0.250000 | 0.150000 | 0.110000 | -1.000000 | 12.000000 | 141.200000 | 1.500000 | -20.000000 | 91.000000 | 6.400000 |
max | 2488.000000 | 71.000000 | 71.000000 | 94.000000 | 74.000000 | 62.000000 | 60.700000 | 89.800000 | 360.000000 | 34.800000 | ... | 9650.000000 | 3220.000000 | 1220.000000 | 0.000000 | 400.000000 | 325.100000 | 28.000000 | 22.000000 | 1226.000000 | 12.700000 |
8 rows × 52 columns
Let’s have a look at the data¶
Below, the DST index is shown. Note that there are quiet periods, and periods with more magnetic storms, indicated by large negative values.
[2]:
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
%matplotlib notebook
register_matplotlib_converters()
plt.figure()
plt.title("DST index (training/test data)")
dtrain['DST Index'].plot(label='train')
dtest['DST Index'].plot(label='test')
plt.gcf().autofmt_xdate()
plt.legend();
The correlation between the DST Index and other variables¶
Let’s have a look at the linear correlation between the DST index and other variables. Correlations range from -1 (anti-correlation) to 1 (perfect correlation), and a value around zero indicates the variables are linearly independent. (However, there might be a complex non-linear correlation)
[3]:
# Generate a correlation matrix
corr_matrix = dtrain.corr()
# Get the correlation with the DST index
dst_corr = corr_matrix['DST Index']
# Display correlations sorted by absolute value
ix = dst_corr.abs().sort_values(ascending=False).index
sorted_corr = dst_corr.reindex(ix)
# Print the variables with the strongest correlations
print(sorted_corr[:10])
products
DST Index 1.000000
ap index -0.629716
Kp -0.573182
AL index (Kyoto) 0.548138
AE Index -0.542807
PC(N) index -0.514771
Plasma Flow Speed -0.456372
AU index (Kyoto) -0.420654
|B| -0.396887
Magnitude of Avg Field Vector -0.386568
Name: DST Index, dtype: float64
Selecting features¶
We will now select features from which we will predict the DST Index (our target). There are a number of other indices that the DST correlates to, but to make this example somewhat challenging we will exclude them!
[4]:
# Get names of the variables
all_names = list(sorted_corr.index)
# Remove index variables
my_names = [name for name in all_names if 'ndex' not in name]
# Remove other variables
my_names.remove('Kp')
# Select variables with the strongest (absolute) correlations as features
n_features = 5
features = my_names[:n_features]
targets = ['DST Index']
all_vars = features + targets
print("The features are: ", features)
print("The targets are: ", targets)
# Remove rows with missing values. We do this after selecting features,
# otherwise we remove too many rows.
dtrain_dropna = dtrain[all_vars].dropna()
dtest_dropna = dtest[all_vars].dropna()
# Get the features and targets
X_train = dtrain_dropna[features].values
y_train = dtrain_dropna[['DST Index']].values
X_test = dtest_dropna[features].values
y_test = dtest_dropna[['DST Index']].values
t_test = dtest_dropna.index
t_train = dtrain_dropna.index
The features are: ['Plasma Flow Speed', '|B|', 'Magnitude of Avg Field Vector', 'Na/Np', 'Bz GSM']
The targets are: ['DST Index']
Regression models¶
In this example, we consider a linear regression model and an artificial neural network (ANN) regression model. These models are used in a pipeline, which also performs scaling of the input data.
[5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from aidapy.ml import mlp
from skorch import NeuralNetRegressor
import torch
# A list of models to use
models = []
# Append a dictionary with the model name and pipeline
models.append({
'name': 'Linear Regression',
'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
('model', LinearRegression())])
})
# RegressorMlp is a simple, fully-connected neural network,
# for which the layer sizes are defined below. The default
# activation function is ReLU.
mlp_model = NeuralNetRegressor(
mlp.RegressorMlp,
max_epochs=25,
lr=0.001,
batch_size=128,
optimizer=torch.optim.Adam,
module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)
models.append({
'name': 'Multilayer perceptron',
'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
('model', mlp_model)])
})
Training the models¶
[6]:
from sklearn.metrics import r2_score
for model in models:
model['pipe'].fit(X_train, y_train)
model['test_predict'] = model['pipe'].predict(X_test)
model['train_predict'] = model['pipe'].predict(X_train)
for model in models:
print("{:30} R2 score on test / train set: {:8.3f} {:8.3f}".format(
model['name'], r2_score(y_test, model['test_predict']),
r2_score(y_train, model['train_predict'])))
epoch train_loss valid_loss dur
------- ------------ ------------ ------
1 265.4195 510.7300 4.9675
2 199.7389 514.0299 6.8559
3 196.6792 513.1639 5.4448
4 194.9946 511.3364 8.0728
5 193.8425 508.8832 3.3963
6 192.9440 506.7368 6.7031
7 192.1152 507.0628 4.3043
8 191.5201 504.7900 4.5337
9 190.9533 504.0500 5.6420
10 190.5970 503.2600 3.0030
11 190.2063 502.6676 3.0651
12 189.9117 501.8307 7.2454
13 189.5939 501.4661 7.0313
14 189.3153 501.0051 4.6045
15 189.0875 500.4397 8.1879
16 188.8131 500.0692 5.8018
17 188.5830 499.5452 8.0832
18 188.3553 499.0485 5.2213
19 188.0861 498.7523 4.9091
20 187.9017 498.4360 4.6728
21 187.7025 498.1764 4.8058
22 187.4986 497.6907 3.5740
23 187.3130 497.4540 4.4685
24 187.1445 497.1794 4.2933
25 186.9713 496.8183 7.3463
Linear Regression R2 score on test / train set: 0.394 0.375
Multilayer perceptron R2 score on test / train set: 0.415 0.404
Visualizing the model predictions¶
[7]:
plt.figure()
plt.title('Performance on test/train data')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models:
p = plt.plot(t_test, model['test_predict'], label=model['name'])
# Plot another line with the same color
plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)
Working with a history to improve predictions¶
Before, we only used the ‘instantaneous’ values of other variables to predict the DST Index. We can improve predictions by keeping track of a history of values, for which we use the AIDApy preprocessing module.
[8]:
from aidapy.ml import preprocess
histsize = 24 # Number of past hours
forecast_time = 0 # Hours into the future
Xtr = dtrain[features].values
ytr = dtrain[targets].values
Xte = dtest[features].values
yte = dtest[targets].values
# Use the AIDApy preprocessing method for time series
X_train, y_train, mask_train = preprocess.time_series(
Xtr, ytr, histsize, forecast_time)
X_test, y_test, mask_test = preprocess.time_series(
Xte, yte, histsize, forecast_time)
t_train = dtrain.index[mask_train]
t_test = dtest.index[mask_test]
models_v2 = []
models_v2.append({
'name': 'Linear Regression',
'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
('model', LinearRegression())])
})
mlp_model = NeuralNetRegressor(
mlp.RegressorMlp,
max_epochs=20,
lr=0.001,
batch_size=128,
optimizer=torch.optim.Adam,
module__layer_sizes=[X_train.shape[1], 64, 64, 64, 1]
)
models_v2.append({
'name': 'Multilayer perceptron',
'pipe': Pipeline(steps=[('preprocess', StandardScaler()),
('model', mlp_model)])
})
Training and evaluating the new models¶
[9]:
for model in models_v2:
model['pipe'].fit(X_train, y_train)
model['test_predict'] = model['pipe'].predict(X_test)
model['train_predict'] = model['pipe'].predict(X_train)
for model in models_v2:
print("{:30} R2 score on test / train set: {:8.3f} {:8.3f}".format(
model['name'], r2_score(y_test, model['test_predict']),
r2_score(y_train, model['train_predict'])))
epoch train_loss valid_loss dur
------- ------------ ------------ ------
1 211.1822 320.4362 3.7218
2 120.9668 272.9826 4.0578
3 116.8801 265.6732 5.9941
4 114.9084 249.5811 3.8256
5 112.3243 244.7855 3.6040
6 110.7551 240.7139 3.4175
7 109.4082 237.4817 3.4066
8 108.2381 234.2301 3.3303
9 107.1030 231.2991 2.4646
10 105.9993 228.7583 5.1124
11 105.0800 226.6691 3.4773
12 104.1000 224.7690 3.3862
13 103.2819 222.9194 3.0130
14 102.5243 221.2111 5.1869
15 101.7451 219.7934 2.9591
16 101.0786 218.5600 2.5240
17 100.4546 216.9458 4.8371
18 99.7634 215.4997 2.8842
19 99.2128 214.3354 2.6664
20 98.6472 213.3654 5.0065
Linear Regression R2 score on test / train set: 0.660 0.634
Multilayer perceptron R2 score on test / train set: 0.667 0.700
Looking at the results¶
The performance of both models seems to have significantly improved by using a history of variables.
[10]:
plt.figure()
plt.title('Performance on test/train data')
plt.plot(t_test, y_test, label='Data test')
plt.plot(t_train, y_train, label='Data train')
for model in models_v2:
p = plt.plot(t_test, model['test_predict'], label=model['name'])
# Plot another line with the same color
plt.plot(t_train, model['train_predict'], color=p[0].get_color())
plt.legend()
plt.gcf().autofmt_xdate()
plt.grid(True)
Possible extensions¶
Above, features were selected based on their linear correlation to the DST index. Such features will work well for a linear model, but a neural network could potentially also use features that have a more complex correlation with the DST index. We could simply use all variables as the input for the neural network, but then we have to take care that there are no missing values (NaN’s).
[ ]:
Modules¶
aidapy¶
aidapy package¶
Subpackages¶
This module serves as base for all the mission data manager
Author : Romain Dupuis
- class aidapy.data.mission.base_mission.BaseMission(t_start, t_end)[source]¶
Bases:
object
Base format for the data downloaded from missions
- Parameters
t_start (datetime object) – It gives the time at which we started to look at the data
t_end (datetime object) – It gives the time at which we finished looking at the data
- static convert_time(time, format_out)[source]¶
Convert time to an accepted astropy time format
- Parameters
time (datetime or astropy.time instance) – Time object we want to convert
format_out (str) – Indicates what is the format output. Must be a format accepted by astropy
- get_mission_info()[source]¶
Provide information on the mission
- Returns
info – Dictionary with information on the mission with the following keys - name: - allowed_probes: - product_catalog:
- Return type
Module contents¶
Contributing to AIDApy¶
All contributions are welcome. For detailed information about the code style please
read the following instructions. All the code must have a adequate number of tests
that assures it’s credibility. Feel free to make pull
requests.
Introduction¶
This guide defines the conventions for writing Python code for aidapy. The main ideas are:
ensuring a consistent code style
promote good practices for testing
maintaining a good level of readability and maintainability
to keep it simple
Python version¶
Prefer if possible Python>=3.6 since there are major dependencies that do not support other python versions. There is no reason motivating the use of Python 2 as AIDApy is not using dependencies or legacy code.
Coding style¶
Stick as much as possible to PEP8 for general guidelines in term of coding conventions and to PEP257 for typical docstring conventions. You can also have a look to Python anti-pattern.
Main guidelines from PEP8¶
PEP8 coding conventions are:
Use 4 spaces per indentation level.
Limit all lines to a maximum of 100 characters.
Separate top-level function and class definitions with two blank lines.
Imports should be grouped in the following order:
Standard library imports.
Related third party imports.
Local application/library specific imports.
A blank line between each group of imports.
Use Linters¶
Linters are tools for static code quality checker. For instance, you can use the following tools to test conformity with the common pythonic standards:
pylint is one of the oldest linters and tracks various problems such as good practice violation, coding standard violation, or programming issues. Pylint may be seen as slow, too verbose and complex to configure to get it working properly. You can run a complete static analysis with the following command:
pylint aidapy --rcfile=setup.cfg --ignore-patterns='file_to_ignore1.py','file_to_ignore2.py'
All these linters can be simply installed with pip. Further details on the functionnalities can be found here or there. Also, a lot of features can also be provided natively or by installing plugins with your IDE (PyCharm, Spyder, Eclipse, etc.).
Documentation¶
Please follow the numpy guidelines to comment your code. All the function and classes have to be commented.
Documentation of all the files must be done in-line using Sphinx. We strongly encourage you to setup a documentation using doctest.
Testing¶
The library pytest is used to launch tests in the code. All tests can be launched using:
coverage run -m pytest
This command gives coverage at the same time. The output consists in tests results and coverage report.
Regarding the location of the test files, add a directory where you will keep the unit tests for each individual modules of the package.
<AIDApy_root>/aidapy/<mymodule>/tests
Each test must be minimal and should ensure non-regression, and full functionality of the module. The files should be named:
test_<name>.py
where<name>
is any name you think representative of the test.
Integration tests will be kept in the following directory. Do not forget that integration tests are those that involve two or more sub-modules of the aidapy package.
<AIDApy_root>/tests
These tests are handled by the package integration team.
Using the git repository¶
You will always work in a branch different from master
. The general
rule is the following
(source):
The production version of your code sits in a branch named master.
The development version of your code (ready to be tested and productionised) sits in a branch named develop.
When development on a new feature is started, a new branch is created off of the develop branch. This is a feature branch.
When development has completed, the feature branch is either merged back into the develop branch if the changes are to be taken into production, or left in their respective branch if they are not.
When the code in the develop branch is finally ready to be released into production, a new release branch is created.
When the release branch is approved for release, it is merged into the master branch and develop branch to capture any fixes that may have been done directly on the release branch.
When you are working on your feature branch, commit often, every time that an consistent modification is made. Please use meaningful commit messages: follow the standards presented in the guide How to write a git commit message.
Remember to always merge the latest version of the master
branch to
your branch before any commit. Please check that your code works and
passes the standardization tool tests before requesting a merge to the
master
branch.
Once you are sure that your branch is in good conditions to share it with the developers and users, please make a merge request on gitlab.
Testing Standards¶
In order the merging request (of the feature branch to master branch ) to be authorized the feature branch must pass successfully from the Continuous Integration tools. To achieve this, the score of the pylint library must be over 8.0 / 10.0 and the test coverage above 80%.
For assessing the pylint score run the command:
pip install pylint
pylint aidapy --rcfile=setup.cfg
Finally for assessing the code test coverage of your branch run:
coverage run setup.py test