Model Formating Overview¶
This notebook was created to enable common formatting for the Early Release Science modeling initiatives. We will review:
Variable terminology
File naming schemes
Data formating
Physical unit archiving
This is namely for the booking of the following model types:
1D-3D climate (spatial- and/or altitude- dependent climate)
1D-3D chemistry (spatial- and/or altitude- dependent atmospheric composition)
1D-3D cloud (spatial- and/or altitude- dependent single scattering, asymmetries, cloud optical depth)
Spectroscopy (flux, transit depth as a function of wavelength)
However, it can be applied to other modeling products (e.g. mixing profiles).
Variable Terminology¶
All file names and meta data should conform to the following variable names. Note, that these will not apply to all models. This is just an initial list. Please shoot me a DM, or slack for additional parameters to add (natasha.e.batalha@nasa.gov, or ERS slack channel)
Planet parameters (planet_params
)¶
rp
: planet radiusmp
: planet masstint
: object internal temperatureheat_redis
: heat redistribution (only relevant for irradiated objects)p_reference
: reference pressure radiuspteff
: planetary effective temperaturemh
: metallicitycto
: carbon to oxygen ratiologkzz
: log of the kzz eddy diffusion
Stellar parameters (stellar_params
)¶
logg
: gravityfeh
: stellar metallicitysteff
: stellar effective temperaturers
: stellar radiusms
: stellar mass
Orbital parameters (orbit_params
)¶
sma
: semi-major axis
Cloud parameters (cld_params
)¶
opd
: extinction optical depthssa
: single scattering albedoasy
: asymmetry parameterfsed
: cloud sedimentation efficiency parameter
Model Gridding ( coords
)¶
pressure
: pressure gridwavelength
: wavelength gridwno
: wavenumber gridlat
: latitude gridlon
: longitude grid
Model Output ( data_vars
)¶
There are SO many different model outputs users will want to pass. For the purposes of ERS, we will focus on these, but feel free to send recommendations for more. Note that in your xarray file there will not be a separation between these categories. They will all be lumped into data_vars. However their coordinate systems will be different! The beauty of xarray!
Spectrum¶
transit_depth
: transmission spectrum reported as unitless depth (rp/rs)^2. This way it can be directly compared to data.fpfs_emission
: relative emission spectrum (unitless)fpfs_reflection
relative reflected light spectrum (unitless)flux_emission
: thermal emission in raw flux unitsalbedo
: albedo spectrum
Chemistry¶
case sensitive molecule names (e.g. Na, H2O, TiO) for each chemical abundance (either 1d or 3d). This means your mixing ratio profile for TiO would not be TIO. Or, for example the chemical profile for sodium would be “Na” not NA
Climate¶
temperature
: computed temperature profile either 1d or 3d
Cloud¶
opd
: extinction optical depthssa
: single scattering albedoasy
: asymmetry parameter
Retrieval output¶
Coming soon.
Specifying units¶
We should be able to convert all units to astropy.units
. For unitless parameters (e.g. single scattering albedo, optical depth) unitless designation should be provided. See example:
[1]:
import astropy.units as u
[2]:
#examples of valid units
u.Unit('cm') #Valid
#u.Unit('CM') #NOT valid
u.Unit("R_jup")#Valid
u.Unit("R_jupiter")#Valid
#u.Unit("R_Jupiter")#NOT Valid
[2]:
[3]:
unit = 'cm'
#doing it this away enables easy conversions. for example:
(1*u.Unit('R_jup')).to('cm')
[3]:
Data Types: Using xarray
¶
xarray: N-D labeled arrays and datasets in Python: From their website: “array introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures.”
Xarray is your friend and will make it very easy for other folks to use your data. Let’s build some simple examples.
This is how we make sure we credit the appropriate people for their work, and understand the meta data surrounding the model runs.
author: Author or author list
contact email: point of contact
code used: (can be a dictionary. e.g. {‘chemistry’:’vulcan’, ‘cloud’:’virga’, ‘spectra’:’chimera’}
doi (str): made sure to include if you want your work referenced!
planet_params (json dict) : with dict keys defined in section 1
stellar_params (json dict): with dict keys defined in section 1
orbit_params (json dict) : with dict keys defined in section 1
cld_params (json dict): with dict keys defined in section 1
model_notes (str) : any additional modeling notes that you want the user to be aware of
Easy Example: 1D data: e.g. P-T profiles, chemistry¶
Here we will show an example with pressure
as the dependent variable. Spectra, which are on a wavelength or wavenumber grid, can also be stored similarly.
[4]:
import numpy as np
import xarray as xr
from astropy.utils.misc import JsonCustomEncoder
import json #we will use this to dump model parameters into an attribute
[5]:
#fake pressure temperature profile at 1000 K
pressure = np.logspace(-6,2,50)
temperature = np.logspace(-6,2,50)*0 + 1000
Practice convert to xarray
. In this case we are storing temperature
data, labeled with unit
“Kelvin” that is on a grid of pressure
with units of “bar”
[6]:
# put data into a dataset where each
ds = xr.Dataset(
data_vars=dict(
temperature=(["pressure"], temperature,{'units': 'Kelvin'})#, required
),
coords=dict(
pressure=(["pressure"], pressure,{'units': 'bar'})#required*
),
attrs=dict(author="NE Batalha",#required
contact="natasha.e.batalha@nasa.gov",#required
code="numpy", #required, in this case I used numpy to make my fake model.
doi="add your paper here",#optional if there is a citation to reference
planet_params=json.dumps({'rp':1*u.Unit('R_jup'), 'mp':1*u.Unit('M_jup')},
cls=JsonCustomEncoder), #optional in accordance with model runs
stellar_params=json.dumps({'rs':1*u.Unit('R_sun'), 'ms':1*u.Unit('M_sun')},
cls=JsonCustomEncoder) #optional in accordance with model runs
),
)
[7]:
#printing is easy
ds
[7]:
<xarray.Dataset> Size: 800B Dimensions: (pressure: 50) Coordinates: * pressure (pressure) float64 400B 1e-06 1.456e-06 ... 68.66 100.0 Data variables: temperature (pressure) float64 400B 1e+03 1e+03 1e+03 ... 1e+03 1e+03 1e+03 Attributes: author: NE Batalha contact: natasha.e.batalha@nasa.gov code: numpy doi: add your paper here planet_params: {"rp": {"value": 1.0, "unit": "jupiterRad"}, "mp": {"val... stellar_params: {"rs": {"value": 1.0, "unit": "solRad"}, "ms": {"value":...
[8]:
#plotting is easy
ds['temperature'].plot(y='pressure')
[8]:
[<matplotlib.lines.Line2D at 0x7fe946276290>]
Looping to add many variables to data_vars
¶
If you have a big table of many parameters, e.g. molecules you can pre-make data_vars
as a dictionary then add it like this:
[9]:
fake_chemistry = {i:np.random.randn(len(pressure)) for i in ['H2O','CH4','CO2','CO',"H2S"]}
data_vars=dict(
temperature=(["pressure"], temperature,{'units': 'Kelvin'})#, required
)
for i in fake_chemistry.keys():
data_vars[i] = (["pressure"], fake_chemistry[i],{'units': 'v/v'})#volume mixing ratio units
# put data into a dataset where each
ds = xr.Dataset(
data_vars=data_vars,
coords=dict(
pressure=(["pressure"], pressure,{'units': 'bar'})#required*
),
attrs=dict(author="NE Batalha",#required
contact="natasha.e.batalha@nasa.gov",#required
code=json.dumps({'climate':"numpy","chemistry":'numpy'}), #required
doi="",#optional if there is a citation to reference
planet_params=json.dumps({'rp':1*u.Unit('R_jup'), 'mp':1*u.Unit('M_jup')},
cls=JsonCustomEncoder), #optional in accordance with model runs
stellar_params=json.dumps({'rs':1*u.Unit('R_sun'), 'ms':1*u.Unit('M_sun')},
cls=JsonCustomEncoder) #optional in accordance with model runs #optional in accordance with model runs
),
)
2D data: e.g. cloud profiles with pressure vs wavenumber¶
[10]:
wno = np.linspace(10,10000,400)#fake wavenumber array
opd = np.zeros((len(pressure), len(wno))) + 1
ssa = np.zeros((len(pressure), len(wno))) + 0.9
asy = np.zeros((len(pressure), len(wno))) + 0.8
# put data into a dataset where each
ds = xr.Dataset(
#now data is a function of two dimensions
data_vars=dict(opd=(["pressure","wno"], opd,{'units': 'unitless per layer'}),
ssa=(["pressure","wno"], ssa,{'units': 'unitless'}),
asy=(["pressure","wno"], asy,{'units': 'unitless'}),
),
coords=dict(
pressure=(["pressure"], pressure,{'units': 'bar'}),#required
wno=(["wno"], wno,{'units': 'cm**(-1)'})#required
),
attrs=dict(author="NE Batalha",#required
contact="natasha.e.batalha@nasa.gov",#required
code='numpy', #required
doi="",#optional if there is a citation to reference
planet_params=json.dumps({'rp':1*u.Unit('R_jup'), 'mp':1*u.Unit('M_jup')},
cls=JsonCustomEncoder), #optional in accordance with model runs
stellar_params=json.dumps({'rs':1*u.Unit('R_sun'), 'ms':1*u.Unit('M_sun')},
cls=JsonCustomEncoder) #optional in accordance with model runs #optional in accordance with model runs
),
)
3D data: e.g. GCM pressure grid¶
xarray
is incredibly useful for GCM work. Here is an example of how picaso uses xarray and xesfm to do regridding and 3d calculations.
Here we’ll just show how to create a 3D gridded dataset.
[11]:
lat = np.linspace(-90,85,40)#fake latitude array
lon = np.linspace(-180,175,100)#fake latitude array
temperature_3d = np.random.randn(len(lat),len(lon),len(pressure))
# put data into a dataset where each
ds = xr.Dataset(
#now data is a function of two dimensions
data_vars=dict(temperature=(["lat","lon","pressure"],temperature_3d ,{'units': 'Kelvin'})
),
coords=dict(
pressure=(["pressure"], pressure,{'units': 'bar'}),#required
lat=(["lat"], lat,{'units': 'degree'}),#required
lon=(["lon"], lon,{'units': 'degree'}),#required
),
attrs=dict(author="NE Batalha",#required
contact="natasha.e.batalha@nasa.gov",#required
code='numpy', #required
doi="",#optional if there is a citation to reference
planet_params=json.dumps({'rp':1*u.Unit('R_jup'), 'mp':1*u.Unit('M_jup')},
cls=JsonCustomEncoder), #optional in accordance with model runs
stellar_params=json.dumps({'rs':1*u.Unit('R_sun'), 'ms':1*u.Unit('M_sun')},
cls=JsonCustomEncoder) #optional in accordance with model runs #optional in accordance with model runs
),
)
[12]:
#easy plotting
ds['temperature'].isel(pressure=10).plot()
[12]:
<matplotlib.collections.QuadMesh at 0x7fe946342550>
Storing xarray
data¶
Filenaming¶
We usually rely on a long filename to give us information about the model. If we properly use attrs
then filenaming does not matter. However, friendly filenames are always appreciated by people using your models. We suggest the following naming convention.
Given independent variables (x,y,z): tag_x{x}_y{y}_z{z}.nc
For example: jupiter_mh1_teff1000_tint100.nc
Using netcdf
¶
“The recommended way to store xarray data structures is netCDF, which is a binary file format for self-described datasets that originated in the geosciences. Xarray is based on the netCDF data model, so netCDF files on disk directly correspond to Dataset objects (more accurately, a group in a netCDF file directly corresponds to a Dataset object. See Groups for more.)” - Quoted from xarray website
[13]:
ds.to_netcdf("/data/picaso_dbs/fakeplanet_1000teq.nc")
Using pickle
¶
[14]:
import pickle as pk
pk.dump(ds, open("/data/picaso_dbs/fakeplanet_1000teq.pk",'wb'))
Reading/interpreting an xarray
file¶
First, make sure you have installed netCDF4 and h5netcdf :
pip install netCDF4
pip install h5netcdf
or if you prefer conda
conda install -c conda-forge netCDF4
[15]:
ds_sm = xr.open_dataset("profile_eq_planet_300_grav_4.5_mh_+2.0_CO_2.0_sm_0.0486_v_0.5_.nc")
Look at all the information we can glean from this
[16]:
ds_sm #39 data variables
[16]:
<xarray.Dataset> Size: 52kB Dimensions: (pressure: 91, wavelength: 1498) Coordinates: * pressure (pressure) float64 728B 1e-06 1.237e-06 ... 161.7 200.0 * wavelength (wavelength) float64 12kB 5.994 5.982 5.97 ... 0.3009 0.3003 Data variables: (12/39) temperature (pressure) float64 728B ... e- (pressure) float64 728B ... H2 (pressure) float64 728B ... H (pressure) float64 728B ... H+ (pressure) float64 728B ... H- (pressure) float64 728B ... ... ... OCS (pressure) float64 728B ... Li (pressure) float64 728B ... LiOH (pressure) float64 728B ... LiH (pressure) float64 728B ... LiCl (pressure) float64 728B ... transit_depth (wavelength) float64 12kB ... Attributes: author: Sagnick Mukherjee contact: samukher@ucsc.edu code: {"spectrum": "PICASO", "climate": "PICASO", "chemistry":... doi: Mukherjee et al. (Submitted) planet_params: {"rp": {"value": 1.279, "unit": "jupiterRad"}, "mp": {"v... stellar_params: {"rs": {"value": 0.932, "unit": "solRad"}, "logg": 4.389... orbit_params: {"sma": 0.0486} cld_params: {"opd": "None", "ssa": "None", "asy": "None", "p_cloud":...
[17]:
ds_sm['wavelength']#data operates very similarly to pandas, note we can see the unit of the coordinate system
[17]:
<xarray.DataArray 'wavelength' (wavelength: 1498)> Size: 12kB array([5.993991, 5.982015, 5.970063, ..., 0.301528, 0.300925, 0.300324]) Coordinates: * wavelength (wavelength) float64 12kB 5.994 5.982 5.97 ... 0.3009 0.3003 Attributes: units: micron
[18]:
ds_sm['wavelength'].values #same as pandas!
[18]:
array([5.99399092, 5.98201491, 5.97006283, ..., 0.30152773, 0.30092527,
0.30032402])
[19]:
ds_sm['temperature']
[19]:
<xarray.DataArray 'temperature' (pressure: 91)> Size: 728B [91 values with dtype=float64] Coordinates: * pressure (pressure) float64 728B 1e-06 1.237e-06 1.529e-06 ... 161.7 200.0 Attributes: units: Kelvin
How to get attributes from string dictionary
[20]:
json.loads(ds_sm.attrs['planet_params'])
[20]:
{'rp': {'value': 1.279, 'unit': 'jupiterRad'},
'mp': {'value': 0.28, 'unit': 'jupiterMass'},
'tint': 300.0,
'heat_redis': 0.5,
'mh': 100.0,
'cto': 0.916,
'rainout': 'None',
'p_quench': 'None',
'p_reference': {'value': 10.0, 'unit': 'bar'},
'logkzz': 'None'}
[21]:
json.loads(ds_sm.attrs['planet_params'])['rp'] #radius used
[21]:
{'value': 1.279, 'unit': 'jupiterRad'}
Checking your data is in compliance¶
TLDR: this function will check that your data can be properly interpretted
[22]:
def data_check(usr_xa):
"""This function will check that all the requirements have been met"""
#step 1: check that required attributes are present
assert 'author' in usr_xa.attrs ,'No author information in attrs'
assert 'contact' in usr_xa.attrs ,'No contact information in attrs'
assert 'code' in usr_xa.attrs , 'Code used was not specified in attrs'
#step 2: check that all coordinates have units
try:
for i in usr_xa.coords.keys(): test= usr_xa[i].units
except AttributeError:
print(f'Missing unit for {i} coords')
#step 2: check that all coordinates have units
try:
for i in usr_xa.data_vars.keys(): test=usr_xa[i].units
except AttributeError:
print(f'Missing unit for {i} data_var')
#step 3: check that some attrs is a proper dictionary
try :
for i in usr_xa.attrs:
#these need to be dictionaries to be interpretable
if i in ['planet_params','stellar_params','cld_params','orbit_params']:
json.loads(usr_xa.attrs[i])
except ValueError:
print(f"Was not able to read attr for {i}. This means that you did not properly define a dictionary with json and a dict."," For example: json.dumps({'mp':1,'rp':1})")
#step 4: hurray if you have made it to here this is great
#last thing is the least important -- to make sure that we agree on terminology
for i in usr_xa.attrs:
if i == 'planet_params':
for model_key in json.loads(usr_xa.attrs[i]).keys():
assert model_key in ['rp', 'mp', 'tint', 'heat_redis', 'p_reference','rainout','p_quench',
'pteff', 'mh' , 'cto' , 'logkzz'], f'Could not find {model_key} in listed planet_params attr. This might be because we havent added it yet! Check your terms and contact us if this is the case'
elif i == 'stellar_params':
for model_key in json.loads(usr_xa.attrs[i]).keys():
assert model_key in ['logg', 'feh', 'steff', 'rs', 'ms',
], f'Could not find {model_key} in listed stellar_params attr. This might be because we havent added it yet! Check your terms and contact us if this is the case'
elif i == 'orbit_params':
for model_key in json.loads(usr_xa.attrs[i]).keys():
assert model_key in ['sma',
], f'Could not find {model_key} in listed orbit_params attr. This might be because we havent added it yet! Check your terms and contact us if this is the case'
elif i == 'cld_params':
for model_key in json.loads(usr_xa.attrs[i]).keys():
assert model_key in ['opd','ssa','asy','fsed','p_cloud','haze_effec',
], f'Could not find {model_key} in listed cld_params attr. This might be because we havent added it yet! Check your terms and contact us if this is the case'
print('SUCCESS!')
[23]:
ds_sm = xr.open_dataset("profile_eq_planet_300_grav_4.5_mh_+2.0_CO_2.0_sm_0.0486_v_0.5_.nc")
data_check(ds_sm)
SUCCESS!
[ ]: