{ "cells": [ { "cell_type": "markdown", "id": "bdbb814c", "metadata": {}, "source": [ "# Model Formating Overview\n", "\n", "This notebook was created to enable common formatting for the Early Release Science modeling initiatives. We will review: \n", "\n", "1. Variable terminology\n", "2. File naming schemes \n", "3. Data formating \n", "4. Physical unit archiving \n", "\n", "This is namely for the booking of the following model types: \n", "\n", "1. 1D-3D climate (spatial- and/or altitude- dependent climate) \n", "2. 1D-3D chemistry (spatial- and/or altitude- dependent atmospheric composition) \n", "3. 1D-3D cloud (spatial- and/or altitude- dependent single scattering, asymmetries, cloud optical depth) \n", "4. Spectroscopy (flux, transit depth as a function of wavelength) \n", "\n", "However, it can be applied to other modeling products (e.g. mixing profiles).\n", "\n", "## Variable Terminology \n", "\n", "All file names and meta data should conform to the following variable names. Note, that these will not apply to all models. This is just an initial list. Please shoot me a DM, or slack for additional parameters to add (natasha.e.batalha@nasa.gov, or ERS slack channel)\n", "\n", "### Planet parameters (`planet_params`)\n", "\n", "1. `rp`: planet radius\n", "2. `mp`: planet mass \n", "3. `tint`: object internal temperature\n", "4. `heat_redis`: heat redistribution (only relevant for irradiated objects)\n", "5. `p_reference`: reference pressure radius \n", "6. `pteff`: planetary effective temperature \n", "7. `mh` : metallicity \n", "8. `cto` : carbon to oxygen ratio \n", "9. `logkzz` : log of the kzz eddy diffusion \n", "\n", "### Stellar parameters (`stellar_params`)\n", "1. `logg` : gravity \n", "2. `feh` : stellar metallicity \n", "3. `steff` : stellar effective temperature \n", "4. `rs` : stellar radius \n", "5. `ms` : stellar mass \n", "\n", "### Orbital parameters (`orbit_params`)\n", "1. `sma` : semi-major axis \n", "\n", "### Cloud parameters (`cld_params`)\n", "1. `opd` : extinction optical depth \n", "2. `ssa` : single scattering albedo\n", "3. `asy` : asymmetry parameter\n", "4. `fsed` : cloud sedimentation efficiency parameter\n", "\n", "### Model Gridding ( `coords`)\n", "1. `pressure`: pressure grid \n", "2. `wavelength`: wavelength grid \n", "3. `wno`: wavenumber grid \n", "4. `lat`: latitude grid \n", "5. `lon`: longitude grid\n", "\n", "### Model Output ( `data_vars`)\n", "\n", "There are SO many different model outputs users will want to pass. For the purposes of ERS, we will focus on these, but feel free to send recommendations for more. Note that in your xarray file there will not be a separation between these categories. They will all be lumped into data_vars. However their coordinate systems will be different! The beauty of xarray! \n", "\n", "#### Spectrum \n", "\n", "1. `transit_depth` : transmission spectrum reported as unitless depth (rp/rs)^2. This way it can be directly compared to data. \n", "2. `fpfs_emission` : relative emission spectrum (unitless)\n", "3. `fpfs_reflection` relative reflected light spectrum (unitless)\n", "4. `flux_emission` : thermal emission in raw flux units \n", "5. `albedo` : albedo spectrum \n", "\n", "#### Chemistry \n", "\n", "6. case sensitive molecule names (e.g. Na, H2O, TiO) for each chemical abundance (either 1d or 3d). This means your mixing ratio profile for TiO would not be TIO. Or, for example the chemical profile for sodium would be \"Na\" not NA\n", "\n", "#### Climate \n", "\n", "7. `temperature`: computed temperature profile either 1d or 3d\n", "\n", "#### Cloud \n", "\n", "8. `opd` : extinction optical depth \n", "9. `ssa` : single scattering albedo\n", "10. `asy` : asymmetry parameter\n", "\n", "#### Retrieval output \n", "\n", "11. Coming soon.\n", "\n", "# Specifying units\n", "\n", "We should be able to convert all units to `astropy.units`. For unitless parameters (e.g. single scattering albedo, optical depth) unitless designation should be provided. See example:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "99461aa9", "metadata": {}, "outputs": [], "source": [ "import astropy.units as u" ] }, { "cell_type": "code", "execution_count": null, "id": "463ec110", "metadata": {}, "outputs": [], "source": [ "#examples of valid units\n", "\n", "u.Unit('cm') #Valid \n", "#u.Unit('CM') #NOT valid\n", "u.Unit(\"R_jup\")#Valid\n", "u.Unit(\"R_jupiter\")#Valid\n", "#u.Unit(\"R_Jupiter\")#NOT Valid" ] }, { "cell_type": "code", "execution_count": null, "id": "731be94e", "metadata": {}, "outputs": [], "source": [ "unit = 'cm'\n", "#doing it this away enables easy conversions. for example: \n", "(1*u.Unit('R_jup')).to('cm')" ] }, { "cell_type": "markdown", "id": "97986b39", "metadata": {}, "source": [ "# Data Types: Using `xarray`\n", "\n", "[xarray: N-D labeled arrays and datasets in Python](https://docs.xarray.dev/en/stable/): From their website: \"array introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures.\"\n", "\n", "Xarray is your friend and will make it very easy for other folks to use your data. Let's build some simple examples.\n" ] }, { "cell_type": "markdown", "id": "058e94e1", "metadata": {}, "source": [ "### `attrs` in `xarray`\n", "\n", "This is how we make sure we credit the appropriate people for their work, and understand the meta data surrounding the model runs. \n", "\n", "### Required `attrs`\n", "\n", "1. author: Author or author list \n", "2. contact email: point of contact \n", "3. code used: (can be a dictionary. e.g. {'chemistry':'vulcan', 'cloud':'virga', 'spectra':'chimera'}\n", "\n", "### Optional `attrs`\n", "1. doi (str): made sure to include if you want your work referenced! \n", "2. planet_params (json dict) : with dict keys defined in section 1 \n", "3. stellar_params (json dict): with dict keys defined in section 1\n", "4. orbit_params (json dict) : with dict keys defined in section 1\n", "5. cld_params (json dict): with dict keys defined in section 1 \n", "6. model_notes (str) : any additional modeling notes that you want the user to be aware of\n", "\n", "## Easy Example: 1D data: e.g. P-T profiles, chemistry\n", "\n", "Here we will show an example with `pressure` as the dependent variable. Spectra, which are on a wavelength or wavenumber grid, can also be stored similarly." ] }, { "cell_type": "code", "execution_count": null, "id": "521a3f38", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import xarray as xr\n", "from astropy.utils.misc import JsonCustomEncoder\n", "import json #we will use this to dump model parameters into an attribute" ] }, { "cell_type": "code", "execution_count": null, "id": "e4e66eb5", "metadata": {}, "outputs": [], "source": [ "#fake pressure temperature profile at 1000 K\n", "pressure = np.logspace(-6,2,50)\n", "temperature = np.logspace(-6,2,50)*0 + 1000" ] }, { "cell_type": "markdown", "id": "f9488b52", "metadata": {}, "source": [ "Practice convert to `xarray`. In this case we are storing `temperature` data, labeled with `unit` \"Kelvin\" that is on a grid of `pressure` with units of \"bar\"" ] }, { "cell_type": "code", "execution_count": null, "id": "6b145318", "metadata": {}, "outputs": [], "source": [ "# put data into a dataset where each\n", "ds = xr.Dataset(\n", " data_vars=dict(\n", " temperature=([\"pressure\"], temperature,{'units': 'Kelvin'})#, required\n", " ),\n", " coords=dict(\n", " pressure=([\"pressure\"], pressure,{'units': 'bar'})#required*\n", " ),\n", " attrs=dict(author=\"NE Batalha\",#required\n", " contact=\"natasha.e.batalha@nasa.gov\",#required\n", " code=\"numpy\", #required, in this case I used numpy to make my fake model. \n", " doi=\"add your paper here\",#optional if there is a citation to reference\n", " planet_params=json.dumps({'rp':1*u.Unit('R_jup'), 'mp':1*u.Unit('M_jup')},\n", " cls=JsonCustomEncoder), #optional in accordance with model runs\n", " stellar_params=json.dumps({'rs':1*u.Unit('R_sun'), 'ms':1*u.Unit('M_sun')},\n", " cls=JsonCustomEncoder) #optional in accordance with model runs\n", " ),\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "6f88bb68", "metadata": {}, "outputs": [], "source": [ "#printing is easy\n", "ds" ] }, { "cell_type": "code", "execution_count": null, "id": "f338cfd9", "metadata": {}, "outputs": [], "source": [ "#plotting is easy\n", "ds['temperature'].plot(y='pressure')" ] }, { "cell_type": "markdown", "id": "50e0f904", "metadata": {}, "source": [ "### Looping to add many variables to `data_vars`\n", "\n", "If you have a big table of many parameters, e.g. molecules you can pre-make `data_vars` as a dictionary then add it like this:" ] }, { "cell_type": "code", "execution_count": null, "id": "7e387ecb", "metadata": {}, "outputs": [], "source": [ "fake_chemistry = {i:np.random.randn(len(pressure)) for i in ['H2O','CH4','CO2','CO',\"H2S\"]}\n", "\n", "data_vars=dict(\n", " temperature=([\"pressure\"], temperature,{'units': 'Kelvin'})#, required\n", " )\n", "for i in fake_chemistry.keys(): \n", " data_vars[i] = ([\"pressure\"], fake_chemistry[i],{'units': 'v/v'})#volume mixing ratio units\n", " \n", "# put data into a dataset where each\n", "ds = xr.Dataset(\n", " data_vars=data_vars,\n", " coords=dict(\n", " pressure=([\"pressure\"], pressure,{'units': 'bar'})#required*\n", " ),\n", " attrs=dict(author=\"NE Batalha\",#required\n", " contact=\"natasha.e.batalha@nasa.gov\",#required\n", " code=json.dumps({'climate':\"numpy\",\"chemistry\":'numpy'}), #required\n", " doi=\"\",#optional if there is a citation to reference\n", " planet_params=json.dumps({'rp':1*u.Unit('R_jup'), 'mp':1*u.Unit('M_jup')},\n", " cls=JsonCustomEncoder), #optional in accordance with model runs\n", " stellar_params=json.dumps({'rs':1*u.Unit('R_sun'), 'ms':1*u.Unit('M_sun')},\n", " cls=JsonCustomEncoder) #optional in accordance with model runs #optional in accordance with model runs\n", " ),\n", ")" ] }, { "cell_type": "markdown", "id": "d21ece2b", "metadata": {}, "source": [ "## 2D data: e.g. cloud profiles with pressure vs wavenumber" ] }, { "cell_type": "code", "execution_count": null, "id": "d678e42d", "metadata": {}, "outputs": [], "source": [ "wno = np.linspace(10,10000,400)#fake wavenumber array\n", "opd = np.zeros((len(pressure), len(wno))) + 1 \n", "ssa = np.zeros((len(pressure), len(wno))) + 0.9\n", "asy = np.zeros((len(pressure), len(wno))) + 0.8\n", "\n", "# put data into a dataset where each\n", "ds = xr.Dataset(\n", " #now data is a function of two dimensions\n", " data_vars=dict(opd=([\"pressure\",\"wno\"], opd,{'units': 'unitless per layer'}),\n", " ssa=([\"pressure\",\"wno\"], ssa,{'units': 'unitless'}),\n", " asy=([\"pressure\",\"wno\"], asy,{'units': 'unitless'}),\n", " ),\n", " coords=dict(\n", " pressure=([\"pressure\"], pressure,{'units': 'bar'}),#required\n", " wno=([\"wno\"], wno,{'units': 'cm**(-1)'})#required\n", " ),\n", " attrs=dict(author=\"NE Batalha\",#required\n", " contact=\"natasha.e.batalha@nasa.gov\",#required\n", " code='numpy', #required\n", " doi=\"\",#optional if there is a citation to reference\n", " planet_params=json.dumps({'rp':1*u.Unit('R_jup'), 'mp':1*u.Unit('M_jup')},\n", " cls=JsonCustomEncoder), #optional in accordance with model runs\n", " stellar_params=json.dumps({'rs':1*u.Unit('R_sun'), 'ms':1*u.Unit('M_sun')},\n", " cls=JsonCustomEncoder) #optional in accordance with model runs #optional in accordance with model runs\n", " ),\n", ")" ] }, { "cell_type": "markdown", "id": "a615a313", "metadata": {}, "source": [ "## 3D data: e.g. GCM pressure grid\n", "\n", "`xarray` is incredibly useful for GCM work. [Here is an example of how picaso uses xarray and xesfm to do regridding and 3d calculations](https://natashabatalha.github.io/picaso/notebooks/9a_3DInputsWithPICASOandXarray.html). \n", "\n", "Here we'll just show how to create a 3D gridded dataset." ] }, { "cell_type": "code", "execution_count": null, "id": "90da7234", "metadata": {}, "outputs": [], "source": [ "lat = np.linspace(-90,85,40)#fake latitude array\n", "lon = np.linspace(-180,175,100)#fake latitude array\n", "temperature_3d = np.random.randn(len(lat),len(lon),len(pressure))\n", "\n", "# put data into a dataset where each\n", "ds = xr.Dataset(\n", " #now data is a function of two dimensions\n", " data_vars=dict(temperature=([\"lat\",\"lon\",\"pressure\"],temperature_3d ,{'units': 'Kelvin'})\n", " ),\n", " coords=dict(\n", " pressure=([\"pressure\"], pressure,{'units': 'bar'}),#required\n", " lat=([\"lat\"], lat,{'units': 'degree'}),#required\n", " lon=([\"lon\"], lon,{'units': 'degree'}),#required\n", " ),\n", " attrs=dict(author=\"NE Batalha\",#required\n", " contact=\"natasha.e.batalha@nasa.gov\",#required\n", " code='numpy', #required\n", " doi=\"\",#optional if there is a citation to reference\n", " planet_params=json.dumps({'rp':1*u.Unit('R_jup'), 'mp':1*u.Unit('M_jup')},\n", " cls=JsonCustomEncoder), #optional in accordance with model runs\n", " stellar_params=json.dumps({'rs':1*u.Unit('R_sun'), 'ms':1*u.Unit('M_sun')},\n", " cls=JsonCustomEncoder) #optional in accordance with model runs #optional in accordance with model runs\n", " ),\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "653e58d1", "metadata": {}, "outputs": [], "source": [ "#easy plotting \n", "ds['temperature'].isel(pressure=10).plot()" ] }, { "cell_type": "markdown", "id": "713e2d62", "metadata": {}, "source": [ "# Storing `xarray` data \n", "\n", "## Filenaming\n", "\n", "We usually rely on a long filename to give us information about the model. If we properly use `attrs` then filenaming does not matter. However, friendly filenames are always appreciated by people using your models. We suggest the following naming convention. \n", "\n", "Given independent variables (x,y,z): `tag_x{x}_y{y}_z{z}.nc`\n", "\n", "For example: `jupiter_mh1_teff1000_tint100.nc`\n", "\n", "## Using `netcdf`\n", "\n", "\"The recommended way to store xarray data structures is netCDF, which is a binary file format for self-described datasets that originated in the geosciences. Xarray is based on the netCDF data model, so netCDF files on disk directly correspond to Dataset objects (more accurately, a group in a netCDF file directly corresponds to a Dataset object. See Groups for more.)\" - [Quoted from xarray website](https://docs.xarray.dev/en/stable/user-guide/io.html)" ] }, { "cell_type": "code", "execution_count": null, "id": "7e7c4f4c", "metadata": {}, "outputs": [], "source": [ "ds.to_netcdf(\"/data/picaso_dbs/fakeplanet_1000teq.nc\")" ] }, { "cell_type": "markdown", "id": "e9be7de1", "metadata": {}, "source": [ "## Using `pickle`" ] }, { "cell_type": "code", "execution_count": null, "id": "fa90f5e5", "metadata": {}, "outputs": [], "source": [ "import pickle as pk\n", "pk.dump(ds, open(\"/data/picaso_dbs/fakeplanet_1000teq.pk\",'wb'))" ] }, { "cell_type": "markdown", "id": "84f33541", "metadata": {}, "source": [ "# Reading/interpreting an `xarray` file\n", "\n", "First, make sure you have installed [netCDF4](https://github.com/Unidata/netcdf4-python) and [h5netcdf](https://github.com/h5netcdf/h5netcdf) : \n", "\n", "```\n", "pip install netCDF4\n", "pip install h5netcdf\n", "```\n", "or if you prefer conda\n", "```\n", "conda install -c conda-forge netCDF4\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "8e70130e", "metadata": {}, "outputs": [], "source": [ "ds_sm = xr.open_dataset(\"profile_eq_planet_300_grav_4.5_mh_+2.0_CO_2.0_sm_0.0486_v_0.5_.nc\")" ] }, { "cell_type": "markdown", "id": "5ca10951", "metadata": {}, "source": [ "Look at all the information we can glean from this" ] }, { "cell_type": "code", "execution_count": null, "id": "2b741f12", "metadata": {}, "outputs": [], "source": [ "ds_sm #39 data variables" ] }, { "cell_type": "code", "execution_count": null, "id": "6338483b", "metadata": {}, "outputs": [], "source": [ "ds_sm['wavelength']#data operates very similarly to pandas, note we can see the unit of the coordinate system" ] }, { "cell_type": "code", "execution_count": null, "id": "fb0df346", "metadata": {}, "outputs": [], "source": [ "ds_sm['wavelength'].values #same as pandas!" ] }, { "cell_type": "code", "execution_count": null, "id": "8d764ef0", "metadata": {}, "outputs": [], "source": [ "ds_sm['temperature']" ] }, { "cell_type": "markdown", "id": "356c4f3f", "metadata": {}, "source": [ "How to get attributes from string dictionary" ] }, { "cell_type": "code", "execution_count": null, "id": "6d8fafa0", "metadata": {}, "outputs": [], "source": [ "json.loads(ds_sm.attrs['planet_params'])" ] }, { "cell_type": "code", "execution_count": null, "id": "57eec372", "metadata": {}, "outputs": [], "source": [ "json.loads(ds_sm.attrs['planet_params'])['rp'] #radius used" ] }, { "cell_type": "markdown", "id": "d90c2896", "metadata": {}, "source": [ "# Checking your data is in compliance\n", "\n", "TLDR: this function will check that your data can be properly interpretted " ] }, { "cell_type": "code", "execution_count": null, "id": "f1ec1ca6", "metadata": {}, "outputs": [], "source": [ "def data_check(usr_xa):\n", " \"\"\"This function will check that all the requirements have been met\"\"\"\n", " \n", " #step 1: check that required attributes are present \n", " assert 'author' in usr_xa.attrs ,'No author information in attrs'\n", " assert 'contact' in usr_xa.attrs ,'No contact information in attrs'\n", " assert 'code' in usr_xa.attrs , 'Code used was not specified in attrs'\n", " \n", " #step 2: check that all coordinates have units\n", " try: \n", " for i in usr_xa.coords.keys(): test= usr_xa[i].units\n", " except AttributeError: \n", " print(f'Missing unit for {i} coords')\n", "\n", " #step 2: check that all coordinates have units\n", " try: \n", " for i in usr_xa.data_vars.keys(): test=usr_xa[i].units\n", " except AttributeError: \n", " print(f'Missing unit for {i} data_var')\n", " \n", " #step 3: check that some attrs is a proper dictionary\n", " try : \n", " for i in usr_xa.attrs:\n", " #these need to be dictionaries to be interpretable\n", " if i in ['planet_params','stellar_params','cld_params','orbit_params']: \n", " json.loads(usr_xa.attrs[i])\n", " except ValueError: \n", " print(f\"Was not able to read attr for {i}. This means that you did not properly define a dictionary with json and a dict.\",\" For example: json.dumps({'mp':1,'rp':1})\")\n", " \n", " #step 4: hurray if you have made it to here this is great\n", " #last thing is the least important -- to make sure that we agree on terminology\n", " for i in usr_xa.attrs: \n", " if i == 'planet_params': \n", " for model_key in json.loads(usr_xa.attrs[i]).keys():\n", " assert model_key in ['rp', 'mp', 'tint', 'heat_redis', 'p_reference','rainout','p_quench',\n", " 'pteff', 'mh' , 'cto' , 'logkzz'], f'Could not find {model_key} in listed planet_params attr. This might be because we havent added it yet! Check your terms and contact us if this is the case'\n", " \n", " elif i == 'stellar_params': \n", " for model_key in json.loads(usr_xa.attrs[i]).keys():\n", " assert model_key in ['logg', 'feh', 'steff', 'rs', 'ms',\n", " ], f'Could not find {model_key} in listed stellar_params attr. This might be because we havent added it yet! Check your terms and contact us if this is the case'\n", " \n", " elif i == 'orbit_params': \n", " for model_key in json.loads(usr_xa.attrs[i]).keys():\n", " assert model_key in ['sma',\n", " ], f'Could not find {model_key} in listed orbit_params attr. This might be because we havent added it yet! Check your terms and contact us if this is the case'\n", " \n", " elif i == 'cld_params': \n", " for model_key in json.loads(usr_xa.attrs[i]).keys():\n", " assert model_key in ['opd','ssa','asy','fsed','p_cloud','haze_effec',\n", " ], f'Could not find {model_key} in listed cld_params attr. This might be because we havent added it yet! Check your terms and contact us if this is the case'\n", " \n", " print('SUCCESS!')\n" ] }, { "cell_type": "code", "execution_count": null, "id": "83c1cdde", "metadata": {}, "outputs": [], "source": [ "ds_sm = xr.open_dataset(\"profile_eq_planet_300_grav_4.5_mh_+2.0_CO_2.0_sm_0.0486_v_0.5_.nc\")\n", "data_check(ds_sm)" ] }, { "cell_type": "code", "execution_count": null, "id": "81668a4b", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }