Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ObsOperator: proposal for a generic observation operator diagnostic #2636

Open
mbertolacci opened this issue Dec 13, 2024 · 6 comments
Open
Labels
category: Discussion An extended discussion of a particular topic never stale Never label this issue as stale topic: Diagnostics Related to output diagnostic data

Comments

@mbertolacci
Copy link

Your name

Michael Bertolacci

Your affiliation

The University of Western Australia

Please provide a clear and concise description of your question or discussion topic.

For a flux inversion project I'm involved in I've been working on a generic observation operator diagnostic for GEOS-Chem which I've tentatively termed ObsOperator.

I'm curious about the views of the GEOS-Chem community on the value of such a diagnostic, and what features it ought to have beyond what is already implemented.

Per offline discussions with @aschuh and @yantosca, I am tagging the following people who may be interested:
@msulprizio @lizziel @kbowman77 @dbajones @ltmurray @drewpendergrass @laestrada @djxjacob @randallvmartin @djvaron @nicholasbalasus @eastjames

The code is in a fork of GEOS-Chem. Most of it is here:
https://github.com/mbertolacci/geos-chem/tree/feature/obsoperator-diagnostic/ObsOperator

The basic idea is to return a (potentially weighted) average of the GEOS-Chem species concentration field. This could be a vertical average, box average, time average, or any combination thereof.

This is sufficient to do what the existing ObsPack diagnostic does. Due to the vertical averaging, it can also implement part of the sampling for OCO-2 or TCCON. I give examples of these operators below in a draft specification.

The rest of this is a draft specification, as currently implemented by the code above.

ObsOperator diagnostic

Overview

This ObsOperator diagnostic for GEOS-Chem is a flexible way to define and run observation operators over the GEOS-Chem species concentration field. The user specifies "observation operators" in YAML files, which are then used to calculate a set of output values for each hour of the run.

A single observation operator is defined as an operation on the GEOS-Chem species concentration field that produces a single number. A YAML file or files is used to define the set of observation operators for a given run, with a format described later in this document.

Every observation operator can apply to one or more species, and is the composition of a time operator, a horizontal operator, and a vertical operator. Each operator either samples a single value, or computed a (potentially weighted) average over the a set of values (conceptually, however, a point value is a single-value average with a weight of 1). The resulting calculated value is thus equal to

output[s] = sum(
  time_weight[t] * sum(
    horizontal_weight[i, j] * sum(
      vertical_weight[k] * value[t, i, j, k, s]
      for k in vertical_indices
    )
    for j in horizontal_indices
  )
  for t in time_indices
)

where s is the species index. The vertical indices can be defined directly as levels, by altitude, or by pressure. The weights can be fixed, or can be pressure weighted (and thus dynamically updated).

Once all time points have been processed, the output values are written to a CSV file.

geoschem_config.yml entry

The entry in the geoschem_config.yml file is as follows:

extra_diagnostics:
  obsoperator:
    activate: true  # Whether to activate the obsoperator diagnostic
    verbose: true  # Whether to print verbose output
    input_file: obsoperator-YYYYMMDD.yml  # The YAML file(s) containing the observation operators
    output_file: OutputDir/obsoperator_output-YYYYMMDD.csv  # The output file name

The input_file and output_file are expanded to include the year, month, and day if they contain YYYY, MM, or DD, as per usual.

YAML file format examples

The YAML file contains a list of observation operators under the entries key as follows:

entries:
  - # first entry
  - # second entry

Examples of entries are below, and full specification of the format is given later.

ObsPack-style time averaging

The ObsPack diagnostic implements either point or time-averaged sampling at a specific location and altitude. For example, suppose the ObsPack entry is:

obspack_id =  obspack_co2_1_GLOBALVIEWplus_v6.1_2021-03-01~co2_smo_surface-flask_1_representative~19583684 
model_sample_window_start =  2014-12-31 23:39:00 
model_sample_window_end =  2015-01-01 00:39:00 
longitude =  -170.5644 
latitude =  -14.2474 
altitude =  60.3 

The corresponding YAML entry could be:

- id: obspack_co2_1_GLOBALVIEWplus_v6.1_2021-03-01~co2_smo_surface-flask_1_representative~19583684
  species: CO2
  time_operator:
    type: range
    unit: time
    start: [20141231, 2339]
    end: [20150101, 0039]
    weights: normalized
  horizontal_operator:
    type: point
    unit: degrees
    longitude: -170.5644
    latitude: -14.2474
  vertical_operator:
    type: point
    unit: altitude
    value: 60.3

OCO-2 retrieval vertical averaging

For OCO-2, the observation operator for a retrieval at a specific time and location has the following form:

value = xco2_apriori + h^T (vco2_model - vco2_apriori)
      = xco2_apriori - h^T vco2_apriori + h^T vco2_model

where xco2_apriori is a given scalar, h is a vector of weights (a combination of the averaging kernel and the pressure weights), vco2_model is a vector of concentrations from GEOS-Chem at certain pressure levels, and vco2_apriori is the apriori vector of concentrations.

The ObsOperator diagnostic can compute h^T vco2_model; the user would then add the constant offset xco2_apriori - h^T vco2_apriori to this to get the final sampled value.

A corresponding YAML entry could be:

- id: oco2_20190701005646
  species: CO2
  time_operator:
    type: point
    unit: time
    time: [20190701, 52]
  horizontal_operator:
    type: point
    unit: degrees
    longitude: -166.4299927
    latitude: -38.4300003
  vertical_operator:
    type: exact
    unit: pressure
    values:
    - 0.102817
    - 54.1142132
    - 108.2284265
    - 162.3426321
    - 216.456853
    - 270.5710739
    - 324.6852641
    - 378.799485
    - 432.9137059
    - 487.0279268
    - 541.1421477
    - 595.256338
    - 649.3705283
    - 703.4847798
    - 757.5989701
    - 811.7132216
    - 865.8274119
    - 919.9416021
    - 974.0558537
    - 1028.1700439
    weights:
    - 6319.1598432
    - 19984.860155
    - 29375.7039415
    - 36636.2610202
    - 42524.5607233
    - 46564.1217474
    - 49287.5914229
    - 51220.5129134
    - 52574.6420179
    - 53401.2996122
    - 53860.9510212
    - 54059.2251766
    - 53992.2235359
    - 53769.4608382
    - 53338.4316021
    - 52930.0816228
    - 52541.1533714
    - 51655.6313083
    - 52050.4418275
    - 26359.1997691

(Here the weights have been multiplied by 1e6 to convert the output concentration from v/v to ppm.)

YAML file format specification

The YAML file contains a list of observation operators under the entries key as follows:

entries:
  - # first entry
  - # second entry

An individual observation operator is defined as a dictionary with the following keys:

id: myentryid
species:
  - species1
  - species2
  ...
time_operator:
  # The time operator definition
  ...
horizontal_operator:
  # The horizontal operator definition
  ...
vertical_operator:
  # The vertical operator definition
  ...

If only one species is specified, it can be given as a single string instead of a list.

Each of the operators is described below

Time operator

The time operator can apply to a single point in time or a range of times.

A point in time is defined as follows:

time_operator:
  type: point
  unit: time  # or time_index
  time: [YYYYMMDD, HHMM]  # or a single integer time index

where YYYYMMDD is the year, month, and day, and HHMM is the hour and minute, both as integers.

A range of times is defined as follows:

time_operator:
  type: range
  unit: time  # or time_index
  start: [YYYYMMDD, HHMM]  # or a single integer time index
  end: [YYYYMMDD, HHMM]  # or a single integer time index
  weights: normalized  # or 'equal'

The weight can be either normalized or equal. Normalized are the default, and mean that the weights are normalized to sum to 1 (thus we are averaging the values over the time range). Equal means that the weights are equal to one for all time points (thus we are summing the values over the time range).

Horizontal operator

The horizontal operator can apply to a single point or a box.

A point is defined as follows:

horizontal_operator:
  type: point
  unit: degrees  # or grid_index
  longitude: 90.0  # or a single integer grid index
  latitude: 10.0  # or a single integer grid index

Degrees are degrees north for latitude (-90 to 90), and degrees east for longitude (-180 to 180). The grid index depends on the choice of horizontal grid.

A box is defined as follows:

horizontal_operator:
  type: box
  unit: degrees  # or grid_index
  longitude_start: 90.0  # or a single integer grid index
  longitude_end: 90.0  # or a single integer grid index
  latitude_start: 10.0  # or a single integer grid index
  latitude_end: 10.0  # or a single integer grid index
  weights: 'normalized_area'  # or 'area', 'normalized', 'equal'

The weight can be either normalized area, area, normalized, or equal. Normalized means that the weights sum up to 1. Area means that the weights are proportional to area (or equal for weights = "area" itself). Equal means that the weights are equal to one for all grid points.

Vertical operator

The vertical operator can apply to a single point, a range of points, or a specific set of points.

A point vertical operator is defined as follows:

vertical_operator:
  type: point
  unit: pressure  # or altitude, pressure_level
  value: 1000  # or a single integer pressure level

Pressure is in hPa. Altitude is in meters. Pressure level is the integer pressure level index, which depends on the choice of vertical grid.

A range of points is defined as follows:

vertical_operator:
  type: range
  unit: pressure  # or altitude, pressure_level
  start: 0  # or a single integer pressure level
  end: 1000  # or a single integer pressure level
  weights: normalized_pressure  # or pressure, normalized, equal

As usual, normalized means that the weights sum up to 1. Pressure means that the weights are proportional to the pressure difference across the sampled vertical levels. Equal means that the weights are equal to one for all vertical levels.

The final vertical operator is a set of points, and is defined as follows:

vertical_operator:
  type: exact
  unit: pressure  # or altitude, pressure_level
  values: [0, 1000]  # or a list of integer pressure levels
  weights: [0.5, 0.5]  # or a list of weights

Here, the weight must be specified for each value.

Output format

The output is a CSV file with the following columns:

  • id: The id of the observation operator
  • species: The species name of the observation operator
  • value: The final accumulated value of the observation operator for the given species
@mbertolacci mbertolacci added the category: Question Further information is requested label Dec 13, 2024
@yantosca yantosca added the topic: Diagnostics Related to output diagnostic data label Dec 17, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the issue from closing this issue.

@github-actions github-actions bot added the stale No recent activity on this issue label Jan 17, 2025
@yantosca yantosca added never stale Never label this issue as stale category: Discussion An extended discussion of a particular topic and removed stale No recent activity on this issue category: Question Further information is requested labels Jan 17, 2025
@lizziel
Copy link
Contributor

lizziel commented Jan 17, 2025

@mbertolacci, you may have better luck getting a response by reaching out to GEOS-Chem working groups or specific individuals by email.

@laestrada
Copy link
Contributor

Hi @mbertolacci, apologies for the late reply. This feature would be very helpful for easily adding new instruments into our methane inversion work (IMI and others). @hannahnesser and @pennelise have been working on a flexible observation operator package for postprocessing GEOS-Chem data. They may have some thoughts on implementation.

@hannahnesser
Copy link
Contributor

Hi @mbertolacci, Yes, @pennelise and I have a complete, working version of this for satellite observations. You can find the code and documentation here. I've implemented it in my (highly modified) branch of the IMI for my OCO work, though it hasn't been generally incorporated for the IMI. All that would need to be done for this to be done is (1) for someone to write a TROPOMI parser and add it to the parsers.py file and (2) for the IMI itself to be updated to use the operator. I've been meaning to do (1) for ages. If someone can volunteer to do (2), I can prioritize (1).

Also, GOOPy doesn't yet have an ObsPack operator. This would also be very easy to implement and again I can prioritize it if someone offers to implement it within the IMI. (You can see our list of things we need to add, including an ObsPack capability, to GOOPy under the issues tab.)

@drewpendergrass
Copy link

I'll just add a quick note that CHEEREIO supports a variety of observation operators across TCCON, TROPOMI, OMI, ObsPack, IASI, and CrIS. The GC adjoint has its own suite of observation operators, as does the JEDI initiative for a universal data assimilation platform. Little details in implementation on the DA side have so far prevented things from being compatible (i.e. the adjoint and JEDI don't output GC mapped to observations or vice-versa, but rather related metrics) so unfortunately there is tons of duplicated effort in this space.

More directly to your idea, I have plans to release a version of CHEEREIO without assimilation that allows users to run GC and output whatever observational diagnostics using existing CHEEREIO operators, but this would have more overhead than the operator you propose @mbertolacci because CHEEREIO involves a lot of bonus code that would be superfluous in this case. I'll also note, following @laestrada , that the diagnostic you propose would be very useful in the ease of creating future operators. We've already greatly benefited from the satellite diagnostic for CrIS work.

@mbertolacci
Copy link
Author

mbertolacci commented Jan 20, 2025

Thanks everyone for popping up with some comments!

Hi @laestrada and @hannahnesser, GOOpy looks like it has very similar goals! The difference, if I understand GOOpy correctly, is that ObsOperator runs inline with GEOS-Chem rather than post-processing SpeciesConc or LevelEdge files.

We needed that because it's infeasible in many cases for us to save model output at sufficiently fine resolutions. I have code (written in R) similar to GOOpy for satellite and obspack-style observations but felt it would be good to unify them in a single format.

(Edited to add) The other difference (and this differs from ObsPack and other existing satellite operators in GEOS-Chem) is there's no consideration of parsing the observation input files in order to get the sampling scheme - that is a left as a preprocessing step for the user. One could provide scripts to do it for common formats.

I think that's both a plus and a minus - it simplifies the Fortran code considerably, it gives a user flexibility in how to exactly implement the sampling, and (like GOOpy) it lets multiple input types be sampled into a common output format. On the other hand it's more work for the user.

Hi @drewpendergrass, that's good to know about CHEEREIO, it sounds like there's some overlap. ObsOperator of course has no consideration of adjoint code at present.

If anyone wants to try the code I'm happy to assist with that.

A few notes on where this code is at:

  1. I plan to switch to NetCDF as the output format. I used CSV just because it was faster to get it implemented.
  2. I also plan to slightly tweak the input format to take a list of "fields" rather than just "species". This would leave room for a future extension to let the ObsOperator apply to anything the History diagnostic can output, even if it's not implemented in the short term.
  3. Longer term, I'd dearly like to allow the input YAML files to be gzipped. I think that would require linking GEOS-Chem to zlib though, have a C binding, etc, etc. That's why it's not there yet.

I think after 1. and 2. the module should have a reasonably stable input and output file format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Discussion An extended discussion of a particular topic never stale Never label this issue as stale topic: Diagnostics Related to output diagnostic data
Projects
None yet
Development

No branches or pull requests

6 participants