Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aquire ECCC Hourly Date #43

Open
franTarkenton opened this issue Nov 1, 2023 · 1 comment
Open

Aquire ECCC Hourly Date #43

franTarkenton opened this issue Nov 1, 2023 · 1 comment

Comments

@franTarkenton
Copy link
Member

franTarkenton commented Nov 1, 2023

Create a script that will pull the following information on an hourly basis.

Source of data:
https://hpfx.collab.science.gc.ca/20231101/WXO-DD/observations/swob-ml/20231101/

Data Aquisition

  • need to figure out what weather stations we want to keep and which we do not
    • do bounding box (future create a bc buffered polygon that we can query)
    • For each weather station grab the lat longs and determine if they are in our area of interest.
    • if so then proceed to processing

Processing

  • Listed in the climate_obs spreadsheet to get the station list

  • pull down the station data for the current hour (note hours in the file names use UTC)

  • Extract from the individual xml files the following properties:

    • pcpn_amt_pst1hr
    • avg_air_temp_pst1hr
  • If a new day is detected then create a new file, otherwise pull the existing file from object store update it and repush (make sure we are not creating new versions)

  • create 2 different input files one for temperature and another for precip.

    • PC.csv
    • TA.csv
  • format of the files / columns:

    • date
    • climate stations (listed along the x axis like the PC.csv ASP data)
    • actual data (either precip. or temperature depending on which file is being created(
  • Script would run hourly when the data is available

  • Would pull the data down and update it, and then repost. (make sure we are not creating a new version in object storage when file is updated)

  • Need to setup a sync process that will ensure the data that exists in object store also exists on prem server.

    • on prem file path: Z:\MPOML\HOURLY (sewer)
    • object store path: RFC_DATA/ECC_HOURLY/

Secondary:

  • listen to the message queue for the specific data we want and trigger the github action
@KYSIEMENS
Copy link
Collaborator

Script is mostly complete. Hourly XML files for stations in station list are being downloaded, processed and saved to a dataframe, which is then saved as a parquet file in object store. Daily temperature and precipitation are generated and saved to object store. 'air_temp' variable used instead of 'avg_air_temp_pst1hr' as the latter was missing for many stations.

To do:

  • sync TA.csv/PC.csv files to prem server
  • Review and clean-up code, look for additional efficiencies to cut down run-time
  • Consider adding additional variables or stations to download that could come in useful in the future (e.g. implement the weather station bounding box strategy described above)

@franTarkenton franTarkenton moved this to In Progress in RFC Backlog Dec 5, 2023
@franTarkenton franTarkenton moved this from In Progress to Sprint Backlog in RFC Backlog Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Sprint Backlog
Development

No branches or pull requests

2 participants