diff --git a/book/chapters/Predictions.ipynb b/book/chapters/Predictions.ipynb new file mode 100644 index 0000000..a877468 --- /dev/null +++ b/book/chapters/Predictions.ipynb @@ -0,0 +1,645 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "TWAdqXXvf1M7" + }, + "source": [ + "# Model Prediction/Evaluation" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M0bqKyYqf5DT" + }, + "source": [ + "Finally, we are at the final Chapter where we see the end-product of the model created. As we venture into this critical phase, the model_predict script emerges, guiding the way toward understanding and anticipating the future of snow water equivalent (SWE) through the ExtraTree model. This chapter delves into the intricacies of this script, unraveling the processes that transforms raw, unprocessed data into precise predictions that illuminate the path forward." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sFDziC_Df-lP" + }, + "source": [ + "**Preparing for Prediction:**\n", + "This begins with loading and pre-processing of data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EpdJKnj7gTr8" + }, + "source": [ + "Loading Data: The script starts by ingesting data from a CSV file, bringing into the fold the vast array of variable" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "id": "LkzaZTCtgJ5R" + }, + "outputs": [], + "source": [ + "def load_data(file_path):\n", + " \"\"\"\n", + " Load data from a CSV file.\n", + " Args: file_path (str): Path to the CSV file containing the data.\n", + " Returns: pd.DataFrame: A pandas DataFrame containing the loaded data.\n", + " \"\"\"\n", + " return pd.read_csv(file_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zUwlJCu6ge2k" + }, + "source": [ + "Pre-processing: Next, the data undergoes a transformation. Dates are converted, irrelevant columns are discarded, and the data is reshaped to match the model's expectations." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "id": "wXQ-4JBdgjdG" + }, + "outputs": [], + "source": [ + "def preprocess_data(data):\n", + " \"\"\"\n", + " Preprocess the input data for model prediction.\n", + " Args: data (pd.DataFrame): Input data in the form of a pandas DataFrame.\n", + " Returns: pd.DataFrame: Preprocessed data ready for prediction.\n", + " \"\"\"\n", + " data['date'] = pd.to_datetime(data['date'])\n", + " data.replace('--', pd.NA, inplace=True)\n", + " data.rename(columns={'Latitude': 'lat', 'Longitude': 'lon',\n", + " 'vpd': 'mean_vapor_pressure_deficit',\n", + " 'vs': 'wind_speed', 'pr': 'precipitation_amount',\n", + " 'etr': 'potential_evapotranspiration', 'tmmn': 'air_temperature_tmmn',\n", + " 'tmmx': 'air_temperature_tmmx', 'rmin': 'relative_humidity_rmin',\n", + " 'rmax': 'relative_humidity_rmax', 'cumulative_AMSR_SWE': 'cumulative_SWE',\n", + " 'cumulative_AMSR_Flag': 'cumulative_Flag', 'cumulative_tmmn':'cumulative_air_temperature_tmmn',\n", + " 'cumulative_etr': 'cumulative_potential_evapotranspiration', 'cumulative_vpd': 'cumulative_mean_vapor_pressure_deficit',\n", + " 'cumulative_rmax': 'cumulative_relative_humidity_rmax', 'cumulative_rmin': 'cumulative_relative_humidity_rmin',\n", + " 'cumulative_pr': 'cumulative_precipitation_amount', 'cumulative_tmmx': 'cumulative_air_temperature_tmmx',\n", + " 'cumulative_vs': 'cumulative_wind_speed', 'AMSR_SWE': 'SWE', 'AMSR_Flag': 'Flag', }, inplace=True)\n", + " print(data.head())\n", + " print(data.columns)\n", + " selected_columns.remove(\"swe_value\")\n", + " desired_order = selected_columns + ['lat', 'lon',]\n", + " data = data[desired_order]\n", + " data = data.reindex(columns=desired_order)\n", + " print(\"reorganized columns: \", data.columns)\n", + " return data\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qZg6If3Lhecf" + }, + "source": [ + "**Loading Model**: The script retrieves the ExtraTree model and starts the process of making Predictions." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "d7EsL7aVhltv" + }, + "outputs": [], + "source": [ + "def load_model(model_path):\n", + " \"\"\"\n", + " Load a machine learning model from a file.\n", + "\n", + " Args:\n", + " model_path (str): Path to the saved model file.\n", + "\n", + " Returns:\n", + " model: The loaded machine learning model.\n", + " \"\"\"\n", + " return joblib.load(model_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NRBPyBVxhu3r" + }, + "source": [ + "**predict_swe:** Before prediction can commence, predict_swe undertakes the crucial task of preparing the input data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Uv3nAmw-h3EO" + }, + "outputs": [], + "source": [ + "def predict_swe(model, data):\n", + " \"\"\"\n", + " Predict snow water equivalent (SWE) using a machine learning model.\n", + " Args: model: The machine learning model for prediction.\n", + " data (pd.DataFrame): Input data for prediction.\n", + " Returns: pd.DataFrame: Dataframe with predicted SWE values.\n", + " \"\"\"\n", + " data = data.fillna(-999)\n", + " input_data = data\n", + " input_data = data.drop([\"lat\", \"lon\"], axis=1)\n", + " predictions = model.predict(input_data)\n", + " data['predicted_swe'] = predictions\n", + " return data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PYeFXD4xiJpi" + }, + "source": [ + "It fills missing values with a designated placeholder (-999), a common practice to ensure machine learning algorithms, can process the data without encountering errors due to missing values. This step reflects a balance between data integrity and computational requirements, enabling the model to make predictions even in the absence of complete information.\n", + "\n", + "At the core of predict_swe is the model's predict() method invocation. This step is where the machine learning model, trained on historical data, applies its learned patterns to the new, unseen data. The decision to drop geographical identifiers (lat, lon) before prediction underscores a focus on the environmental and temporal factors influencing SWE, aligning the model's inputs with its training regime.\n", + "\n", + "The function concludes by appending the model's predictions back to the original dataset as a new column, predicted_swe. This enrichment transforms the dataset from a static snapshot of past and present conditions into a dynamic forecast of future snow water equivalents. This step is critical for stakeholders relying on accurate SWE predictions." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zmbl6QxXiVIE" + }, + "source": [ + "**Merge data:** merge_data meticulously combines the predicted SWE values with the original dataset. It employs conditional logic to adjust predictions based on specific criteria, such as nullifying predictions in the absence of key environmental data. This approach underscores a commitment to precision, ensuring that the predictions reflect a nuanced understanding of the environmental context." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "id": "aa_il7YriZlE" + }, + "outputs": [], + "source": [ + "def merge_data(original_data, predicted_data):\n", + " \"\"\"\n", + " Merge predicted SWE data with the original data.\n", + " Args: original_data (pd.DataFrame): Original input data.\n", + " predicted_data (pd.DataFrame): Dataframe with predicted SWE values.\n", + " Returns: pd.DataFrame: Merged dataframe.\n", + " \"\"\"\n", + " if \"date\" not in predicted_data:\n", + " predicted_data[\"date\"] = test_start_date\n", + " new_data_extracted = predicted_data[[\"date\", \"lat\", \"lon\", \"predicted_swe\"]]\n", + " print(\"original_data.columns: \", original_data.columns)\n", + " print(\"new_data_extracted.columns: \", new_data_extracted.columns)\n", + " print(\"new prediction statistics: \", new_data_extracted[\"predicted_swe\"].describe())\n", + " merged_df = original_data.merge(new_data_extracted, on=['date', 'lat', 'lon'], how='left')\n", + " merged_df.loc[merged_df['fsca'] == 237, 'predicted_swe'] = 0\n", + " merged_df.loc[merged_df['fsca'] == 239, 'predicted_swe'] = 0\n", + " merged_df.loc[merged_df['cumulative_fsca'] == 0, 'predicted_swe'] = 0\n", + " merged_df.loc[merged_df['air_temperature_tmmx'].isnull(), 'predicted_swe'] = 0\n", + " return merged_df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ER1Ej6OPioDb" + }, + "source": [ + "**This function's Technical execution**\n", + "\n", + "Merging datasets based on date, latitude, and longitude—exemplifies the complex use of data science. It ensures that each predicted SWE value is accurately aligned with its corresponding geographical and temporal marker, preserving the integrity and utility of the predictions. This process not only highlights the technical sophistication of the SnowCast project but also its dedication to delivering reliable and actionable insights." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PmUzvNEYi9rl" + }, + "source": [ + "**Predict Function**\n", + "\n", + "The predict function stands as the conductor, orchestrating the entire predictive process from start to finish. It starts by loading the pre-trained model, which embodies the project's strength of making predictions by preserving and leveraging the accumulated knowledge encapsulated within the model's parameters." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "ny9tjabpjJ4q" + }, + "outputs": [], + "source": [ + "def predict():\n", + " \"\"\"\n", + " Main function for predicting snow water equivalent (SWE).\n", + " Returns: None\n", + " \"\"\"\n", + " height = 666\n", + " width = 694\n", + " model_path = f'{homedir}/Documents/GitHub/SnowCast/model/wormhole_ETHole_latest.joblib'\n", + " print(f\"Using model: {model_path}\")\n", + "\n", + " new_data_path = f'{work_dir}/testing_all_ready_{test_start_date}.csv'\n", + " latest_output_path = f'{work_dir}/test_data_predicted_latest_{test_start_date}.csv'\n", + " output_path = f'{work_dir}/test_data_predicted_{generate_random_string(5)}.csv'\n", + "\n", + " if os.path.exists(output_path):\n", + " os.remove(output_path)\n", + " print(f\"File '{output_path}' has been removed.\")\n", + "\n", + " model = load_model(model_path)\n", + " new_data = load_data(new_data_path)\n", + " #print(\"new_data shape: \", new_data.head())\n", + "\n", + " preprocessed_data = preprocess_data(new_data)\n", + " if len(new_data) < len(preprocessed_data):\n", + " raise ValueError(\"Why the preprocessed data increased?\")\n", + "\n", + " predicted_data = predict_swe(model, preprocessed_data)\n", + " print(\"how many predicted? \", len(predicted_data))\n", + "\n", + " if \"date\" not in preprocessed_data:\n", + " preprocessed_data[\"date\"] = test_start_date\n", + " predicted_data = merge_data(preprocessed_data, predicted_data)\n", + "\n", + " predicted_data.to_csv(output_path, index=False)\n", + " print(\"Prediction successfully done \", output_path)\n", + "\n", + " shutil.copy(output_path, latest_output_path)\n", + " print(f\"Copied to {latest_output_path}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d5DBPYh8jhDx" + }, + "source": [ + "Following model loading, the function navigates the data landscape, loading new data for prediction and preprocessing it to align with the model's requirements. This step is critical, as it transforms raw data into a format that the model can interpret, ensuring the accuracy and relevance of the predictions." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "slezgjgmjY6-" + }, + "source": [ + "**Delivering the Prediction**\n", + "\n", + "In its final act, the predict function executes predict_swe, merges the predictions with the original data, and saves the enriched dataset. The choice of a dynamically generated filename for saving predictions demonstrates an understanding of operational requirements, ensuring that each prediction cycle is uniquely identifiable.\n", + "\n", + "![](../img/Pred_Delivery.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7kYpJsr3jWJx" + }, + "source": [ + "\n", + "\n", + "---\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "---\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u61bKY8mj0eE" + }, + "source": [ + "# Results" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oRHveVzRj91l" + }, + "source": [ + "This is the whole process of how the predictions are converted into Images." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tHsM6EIRlQdQ" + }, + "source": [ + "**Convert result to image:**\n", + "\n", + "**convert csvs to images simple:** This Is the function that takes the raw data and converts them into Geographical images.\n", + "\n", + "**Data Loading:** This begins by ingesting the CSV containing SWE predictions, ensuring every data point is primed for visualization.\n", + "\n", + "**Custom Colormap Creation:** It employs a custom colormap, crafted to represent various ranges of SWE, providing an intuitive visual understanding of snow coverage.\n", + "\n", + "**Geospatial Plotting:** This utilizes the geographical coordinates within the data to accurately place each prediction on the map, ensuring a realistic representation of SWE distribution.\n", + "\n", + "**Merge data:** The merge_data function combines the predicted SWE values with their corresponding geographical markers.\n", + "\n", + "**Conditional Adjustments**: Conditional adjustment refines the predicted values based on specific criteria, ensuring the visual representation aligns with realistic expectations of SWE.\n", + "\n", + "**Spatial Accuracy:** This aligns predictions with their exact geographical locations, ensuring that the visual output is as informative as it is accurate." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7QOdrzg6oLNh" + }, + "source": [ + "**Custom Colormap:** A list named colors defines the color scheme for the colormap, using RGB tuples for each color. These colors are intended to represent different levels of SWE, from low to high(light gray to dark red).\n", + "\n", + "**Geographical Boundaries:** lon_min, lon_max, lat_min, and lat_max define the geographical area of interest by specifying the minimum and maximum longitudes and latitudes. This setting targets the visualization and analysis efforts to the Western United States." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "qMeBEobUoJcW" + }, + "outputs": [], + "source": [ + "colors = [\n", + " (0.8627, 0.8627, 0.8627), # #DCDCDC - 0 - 1\n", + " (0.8627, 1.0000, 1.0000), # #DCFFFF - 1 - 2\n", + " (0.6000, 1.0000, 1.0000), # #99FFFF - 2 - 4\n", + " (0.5569, 0.8235, 1.0000), # #8ED2FF - 4 - 6\n", + " (0.4509, 0.6196, 0.8745), # #739EDF - 6 - 8\n", + " (0.4157, 0.4706, 1.0000), # #6A78FF - 8 - 10\n", + " (0.4235, 0.2784, 1.0000), # #6C47FF - 10 - 12\n", + " (0.5529, 0.0980, 1.0000), # #8D19FF - 12 - 14\n", + " (0.7333, 0.0000, 0.9176), # #BB00EA - 14 - 16\n", + " (0.8392, 0.0000, 0.7490), # #D600BF - 16 - 18\n", + " (0.7569, 0.0039, 0.4549), # #C10074 - 18 - 20\n", + " (0.6784, 0.0000, 0.1961), # #AD0032 - 20 - 30\n", + " (0.5020, 0.0000, 0.0000) # #800000 - > 30\n", + "]\n", + "cmap_name = 'custom_snow_colormap'\n", + "custom_cmap = mcolors.ListedColormap(colors)\n", + "\n", + "lon_min, lon_max = -125, -100\n", + "lat_min, lat_max = 25, 49.5\n", + "\n", + "# Define value ranges for color mapping\n", + "fixed_value_ranges = [1, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 30]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-gFem_1DlsPd" + }, + "source": [ + "**Convert csv to geotiff:** This function mainly helps in converting images to geographically accurate maps.\n", + "\n", + "**Rasterization:** It transforms the CSV data into a raster format, suitable for creating detailed geospatial maps.\n", + "\n", + "**Resolution and Coverage:** This carefully defines the resolution and geographical extent of the output map, ensuring that it captures the full scope of the predictions.\n", + "\n", + "**Geospatial Alignment:** Geospatial Alignment utilizes rasterio and geopandas libraries to ensure that each pixel in the output map accurately represents the predicted SWE values at specific geographical coordinates." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "7h9706LXVba2" + }, + "outputs": [], + "source": [ + "def convert_csvs_to_images():\n", + " \"\"\"\n", + " Convert CSV data to images with color-coded SWE predictions.\n", + "\n", + " Returns:\n", + " None\n", + " \"\"\"\n", + " global fixed_value_ranges\n", + " data = pd.read_csv(f\"{homedir}/gridmet_test_run/test_data_predicted_n97KJ.csv\")\n", + " print(\"statistic of predicted_swe: \", data['predicted_swe'].describe())\n", + " data['predicted_swe'].fillna(0, inplace=True)\n", + "\n", + " for column in data.columns:\n", + " column_data = data[column]\n", + " print(column_data.describe())\n", + "\n", + " # Create a figure with a white background\n", + " fig = plt.figure(facecolor='white')\n", + "\n", + "\n", + "\n", + " m = Basemap(llcrnrlon=lon_min, llcrnrlat=lat_min, urcrnrlon=lon_max, urcrnrlat=lat_max,\n", + " projection='merc', resolution='i')\n", + "\n", + " x, y = m(data['lon'].values, data['lat'].values)\n", + " print(data.columns)\n", + "\n", + " color_mapping, value_ranges = create_color_maps_with_value_range(data[\"predicted_swe\"], fixed_value_ranges)\n", + "\n", + " # Plot the data using the custom colormap\n", + " plt.scatter(x, y, c=color_mapping, cmap=custom_cmap, s=30, edgecolors='none', alpha=0.7)\n", + "\n", + " # Draw coastlines and other map features\n", + " m.drawcoastlines()\n", + " m.drawcountries()\n", + " m.drawstates()\n", + "\n", + " reference_date = datetime(1900, 1, 1)\n", + " day_value = day_index\n", + "\n", + " result_date = reference_date + timedelta(days=day_value)\n", + " today = result_date.strftime(\"%Y-%m-%d\")\n", + " timestamp_string = result_date.strftime(\"%Y-%m-%d\")\n", + "\n", + " # Add a title\n", + " plt.title(f'Predicted SWE in the Western US - {today}', pad=20)\n", + "\n", + " # Add labels for latitude and longitude on x and y axes with smaller font size\n", + " plt.xlabel('Longitude', fontsize=6)\n", + " plt.ylabel('Latitude', fontsize=6)\n", + "\n", + " # Add longitude values to the x-axis and adjust font size\n", + " x_ticks_labels = np.arange(lon_min, lon_max + 5, 5)\n", + " x_tick_labels_str = [f\"{lon:.1f}°W\" if lon < 0 else f\"{lon:.1f}°E\" for lon in x_ticks_labels]\n", + " plt.xticks(*m(x_ticks_labels, [lat_min] * len(x_ticks_labels)), fontsize=6)\n", + " plt.gca().set_xticklabels(x_tick_labels_str)\n", + "\n", + " # Add latitude values to the y-axis and adjust font size\n", + " y_ticks_labels = np.arange(lat_min, lat_max + 5, 5)\n", + " y_tick_labels_str = [f\"{lat:.1f}°N\" if lat >= 0 else f\"{abs(lat):.1f}°S\" for lat in y_ticks_labels]\n", + " plt.yticks(*m([lon_min] * len(y_ticks_labels), y_ticks_labels), fontsize=6)\n", + " plt.gca().set_yticklabels(y_tick_labels_str)\n", + "\n", + " # Convert map coordinates to latitude and longitude for y-axis labels\n", + " y_tick_positions = np.linspace(lat_min, lat_max, len(y_ticks_labels))\n", + " y_tick_positions_map_x, y_tick_positions_map_y = lat_lon_to_map_coordinates([lon_min] * len(y_ticks_labels), y_tick_positions, m)\n", + " y_tick_positions_lat, _ = m(y_tick_positions_map_x, y_tick_positions_map_y, inverse=True)\n", + " y_tick_positions_lat_str = [f\"{lat:.1f}°N\" if lat >= 0 else f\"{abs(lat):.1f}°S\" for lat in y_tick_positions_lat]\n", + " plt.yticks(y_tick_positions_map_y, y_tick_positions_lat_str, fontsize=6)\n", + "\n", + " # Create custom legend elements using the same colormap\n", + " legend_elements = [Patch(color=colors[i], label=f\"{value_ranges[i]} - {value_ranges[i+1]-1}\" if i < len(value_ranges) - 1 else f\"> {value_ranges[-1]}\") for i in range(len(value_ranges))]\n", + "\n", + " # Create the legend outside the map\n", + " legend = plt.legend(handles=legend_elements, loc='upper left', title='Legend', fontsize=8)\n", + " legend.set_bbox_to_anchor((1.01, 1))\n", + "\n", + " # Remove the color bar\n", + " #plt.colorbar().remove()\n", + "\n", + " plt.text(0.98, 0.02, 'Copyright © SWE Wormhole Team',\n", + " horizontalalignment='right', verticalalignment='bottom',\n", + " transform=plt.gcf().transFigure, fontsize=6, color='black')\n", + "\n", + " # Set the aspect ratio to 'equal' to keep the plot at the center\n", + " plt.gca().set_aspect('equal', adjustable='box')\n", + "\n", + " # Adjust the bottom and top margins to create more white space between the title and the plot\n", + " plt.subplots_adjust(bottom=0.15, right=0.80) # Adjust right margin to accommodate the legend\n", + " # Show the plot or save it to a file\n", + " new_plot_path = f'{homedir}/gridmet_test_run/predicted_swe-{test_start_date}.png'\n", + " print(f\"The new plot is saved to {new_plot_path}\")\n", + " plt.savefig(new_plot_path)\n", + " # plt.show() # Uncomment this line if you want to display the plot directly instead of saving it to a file\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fMErFn7ul35G" + }, + "source": [ + "**Deploy images to website:**\n", + "This is the process that helps in Deploying the visual insights\n", + "\n", + "**copy files to right folder** --\n", + "\n", + "Function: Bridging Computational Outputs with Public Access At the heart of our deployment strategy lies the copy_files_to_right_folder function.\n", + "This function acts as the bridge, transferring the visual and data outputs of SnowCast from the secure confines of its computational environment to a publicly accessible web directory.\n", + "\n", + "Here's how it achieves this pivotal role:\n", + "\n", + "\n", + "* Folder Synchronization: Utilizing distutils.dir_util.copy_tree, it ensures that all visual comparisons and predictions are mirrored from the SnowCast workspace to the web server's plotting directory, maintaining up-to-date access for users worldwide.\n", + "* Selective Deployment: Through meticulous directory traversal, it distinguishes between .png visualizations and .tif geospatial files, ensuring each file type is deployed to its rightful place for optimal public utility.\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gwC4XJxHkClU" + }, + "source": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "**create mapserver map config: Crafts interactive Maps**\n", + "\n", + "The magic of SnowCast is not just in its predictions but in how these predictions are presented. The create_mapserver_map_config function crafts a MapServer configuration for each GeoTIFF prediction file, transforming static data into interactive, exploratory maps.\n", + "\n", + "\n", + "* **Dynamic Configuration:** By generating a .map file for each prediction, this function lays the groundwork for interactive map services, allowing users to explore SWE predictions across different regions and times.\n", + "* **Intuitive Visualization:** The custom MapServer configuration leverages the power of geographical information systems (GIS) to provide an intuitive, visual representation of complex SWE data, making it accessible to experts and laypeople alike.\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JHTfd_L9UkBt" + }, + "source": [ + "![](../img/SWE_Map.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6V64pCx4mT1b" + }, + "source": [ + "**refresh available date list: Refreshing the Forecast**\n", + "\n", + "The refresh_available_date_list function ensures that the SnowCast portal remains current, reflecting the latest predictions and analyses. By dynamically updating the available date list with new predictions, it guarantees that users have access to the most recent insights.\n", + "\n", + "\n", + "* Data Frame Dynamics: This function creates a pandas DataFrame to catalog the available predictions, linking each date with its corresponding visualization and data file, thereby ensuring the portal's content is both comprehensive and current.\n", + "* Seamless Integration: The updated date list is saved as a CSV file, seamlessly integrating with the web portal's infrastructure to refresh the interactive calendar, guiding users to the latest SWE predictions.\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CUMa9t2iUkBt" + }, + "source": [ + "![](../img/Refresh_pred.png)" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/book/chapters/predictions.ipynb b/book/chapters/predictions.ipynb deleted file mode 100644 index 61c8550..0000000 --- a/book/chapters/predictions.ipynb +++ /dev/null @@ -1,54 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "source": [ - "# Predictions" - ], - "metadata": { - "collapsed": false - }, - "id": "3d78e29fc7f760ea" - }, - { - "cell_type": "markdown", - "source": [ - "Generating SWE forecasts using the developed models\n", - "Interpretation and analysis of prediction results\n" - ], - "metadata": { - "collapsed": false - }, - "id": "7cc3fefc60b95426" - }, - { - "cell_type": "markdown", - "source": [], - "metadata": { - "collapsed": false - }, - "id": "7e0341a2341ed5d7" - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 2 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.6" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/book/chapters/validation.ipynb b/book/chapters/validation.ipynb index 6951218..9600778 100644 --- a/book/chapters/validation.ipynb +++ b/book/chapters/validation.ipynb @@ -1,47 +1,320 @@ { - "cells": [ - { - "cell_type": "markdown", - "source": [ - "# Model Testing and Validation\n", - "\n", - "Techniques for validating SWE prediction models\n", - "Importance of model validation in forecasting\n" - ], - "metadata": { - "collapsed": false - }, - "id": "b35a5fc3f09e5e77" + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "jrTh5-Sgr1Od" + }, + "source": [ + "# Model Testing and Evaluation\n", + "\n", + "The goal of predicting Snow Water Equivalent exemplifies the integration of Machine learning with environmental science. This chapter delved into the testing and Evaluation part of this project." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IFvX25Khr2dW" + }, + "source": [ + "To begin with, it is essential to grasp the function of the BaseHole class. This class represents the complete lifecycle of the project, guiding it from initial development through to its final deployment." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rkO0EdSFr4q1" + }, + "source": [ + "BaseHole class is a meticulously crafted blueprint for constructing models capable of predicting SWE. It offers a structured approach to handling data, training models, and making predictions with unparalleled precision." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Y-PZlyQKr48X" + }, + "source": [ + "Let’s briefly discuss what is happening in this BaseHole class--\n", + "\n", + "\n", + "\n", + "* Preprocessing: The model begins with preprocessing, a critical phase where raw data is transformed into a refined form suitable for training. The BaseHole class adeptly navigates this phase, loading data, cleaning it, and splitting it into training and testing sets. This preparatory step ensures that the models are fed data that is both digestible and informative, setting the stage for accurate predictions.\n", + "* Training: This is the center of Learning, with the data primed, the BaseHole class now moves on to the training phase. This is where the coalition of machine learning takes place as the class utilizes the power of its classifiers to learn from the training data. The model, through this process, uncovers patterns and insights hidden within the data, providing itself with the knowledge needed to predict SWE with confidence.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2B9qmiO7r5N9" + }, + "source": [ + "Now comes the part that is one of the main focus of this chapter—**Testing**\n", + "\n", + "Within the extensive array of functionalities provided by the BaseHole class, the testing process is akin to a rigorous examination.\n", + "Unveiling the test function:\n", + "\n", + "So, what is a test?\n", + "\n", + "The test function operates on a simple yet profound principle: it utilizes the model to predict outcomes based on the test dataset. By invoking the classifier's prediction method, the BaseHole class utilizes the trained model on the test data to forecast SWE values with precision.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "sxGkvT-ysDR2" + }, + "outputs": [], + "source": [ + "def test(self):\n", + " '''\n", + " Tests the machine learning model on the testing data.\n", + " Returns: numpy.ndarray: The predicted results on the testing data.\n", + " '''\n", + " self.test_y_results = self.classifier.predict(self.test_x)\n", + " return self.test_y_results" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qpzGEyDQsFaZ" + }, + "source": [ + "**The Mechanics of Testing**\n", + "\n", + "At its core, the test function embodies the essence of machine learning validation. It executes the trained model's prediction method on the test_x dataset—a collection of features that the model has not encountered during its training phase. The function then returns the predicted SWE values, encapsulated within test_y_results, offering a glimpse into the model's predictive accuracy and reliability." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-0nRqBFQsICP" + }, + "source": [ + "# Validation/Evaluation" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "toh-cEggsKZ0" + }, + "source": [ + "So now we have made a model, trained the model, and made predictions on a test dataset, but how to evaluate all of this? For this, we use multiple Evaluation metrics. A model needs to go through a rigorous validation process that assesses its effectiveness and accuracy. Evaluation is a testament to the model’s commitment to precision, ensuring that the predictions made are not only reliable but also meaningful." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "unwqjmuhsMsc" + }, + "source": [ + "For this project we have a comprehensive suite of metrics -- Mean Absolute Error (MAE), Mean Squared Error (MSE), R-squared (R2), and Root Mean Squared Error (RMSE). Each metric offers a unique lens through which the model's performance can be scrutinized, from the average error per prediction (MAE) to the proportion of variance explained (R2)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dpbM36bgsRCB" + }, + "source": [ + "**Insights**\n", + "\n", + "Upon invoking the evaluation method, the class starts a detailed analysis of the model's predictions. By comparing these predictions against actual values from the test dataset, the method illuminates the model's strengths and areas for improvement.\n", + "\n", + "The output—a dictionary of metrics—serves as a beacon, guiding further refinement and optimization of the model." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-Jwo8aLhsTEp" + }, + "source": [ + "**The Testament of Metrics**\n", + "\n", + "\n", + "* MAE: This metric provides an average of the absolute errors between predicted and actual values, offering a straightforward measure of prediction accuracy.\n", + "\n", + "* MSE: By squaring the errors before averaging, MSE penalizes larger errors more heavily, providing insight into the variance of the model's predictions.\n", + "* R2: The R2 score reveals how well the model's predictions conform to the actual data, serving as a gauge of the model's explanatory power.\n", + "\n", + "\n", + "* RMSE: As the square root of MSE, RMSE offers a measure of error in the same units as the predicted value, making it intuitively interpretable." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PdYjCmmbsVzW" + }, + "source": [ + "# The Evaluation Process" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IlRUU0lzsW4r" + }, + "source": [ + "Upon invocation, the evaluate method undertakes the task of computing these metrics, using the predictions generated by the RandomForestHole model (self.test_y_results) and comparing them against the actual values (self.test_y) from the test dataset. This comparison is the crux of the evaluation, offering a window into the model's predictive capabilities." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5OIcffx3sZAz" + }, + "outputs": [], + "source": [ + "from sklearn.ensemble import RandomForestRegressor\n", + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.preprocessing import MinMaxScaler\n", + "from sklearn.metrics import mean_squared_error\n", + "from sklearn import metrics\n", + "from sklearn import tree\n", + "import joblib\n", + "import os\n", + "from pathlib import Path\n", + "import json\n", + "# import geopandas as gpd\n", + "# import geojson\n", + "import os.path\n", + "import math\n", + "from sklearn.model_selection import RandomizedSearchCV\n", + "\n", + "exit(0) # for now, the workflow is not ready yet\n", + "\n", + "# read the grid geometry file\n", + "\n", + "\n", + "# read the grid geometry file\n", + "homedir = os.path.expanduser('~')\n", + "print(homedir)\n", + "github_dir = f\"{homedir}/Documents/GitHub/SnowCast\"\n", + "modis_test_ready_file = f\"{github_dir}/data/ready_for_training/modis_test_ready.csv\"\n", + "modis_test_ready_pd = pd.read_csv(modis_test_ready_file, header=0, index_col=0)\n", + "\n", + "pd_to_clean = modis_test_ready_pd[[\"year\", \"m\", \"doy\", \"ndsi\", \"swe\", \"station_id\", \"cell_id\"]].dropna()\n", + "\n", + "all_features = pd_to_clean[[\"year\", \"m\", \"doy\", \"ndsi\"]].to_numpy()\n", + "all_labels = pd_to_clean[[\"swe\"]].to_numpy().ravel()\n", + "def evaluate(self):\n", + " y_predicted = model.predict(test_features)\n", + " mae = metrics.mean_absolute_error(y_test, y_predicted)\n", + " mse = metrics.mean_squared_error(y_test, y_predicted)\n", + " r2 = metrics.r2_score(y_test, y_predicted)\n", + " rmse = math.sqrt(mse)\n", + "\n", + " print(\"The {} model performance for testing set\".format(model_name))\n", + " print(\"--------------------------------------\")\n", + " print('MAE is {}'.format(mae))\n", + " print('MSE is {}'.format(mse))\n", + " print('R2 score is {}'.format(r2))\n", + " print('RMSE is {}'.format(rmse))\n", + "\n", + " return y_predicted" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RjgBbKVVsbKF" + }, + "source": [ + "**Computing the Metrics:** Leveraging the metrics module from scikit-learn, the function calculates MAE, MSE, R2, and RMSE. Each of these calculations provides a different lens through which to view the model's performance, from average error rates (MAE, RMSE) to the model's explanatory power (R2) and the variance of its predictions (MSE).\n", + "\n", + "**Interpreting the Results:** The function not only computes these metrics but also prints them out, offering immediate insight into the model's efficacy. This step is vital for iterative model improvement, allowing data scientists to diagnose and address specific areas where the model may fall short.\n", + "\n", + "**Returning the Metrics:** Finally, the function encapsulates these metrics in a dictionary and returns it. This encapsulation allows for the metrics to be easily accessed, shared, and utilized in further analyses or reports, facilitating a deeper understanding of the model's impact and areas for enhancement.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s3zrQxxUsmOK" + }, + "source": [ + "![](../img/Validation.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C1pxcIYgsvKD" + }, + "source": [ + "All the necessary functions are called in the model_train_validate process" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "D2hKsntLszCu" + }, + "outputs": [], + "source": [ + "def main():\n", + " print(\"Train Models\")\n", + " # Choose the machine learning models to train (e.g., RandomForestHole, XGBoostHole, ETHole)\n", + " worm_holes = [ETHole()]\n", + " for hole in worm_holes:\n", + " # Perform preprocessing for the selected model\n", + " hole.preprocessing()\n", + " print(hole.train_x.shape)\n", + " print(hole.train_y.shape)\n", + " # Train the machine learning model\n", + " hole.train()\n", + " # Test the trained model\n", + " hole.test()\n", + " # Evaluate the model's performance\n", + " hole.evaluate()\n", + " # Save the trained model\n", + " hole.save()\n", + " print(\"Finished training and validating all the models.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JJDaq-uls1AR" + }, + "source": [ + "In conclusion, testing and validation form the bedrock of predictive excellence in the SnowCast project. They are not merely steps in the machine learning workflow but are the very processes that ensure the models we build are not just algorithms but are reliable interpreters of the natural world." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + } }, - { - "cell_type": "code", - "outputs": [], - "source": [], - "metadata": { - "collapsed": false - }, - "id": "898cb48fdc59bf41" - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 2 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.6" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/book/img/Pred_Delivery.png b/book/img/Pred_Delivery.png new file mode 100644 index 0000000..a6efef3 Binary files /dev/null and b/book/img/Pred_Delivery.png differ diff --git a/book/img/Refresh_pred.png b/book/img/Refresh_pred.png new file mode 100644 index 0000000..216c48c Binary files /dev/null and b/book/img/Refresh_pred.png differ diff --git a/book/img/SWE_Map.png b/book/img/SWE_Map.png new file mode 100644 index 0000000..ae6e130 Binary files /dev/null and b/book/img/SWE_Map.png differ diff --git a/book/img/Validation.png b/book/img/Validation.png new file mode 100644 index 0000000..da5099b Binary files /dev/null and b/book/img/Validation.png differ