This project and the technical knowledge and skills used for analysis are largley in part due to the the guidance and direction of Khaled Karmen and the instructional team at UC Berkeley Data Analytics & Visualization Cohort.
Data Sources:
The ultimate goal of this analysis is to determine the underlying factors that led to the recent phenomenon being dubbed “The Great Resignation.” Our team decided on this topic after meeting for the first time. We decided that this topic was intriguing and applicable to the project, and the overall sentiment of each of us as individuals as we progressed through our careers. It seems that workplace cultures and the willingness to submit to old school working conditions has decreased over the recent years, our goal is to find out why.
We believe that younger generations are placing more emphasis on work/life balance, compensation transparency, remote work, and company culture. Our goal for this project is to find, process, and use the data we can find to see what has caused this shift around working conditions and employment.
- What are the contributing factors to individuals resigning from their jobs?
- Is there a correlation between age (retirement, etc.), job satisfaction, rate of pay, and resignation?
- What will future U.S. economic trends look like?
- What industries will be most impacted in the future by resignations?
Data Cleaning/ETL: Python, Pandas, Jupyter Notebook, CSV file,
Database: SQL, Postgress
Machine Learning: Python, SKforecast
Visualization and Presentation: Tableau, Matplotlib, Google Slides
One of the biggest challenges we faced in week one was gathering usable data that supported the project. Before starting we were optimistic that there would be an abundance of data to choose from that would support our initial thoughts on the subject. We turned out to be wrong and realized that this part of our project would take more time than we had projected. Thankfully we discovered these datasets containing (describe).
- Reduced the data set to include only the selected features we will use for time series forecasting.
- Count the number of null values in the raw dataset and drop the null values.
- Delete GeoFips and Line Code column as they are references to another directory on the BEA website.
- Drop rows where the Year columns has data with (D) in it and made a duplicate DataFrame to not interfere with the original.
- Drop rows where the Year column have data with E from columns and replacing them in the DataFrame with empty values.
- Drop rows where the Year column have data with 0 values.
- Converting the Year columns to integers for Machine Learning Modeling.
- Created a dictionary to hold the description names as well as a new dictionary with an additional string "- " to reformat the column then be filtered with 0 values to then filter to remove descrtiptions that were the total sum of a partiular sectory or industry that would skew the data and model.
- Created a for loop in which if any row in the Description column contained the added partial string "- ", the value would be appended to 0.
- Incorporated the python zip() function for parallel iteration to have a new DataFrame without the total sum of Descriptions.
- Created a dictionary to hold the description names as well as a new dictionary with an additional string "- " to reformat the column then be filtered with 0 values to then filter to only have the six industries that will be used in the model.
- Created a for loop in which if any row in the Description column contained the added partial string "- ", the value would be appended to 0.
- Incorporated the python zip() function for parallel iteration to have a new DataFrame with the six desired industries for forecasting.
- Split the GeoName column to Metropolitan portion and Nonmetropolitan portion.
- Reorder of columns in the new DataFrame.
- Label encode Geoname column with '0' being for Metropolitan portion and '1' for Nonmetropolitan portion.
- Created a new DataFrame with pandas iterrows() method to generate an interator object of the DataFrame, in this case generating a Year column for time series forecasting.
- Converted the new 'Year' column from a scalar, array-like, Series or DataFrame/dict-like to a pandas datetime object.
- Label encode the State column.
- Label encode the Description column.
- Reindex the Year column as a index.
During the implementation and exploratory stage of our Machine Learning portion and modeling, we initially had planned to use regression models but it became soon apparent that our question we hoped to have answered was a time series related problem. A time series is a succession of chronologically ordered data spaced at equal or unequal intervals. The forecasting process consists of predicting the future value of a time series, either by modeling the series solely based on its past behavior or by using other external variables. For this reason, our group then shifted our interest and focus to time series forecasting models particularly on Scikit-learn
regression models to perform forecasting on time series.
After further research and consideration, our group decided to use Skforecast
. Skforecast
, is a python library that eases using scikit-learn regressors as multi-step forecasters. It also works with any regressor compatible with the scikit-learn API (pipelines, CatBoost, LightGBM, XGBoost, Ranger etc..). We decided to use Recursive autoregressive forecasting
and ForecasterAutoreg
that inputs frequency distribution.
One major drawback that we discovered when preprocessing the data from the Bureau of Economic Analysis for modeling was that there was no time series data or datetime series associated with the State or Industries. Since the Year index had duplicate values when the original DataFrame was transformed, in order to use Skforecast
, the frequency of the time series index had to be set in a datetime format, in our case 'AS' or beginning of the year with unique values.
To bypass the TypeError: “Cannot reindex from duplicate axis”, we:
- Created a duplicate DataFrame to not interfere with the original DataFrame.
- Incorporated a time tuple.
- Implemented a iterrows() function to create a new column called “New_Year” to be formatted as YYYY-MM-DD 00:00:00. A millisecond was added for each row to be labeled as unique to then be reindexed to then be allowed to be split into train-test with steps equating to 4.
- The steps were equated to 4 years for the reason to be used as a test set to evaluate the predictive capacity of the model.
With the ForecasterAutoreg
class, a model was created and trained from a RandomForestRegressor regressor with a time window of 6 lags. Once our model was trained, the test data is predicted for 4 years into the future of our chosen industries for modeling.
In the section for Prediction error in the test set, the error that the model makes in its predictions is quantified. In this case, the metric used is the mean squared error (MSE). The Mean Squared Error (MSE) is for the most part the most practical and common loss function and is calculated by taking the difference between our model’s predictions and the ground truth, squaring it, and average it out across the whole dataset that we used for our project on the Great Resignation. MSE is advantageous in regards for ensuring that our trained model has no outlier predictions with huge errors since the MSE puts larger weight on those errors due to the squaring part of the function. In contrast, if our model makes a single bad prediction, the squaring part of our function will magnify the error.
- The original dataset that was used collected from the Bureau of Economic Analysis had no time series data, our error was significantly magnified with an test error MSE of
3545547978.331675
.
In the section for Hyperparameter tuning, the trained ForecasterAutoreg uses a 6 lag time window and a Random Forest model with the default hyperparameters. However, in our case since there is no particular reason why these values are the most suitable for our forecasting purposes, in order to identify the best combination of lags and hyperparameters, we used the Skforecast
library the grid_search_forecaster function. This library compares the results obtained with each model configuration. The grid_search_forecaster compared 12 models and was fit 244 times.
- Since the index was manipulated to create the time stamps, the index still did not have a frequency and was overwritten with a RangeIndex of step 1. All lags that were in the results_grid had a high MSE.
- The best results are obtained using a time window of 20 lags and a Random Forest set up of {'max_depth': 3, 'n_estimators': 100}.
For our final model, a ForecasterAutoreg is trained with the optimal configuration found by validation. It is important to note that this step is not necessary if return_best = True is specified in the grid_search_forecaster function.
- The optimal combination of hyperparameters significantly inceased our test error with our test error being
Test error (mse): 5931530045.247393
due to there being insufficient historical time series data that expand decades. - From the modeling of the six industries for modeling and analysis future input and features with the important indicator being time-series data and categorical features and scores such as retention levels and years of employment, Skforecast could be used to train, test, and predict these industries in a year or years iteration as well as monthly.
We ultimately decided that PGAdmin would be a suitable tool for storing our data in a database for this project. We realized that since we would collectively be using less than 100,000 rows of data, there would be no reason to you Amazon Web Services or any other big data service for monetary and practical reasons.
Pictured below is our schema used to establish our tables with primary and foreign keys. Initially, we stored everything as static data just to ensure that we were capable of storing and querying data successfully. This schema along with manual imports allowed us to store the static data first. We then focused on making the database synchronous with our jupyter notebooks and machine learning models.
We were successfully able to connect our machine learning model to our database via Jupyter Notebook using the code pictured below:
In the screenshot above, we import the necessary modules to create our database engine and connect it to the notebook we are using. The to_sql portion of our code in the fourth block establishes the table name, references the engine created, and will replace the table with new or altered data each time it is run (if applicable). This will allow for our database to be updated properly should we ever find new data in the future and want to add to our models and database.
The following blocks demonstrate how we are able to pull data from the our postgres database into a jupyter notebook and create a pandas dataframe. There may be other ways to do this but this was a method we found to be successful after trying multiple times, so we decided to go with it as it is functioning in the timeframe we need it to function.
Ultimately, our database is not very complex, consisting of two main tables that we used for our machine learning models. However, should we every wish to add to our project, we have the framework already establishing to do so. Pictured below is the query we used to showcase the most effected industries in descending order:
One of the biggest challenges we faced in week one was gathering usable data that supported the project. Before starting, we were optimistic that there would be an abundance of data to choose from that would support our initial thoughts on the subject. We turned out to be wrong, and realized that this part of our project will take more time than we had projected. We were confident that once we got the data we needed, the ETL process and steps thereafter would come along nicely. We also encountered some challenges with the machine learning portion, as noted above.
In conclusion, individuals experiencing attrition are likely to voluntarily leave their current roles. We have found that the greatest contributing factors were spending more time away from home due to having to work overtime and traveling frequently for their roles. This therefore contributed to a poor work life balance. Due to the Great Resignation, this has also led to the Great Upgrade. Employees moved into different roles that offered better wages, work life balance, company culture, and location.
We set out to find a correlation between certain factors like length of commute, age, education level, marital status, etc. and rates of resignation. We encountered a few obstacles along the way, including a lack of appropriate data, and some challenges with the machine learning portion. In the future, we would research the topic more thoroughly beforehand, don’t make any assumptions, and have resource data that has time series associated with it.