Analytic code for the paper: Predicting the longevity of resources shared in scientific publications
- Preliminary exploration
Begin with a baseline model using Logistic Regression to understand the initial performance on the dataset. - Feature selection
Apply Lasso regression for feature selection, helping to identify the most impactful features by enforcing sparsity - Classic ML Model Exploration
Test Elastic Net regression, which combines both L1 (Lasso) and L2 (Ridge) regularization, to assess its performance in comparison to the baseline. - Experiment with Tobit Model
Utilize the built-in Tobit model in R to handle censored data and explore how it fits the dataset. - Data Preprocessing
Rebuild the dataset using the scripts provided in thedata
folder, ensuring that it is formatted correctly for subsequent experiments. - Feature Importance Analysis - Random Forest
Run a Random Forest classifier to evaluate feature importance and compare the results with those obtained from regression models. - Feature Importance Analysis - Lasso Regression
Re-run Lasso regression to calibrate feature importance based on the penalized regression method. - Performance Evaluation - Elastic Net
Evaluate the performance of the Elastic Net regression by calculating its R-squared value to measure how well the model explains the variance in the dataset. - Performance Evaluation - Tobit Model with Elastic Net Enhancement
Introduce the Elastic Net regularization into the Tobit model, combining censoring handling with regularization. - Hyperparameter Tuning of the Tobit Model
Perform hyperparameter tuning on the Tobit model to optimize its performance. - In-depth Exploration of Tobit Model
Conduct a detailed exploration of the Tobit model to understand its behavior and limitations on the dataset.
This section serves as a guide to help you to navigate and understand the project's components. It provides an overview of the project's code organization. In this section, we include detailed explanations of the scripts' purposes in project sub-folders, covering data migration, modeling techniques (such as Lasso, Logistic, and Tobit regression), and analyses for this study.
Scripts utilized during the preliminary exploration phase.
- Scripts for migrating data from MongoDB to Spark.
- Modeling scripts that include Lasso, Logistic, and Elastic Net regression.
- Modeling scripts that utilize Tobit censored regression. The script located in the
archived/tobit/spark
folder is the Spark implementation of Tobit regression, but this version is currently non-functional. Please do not spend time on it.
- R implementation of Tobit censored regression.
Scripts utilized in the paper.
iconference_followup_study/data/datacheck_data.ipynb
To assess the ratio of alive to dead URLs.iconference_followup_study/data/data_proportion_inspection.ipynb
To analyze the proportion of alive to dead URLs in a spreadsheet format.iconference_followup_study/data/data_truncated.ipynb
To generate a dataset containing only truncated records (selecting for dead URLs).iconference_followup_study/data/data_untruncated.ipynb
To generate a dataset that includes both truncated and untruncated records (selecting for alive and dead URLs). The categorical ordinal featurecharset
is encoded using a frequency indexer, and the script will standardize the entire dataset.iconference_followup_study/data/data_untruncated_charset_viz.ipynb
In addition to the previous dataset, this script selects the raw URLs along with other variables to analyze the polarity of URL longevity.
iconference_followup_study/lasso/lasso_truncated.ipynb
To build a Lasso model using only truncated data (dead URLs).iconference_followup_study/lasso/lasso_untruncated.ipynb
To build a Lasso model using both truncated and untruncated data (all URLs). The categorical ordinal featurecharset
is encoded with a frequency indexer, and the script will standardize the entire dataset.iconference_followup_study/lasso/lasso_untruncated_cleaned.ipynb
To build a Lasso model using both truncated and untruncated data (all URLs). The categorical ordinal featurecharset
is encoded using a frequency indexer, and the script will standardize the entire dataset.iconference_followup_study/elastic_net/elastic_net_untruncated_cleaned.ipynb
To build an Elastic Net model using both truncated and untruncated data (all URLs). The categorical ordinal featurecharset
is encoded with a frequency indexer, and the script will standardize the entire dataset.
iconference_followup_study/elastic_net/lib
The dependent library for Elastic Net analysis.iconference_followup_study/elastic_net/elastic_net_truncated.ipynb
To build an Elastic Net model using only truncated data (dead URLs).iconference_followup_study/elastic_net/elastic_net_untruncated.ipynb
To build an Elastic Net model using both truncated and untruncated data (all URLs). The categorical ordinal featurecharset
is encoded using a frequency indexer. Since the script utilizes the Plotly package for data visualization, the output can be quite large.
iconference_followup_study/random_forest/best_lambda_scrutinize.ipynb
To fine-tune the hyperparameters of the Random Forest model.iconference_followup_study/random_forest/best_lambda_summary.ipynb
To report the performance of the Random Forest model with its best hyperparameters.
iconference_followup_study/tobit/eda
Scripts for exploratory data analysis (EDA).iconference_followup_study/tobit/grid_search
Scripts for searching optimal hyperparameters.