Skip to content

Commit

Permalink
updated written report
Browse files Browse the repository at this point in the history
rewrote recommendation system part
  • Loading branch information
candiswu committed Dec 14, 2024
1 parent 49a3274 commit bc43a37
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 2 deletions.
Binary file modified written_report.pdf
Binary file not shown.
5 changes: 3 additions & 2 deletions written_report.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,10 @@ The first step in building the recommendation system is to preprocess both the d

## Cosine similarity and the Recommendation System

Our recommendation system utilizes a content-based filtering approach to match students (club members) with tailored internship opportunities that align with their unique profiles and interests. The system leverages text preprocessing and TF-IDF vectorization to transform unstructured textual data---like student skills, location preferences, and internship requirements---into numerical representations. Using cosine similarity, the system quantifies how closely a student's profile matches the attributes of a given internship, such as job qualifications, skills, and location. A heuristic score is calculated by combining these similarity measures, where components like location, experience, and skills are weighted equally to produce an overall "match score. To enhance and automate this recommendation process, the labeled heuristic scores serve as the target variable for training a Random Forest Regressor---an ensemble machine learning model that uses multiple decision trees to predict scores for unseen student-internship pairs. Input features, such as `student_id` and `internship_id`, are one-hot encoded to create a format suitable for model training. The trained model generates predictions that rank internships by their relevance to a specific student's profile. This approach focuses on content-based filtering by using features directly related to the content of both student and internship data, ensuring personalized recommendations based solely on the attributes of the user and items.
At first we attempted building our recommender system using a machine learning model trained on a heuristic score, which we calculated by combining cosine similarity measures between specific job features, such as, location, experience, and skills. To calculate a heuristic, these components were weighted to a specific scale that we thought was reasonable to produce an overall "match score.” The labeled heuristic scores then served as the target variable for training a Random Forest Regressor—an ensemble machine learning model that uses multiple decision trees to predict scores for unseen student-internship pairs. Input features, such as `student_id` and `internship_id`, were one-hot encoded to create a format suitable for model training. The trained model then generates predictions that rank internships by their relevance to a specific student's profile. Although our model did end up working the accuracy was not as reliable as we expected, so as we are still trying to refine the machine learning model we decided to implement a recommender system only using the cosine similarity.
Our final model implements content-based filtering to match students (club members) with relevant internship opportunities based on their individual profiles. The system leverages text preprocessing and TF-IDF vectorization to transform unstructured textual data—like student skills, location preferences, and internship requirements—into numerical representations. Using cosine similarity, the system quantifies how closely a student's profile matches the attributes of a given internship, such as job qualifications, skills, and location. We then used the similarity score between each student and job to make our recommendations. The final output can be represented as a ranked table displaying internship recommendations for specific students. Each row shows an internship title alongside its similarity score, which quantifies how well the internship matches the student's skills, experiences, and preferences. Higher scores indicate stronger matches. Visualizing this list helps clearly demonstrate the model's predictions. For instance, the top 10 recommended internships for student_id = 113 is highlighted below.

![](images/Screen%20Shot%202024-12-09%20at%201.58.41%20PM.png)
![](images/recommendation_results.png)

# Conclusion

Expand Down

0 comments on commit bc43a37

Please sign in to comment.