Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler Overhaul #114

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Crawler Overhaul #114

wants to merge 15 commits into from

Conversation

dipamsen
Copy link
Contributor

@dipamsen dipamsen commented Feb 8, 2025

Fixes #91
Fixes #92

Description

  • Add semester to crawled library papers
  • Change parsing
    • Parse semester and year from URL
    • Parse course code from file name from first 7 characters in file name
    • Try to get course name from courses.json. If not, found, use the rest of the file name as the course name.
    • If unable to extract course code, or course name is less than 5 characters, add the paper in the database as unapproved.
  • Fix documentation for PostgreSQL database
  • Add rust script to import library files to db

Copy link

vercel bot commented Feb 8, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
iqps ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 2, 2025 4:52pm

frontend: do not show delete button for library papers
@dipamsen
Copy link
Contributor Author

dipamsen commented Feb 13, 2025

Crawler - go run crawler.go

  1. Crawls the library for all papers for a single year.
  2. For each paper found, gets its course code, name, year, exam, semester (See first message for details)
  3. File name generated is original url path on peqp, with / replaced with _
  4. Downloads all papers into a folder qp/
  5. Creates a compressed tarball qp.tar.gz which includes qp.json (paper metadata) and qps/*.pdf (papers)

Rust script - cargo run --bin import-papers

  1. Expects qp.tar.gz to be present in the backend directory.
  2. Reads qp.json. For each paper,
    • Checks whether paper of same course, semester, year, exam already exists. If so
      • If it is a library paper, and their file hashes match, skip the paper (duplicate)
      • If it is a library paper, and their file hashes don't match, mark as unapproved (this indicates potentially wrong metadata)
      • If it is a user uploaded paper, marks this paper as unapproved (admin will be able to replace the papers)
    • Inserts the paper into DB, with file name {ID}_{fn}.pdf where {fn} is the file name generated by the crawler (based on the URL of the paper)
    • Moves the file to storage location

@dipamsen dipamsen changed the title Improve Crawler Crawler Overhaul Feb 21, 2025
.await
.expect("Failed to connect to database");

let mut errored = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be done without a mutable variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure (easily), maybe can print success msg for each paper instead of globally?


if qp.approve_status {
if let Some(similar) = similar_papers.iter().next() {
// todo: what if there are multiple similar papers?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been resolved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to do if there are multiple

let new_path = env_vars.paths.get_path_from_slug(&file_link_slug);

if let Err(e) = fs::copy(file_path, new_path) {
eprintln!("Failed to copy file: {}", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be logged to a logfile instead? Each restore can create a new logfile.


if len(name_split[0]) == 7 {
if len(name_split[0]) == 7 && re.MatchString(name_split[0]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some old course codes had a length of less than 7 I believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the regex checks for 7 char codes. we'll have to change that too

}
}
name = strings.Join(name_split, " ")
if len(name) < 5 { // assuming course name is at least 5 characters long
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required if the next check is done anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsure what 'next check' refers to. this check (len(name) < 5) is there because- if the file on peqp contains a name which is <5 chars long, very likely it isn't the course name. if the course also doesn't exist on courses.json, we would end up with the same, likely wrong name for the course in the DB. Here, setting it to empty string forces it to be added as unapproved

yeah, its a bit hacky, not sure how to make this clearer

}

let (mut tx, id) = database.insert_new_library_qp(&qp).await?;
let file_name = format!("{}_{}", id, qp.filename);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is qp.filename usually? Maybe it would be better to use the format used for the uploaded papers instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qp.filename would be

2024_End-Spring_[Department]_[CourseCode]_[CourseName]_ES_2024.pdf

basically just the url path on PEQP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CLI flags for easier crawler handling Make the crawler concurrent
2 participants