Crawler Overhaul #114

dipamsen · 2025-02-08T06:09:29Z

Fixes #91
Fixes #92

Description

Add semester to crawled library papers
Change parsing
- Parse semester and year from URL
- Parse course code from file name from first 7 characters in file name
- Try to get course name from courses.json. If not, found, use the rest of the file name as the course name.
- If unable to extract course code, or course name is less than 5 characters, add the paper in the database as unapproved.
Fix documentation for PostgreSQL database
Add rust script to import library files to db

vercel · 2025-02-08T06:09:34Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
iqps	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Mar 2, 2025 4:52pm

frontend: do not show delete button for library papers

…efactor crawler (go)

dipamsen · 2025-02-13T19:42:43Z

Crawler - `go run crawler.go`

Crawls the library for all papers for a single year.
For each paper found, gets its course code, name, year, exam, semester (See first message for details)
File name generated is original url path on peqp, with / replaced with _
Downloads all papers into a folder qp/
Creates a compressed tarball qp.tar.gz which includes qp.json (paper metadata) and qps/*.pdf (papers)

Rust script - `cargo run --bin import-papers`

Expects qp.tar.gz to be present in the backend directory.
Reads qp.json. For each paper,
- Checks whether paper of same course, semester, year, exam already exists. If so
  - If it is a library paper, and their file hashes match, skip the paper (duplicate)
  - If it is a library paper, and their file hashes don't match, mark as unapproved (this indicates potentially wrong metadata)
  - If it is a user uploaded paper, marks this paper as unapproved (admin will be able to replace the papers)
- Inserts the paper into DB, with file name {ID}_{fn}.pdf where {fn} is the file name generated by the crawler (based on the URL of the paper)
- Moves the file to storage location

backend/Cargo.toml

backend/src/bin/library-papers.rs

harshkhandeparkar · 2025-02-22T13:22:26Z

backend/src/bin/library-papers.rs

+        .await
+        .expect("Failed to connect to database");
+
+    let mut errored = false;


Can this be done without a mutable variable?

not sure (easily), maybe can print success msg for each paper instead of globally?

crawler/README.md

frontend/.env.development

frontend/src/components/AdminDashboard/QPCard.tsx

frontend/src/pages/AdminDashboard.tsx

backend/src/bin/import-papers.rs

harshkhandeparkar · 2025-02-27T14:28:16Z

backend/src/bin/import-papers.rs

+
+        if qp.approve_status {
+            if let Some(similar) = similar_papers.iter().next() {
+                // todo: what if there are multiple similar papers?


Has this been resolved?

Not sure what to do if there are multiple

harshkhandeparkar · 2025-02-27T14:29:33Z

backend/src/bin/import-papers.rs

+            let new_path = env_vars.paths.get_path_from_slug(&file_link_slug);
+
+            if let Err(e) = fs::copy(file_path, new_path) {
+                eprintln!("Failed to copy file: {}", e);


Can this be logged to a logfile instead? Each restore can create a new logfile.

crawler/crawler.go

harshkhandeparkar · 2025-02-27T15:55:14Z

crawler/crawler.go


-			if len(name_split[0]) == 7 {
+			if len(name_split[0]) == 7 && re.MatchString(name_split[0]) {


Some old course codes had a length of less than 7 I believe.

well, the regex checks for 7 char codes. we'll have to change that too

crawler/crawler.go

harshkhandeparkar · 2025-02-27T15:56:46Z

crawler/crawler.go

+				}
+			}
+			name = strings.Join(name_split, " ")
+			if len(name) < 5 { // assuming course name is at least 5 characters long


Is this required if the next check is done anyway?

unsure what 'next check' refers to. this check (len(name) < 5) is there because- if the file on peqp contains a name which is <5 chars long, very likely it isn't the course name. if the course also doesn't exist on courses.json, we would end up with the same, likely wrong name for the course in the DB. Here, setting it to empty string forces it to be added as unapproved

yeah, its a bit hacky, not sure how to make this clearer

crawler/crawler.go

harshkhandeparkar · 2025-02-27T15:59:10Z

backend/src/bin/import-papers.rs

+        }
+
+        let (mut tx, id) = database.insert_new_library_qp(&qp).await?;
+        let file_name = format!("{}_{}", id, qp.filename);


What is qp.filename usually? Maybe it would be better to use the format used for the uploaded papers instead?

qp.filename would be

2024_End-Spring_[Department]_[CourseCode]_[CourseName]_ES_2024.pdf

basically just the url path on PEQP

…o of routines cli arg

crawler: add semester and improve parsing

3e6242a

vercel bot deployed to Preview February 8, 2025 06:09 View deployment

backend: commit edit transaction

795f871

frontend: do not show delete button for library papers

vercel bot deployed to Preview February 8, 2025 17:40 View deployment

feat: add library question paper handling and upload script (rust), r…

16b9553

…efactor crawler (go)

vercel bot deployed to Preview February 13, 2025 19:26 View deployment

update documentation

46d1aeb

vercel bot deployed to Preview February 13, 2025 19:32 View deployment

fix file name

f4b9750

add concurrency

af5dc30

vercel bot deployed to Preview February 19, 2025 16:05 View deployment

refactor: optimize crawler by not visiting pdfs

6a23202

vercel bot deployed to Preview February 21, 2025 10:18 View deployment

add cli flags, add buffering, remove time logs

f2dd43f

vercel bot deployed to Preview February 21, 2025 11:48 View deployment

dipamsen changed the title ~~Improve Crawler~~ Crawler Overhaul Feb 21, 2025

rename rust script, update docs

554807a

vercel bot deployed to Preview February 21, 2025 14:12 View deployment

harshkhandeparkar reviewed Feb 22, 2025

View reviewed changes

fix review comments, remove frontend changes

ced8bb9

vercel bot deployed to Preview February 22, 2025 14:07 View deployment

Merge branch 'main' into crawler-update

a86809e

vercel bot deployed to Preview February 22, 2025 14:09 View deployment

chore: update dependencies in go.mod and go.sum

d447d5c

vercel bot deployed to Preview February 22, 2025 14:17 View deployment

feat: add file hashing to prevent duplicate paper uploads

38026af

vercel bot deployed to Preview February 24, 2025 13:20 View deployment

harshkhandeparkar reviewed Feb 27, 2025

View reviewed changes

use tmpdir, include qp.json in tar, use current yr as default, make n…

7dc4af0

…o of routines cli arg

vercel bot deployed to Preview March 2, 2025 16:49 View deployment

update docs for crawler

95fef9f

vercel bot deployed to Preview March 2, 2025 16:52 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler Overhaul #114

Crawler Overhaul #114

dipamsen commented Feb 8, 2025 •

edited by harshkhandeparkar

Loading

vercel bot commented Feb 8, 2025 •

edited

Loading

dipamsen commented Feb 13, 2025 •

edited

Loading

harshkhandeparkar Feb 22, 2025

dipamsen Feb 22, 2025

harshkhandeparkar Feb 27, 2025

dipamsen Mar 2, 2025

harshkhandeparkar Feb 27, 2025

harshkhandeparkar Feb 27, 2025

dipamsen Mar 2, 2025

harshkhandeparkar Feb 27, 2025

dipamsen Mar 2, 2025

harshkhandeparkar Feb 27, 2025

dipamsen Mar 2, 2025


		if len(name_split[0]) == 7 {
		if len(name_split[0]) == 7 && re.MatchString(name_split[0]) {

Crawler Overhaul #114

Are you sure you want to change the base?

Crawler Overhaul #114

Conversation

dipamsen commented Feb 8, 2025 • edited by harshkhandeparkar Loading

Description

vercel bot commented Feb 8, 2025 • edited Loading

dipamsen commented Feb 13, 2025 • edited Loading

Crawler - go run crawler.go

Rust script - cargo run --bin import-papers

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dipamsen commented Feb 8, 2025 •

edited by harshkhandeparkar

Loading

vercel bot commented Feb 8, 2025 •

edited

Loading

dipamsen commented Feb 13, 2025 •

edited

Loading

Crawler - `go run crawler.go`

Rust script - `cargo run --bin import-papers`