Name		Name	Last commit message	Last commit date
parent directory ..
.gitignore		.gitignore
Connector.pl		Connector.pl
README.md		README.md
Runner.pl		Runner.pl
Scheduler.pl		Scheduler.pl
SharedData.pm		SharedData.pm

README.md

Crawler

"definitely necessary to bring out emacs and modify that perl script"

What's this?

This Perl script is a simple "polite" web crawler designed to traverse and gather data from a specified starting webpage. It identifies image URLs and their associated names from each page it visits and records this information, along with the page title and the timestamp of the scraping event, into a CSV file named scraped_data.csv. The script then uses a PageRank-like method to find all pages linked from the current page and adds them to a queue for future visits. Additionally, it logs the URLs of visited and queued sites, along with some additional metadata, to visited_sites.csv and queued_sites.csv respectively.

Tip

The script can be limited to visiting a maximum number of URLs by changing the value of the $limit variable. It also maintains a blacklist of keywords to avoid queuing certain types of URLs.

Warning

Currently, the script does not implement any form of prioritization of URLs.

Architecture

Scheduler: The Scheduler.pl script creates multiple worker threads that each run the run subroutine from Runner.pl which allows for multiple worker threads to download web pages in parallel.
Queue: The SharedData.pm package provides a shared queue (@SharedData::shared_queue) as a queue of URLs to be downloaded. Each worker thread adds its seed page to the queue and then runs the run subroutine, which presumably removes URLs from the queue as it processes them.
Multi-threaded runner: The Runner.pl script contains the main loop of the crawler, which fetches URLs from the queue, downloads them and enqueues new URLs found on the page.
Storage: The Connector.pl persists the scraped data to a remote MongoDB collection.

![NOTE] It's up to the run subroutine (3.) to actually write the downloaded pages to the temporary data directory before being persistantly stored (4.).

Config

Install the necessary dependencies:

cpan Text::CSV
cpan LWP::Protocol::https
cpan MongoDB::MongoClient
cpan Config::Simple

Note

You might have to run those commands as a user with admin privileges (e.g., sudo cspan Text::CSV)

Run the script:
```
perl downloader.pl
```

Carlo Bortolan · GitHub carlobortolan · contact via [email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawler

crawler

README.md

Crawler

What's this?

Architecture

Config

Files

crawler

Directory actions

More options

Directory actions

More options

Latest commit

History

crawler

Folders and files

parent directory

README.md

Crawler

What's this?

Architecture

Config