"definitely necessary to bring out emacs and modify that perl script"
This Perl script is a simple "polite" web crawler designed to traverse and gather data from a specified starting webpage.
It identifies image URLs and their associated names from each page it visits and records this information, along with the page title and the timestamp of the scraping event, into a CSV file named scraped_data.csv
.
The script then uses a PageRank-like method to find all pages linked from the current page and adds them to a queue for future visits.
Additionally, it logs the URLs of visited and queued sites, along with some additional metadata, to visited_sites.csv
and queued_sites.csv
respectively.
Tip
The script can be limited to visiting a maximum number of URLs by changing the value of the $limit
variable. It also maintains a blacklist of keywords to avoid queuing certain types of URLs.
Warning
Currently, the script does not implement any form of prioritization of URLs.
-
Scheduler: The
Scheduler.pl
script creates multiple worker threads that each run the run subroutine fromRunner.pl
which allows for multiple worker threads to download web pages in parallel. -
Queue: The
SharedData.pm
package provides a shared queue (@SharedData::shared_queue
) as a queue of URLs to be downloaded. Each worker thread adds its seed page to the queue and then runs the run subroutine, which presumably removes URLs from the queue as it processes them. -
Multi-threaded runner: The
Runner.pl
script contains the main loop of the crawler, which fetches URLs from the queue, downloads them and enqueues new URLs found on the page. -
Storage: The
Connector.pl
persists the scraped data to a remote MongoDB collection.
![NOTE] It's up to the run subroutine (3.) to actually write the downloaded pages to the temporary data directory before being persistantly stored (4.).
-
Install the necessary dependencies:
cpan Text::CSV cpan LWP::Protocol::https cpan MongoDB::MongoClient cpan Config::Simple
Note
You might have to run those commands as a user with admin privileges (e.g., sudo cspan Text::CSV
)
-
Run the script:
perl downloader.pl
© Carlo Bortolan
Carlo Bortolan · GitHub carlobortolan · contact via [email protected]