Filter Certain URLs from Crawl #165

VictorAus · 2021-10-04T11:24:38Z

VictorAus
Oct 4, 2021

Hi, thanks for the awesome code you have put together.

I wanted to know if there is a way to filter out certain URLs from crawling. I am crawling this e-commerce store and it has these ?mycart pages that doesn't need to be scanned. Because of the pages the crawl takes a long time. At other times, I also don't want to crawl /blog pages.

Is there any way of filtering out URLs based on words present in it?

Thanks

Answered by eliasdabbas

Apr 15, 2022

For newcomers:

The advertools.crawl function has the following parameters, starting at v0.13.0 to help control which links to follow (or not):

exclude_url_params
include_url_params
include_url_regex
exclude_url_regex

Check the function documentation for the full details

View full answer

eliasdabbas · 2021-10-05T11:35:51Z

eliasdabbas
Oct 5, 2021
Maintainer

Thanks a lot @VictorAus

I think this would be an important feature and worth implementing, but the details of how it works need to be sorted out. I'll give it a thought definitely, and here are some possible solutions that might help:

In the upcoming release, there will be an option to skip_url_params, which means that the crawler won't crawl any page that has
?param_1=value_1.

Currently you can use the allowed_domains parameters as well, if you want to restrict crawling to a certain set of (sub)domains.

You might also use list mode if you know the URLs beforehand (for example using sitemap_to_df), then you can use that list of URLs while having follow_links=False. This way you only crawl the URLs you provided.

These might be some solutions that might help in various situations I hope.

Feel free to share ideas on how filtering would work, or how you would like it to behave.

0 replies

VictorAus · 2021-10-05T11:39:14Z

VictorAus
Oct 5, 2021
Author

Thanks for taking the time to reply. skip_url_params is exactly what I am after :)
Keeping my figures crossed to see it come to life soon.

0 replies

eliasdabbas · 2022-04-15T00:14:21Z

eliasdabbas
Apr 15, 2022
Maintainer

For newcomers:

The advertools.crawl function has the following parameters, starting at v0.13.0 to help control which links to follow (or not):

exclude_url_params
include_url_params
include_url_regex
exclude_url_regex

Check the function documentation for the full details

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter Certain URLs from Crawl #165

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Filter Certain URLs from Crawl #165

VictorAus Oct 4, 2021

Replies: 3 comments

eliasdabbas Oct 5, 2021 Maintainer

VictorAus Oct 5, 2021 Author

eliasdabbas Apr 15, 2022 Maintainer

VictorAus
Oct 4, 2021

eliasdabbas
Oct 5, 2021
Maintainer

VictorAus
Oct 5, 2021
Author

eliasdabbas
Apr 15, 2022
Maintainer