Filter Certain URLs from Crawl #165
-
Hi, thanks for the awesome code you have put together. I wanted to know if there is a way to filter out certain URLs from crawling. I am crawling this e-commerce store and it has these ?mycart pages that doesn't need to be scanned. Because of the pages the crawl takes a long time. At other times, I also don't want to crawl /blog pages. Is there any way of filtering out URLs based on words present in it? Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Thanks a lot @VictorAus I think this would be an important feature and worth implementing, but the details of how it works need to be sorted out. I'll give it a thought definitely, and here are some possible solutions that might help: In the upcoming release, there will be an option to Currently you can use the You might also use list mode if you know the URLs beforehand (for example using These might be some solutions that might help in various situations I hope. Feel free to share ideas on how filtering would work, or how you would like it to behave. |
Beta Was this translation helpful? Give feedback.
-
Thanks for taking the time to reply. |
Beta Was this translation helpful? Give feedback.
-
For newcomers: The
Check the function documentation for the full details |
Beta Was this translation helpful? Give feedback.
For newcomers:
The
advertools.crawl
function has the following parameters, starting at v0.13.0 to help control which links to follow (or not):exclude_url_params
include_url_params
include_url_regex
exclude_url_regex
Check the function documentation for the full details