Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snapshot functionality for a full site at a given time? #15

Open
DOSull opened this issue May 3, 2021 · 0 comments
Open

snapshot functionality for a full site at a given time? #15

DOSull opened this issue May 3, 2021 · 0 comments

Comments

@DOSull
Copy link

DOSull commented May 3, 2021

Hi, thanks for an interesting and useful project which has helped me make a start on reconstructing a site that would be really useful for a research project. I'm new to scrapy so it's been an interesting way to start learning about that.

I've been trying to make a snapshot of the whole site (or as much of it as is contained in the waybackmachine) at a particular time following the instruction here to set the from and to timestamps to the same value. However, when I do this, I only get a very incomplete snapshot of the site. If I open up the from and to range I get many more pages (but also a lot of snapshots I'm not interested in!)

I've looked at the logic in the filter_snapshots function and it all makes sense - essentially it keeps each snapshot before time_range[0] in a holding variable initial_snapshot and if the filtered_snapshots list is still empty when the time_range[1] is reached then that goes into filtered_snapshot list as the only snapshot.

Have you seen any problems like this before? Possibly related is that even if I expand the time range, then some pages don't get picked up and I have to re-run with a more specific URL to retrieve some subfolders of the site. The behaviour is consistent between runs, so I don't think it's timing out or anything, it's just not crawling to those pages for some reason. I've tried setting DEPTH_LIMIT in __main.py__ and when I run the command line it echoes the setting back to me, but that doesn't seem to make any difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant