Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capturing two URLs are not being properly read by Webrecorder Player? #27

Open
hanoii opened this issue Jan 18, 2019 · 12 comments
Open

Comments

@hanoii
Copy link

hanoii commented Jan 18, 2019

I successfully (I think) captured and generated a warc file using https://electronjs.org/docs/api/debugger.

I tried a simple site: www.drupal.org

If I capture the first load, it seems to work nicely, Webrecorder Player shows it perfect.

However if I navigate to "Developers" and then store both the homepage and this page into the warc file, it doesn't seem to work. I see the data on the warcfile though.

I guess something is missing on the Warc file or I am missing something, any ideas?

Other than that, I am super happy of seeing this working. Might even worth contributing this warc generator into this package.

@hanoii hanoii changed the title Browsing away from an URL Capturing two URLs are not being properly read by Webrecorder Player? Jan 18, 2019
@N0taN3rd
Copy link
Owner

N0taN3rd commented Jan 18, 2019

node-warc welcomes all contributions!

My guess is that you are not writing a single warc info record using writeWebrecorderBookmarksInfoRecord that contains all all the URLs of the pages you wish to be viewable via WR player

To fix that you can wait to append that record till the very end of capturing all the pages or view them using pywb which has no such restriction.
Ultimately WR player and WR itself use pywb as the replay system

@hanoii
Copy link
Author

hanoii commented Jan 18, 2019

Hmm, oddly I also tried pywb, but it didn't display anything. Will look. I am basically just capturing everything from a webview tag, and I navigated just to one URL, and then store all packages to a warcfile almost the same as with the remoteChromeGenerator

@N0taN3rd
Copy link
Owner

Have you tried to use puppeteer rather than electron?
I have found that using a full browser either Chrome or Chromium (brought in via puppeteer) controllable via puppeteer or chrome-remote-interface
produces better results and is easier to use.

@N0taN3rd
Copy link
Owner

N0taN3rd commented Jan 18, 2019

Ultimately the best advice I can give without seeing how you are doing the capturing (either src or minimal working example) is to treat each page as a standalone WARC that is either appended to a single WARC or written to its own WARC with concatenation done afterwards.

@hanoii
Copy link
Author

hanoii commented Jan 18, 2019

If you can, here's my source:

https://pastebin.com/61bBUiyg

I'll eventually wrap this better, for now is a PoC.

I took out your RemoteChromeWARCGenerator and RemoteChromeRequestCapturer, change the network interface for Electron's Debugger which gave me access to the same events. So it should be basically the same.

The writing of the warc file is as per your example for chrome on the project's page.

I only tried puppeteer for a quick test, might do some better one next week but I would have expected to work.

@N0taN3rd
Copy link
Owner

Did the electron request capturer and writer not work for you?

@hanoii
Copy link
Author

hanoii commented Jan 18, 2019

😱 I didn't see them or knew they were there! Sorry. Quick look at the code looks like I ended up doing something very similar.

Will try it anyway to see if I get the same Warc.

I am not capturing maybeNetworkMessage though.

This is the warc file I got: warc.zip

It should have both https://www.drupal.org/ and https://www.drupal.org/developers

I see them on the warc file

Will try yours anyway and see what I do. Thanks, might get back properly next week.

@N0taN3rd
Copy link
Owner

N0taN3rd commented Jan 18, 2019

maybeNetworkMessage is a utility function in order to allow you to not have to add an additional message listener to the debugger 😄
As far as your shared src code I can not infer when you are writing to the WARC and from what I can infer from the discussion here when that is being done is likely the reason for your issues.

@hanoii
Copy link
Author

hanoii commented Jan 18, 2019

I am doing that manually on a context menu, so basically I just wait a reasonable while and trigger it:

    const menuItem2 = new MenuItem({
      label: 'Warc it yo!',
      click: (menuItem, browserWindow, event) => {
        const warcGen = new DebuggerWARCGenerator()
        console.log(cap)
        warcGen.generateWARC(cap, debug, {
          warcOpts: {
            warcPath: 'myWARC.warc'
          },
          winfo: {
            description: 'I created a warc!',
            isPartOf: 'My awesome electron1 collection'
          }
        })
      }
    })

@hanoii
Copy link
Author

hanoii commented Jan 21, 2019

Did the electron request capturer and writer not work for you?

I just tried this and got the exact same behavior, maybe I am missing something related to the warc file that is currently beyond me, but would probably soon get to it. This was mainly making sure this is a workable solution, which it definitely is.

If there's something to follow up here you may want to suggest or for me to help debugging or attempting to get to the root of this, I rather have this small thing working.

@N0taN3rd
Copy link
Owner

N0taN3rd commented Jan 22, 2019

You are not adding the pages array, and the warc is not being written to in appending mode.

See the electron generator docs for more details.

Correcting those issues should help you get your desired results ☺️

@N0taN3rd
Copy link
Owner

See also https://github.com/N0taN3rd/Squidwarc/blob/next/lib/crawler/chrome.js for an example of warc generation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants