-
-
Notifications
You must be signed in to change notification settings - Fork 85
New script to detect importScripts(), service worker events and service worker properties #204
Conversation
custom_metrics/pwa.js
Outdated
return response_bodies.filter(har => { | ||
return regexPattern.test(har.response_body); | ||
}).map(har => { | ||
return [har.url, Array.from(har.response_body.matchAll(regexPattern)).map(m => m[0])]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in some cases you might actually want m[1]
to extract the capturing group rather than the whole match. For example, outputting only message
rather than addEventListener("message"
.
For the original Workbox use case, m[0]
was sufficient, so you may need to add some additional functionality to this function to handle either case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change by adding an additional parameter (extractMatchingGroupOnly
) to the getInfoForPattern()
function. As you mentioned, in some cases, we want m[0]
and in others m[1]
.
custom_metrics/pwa.js
Outdated
const swEventListenersPattern = /addEventListener\(\s*[\'"](install|activate|fetch|push|notificationclick|notificationclose|sync|canmakepayment|paymentrequest|message|messageerror|periodicsync|backgroundfetchsuccess|backgroundfetchfailure|backgroundfetchabort|backgroundfetchclick)[\'"]/g; | ||
const swEventListenersInfo = getInfoForPattern(swEventListenersPattern); | ||
|
||
const swPropertiesPattern = /\.on(install|activate|fetch|push|notificationclick|notificationclose|sync|canmakepayment|paymentrequest|message|messageerror|periodicsync|backgroundfetchsuccess|backgroundfetchfailure|backgroundfetchabort|backgroundfetchclick)\s*=/g; | ||
const swPropertiesInfo = getInfoForPattern(swPropertiesPattern); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to distinguish between addEventListener
and on
usage in your analysis? If not, consider combining the regexes or merging the results into a single output property.
Also FYI we have the new event-names.js custom metric that will extract any event name after addEventListener
. Probably still good to have a shortlist of SW-related event names to check for here but just wanted to make you aware in case it's useful down the road.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided not to combine fields, since they are easier to merge during queries, but harder to separate, if, for some reason we need to use the original separation. In this case, for example, to answer the question: "How many sites use event listeners vs. properties".
Please, refer to the Granularity level headline in this comment for more details.
Hi @rviscomi! Thank you for taking a look at this PR. I'll be running some tests, to decide what's the best course of action for both suggestions, according to what the data returns and how we visualize it. One important question: I've noticed that the original script takes into account all the URLs ( For Workbox-related things or Correct me if I'm wrong, but I believe one way of mitigating this, would be to use Now, if based on the data we receive from the June crawl, we are able to determine that these false positives are relatively low (for example, by comparing how many tests have an empty I believe 2019 and 2020 were taking the whole Thoughts are welcome! cc // @tunetheweb |
Also, this comment from @tunetheweb is very interesting:
For the Twitter test, that we couldn't detect the service worker with our script, WPT shows the service worker related requests in blue in the waterfall. Maybe we can reuse that logic WPT uses to detect service worker activity inside our script to be sure of which tests things from a SW? Cheers. |
WPT highlights any request that belongs to a different document from the main page in blue. It will show SW but also iFrames. "documentURL" in the request data. |
Thanks for confirming @pmeenan! In that case, we won't be able to use that to isolate SW-only traffic. |
That's right. It would provide fewer false positives but we'd lose out on the Twitter edge case.
Reevaluating after the June crawl sounds like a good plan. We can also compare to the ServiceWorkerControlledPage use counter. |
@demianrenzulli and I looked at this for 2020 stats and it was pretty damn close actually. Web Almanac counted 53,366 sites for mobile and 42,521 sites for desktop. Blink Usage for Aug 2020 had 55,019 and 49,305. And they presumably will be counting more than the home page and all sites not just CrUX ones? So that’s VERY close in the grand scheme of things and gives real confidence the 2020 methodology (which is what we are currently proposing for 2021 but just moving the counting to be done during the crawl) is representative. |
Actually where does the |
The feature usage comes from the crawl. |
Ah then the closeness is more understandable 😔 I guess the difference would be sites like Twitter where we can’t see it registering a service worker even though it does use one. |
One other consideration for minimizing false positives (eg onmessage outside of a SW) is to test for patterns that would disqualify the script from being a service worker. For example, if we find |
Thanks for sharing your thoughts again, @rviscomi! It's really helping us understand what can be done to make our analysis more precise. So far, I have found a potentially strong usage of features outside service worker in the following cases:
All of them can be accessed both from pages and service workers, and their usage might be quite frequent in both contexts. So, we either (1) exclude them from our analysis to avoid false positives, or (2) find a way of mitigating this issue by focusing only in For (1): For the three features, I thought it might be interesting to know how many service workers use the Regarding (2): If we decide to implement a technique to filter service worker only activity, this comment:
Sounds like a very good idea. In general, workers don't have access to the Window and therefore the Document object either and have limited access to browser APIs. Since Window is implicit, I believe that any call to I'm wondering how this could be implemented? Is it possible to create an exception in pwa.js, or is it something we can do after getting the data? |
BTW: cc @jeffposnick to validate this:
|
That's correct. |
}); | ||
function getInfoForPattern(regexPattern) { | ||
return response_bodies.filter(har => { | ||
return regexPattern.test(har.response_body); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the discussion, I'd suggest adding a condition here to filter out responses that contain non-SW code. For example, have a top-level const to define a pattern that disqualifies the response from being considered a SW and test for that here:
return regexPattern.test(har.response_body); | |
return !disqualifyingSWPattern.test(har.response_body) && regexPattern.test(har.response_body); |
Where disqualifyingSWPattern
could be defined as something like:
const disqualifyingSWPattern = /\bdocument\.\w+/g;
(not tested)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, refer to the Filtering "service worker only" activity of this comment, where I explained why this strategy didn't work.
Instead of that, I ended up adding an experimental field serviceWorkerHeuristic
, which combines different conditions for features that should only appear in service workers. This is explained in this comment.
Hi folks, I just updated this PR, by adding two additional fields: Output: Before starting to work on @rviscomi code to filter SW only files, I wanted to go back to on of his previous comments:
I'm not familiar on how this information will be consumed and if it's better to combine some of the fields that this script returns into one. I'd like to hear @tunetheweb thoughts on this, before changing the current structure. Maybe we can discuss in tomorrow's sync. |
Hi team, As discussed in our sync with @tunetheweb and @OBTo earlier this week, I've been working in some updates to the script to get it closer to the final version, so we can hopefully use it for July's crawl. @rviscomi (and others) I would appreciate your thoughts: New fieldsYou'll notice that I have included more PWA features to test, like Output formatI added an option to Granularity levelAs discussed, we would prefer to return the fields in the more granular way, instead of combining them, so I tried to group them according to the feature they belong to. With that say: Filtering "service worker only" activityI took @rviscomi's suggestion and tried the suggested technique: const disqualifyingSWPattern = /\bdocument\.\w+/g; While I was initially happy with the preliminary results, I found some cases like Spotify's service worker, which code contains calls to For that reason, I decided to exclude two features that are not SW-specific and that are very commonly used in pages: Something else that we can do after analyzing June's crawl would be creating a condition like: "If it doesn't have a Test casesI tested 20 sites that have service workers and the output seems to be working well. I was hoping we can use that as test cases every time we introduce changes. @tunetheweb: you can take a look at any test, for example, this one, to see how the JSON looks like. Sorry for the length of this comment! Too many decisions have been made that I wanted to share. |
Hey @demianrenzulli is this expected in your results: Given that you are searching for very specific |
Oh and btw I just found out yesterday that if you use URLs like this: |
Hi @tunetheweb, thank you for taking a look at the tests and share your feedback! The output you found belongs to an https://storage.googleapis.com/workbox-cdn/releases/4.3.1/workbox-sw.js This is the relevant portion of the code: loadModule(t) {
const e = this.i(t);
try {
importScripts(e), this.o = !0
} catch (s) {
throw console.error(`Unable to import module '${t}' from '${e}'.`), s
}
} As you can see, the regular expression has detected the pattern well, but since the call to I didn't want to force the the parameter sent to Maybe at the time of creating the raking of most used libraries we could filter out these results that don't add much? |
Ah I loaded the wrong script so couldn't see it! My bad. Is it useful to know what they import? Or just that they use i.e. instead of this: "importScriptsInfo": {
"https: //www.tiktok.com/sw.js": [
"'https://storage.googleapis.com/workbox-cdn/releases/4.3.1/workbox-sw.js'"
],
"https: //storage.googleapis.com/workbox-cdn/releases/4.3.1/workbox-sw.js": [
"e"
]
} Should we have this?: "importScriptsInfo": [
"https: //www.tiktok.com/sw.js",
"https: //storage.googleapis.com/workbox-cdn/releases/4.3.1/workbox-sw.js"
] Or maybe each just this?: "importScriptsInfo": true |
Thanks again @tunetheweb! If we do this: "importScriptsInfo": [
"https: //www.tiktok.com/sw.js",
"https: //storage.googleapis.com/workbox-cdn/releases/4.3.1/workbox-sw.js"
] We know that there are two files that contain Same goes for Please, let me know if I understood this correctly. |
Ah you're correct yes. We want to track the most used libraries, not just the use of import scripts. Then maybe check it includes at least one Or we just leave as is and filter out in the results? WDYT? |
I really like this idea: "Then maybe check it includes at least one .?", but what if a site has a service worker with only an While the variable wouldn't be useful for the rankings, not knowing that site has an That's the first row in last year's table. Am I right? |
I say leave as is then and can clean it up in Sheets. I also had a look at your test cases and looks pretty thorough! My only concern is the number of "Empty serviceWorkers field with non empty SW methods" cases you have. Is that because you specifically looked for them? Or is our logic so limited that it's picking up so few of them? Anything we can do to improve it that you've seen from these sites? Though guess we do have Blink Usage and Lighthouse to fall back on for those tests. |
I have to admit that I am a bit concerned too about the potentially large number of false negatives that we have seen in the tests for the If we find that the number of false positives is still considerable, something that we can do is to create a condition like: "If a site has an empty A priori, none of these fields should be populated for sites that don't have it, since the methods they check for shouldn't appear outside of service workers. But maybe it's good to have the data to be 100% sure. Please, let me know what you think! |
Here is a test for web.dev, which doesn't have a service worker, and therefore, has all the SW-related fields mentioned before, empty. |
Hi folks, a final update on my side: I just committed an experimental field If you take a look at the SW Heuristics column of the test case sheet, you'll see that this value is I added a new tab in the sheet, Non-PWA sites, with URLs for sites that don't register a service worker, and you'll see that in all cases SW Heuristics is This might help us get a more accurate idea of how many sites actually have service workers, but since the number of tests I performed is relatively small (30 approximately), I can't be sure what the result will be in the order of thousands when the actual crawl takes place. I hope this helps. |
@rviscomi as discussed with @tunetheweb I’ve resolved all the comments and don’t plan on making any further changes here if you can review again when you have a moment? I did my best to cover as many test cases as possible. @OBTo as we discussed, this seems to be the only way to test, but if you have any thoughts on how to test custom metrics that would be great. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Hi folks,
As discussed in #203 I decided to create a PR for this script (even when is not optimal), since, after analyzing some tests with @tunetheweb, having it merged for Monday's crawl could help us get an idea of the number of false negatives in
serviceWorkerInitiatedURLs
.We have added three new fields to the response:
importScriptsInfo
,swEventListenersInfo
,wPropertiesInfo
, which use Rick's script for Workbox detection and combines it with the regular expressions used last year to detect SW events and properties (sw_events.sql), plus a new one to detect calls toimportScripts()
.Tests
serviceWorkers
field is non-empty and it also has values for the new properties.serviceWorkers
is empty, but we can seeswEventListenersInfo
andwPropertiesInfo
populated.It would be interesting to see how many cases like Twitter we have, to see if we are under-counting many sites that use service workers.
From what we could check with @tunetheweb this number represent a very small percentage.
It would be great if @rviscomi or @jeffposnick can take a look at some point.
cc // @OBTo