-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iterate on prototype of social media censorship alert system #114
Comments
There is a prototype of this implemented in the form of an observable notebook. It works by retrieving a configurable window of measurements in a specified country and running some statistical methods to identify events of blocking implemented as clickhouse queries. The query which is run is the following: click to view querySELECT *
FROM (
SELECT *,
multiIf(
is_blocked = 1,
1,
is_ok = 1,
-1,
NULL
) as blocked_status,
-- Then detect changes using lag
abs(
last_value(blocked_status) IGNORE NULLS OVER (
PARTITION BY probe_cc, probe_asn, domain, test_name, category
ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) -
last_value(blocked_status) IGNORE NULLS OVER (
PARTITION BY probe_cc, probe_asn, domain, test_name, category
ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
) = 2 as is_event
FROM (
WITH (rolling_sum >= ${rolling_sum_min}) OR (rolling_sum_zscore > ${zscore_threshold}) as is_data_enough
SELECT *,
--exponentialTimeDecayedAvg(${rolling_sum_days})(rolling_sum, toUnixTimestamp(ts)/86400) OVER
--(PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
--ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rolling_sum_ewm,
avg(rolling_sum) OVER
(PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rolling_sum_ewm,
varPop(rolling_sum) OVER
(
PARTITION BY probe_cc, probe_asn, domain, test_name, category
ORDER BY ts
RANGE BETWEEN ${rolling_sum_days} * 86400 PRECEDING AND CURRENT ROW
) AS rolling_sum_std,
abs(
IF(
rolling_sum_std = 0, (rolling_sum - rolling_sum_ewm) / 1,
(rolling_sum - rolling_sum_ewm) / rolling_sum_std
)) as rolling_sum_zscore,
is_data_enough AND (
ok_long_ewm >= ${long_threshold}
AND ok_short_ewm >= ${short_threshold}
) as is_ok,
is_data_enough AND (
blocked_long_ewm >= ${long_threshold}
AND blocked_short_ewm >= ${short_threshold}
AND NOT is_ok -- TODO: investigate this
) as is_blocked
FROM (
WITH
multiIf(
outcome_category = 'dns_isp' AND is_isp_resolver,
(outcome_category, dns_failure, dns_blocked_sum, dns_down_sum, dns_ok_sum),
outcome_category = 'dns_other' AND NOT is_isp_resolver,
(outcome_category, dns_failure, dns_blocked_sum, dns_down_sum, dns_ok_sum),
outcome_category = 'tcp',
(outcome_category, tcp_failure, tcp_blocked_sum, tcp_down_sum, tcp_ok_sum),
outcome_category = 'tls',
(outcome_category, tls_failure, tls_blocked_sum, tls_down_sum, tls_ok_sum),
('ignore', '', 0, 0, 0)
) as outcomes
SELECT
ts,
probe_cc, probe_asn, domain, test_name,
probe_analysis,
count,
blocked/count as blocked_norm,
ok/count as ok_norm,
exponentialTimeDecayedAvg(${halflife_days_long})(blocked/count, toUnixTimestamp(ts)/86400.0) OVER
(
PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS blocked_long_ewm,
exponentialTimeDecayedAvg(${halflife_days_short})(blocked/count, toUnixTimestamp(ts)/86400.0) OVER
(
PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS blocked_short_ewm,
exponentialTimeDecayedAvg(${halflife_days_long})(ok/count, toUnixTimestamp(ts)/86400.0) OVER
(PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS ok_long_ewm,
exponentialTimeDecayedAvg(${halflife_days_short})(ok/count, toUnixTimestamp(ts)/86400.0) OVER
(PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS ok_short_ewm,
sum(count) OVER (
PARTITION BY probe_cc, probe_asn, domain, test_name, category
ORDER BY ts
RANGE BETWEEN ${rolling_sum_days} * 86400 PRECEDING AND CURRENT ROW
) as rolling_sum,
outcomes.1 as category,
outcomes.2 as top_failure,
outcomes.3 as blocked,
outcomes.4 as down,
outcomes.5 as ok
FROM (
WITH
multiIf(
top_dns_failure IN ('android_dns_cache_no_data', 'dns_nxdomain_error'),
'nxdomain',
coalesce(top_dns_failure, 'ok')
) as dns_failure,
coalesce(top_tcp_failure, 'ok') as tcp_failure,
coalesce(top_tls_failure, 'ok') as tls_failure
SELECT
IF(resolver_asn = probe_asn, 1, 0) as is_isp_resolver,
toStartOfHour(measurement_start_time) as ts,
probe_cc, probe_asn, domain, test_name,
COUNT() as count,
arrayJoin(['dns_isp', 'dns_other', 'tcp', 'tls']) as outcome_category,
dns_failure,
tcp_failure,
tls_failure,
anyHeavy(top_probe_analysis) as probe_analysis,
sum(dns_blocked) as dns_blocked_sum,
sum(
-- This is a workaround for: https://github.com/ooni/data/pull/110
IF(dns_blocked = 1.0 AND dns_down = 1.0, 0.0, dns_down)
) as dns_down_sum,
sum(dns_ok) as dns_ok_sum,
sum(tcp_blocked) as tcp_blocked_sum,
sum(tcp_down) as tcp_down_sum,
sum(tcp_ok) as tcp_ok_sum,
sum(tls_blocked) as tls_blocked_sum,
sum(tls_down) as tls_down_sum,
sum(tls_ok) as tls_ok_sum
FROM analysis_web_measurement
WHERE
measurement_start_time > '${start_date.toISOString().slice(0, 10)}'
AND measurement_start_time < '${end_date.toISOString().slice(0, 10)}'
AND test_name = 'web_connectivity'
AND probe_cc = '${probe_cc}'
GROUP BY probe_cc, probe_asn, domain, test_name, ts, dns_failure, tcp_failure, tls_failure, is_isp_resolver
)
WHERE (blocked + down + ok) > 0
)
)
) WHERE is_event = 1 As can be seen in the parameters in the query there are quite a few critical parameters that need to be fine tuned in order to achieve the right tradeoff between not alerting too often, but also capturing all the events we need. Here are some screenshots of how it works for detecting a recent case of blocking: I haven't applied any kind of limitation to restrict it only to social media. That's because I have the hope that with the richness of the metrics generated by the new pipeline, maybe we are able to make it produce useful stuff even for non social media targets. In any case that will be a decision we can make later and perhaps just expose social media, but still keep full alerting for internal consumption until we feel confident in the results. Another thing to keep in mind, is that the kinds of events that will be generated might change depending on the time window you select because of how the statistical windowed functions are working in this non-streamed implementation. Basically all the statistical metrics are recomputed within the selected range every time you change the filters. This is going to be different in the production version of this, because we are going to incrementally generate them on an hourly basis, meaning the history will be dating back to whenever we started running it. The implication of this, is that for countries where there aren't a lot of measurements, you should pick a very large window for it to detect events successful, while for countries with a lot of measurements you should not, because otherwise it's likely to take a very long time to compute the metrics. |
The next step to get this running in production is:
Second point is the one that's probably the most tricky, but there is quite a bit of information on the different tradeoffs with incremental variance and mean calculations:
What's just to be decided is what kind of tradeoff in terms of precision vs numerical stability vs amount of information from previous windows we would like to store. |
Second iteration is completed. I am closing this issue. |
It's useful to improve upon the initial prototype (ooni/backend#651) of the alert system to assist in the validation of the new measurement analysis.
xref: #21
The text was updated successfully, but these errors were encountered: