Iterate on prototype of social media censorship alert system #114

hellais · 2025-01-23T12:08:28Z

It's useful to improve upon the initial prototype (ooni/backend#651) of the alert system to assist in the validation of the new measurement analysis.

xref: #21

hellais · 2025-01-24T11:33:55Z

There is a prototype of this implemented in the form of an observable notebook.

It works by retrieving a configurable window of measurements in a specified country and running some statistical methods to identify events of blocking implemented as clickhouse queries.

The query which is run is the following:

click to view query

SELECT * 
FROM (
SELECT *,
    multiIf(
      is_blocked = 1,
      1,
      is_ok = 1,
      -1,
      NULL
  ) as blocked_status,
  
  -- Then detect changes using lag
  abs(
      last_value(blocked_status) IGNORE NULLS OVER (
          PARTITION BY probe_cc, probe_asn, domain, test_name, category
          ORDER BY ts
          ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
      ) - 
      last_value(blocked_status) IGNORE NULLS OVER (
              PARTITION BY probe_cc, probe_asn, domain, test_name, category
              ORDER BY ts
              ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
          )
  ) = 2 as is_event
FROM (
    WITH (rolling_sum >= ${rolling_sum_min}) OR (rolling_sum_zscore > ${zscore_threshold}) as is_data_enough
    SELECT *,  
    --exponentialTimeDecayedAvg(${rolling_sum_days})(rolling_sum, toUnixTimestamp(ts)/86400) OVER 
      --(PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
      --ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rolling_sum_ewm,
  
    avg(rolling_sum) OVER 
      (PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rolling_sum_ewm,

    varPop(rolling_sum) OVER 
        (
            PARTITION BY probe_cc, probe_asn, domain, test_name, category
            ORDER BY ts
            RANGE BETWEEN ${rolling_sum_days} * 86400 PRECEDING AND CURRENT ROW
        ) AS rolling_sum_std,

      abs(
        IF(
          rolling_sum_std = 0, (rolling_sum - rolling_sum_ewm) / 1,
          (rolling_sum - rolling_sum_ewm) / rolling_sum_std
        )) as rolling_sum_zscore,

      is_data_enough AND (
          ok_long_ewm >= ${long_threshold}
          AND ok_short_ewm >= ${short_threshold}
      ) as is_ok,
      
      is_data_enough AND (
          blocked_long_ewm >= ${long_threshold}
          AND blocked_short_ewm >= ${short_threshold}
          AND NOT is_ok -- TODO: investigate this
      ) as is_blocked

    FROM (
        WITH
        multiIf(
            outcome_category = 'dns_isp' AND is_isp_resolver,
            (outcome_category, dns_failure, dns_blocked_sum, dns_down_sum, dns_ok_sum),
        
            outcome_category = 'dns_other' AND NOT is_isp_resolver,
            (outcome_category, dns_failure, dns_blocked_sum, dns_down_sum, dns_ok_sum),
        
            outcome_category = 'tcp',
            (outcome_category, tcp_failure, tcp_blocked_sum, tcp_down_sum, tcp_ok_sum),
        
            outcome_category = 'tls',
            (outcome_category, tls_failure, tls_blocked_sum, tls_down_sum, tls_ok_sum),
        
            ('ignore', '', 0, 0, 0)
        ) as outcomes
            
        SELECT 
        ts,
        probe_cc, probe_asn, domain, test_name,
        probe_analysis,
        count,

        blocked/count as blocked_norm,
        ok/count as ok_norm,

        exponentialTimeDecayedAvg(${halflife_days_long})(blocked/count, toUnixTimestamp(ts)/86400.0) OVER 
            (
            PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
            ) AS blocked_long_ewm,
        exponentialTimeDecayedAvg(${halflife_days_short})(blocked/count, toUnixTimestamp(ts)/86400.0) OVER 
            (
            PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
            ) AS blocked_short_ewm,
    
        exponentialTimeDecayedAvg(${halflife_days_long})(ok/count, toUnixTimestamp(ts)/86400.0) OVER 
            (PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS ok_long_ewm,
        exponentialTimeDecayedAvg(${halflife_days_short})(ok/count, toUnixTimestamp(ts)/86400.0) OVER 
            (PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS ok_short_ewm,
    
        sum(count) OVER (
            PARTITION BY probe_cc, probe_asn, domain, test_name, category
            ORDER BY ts
            RANGE BETWEEN ${rolling_sum_days} * 86400 PRECEDING AND CURRENT ROW
        ) as rolling_sum,
        
        outcomes.1 as category,
        outcomes.2 as top_failure,
        outcomes.3 as blocked,
        outcomes.4 as down,
        outcomes.5 as ok
        
        FROM (
            WITH
            multiIf(
                top_dns_failure IN ('android_dns_cache_no_data', 'dns_nxdomain_error'),
                'nxdomain',
                coalesce(top_dns_failure, 'ok')
            ) as dns_failure,
            coalesce(top_tcp_failure, 'ok') as tcp_failure,
            coalesce(top_tls_failure, 'ok') as tls_failure
            
            SELECT
                IF(resolver_asn = probe_asn, 1, 0) as is_isp_resolver,
                toStartOfHour(measurement_start_time) as ts,
                probe_cc, probe_asn, domain, test_name,
                COUNT() as count,
                
                arrayJoin(['dns_isp', 'dns_other', 'tcp', 'tls']) as outcome_category,
        
                dns_failure,
                tcp_failure,
                tls_failure,
    
                anyHeavy(top_probe_analysis) as probe_analysis,
        
                sum(dns_blocked) as dns_blocked_sum,
                sum(
                    -- This is a workaround for: https://github.com/ooni/data/pull/110
                    IF(dns_blocked = 1.0 AND dns_down = 1.0, 0.0, dns_down)
                ) as dns_down_sum,
                sum(dns_ok) as dns_ok_sum,
        
                sum(tcp_blocked) as tcp_blocked_sum,
                sum(tcp_down) as tcp_down_sum,
                sum(tcp_ok) as tcp_ok_sum,
                
                sum(tls_blocked) as tls_blocked_sum,
                sum(tls_down) as tls_down_sum,
                sum(tls_ok) as tls_ok_sum
            
            FROM analysis_web_measurement
            WHERE 
            measurement_start_time > '${start_date.toISOString().slice(0, 10)}'
            AND measurement_start_time < '${end_date.toISOString().slice(0, 10)}'
            AND test_name = 'web_connectivity'
            AND probe_cc = '${probe_cc}'
            GROUP BY probe_cc, probe_asn, domain, test_name, ts, dns_failure, tcp_failure, tls_failure, is_isp_resolver
        )
        WHERE (blocked + down + ok) > 0
    )
)
) WHERE is_event = 1

As can be seen in the parameters in the query there are quite a few critical parameters that need to be fine tuned in order to achieve the right tradeoff between not alerting too often, but also capturing all the events we need.

Here are some screenshots of how it works for detecting a recent case of blocking:

click to view screenshots

I haven't applied any kind of limitation to restrict it only to social media. That's because I have the hope that with the richness of the metrics generated by the new pipeline, maybe we are able to make it produce useful stuff even for non social media targets. In any case that will be a decision we can make later and perhaps just expose social media, but still keep full alerting for internal consumption until we feel confident in the results.

Another thing to keep in mind, is that the kinds of events that will be generated might change depending on the time window you select because of how the statistical windowed functions are working in this non-streamed implementation. Basically all the statistical metrics are recomputed within the selected range every time you change the filters. This is going to be different in the production version of this, because we are going to incrementally generate them on an hourly basis, meaning the history will be dating back to whenever we started running it.

The implication of this, is that for countries where there aren't a lot of measurements, you should pick a very large window for it to detect events successful, while for countries with a lot of measurements you should not, because otherwise it's likely to take a very long time to compute the metrics.

hellais · 2025-01-24T11:38:22Z

The next step to get this running in production is:

reduce observation generation to hourly frequency: Reduce observation generation frequency to hourly #111
implement the detection in a way that it can be run in a streaming fashion (port the rolling window functions to be done incrementally)

Second point is the one that's probably the most tricky, but there is quite a bit of information on the different tradeoffs with incremental variance and mean calculations:

What's just to be decided is what kind of tradeoff in terms of precision vs numerical stability vs amount of information from previous windows we would like to store.

hellais · 2025-01-29T12:15:39Z

Second iteration is completed. I am closing this issue.

hellais added the funder/drl2022-2024 label Jan 23, 2025

hellais self-assigned this Jan 23, 2025

hellais added this to Roadmap Jan 23, 2025

hellais moved this to Sprint Backlog in Roadmap Jan 23, 2025

hellais added the priority/medium Normal priority issue label Jan 23, 2025

hellais moved this from Sprint Backlog to Review/QA in Roadmap Jan 24, 2025

hellais changed the title ~~Initial prototype of social media censorship alert system~~ Iterate on prototype of social media censorship alert system Jan 24, 2025

hellais removed this from Roadmap Jan 29, 2025

hellais mentioned this issue Jan 29, 2025

Social Media Alert System ooni/ooni.org#1626

Open

hellais closed this as completed Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterate on prototype of social media censorship alert system #114

Iterate on prototype of social media censorship alert system #114

hellais commented Jan 23, 2025 •

edited

Loading

hellais commented Jan 24, 2025 •

edited

Loading

hellais commented Jan 24, 2025

hellais commented Jan 29, 2025

Iterate on prototype of social media censorship alert system #114

Iterate on prototype of social media censorship alert system #114

Comments

hellais commented Jan 23, 2025 • edited Loading

hellais commented Jan 24, 2025 • edited Loading

hellais commented Jan 24, 2025

hellais commented Jan 29, 2025

hellais commented Jan 23, 2025 •

edited

Loading

hellais commented Jan 24, 2025 •

edited

Loading