Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterate on prototype of social media censorship alert system #114

Closed
hellais opened this issue Jan 23, 2025 · 3 comments
Closed

Iterate on prototype of social media censorship alert system #114

hellais opened this issue Jan 23, 2025 · 3 comments
Assignees
Labels

Comments

@hellais
Copy link
Member

hellais commented Jan 23, 2025

It's useful to improve upon the initial prototype (ooni/backend#651) of the alert system to assist in the validation of the new measurement analysis.

xref: #21

@hellais hellais self-assigned this Jan 23, 2025
@hellais hellais added this to Roadmap Jan 23, 2025
@hellais hellais moved this to Sprint Backlog in Roadmap Jan 23, 2025
@hellais hellais added the priority/medium Normal priority issue label Jan 23, 2025
@hellais
Copy link
Member Author

hellais commented Jan 24, 2025

There is a prototype of this implemented in the form of an observable notebook.

It works by retrieving a configurable window of measurements in a specified country and running some statistical methods to identify events of blocking implemented as clickhouse queries.

The query which is run is the following:

click to view query
SELECT * 
FROM (
SELECT *,
    multiIf(
      is_blocked = 1,
      1,
      is_ok = 1,
      -1,
      NULL
  ) as blocked_status,
  
  -- Then detect changes using lag
  abs(
      last_value(blocked_status) IGNORE NULLS OVER (
          PARTITION BY probe_cc, probe_asn, domain, test_name, category
          ORDER BY ts
          ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
      ) - 
      last_value(blocked_status) IGNORE NULLS OVER (
              PARTITION BY probe_cc, probe_asn, domain, test_name, category
              ORDER BY ts
              ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
          )
  ) = 2 as is_event
FROM (
    WITH (rolling_sum >= ${rolling_sum_min}) OR (rolling_sum_zscore > ${zscore_threshold}) as is_data_enough
    SELECT *,  
    --exponentialTimeDecayedAvg(${rolling_sum_days})(rolling_sum, toUnixTimestamp(ts)/86400) OVER 
      --(PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
      --ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rolling_sum_ewm,
  
    avg(rolling_sum) OVER 
      (PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rolling_sum_ewm,

    varPop(rolling_sum) OVER 
        (
            PARTITION BY probe_cc, probe_asn, domain, test_name, category
            ORDER BY ts
            RANGE BETWEEN ${rolling_sum_days} * 86400 PRECEDING AND CURRENT ROW
        ) AS rolling_sum_std,

      abs(
        IF(
          rolling_sum_std = 0, (rolling_sum - rolling_sum_ewm) / 1,
          (rolling_sum - rolling_sum_ewm) / rolling_sum_std
        )) as rolling_sum_zscore,

      is_data_enough AND (
          ok_long_ewm >= ${long_threshold}
          AND ok_short_ewm >= ${short_threshold}
      ) as is_ok,
      
      is_data_enough AND (
          blocked_long_ewm >= ${long_threshold}
          AND blocked_short_ewm >= ${short_threshold}
          AND NOT is_ok -- TODO: investigate this
      ) as is_blocked

    FROM (
        WITH
        multiIf(
            outcome_category = 'dns_isp' AND is_isp_resolver,
            (outcome_category, dns_failure, dns_blocked_sum, dns_down_sum, dns_ok_sum),
        
            outcome_category = 'dns_other' AND NOT is_isp_resolver,
            (outcome_category, dns_failure, dns_blocked_sum, dns_down_sum, dns_ok_sum),
        
            outcome_category = 'tcp',
            (outcome_category, tcp_failure, tcp_blocked_sum, tcp_down_sum, tcp_ok_sum),
        
            outcome_category = 'tls',
            (outcome_category, tls_failure, tls_blocked_sum, tls_down_sum, tls_ok_sum),
        
            ('ignore', '', 0, 0, 0)
        ) as outcomes
            
        SELECT 
        ts,
        probe_cc, probe_asn, domain, test_name,
        probe_analysis,
        count,

        blocked/count as blocked_norm,
        ok/count as ok_norm,

        exponentialTimeDecayedAvg(${halflife_days_long})(blocked/count, toUnixTimestamp(ts)/86400.0) OVER 
            (
            PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
            ) AS blocked_long_ewm,
        exponentialTimeDecayedAvg(${halflife_days_short})(blocked/count, toUnixTimestamp(ts)/86400.0) OVER 
            (
            PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
            ) AS blocked_short_ewm,
    
        exponentialTimeDecayedAvg(${halflife_days_long})(ok/count, toUnixTimestamp(ts)/86400.0) OVER 
            (PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS ok_long_ewm,
        exponentialTimeDecayedAvg(${halflife_days_short})(ok/count, toUnixTimestamp(ts)/86400.0) OVER 
            (PARTITION BY probe_cc, probe_asn, domain, test_name, category ORDER BY ts
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS ok_short_ewm,
    
        sum(count) OVER (
            PARTITION BY probe_cc, probe_asn, domain, test_name, category
            ORDER BY ts
            RANGE BETWEEN ${rolling_sum_days} * 86400 PRECEDING AND CURRENT ROW
        ) as rolling_sum,
        
        outcomes.1 as category,
        outcomes.2 as top_failure,
        outcomes.3 as blocked,
        outcomes.4 as down,
        outcomes.5 as ok
        
        FROM (
            WITH
            multiIf(
                top_dns_failure IN ('android_dns_cache_no_data', 'dns_nxdomain_error'),
                'nxdomain',
                coalesce(top_dns_failure, 'ok')
            ) as dns_failure,
            coalesce(top_tcp_failure, 'ok') as tcp_failure,
            coalesce(top_tls_failure, 'ok') as tls_failure
            
            SELECT
                IF(resolver_asn = probe_asn, 1, 0) as is_isp_resolver,
                toStartOfHour(measurement_start_time) as ts,
                probe_cc, probe_asn, domain, test_name,
                COUNT() as count,
                
                arrayJoin(['dns_isp', 'dns_other', 'tcp', 'tls']) as outcome_category,
        
                dns_failure,
                tcp_failure,
                tls_failure,
    
                anyHeavy(top_probe_analysis) as probe_analysis,
        
                sum(dns_blocked) as dns_blocked_sum,
                sum(
                    -- This is a workaround for: https://github.com/ooni/data/pull/110
                    IF(dns_blocked = 1.0 AND dns_down = 1.0, 0.0, dns_down)
                ) as dns_down_sum,
                sum(dns_ok) as dns_ok_sum,
        
                sum(tcp_blocked) as tcp_blocked_sum,
                sum(tcp_down) as tcp_down_sum,
                sum(tcp_ok) as tcp_ok_sum,
                
                sum(tls_blocked) as tls_blocked_sum,
                sum(tls_down) as tls_down_sum,
                sum(tls_ok) as tls_ok_sum
            
            FROM analysis_web_measurement
            WHERE 
            measurement_start_time > '${start_date.toISOString().slice(0, 10)}'
            AND measurement_start_time < '${end_date.toISOString().slice(0, 10)}'
            AND test_name = 'web_connectivity'
            AND probe_cc = '${probe_cc}'
            GROUP BY probe_cc, probe_asn, domain, test_name, ts, dns_failure, tcp_failure, tls_failure, is_isp_resolver
        )
        WHERE (blocked + down + ok) > 0
    )
)
) WHERE is_event = 1

As can be seen in the parameters in the query there are quite a few critical parameters that need to be fine tuned in order to achieve the right tradeoff between not alerting too often, but also capturing all the events we need.

Here are some screenshots of how it works for detecting a recent case of blocking:

click to view screenshots

Image
Image

I haven't applied any kind of limitation to restrict it only to social media. That's because I have the hope that with the richness of the metrics generated by the new pipeline, maybe we are able to make it produce useful stuff even for non social media targets. In any case that will be a decision we can make later and perhaps just expose social media, but still keep full alerting for internal consumption until we feel confident in the results.

Another thing to keep in mind, is that the kinds of events that will be generated might change depending on the time window you select because of how the statistical windowed functions are working in this non-streamed implementation. Basically all the statistical metrics are recomputed within the selected range every time you change the filters. This is going to be different in the production version of this, because we are going to incrementally generate them on an hourly basis, meaning the history will be dating back to whenever we started running it.

The implication of this, is that for countries where there aren't a lot of measurements, you should pick a very large window for it to detect events successful, while for countries with a lot of measurements you should not, because otherwise it's likely to take a very long time to compute the metrics.

@hellais
Copy link
Member Author

hellais commented Jan 24, 2025

The next step to get this running in production is:

Second point is the one that's probably the most tricky, but there is quite a bit of information on the different tradeoffs with incremental variance and mean calculations:

What's just to be decided is what kind of tradeoff in terms of precision vs numerical stability vs amount of information from previous windows we would like to store.

@hellais hellais moved this from Sprint Backlog to Review/QA in Roadmap Jan 24, 2025
@hellais hellais changed the title Initial prototype of social media censorship alert system Iterate on prototype of social media censorship alert system Jan 24, 2025
@hellais hellais removed this from Roadmap Jan 29, 2025
@hellais
Copy link
Member Author

hellais commented Jan 29, 2025

Second iteration is completed. I am closing this issue.

@hellais hellais closed this as completed Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant