Rate limit devices per user on short window #35613

gherceg · 2025-01-13T22:03:42Z

Product Description

A 406 http response is returned when this limit is reached, which allows us to send a custom, non-translatable message. However given the mobile worker will be connected to the internet, I'm optimistic they will be able to translate it or send it to someone who can make better use of the error message if needed. While a 429 response would have fit here as well and is translatable, the message is too vague in my opinion. This way we can ensure we are communicating to the user that it is usage for this specific user that is too high.

Based on android code, I'm pretty sure this will be displayed similarly on mobile, but will confirm with a physical device. Update: Confirmed this looks good on mobile device.

Technical Summary

https://dimagi.atlassian.net/browse/SAAS-16355

Initial PR: #35515

Note for reviewers
By default, the device rate limiter is not enabled, but the code is in place to have the ability to enable this via update-config. Review this code as if it were being enabled immediately, but note that there will be time after this is deployed to collect metrics in datadog to determine how the current limits fit into current usage.

Summary

The changes in this PR track usage for each user action that leads to updating user metadata which are:

restores
form submissions
heartbeats

If the combined usage on these endpoints reaches 10 unique device IDs within a fixed minute time window, any other requests for that user within that minute from a new device ID are rejected. The exceptions are if a device ID is None or blank, or if the device ID is from Web Apps.

Understanding past usage

The easiest way to get a rough sense of current usage at a larger scale is to look at the SyncLogSQL table and see how many different devices are used by a single user in different windows of time. This fails to capture all activity like form submissions, but gives a good sense of restore activity for a specific user. I've periodically looked at this over the last month for different days/weeks for day windows and have seen that these numbers are relatively consistent, with the exception of December 6th.

The smaller the time window, the less useful this data is, but gives a sense of the difference. The datadog metrics included in this PR will give a much better sense of usage at the minute window level.

1 day window
December 6th: [2781, 349, 346, 328, 54, 46, ...] # the top 4 device counts are from the problematic domain
January 15th: [420, 224, 42, 41, 40, ...]

Random 1 hour window
December 6th: [776, 52, 36, 24, 5, ...]
January 17th: [17, 7, 5, 4, 3, ...]

Random 1 minute window
December 6th: [34, 3, 1, ...]
January 17th: [2, 1, ...]

Again, I don't think it is worth diving too deep into these numbers as the metric will be far more useful and accurate.

Feature Flag

Safety Assurance

Safety story

Based on Clayton's comment, here are the considerations outlined for this change.

How will users become aware of this change?

Assuming that this limit is currently being reached or exceeded (we won't know until metrics are collected), here is how users will become aware.

The mobile users will attempt an action like a sync or form submission, and receive an error message explaining, in english, that current usage for this user is too high and that they should try again in a minute. Based on numbers pulled for devices used per user in a day (hardly ever exceeds 100 devices), in most cases where this limit is hit, trying again in a minute should succeed. If however there are thousands of devices attempting to take an action at roughly the same time, it will be a painful process to have every device be able to restore or submit successfully. We suspect this scenario is limited to large events like trainings.

How are other legitimate use cases impacted?

Other use cases outside of simultaneous usage of a user across different physical devices that can lead to multiple device ids for one user are:

Uninstalling and reinstalling the CommCare app
It seems nearly impossible to uninstall and reinstall CommCare 9 times in one minute, let alone logging into your user and triggering a restore or submission, so this scenario should not be a concern.
Clearing user data
Clearing user data is easier to do frequently, and could lead to frequent device ID changes, but again, a user would have to do this 9 times in one minute. That seems well past the point of debugging an issue.
Multiple CommCare app installations
Running with multiple installations of CommCare seems to be the most plausible for actually sending off 10 requests from 10 different installations for one user, but just because it is plausible doesn't mean it makes sense. I'm not too familiar with the use cases for different installations, but I imagine it is testing behavior between different mobile app versions or something along those lines, which should be limited to a few different device IDs at most.

Does this impact our offering?

I would say this does not impact our offering, but does make it more painful to use the product in a way we did not intend for (a significant number of devices for one user). This is a temporary rate limit for what we, SaaS, consider to be a reasonable number of devices per user in a minute window.

Automated test coverage

Created tests for the DeviceRateLimiter class (corehq.apps.users.tests.test_device_rate_limiter.TestDeviceRateLimiter)

QA Plan

Rollback instructions

This PR can be reverted after deploy with no further considerations

Labels & Review

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change

TestAuditLoggingForFormSubmission::test_api_user_api_endpoint failed due to attempting to pass a Mock object to convert_xform_to_json

gherceg · 2025-01-20T22:19:31Z

Moving the remaining TODOs to the JIRA ticket:

Create list of projects/users at risk of hitting this limit

Create and link confluence page detailing this limit

The first one will be easier to do based on the datadog metric, and I can work on the confluence page while this is in review.

snopoke

LGTM

corehq/apps/users/device_rate_limiter.py

corehq/apps/users/tests/test_device_rate_limiter.py

millerdev · 2025-01-21T21:42:19Z

corehq/apps/receiverwrapper/views.py

+        # let normal response handle invalid xml
+        pass
+    else:
+        device_id = form_json.get('meta', {}).get('deviceID')


Can this be retrieved directly from the instance with instance.metadata.deviceID? Possibly it should be made it forgiving of missing attributes with getattr()? I'm concerned about the overhead of adding convert_xform_to_json() here.

@gherceg pointed out in an offline discussion that instance is a byte string here, not a form object as it is in SubmissionPost later on.

It would be nice to pass form_json on from here to anything else that subsequently needs parsed form JSON to avoid having to re-parse in those places.

Attempted in d3e70de

corehq/apps/users/device_rate_limiter.py

Reproduced issue this caused with test: device_rate_limiter.rate_limit_device(self.domain, 'user-id', 'existing-device-id') device_rate_limiter.rate_limit_device(self.domain, 'user-id', 'existing-device-id') device_rate_limiter.rate_limit_device(self.domain, 'user-id', 'new-device-id') self.assertFalse(device_rate_limiter.rate_limit_device(self.domain, 'user-id', 'existing-device-id')) which failed since the second rate_limit_device call removed 'existing-device-id', and the next call with 'new-device-id' effectively takes its place in the allowed device list.

corehq/apps/users/device_rate_limiter.py

This avoids unnecessarily converting the instance xml to json multiple times

gherceg · 2025-01-23T19:08:11Z

I'm not proud of d3e70de, but wanted to make as minimal of a change as I could to reduce risk.

A histogram was the wrong choice for this metric since we report each individual usage as it happens, not the total number at the end of a window.

Don't code while you're sick

corehq/form_processor/parsers/form.py

corehq/apps/users/device_rate_limiter.py

gherceg force-pushed the gh/rate-limiting/devices-per-user branch 6 times, most recently from 3e22f49 to 6f6f461 Compare January 17, 2025 22:49

gherceg added 5 commits January 20, 2025 17:04

Create limiter to track device usage for each user

569a15a

Return 406 response when device is rate limited

c8e4f92

Added datadog metrics for rate limiter

74ac239

Add setting to enable/disable device rate limiter

6d270bb

Fix failing test using mock

bc9d976

TestAuditLoggingForFormSubmission::test_api_user_api_endpoint failed due to attempting to pass a Mock object to convert_xform_to_json

gherceg force-pushed the gh/rate-limiting/devices-per-user branch from 6f6f461 to bc9d976 Compare January 20, 2025 22:12

gherceg marked this pull request as ready for review January 20, 2025 22:51

gherceg requested review from esoergel and snopoke as code owners January 20, 2025 22:51

snopoke approved these changes Jan 21, 2025

View reviewed changes

gherceg requested review from dannyroberts, millerdev and ctsims January 21, 2025 14:09

millerdev reviewed Jan 21, 2025

View reviewed changes

gherceg added 2 commits January 21, 2025 17:32

Use pipelines to reduce calls to redis

c90a5bd

millerdev approved these changes Jan 22, 2025

View reviewed changes

corehq/apps/users/device_rate_limiter.py Show resolved Hide resolved

gherceg added 2 commits January 22, 2025 17:53

Modify bucket boundaries

4014ac8

Pass json converted instance into SubmissionPost

d3e70de

This avoids unnecessarily converting the instance xml to json multiple times

dimagimon added the Risk: High Change affects files that have been flagged as high risk. label Jan 23, 2025

Add comment explaining rationale for 2 minute timeout

4c8c734

gherceg mentioned this pull request Jan 24, 2025

Enforce limit on number of devices per user #35515

Closed

3 tasks

gherceg added 4 commits January 24, 2025 17:41

Fix bug tracking wrong user id

04e2ce7

Only use login as user if available

b5c3aeb

Fix metric to report device count

0812b53

A histogram was the wrong choice for this metric since we report each individual usage as it happens, not the total number at the end of a window.

Fix bug introduced with previous change

881c35d

Don't code while you're sick

millerdev reviewed Jan 28, 2025

View reviewed changes

corehq/form_processor/parsers/form.py Outdated Show resolved Hide resolved

corehq/apps/users/device_rate_limiter.py Outdated Show resolved Hide resolved

corehq/apps/users/device_rate_limiter.py Show resolved Hide resolved

gherceg added 2 commits January 28, 2025 17:01

Address minor feedback

963cb42

Convert xform to json if instance_json IS None

0aa7482

millerdev approved these changes Jan 29, 2025

View reviewed changes

gherceg merged commit 2414923 into master Jan 29, 2025
14 checks passed

gherceg deleted the gh/rate-limiting/devices-per-user branch January 29, 2025 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rate limit devices per user on short window #35613

Rate limit devices per user on short window #35613

gherceg commented Jan 13, 2025 •

edited

Loading

gherceg commented Jan 20, 2025

snopoke left a comment

millerdev Jan 21, 2025

millerdev Jan 22, 2025

gherceg Jan 23, 2025

gherceg commented Jan 23, 2025 •

edited

Loading

Rate limit devices per user on short window #35613

Rate limit devices per user on short window #35613

Conversation

gherceg commented Jan 13, 2025 • edited Loading

Product Description

Technical Summary

Feature Flag

Safety Assurance

Safety story

How will users become aware of this change?

How are other legitimate use cases impacted?

Does this impact our offering?

Automated test coverage

QA Plan

Rollback instructions

Labels & Review

gherceg commented Jan 20, 2025

snopoke left a comment

Choose a reason for hiding this comment

millerdev Jan 21, 2025

Choose a reason for hiding this comment

millerdev Jan 22, 2025

Choose a reason for hiding this comment

gherceg Jan 23, 2025

Choose a reason for hiding this comment

gherceg commented Jan 23, 2025 • edited Loading

gherceg commented Jan 13, 2025 •

edited

Loading

gherceg commented Jan 23, 2025 •

edited

Loading