Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new module API flag to bypass command validation #1357

Merged
merged 2 commits into from
Feb 20, 2025

Conversation

sungming2
Copy link
Contributor

@sungming2 sungming2 commented Nov 26, 2024

Issue #1175

Problem

Performance degradation occurs due to the sanity check(lookupCommandByCString) in the VM_Replicate function, which converts const char* into sds and free them for command dictionary lookup. This check is mainly used for debugging purpose, but it is unnecessary for trusted modules. The user seeks a way to bypass this check for performance gains.

Solution

Introduce a new SKIP_COMMAND_VALIDATION option which allows individual modules to opt out of command validation

Test

Unit test

@sungming2 sungming2 force-pushed the module-skip-validation branch from 18763c9 to c54ba74 Compare November 26, 2024 07:54
@sungming2 sungming2 force-pushed the module-skip-validation branch from c54ba74 to 24e6573 Compare November 26, 2024 08:02
Copy link

codecov bot commented Nov 26, 2024

Codecov Report

Attention: Patch coverage is 0% with 10 lines in your changes missing coverage. Please review.

Project coverage is 70.95%. Comparing base (f85c933) to head (7099531).
Report is 4 commits behind head on unstable.

Files with missing lines Patch % Lines
src/module.c 0.00% 10 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1357      +/-   ##
============================================
- Coverage     71.13%   70.95%   -0.19%     
============================================
  Files           123      123              
  Lines         65531    65533       +2     
============================================
- Hits          46616    46496     -120     
- Misses        18915    19037     +122     
Files with missing lines Coverage Δ
src/module.c 9.61% <0.00%> (-0.01%) ⬇️

... and 11 files with indirect coverage changes

@hpatro hpatro requested review from hpatro and hwware November 26, 2024 17:18
@hpatro hpatro added the needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. label Nov 26, 2024
@sungming2
Copy link
Contributor Author

Open a pr for documentation of this: valkey-io/valkey-doc#192

@hwware
Copy link
Member

hwware commented Nov 27, 2024

The PR comes from a performance issue about current VM_Replicated, then could you pls provide the result how much performance improvement on this PR?

@sungming2
Copy link
Contributor Author

sungming2 commented Nov 28, 2024

I don't see notable performance improvement in the multiple tests (1. benchmark 2. CPU utilization, 3. CPU profiling with perf) so I wonder if it is worth introducing this new API. @hwware @hpatro

1. Benchmark

With validation

====== eval return redis.call("propagate-test.validation") 0 ======                                                     
  100000 requests completed in 0.71 seconds
  50 parallel clients
  74 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: no

Latency by percentile distribution:
0.000% <= 0.111 milliseconds (cumulative count 1)
50.000% <= 0.295 milliseconds (cumulative count 54440)
75.000% <= 0.335 milliseconds (cumulative count 75261)
87.500% <= 0.383 milliseconds (cumulative count 88850)
93.750% <= 0.407 milliseconds (cumulative count 94702)
96.875% <= 0.423 milliseconds (cumulative count 96981)
98.438% <= 0.479 milliseconds (cumulative count 98634)
99.219% <= 0.503 milliseconds (cumulative count 99282)
99.609% <= 0.599 milliseconds (cumulative count 99611)
99.805% <= 0.791 milliseconds (cumulative count 99811)
99.902% <= 0.999 milliseconds (cumulative count 99903)
99.951% <= 1.367 milliseconds (cumulative count 99952)
99.976% <= 1.615 milliseconds (cumulative count 99976)
99.988% <= 1.679 milliseconds (cumulative count 99989)
99.994% <= 1.687 milliseconds (cumulative count 99995)
99.997% <= 1.799 milliseconds (cumulative count 99997)
99.998% <= 1.831 milliseconds (cumulative count 99999)
99.999% <= 1.895 milliseconds (cumulative count 100000)
100.000% <= 1.895 milliseconds (cumulative count 100000)

Cumulative distribution of latencies:
0.000% <= 0.103 milliseconds (cumulative count 0)
3.697% <= 0.207 milliseconds (cumulative count 2697)
60.307% <= 0.303 milliseconds (cumulative count 60307)
94.702% <= 0.407 milliseconds (cumulative count 94702)
99.282% <= 0.503 milliseconds (cumulative count 99282)
99.620% <= 0.607 milliseconds (cumulative count 99620)
99.725% <= 0.703 milliseconds (cumulative count 99725)
99.826% <= 0.807 milliseconds (cumulative count 99826)
99.878% <= 0.903 milliseconds (cumulative count 99878)
99.904% <= 1.007 milliseconds (cumulative count 99904)
99.919% <= 1.103 milliseconds (cumulative count 99919)
99.928% <= 1.207 milliseconds (cumulative count 99928)
99.939% <= 1.303 milliseconds (cumulative count 99939)
99.962% <= 1.407 milliseconds (cumulative count 99962)
99.968% <= 1.503 milliseconds (cumulative count 99968)
99.975% <= 1.607 milliseconds (cumulative count 99975)
99.995% <= 1.703 milliseconds (cumulative count 99995)
99.998% <= 1.807 milliseconds (cumulative count 99998)
100.000% <= 1.903 milliseconds (cumulative count 100000)

Summary:
  throughput summary: 141043.72 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.305     0.104     0.295     0.415     0.495     1.895

Bypass validation

====== eval return redis.call("propagate-test.novalidation") 0 ======                                                     
  100000 requests completed in 0.69 seconds
  50 parallel clients
  76 bytes payload
  keep alive: 1
  host configuration "save": 3600 1 300 100 60 10000
  host configuration "appendonly": no
  multi-thread: no

Latency by percentile distribution:
0.000% <= 0.119 milliseconds (cumulative count 1)
50.000% <= 0.279 milliseconds (cumulative count 52272)
75.000% <= 0.327 milliseconds (cumulative count 76134)
87.500% <= 0.367 milliseconds (cumulative count 87689)
93.750% <= 0.399 milliseconds (cumulative count 95285)
96.875% <= 0.423 milliseconds (cumulative count 97098)
98.438% <= 0.471 milliseconds (cumulative count 98588)
99.219% <= 0.543 milliseconds (cumulative count 99244)
99.609% <= 0.759 milliseconds (cumulative count 99619)
99.805% <= 0.927 milliseconds (cumulative count 99810)
99.902% <= 1.087 milliseconds (cumulative count 99903)
99.951% <= 1.271 milliseconds (cumulative count 99952)
99.976% <= 1.527 milliseconds (cumulative count 99977)
99.988% <= 1.575 milliseconds (cumulative count 99989)
99.994% <= 1.623 milliseconds (cumulative count 99994)
99.997% <= 1.663 milliseconds (cumulative count 99997)
99.998% <= 1.687 milliseconds (cumulative count 99999)
99.999% <= 1.719 milliseconds (cumulative count 100000)
100.000% <= 1.719 milliseconds (cumulative count 100000)

Cumulative distribution of latencies:
0.000% <= 0.103 milliseconds (cumulative count 0)
4.562% <= 0.207 milliseconds (cumulative count 4562)
67.223% <= 0.303 milliseconds (cumulative count 67223)
96.219% <= 0.407 milliseconds (cumulative count 96219)
99.021% <= 0.503 milliseconds (cumulative count 99021)
99.438% <= 0.607 milliseconds (cumulative count 99438)
99.562% <= 0.703 milliseconds (cumulative count 99562)
99.678% <= 0.807 milliseconds (cumulative count 99678)
99.787% <= 0.903 milliseconds (cumulative count 99787)
99.858% <= 1.007 milliseconds (cumulative count 99858)
99.910% <= 1.103 milliseconds (cumulative count 99910)
99.944% <= 1.207 milliseconds (cumulative count 99944)
99.954% <= 1.303 milliseconds (cumulative count 99954)
99.956% <= 1.407 milliseconds (cumulative count 99956)
99.973% <= 1.503 milliseconds (cumulative count 99973)
99.993% <= 1.607 milliseconds (cumulative count 99993)
99.999% <= 1.703 milliseconds (cumulative count 99999)
100.000% <= 1.807 milliseconds (cumulative count 100000)

Summary:
  throughput summary: 144300.14 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.294     0.112     0.279     0.399     0.503     1.719

2. CPU utilization

  • while continuously calling VM_Replicate/ReplicateWithFlag with benchmark tool (benchmark -n -l)

With validation

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                    
  33349 sungming  20   0  144672  25056   5888 R  54.0   0.2  12:51.11 valkey-server    
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                  
   1598 sungming  20   0  144672  25084   5700 R  93.3   0.2   0:30.32 valkey-server  

Bypass validation

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                    
  33349 sungming  20   0  144672  25064   5888 S  52.8   0.2  13:29.67 valkey-server      
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                  
   1598 sungming  20   0  144672  25116   5700 R  92.7   0.2   1:06.09 valkey-server

3. Performance profiling with perf

with validation

-    8.10%     0.10%  valkey-server    valkey-server       [.] VM_Replicate                                                    ▒
   - 8.00% VM_Replicate                                                                                                        ▒
      - 7.65% moduleReplicate                                                                                                  ▒
         - 3.08% lookupCommandBySdsLogic                                                                                       ▒
            - 1.21% dictFetchValue                                                                                             ▒
                 0.88% dictFind                                                                                                ▒
            - 1.10% sdssplitlen.constprop.1                                                                                    ▒
                 0.56% _sdsnewlen                                                                                              ▒
         - 1.91% moduleCreateArgvFromUserFormat                                                                                ▒
              1.12% createEmbeddedStringObject                                                                                 ▒
           0.85% _sdsnewlen                                                                                                    ▒
           0.60% alsoPropagate                    

bypass validation

-    3.84%     0.14%  valkey-server    valkey-server       [.] VM_ReplicateWithFlag                                            ▒
   - 3.70% VM_ReplicateWithFlag                                                                                                ▒
      - 3.47% moduleReplicate                                                                                                  ▒
         - 2.04% moduleCreateArgvFromUserFormat                                                                                ▒
              1.14% createEmbeddedStringObject                                                                                 ▒
              0.64% valkey_realloc                                                                                             ▒
           0.59% alsoPropagate                                                    

@hwware hwware added the major-decision-pending Major decision pending by TSC team label Dec 3, 2024
@hwware
Copy link
Member

hwware commented Dec 3, 2024

@valkey-io/core-team Do you think this is a proper solution?

@sungming2
Copy link
Contributor Author

sungming2 commented Dec 3, 2024

I added another CPU utilization test to push around 90%

@sungming2 sungming2 force-pushed the module-skip-validation branch 4 times, most recently from fd2cbc0 to 1f41ddd Compare January 15, 2025 02:04
@sungming2
Copy link
Contributor Author

New performance test showing performance increase.
#1175 (comment)

@enjoy-binbin
Copy link
Member

The new module api, if the command does not exist, the server will crash, right?

Are we considering adding flags to the client to bypass some checks? For example, a trusted client flag?

@zuiderkwast
Copy link
Contributor

The new module api, if the command does not exist, the server will crash, right?

@enjoy-binbin I may be wrong, but I believe the primary can replica anything. The replica receiving it will try to execute it and will fail with an error which is normally just ignored. But there is a config that makes the replica crash:

# The propagation error behavior controls how the server will behave when it is
# unable to handle a command being processed in the replication stream from a primary
# or processed while reading from an AOF file. Errors that occur during propagation
# are unexpected, and can cause data inconsistency. 
#
# If an application wants to ensure there is no data divergence, this configuration
# should be set to 'panic' instead. The value can also be set to 'panic-on-replicas'
# to only panic when a replica encounters an error on the replication stream. One of
# these two panic values will become the default value in the future once there are
# sufficient safety mechanisms in place to prevent false positive crashes.
#
# propagation-error-behavior ignore

Are we considering adding flags to the client to bypass some checks? For example, a trusted client flag?

What do you mean? The module can send stuff to the replication stream even no actually command was sent by any real client. A module is always trusted. A module can do all kinds of dangerous things that can crash the server.

@enjoy-binbin
Copy link
Member

right, sorry about this confusing wording, i got the context mixed up (reading other discussions at the same time). anyway, what i means is that, can we add flags to the client, or like ctx flags then can bypass some internal checks? i am not sure about it though, i am not that good at module, i am ok with the new module api if needed.

@zuiderkwast
Copy link
Contributor

@valkey-io/core-team can we settle on the ReplicateWithoutValidation API? No flags added now. TSC please vote 👍 or 👎 .

@sungming2 sungming2 force-pushed the module-skip-validation branch from 1f41ddd to 67dbe62 Compare February 10, 2025 21:51
@hpatro
Copy link
Collaborator

hpatro commented Feb 10, 2025

@valkey-io/core-team can we settle on the ReplicateWithoutValidation API? No flags added now. TSC please vote 👍 or 👎 .

@valkey-io/core-team nudge!

@hwware
Copy link
Member

hwware commented Feb 11, 2025

@valkey-io/core-team can we settle on the ReplicateWithoutValidation API? No flags added now. TSC please vote 👍 or 👎 .

@valkey-io/core-team nudge!

Add it to next meeting agenda

zuiderkwast
zuiderkwast previously approved these changes Feb 12, 2025
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation LGTM.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as well, just minor suggestions.

@sungming2 sungming2 marked this pull request as draft February 17, 2025 22:55
@sungming2 sungming2 changed the title Add new replicate module API to bypass command validation Add new module API flag to bypass command validation Feb 18, 2025
@sungming2 sungming2 force-pushed the module-skip-validation branch from 6c68898 to 08bd13b Compare February 18, 2025 08:32
@zuiderkwast zuiderkwast self-requested a review February 18, 2025 09:23
@zuiderkwast zuiderkwast dismissed their stale review February 18, 2025 09:23

Previous approval no longer relevant

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we discussed ValkeyModule_SetModuleOptions, it seems we forgot that there is no getter for these options.

Signed-off-by: Seungmin Lee <[email protected]>
@sungming2 sungming2 marked this pull request as ready for review February 18, 2025 22:44
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! @valkey-io/core-team please double check the new updates. The surface area is much smaller, just a new option VALKEYMODULE_OPTIONS_SKIP_COMMAND_VALIDATION.

@hpatro
Copy link
Collaborator

hpatro commented Feb 19, 2025

I guess it doesn't need to be a major decision anymore.

@zuiderkwast
Copy link
Contributor

@madolson We made the decision about this new option in the last core team meeting, right? I think we can consider it approved.

@madolson
Copy link
Member

@madolson We made the decision about this new option in the last core team meeting, right? I think we can consider it approved.

We've had apparent consensus in the past but we didn't make it explicit. In the worst case I would like the folks to have the notification and they can choose to ignore it.

@madolson madolson merged commit 54c4bbc into valkey-io:unstable Feb 20, 2025
51 checks passed
@madolson madolson added release-notes This issue should get a line item in the release notes major-decision-approved Major decision approved by TSC team and removed major-decision-pending Major decision pending by TSC team labels Feb 20, 2025
@sungming2 sungming2 deleted the module-skip-validation branch February 20, 2025 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major-decision-approved Major decision approved by TSC team needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. release-notes This issue should get a line item in the release notes
Projects
Status: To be backported
Development

Successfully merging this pull request may close these issues.

7 participants