Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange weighted round robin behaviour with special configuration of weights #44

Open
kalindudc opened this issue May 11, 2022 · 8 comments

Comments

@kalindudc
Copy link

Description

tl;dr: With the use of a test script, we can observe a momentary dip in traffic down to zero due to the weighted round robin algorithm.

Setup

I am attempting to reproduce this behaviour using a test lua script to investigate the behaviour of the weighted round robin algorithm (WRR). The ingress_nginx balancer utilizes round robin (RR) as it's default load balancing algorithm which is the default implementation provided by openresty.

Weights Configuration

-- config 1
local NODES = {}
NODES["10.0.0.1"] = 100
NODES["10.0.0.2"] = 100
NODES["10.0.0.3"] = 25

With this, the next step is to use the WRR algorithm to pick a node for a number of cycles to investigate the distribution of nodes. We can repeat this with the second set of weight configurations to observe the distribution of nodes with the WRR algorithm.

-- config 2
local NODES = {}
NODES["10.0.0.1"] = 100
NODES["10.0.0.2"] = 100
NODES["10.0.0.3"] = 66

Observations

For this test I am using a lua script to explicitly call the WRR algorithm for 300 cycles. The distribution of nodes for the first weight configuration is constant with little variations, this is expected. The overall distribution of traffic follows the weight configuration.

image

For the next weight configuration, we can observe that although the overall distribution of traffic adheres to the relative weights, the distribution of traffic is not constant. The node 10.0.0.3 does not get picked by WRR for some time and then the algorithm picks each node as if they were equal in weight. We can also see that this pattern repeats. After sometime, again node 10.0.0.3 is not picked by WRR for some time.

image

We can observe similar results regardless of the number of cycles. The fault in this implementation is more apparent when the number of cycles is less than the amount of cycles required for the algorithm to pick node 10.0.0.3.

image

image

Openresty WRR implementation

The openresty implementation is an accurate implementation of WRR. However, this implementation is only accurate for large finite sets of data to distribute traffic to weighted nodes. It is not viable for real time data with varying weights. This algorithm is an Interleaved WRR implementation which utilizes the greatest common denominator (GCD) between the weights to calculate the probability a node is picked based on it's weight.

The algorithm initially starts at a maximum probability for the last_node that was last picked where only nodes with weights >= the weight of the last_picked node will qualify for the next pick.

-- cycle and only pick a node, where node.weights >= cw
local last_id, cw, weight = self.last_id, self.cw

The last_picked node is randomized on initialization, so during the first pick a random node will be used to represent the last_picked node.

On each pick the algorithm infinitely iterates through the set of nodes until a node is picked. On every full iteration of all nodes we increase the chance a node is picked by the (GCD / MAX_WEIGHT) * 100%.

cw = cw - self.gcd

For example for the first configuration of nodes, we have GCD = 25 and MAX_WIEGHT = 100 so we pick each node in the following order.

Number of cycles: 20
Pick: 10.0.0.1, Weight: 100, Pick threshold: 75%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 75%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 50%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 50%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 25%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 25%
Pick: 10.0.0.3, Weight: 25, Pick threshold: 25%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 100%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 100%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 75%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 75%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 50%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 50%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 25%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 25%
Pick: 10.0.0.3, Weight: 25, Pick threshold: 25%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 100%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 100%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 75%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 75%

Distribution...
10.0.0.1: 45%
10.0.0.2: 45%
10.0.0.3: 10%

Although node 10.0.0.3 only has a weights of 25 compared to 100 of the other two nodes, the algorithm quickly increases it's chance of being picked (by 25% every complete cycle). However, for the second configuration of nodes, we have a GCD = 2 and MAX_WIEGHT = 100. This only allows an increase of 2% per complete cycle of nodes. This results in this pattern, where node 10.0.0.3 is not picked.

Number of cycles: 34
Pick: 10.0.0.2, Weight: 100, Pick threshold: 98%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 98%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 96%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 96%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 94%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 94%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 92%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 92%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 90%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 90%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 88%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 88%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 86%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 86%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 84%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 84%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 82%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 82%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 80%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 80%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 78%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 78%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 76%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 76%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 74%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 74%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 72%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 72%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 70%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 70%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 68%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 68%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 66%
Pick: 10.0.0.3, Weight: 66, Pick threshold: 66%

Distribution...
10.0.0.1: 47.058823529412%
10.0.0.2: 50%
10.0.0.3: 2.9411764705882%

After the initial pick of node 10.0.0.3, we can see a uniform distribution for all nodes. After a while, when the probability of pick becomes <= 0%, we reset back to the MAX_WEIGHT and this pattern repeats.

Number of cycles: 200
Pick: 10.0.0.1, Weight: 100, Pick threshold: 98%

...

Pick: 10.0.0.2, Weight: 100, Pick threshold: 70%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 68%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 68%
Pick: 10.0.0.3, Weight: 66, Pick threshold: 66%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 66%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 66%
Pick: 10.0.0.3, Weight: 66, Pick threshold: 64%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 64%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 64%
Pick: 10.0.0.3, Weight: 66, Pick threshold: 62%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 62%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 62%
Pick: 10.0.0.3, Weight: 66, Pick threshold: 60%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 60%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 60%
Pick: 10.0.0.3, Weight: 66, Pick threshold: 58%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 58%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 58%
Pick: 10.0.0.3, Weight: 66, Pick threshold: 56%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 56%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 56%

...

Pick: 10.0.0.2, Weight: 100, Pick threshold: 6%
Pick: 10.0.0.3, Weight: 66, Pick threshold: 4%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 4%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 4%
Pick: 10.0.0.3, Weight: 66, Pick threshold: 2%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 2%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 2%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 100%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 100%
Pick: 10.0.0.1, Weight: 100, Pick threshold: 98%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 98%

...

Pick: 10.0.0.1, Weight: 100, Pick threshold: 68%
Pick: 10.0.0.2, Weight: 100, Pick threshold: 68%
Pick: 10.0.0.3, Weight: 66, Pick threshold: 66%

Distribution...
10.0.0.1: 39%
10.0.0.2: 38.5%
10.0.0.3: 22.5%

When running this algorithm with a large number of cycles the overall weighted distribution is correct, however this still leads to a incorrect distribution of nodes for smaller intervals.

Possible Solution

I am suggesting a slight modification to this algorithm to avoid this scenario. Instead of the GCD, we can utilize the smallest weight to better distribute the picked nodes.

-- Replace get_gcd(nodes) https://github.com/openresty/lua-resty-balancer/blob/a7a8b625c6d79d203702709983b736137be2a9bd/lib/resty/roundrobin.lua#L28
local function get_lowest_weight(nodes)
  local first_id, max_weight = next(nodes)
  if not first_id then
    return error("empty nodes")
  end

  local only_key = first_id
  local lowest = max_weight
  for id, weight in next, nodes, first_id do
    only_key = nil
    lowest = weight < lowest and weight or lowest
    max_weight = weight > max_weight and weight or max_weight
  end

  return only_key, max(lowest, 1), max_weight
end

-- ================================

-- Replace self.gcd with self.lowest
function _M.new(_, nodes)
  ...
  local only_key, lowest, max_weight = get_lowest_weight(newnodes)

  local self = {
      ...
      lowest = lowest,
      ...
  }
  return setmetatable(self, mt)
end

Now, instead of resetting the probability of pick back to max_weight when it becomes <= 0%, we can allow a guaranteed pick for the next node. This will avoid the cases when a node can be skipped entirely when using to the lowest_weight to increase the pick chance.

local function find(self)
  ...

  while true do
    while true do
      ...
    end

    -- New logic
    if cw == 0 then
      cw = self.max_weight
    else
      cw = cw - self.lowest
      if cw < 0 then
          cw = 0
      end
    end
  end
end

With this solution I have conducted the same test for 20, 300, 1000 cycles using the second weight configuration.

20 Cycles

image

Distribution...
10.0.0.1: 40%
10.0.0.2: 35%
10.0.0.3: 25%

300 Cycles

image

Distribution...
10.0.0.1: 37.666666666667%
10.0.0.2: 37.333333333333%
10.0.0.3: 25%

1000 Cycles

image

Distribution...
10.0.0.1: 37.5%
10.0.0.2: 37.5%
10.0.0.3: 25%

We can see that the overall distribution of nodes still follows their respective weights and we no longer have this pattern where node 10.0.0.3 is not picked for large period of time. Even at smaller intervals (20) the distribution of nodes are correct.

Some Limitations

Even with the above solutions we can still have configurations where this issue persist. For example the following configuration will still have a similar distribution of nodes even with the modified algorithm.

local NODES = {}
NODES["10.0.0.1"] = 100
NODES["10.0.0.2"] = 100
NODES["10.0.0.3"] = 66
NODES["10.0.0.4"] = 1

We can avoid this by introducing an offset to self.lowest or by setting a minimum value. However, this will mess with the relative weight distribution at small number of cycles.

Conclusion

I do not have a perfect solution to this problem. But it is affecting real world data that relies on this algorithm to accurately distribute traffic across a set of nodes. Something to keep in mind is that, nginx implementation of WRR algorithm is vastly different to the one implemented here and on first inspection it doesn't look like that algorithm suffers this same limitation. It possible that algorithm can be adopted in here. I will include a real world test with production data as comment bellow in this issue.

Test script and setup

Setup

$ git clone [email protected]:openresty/lua-resty-balancer.git
$ touch lib/test.lua
Script
local rr = require("resty.roundrobin")

local BALANCE_CALLS = 1000

local function dump(o)
  if type(o) == 'table' then
    local s = '{ '
    for k,v in pairs(o) do
        if type(k) ~= 'number' then k = '"'..k..'"' end
        s = s .. '['..k..'] = ' .. dump(v) .. ','
    end
    return s .. '} '
  else
    return tostring(o)
  end
end

print("Setup Nodes ...")

local NODES = {}
NODES["10.0.0.1"] = 100
NODES["10.0.0.2"] = 100
NODES["10.0.0.3"] = 66

print(dump(NODES))

local NODE_COUNTS = {}
NODE_COUNTS["10.0.0.1"] = 0
NODE_COUNTS["10.0.0.2"] = 0
NODE_COUNTS["10.0.0.3"] = 0

print ("Setup roundrobin ...")
local rr_instance = rr:new(NODES)

print("Number of cycles: ", BALANCE_CALLS)

local out = io.open('data.csv', 'w')
for i = 1, BALANCE_CALLS do
  if (rr.DEBUG) then
    print(" ")
  end
  local node = rr_instance:find()
  print("Pick: ", node, ", Weight: ", NODES[node], ", Pick threshold: ", (rr_instance.cw / rr_instance.max_weight) * 100, "%")
  NODE_COUNTS[node] = NODE_COUNTS[node] + 1
  out:write(i .. "," .. NODE_COUNTS["10.0.0.1"] .. "," .. NODE_COUNTS["10.0.0.2"] .. "," .. NODE_COUNTS["10.0.0.3"] .. "\n")
end
out:close()

print("\nDistribution...")
print("10.0.0.1: ", NODE_COUNTS["10.0.0.1"] / BALANCE_CALLS * 100, "%")
print("10.0.0.2: ", NODE_COUNTS["10.0.0.2"] / BALANCE_CALLS * 100, "%")
print("10.0.0.3: ", NODE_COUNTS["10.0.0.3"] / BALANCE_CALLS * 100, "%")
Modified `roundrobin.lua`
local pairs = pairs
local next = next
local tonumber = tonumber
local setmetatable = setmetatable
local math_random = math.random
local error = error
local max = math.max

local utils = require "resty.balancer.utils"

local copy = utils.copy
local nkeys = utils.nkeys
local new_tab = utils.new_tab

local _M = {}
local mt = { __index = _M }


local function get_lowest_weight(nodes)
    local first_id, max_weight = next(nodes)
    if not first_id then
        return error("empty nodes")
    end

    local only_key = first_id
    local lowest = max_weight
    for id, weight in next, nodes, first_id do
        only_key = nil
        lowest = weight < lowest and weight or lowest
        max_weight = weight > max_weight and weight or max_weight
    end

    return only_key, max(lowest, 1), max_weight
end

local function get_random_node_id(nodes)
    local count = nkeys(nodes)

    local id = nil
    local random_index = math_random(count)

    for _ = 1, random_index do
        id = next(nodes, id)
    end

    return id
end


function _M.new(_, nodes)
    local newnodes = copy(nodes)
    local only_key, lowest, max_weight = get_lowest_weight(newnodes)
    local last_id = get_random_node_id(nodes)

    local self = {
        nodes = newnodes,  -- it's safer to copy one
        only_key = only_key,
        max_weight = max_weight,
        lowest = lowest,
        cw = max_weight,
        last_id = last_id,
    }
    return setmetatable(self, mt)
end


function _M.reinit(self, nodes)
    local newnodes = copy(nodes)
    self.only_key, self.lowest, self.max_weight = get_lowest_weight(newnodes)

    self.nodes = newnodes
    self.last_id = get_random_node_id(nodes)
    self.cw = self.max_weight
end


local function _delete(self, id)
    local nodes = self.nodes

    nodes[id] = nil

    self.only_key, self.lowest, self.max_weight = get_lowest_weight(nodes)

    if id == self.last_id then
        self.last_id = nil
    end

    if self.cw > self.max_weight then
        self.cw = self.max_weight
    end
end
_M.delete = _delete


local function _decr(self, id, weight)
    local weight = tonumber(weight) or 1
    local nodes = self.nodes

    local old_weight = nodes[id]
    if not old_weight then
        return
    end

    if old_weight <= weight then
        return _delete(self, id)
    end

    nodes[id] = old_weight - weight

    self.only_key, self.lowest, self.max_weight = get_lowest_weight(nodes)

    if self.cw > self.max_weight then
        self.cw = self.max_weight
    end
end
_M.decr = _decr


local function _incr(self, id, weight)
    local weight = tonumber(weight) or 1
    local nodes = self.nodes

    nodes[id] = (nodes[id] or 0) + weight

    self.only_key, self.lowest, self.max_weight = get_lowest_weight(nodes)
end
_M.incr = _incr



function _M.set(self, id, new_weight)
    local new_weight = tonumber(new_weight) or 0
    local old_weight = self.nodes[id] or 0

    if old_weight == new_weight then
        return
    end

    if old_weight < new_weight then
        return _incr(self, id, new_weight - old_weight)
    end

    return _decr(self, id, old_weight - new_weight)
end


local function find(self)
    local only_key = self.only_key
    if only_key then
        return only_key
    end

    local nodes = self.nodes
    local last_id, cw, weight = self.last_id, self.cw

    while true do
        while true do
            last_id, weight = next(nodes, last_id)
            if not last_id then
                break
            end

            if weight >= cw then
                self.cw = cw
                self.last_id = last_id
                return last_id
            end
        end

        if cw == 0 then
            cw = self.max_weight
        else
          cw = cw - self.lowest
          if cw < 0 then
              cw = 0
          end
        end
    end
end
_M.find = find
_M.next = find


return _M

Execute with

$ resty lib/test.lua

Data used in graphs: https://docs.google.com/spreadsheets/d/1ba570tbELUbG_N-Q-5vq15EyOP0x6wCd6X2Pu1ZAPrQ/edit?usp=sharing

@kalindudc kalindudc changed the title Strange weighted round robin behaviour during traffic shifting with special configuration of weights Strange weighted round robin behaviour with special configuration of weights May 11, 2022
@kalindudc
Copy link
Author

Real world experiment

tl;dr: With the use of a test application, we can observe a momentary dip in traffic down to zero before recovering and shifting to the desired state.

Setup

I am attempting to reproduce this behaviour using a test application using the staging environment. The test application consists of 3 origins on separate clusters. The initial weights have been set similar to what was observed initially.

image

With this, the next step is to generate some load over a period of time and shift traffic and observe.

Observations

For this test I am using fortio to generate 20k queries per second with 10 parallel threads. Initial traffic to the origin is constant with little variations, this is expected. With this constant load traffic shifts as expected when going from weight 25 -> 66, however, just before traffic shifts to the desired state, we can observe a momentary dip in traffic load down to zero and an immediate recovery to the desired. Similarly, during this anomaly, the other two origins saw an increase in traffic, to account for this loss in traffic in this origin.

image

Looking at splunk log distribution for this test app, we can clearly see the obvious dip in traffic and the increase in traffic for the two origins to account for the loss. Some things to note, this dip in traffic did not cause any requests to fail, all requests were routed successfully with a status of 200.

image

I conducted this experiment additional 2 times and observed the same behaviour, where traffic would momentarily dip to zero before recovering immediately.

image

As an additional measure, I decided to continue shifting traffic to this origin by setting it's weight from 66 -> 100. However, this shift did not cause a sudden dip in traffic.

image

@lilien1010
Copy link

hi @kalindudc this is a great post to explain the potential issue in lua-resty-balancer, I do agree you idea on RR。 and IIRC, the Nginx weighted rotation algorithm shoule like.
Poll all nodes and calculate the sum of effectiveWeight of all nodes in the current state as totalWeight.
Update the currentWeight of each node, currentWeight = currentWeight + effectiveWeight;
select the node with the largest currentWeight of all nodes as the selected node.
update currentWeight again for the selected node, currentWeight = currentWeight - totalWeight.

@doujiang24
Copy link
Member

@kalindudc Nice catch, let's follow nginx's implementation. Patches welcome.

@jizhuozhi
Copy link
Contributor

I recently discovered that if any two elements are relatively prime, then GCD's optimization will fail, for example $n + 1 \equiv 1 \pmod n$ and $n \equiv 0 \pmod 1$, and we'll have many rounds of pointless traversals, which wastes CPU computing power

@jizhuozhi
Copy link
Contributor

I recently discovered that if any two elements are relatively prime, then GCD's optimization will fail, for example n+1≡1(modn) and n≡0(mod1), and we'll have many rounds of pointless traversals, which wastes CPU computing power

For example, there is 1 instance with weight 100, and 99 instances with weight 1, each round will waste 99 times to touch last instance.

@doujiang24
Copy link
Member

@jizhuozhi For now, the round-robin is an O(n) operation, and n is the number of nodes.
If you really care about the CPU performance, we can precompute the order, more memory for less CPU.
eg. https://www.infoq.cn/article/sebuh0k6ji*ytfqzcihb
Which is O(1).

@doujiang24
Copy link
Member

Also, O(n) is acceptable usually. You'd better profile it before optimizing.

@jizhuozhi
Copy link
Contributor

jizhuozhi commented Oct 7, 2022

@jizhuozhi For now, the round-robin is an O(n) operation, and n is the number of nodes. If you really care about the CPU performance, we can precompute the order, more memory for less CPU. eg. https://www.infoq.cn/article/sebuh0k6ji*ytfqzcihb Which is O(1).

Hello, @doujiang24. Thanks for your reply, I will follow this blog to learn how does Tengine implement VNSWRR.

Before I know VNSWRR, I prefer to use the EDF algorithm to implement the weighted load balancing algorithm. When using the binary heap implementation, its space complexity is O(n), and the peek time complexity is O(1). And the upper limit of the adjusted time complexity is O(log(n)), but its minimum positive period is $\Pi(weights)$, not $\Sigma(weights)$ as the minimum positive period like SWRR.

Without changing the existing WRR implementation, I will provide a separate EDF implementation for selection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants