Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent fetch of azure metricdefinitions and batchApi usage #41790

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

MichaelKatsoulis
Copy link
Contributor

@MichaelKatsoulis MichaelKatsoulis commented Nov 26, 2024

The changes affect azure monitor and relevant metricsets. The list of metricsets affected are:

  • monitor
  • container_registry
  • container_instance
  • container_service
  • compute_vm
  • compute_vm_scaleset
  • database_account

A new configuration parameter is introduced enable_batch_api of type boolean.
If set to false(default) nothing changes in the way the metrics are collected for these metricsets.

If set to true:

  • The metric definitions of resources are collected asynchronously and write the results in a channel.
  • The channel is read and when the number of definitions collected reach 50 (batch API limit)
  • The metrics definitions are grouped based on criteria(1) and the azure BatchAPI is used to retrieve
    metrics of multiple resources with one api call.
  1. Grouping criteria are
  • Namespace
  • SubscriptionID
  • Location
  • Names
  • TimeGrain
  • Dimensions

Proposed commit message

  • WHAT: Introduce enable_batch_api parameter for concurrent fetching of azure metric definitions and metric values collection using Batch Api
  • WHY: Helps mitigating scalability problems

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@MichaelKatsoulis MichaelKatsoulis requested review from a team as code owners November 26, 2024 12:08
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 26, 2024
@botelastic
Copy link

botelastic bot commented Nov 26, 2024

This pull request doesn't have a Team:<team> label.

@MichaelKatsoulis MichaelKatsoulis marked this pull request as draft November 26, 2024 12:08
Copy link
Contributor

mergify bot commented Nov 26, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b concurrent-fetch-of-azure-metricdefinitions upstream/concurrent-fetch-of-azure-metricdefinitions
git merge upstream/main
git push upstream concurrent-fetch-of-azure-metricdefinitions

Copy link
Contributor

mergify bot commented Nov 26, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @MichaelKatsoulis? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Nov 26, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 26, 2024
@zmoog
Copy link
Contributor

zmoog commented Jan 10, 2025

Microsoft.DocumentDb/databaseAccounts (1 resource)

resource type: Microsoft.DocumentDb/databaseAccounts
resource count: 1 resource
versions tested:

  • 8.17.1 (branch 8.17)
  • 9.0.0 (branch MichaelKatsoulis:concurrent-fetch-of-azure-metricdefinitions)

Activity:

  • I created one "Azure Cosmos DB for NoSQL", with Provisioned throughput (default settings)
  • I set up the standard Metricbeat database account module
# x-pack/metricbeat/modules.d/azure.yml
- module: azure
  metricsets:
  - database_account
  enabled: true
  period: 300s
  client_id: '${AZURE_CLIENT_ID:""}'
  client_secret: '${AZURE_CLIENT_SECRET:""}'
  tenant_id: '${AZURE_TENANT_ID:""}'
  subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
  refresh_list_interval: 600s
  • 8.17.1 and 9.0.0 are creating the same metrics (cardinality and values).

UPDATE: I didn't build the right version, I'm re-testing 9.0.0

8.17.1

CleanShot 2025-01-10 at 13 16 51@2x

9.0.0

  • Data collected regularly: yes

Issues

(1) Timegrain for azure.database_account.create_account.count is empty

CleanShot 2025-01-10 at 15 49 18@2x

In version 8.17.1, the timegrain for this field is PT5M.

(2) The azure.database_account.service_availability.avg (timegrain PT1H) is missing

Version 9.0.0 always collects 7 documents with PT5M, while version 8.17.1 collect 7 documents PT5M + 1 document PT1H during the first iteration and again every 60 mins.

Is 9.0.0 missing the PT1H document on the first iteration? Waiting for the next iteration to double-check.

After 75 mins, no azure.database_account.service_availability.avg field with PT1H.

CleanShot 2025-01-10 at 16 30 53@2x

UPDATE: tested by @MichaelKatsoulis

I managed to collect azure.database_account.service_availability.avg field with PT1H with the PR code. The problem is that the API requests metric values for metrics ServiceAvailability and ReplicationLatency for Average aggregation. When values for both metrics are requested, service_availability.avg is always nil. If we remove the ReplicationLatency and we just request values for ServiceAvailability the service_availability.avg is returned ok! Still do not know the reason of that.

@zmoog
Copy link
Contributor

zmoog commented Jan 10, 2025

UPDATE: I built the wrong version, I'm re-testing 9.0.0 with Microsoft.DocumentDb/databaseAccounts (1 resource) and I'll update the previous comment.

My apologies for the noise.

@zmoog
Copy link
Contributor

zmoog commented Jan 10, 2025

Microsoft.KeyVault/vaults (10 resources)

resource type: Microsoft.KeyVault/vaults
resource count: 10 resources
versions tested:

  • 8.17.1 (branch 8.17)
  • 9.0.0 (branch MichaelKatsoulis:concurrent-fetch-of-azure-metricdefinitions)

Activity:

  • I set up a custom Metricbeat config using the Azure Monitor metricset to target the key vaults
- module: azure  
  metricsets:  
    - monitor  
  enabled: true  
  period: 60s  
  client_id: '${AZURE_CLIENT_ID:""}'
  client_secret: '${AZURE_CLIENT_SECRET:""}'
  tenant_id: '${AZURE_TENANT_ID:""}'
  subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
  refresh_list_interval: 600s  
  resources:  
  - resource_query: "resourceType eq 'Microsoft.KeyVault/vaults'"  
    resource_group:  
    - "mbranca-az-scalability-kv-r10"    
    metrics:  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
          - name: StatusCode  
            value: '*'  
          - name: StatusCodeClass  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - ServiceApiLatency  
          - Availability  
          - ServiceApiResult  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - ServiceApiHit  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
          - name: TransactionType  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - SaturationShoebox  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M

Notes:

When the key vaults are unused (like in this resource group), they only generates a subset of metrics:

  • Availability
  • API Hits
  • API Results.

8.17.1

In progress.

I can see the three metrics (Availability, API Hits, API Results), grouped in two documents. So 2 documents x 10 resources = 20 documents per iteration:

CleanShot 2025-01-10 at 16 35 28@2x

9.0.0

In progress.

First iterations are okay. I get the same number of documents (20) as 8.17.1 and same values.

CleanShot 2025-01-10 at 16 48 50@2x

Still checking, but this case looks good.

@zmoog
Copy link
Contributor

zmoog commented Jan 10, 2025

@MichaelKatsoulis, I found a couple of issues relate to timegrain in the Microsoft.DocumentDb/databaseAccounts (1 resource) test.

@zmoog
Copy link
Contributor

zmoog commented Jan 10, 2025

Microsoft.ContainerRegistry/registries (1 resource)

resource type: Microsoft.ContainerRegistry/registries
resource count: 1 resource
versions tested:

  • 8.17.1 (branch 8.17)
  • 9.0.0 (branch MichaelKatsoulis:concurrent-fetch-of-azure-metricdefinitions)

Activity:

  • I set up a custom Metricbeat config using the Azure Monitor metricset to target the key vaults
- module: azure
  metricsets:
  - container_registry
  enabled: true
  period: 300s
  client_id: '${AZURE_CLIENT_ID:""}'
  client_secret: '${AZURE_CLIENT_SECRET:""}'
  tenant_id: '${AZURE_TENANT_ID:""}'
  subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
  refresh_list_interval: 600s

Since we had issue with PT1H metrics, I tried another metricset with this timegrain.

8.17.1

After one iteration, 8.17.1 collected:

  • 1 document with PT5M every 5 minutes
  • 1 document with PT1H every 60 minutes

9.0.0

After one iteration, 8.17.1 collected:

  • 1 document with PT5M every 5 minutes
  • 1 document with PT1H every 60 minutes

Conclusion

✅ With the recent code changes 8.17.1 and 9.0.0 yield the same outcome.

CleanShot 2025-01-15 at 13 23 47@2x

Metrics docs

@zmoog
Copy link
Contributor

zmoog commented Jan 15, 2025

@MichaelKatsoulis, I also re-run the container registry test case and 8.17.1 and 9.0.0 now yield the same outcome ✅

x-pack/metricbeat/module/azure/azure.go Show resolved Hide resolved
x-pack/metricbeat/module/azure/client_utils.go Outdated Show resolved Hide resolved
}

var monitorMetricsets = []string{"monitor", "container_registry", "container_instance", "container_service", "compute_vm", "compute_vm_scaleset", "database_account"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not including storage because we it isn't working with the Batch API yet, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to not include it for now. First of all it is not monitor metricset but the main reason is that while testing the documents returned with the batch API are of slightly different value to the standard api.
It is related to the documents dimensions but it will require extra investigation. Can be included later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness first.

return nil, fmt.Errorf("error initializing the monitor client: module azure - %s metricset: %w", metricsetName, err)
var monitorClient *Client
var monitorBatchClient *BatchClient
if containsString(monitorMetricsets, metricsetName) && config.EnableBatchApi {
Copy link
Contributor

@zmoog zmoog Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the batch API components only if:

  1. the metricset is supported
  2. batch api is enabled

right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly. if the metricset name is part of the supported (monitor metricset and the ones that user monitor under the hood) and the parameter is enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a name that conveys this idea, like supportedMonitorMetricsets, for clarity?

}

// mapToEvents maps the metric values to events and reports them to Elasticsearch.
func (client *BatchClient) MapToEvents(metrics []Metric, reporter mb.ReporterV2) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides InitResources(), GroupAndStoreMetrics(), and GetMetricsInBatch(), there are other functions likes MapToEvents(), AddVmToResource() that seems to contain the same code as the non-batch counterpart.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code by using a baseClient which implements all the common methods


if filter != "" {
metricsFilter = &filter
top = int32(10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember why we picked 10 here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a mistake. I will set it to nil like in GetMetricValues

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I remember we had a conversation with MSFT people around this param:

Azure/azure-sdk-for-go#22757

Copy link
Contributor

@zmoog zmoog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR introduces two significant changes:

  • async fetch of metric definitions
  • batch API for metric values

The enable_batch_api configuration flag enables both async+batch changes for all metricsets but the storage metricset.

I still have to run performance tests on more than 100 key values, but I see the benefits of async+batch: IIRC, the main version couldn’t collect metrics for 100+ resources.

If the next performance tests show that with the async+batch changes, we can collect metrics for 200, 300, 500, or more resources, we should consider shipping it the 8.18 FF to improve the scalability.

However, there are a few limits to the current PR implementation:

  • Code duplication 2aea321
  • Duality 1 in the batch/non-batch versions of internal components
  • The storage metricset does not support async+batch

If we cannot address the current limits before the 8.18 FF, we should only consider shipping it in 8.18.0 if the performance increase is significant.

Footnotes

  1. UPDATE: A note on the "duality" in the limits. I feel that having two implementations (batch & non-batch) as a series of internal components increases complexity and maintenance costs.

@zmoog
Copy link
Contributor

zmoog commented Jan 24, 2025

I'm starting test sessions on resource groups with 200, 400, and 800 key vaults.

@zmoog
Copy link
Contributor

zmoog commented Jan 24, 2025

Microsoft.KeyVault/vaults (200 resources)

resource type: Microsoft.KeyVault/vaults
resource count: 200 resources
versions tested:

  • 9.0.0 (branch MichaelKatsoulis:concurrent-fetch-of-azure-metricdefinitions, commit 54d4c03)

Activity:

I set up a custom Metricbeat config using the Azure Monitor metricset to target the key vaults

- module: azure  
  metricsets:  
    - monitor  
  enabled: true  
  period: 60s  
  client_id: '${AZURE_CLIENT_ID:""}'
  client_secret: '${AZURE_CLIENT_SECRET:""}'
  tenant_id: '${AZURE_TENANT_ID:""}'
  subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
  refresh_list_interval: 600s  
  resources:  
  - resource_query: "resourceType eq 'Microsoft.KeyVault/vaults'"  
    resource_group:  
    - "mbranca-az-scalability-kv-r200"
    metrics:  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
          - name: StatusCode  
            value: '*'  
          - name: StatusCodeClass  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - ServiceApiLatency  
          - Availability  
          - ServiceApiResult  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - ServiceApiHit  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
          - name: TransactionType  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - SaturationShoebox  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M

To have a few metrics from each key value, I'm running the following script:

for i in $(seq -f "%03g" 1 200); do
    az keyvault show --resource-group mbranca-az-scalability-kv-r200 --name "mbrancar200s$i"
done

9.0.0

In progress.

Metric Value Description
Ramp up time 22m Time between metric value from first resource and all resources
Gap (min) 4m Minimum time between collections
Gap (max) 7m Maximum time between collections
Gap (avg) ~5m Average time between collections

CleanShot 2025-01-24 at 15 59 48@2x

@zmoog
Copy link
Contributor

zmoog commented Jan 24, 2025

Microsoft.KeyVault/vaults (400 resources)

resource type: Microsoft.KeyVault/vaults
resource count: 400 resources
versions tested:

  • 9.0.0 (branch MichaelKatsoulis:concurrent-fetch-of-azure-metricdefinitions, commit 54d4c03)

Activity:

I set up a custom Metricbeat config using the Azure Monitor metricset to target the key vaults

- module: azure  
  metricsets:  
    - monitor  
  enabled: true  
  period: 60s  
  client_id: '${AZURE_CLIENT_ID:""}'
  client_secret: '${AZURE_CLIENT_SECRET:""}'
  tenant_id: '${AZURE_TENANT_ID:""}'
  subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
  refresh_list_interval: 600s  
  resources:  
  - resource_query: "resourceType eq 'Microsoft.KeyVault/vaults'"  
    resource_group:  
    - "mbranca-az-scalability-kv-r400"
    metrics:  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
          - name: StatusCode  
            value: '*'  
          - name: StatusCodeClass  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - ServiceApiLatency  
          - Availability  
          - ServiceApiResult  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - ServiceApiHit  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
          - name: TransactionType  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - SaturationShoebox  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M

To have a few metrics from each key value, I'm running the following script:

for j in $(seq 1 500); do                                                                                                           
    for i in $(seq -f "%03g" 1 400); do
        az keyvault show --resource-group mbranca-az-scalability-kv-r400 --name "mbrancar400s$i"
        echo "Iteration: $j, resource: mbrancar400s$i"
    done
done

9.0.0

In progress.

Metric Value Description
Ramp up time 0 Time between metric value from first resource and all resources
Gap (min) 11m Minimum time between collections
Gap (max) 12m Maximum time between collections
Gap (avg) ~11m Average time between collections

CleanShot 2025-01-24 at 17 26 41@2x

@zmoog
Copy link
Contributor

zmoog commented Jan 24, 2025

Microsoft.KeyVault/vaults (800 resources)

resource type: Microsoft.KeyVault/vaults
resource count: 800 resources
versions tested:

  • 9.0.0 (branch MichaelKatsoulis:concurrent-fetch-of-azure-metricdefinitions, commit 54d4c03)

Activity:

I set up a custom Metricbeat config using the Azure Monitor metricset to target the key vaults

- module: azure  
  metricsets:  
    - monitor  
  enabled: true  
  period: 60s  
  client_id: '${AZURE_CLIENT_ID:""}'
  client_secret: '${AZURE_CLIENT_SECRET:""}'
  tenant_id: '${AZURE_TENANT_ID:""}'
  subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
  refresh_list_interval: 600s  
  resources:  
  - resource_query: "resourceType eq 'Microsoft.KeyVault/vaults'"  
    resource_group:  
    - "mbranca-az-scalability-kv-r800"
    metrics:  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
          - name: StatusCode  
            value: '*'  
          - name: StatusCodeClass  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - ServiceApiLatency  
          - Availability  
          - ServiceApiResult  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - ServiceApiHit  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
          - name: TransactionType  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - SaturationShoebox  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M

To have a few metrics from each key value, I'm running the following script:

for j in $(seq 1 500); do                                                                                                           
    for i in $(seq -f "%03g" 1 800); do
        az keyvault show --resource-group mbranca-az-scalability-kv-r800 --name "mbrancar800s$i"
        echo "Iteration: $j, resource: mbrancar800s$i"
    done
done

9.0.0

In progress.

Metric Value Description
Ramp up time Time between metric value from first resource and all resources
Gap (min) 23m Minimum time between collections
Gap (max) 23m Maximum time between collections
Gap (avg) 23m Average time between collections

CleanShot 2025-01-24 at 18 28 46@2x

@zmoog
Copy link
Contributor

zmoog commented Jan 24, 2025

Recap after running a batch of test with 200, 400, and 800 resources with a collection period of 60s.

Resources Average gap Time per resource ( gap / resources)
200 5m 1.5s
400 11m 1.65s
800 786 23m 1.75s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify needs_team Indicates that the issue/PR needs a Team:* label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants