-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent fetch of azure metricdefinitions and batchApi usage #41790
base: main
Are you sure you want to change the base?
Concurrent fetch of azure metricdefinitions and batchApi usage #41790
Conversation
This pull request doesn't have a |
This pull request is now in conflicts. Could you fix it? 🙏
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
Microsoft.DocumentDb/databaseAccounts (1 resource)resource type:
Activity:
# x-pack/metricbeat/modules.d/azure.yml
- module: azure
metricsets:
- database_account
enabled: true
period: 300s
client_id: '${AZURE_CLIENT_ID:""}'
client_secret: '${AZURE_CLIENT_SECRET:""}'
tenant_id: '${AZURE_TENANT_ID:""}'
subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
refresh_list_interval: 600s
UPDATE: I didn't build the right version, I'm re-testing 9.0.0 8.17.19.0.0
Issues(1) Timegrain for azure.database_account.create_account.count is emptyIn version 8.17.1, the timegrain for this field is PT5M. (2) The azure.database_account.service_availability.avg (timegrain PT1H) is missingVersion 9.0.0 always collects 7 documents with PT5M, while version 8.17.1 collect 7 documents PT5M + 1 document PT1H during the first iteration and again every 60 mins. Is 9.0.0 missing the PT1H document on the first iteration? Waiting for the next iteration to double-check. After 75 mins, no UPDATE: tested by @MichaelKatsoulis I managed to collect |
UPDATE: I built the wrong version, I'm re-testing 9.0.0 with Microsoft.DocumentDb/databaseAccounts (1 resource) and I'll update the previous comment. My apologies for the noise. |
Microsoft.KeyVault/vaults (10 resources)resource type:
Activity:
- module: azure
metricsets:
- monitor
enabled: true
period: 60s
client_id: '${AZURE_CLIENT_ID:""}'
client_secret: '${AZURE_CLIENT_SECRET:""}'
tenant_id: '${AZURE_TENANT_ID:""}'
subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
refresh_list_interval: 600s
resources:
- resource_query: "resourceType eq 'Microsoft.KeyVault/vaults'"
resource_group:
- "mbranca-az-scalability-kv-r10"
metrics:
- dimensions:
- name: ActivityType
value: '*'
- name: ActivityName
value: '*'
- name: StatusCode
value: '*'
- name: StatusCodeClass
value: '*'
ignore_unsupported: true
name:
- ServiceApiLatency
- Availability
- ServiceApiResult
namespace: Microsoft.KeyVault/vaults
timegrain: PT1M
- dimensions:
- name: ActivityType
value: '*'
- name: ActivityName
value: '*'
ignore_unsupported: true
name:
- ServiceApiHit
namespace: Microsoft.KeyVault/vaults
timegrain: PT1M
- dimensions:
- name: ActivityType
value: '*'
- name: ActivityName
value: '*'
- name: TransactionType
value: '*'
ignore_unsupported: true
name:
- SaturationShoebox
namespace: Microsoft.KeyVault/vaults
timegrain: PT1M Notes: When the key vaults are unused (like in this resource group), they only generates a subset of metrics:
8.17.1In progress. I can see the three metrics (Availability, API Hits, API Results), grouped in two documents. So 2 documents x 10 resources = 20 documents per iteration: 9.0.0In progress. First iterations are okay. I get the same number of documents (20) as 8.17.1 and same values. Still checking, but this case looks good. |
@MichaelKatsoulis, I found a couple of issues relate to timegrain in the Microsoft.DocumentDb/databaseAccounts (1 resource) test. |
Microsoft.ContainerRegistry/registries (1 resource)resource type:
Activity:
- module: azure
metricsets:
- container_registry
enabled: true
period: 300s
client_id: '${AZURE_CLIENT_ID:""}'
client_secret: '${AZURE_CLIENT_SECRET:""}'
tenant_id: '${AZURE_TENANT_ID:""}'
subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
refresh_list_interval: 600s Since we had issue with PT1H metrics, I tried another metricset with this timegrain. 8.17.1After one iteration, 8.17.1 collected:
9.0.0After one iteration, 8.17.1 collected:
Conclusion✅ With the recent code changes 8.17.1 and 9.0.0 yield the same outcome. Metrics docs |
@MichaelKatsoulis, I also re-run the container registry test case and 8.17.1 and 9.0.0 now yield the same outcome ✅ |
} | ||
|
||
var monitorMetricsets = []string{"monitor", "container_registry", "container_instance", "container_service", "compute_vm", "compute_vm_scaleset", "database_account"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not including storage
because we it isn't working with the Batch API yet, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to not include it for now. First of all it is not monitor
metricset but the main reason is that while testing the documents returned with the batch API are of slightly different value to the standard api.
It is related to the documents dimensions but it will require extra investigation. Can be included later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correctness first.
return nil, fmt.Errorf("error initializing the monitor client: module azure - %s metricset: %w", metricsetName, err) | ||
var monitorClient *Client | ||
var monitorBatchClient *BatchClient | ||
if containsString(monitorMetricsets, metricsetName) && config.EnableBatchApi { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the batch API components only if:
- the metricset is supported
- batch api is enabled
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly. if the metricset name is part of the supported (monitor metricset and the ones that user monitor under the hood) and the parameter is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a name that conveys this idea, like supportedMonitorMetricsets
, for clarity?
} | ||
|
||
// mapToEvents maps the metric values to events and reports them to Elasticsearch. | ||
func (client *BatchClient) MapToEvents(metrics []Metric, reporter mb.ReporterV2) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides InitResources()
, GroupAndStoreMetrics()
, and GetMetricsInBatch()
, there are other functions likes MapToEvents()
, AddVmToResource()
that seems to contain the same code as the non-batch counterpart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code by using a baseClient which implements all the common methods
|
||
if filter != "" { | ||
metricsFilter = &filter | ||
top = int32(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember why we picked 10 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a mistake. I will set it to nil like in GetMetricValues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, I remember we had a conversation with MSFT people around this param:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR introduces two significant changes:
- async fetch of metric definitions
- batch API for metric values
The enable_batch_api
configuration flag enables both async+batch changes for all metricsets but the storage
metricset.
I still have to run performance tests on more than 100 key values, but I see the benefits of async+batch: IIRC, the main
version couldn’t collect metrics for 100+ resources.
If the next performance tests show that with the async+batch changes, we can collect metrics for 200, 300, 500, or more resources, we should consider shipping it the 8.18 FF to improve the scalability.
However, there are a few limits to the current PR implementation:
Code duplication2aea321- Duality 1 in the batch/non-batch versions of internal components
- The
storage
metricset does not support async+batch
If we cannot address the current limits before the 8.18 FF, we should only consider shipping it in 8.18.0 if the performance increase is significant.
Footnotes
-
UPDATE: A note on the "duality" in the limits. I feel that having two implementations (batch & non-batch) as a series of internal components increases complexity and maintenance costs. ↩
I'm starting test sessions on resource groups with 200, 400, and 800 key vaults. |
Microsoft.KeyVault/vaults (200 resources)resource type:
Activity: I set up a custom Metricbeat config using the Azure Monitor metricset to target the key vaults - module: azure
metricsets:
- monitor
enabled: true
period: 60s
client_id: '${AZURE_CLIENT_ID:""}'
client_secret: '${AZURE_CLIENT_SECRET:""}'
tenant_id: '${AZURE_TENANT_ID:""}'
subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
refresh_list_interval: 600s
resources:
- resource_query: "resourceType eq 'Microsoft.KeyVault/vaults'"
resource_group:
- "mbranca-az-scalability-kv-r200"
metrics:
- dimensions:
- name: ActivityType
value: '*'
- name: ActivityName
value: '*'
- name: StatusCode
value: '*'
- name: StatusCodeClass
value: '*'
ignore_unsupported: true
name:
- ServiceApiLatency
- Availability
- ServiceApiResult
namespace: Microsoft.KeyVault/vaults
timegrain: PT1M
- dimensions:
- name: ActivityType
value: '*'
- name: ActivityName
value: '*'
ignore_unsupported: true
name:
- ServiceApiHit
namespace: Microsoft.KeyVault/vaults
timegrain: PT1M
- dimensions:
- name: ActivityType
value: '*'
- name: ActivityName
value: '*'
- name: TransactionType
value: '*'
ignore_unsupported: true
name:
- SaturationShoebox
namespace: Microsoft.KeyVault/vaults
timegrain: PT1M To have a few metrics from each key value, I'm running the following script: for i in $(seq -f "%03g" 1 200); do
az keyvault show --resource-group mbranca-az-scalability-kv-r200 --name "mbrancar200s$i"
done 9.0.0In progress.
|
Microsoft.KeyVault/vaults (400 resources)resource type:
Activity: I set up a custom Metricbeat config using the Azure Monitor metricset to target the key vaults - module: azure
metricsets:
- monitor
enabled: true
period: 60s
client_id: '${AZURE_CLIENT_ID:""}'
client_secret: '${AZURE_CLIENT_SECRET:""}'
tenant_id: '${AZURE_TENANT_ID:""}'
subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
refresh_list_interval: 600s
resources:
- resource_query: "resourceType eq 'Microsoft.KeyVault/vaults'"
resource_group:
- "mbranca-az-scalability-kv-r400"
metrics:
- dimensions:
- name: ActivityType
value: '*'
- name: ActivityName
value: '*'
- name: StatusCode
value: '*'
- name: StatusCodeClass
value: '*'
ignore_unsupported: true
name:
- ServiceApiLatency
- Availability
- ServiceApiResult
namespace: Microsoft.KeyVault/vaults
timegrain: PT1M
- dimensions:
- name: ActivityType
value: '*'
- name: ActivityName
value: '*'
ignore_unsupported: true
name:
- ServiceApiHit
namespace: Microsoft.KeyVault/vaults
timegrain: PT1M
- dimensions:
- name: ActivityType
value: '*'
- name: ActivityName
value: '*'
- name: TransactionType
value: '*'
ignore_unsupported: true
name:
- SaturationShoebox
namespace: Microsoft.KeyVault/vaults
timegrain: PT1M To have a few metrics from each key value, I'm running the following script: for j in $(seq 1 500); do
for i in $(seq -f "%03g" 1 400); do
az keyvault show --resource-group mbranca-az-scalability-kv-r400 --name "mbrancar400s$i"
echo "Iteration: $j, resource: mbrancar400s$i"
done
done 9.0.0In progress.
|
Microsoft.KeyVault/vaults (800 resources)resource type:
Activity: I set up a custom Metricbeat config using the Azure Monitor metricset to target the key vaults - module: azure
metricsets:
- monitor
enabled: true
period: 60s
client_id: '${AZURE_CLIENT_ID:""}'
client_secret: '${AZURE_CLIENT_SECRET:""}'
tenant_id: '${AZURE_TENANT_ID:""}'
subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
refresh_list_interval: 600s
resources:
- resource_query: "resourceType eq 'Microsoft.KeyVault/vaults'"
resource_group:
- "mbranca-az-scalability-kv-r800"
metrics:
- dimensions:
- name: ActivityType
value: '*'
- name: ActivityName
value: '*'
- name: StatusCode
value: '*'
- name: StatusCodeClass
value: '*'
ignore_unsupported: true
name:
- ServiceApiLatency
- Availability
- ServiceApiResult
namespace: Microsoft.KeyVault/vaults
timegrain: PT1M
- dimensions:
- name: ActivityType
value: '*'
- name: ActivityName
value: '*'
ignore_unsupported: true
name:
- ServiceApiHit
namespace: Microsoft.KeyVault/vaults
timegrain: PT1M
- dimensions:
- name: ActivityType
value: '*'
- name: ActivityName
value: '*'
- name: TransactionType
value: '*'
ignore_unsupported: true
name:
- SaturationShoebox
namespace: Microsoft.KeyVault/vaults
timegrain: PT1M To have a few metrics from each key value, I'm running the following script: for j in $(seq 1 500); do
for i in $(seq -f "%03g" 1 800); do
az keyvault show --resource-group mbranca-az-scalability-kv-r800 --name "mbrancar800s$i"
echo "Iteration: $j, resource: mbrancar800s$i"
done
done 9.0.0In progress.
|
Recap after running a batch of test with 200, 400, and 800 resources with a collection period of 60s.
|
The changes affect azure monitor and relevant metricsets. The list of metricsets affected are:
monitor
container_registry
container_instance
container_service
compute_vm
compute_vm_scaleset
database_account
A new configuration parameter is introduced
enable_batch_api
of type boolean.If set to
false
(default) nothing changes in the way the metrics are collected for these metricsets.If set to
true
:metrics of multiple resources with one api call.
Proposed commit message
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Disruptive User Impact
Author's Checklist
How to test this PR locally
Related issues
Use cases
Screenshots
Logs