Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] [Billing] Duplicated data when having multiple tags #8942

Closed
Tracked by #8905
gpop63 opened this issue Jan 22, 2024 · 11 comments
Closed
Tracked by #8905

[AWS] [Billing] Duplicated data when having multiple tags #8942

gpop63 opened this issue Jan 22, 2024 · 11 comments
Assignees

Comments

@gpop63
Copy link
Contributor

gpop63 commented Jan 22, 2024

When using an AWS billing configuration that groups by a combination of tags and dimensions, such as SERVICE and multiple tags (for example, team, project, aws:createdBy), we may end up with multiple duplicates of the same data. This is due to a limitation of the GetCostAndUsage API, which only allows grouping by two groups at once.

In beats, we pair each tag with each dimension and initiate a request. The total number of Cost and Usage requests equals the number of tags multiplied by the number of dimensions.

Possible solutions:

  • Allow users to add a filter
    • When grouping by multiple tags and dimensions, we make several GetCostAndUsage requests. We would need a way to know which filter to use for which request.
  • I'm exploring if the new Data Export feature could replace the Cost and Usage API to solve these issues (GCP billing works in a similar way).

@agithomas @lalit-satapathy

@agithomas
Copy link
Contributor

@kaiyan-sheng , you developed the current AWS billing integration. What do you think of the new approach mentioned in the description ?

@gpop63
Copy link
Contributor Author

gpop63 commented Feb 9, 2024

Athena Exploration

Amazon Athena is a query service frequently utilized for log analysis and big data analytics. It is capable of analyzing logs from various AWS services such as CloudTrail, CloudFront, as well as application logs.

Prerequisites

  • Billing report through Data Exports feature (if querying billing data)
  • Athena setup (database and table)
  • S3 bucket located in the same region as Athena to store query results
    • Query results can be reused for a certain period of time to avoid costs

Implementation Capabilities

Athena Input Integration

As suggested by @agithomas, this could serve as an input package. It allows for the use of any SQL query against a table and the selection of desired fields to be included in the ES documents. Users would have the ability to query any of their data from an S3 bucket as long as it's in one of the supported formats.

AWS Billing Integration

While SQL query and fields remain customizable, the default config will prioritize key columns from billing data reports. Additional fields requests would not require beats changes.

Example of billing config

- module: aws
  period: 1m
  access_key_id: <REDACTED>
  secret_access_key: <REDACTED>
  regions:
    - eu-west-1
  metricsets:
    - billingv2
  athena_config:
    table: t1
    database: db1
    query_results_s3_bucket: s3://example/
    sql_query: |
      SELECT
          CAST(SUM(t1.line_item_unblended_cost) AS DECIMAL(10, 2)) AS UnblendedCost,
          product_servicecode as ProductServiceCode,
          identity_time_interval as IdentityTimeInterval,
          resource_tags as ResourceTags
      FROM
          db1.t1
      GROUP BY
          product_servicecode,
          identity_time_interval,
          resource_tags
      HAVING
          CAST(SUM(t1.line_item_unblended_cost) AS DECIMAL(10, 2)) > 0.00;
    columns:
      - name: UnblendedCost
      - name: ProductServiceCode
        unique: true
      - name: IdentityTimeInterval
        unique: true
      - name: ResourceTags
        unique: true


Benefits and Drawbacks

Benefits

  • No need for beats changes when requesting additional fields
  • Applicable to use cases beyond billing data
  • Provides more granular access to AWS billing data, allowing for the selection of any desired fields from reports
  • Can query data in various formats
  • Data exports work with multiple accounts
  • The issue with tags we have in current integration would not be present

Drawbacks

  • Athena usage costs $5.00 per TB of data scanned
  • Setup of Athena, S3 bucket, and data exports report is necessary
  • SQL knowledge would be needed for query customization by users
  • Data export reports in AWS can be refreshed multiple times throughout the day, and AWS may perform updates that affect these exports at any time — there's a risk of having outdated data. This issue isn't unique to data exports but also applies to the current implementation that utilizes the GetCostAndUsage API.

@agithomas @tommyers-elastic

@tommyers-elastic
Copy link
Contributor

thanks for the detail @gpop63. i'm sceptical of the athena solution primarily because of the effort required to set it up.

could a short term solution to this issue be to document the limitations of the current integration wrt to tags, and make it very clear that incorrect (inflated) cost data will be reported if multiple tags/dimensions are present?

in terms of other things we could do that continue to utilize the existing cost API input, could we change how we do the groupings such that we get accurate cost data, but perhaps we have a limit on the number of supported dimensions. so at least we do not ever report incorrect data, even if it means customers cannot query it in such a granular way?

@agithomas
Copy link
Contributor

thanks for the detail @gpop63. i'm sceptical of the athena solution primarily because of the effort required to set it up.

Would providing an AWS cloud formation template or terraform, including them as part of Readme, simplify the setup process?

@agithomas
Copy link
Contributor

Should we revisit the default configurations? Keep only the AZ and SERVICE?

In a large AWS setup, having aws:createdBy led to a large number of documents.

image

I think, apart from the README, we can consider adding a hint in the configuration to limit the number of dimensions to 2.

@m-adams
Copy link

m-adams commented Feb 14, 2024

Can we come at this form a customer 0 perspective.
The people who will get the most value from this will be large orgs wanting to do some form of finops on the data.
To do that usefully you need to pull data at a granular level and then let people analyse that data in the stack.
We need a solution that at minimum is useful internally as our tagging scheme is not exactly that complex. The basic version being described although easier to setup doesn't seem to actually be useful when using user defined cost allocation tags which seems to be the direction people are pushed to track their costs.
Mabe there could be a basic and advanced option if we need to maintain something that is very easy to setup.

@tommyers-elastic
Copy link
Contributor

@vinaychandrasekhar @SubhrataK it sounds like we need some research on the best way forward here before we implement anything new.

@gpop63 @agithomas at a minimum right now we should document the issue and/or remove the ability to configure the existing integration in a way that causes incorret billing data to be generated.

cc @lalit-satapathy

@vinaychandrasekhar
Copy link

@SubhrataK @lalit-satapathy - are we tracking this research effort? Do you need any input from me?

@lalit-satapathy
Copy link
Collaborator

@gpop63,

Please update & close the issue the final summary of research work done so far; as we are updating the docs here: #9290.

It will be nice to have a proposed architecture for future discussion.

@m-adams
Copy link

m-adams commented Mar 21, 2024

if we close this issue, can we open a new one for making multi-tag analysis work please

@gpop63
Copy link
Contributor Author

gpop63 commented Apr 1, 2024

Leaving the steps here as a reference in case it is decided to proceed with the implementation.

This should cover the steps required both in AWS and Agent (Metricbeat).

AWS:

  • Credentials
  • Standard Data Export export
    • S3 bucket to store reports
    • Can be in CSV or Parquet format
  • Athena database & table from the data export report
    • This is easily done by creating a table from S3 bucket data source
      image
    • S3 bucket where results will be stored (they can be reused)

Agent (Metricbeat):

  • Use AWS credentials as usual
  • Make Athena related settings:
    • Table, database and s3 bucket for query results
    • Customize SQL query and columns (if needed, default can be used)
Metricbeat config example

- module: aws
  period: 1m
  access_key_id: <REDACTED>
  secret_access_key: <REDACTED>
  regions:
    - eu-west-1
  metricsets:
    - billingv2
  athena_config:
    table: t1
    database: db1
    query_results_s3_bucket: s3://example/
    sql_query: |
      SELECT
          CAST(SUM(t1.line_item_unblended_cost) AS DECIMAL(10, 2)) AS UnblendedCost,
          product_servicecode as ProductServiceCode,
          identity_time_interval as IdentityTimeInterval,
          resource_tags as ResourceTags
      FROM
          db1.t1
      GROUP BY
          product_servicecode,
          identity_time_interval,
          resource_tags
      HAVING
          CAST(SUM(t1.line_item_unblended_cost) AS DECIMAL(10, 2)) > 0.00;
    columns:
      - name: UnblendedCost
      - name: ProductServiceCode
        unique: true
      - name: IdentityTimeInterval
        unique: true
      - name: ResourceTags
        unique: true

image

For now we have created a PR #9290 to document the limitation of the API.

@gpop63 gpop63 closed this as completed Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants