Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Add Support for AWS Batch #612

Open
cschwartz1020 opened this issue Dec 11, 2024 · 7 comments
Open

[batch] Add Support for AWS Batch #612

cschwartz1020 opened this issue Dec 11, 2024 · 7 comments
Labels
feature-request New feature

Comments

@cschwartz1020
Copy link

Feature scope

AWS Batch

Describe your suggested feature

Feature request is for an AWS Batch Monitoring construct

@cschwartz1020 cschwartz1020 added the feature-request New feature label Dec 11, 2024
@echeung-amzn
Copy link
Member

Do you have particular alarms and dashboard widgets that you think would make sense for Batch users?

@echeung-amzn echeung-amzn changed the title [AWS Batch] Add Support for AWS Batch [batch] Add Support for AWS Batch Dec 13, 2024
@cschwartz1020
Copy link
Author

Do you have particular alarms and dashboard widgets that you think would make sense for Batch users?

The most basic requirement would be widgets which show the number of Batch Job instances in any given status (SUBMITTED, PENDING, RUNNABLE, STARTING, RUNNING, SUCCEEDED, FAILED) for a given Job Queue or Job Definition.

However, I do understand this would likely be a large effort given that these metrics are currently not even sent to CloudWatch (i.e. there's no Batch CW namespace--no native metrics or CW integration). I have seen this solved before via EventBridge rules which route Batch Job State Change event detail types to an SNS Topic target, and from there you can track the AWS/SNS namespace "NumberOfMessagesPublished" metric. Although, this is somewhat of heuristic as it tells you how many jobs entered a given state during a period as opposed to how many jobs are in a given state. Regardless, it would be nice to have a construct that takes care of all this heavy lifting for you via .monitorBatchJob(..). It would also be nice to add a dimension of EC2 Instance Type, so you can see how workloads are spread across the instances configured on the Batch ComputeEnvironment.

Beyond that, it would be nice to have basic CPU/GPU (mem/util) metric widgets from the nodes on the underlying ECS/EKS cluster powering the Batch ComputeEnvironment.

@straygar
Copy link

straygar commented Jan 24, 2025

I've built something like this within my team, but unfortunately it's not clear to me how to contribute something that's backed by a custom Lambda to this repo, as everything seems to rely on AWS exposing the metrics.

You basically have 2 ways of getting metrics for AWS Batch:

  • consuming the events they publish (basically just Batch State transition, which excludes SUBMITTED)
    • to get SUBMITTED I added a separate EventBridge rule that listens to CloudTrail SubmitJob API calls that were successful
    • lightweight and real-time, but as you say - you can't see aggregations like "Total number of jobs in state X at this time"
  • running a scheduled Lambda, that does API calls listing the jobs in each queue, and publishing some stats
    • scheduled + the most frequent EventBridge can trigger at is once per minute

I also ran into some weirdness, which would raise some eyebrows if I were to try to contribute this, e.g. there is no way to limit the scope of ListJobs to a particular queue, Batch requires you to give access to all job queues (resource: *) to list jobs in one queue.

For the event-based thing, I'm creating a Lambda and publishing metrics to a custom namespace, just to make it easier to discover in CloudWatch. You can also avoid this, but you don't need the SNS topic I reckon, you can alarm on the number of times a rule was triggered as well.

WRT the resource utilization widgets, you can use Batch ContainerInsights (although these need to be flipped on manually or via a custom resource: aws/aws-cdk#21698)

@echeung-amzn
Copy link
Member

However, I do understand this would likely be a large effort given that these metrics are currently not even sent to CloudWatch

I'd encourage you to reach out to TAM/support contacts so that they can capture the datapoint about the customer request for the Batch team to help prioritize it.

unfortunately it's not clear to me how to contribute something that's backed by a custom Lambda to this repo

There's some very basic stuff in this folder that ultimately gets used elsewhere in the repo, but it's far from a robust setup. SecretsManagerMetricsPublisher is a somewhat similar idea that runs hourly to emit some custom metrics.

@straygar
Copy link

straygar commented Jan 28, 2025

@echeung-amzn Thanks for the pointer Eugene. To be consistent, I'd need to adapt my solution a bit, but no worries. Currently my Lambda:

  • Is written in python
    • I can rewrite this to .js
  • Writes metrics using EMF, so we can do analysis on the metadata fields in CW Insights
    • I guess we'd expect this module to call CW Metrics directly?
    • I recall the API had some pretty strict TPS limit (but it seems it's now 500 per second and each request can take up-to 1000 metrics, which is probably more than enough. I'm sure there are some customers who run 10'000s jobs in parallel, at which point we might run into issues)
  • Has a dependency on AWS Lambda Powertools for the handy EMF abstraction
    • won't be needed if we call the CW API directly

Does that sound about right?

@echeung-amzn
Copy link
Member

Writes metrics using EMF [...] I guess we'd expect this module to call CW Metrics directly?

I don't feel strongly about this, it'd be more of question of cost benefit. As you mention later, it's simpler with the current setup to just call AWS SDK APIs at least.

Has a dependency on AWS Lambda Powertools for the handy EMF abstraction

That's definitely a downside of the current repo setup since the handler code is just super basic with no build process involved.

@straygar
Copy link

Ok, I'll avoid EMF and powertools. (For reference - you can just attach the official powertools layer, no builds involved, but having 0 non-lambda runtime deps would be best for this repo I agree)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request New feature
Projects
None yet
Development

No branches or pull requests

3 participants