Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AVM Module Issue]: AuthorizationFailed #157

Closed
1 task done
Raphael-kainos opened this issue Dec 9, 2024 · 22 comments
Closed
1 task done

[AVM Module Issue]: AuthorizationFailed #157

Raphael-kainos opened this issue Dec 9, 2024 · 22 comments
Assignees
Labels
Language: Terraform 🌐 This is related to the Terraform IaC language Needs: External Changes ⚒️ When an issue/PR requires changes that are outside of the control of the module. e.g. to an RP. Type: Bug 🐛 Something isn't working

Comments

@Raphael-kainos
Copy link

Raphael-kainos commented Dec 9, 2024

Check for previous/existing GitHub issues

  • I have checked for previous/existing GitHub issues

Issue Type?

Bug

(Optional) Module Version

0.10.0

(Optional) Correlation Id

No response

Description

The module terraform-azurerm-avm-ptn-alz does not deploy successfully when using Azure DevOps agents. The deployment results in an AuthorizationFailed error indicating the service principal does not have authorization to perform action 'Microsoft.Management/managementGroups/read'. The process often gets stuck at level 0 or 1 and does not progress further.

I have attempted the following steps to mitigate the issue, but none resolved the problem:

-Utilised a self-hosted agent instead with the required permissions.

  • Granted the service principal Owner role and azure landing zones management group contributor & reader at the tenant root level.

Image

  • Adjusted settings for timeouts, delays, and retries to account for propagation and eventual consistency issues.

This is exclusive to ADO, when using github or locally it deploy with no errors. Please note that in all scenarios I used the same service princpal with same permission and ADO is the only method that failed.

Despite these efforts, the issue persists, suggesting that the module may not be fully compatible with Azure DevOps agent workflows. Please advise on whether additional configuration or updates to the module are required to resolve this issue.

Image

@Raphael-kainos Raphael-kainos added Language: Terraform 🌐 This is related to the Terraform IaC language Needs: Triage 🔍 Maintainers need to triage still labels Dec 9, 2024
@microsoft-github-policy-service microsoft-github-policy-service bot added Type: Bug 🐛 Something isn't working Status: Response Overdue 🚩 When an issue/PR has not been responded to for X amount of days labels Dec 9, 2024
@paul-e-martin
Copy link

I am also seeing the exact same error.

@Raphael-kainos
Copy link
Author

Still experiencing this issue can we get a update please ?

@matt-FFFFFF
Copy link
Member

Hi,

We successfully use ADO to deploy this module in the alz-terraform-accelerator, therefore I do not believe that there is any specific issue with ADO agents.

Adding @jaredfholgate who has done more testing than I here.

This issue can occur when for to permission time reconciliation. There have been some changes made to azapi to address this but they are not released yet.

In testing with a locally built provider from the main branch I have noticed that these issues do not occur but until a provider version is released then we are a little stuck.

#RR

@microsoft-github-policy-service microsoft-github-policy-service bot added the Needs: Author Feedback 👂 Awaiting feedback from the issue/PR author label Dec 18, 2024
@matt-FFFFFF matt-FFFFFF removed Needs: Author Feedback 👂 Awaiting feedback from the issue/PR author Needs: Triage 🔍 Maintainers need to triage still Status: Response Overdue 🚩 When an issue/PR has not been responded to for X amount of days labels Dec 18, 2024
@matt-FFFFFF matt-FFFFFF self-assigned this Dec 18, 2024
@jaredfholgate
Copy link
Member

I'll run some testing when I get chance and see if I can replicate. It looks like this is being run via an Accelerator deployment given the role definition shown.

@Raphael-kainos
Copy link
Author

I'll run some testing when I get chance and see if I can replicate. It looks like this is being run via an Accelerator deployment given the role definition shown.

No, this isn't being run via the Accelerator deployment. I just reused the role definition from the module deployment. I've also tried assigning it the Owner, Contributor, and User Access Administrator roles at the root tenant level, but I still encounter the same error.

@jaredfholgate
Copy link
Member

I'll run some testing when I get chance and see if I can replicate. It looks like this is being run via an Accelerator deployment given the role definition shown.

No, this isn't being run via the Accelerator deployment. I just reused the role definition from the module deployment. I've also tried assigning it the Owner, Contributor, and User Access Administrator roles at the root tenant level, but I still encounter the same error.

I'm trying it with the accelerator now. I'll share the new module with you assuming everything works and it may help to isolate the problem.

@matt-FFFFFF
Copy link
Member

FYI this was the PR that improves behaviour for resources at MG scope:

Azure/terraform-provider-azapi#681

@jaredfholgate
Copy link
Member

I'll run some testing when I get chance and see if I can replicate. It looks like this is being run via an Accelerator deployment given the role definition shown.

No, this isn't being run via the Accelerator deployment. I just reused the role definition from the module deployment. I've also tried assigning it the Owner, Contributor, and User Access Administrator roles at the root tenant level, but I still encounter the same error.

Hi @Raphael-kainos. It is not clear whether a second plan / apply resolves this for you? Are you saying it never works, not even after a retry?

@jaredfholgate
Copy link
Member

jaredfholgate commented Dec 19, 2024

So far I have been unable to replicate this specific issue.

To reproduce, I used the preview version of the Azure Verified Modules starter module for the accelerator.

The following is not yet GA, but may help you to resolve your problem. It is expected to be GA at end of January.

If you want to try the same, you can find the details here:

My command to deploy was: Deploy-Accelerator -inputs "C:\acc-test\config\inputs-azure-devops.yaml", "C:\acc-test\config\hub-and-spoke-vnet.tfvars" -output "C:\acc-test\output"

I used the full multi-region config to test everything, but you could probably use the management only one to replicate your issue: https://github.com/Azure/alz-terraform-accelerator/blob/main/templates/platform_landing_zone/examples/management-only/management.tfvars

@jaredfholgate
Copy link
Member

Given what Matt said about the known issue with retry, this will hopefully have a solution soon. However, please confirm whether a second plan / apply solves the problem for you? If not, does using the accelerator code I shared solve it for you?

If you still get the issue after that, then there must be an environment specific problem that would require further investigation.

@jaredfholgate jaredfholgate added the Needs: Author Feedback 👂 Awaiting feedback from the issue/PR author label Dec 19, 2024
@paul-e-martin
Copy link

I'll run some testing when I get chance and see if I can replicate. It looks like this is being run via an Accelerator deployment given the role definition shown.

No, this isn't being run via the Accelerator deployment. I just reused the role definition from the module deployment. I've also tried assigning it the Owner, Contributor, and User Access Administrator roles at the root tenant level, but I still encounter the same error.

Hi @Raphael-kainos. It is not clear whether a second plan / apply resolves this for you? Are you saying it never works, not even after a retry?

In mine, if i run a 2nd plan/apply, it will fail. This is due to the management groups actually getting created, but i guess not "logged" into the statefile, so terraform wants to create them again.

If i delete the management groups that had the issue, there is no guarantee that the plan/apply will succeed. I have had instances where it does apply, others where we get the same AuthorizationFailed error.

When checking the Azure portal when the failure occurs, it does appear to take time for the inherited permissions to apply from the parent management group.

Not sure if any of that helps.

@microsoft-github-policy-service microsoft-github-policy-service bot added Needs: Attention 👋 Reply has been added to issue, maintainer to review and removed Needs: Author Feedback 👂 Awaiting feedback from the issue/PR author labels Dec 19, 2024
@Raphael-kainos
Copy link
Author

I'll run some testing when I get chance and see if I can replicate. It looks like this is being run via an Accelerator deployment given the role definition shown.

No, this isn't being run via the Accelerator deployment. I just reused the role definition from the module deployment. I've also tried assigning it the Owner, Contributor, and User Access Administrator roles at the root tenant level, but I still encounter the same error.

Hi @Raphael-kainos. It is not clear whether a second plan / apply resolves this for you? Are you saying it never works, not even after a retry?

In mine, if i run a 2nd plan/apply, it will fail. This is due to the management groups actually getting created, but i guess not "logged" into the statefile, so terraform wants to create them again.

If i delete the management groups that had the issue, there is no guarantee that the plan/apply will succeed. I have had instances where it does apply, others where we get the same AuthorizationFailed error.

When checking the Azure portal when the failure occurs, it does appear to take time for the inherited permissions to apply from the parent management group.

Not sure if any of that helps.

Yh @jaredfholgate this is exactly what happens when I tried a 2nd plan/apply. Im going try the accelerator code and get back to you as soon as possible. Thanks

@Raphael-kainos
Copy link
Author

Raphael-kainos commented Dec 23, 2024

Hi @jaredfholgate I tried the accelerator code and encountered the same error. This suggests it might be an environment issue, but I’m not sure what the cause could be. I’m not using a self-hosted agent at the moment—could that be the issue?

Since I’m working with the exact same codebase as you, it’s a bit puzzling. Any pointers for investigation would be greatly appreciated

Image

@Raphael-kainos
Copy link
Author

FYI this was the PR that improves behaviour for resources at MG scope:

Azure/terraform-provider-azapi#681

I can only assume it related to this current bug, and will have to wait to this newest version of AZAPI. Just odd because you guys do not have this problem.

Image

@matt-FFFFFF
Copy link
Member

Based on what you've both said I think this will be fixed with 2.2

@matt-FFFFFF matt-FFFFFF added Needs: External Changes ⚒️ When an issue/PR requires changes that are outside of the control of the module. e.g. to an RP. and removed Needs: Attention 👋 Reply has been added to issue, maintainer to review labels Dec 26, 2024
@Raphael-kainos
Copy link
Author

The recent update to azapi doesn’t seem to address the issue I’m experiencing with Azure DevOps deployments. I’m still encountering the same error, which primarily appears to be related to RBAC propagation, specifically within Azure DevOps.

I’ve tested this in two different tenants, one of which has significantly fewer role assignments, but the problem persists. I’m currently stuck on why this is happening. When I check the portal after receiving the error, I notice also that the RBAC permissions for the managed identity assigned to Azure DevOps take a significant amount of time to propagate. This delay affects not only nested management groups but occasionally even the top-level management group.

@paul-e-martin Are you still experiencing the same issue? Also, @jaredfholgate, I’m using the accelerator configuration you provided, both with the deployment management group only and with all modules. Can you think of any differences in your setup that might help investigate this issue, considering you’ve mentioned you’re currently using Azure DevOps for deployments?

@paul-e-martin
Copy link

@Raphael-kainos will be testing again on Monday, will update here with the results.

What authentication are you using? My deployment is with a managed identity and the workload identity federation. Will do a test using a service principal, to see if the propagation of RBAC is different.

@Raphael-kainos
Copy link
Author

Ok @paul-e-martin, that would be very helpful. I’ve primarily been using managed identities and workload identity federation, as this is what is utilised in the ALZ accelerator. However, I also tried using a service principal but encountered the same issues.

@paul-e-martin
Copy link

paul-e-martin commented Jan 4, 2025 via email

@jaredfholgate
Copy link
Member

Our tests run in multiple regions. The Azure DevOps org is homed in uksouth. The last end to end test I ran was uksouth.

We do often see issues with role assignments being slow, but they are eventually consistent. I've never seen a permanent failure. We have increased the retry timeouts in the accelerator too.

With regards to the issue being specific to Azure DevOps, the only thing I can think of is that GitHub is refreshing it's access token, but Azure DevOps is not. GitHub has some logic built into the Terraform provider to automatically get and use the id token. It is using the request url behind the scenes for this.

I am currently in discussion with someone working to implement this for the azurerm backend for Azure DevOps. That same logic could be applied to the 3 providers if it works. It's possible that using ARM_OIDC_REQUEST_TOKEN and ARM_OIDC_REQUEST_URL could resolve this. I will take a look into this as time allows as they have recently been exposed in Azure DevOps.

@jaredfholgate
Copy link
Member

jaredfholgate commented Jan 7, 2025

Further to this, my colleague is working on a proper solution in the azurerm provider and the azurerm backend in Terraform Core that will support token refresh for OIDC. You can see his latest update here: hashicorp/terraform#34322 (comment)

azapi already got support for token refresh. See the second option here that uses ARM_OIDC_REQUEST_TOKEN and ARM_OIDC_AZURE_SERVICE_CONNECTION_ID : https://registry.terraform.io/providers/Azure/azapi/latest/docs/guides/service_principal_oidc#configuring-the-service-principal-in-terraform

Once they are all released, I'll be able to update the Accelerator pipelines to support token refresh.

jaredfholgate added a commit to Azure/alz-terraform-accelerator that referenced this issue Jan 11, 2025
<!-- Thank you for submitting a Pull Request. Please fill out the
template below.-->
## Overview/Summary

Increase timeouts to help with ADO eventual consistency issue

## This PR fixes/adds/changes/removes

1. Azure/terraform-azurerm-avm-ptn-alz#157
2. Azure/ALZ-PowerShell-Module#269

### Breaking Changes

None

## Testing Evidence

Please provide any testing evidence to show that your Pull Request
works/fixes as described and planned (include screenshots, if
appropriate).

## As part of this Pull Request I have

- [x] Checked for duplicate [Pull
Requests](https://github.com/Azure/alz-terraform-accelerator/pulls)
- [x] Associated it with relevant
[issues](https://github.com/Azure/alz-terraform-accelerator/issues), for
tracking and closure.
- [x] Ensured my code/branch is up-to-date with the latest changes in
the `main`
[branch](https://github.com/Azure/alz-terraform-accelerator/tree/main)
- [x] Performed testing and provided evidence.
- [x] Updated relevant and associated documentation.
@jaredfholgate
Copy link
Member

jaredfholgate commented Jan 11, 2025

Hi @Raphael-kainos and @paul-e-martin

I was eventually able to reproduce this issue. I changed my root management group and it started happening.

As such I have been able to find a way to resolve it. You can see the PRs I have linked here to fix it in the Accelerator. Eventually, I think the token refresh in the provider may also help.

In the short term, setting the environment variable AZAPI_RETRY_GET_AFTER_PUT_MAX_TIME solves the issue. I found the permissions were consistent between 10 and 15 minutes after creation of the management group, so you could probably set it to 20m. I have set it to 60m to be on the safe side given there is no impact of having a longer timeout for this use case.

Given I believe this solves the originally raised issue, I am going to close this issue now. Please re-open it if you find that it does not solve it for you though.

I continue to work with PG on implementing the token refresh for Azure DevOps OIDC and that will hopefully come in the next few months and potentially help to reduce the time it takes, but I can't be 100% sure it will.

CC: @matt-FFFFFF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Language: Terraform 🌐 This is related to the Terraform IaC language Needs: External Changes ⚒️ When an issue/PR requires changes that are outside of the control of the module. e.g. to an RP. Type: Bug 🐛 Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants