In order to allow .NET development to proceed, problems in the build or test that cannot be resolved immediately need a mechanism to quarantine them in order to unblock the developer workflow of the many developers that contribute to the product.
Quarantine is not the first resort, but it is a tool in order to ensure successful building of the product.
In general if a build or test fails, the steps should be as follows.
- If the source PR/change can be identified, it should be backed out to restore correct behavior, and then the correction made in a future PR to reinstate the new code
- If the problem can be definitively fixed quickly, it should be done as soon as possible
- If the broken component can be isolated to be removed from PR's and CI builds, it should be quarantined
- If none of the above are possible, is should be priority 0 to tackle the situation to get the build unblocked as soon as possible
Step 3 is the focus of this proposal.
We are going to consider something "broken" and in need of remediation if it has failed 3 of the last 10 builds in the CI pipeline. The CI pipeline should be passing 100% of the time, so 3 fails indicates that something needs to be done to unblock PRs.
The quarantine option is meant to be used for issues that are believed to be short term disruptions. If the fix cannot be determined immediately, within 5 minutes of the failure, quarantine needs to be enacted to unblock PR workflows. Permanent unreliability is a different problem not addressed by this procedure.
PR builds will not include the quarantined component. The primary CI pipelines (e.g. the 'runtime' pipeline) will not include the quarantined component. A separate pipeline will be run on the same cadence as the CI pipeline in order to execute quarantined components in order to determine when it is appropriate to unquarantine the affected component.
Owners should be aware every quarantined item, with a tracking issue in the most appropriate repository assigned to them. The primary purpose of this ownership is to ensure that the quarantined item is being addressed and tracked for reintroduction into the mainline builds.
The smallest unit possible should be quarantined, to minimize the coverage gap in PR.
- A single test in a single configuration
- A single test in all configurations
- A test assembly
- A build "job"
- An entire pipeline
TBD (dotnet#6661)
TBD (dotnet#6661)
TBD (dotnet#6662)
TBD (dotnet#6663)
TBD (dotnet#6663)
Once a fix has been introduced, and that component has passed passed for a month, it can be reintroduced into the mainline build by reverting the change made to quarantine it.