Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform sparse checkout module #19277

Open
rverma-nikiai opened this issue Nov 4, 2018 · 23 comments
Open

Terraform sparse checkout module #19277

rverma-nikiai opened this issue Nov 4, 2018 · 23 comments

Comments

@rverma-nikiai
Copy link

rverma-nikiai commented Nov 4, 2018

Current Terraform Version

v0.11.7

Use-cases

Terraform module sparse checkout and specify depth. While terraform suggest 1 module per repo, there are orgs which are more willing to manage multiple related modules together. This gives faster feedback cycle also, related pull requests in one repo etc..

Attempted Solutions

Couldn't find any thing relevant.

Proposal

possibly we can evolve source with backward compatibility as

module dynamo-auto {
     source = "git::https://github.com/cloudposse/terraform-aws-dynamodb-autoscaler.git?ref=master"
}

and also

module dynamo-auto {
     source = {
       repo = "git::https://github.com/cloudposse/terraform-aws-dynamodb-autoscaler.git?ref=master"
       depth = 1
       path = /modules/dynamo
}

which allows to sparse checkout /modules/dynamo as relevant terraform module.

@apparentlymart
Copy link
Contributor

Hi @rverma-nikiai! Thanks for sharing this use-cases.

The git module source actually already supports a syntax for selecting a sub-path from a repository, like this:

module "dynamo-auto" {
     source = "git::https://github.com/cloudposse/terraform-aws-dynamodb-autoscaler.git//modules/dynamo?ref=master"
}

The extra //... portion of the path at the end is interpreted as a subdirectory within the repository.

That then just leaves the request for shallow cloning. The git handling is all done by a component which parses only the source string, so additional git-related settings must be packed in inside that pseudo-query-string argument at the end, which means a hypothetical new option might look something like this:

module "dynamo-auto" {
     source = "git::https://github.com/cloudposse/terraform-aws-dynamodb-autoscaler.git//modules/dynamo?ref=master&depth=1"
}

However, I think Terraform's module installer does a full clone by default just because when it was written the "shallow clone" functionality was relatively new and limited, and we wanted to be sure of proper behavior on subsequent commands such as upgrading the module, which requires running operations like git fetch.

I think we should investigate whether the improved shallow clone behavior added in Git 1.9 (now several years old) is featureful enough that we could enable shallow cloning by default in a future release, since the module installer's goal is always to install just the single version you requested, rather than to create a fully-fledged development environment for that repository. Before making that decision, we'll need to prototype it to make sure the upgrading behavior is well-behaved after a shallow, single-branch clone.

We are in the early stages of planning some other changes to how Terraform manages configuration dependencies for a future release, so I'm going to label this one to remind us to consider this use-case as part of that work.

@rverma-nikiai
Copy link
Author

@apparentlymart, though the terraform module support submodule like

 source = "git::https://github.com/cloudposse/terraform-aws-dynamodb-autoscaler.git//modules/dynamo?ref=master&depth=1"

It did clone the whole repo and reference to path which is useful locally. It still defats the purpose of sparse checkout, which provides various benefits.

Just some thoughts as
Consider three modules definition in main.tf

module "dynamo-auto" {
     source = "git::https://github.com/cloudposse/terraform-aws-autoscaler.git//modules/dynamo?ref=master"
}
module "rds-auto" {
     source = "git::https://github.com/cloudposse/terraform-aws-autoscaler.git//modules/rds?ref=master"
}
module "es-auto" {
     source = "git::https://github.com/cloudposse/terraform-aws-autoscaler.git//modules/es?ref=master"
}

Currently this will cause 3 complete clone of terraform-aws-autoscaler.git and we have atleast 2 redundant copies of each module on disk, 3 time git cloning would be called as well.

possibly in init step we can prebuilt the sparse-checkout info resulting in 1 clone of just 3 repos only.
I can see one major flaw is that if we miss spell any module, sparse checkout will ignore it without warning.

Anyways, shallow cloning would be a huge improvement standalone.

@apparentlymart
Copy link
Contributor

Hi again @rverma-nikiai! Thanks for the additional context.

It looks like you are interested in several slightly different (but related) problems here:

  1. Terraform clones exactly the same git repository multiple times over the network, which is slow.
  2. Terraform clones the entire history of the repository, even though it only uses the latest commit on the given branch.
  3. Terraform clones the entire source tree in the repository, even though only a sub-path is requested.
  4. The same repository is stored on local disk multiple times.

The first of these has already been addressed in master and will be included in the forthcoming v0.12.0 release: Terraform will now detect that all of these are coming from the same repository and only run git clone once.

Point 2 I think we can solve after we do some testing to make sure that -depth 1 doesn't have any unwanted consequences for the update step. This is what my previous comment was about.

Point 3 here is intentional because in a multi-module repository the different modules will often refer to one another with references like source = "../es" and so we need to have the entire repository on disk to resolve references like that.

Point 4 is another one we can address eventually. For v0.12.0 we've switched to a directory naming scheme that reflects the module names in source code so that error messages (which now contain source location references) are more easily understandable. The new mechanism I mentioned for point 1 doesn't yet address this, since we wanted to keep things relatively simple for the first pass, but that mechanism could also potentially use additional techniques to share the files on disk between multiple copies of the same source. We intend to investigate that further in a later release.

In order to keep things focused, let's say that this issue is about the second point, since I think that's the one that is in most need of some further study/prototyping. I expect we will also make a separate issue for the 4th point at a later date, once we've got some experience with this new download optimization fix in v0.12 and can potentially address any other concerns related to it at the same time.

Point 1 here was originally discussed in #11435, which is now closed due to the fix being ready for release.

@rifelpet
Copy link

@apparentlymart Now that hashicorp/go-getter#140 has been merged, any chance we can get terraform's vendoring updated to add support for shallow clones? I'm happy to open a PR including a docs update, I just need to know which target branch would be most appropriate at the moment.

@rifelpet
Copy link

rifelpet commented Mar 15, 2019

It looks like #20411 updated the go-getter version to include the shallow clone functionality. It is in 0.12 beta1. I'm looking forward to using this in 0.12, thanks! We can probably close this issue out.

@apparentlymart
Copy link
Contributor

Since that other PR wasn't intentionally updating go-getter to address this issue, it therefore didn't update Terraform's module sources documentation to mention this new option. We'll need to do that at least before considering this done.

I'd also still like to investigate whether that option is necessary at all or if we can just make that behavior the default. Since we're not cloning the repository for development it seems unnecessary to produce a fully-functioning work tree by default, and in the rare case where someone does want to work directly with the cloned repository in .terraform/modules it only takes a couple git commands to fetch the full history if needed.

@rifelpet
Copy link

Ah you're right I forgot about documentation, and I would fully support using depth=1 by default. I can't think of any reasonable situations where a user would need the full history in .terraform/modules.

@KaGeN101
Copy link

KaGeN101 commented Aug 2, 2019

Surely the source can be cached as well as put in a lookup during execution. If the lookup contains the same url key just use the cached copy if not source it and add to the cache...this surely can't be hard to do versus downloading and identical copy form the internet over and over and over each time :/

@adrian-gierakowski
Copy link

@apparentlymart

Is there a github issue tracking the the following point?

2. Terraform clones the entire history of the repository, even though it only uses the latest commit on the given branch.

I'm using the terraform-google-modules/gcloud/google module which ends up downloading hundreds of megabytes of history since the github repo contains gcloud binaries and grows in size significantly with every version bump.

@KaGeN101
Copy link

@apparentlymart

Is there a github issue tracking the the following point?

  1. Terraform clones the entire history of the repository, even though it only uses the latest commit on the given branch.

I'm using the terraform-google-modules/gcloud/google module which ends up downloading hundreds of megabytes of history since the github repo contains gcloud binaries and grows in size significantly with every version bump.

That is the way Git works it is distributed so copies all history locally. There no way around this the first time

@milpog
Copy link

milpog commented May 10, 2020

It would be really nice to have shallow clone as default option for cloning Terraform modules. @apparentlymart can you tell whether there are any plans to implement it anytime soon?

@apparentlymart
Copy link
Contributor

Nobody on the Terraform team at HashiCorp is currently working on this, because our attentions are currently elsewhere.

As I mentioned before, the main trick here is making sure that shallow clone won't break the ability for terraform init -upgrade to roll forward to a newer commit when a shallow tree is already present on disk. I don't know yet how that will behave, and I think understanding that behavior is the main blocker for deciding whether we can make this change. If someone is motivated to work on this, I'd suggest the following approach to get into a state where it's possible to test and experiment:

  • Create a local branch of go-getter, which is the library that implements the Git fetching in Terraform.

  • In your Terraform work tree, temporarily edit go.mod to include a replace directive referring to your local go-getter tree, so your local Terraform builds will see the go-getter changes you're making locally:

    replace github.com/hashicorp/go-getter => ../go-getter
    
  • Change the logic in GitGetter to enable shallow cloning unconditionally. (I don't have exact details on this step, because I've not looked closely at the logic in there yet.)

  • Build Terraform against the locally-modified go-getter and experiment with terraform init and terraform init -upgrade to make sure they are both still working as expected.

If the above is fruitful and it seems like making shallow clone the default work work, I expect the final change to go-getter would need to make it conditional via a flag field in the GitGetter type so that Terraform can enable it without forcing that behavior on other go-getter callers. We can then change Terraform's own instantiation of that getter to set the new flag, making that behavior always be activated for Terraform's module installer.

If someone is interested in working on this but needs some more guidance, please let me know what specific questions you have and I can try to answer them as best I can with what I know already.

@hayorov
Copy link
Contributor

hayorov commented Jul 14, 2020

Due to closed #11435 I'd like to slightly offtopic here and share a small pre-terraform routine utility that optimizes init (modules download) for git modules https://github.com/hayorov/terraform-init-booster

@sebglon
Copy link

sebglon commented Mar 15, 2021

Can we have some news?

@sebglon
Copy link

sebglon commented Mar 15, 2021

It seems that we have a beginning fix here: #10703

@Vasi-Shche
Copy link

  1. Terraform clones the entire source tree in the repository, even though only a sub-path is requested.

Point 3 here is intentional because in a multi-module repository the different modules will often refer to one another with references like source = "../es" and so we need to have the entire repository on disk to resolve references like that.

About Point 3. With 100 modules per repo, it will replicate it 100 times. Why not, for example, use links instead of copying 100 modules 100 times?

@aberres
Copy link

aberres commented Jun 16, 2023

I wanted to reference modules living in a mono repo. But boy, this is no fun.

@apparentlymart
Copy link
Contributor

If all of your source code is in the same repository then you can use relative paths (starting with either ../ or ./) to refer to other modules in the same repository. If you do that then Terraform will use the parent module's copy of the source code to handle the downstream modules, and if you do this exclusively then terraform init will not need to download any additional module source code at all.

@aberres
Copy link

aberres commented Jun 16, 2023

@apparentlymart But then we use the possibility to version things, don't we? Or is there a trick?

@apparentlymart
Copy link
Contributor

Indeed, when I hear "monorepo" I tend to understand that to mean "big bag of everything all versioned together as a single unit", so I was assuming that independent module versioning is not a requirement. If you do want to have separate versions for each of your modules then indeed it'd be best to use multiple smaller repositories to represent that, so that the boundaries between the packages are clear.

@aberres
Copy link

aberres commented Jun 19, 2023

Indeed, when I hear "monorepo" I tend to understand that to mean "big bag of everything all versioned together as a single unit"

You are absolutely right. Nevertheless, I was hoping to leverage versioning to do things like promoting things from testing, to staging, to prod.
Creating (TF specific) tags works even in the monorepo case.

If you do want to have separate versions for each of your modules then indeed it'd be best to use multiple smaller repositories to represent that, so that the boundaries between the packages are clear.

Yeah, I guess we should simply move the TF code out and call it a day 👍

@fishpen0
Copy link

fishpen0 commented Jun 19, 2023

You can use git tags to pull off versioning inside a monorepo. So for each module in our modules folder we autogenerate a semver tag that looks like this: foomod-1.1.1 from a version file we keep collocated in the module folder. Then if we want to use that module we reference it like so from our same repo:

module "myfoomod" {
  src = "[email protected]:myorg/monorepo.git//modules/foomod?ref=foomod-1.1.1"
} 

It is not ideal though because if you do this a lot in a monorepo you get hundreds of local copies of the modules cloning the repo you already are working inside of over and over again. Which is kind of what this issue is opened to fix.

The better approach is to just not monorepo, but thats simply not allowed in the organization we are in. Compliance, security, and strange engineering leadership can all just create bad and confusing rules and we have to work around it.

@mschfh
Copy link
Contributor

mschfh commented Jan 5, 2025

Gave this a try to improve Registry module fetching, but the current go-getter implementation only supports git clone --depth, which does not work with specific commits (example response below).

2025-01-05T11:36:56.780Z [TRACE] getmodules: fetching "git::https://github.com/cloudposse/terraform-aws-ecs-web-app?ref=9ae7e799cf1779f7d32b8076641c02146d65ef5d" to ".terraform/modules/testmodule"

Related: hashicorp/go-getter#510

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests