Managing Terraform resources with remote and versioned modules
2025-08-26
While this is probably not the first story of this kind, I also want to suggest some notes about managing Terraform resources. If you are starting a brand new infrastructure then I can only suggest researching around GitOps, Terragrunt and take a look at Terraform best practices. HCP Terraform is also providing stacks and OpenTofu community is also interested to this concept.
Before going any further, I am not considering this method a bullet-proof solution to most of the common problems, so treat this carefully and be critic. As usual think for the good of your organisation first and be sure to evalute as many scenarios as possible when taking a certain path.
Each organisation defines standards and styles on how to manage their own infrastructure and resources. Poor planning on scalable or DRY code will certainly cause serious problems to the organisation over time as it grows, such as: outages, rollback complications, hard to track issues, inconsistencies, drifts and difficult maintenance. All this will increase technical debt to an unsustainable level. IaC itself is powerful but, as any other existing software, it also requires a good strategy to deliver the desired results. Roadmaps, versioning, release cadence, testing are all fundamental ingredients to securely and efficiently deliver changes to production environments in a responsible way. This kind of issues can become even more problematic when there is a high number of contributors who also generate a high number of short-lived pull requests.
Modules
Define and use modules for better management.
The usage of modules can be taken for granted on most of the cases as organisations already implement the concept of modules in their configuration but the implementation can differ case by case. Terraform official documentation on modules lists what problems modules can solve when used correctly.
When working with modules it is important to find a good balance on how atomic your approach will be. Creating a module for very small sets of resources can become overwhelming, confusing, dispersive and complicated, at the same time designing modules which do multiple things, even if similar, is not great. For example, using a module for creating S3 buckets can be a good practice as you probably want to standardise all the security aspects, the naming convention across your organisation and keep it DRY. On the other side I do not see the utility of a module for uploading files to this S3 bucket or even worse one module for a each application in your stack or micro-service! The Terraform community provides a good amount of well maintained modules which can be used, without the need of re-inventing them.
The root module
The root module can be referred as the caller to these modules. It will also provide the values to the modules which will go ahead and create the resources to what’s provided. The root module can be scoped to the creation of the whole infrastructure or for a specific big chunk of it (for example: terraform-backend).
The root module typically includes multiple, different stacks and refer to them in multiple ways Terraform supports. For example, the root module will include terraform-backend, terraform-frontend, terraform-database, terraform-data, terraform-finance and can refer them with versions. These modules can also point to other modules, for example the AWS ec2-module. Anyway be careful to nesting. Avoid getting too deep with it if possible.
Let’s take this scenario as example, the root module will live in its own repository:
terraform-infrastructure
.
├── environments
│ ├── prod1
│ │ └── variables.auto.tfvars.json
│ ├── prod2
│ │ └── variables.auto.tfvars.json
│ ├── prod3
│ │ └── variables.auto.tfvars.json
│ ├── qa1
│ │ └── variables.auto.tfvars.json
│ ├── qa2
│ │ └── variables.auto.tfvars.json
│ └── stage1
│ └── variables.auto.tfvars.json
├── data.tf
├── main.tf
├── modules.tf
├── provider.tf
└── variables.tf
inside modules.tf:
module "scheduler" {
count = var.scheduler_enabled ? 1 : 0
source = "github.com/org/terraform-backend//scheduler?ref=${var.backend_version}"
region = var.aws_region
env = var.env
}
module "api" {
count = var.api_enabled ? 1 : 0
source = "github.com/org/terraform-backend//api?ref=${var.backend_version}"
region = var.aws_region
env = var.env
}
module "frontend-js" {
count = var.frontend_js_enabled ? 1 : 0
source = "github.com/org/terraform-frontend?ref=${var.frontend_version}"
region = var.aws_region
env = var.env
}
[...]
or it’s possible to think a wider concept like the following:
module "backend" {
count = var.backend_enabled ? 1 : 0
source = "github.com/org/terraform-backend?ref=${var.backend_version}"
region = var.aws_region
env = var.env
}
module "frontend-js" {
count = var.frontend_js_enabled ? 1 : 0
source = "github.com/org/terraform-frontend?ref=${var.frontend_version}"
region = var.aws_region
env = var.env
}
[...]
variables.tf:
variable "backend_version" {
description = "Code version of terraform-backend"
default = "1.0.0"
}
variable "frontend_version" {
description = "Code version of API"
default = "1.0.1"
}
variable "aws_region" {
description = "AWS Region"
default = "us-east-1"
}
variable "env" {
description = "Environment name"
}
[...]
./environments/prod2/variables.auto.tfvars.json:
{
"env": "prod1",
"aws_region": "us-west-2",
"scheduler_enabled": true,
"api_enabled": true
}
The root module can then become a collection of versions, conditional resources and values, making it a compact source of truth for easily answering “what went out where” type of questions.
The called module
A module can even be as simple as:
.
├── README.md
├── main.tf
├── variables.tf
variables.tf should not include default values. The whole module should make use of assert {}, validation {}, precondition {}, postcondition {} and custom error messages on resources and variables as much as possible. Example for terraform-engine-loadbalancers:
main.tf:
resource "aws_lb" "application" {
name = var.name
internal = var.internal_only
load_balancer_type = var.elb_type
[...]
}
resource "aws_lb_target_group" "app" {
name = var.name
protocol = var.tg_protocol
target_type = var.target_type
[...]
}
variables.tf:
variable "app_name" {
description = "Application name"
type = string
}
locals {
name = "${var.app_name}-${var.env}"
}
variable "name" {
description = "Name must not exceed 32 characters"
validation {
condition = length(local.name) <= 32
error_message = "The combined length of app_name-env must not exceed 32 characters."
}
}
variable "internal_only" {
dedscription = "Whether the ELB is publicly accessible"
}
[...]
These modules can be maintained by platform engineers or infrastructure engineers specifically. Some of these modules can become complex sometimes, and in my opinion, will also become very delicate; this is why splitting from “other modules”.
When dealing with common cloud providers and resources, consider using publicly available modules coming from trusted sources. Usually these modules are under more scrutiny.
Nesting, problems and inconsistencies
As previously mentioned, while versioning and modularisation will help with deployments, rollbacks, roadmaps, consistent commit history, and more, they will also imply a constant follow up on the progress between the current deployed release and the next release. Let’s say production environments get weekly releases, it’s vital to pick the correct versions to include/deploy by reviewing the differences in their specific git repositories and making sure all the components are in sync. GitHub already includes a “Compare to” button between releases which should help when making this analysis.
Let’s say, a new terraform-engine-loadbalancers version now requires a new variable/value. terraform-infrastructure must include this new variable in its new version as well. As long as Terraform will complain before proceeding any further, that can be easily fixed; anyway there are chances to include either too new or too old module versions with unexpected changes which may silently get to higher environments. Keep good track of the progress across the modules and follow strict guidelines when choosing the next releases, especially when nesting the modules (discussed above).
Infrastructure is not application
Create a bar between infrastructure and application. Separate the two realities apart as much as possible.
There are organisations out there which include their application’s remote configurations on the main stack of the infrastructure. I strongly disagree with such approach.
-
The infrastructure is meant to provision the resources the application will live in. The application must not have any ideas where it’s running on, whether it’s externally reachable or not. Infrastructure IaC will provide host instances, EKS clusters, load balancers, DNS entries and all this won’t matter to the application itself. The same applies the other way around, the infrastructure must not care if the application has new configuration changes or a different log level. If an application fetches secret values from AWS Secrets Manager or it will fetch configurations from SSM Parameter Store, this should not affect infrastructure. Example with applications fetching their configuratipons from
aws_ssm_parameterresources, must live in their own repository, follow their own release cadence and strategies, even by simulating what was shown above with modules, possibly stay more in sync with the application itself than the infrastructure. Missing or misconfigured configuration values must not affect infrastructure modules and in my opinion these should directly be managed by that specific product team. -
Configurations change often and infrastructure must keep the pace: Applications change often, they usually include new parameters or change the value of existing parameters. If these are part of your infrastructure plan, then the platform or infrastructure team will have to deal with such resources, and depending on the size of the stack and the number of applications, they will have no idea what they are dealing with. Platform engineers will not have to deal with application cache size, max number of threads or application behaviours in general.
-
Configurations are usually a good amount and will slow you down: In many cases configurations are also a good amount of resources. I’m aware of scenarios where each environment has 1500+ configuration resources, between SSM parameters, secret parameters and feature flags. This will slow down infrastructure plans and will cause frustration to platform and infrastructure engineers.
-
Hot fixes and regression: Let’s say a new configuration parameter turns out to be undesired. On a versioned infrastructure, which was architected to also include these parameters, this will imply the creation of a new “hot fix” release to address whatever issue was introduced. A brand new hotfix release for just changing a string, or even a character!
-
Bad mix: The application/infrastructure mix is just poison. Both developers and platform engineers will conflict almost immediately and the whole organisation will get the consequences.
- Developers will wait for the infrastructure to be up to date with what their code is expecting, delays happening when deploying infrastructure will also impact application deployments.
- Configuration changes happen too quickly, perhaps when someone thinks to get the infrastructure prepared, then the infrastructure gets applied, a container restarts and automatically pulls a new value while the code is not fully ready with it.
Standards, culture, governance
Introduce governance. Define standards and culture in your repositories. Do your best to get these observed.
For consistency and readability define a style and standards when contributing to repositories.
Encourage:
- Provide a pull request template and make sure it is filled with as much detailed information as possible.
- Define a merge window.
- Decide whether it will be a maintainer or the contributor to merge.
- Express preference between merge styles.
- Provide commit messages convention.
- Squash commits.
- Use labels in your PRs.
- Create tags, releases and branches using the adopted style.
- Make sure
CONTRIBUTING.md,MAINTAINERS.md,README.mdexist and are up to date.
Discourage:
- Commit messages like: fix typo, add stuff, something changed. This is just disrespectful and unprofessional on repositories with many contributors.
- Titles in pull request like: “TICKET-XYZ” without even a brief description.
- Manipulated body in pull requests.
- Testing procedures like: “works locally”
Above are only some ideas on maintaining git repositories for modules. The message here is to define a culture when maintaining products, especially when these will be read by other members or teams. This must not slow down development, contributors should not be slowed down by such practices, instead this should help to facilitate the development. Do not overkill this with unnecessary practices.
Continous development must never stop. Memebers will keep developing non-stop and it will be on the maintainers to keep things clean and the repository healthy, merge PRs, cut releases and discuss them.
Release cadence
Follow a regular release cadence.
It is desireable that infrastructure gets a weekly or fortnightly release cadence. I believe it should not go later than that. It is also good to deploy infrastructure changes at the same time window on a specific day of the week. For example, deploy infrastructure changes every Monday at 11:00 PM local market time and make sure to unleash your automated tests!
Prepare a release call with other platform team members and project managers, let’s say on a Thursday or Wednesday, to go through what’s the release’s changelog.
Review the changes in GitHub by comparing the proposed release and the current release. Run the plans with this release; review the plan’s output and discuss additions, deletions, changes, etc. Apply only in stage environment(s).
Create the relevant tickets providing the next infrastructure version and the current infrastructure version, for easy rollback.
Store the production plans somewhere, let’s say in S3, and only apply the plan at deployment time. On this specific point: think on ways you can avoid drifts, manual changes, provider changes (for example stick to fixed provider versions) between the creation of the plan and the apply time. Worst case: replan, review and if the changes are the same as the previous plan, apply. Also look at the plan with:
terraform plan -no-color | less
Make sure production plans do not auto apply, but prompt for confirmation.
Testing
Make sure testing routines are in place. Contributors will need to demonstrate tests are successful, usually by linking a green apply in lower environments directly in the body of a pull request; automated tests should also be available so people will not have to manually do this; consistent monitoring should be available in non-production environments. The code itself should be robust, include assert {}, validation {}, precondition {}, postcondition {} and custom error messages whenever possible. QA teams should also test production release candidates in stage environments.
Stage environment
The stage environment here will be particularly helpful. One stage environment for all production environments could be enough, as long as it is complete and scaled to the very bare minimum. Once production releases are created, deploy them to the stage environment first and make sure to redeploy/restart all your applications there. Make sure the apps come back up fine, monitoring is clear and QA successfully give a green light. If something is not right, rollback, freeze the release and address the problem.
Conclusions
The perfect method does not exist but it will always be possible to improve the scalability and reliability of the infrastructure. Split your infrastructure into modules and empower people to manage and maintain them. Do not slow down the development but make sure the delivery is safe. Chunking a problem into smaller pieces should help. Be as much consistent and disciplined as possible by defining standards and good practices. Coordinate with other teams with weekly calls and include as much testing as possible.