Managing Terraform resources with remote and versioned modules

2025-08-26

While this is probably not the first story of this kind, I also want to suggest some notes about managing Terraform resources. If you are starting a brand new infrastructure then I can only suggest researching around GitOps, Terragrunt and take a look at Terraform best practices. HCP Terraform is also providing stacks and OpenTofu community is also interested to this concept.

Before going any further, I am not considering this method a bullet-proof solution to most of the common problems, so treat this carefully and be critic. As usual think for the good of your organisation first and be sure to evalute as many scenarios as possible when taking a certain path.

Each organisation defines standards and styles on how to manage their own infrastructure and resources. Poor planning on scalable or DRY code will certainly cause serious problems to the organisation over time as it grows, such as: outages, rollback complications, hard to track issues, inconsistencies, drifts and difficult maintenance. All this will increase technical debt to an unsustainable level. IaC itself is powerful but, as any other existing software, it also requires a good strategy to deliver the desired results. Roadmaps, versioning, release cadence, testing are all fundamental ingredients to securely and efficiently deliver changes to production environments in a responsible way. This kind of issues can become even more problematic when there is a high number of contributors who also generate a high number of short-lived pull requests.

Modules

Define and use modules for better management.

The usage of modules can be taken for granted on most of the cases as organisations already implement the concept of modules in their configuration but the implementation can differ case by case. Terraform official documentation on modules lists what problems modules can solve when used correctly.

When working with modules it is important to find a good balance on how atomic your approach will be. Creating a module for very small sets of resources can become overwhelming, confusing, dispersive and complicated, at the same time designing modules which do multiple things, even if similar, is not great. For example, using a module for creating S3 buckets can be a good practice as you probably want to standardise all the security aspects, the naming convention across your organisation and keep it DRY. On the other side I do not see the utility of a module for uploading files to this S3 bucket or even worse one module for a each application in your stack or micro-service! The Terraform community provides a good amount of well maintained modules which can be used, without the need of re-inventing them.

Right now I feel comfortable with what I arbitrarily refer to as root modules, primitive modules and stack modules (optional). The way I see “modules” is simply git repositories with adaptive and scalable code; code which will be only written once, stored in one place and then referenced by others.

The root module

The root module can be referred as the caller to these modules. It will also provide the values to the modules which will go ahead and create the resources to what’s provided. Utilising the root module as the centre of an infrastructure can be acceptable in my opinion, especially when working with multiple macro-stacks, versions and release strategies. The root module can be scoped to a single stack (for example: terraform-backend), or it can also include multiple stacks and reference them as modules too! For example, it’s possible to design a root module specific to the backend, which will refer to child modules (or primitive modules), such as load balancers, Route53 entries etc. This is probably the simplest and cleanest way and will keep the 1:1 ratio between the caller and the primitive module, I prefer this way most of the cases.

Alternatively, the root module can include multiple, different stacks (referred as stack modules) in it and refer to them by version. For example, the root module will include terraform-backend, terraform-frontend, terraform-database, terraform-data, terraform-finance and can refer them with versions. These modules will then point to the primitive modules, creating what’s known as nesting. Let’s take this scenario as example, the root module will live in its own repository:

  • terraform-root (or simply terraform-infrastructure)
.
├── environments
│   ├── prod1
│   │   └── variables.auto.tfvars.json
│   ├── prod2
│   │   └── variables.auto.tfvars.json
│   ├── prod3
│   │   └── variables.auto.tfvars.json
│   ├── qa1
│   │   └── variables.auto.tfvars.json
│   ├── qa2
│   │   └── variables.auto.tfvars.json
│   └── stage1
│       └── variables.auto.tfvars.json
├── data.tf
├── main.tf
├── modules.tf
├── provider.tf
└── variables.tf

modules.tf:

module "scheduler" {
    count  = var.scheduler_enabled ? 1 : 0
    source = "github.com/org/terraform-backend//scheduler?ref=${var.backend_version}"
    region = var.aws_region
    env    = var.env
}

module "api" {
    count  = var.api_enabled ? 1 : 0
    source = "github.com/org/terraform-backend//api?ref=${var.backend_version}"
    region = var.aws_region
    env    = var.env
}

module "frontend-js" {
    count  = var.frontend_js_enabled ? 1 : 0
    source = "github.com/org/terraform-frontend//js?ref=${var.frontend_version}"
    region = var.aws_region
    env    = var.env
}

[...]

or it’s possible to think a concept like the following:

module "scheduler" {
    count  = var.scheduler_enabled ? 1 : 0
    source = "github.com/org/terraform-backend?ref=${var.backend_version}"
    region = var.aws_region
    env    = var.env
}

module "frontend-js" {
    count  = var.frontend_js_enabled ? 1 : 0
    source = "github.com/org/terraform-frontend?ref=${var.frontend_version}"
    region = var.aws_region
    env    = var.env
}

[...]

variables.tf:

variable "backend_version" {
    description = "Code version of terraform-backend"
    default     = "1.0.0"
}

variable "frontend_version" {
    description = "Code version of API"
    default     = "1.0.1"
}

variable "aws_region" {
    description = "AWS Region"
    default     = "us-east-1"
}

variable "env" {
    description = "Environment name"
}
[...]

./environments/prod2/variables.auto.tfvars.json:

{
    "env": "prod1",
    "aws_region": "us-west-2",
    "scheduler_enabled": true,
    "api_enabled": true
}

The root module can go down to apps from both frontend and backend by respective versions, or it can reference the whole repository as module without referring to singular applications. The root module can then become a collection of versions, conditional resources and values, making it a compact source of truth for easily answering “what went out where and when” type of questions. This is, again, not the simplest way for managing multiple stacks but allows more control over resources and looks more centralised.

When the root module is also the stack module, things will probably look simpler, versioning can still be applied to the module itself although the submodule will be versioned and referenced along with the called module.

Primitive module

In an AWS environment the primitive module will actually create all the relevant resources for provisioning an EKS cluster, EC2 instances, CloudFront distributions, S3 buckets, VPC and subnets (and so on), CloudWatch monitors alarms, RDSs and so on. These modules should not call other modules and be as primitive as possible, use official or popular community modules when possible. Think of them as separate repositories as follows:

  • terraform-engine-kubernetes
  • terraform-engine-computing
  • terraform-engine-loadbalancers
  • terraform-engine-cdn
  • terraform-engine-network
  • terraform-engine-observability
.
├── README.md
├── main.tf
├── variables.tf

variables.tf should not include default values. The whole module should make use of assert {}, validation {}, precondition {}, postcondition {} and custom error messages on resources and variables as much as possible. Example for terraform-engine-loadbalancers:

main.tf:

resource "aws_lb" "application" {
  name                       = var.name
  internal                   = var.internal_only
  load_balancer_type         = var.elb_type
  [...]
}

resource "aws_lb_target_group" "app" {
  name                       = var.name
  protocol                   = var.tg_protocol
  target_type                = var.target_type
  [...]
}

variables.tf:

variable "app_name" {
  description = "Application name"
  type        = string
}

locals {
    name = "${var.app_name}-${var.env}"
}

variable "name" {
    description = "Name must not exceed 32 characters"
    validation {
        condition     = length(local.name) <= 32
        error_message = "The combined length of app_name-env must not exceed 32 characters."
  }
}

variable "internal_only" {
    dedscription = "Whether the ELB is publicly accessible"
}

[...]

These modules can be maintained by platform engineers or infrastructure engineers specifically. Some of these modules can become complex sometimes, and in my opinion, will also become very delicate; this is why splitting from “other modules”.

Stack module (optional)

The optional stack module will create the platforms or stacks by referencing to the primitive modules. While I see this concept strategic for bigger stacks, I also believe this will add unneeded complexity as it will act as intermediary. This can become particularly helpful during the gradual process of moving out from a monolithic repository which stores most, it not all, your IaC while still keeping this central repository.

Stick to a simpler, standard “root (stack) <-> primitive” design where possible.

Think to them as separate repositories as follows:

  • terraform-backend
  • terraform-frontend
  • terraform-finance
  • terraform-clientservice
  • terraform-observability

For example terraform-backend consists of an API server and a scheduler:

.
├── api
│   ├── ec2.tf
│   ├── README.md
│   └── variables.tf
├── scheduler
│   ├── configs
│   ├── policies
│   ├── data.tf
│   ├── iam.tf
│   ├── README.md
│   ├── s3.tf
│   ├── elb.tf
│   └── variables.tf
├── modules.tf
├── README.md
└── variables.tf

./modules.tf:

module "apiserver" {
    source = "./api"
    region = var.region
    [...]
}

module "scheduler" {
    source = "./scheduler"
    region = var.region
    [...]
}

./scheduler/elb.tf:

module "scheduler_elb" {
    source  = "github.com/org/terraform-engine-loadbalancers?ref=${var.loadbalancer_version}"
    name    = var.app_name
    env     = var.env
    lb_type = var.lb_type
    [...]
}

./scheduler/s3.tf:

resource "aws_s3_object" "config" {
  for_each               = fileset("${path.module}/configs/","*.json")
  bucket                 = "${var.environment}-bucket"
  key                    = "scheduler/${each.key}"
  source                 = "${path.module}/configs/${each.key}"
  tags = {
    environment = var.environment
    service     = var.app_name
  }
}

./scheduler/variables.tf:

variable "loadbalancer_version" {
    default = "2.3.1"
}

variable "env" {
    description = "Environment name (must be one of prod1, prod2, prod3, qa1, qa2, stage1)"
    type        = string
    
    validation {
        condition     = contains(["prod1", "prod2", "prod3", "qa1", "qa2", "stage1"], var.env)
        error_message = "The env variable must be one of: prod1, prod2, prod3, qa1, qa2, stage1."
    }
}

Same as above: The whole module should make use of assert {}, validation {}, precondition {}, postcondition {} and custom error messages on resources and variables as much as possible, while default variables are allowed when specifying versions to the primitive modules.

These can be maintained by platform engineers, infrastructure engineers or devs specifically working on this part of the infrastructure. In my opinion, these modules are not as delicate as the stack modules as, if by any chance a bug is introduced here it should only affect a specific application/portion of the infrastructure.

Organisations with structures like this, at some point, will realise that most contributions are addressed to these modules rather than the primitive modules, which instead will tend to slow down in development once they reach a certain maturity.

New applications, changes to static files or content, new security group rules, changes to custom policies, new CludWatch monitors/alarms, new automations, new tags on resources can all be committed to this stack modules while leaving the primitive modules alone. Work towards primitive modules will impact all the modules referencing them increasing the risks of incidents, this is also why I suggest no default values set on the engine modules.

Another little pro to this is about the possibility to gently promote a new version of the terraform-engine-loadbalancers module to specific apps and/or environments rather than rolling it out everywhere at the same time, this is good for extra prudency. On this last bit, proceeding this way can lead to forgotten inconsistencies across environments, so do it carefully and only when really really necessary and re-align as soon as possible.

Nesting, problems and inconsistencies

As previously mentioned, while versioning and modularisation will help with deployments, rollbacks, roadmaps, consistent commit history, and more, they will also imply a constant follow up on the progress between the current deployed release and the next release. Let’s say production environments get weekly releases, it’s vital to pick the correct versions to include/deploy by reviewing the differences in their specific git repositories and making sure all the components are in sync. GitHub already includes a “Compare to” button between releases which should help when making this analysis.

Let’s say, a new terraform-engine-loadbalancers (primitive module) version now requires a new variable/value. terraform-frontend must include this new variable in its new version as well. As long as Terraform will complain before proceeding any further, that can be relatively ok; anyway there are chances to include either too new or too old module versions with unexpected changes which may silently get to production. Keep good track of the progress across the modules and follow strict guidelines when choosing the next releases, especially when nesting modules.

Nesting modules and their recommendations and potential problems are also covered by Terraform official documentation. Avoid getting too deep with it if possible, especially when there are not many stacks and not many contributors. Benefit from being lightweight.

Infrastructure is not application

Create a bar between infrastructure and application. Separate the two realities apart as much as possible.

The infrastructure is only meant to provision the resources the application will live in. The application must no have any ideas where it’s running from, whether it’s externally reachable or not. Infrastructure will provide host instances, EKS clusters, load balancers, DNS entries and all this won’t matter to the application itself.

The same applies the other way around, the infrastructure must not care if the application has new configuration changes or a different log level. If an application fetches secret values from AWS Secrets Manager or it will fetch configurations from SSM Parameter Store, this should not affect infrastructure. aws_ssm_parameter resources, specific to application configurations, must live in their own repository, follow their own release cadence and strategies, even by simulating what was shown above with modules. Missing or misconfigured configuration values must not affect infrastructure modules.

There are some undesired consequences when application configurations are part of the infrastructure (stack/root modules), especially when configurations are remotely stored and transformed at application start up as it will require the creation of remote resources, which you probably also want to manage with Terraform.

  • Configurations will change often and infrastructure must keep the pace: Applications change often, they usually include new parameters or change the value of existing parameters. If these are part of your infrastructure plan, then the platform or infrastructure team will have to deal with such resources when they are not supposed to. Platform engineers will not have to deal with application cache size, max number of threads or application behaviours in general.

  • Configurations are usually a good amount and will slow you down: In many cases configurations are also a good amount of resources. I’m aware of scenarios where each environment has 1500+ configuration resources, between SSM parameters, secret parameters and feature flags. This will slow down infrastructure plans and will cause frustration to platform and infrastructure engineers.

  • Hot fixes and regression: Let’s say a new configuration parameter turns out to be undesired, on a versioned infrastructure (where these parameters will be part of), this will imply the creation of a new “hot fix” release to address this as you probably do not want to rollback the other infrastructure changes for just a bad string. Once the problem will be figured out it will then either be included in a new release along with other changes or be the only change for the next release. Strongly discourage this when mixed with infrastructure, while it can be acceptable when the whole module is specific to parameters and configurations.

  • Bad mix: The application/infrastructure mix is just poison. Both developers and platform engineers will conflict almost immediately and the whole organisation will get the consequences.

    • Developers will wait for the infrastructure to be up to date with what their code is expecting, delays happening when deploying infrastructure will also impact application deployments.
    • Configuration changes happen too quickly, perhaps when someone thinks to get the infrastructure prepared, then the infrastructure gets applied, a container restarts and automatically pulls a new value while the code is not fully ready with it.

Replicate the same setup above for configurations if they are remotely stored, a collection of AWS SSM Parameters will anyway be simpler to manage compared to other infrastructure resources. When planning the release of a new version of the application, or a stack of applications, then also include the deployment of configuration changes in the same time window. If something goes wrong then simply rollback the collectior of parameters and the code.

Standards and culture

Define standards and culture in your repositories and culture and do your best to get these observed.

For consistency and readability define a style and standards when contributing to repositories.

Encourage:

  • Provide a pull request template and make sure it is filled with as much detailed information as possible.
  • Define a pull request window.
  • Decide whether it will be a maintainer or the contributor to merge.
  • Express preference between merge or rebase.
  • Provide commit messages convention.
  • Squash commits.
  • Use labels.
  • Create tags and releases using the same style.
  • Make sure CONTRIBUTING.md, MAINTAINERS.md, README.md exist and are up to date.

Discourage:

  • Commit messages like: fix typo, add stuff.
  • Titles in pull request like: “TICKET-XYZ” without even a brief description.
  • Manipulated body in pull requests.
  • Testing procedures like: “works locally

Above are only some ideas on maintaining git repositories for modules. The message here is to define a culture when maintaining products, especially when these will be read by other members or teams. This must not slow down development, contributors should not be slowed down by such practices, instead this should help to facilitate the development. Do not overkill this with unnecessary practices.

Continous development must never stop. Memebers will keep developing non-stop and it will be on the maintainers to keep things clean and the repository healthy, merge PRs, cut releases and discuss them.

Release cadence

Follow a regular release cadence.

It is desireable that infrastructure gets a weekly or fortnightly release cadence. I believe it should not go later than that. It is also good to deploy infrastructure changes at the same time window on a specific day of the week. For example, deploy infrastructure changes every Monday at 11:00 PM local market time.

Prepare a release call with other platform team members and project managers, let’s say on a Thursday or Wednesday, to go through what’s being submitted for this release.

Review the changes in GitHub by comparing the proposed releasre and the current release. Run the plans with this release; review the plan’s output and discuss additions, deletions, changes, etc. Apply only in stage environment(s).

Create the relevant tickets providing the next infrastructure version and the current infrastructure version, for easy rollback. Get all these tickets fully approved by the relevant teams.

Store the production plans somewhere, let’s say in S3, and only apply the plan at deployment time. On this specific point: think on ways you can avoid drifts, manual changes, provider changes (for example stick to fixed provider versions) between the creation of the plan and the apply time. Worst case: replan, review and if the changes are the same as the previous plan, apply.

terraform plan -no-color > terraform_plan.txt

Make sure production plans do not auto apply, but prompt for confirmation.

Testing

Make sure testing routines are in place. Contributors will need to demonstrate tests were successful, usually by linking a green apply in lower environments directly in the body of a pull request; automated tests should also be available so people will not have to manually do this; consistent monitoring should be available in non-production environments (perhaps without real alerting!). The code itself should be robust, include assert {}, validation {}, precondition {}, postcondition {} and custom error messages whenever possible. QA teams should also test production release candidates in stage environments.

Stage environment

The stage environment here will be particularly helpful. One stage environment for all production environments could be enough, as long as it is complete and scaled to the very bare minimum. Once production releases are created, deploy them to the stage environment first and make sure to redeploy/restart all your applications there. Make sure the apps come back up fine, monitoring is clear and QA successfully give a green light. If something is not right, rollback, freeze the release and address the problem.

Conclusions

The perfect method does not exist but it will always be possible to improve the scalability and reliability of the infrastructure. Split your infrastructure into modules and empower people to manage and maintain them. Do not slow down the development but make sure the delivery is safe. Splitting a problem into smaller ones will help with the overall result. Be as much consistent and disciplined as possible by defining standards and good practices. Coordinate with other teams with weekly calls and include as much testing as possible.