Containers and exit codes again

2026-02-07

Like any other process, containers exit with an exit code too. I feel the importance of these exit codes is usually underestimated or even ignored. Exit codes are important and can provide valuable information on the behaviour of your application.

When looking exit codes it is impossible not to mention signals as the two subjects go hand in hand. Before proceeding any further I strongly encourage to look at Appendix E. Exit codes with special meanings from tldp.org, signal(7), bash(1) - exit status and bash(1) - signals.

As soon as you read the content in the above links you will immediately understand the importance of these codes. So, why do these code matter that much on containers? I will try to answer this question under an AWS ECS EC2 standpoint using AL2023 ECS Optimized. The same principles generally apply to plain Docker, Kubernetes, ECS, EKS.

AL2023 uses the ECS Agent to manage containers inside an instance. When a container gets deployed to ECS, then it’s the turn for the ECS Agent to start the container with the specifics described in a task definition, and manage its lifecycle. Containers get terminated at some point, and it is the ECS Agent to provide the shutdown signal, also known as SIGTERM. If all is good a SIGTERM is more than enough, the container wraps up all its stuff and retires; this is commonly known as a graceful shutdown. The exit code is typically 0 or 143. By default, the ECS Agent will not wait for longer than 30s before forcefully shutting down a container. When a container cannot listen, or takes longer than 30s to finish what it’s doing then the ECS Agent then sends a SIGKILL (137) to the container. This is a bad sign.

  1. If your application is in the middle of something and gets killed the results can be unexpected. Let’s say it’s in the process of a database transaction: it can cause locks, slowness and other unpredictable behaviours. Another example could be a missed HTTP POST request at the end of a session, this can also cause unpredictable behaviour for that customer’s session. When writing to files, it can leave files corrupted or leave its content with truncated data and so on.
  2. Real problems are masked. If your application is generally failing on doing something you will never know by looking at an exit code. The SIGKILL will likely cover a more specific exit code, for example a SIGABRT (134) or any other interesting signals which suggest your application encountered a bug.
  3. If you own an instance with AL2023, you might see systemd-coredump starting up… and eventually do nothing. Perhaps your application is also encountering a bug while it’s getting stopped, but the agent kills the process while systemd-coredump is attempting to create a dump. Waste of time and resources.
  4. Application will waste time on shutting down/restarting. New deployments can be slower as well as scaling activity has to wait for those extra seconds.

Generally, the good exit codes are 0 and 143. All the other codes usually mean something went bad and have a meaning for the “kind of bad”. There could be several reasons why your application gets furcefully terminated.

Another golden rule is to make sure your appication gets started with PID 1 in your container:

# docker exec -it ab12cd34ef56 '/usr/bin/ps' '-A'
    PID TTY          TIME CMD
      1 pts/0    04:34:38 dotnet

Your Dockerfile’s fault

The Dockerfile describes how your image is built and how it gets started. When starting a container, it’s usually CMD[] and ENTRYPOINT[] instructions to express this behaviour. They are not mutually exclusive and can be used together (they should be used together).

Let’s say you want to start your .NET application:

ENTRYPOINT [ "dotnet", "myApp.dll" ]

In this case you will see dotnet as PID 1 (init process). When PID 1 is not your application (and not even tini aka docker-init and discussed below), then your application will likely struggle with signals and will probably shut down dirty.

Let’s say you want to start a script which will perform extra steps before starting your application:

ENTRYPOINT [ "/usr/local/bin/startup.sh" ]

Do not forget to terminate the startup script with the magic word exec when finally calling your application.

#!/bin/bash
set -e

if [[ -f "/$(whoami)/.myApp/config.cfg" ]];
    CONF="/$(whoami)/.myApp/config.cfg"
else
    aws s3 cp s3://my-s3-bucket /opt/config.cfg
    CONF="/opt/config.cfg"
fi

exec /opt/bin/myApp --config="$CONF" "$@"

The exec built-in commandreplaces the shell without creating a new process”, which will eventually make your application PID 1. "$@" allows external, additional arguments to be passed to the application.

Entrypoint scripts which use pipes (|) or redirect operators (<, >, >>) at the end of their invokations will break the concept. Doing the following is incorrect and will likely be followed by SIGKILL:

myApp | grep "hello"
exec myApp | grep -v "Debug"
exec myApp > file.txt

Use Dockerfile in exec format form whenever possible. It can be syntactically more demanding, but this is usually the most compatible form.

init… tini

One of those projects I love is tini. This is just a init for containers. tini performs signal forwarding and zombie reaping in your container.

It is part of Docker engine since version 1.13 by providing the --init flag.

AWS ECS provides the facility in the task definitions to use it as well: initProcessEnabled.

tini can’t do miracles though. If your application can’t handle signals or the container is not correctly started you will not benefit of it.

Increase the timeout

Perhaps your application is just slow to shut down in some circumnstances, sometimes simply wait for it to finish is the wisest choice. In an AWS ECS task definition it is possible to increase stopTimeout to a higher value than the default 30s. stopTimeout

On FARGATE the maximum is 2m.

Your Application’s fault

Your application is written in a way that just can’t handle signals. While this may sound uncommon, this can happen. Refer to the manual of your language to know more about signals as this can differ implementation by your implementation and language.

Extra

It’s probably worth to mention that, when a correctly started application in a Docker container fails due to a bug, it’s likely to get intercepted by systemd-coredump which will log the summary of events related to your process to the systemd-journald service, and possibly provide a stack trace which can be useful for debugging. AL2023 is based on Fedora and uses systemd, so systemd-coredump is installed and runs by default.

Let’s say your container exits with a genuine SIGTRAP (133), SIGABRT (134) or SIGSEGV (139), there is a good chance you will know what happened by looking at the journal and stack trace beyond logs, telemetries, metrics. Use coredumpctl to know more about the unpredicted exit of your container. If your corefile is truncated or does not exist make sure your ulimit -c either set at host or in the task definition ("ulimits": []") are not properly configured. The coredump file can be found in /usr/lib/systemd/coredump/ ready to be analysed.

$ sudo coredumpctl
TIME                         PID UID GID SIG     COREFILE EXE                       SIZE
Sun 2026-02-08 12:39:41 UTC 5705 333 333 SIGTRAP present  /path/to/executable      59.2M