AWS ECS Exec のタイムアウト対策

新年あけまして地味なことから始めましょう。

ECS Exec におけるタイムアウト対策の雑多なメモで、AWS ECS Exec を使ってみたTips | 外道父の匠の補足みたいな内容です。

問題点

引っかかった問題はシンプルで、こんな風にごく普通に execute_command を実行したときに、20分強経過するとプロセスが強制終了されるというものです。

ecs = boto3.client('ecs', region_name="ap-northeast-1")

cluster     = 'test'
task_arn    = 'arn:aws:ecs:ap-northeast-1:0123456789:task/test/0123456789abcdef'
container   = None
interactive = True

command = "/bin/sh -c 'echo ABC; /bin/sleep 1800'"

params = {
    "cluster"    : cluster,
    "task"       : task_arn,
    "interactive": interactive,
    "command"    : command,
}
if container: params["container"] = container

res = ecs.execute_command(**params)
print(res)

ecs = boto3.client('ecs', region_name="ap-northeast-1")

cluster = 'test'

task_arn = 'arn:aws:ecs:ap-northeast-1:0123456789:task/test/0123456789abcdef'

container = None

interactive = True

command = "/bin/sh -c 'echo ABC; /bin/sleep 1800'"

params = {

"cluster" : cluster,

"task" : task_arn,

"interactive": interactive,

"command" : command,

}

if container: params["container"] = container

res = ecs.execute_command(**params)

print(res)

実行時、コンテナ内の関連プロセスはこういう感じ。

/managed-agents/execute-command/amazon-ssm-agent
/managed-agents/execute-command/ssm-agent-worker
/managed-agents/execute-command/ssm-session-worker ecs-execute-command-123456789 abcdef2345678901
/bin/sh -c echo ABC; /bin/sleep 1800
/bin/sleep 1800

/managed-agents/execute-command/amazon-ssm-agent

/managed-agents/execute-command/ssm-agent-worker

/managed-agents/execute-command/ssm-session-worker ecs-execute-command-123456789 abcdef2345678901

/bin/sh -c echo ABC; /bin/sleep 1800

/bin/sleep 1800

で、本来1800秒経ってから終わってほしいプロセスが、1210～1250秒程度経過したら落とされてしまいます。当時から認識してたけど、見て見ぬ振りしてたヤツ;-(

これについて、どういう関連リソースがあるのか、どこにログがあるのか、といった調査用情報をメモっていき、最終的な(仮)解決策を示しておきます。

AWS Systems Manager ＞ Session Manager の設定

最も最初に気になるのは AWS Systems Manager（旧SSM）の Session Manager の設定で、ECSのドキュメントに書いてある Idle Timeout です。

ECS Execセッションのデフォルトのアイドルタイムアウト値は 20 分です。詳細については、「AWS Systems Managerユーザーガイド。」のアイドルセッションのタイムアウト値を指定するを参照してください。

デバッグ用にAmazon ECS Exec を使用 – Amazon Elastic Container Service

また、なぜか英語のドキュメントにしかありませんが、セッションの最大有効時間という設定もあります。

Specify maximum session duration – AWS Systems Manager

これらは変更APIが無いっぽいので管理画面で変更するのですが、結論を言うと

Idle session timeout : Default = 20 minutes → 1 minute

Maximum session duration : Default = Disable → Enable 1 minute

のように変更しても、ECS Exec には一切影響できませんでした。逆に 30 minutes など増やしても同様。

他の用途もある設定なので、実はECS Execには適用されないとかかもですが、他にそれらしい設定を見つけることができなかったので、この問題を正々堂々解決することは難しくなりました。（…が、この設定が適用されないのもオカシイので、何か見落としているかもしれない）

強制終了を示すログ

ECS Exec + SSM に関する情報はほとんどないので、なんとかログなど探していくわけですが、まず Exec に関する設定はここにあります。

１つはこれですが、情報不足なのと、解決のための編集の余地がなさそうだったので見るだけです。
/managed-agents/execute-command/configuration/amazon-ssm-agent.json

{
        "Mgs": {
                "Region": "",
                "Endpoint": "",
                "StopTimeoutMillis": 20000,
                "SessionWorkersLimit": 10
        },
        "Agent": {
                "Region": "",
                "OrchestrationRootDir": "",
                "ContainerMode": true
        }
}

{

"Mgs": {

"Region": "",

"Endpoint": "",

"StopTimeoutMillis": 20000,

"SessionWorkersLimit": 10

"Agent": {

"Region": "",

"OrchestrationRootDir": "",

"ContainerMode": true

}

もう１つはこれで、ログファイルが書いてあるので、これを追います。
/managed-agents/execute-command/configuration/seelog.xml

<seelog type="adaptive" mininterval="2000000" maxinterval="100000000" critmsgcount="500" minlevel="info">
        <exceptions>
                <exception filepattern="test*" minlevel="error"/>
        </exceptions>
        <outputs formatid="fmtinfo">
                <console formatid="fmtinfo"/>
                <rollingfile type="size" filename="/var/log/amazon/ssm/amazon-ssm-agent.log" maxsize="40000000" maxrolls="1"/>
                <filter levels="error,critical" formatid="fmterror">
                        <rollingfile type="size" filename="/var/log/amazon/ssm/errors.log" maxsize="10000000" maxrolls="1"/>
                </filter>
        </outputs>
        <formats>
                <format id="fmterror" format="%Date %Time %LEVEL [%FuncShort @ %File.%Line] %Msg%n"/>
                <format id="fmtdebug" format="%Date %Time %LEVEL [%FuncShort @ %File.%Line] %Msg%n"/>
                <format id="fmtinfo" format="%Date %Time %LEVEL %Msg%n"/>
        </formats>
</seelog>

</exceptions>

</filter>

</outputs>

</formats>

</seelog>

/var/log/amazon/ssm/errors.log には、コマンドがぶった切られた様子がありますが、詳細がわかるものはありません。

[DataBackend] [pluginName=InteractiveCommands] unable to wait pty: signal: killed
[DataBackend] [pluginName=InteractiveCommands] pty process: 2989 exited unsuccessfully, error message: signal: terminated

1 2	[DataBackend] [pluginName=InteractiveCommands] unable to wait pty: signal: killed [DataBackend] [pluginName=InteractiveCommands] pty process: 2989 exited unsuccessfully, error message: signal: terminated

/var/log/amazon/ssm/amazon-ssm-agent.log には、始まりから終わりまで、結構詳細な内容が記録されます。このうち、20分経過で打ち切られた場合のログはこんな感じ。

[DataBackend] [pluginName=InteractiveCommands] Stopping pty
[pluginName=InteractiveCommands] The session was cancelled
[pluginName=InteractiveCommands] Reach the retry limit 5 for receive messages. Error: websocket: close 1006 (abnormal closure): unexpected EOF

[DataBackend] [pluginName=InteractiveCommands] unable to wait pty: signal: killed
[DataBackend] [pluginName=InteractiveCommands] Stopping pty

[DataBackend] [pluginName=InteractiveCommands] Closing datachannel with channel Id ecs-execute-command-0123456789abc
[DataBackend] [pluginName=InteractiveCommands] Closing websocket channel connection to: wss://ssmmessages.ap-northeast-1.amazonaws.com/v1/data-channel/ecs-execute-command-0123456789abc?role=publish_subscribe

[DataBackend] [pluginName=InteractiveCommands] Stopping pty

[pluginName=InteractiveCommands] The session was cancelled

[pluginName=InteractiveCommands] Reach the retry limit 5 for receive messages. Error: websocket: close 1006 (abnormal closure): unexpected EOF

[DataBackend] [pluginName=InteractiveCommands] unable to wait pty: signal: killed

[DataBackend] [pluginName=InteractiveCommands] Stopping pty

[DataBackend] [pluginName=InteractiveCommands] Closing datachannel with channel Id ecs-execute-command-0123456789abc

[DataBackend] [pluginName=InteractiveCommands] Closing websocket channel connection to: wss://ssmmessages.ap-northeast-1.amazonaws.com/v1/data-channel/ecs-execute-command-0123456789abc?role=publish_subscribe

前後の内容的に、wss://ssmmessages.ap-northeast-1.amazonaws.com との websocket 通信が無効になったっぽいことがわかります。

コマンドが打ち切られるのが大体1210～1250秒くらいなので、厳密ではないもののSSMのセッション切れデフォルト時間20分と関連していると見るのが妥当。

関係ないけどメモっとく関連リソース

ECS Exec + SSM において、セッション情報を見ると関連しているリソースが２つあります。

SSMドキュメント

セッション履歴に、セッションごとに適用された「ドキュメント」というものが記録されます。

長いので省略しますが、設定JSONになっており、編集の余地はなさそうです。

AWS Systems Manager > ドキュメント > AmazonECS-ExecuteInteractiveCommand

STS + ServiceRole

セッションの所有者がこれなので

arn:aws:sts::1234567890:assumed-role/AWSServiceRoleForECS/ecs-execute-command

この周りも調べてみたけど、サービスロールに認証の有効期限はないので、関係なさげ。

コマンドの出力と結果

コンテナのファイルや

/var/lib/amazon/ssm/***-***/session/orchestration/ecs-execute-command-***/InteractiveCommands/ecs-execute-command-***.log

その同内容が CloudWatchLogs に残されるのですが、例えばこういうコマンドだったとすると

/bin/sh -c 'echo ABC; /bin/sleep 5'

1	/bin/sh -c 'echo ABC; /bin/sleep 5'

こういうログになります。

Script started on 2022-01-01 14:57:00+00:00 [<not executed on terminal>]
ABC
 
Script done on 2022-01-01 14:57:05+00:00 [COMMAND_EXIT_CODE="0"]

Script started on 2022-01-01 14:57:00+00:00 [<not executed on terminal>]

ABC

Script done on 2022-01-01 14:57:05+00:00 [COMMAND_EXIT_CODE="0"]

これを、打ち切られるように実行時間を調整しても、やはり同じログになります。

sleep は成功すると 0 を返し、途中でやめると 130 を返すので、常に COMMAND_EXIT_CODE=”0″ というのもおかしい話です。ということは、誰かが強制的に”0″にしているか、そもそもこの返り値はコマンドそのものの返り値ではない、ということになります。

コマンド・プロセスの確認

そこで、プロセスの構成を確認してみます。関連箇所のみ載せています。

#  ps -ef
UID   PID  PPID  C STIME TTY       TIME CMD
root    1     0  0 07:03 ?     00:00:00 /dev/init -- /usr/bin/tail -f /dev/null
root    9     1  0 07:03 ?     00:00:00 /usr/bin/tail -f /dev/null
root   10     0  0 07:03 ?     00:00:01 /managed-agents/execute-command/amazon-ssm-agent
root   21    10  0 07:03 ?     00:00:01 /managed-agents/execute-command/ssm-agent-worker
root 2638    21  0 12:07 ?     00:00:00 /managed-agents/execute-command/ssm-session-worker ecs-execute-command-***
root 2646  2638  0 12:07 pts/1 00:00:00 /bin/sh -c /bin/sleep 1320
root 2647  2646  0 12:07 pts/1 00:00:00 /bin/sleep 1320

# ps -ef

UID PID PPID C STIME TTY TIME CMD

root 1 0 0 07:03 ? 00:00:00 /dev/init -- /usr/bin/tail -f /dev/null

root 9 1 0 07:03 ? 00:00:00 /usr/bin/tail -f /dev/null

root 10 0 0 07:03 ? 00:00:01 /managed-agents/execute-command/amazon-ssm-agent

root 21 10 0 07:03 ? 00:00:01 /managed-agents/execute-command/ssm-agent-worker

root 2638 21 0 12:07 ? 00:00:00 /managed-agents/execute-command/ssm-session-worker ecs-execute-command-***

root 2646 2638 0 12:07 pts/1 00:00:00 /bin/sh -c /bin/sleep 1320

root 2647 2646 0 12:07 pts/1 00:00:00 /bin/sleep 1320

コマンドは ssm-session-worker が実行している構成になっていることがわかります。なので、打ち切りでも返り値が常に 0 になるのは、こいつの仕業が濃厚ですが、だからといってどうなるものでもありません。

ボツ案 SSM Session resume

ECS の API としては、execute_command 以上にやれることは無いので、SSM の方を探すと１つだけ光明が見つかります。

...
res = ecs.execute_command(**params)
session_id = res['session']['sessionId']
 
ssm = boto3.client('ssm', region_name="ap-northeast-1")
res = ssm.resume_session(
    SessionId = session_id
)

...

res = ecs.execute_command(**params)

session_id = res['session']['sessionId']

ssm = boto3.client('ssm', region_name="ap-northeast-1")

res = ssm.resume_session(

SessionId = session_id

)

ssm.resume_session でセッション期限が延長されねーかなーって試してみました……

が、ダメ……っ！
ログにも何も起きなく、無駄骨っ！！

解決案 nohup

前にも nohup については触れていますが、nohup をつけると SSM Session の支配から逃れることがわかりました。

SSMセッション管理画面では、通常、コマンドが終わるか 20分強経過するまで実行中セッションとなるのですが、nohup 付きコマンドを実行すると、セッションとしては即座に終了することがわかります。

しかし、終了したセッションが並ぶセッション履歴では、「終了中」ステータスでぐるぐる継続し、20分強経過後にようやく完全に終了する特殊な挙動となります。おそらく終了シグナルを待っているのでしょう。

また、コマンドの出力結果が残らなくなります。即終了したセッションとしてのログはこんな感じで

Script started on 2022-01-01 14:57:00+00:00 [<not executed on terminal>]
nohup: ignoring input and appending output to 'nohup.out'
 
Script done on 2022-01-01 14:57:00+00:00 [COMMAND_EXIT_CODE="0"]

Script started on 2022-01-01 14:57:00+00:00 [<not executed on terminal>]

nohup: ignoring input and appending output to 'nohup.out'

Script done on 2022-01-01 14:57:00+00:00 [COMMAND_EXIT_CODE="0"]

echo ABC の結果は通常ではない nohup.out に残ります。では、そのファイルはどこにあるかというと、

# cat /nohup.out
ABC

1 2	# cat /nohup.out ABC

なんとルートトップ。こんな場所に残すシステムも珍しい……。複数回実行すると、同ファイルに追記されるので、なんかこの部分だけオザナリな作りになっているという印象です。

で、nohup によってSSM支配から逃れたコマンドはというと、ssm-session-worker の監視下にあるものの、ちゃんと 20分以上経過しても継続して実行されます。20分経過時のログはこう

WARN [ssm-session-worker] [ecs-execute-command-***] [DataBackend] [pluginName=InteractiveCommands] Reach the retry limit 5 for receive messages. Error: websocket: close 1006 (abnormal closure): unexpected EOF
ERROR [ssm-session-worker] [ecs-execute-command-***] [DataBackend] [pluginName=InteractiveCommands] CreateDataChannel failed with no output or error: createDataChannel request failed: unexpected response from the service <BadRequest>
  <ClientErrorMessage>Session is already terminated</ClientErrorMessage>
</BadRequest>

WARN [ssm-session-worker] [ecs-execute-command-***] [DataBackend] [pluginName=InteractiveCommands] Reach the retry limit 5 for receive messages. Error: websocket: close 1006 (abnormal closure): unexpected EOF

ERROR [ssm-session-worker] [ecs-execute-command-***] [DataBackend] [pluginName=InteractiveCommands] CreateDataChannel failed with no output or error: createDataChannel request failed: unexpected response from the service <BadRequest>

<ClientErrorMessage>Session is already terminated</ClientErrorMessage>

</BadRequest>

SSM Session が0秒で即切れているにも関わらず、20分でこういうログが出るということは、Session Manager の設定に関わらず 20分固定でチェックしているだろうことが推測されます。

20分より長いコマンドが終了後は、ログ的にも何事もなかったようにキレイに終了します。

なので、nohup のデメリットは、コマンドの出力内容が /nohup.out に残ってしまう……しかないような？Session Manager の設定もどうせ効かないので、管理下から外れても実質関係ない、というのが現状っぽいです。

ということで

長いコマンドも、動作確認は短い時間でやるとしたら、実際の長い実行での出力結果はさほど重要ではなく、このデメリットはどうでもよくなるかもですし、

もし出力内容が必要なら、実行前に /nohup.out を削除（移動）して、コマンド実行完了後に /nohup.out を拾ってくるなりS3にアップロードしたり、といった救出作戦で問題なさげです。

あと微妙に気になるところで、ecs.execute_command の返り値の中に

{
    'SessionId': 'string',
    'TokenValue': 'string',
    'StreamUrl': 'string'
}

{

'SessionId': 'string',

'TokenValue': 'string',

'StreamUrl': 'string'

}

こういうのがあって、StreamUrl で何かできるかもだけど、ドキュメントに一切ないし、そこまで突っ込むものでもないって感想。

AWSにしては珍しくドキュメントが嘘ついてたのか、バグの部類か、俺がまだ何か見落としているのか、はわからないですが、そんな調査をしましたよってまとめでした。

今年もよろしくお願い申し上げますん:-)