[id00032] Distributed Training in Amazon SageMaker
https://pages.awscloud.com/rs/112-TZM-766/images/3_20230511-aiml_AWS-FM-Model.pdf (概要把握に良い)
Distributed training
Distributed Training Solutions
Data parallelism
Model parallelism
Pipeline Execution Schedule (Pipelining): The pipeline execution schedule determines the order in which computations (micro-batches) are made and data is processed across devices during model training. Pipelining is a technique to achieve true parallelization in model parallelism and overcome the performance loss due to sequential computation by having the GPUs compute simultaneously on different data samples. To learn more, see Pipeline Execution Schedule.
How to choice solutions
The general rule is that if your model has less than 1 billion parameters and can fit into GPU memory, SageMaker data parallel library or SageMaker training compiler can be sufficient for you. If you have larger language or computer vision models, our suggestion is to train it with the sharded data parallelism technique combined with activation checkpointing and activation offloading in the SageMaker model parallel library first, before other techniques such as tensor parallelism or pipeline parallelism. Sharded data parallel has helped many customers including Mobileye to train large models in an affordable manner on SageMaker.
SageMaker's Data Parallelism Library
You can use the DDP with Spot training.
Supported Frameworks
Modify a PyTorch Training Script
To use the SageMaker distributed data parallel library, the only thing you need to do is to import the SageMaker distributed data parallel library’s PyTorch client (smdistributed.dataparallel.torch.torch_smddp).
I made the following code modifications as the original code was causing some errors.
%%writefile scripts/mnist.py
import os
import torch
import smdistributed.dataparallel.torch.torch_smddp
# workaround
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"
from torch.nn import functional as F
...
ddp = DDPStrategy(
cluster_environment=env,
process_group_backend="smddp",
accelerator="gpu"
)
Points
import pytorch_lightning as pl
# this module is required for SageMaker DDP.
import smdistributed.dataparallel.torch.torch_smddp
...
# Configure some settings for distributed training.
env.world_size = lambda: int(os.environ["WORLD_SIZE"])
env.global_rank = lambda: int(os.environ["RANK"])
...
# set distributed training strategy
from pytorch_lightning.strategies import DDPStrategy
ddp = DDPStrategy(
cluster_environment=env,
process_group_backend="smddp",
accelerator="gpu"
)
trainer = pl.Trainer(
devices=num_gpus,
num_nodes=num_nodes,
max_epochs=10,
strategy=ddp
)
...
trainer.fit(model, datamodule=MNISTDataModule(batch_size=32))
Refs
https://www.amazon.science/latest-news/the-science-of-amazon-sagemakers-distributed-training-engines
SageMaker's Model Parallelism Library
To save GPU memory, the library supports activation checkpointing to avoid storing internal activations in the GPU memory for user-specified modules during the forward pass. The library recomputes these activations during the backward pass. In addition, the activation offloading feature offloads the stored activations to CPU memory and fetches back to GPU during the backward pass to further reduce activation memory footprint. For more information about how to use these features, see Activation Checkpointing and Activation Offloading.
Supported Frameworks
Technics
Sharded data parallel also leverages other techniques in MiCS such as Hierarchical Communication and 2-hop Gradient Synchronization. For more information, check out Near-linear scaling of gigantic-model training on AWS or MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud.
Shared Data Parallelism
Sharded data parallelism is a memory-saving distributed training technique that splits the state of a model (model parameters, gradients, and optimizer states) across GPUs in a data parallel group.
# Any computation defined inside the smp.step-decorated function is executed in a distributed manner.
@smp.step
def train_step(model, optimizer, input_ids, attention_mask, args):
loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)["loss"]
model.backward(loss)
return loss
@smp.step
def test_step(model, input_ids, attention_mask):
loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)["loss"]
return loss
- smp.init() : SMPの初期化。DDP(分散データ並列)の有効化、混合精度計算の設定、アクティベーションオフロードなどの設定
- smp.DistributedModel(): モデルをラップして分散化
- smp.set_activation_checkpointing(): アクティベーションチェックポインティングを設定
- smp.DistributedOptimizer(): オプティマイザをラップして分散化
- smp.save_checkpoint()/smp.resume_from_checkpoint(): チェックポイントの保存と復元
- smp.model_creation(): 分散パラメータの初期化時の設定
Pipeline parallelism
Tensor parallelism
Modify a PyTorch Training Script
PyTorch FSDP
DeepSpeed ZeRO-3
Comparing
ZeRO-3, PyTorch SFDP, SageMaker Model Parallelの特徴と違い。
ZeRO-3:
メモリ最適化技術。分散GPUトレーニング時のメモリ使用量を削減。
大規模モデルのトレーニングを可能に。
PyTorch SFDP:
パラメータサーバベースの分散トレーニングフレームワーク。
大規模モデルを分割して並列トレーニング。スケールアウトを実現。
SageMaker Model Parallel:
Amazon SageMakerの機能。大規模モデルのレイヤーを分散処理。
トレーニングと推論の両方を実行。
これらの技術が併用できるかは未検証。。
HyperPod
この記事が気に入ったらサポートをしてみませんか?