見出し画像

[id00032] Distributed Training in Amazon SageMaker


https://pages.awscloud.com/rs/112-TZM-766/images/3_20230511-aiml_AWS-FM-Model.pdf  (概要把握に良い)

Distributed training

Distributed Training Solutions

  • Data parallelism

  • Model parallelism

  • Pipeline Execution Schedule (Pipelining): The pipeline execution schedule determines the order in which computations (micro-batches) are made and data is processed across devices during model training. Pipelining is a technique to achieve true parallelization in model parallelism and overcome the performance loss due to sequential computation by having the GPUs compute simultaneously on different data samples. To learn more, see Pipeline Execution Schedule.

How to choice solutions

The general rule is that if your model has less than 1 billion parameters and can fit into GPU memory, SageMaker data parallel library or SageMaker training compiler can be sufficient for you. If you have larger language or computer vision models, our suggestion is to train it with the sharded data parallelism technique combined with activation checkpointing and activation offloading in the SageMaker model parallel library first, before other techniques such as tensor parallelism or pipeline parallelism. Sharded data parallel has helped many customers including Mobileye to train large models in an affordable manner on SageMaker.

SageMaker's Data Parallelism Library

Modify a PyTorch Training Script

To use the SageMaker distributed data parallel library, the only thing you need to do is to import the SageMaker distributed data parallel library’s PyTorch client (smdistributed.dataparallel.torch.torch_smddp).

I made the following code modifications as the original code was causing some errors.

%%writefile scripts/mnist.py

import os
import torch

import smdistributed.dataparallel.torch.torch_smddp

# workaround
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

from torch.nn import functional as F
...
    ddp = DDPStrategy(
        cluster_environment=env, 
        process_group_backend="smddp", 
        accelerator="gpu"
    )

Points

import pytorch_lightning as pl

# this module is required for SageMaker DDP.
import smdistributed.dataparallel.torch.torch_smddp
...
# Configure some settings for distributed training.
env.world_size = lambda: int(os.environ["WORLD_SIZE"])
env.global_rank = lambda: int(os.environ["RANK"])
...
# set distributed training strategy
from pytorch_lightning.strategies import DDPStrategy

ddp = DDPStrategy(
  cluster_environment=env, 
  process_group_backend="smddp", 
  accelerator="gpu"
)

trainer = pl.Trainer(
  devices=num_gpus, 
  num_nodes=num_nodes,
  max_epochs=10,
  strategy=ddp
)
...
trainer.fit(model, datamodule=MNISTDataModule(batch_size=32))

Refs

SageMaker's Model Parallelism Library

To save GPU memory, the library supports activation checkpointing to avoid storing internal activations in the GPU memory for user-specified modules during the forward pass. The library recomputes these activations during the backward pass. In addition, the activation offloading feature offloads the stored activations to CPU memory and fetches back to GPU during the backward pass to further reduce activation memory footprint. For more information about how to use these features, see Activation Checkpointing and Activation Offloading.


Shared Data Parallelism

Sharded data parallelism is a memory-saving distributed training technique that splits the state of a model (model parameters, gradients, and optimizer states) across GPUs in a data parallel group. 

Reducing the batch size per GPU sometimes is not possible with sharded data parallelism alone when a single batch is already large and cannot be reduced further. In such cases, using sharded data parallelism in combination with tensor parallelism helps reduce the global batch size.
# Any computation defined inside the smp.step-decorated function is executed in a distributed manner.
@smp.step
def train_step(model, optimizer, input_ids, attention_mask, args):
    loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)["loss"]
    model.backward(loss)

    return loss

@smp.step
def test_step(model, input_ids, attention_mask):
    loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)["loss"]
    
    return loss
- smp.init() : SMPの初期化。DDP(分散データ並列)の有効化、混合精度計算の設定、アクティベーションオフロードなどの設定
- smp.DistributedModel(): モデルをラップして分散化
- smp.set_activation_checkpointing(): アクティベーションチェックポインティングを設定
- smp.DistributedOptimizer(): オプティマイザをラップして分散化
- smp.save_checkpoint()/smp.resume_from_checkpoint(): チェックポイントの保存と復元
- smp.model_creation(): 分散パラメータの初期化時の設定

Pipeline parallelism

Tensor parallelism

Modify a PyTorch Training Script

PyTorch FSDP

DeepSpeed ZeRO-3

Comparing

ZeRO-3, PyTorch SFDP, SageMaker Model Parallelの特徴と違い。


ZeRO-3:

  • メモリ最適化技術。分散GPUトレーニング時のメモリ使用量を削減。

  • 大規模モデルのトレーニングを可能に。

PyTorch SFDP:

  • パラメータサーバベースの分散トレーニングフレームワーク。

  • 大規模モデルを分割して並列トレーニング。スケールアウトを実現。

SageMaker Model Parallel:

  • Amazon SageMakerの機能。大規模モデルのレイヤーを分散処理。

  • トレーニングと推論の両方を実行。

これらの技術が併用できるかは未検証。。

HyperPod


この記事が気に入ったらサポートをしてみませんか?