[Paper Review] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

[Paper Review] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

2022. 6. 7. 10:36ㆍPaper Review

Abstract

In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. We sustain $15.1$ PetaFLOPs across the entire application with $76$% scaling efficiency when compared to a strong single GPU baseline that sustains $39$ TeraFLOPs, which is $30$% of peak FLOPs. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows.

1. Introduction

As transformer models become larger, they exceed the memory limit of modern processors, and require additional memory management techniques. Several approaches to model parallelism overcome this limit by partitioning the model such that the weights and their associated optimizer state do not need to reside concurrently on the procesor. However, they require rewriting the model, and rely on custom compilers and frameworks that are still under development. In this work, we implement a simple and efficient model parallel approach using intra-layer model-parallelism.
We show that the existing BERT architecture results in model degradation as the size increases. We overcome this challenge by rearranging the layer normalization and residual connection in the transformer layers and show that with this change, results for the downstream tasks on development sets improve monotonically as the model size increases.

2. Background and Challenges

2.1. Neural Language Model Pretraining

The state-of-the-art has advanced from transferring just word embedding tables to transferring entire multi-billion parameter language models. This progression of methods has necessitated the need for hardware, systems techniques, and frameworks that are able to operate efficiently at scale and satisfy increasing computational needs.

2.2. Transformer Language Models and Multi-Head Attention

Due to transformer's superior accuracy and compute efficiency, recent works use only the Encoder or Decoder depending on their needs. So in this work, we explores both a decoder architecture, GPT-2, and an encoder architecture, BERT. We also use 1). GeLU noninearities and 2). layer normalization to the input of the multi-head attention and feed forward layers.

2.3. Data and Model Parallelism in Deep Learning

There are two central paradigms for scaling out deep neural network training to numerous hardware accelerators: 1). data parallelism where a training minibatch is split across multiple workers, and 2). model parallelism in which the memory usage and computation of a model is distributed across multiple workers.
Our approach is to utilize model parallelism to split the model across mutliple acclerators. This not only alleviates the memory pressure, but also increases the amount of parallelism independently of the microbatch size. Within model parallelism, there are two further paradigms: 1). layer-wise pipeline parallelism, and 2). more general distributed tensor computation.
In pipeline model parallelism, groups of operations are performed on one device before the outputs are passed to the next device in the pipeline where a different group of operations are performed. This approach suffer from inconsistency issues and pipline bubbles, and requires additional logic to handle the efficient pipelining of these communication and computation operations.
Distributed tensor computation is an othogonal and more general approach that partitions a tensor operation across mutliple devices to accelerate computation or increase model size. We exploit parallelism in computing the transformer's attention heads to parallelize our transformer model. Our approach is simple, does not require any new complier or code re-writing, and can be fully implemented by inserting a few simple primitives.

3. Model Parallel Transformers

A transformer layer consists of a self attention block followed by a two-layer, multi-layer perceptron (MLP).

3.1. MLP block

The first part of the block is a GEMM (General Matrix to Matrix Multiplication) followed by a GeLU nonlinearity:
$$
Y = \text{GeLU}(XA)
$$
We split $A$ along its columns. This partitioning allows the GeLU nonlinearity to be independently applied to the output of each paritioned GEMM:
$$
A = \begin{bmatrix} A_1 & B_2 \end{bmatrix}
\qquad
[Y_1, Y_2] = [\text{GeLU}(XA_1), \text{GeLU}(XA_2)]
$$
Hence, we parition the first GEMM in this column parallel fashion and split the second GEMM along its row so it takes the output of the GeLU layer directly without requiring any communication:
$$
B = \begin{bmatrix} B_1 \\ B_2 \end{bmatrix}
\qquad
[Z_1, Z_2] = [Y_1B_1, Y_2B_2]
$$
This approach splits both GEMMs in the MLP block across GPUs and requires only a single all-reduce operation in the forward pass($f$) and single all-reduce in the backward pass($g$).

3.2. Self-Attention block

In self-attention block, we exploit inherent parallelism in the multihead attention operation. We split attention heads in column parallel fashion such that each attention head is done locally on one GPU.
The subsequent GEMM is parallelized along its rows and takes the output of the parallel attention layer directly, without requiring communication between the GPUs.
$ $
This approach for both the MLP and self attention layer fuses groups of two GEMMs, eliminates a synchronization point in between, and results in better scaling.

3.3. Embedding layer

Transformer language model has an output embedding with the dimension of hidden-size ($H$) times vocabulary ($v$). Since the vocabulary size is on the order of tens of thousands of tokens for modern language models, it is beneficial to parallelize the output embedding GEMM. However, the output embedding layer shares weights with the input embedding, requiring modifications to both.

3.3.1 Input embedding layer

We parallelize the input embedding weight matrix $E_{H \times v}$ along the vocabulary dimension $E = [E_1, E_2]$ (column-wise). Since each partition now only contains a portion of the embedding table, an all-reduce is required after the input embedding.

3.3.2. Output embedding layer

For the output embedding, one approach is to 1). perform the parallel $\text{GEMM}[Y_1, Y_2] = [XE_1, XE_2]$ to obtain the logits, 2). add an all-gather $Y = \text{all-gather}([Y_1, Y_2])$, and 3). send the results to the cross entropy loss function. Furthermore, we can reduce the communication size by fusing the output of the parallel $\text{GEMM}[Y_1, Y_2]$ with the cross entropy loss which reduces the dimension to $b \times s$.
$ $
Much of our model parallel approach can be characterized as techniques aimed at reducing communication and keeping the GPUs compute bound. Since all values are either local to or duplicated on a GPU, there is no need for communicating updated parameter values in this formulation.

4. Setup

We explain our configurations for BERT, GPT-2 in the following section and refer to the original papers for more details.

4.1. Training Dataset

We create an aggregate dataset consisting of Wikipedia, CC-Stories, RealNews, and OpenWebtext. To avoid training set leakage into our downstream tasks we remove the Wikipedia articles and newlines from the CC-Stories. For BERT models, we include BooksCorpus. We filtered out all the documents with content length less than $128$ tokens from the aggregated dataset. We used locality sensitive hashing (LSH) to deduplicate content with a jaccard similarity greater than $0.7$.

4.2. Training Optimization and Hyperparameters

To train our models efficiently we utilize mixed percision training with dynamic loss scaling to take advantage of the V100's Tensor Cores. We start by initializing our weights $W$ with a simple normal distribution $W \sim \mathcal{N}(0, 0.02)$. We then cale weights immediately before residual layers by $\frac{1}{\sqrt{2N}}$ where $N$ is the number of transformer layers. For our optimizer we utilize Adam with weight decay $\lambda = 0.01$. Additionally, we use global gradient norm clipping of $1.0$. Lastly, we utilize activation checkpointing after every transformer layer.
For GPT-2 models, we use sequences of 1024 subword units at a batch size of $512$ for $300\mathbf k$ iterations. Our learning rate is $1.5e^{-4}$ utilizes a warmup period of $3\mathbf k$ iterations before following a single cycle cosine decay over the remaining $297\mathbf k$ iterations. We stop the decay at a minimum learning rate of $1e^{-5}$.
For BERT models, we largely follow the training process described in (Lan et at., 2019). We use the original BERT directionary with vocab size of $30522$. In addition, we use sentence order prediction and use whole word n-gram masking. For all cases, we set the batch size to $1024$ and use a learning rate of $1.0e^{-4}$ warmed up over $10000$ iterations and decayed linearly over $2$ million iterations. Other training parameters are kept the same as (Devlin et at., 2018).

5. Experiments

5.1. Scaling Analysis

To test the scalability of our implementation, we consider GPT-2 models with four sets of parameters detailed in Table 1. The configuration with $1.2$ billion parameters fits on a single GPU whereas the $8$ billion parameter model requires $8$-way model parallelism ($8$ GPUs). We study both model and model+data parallel scaling. For model parallel scaling, a fixed batch size of $8$ is used across all configurations. For the model+data parallel cases we fix the global batch size to $512$.

5.1.1. Model and data parallelism

Throughout this section, we will showcase weak scaling with respect to the model parameters for both model parallel and model+data parallel cases. The baseline for all the scaling numbers is the first configuration ($1.2$ billion parameters) which achieves $39$ TeraFLOPS during the overall training process. The results shows scaling values for both model and model+data parallelism. The $8.3$ billion parameters case with $8$-way model parallelism achieves $77$% of linear scaling. The largest configuration with model+data parallelism achieves $74$% of linear scaling.

Strong scaling is defined as how the solution time varies with the number of processors for a fixed total problem size.
Weak scaling is defined as how the solution time varies with the number of processors for a fixed problem size per processor.

5.2. Language Modeling Results Using GPT-2

To demonstrate that large langauge models can further advance the state of the art, we train GPT-2 models with four sets of parameters detailed in Table 2. We train and evaluate models as described in section 4. The results show that as the model size increases, the validation perplexity decreases and reaches a validation perplexity of $9.27$ for the $8.3\mathbf B$ model. We also observe that increasing model size also leads to lower perplexity on WikiText103 and higher cloze accuracy on LAMBADA.

5.3. Bi-directional Transformer Results Using BERT

We empirically demonstrated that rearranging the order of the layer normalization and the residual connections is ciritical to enable the scaling of the BERT-style models beyond BERT-Large. We consider three different cases as detailed in Tabel 4. On a $3$% held-out set, $336 \mathbf M$, $1.3\mathbf B$, and $3.9\mathbf B$ models achieve validation set perplexity of $1.58$, $1.30$, and $.1.16$. The results also show that 1). as the model size increases, the downstream task performance improves in all cases, 2). our $3.9\mathbf B$ model establishes state of the art results on the development set compared to other BERT based models, and 3). our $3.9\mathbf B$ model achieves both single model as well as ensembled SOTA results on RACE test set.

6. Conclusion and Future Work

In this work, we successfully surpassed the limitations posed by traditional single-GPU-per-model training by implementing model parallelism with only a few modifications to the existing PyTorch transformer implementations.
We also showed that for BERT models, careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased accuracies as the model size increases.

저작자표시 비영리 변경금지 (새창열림)

'Paper Review' 카테고리의 다른 글

[Paper Review] Zero-Shot Text-to-Image Generation (0)	2022.08.19
[Paper Review] Neural Discrete Representation Learning (0)	2022.08.02
[Paper Review] Auto-Encoding Variational Bayes (0)	2022.07.20
[Paper Review] ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (0)	2022.06.27
[Paper Review] LaMDA: Language Models for Dialog Applications (0)	2022.05.24

강정노트