[Paper Review] PaLM: Scaling Language Modeling with Pathways

[Paper Review] PaLM: Scaling Language Modeling with Pathways

2023. 9. 28. 16:02ㆍPaper Review

1. Introduction

이번 연구에서는, 언어 모델링 개선을 위한 모델 크기 확장 추세를 이어가기 위해, 540B 개의 파라미터를 가진 트렌스포머(=PaLM)를 780B 개의 양질 토큰으로 학습시킬 것이다. 이때, Pathway라는 새로운 ML 시스템을 사용했는데, 이는 수천 개의 가속기로 매우 거대한 인공신경망을 매우 효율적으로 학습시켜주는 ML 시스템이다. 위와 같은 방식으로 만들어진 PaLM은 수백 가지의 테스크(자연어, 코드, 수학적 추론)에서 SOTA 성능을 달성했다.

위 연구의 핵심 내용은 다음과 같다.

1. 효율적인 확장 - Pathway를 통해, 매우 효율적으로 6144개의 TPU v4을 이용해, 540B 개의 파라미터를 가진 언어 모델을 만들었다. 위 효율성(모델 FLOPs 활용률 46.2%, HW FLOPs 활용률 57.8%)은 PaLM과 비슷한 규모의 모델 학습에서 보지 못한 효율성이다. (Pathway는 수 천개의 TPU로 한 개의 인공 신경망을 학습시키는 매우 효율적인 방법이다.)

2. 지속적인 확장 따른 지속적인 성능 향상 - 6번째 단락에서, 수백 가지의 테스크(자연어, 코드, 수학적 추론)로 PaLM을 평가해본 결과, 대부분의 테스크에서 큰 차이로 SOTA를 달성했다. 이는 확장을 통한 성능 향상에는 한계가 있지 않음을 의미한다. 즉, 확장을 하면 성능은 향상된다.

3. 엄청난 성능 향상 - 6번째 단락에서, BIG-bench로 자연어 이해와 생성에 관한 수 많은 어려운 테스크를 평가해본 결과, 엄청난 성능 향상을 보였다. 뿐만 아니라, 모델 크기 확장과 chain-of-though-prompting을 합치면, 수 많은 추론 테스크에서 PaLM few-shot이 fine-tuned SOTA 모델을 능가하는 모습을 보여줬다.

4. 불규칙한 성능 향상 - 확장의 반응(=확장에 따른 모델의 능력 변화?)을 이해하기 위해, 서로 다른 3가지 모델(8B, 62B, 540B)을 만들어 평가를 진행했다. 그 결과, 특정 테스크(BIG-bench의 25%)에서는 다음과 같은 불규칙한 성능 향상이 보였다. 62B에서 540B의 성능 향상이 8B에서 62B의 성능 향상에 비해 압도적 높았다.

5. 다중 언어 이해도 - 이번 연구에서, 더욱 자세한 다중 언어 평가(번역, 요약, QA)를 진행했다. 비영어권 데이터셋을 조금 사용했음에도 불구하고, 다음과 같은 평가 결과를 보였다. 1). 기존 요약 테스크의 fine-tuned SOTA 모델과의 성능 차이를 줄였다. 2). 기존 번역 테스크의 SOTA의 성능을 뛰어 넘었다. 앞으로는 다중 언어 데이터셋 비중 향상이 영어 테스크와 비영어 테스크에 어떠한 영향을 주는지 조사해볼 필요가 있다.

6. 편견과 유해성 - 우리는 분포적 편향과 유해성을 여러 관점에서 평가해봤다. 1). 성별과 직업 편견: 모델 크기가 확장할수록 Winogender의 공통 참조 테스크의 성능이 향상된다. (참고로, PaLM 530B은 one-, few-shot에서 SOTA를 달성했다.) 2. 인종/지역/성별 프롬프트에 대한 생성 텍스트로 동시 출현 단어를 분석해보면, 모델이 잠재적으로 갖고 있는 잘못된 고정관념을 파악할 수 있다. 파악 해본 결과, 모두 모델에서 비슷한 편향 수준을 보였다. 3). 마지막으로 프롬프트 기반 생성 텍스트의 유해성 분석을 해본 결과, 540B 모델이 62B 모델보다 혐오 정도가 높게 나왔다. 하지만, 모델 생성 텍스트의 혐오 정도는 프롬프트의 혐오 정도와 매우 강한 상관관계를 가지고 있는 반면, 사람 생성 텍스트와의 강한 상관관계를 가지지 않는다. 이는 모델 생성 텍스트는 사람 생성 텍스트보다 프롬프트에 영향을 더 많이 받는다는 것을 시사한다.

In this work, we continue the scaling line of language modeling improvements and train a 540 billion parameter, densely activated, autoregressive Transformer on 780 billion tokens of high-quality text. This was achieved through the use of Pathways, a new ML system which enables highly efficient training of very large neural networks across thousands of accelerator chips.
This new model, called Pathways Language Model (PaLM), achieves state-of-the-art few-shot results across hundreds of natural language, code, and mathematical reasoning tasks.

The key takeaways from this work are as follows:

Efficient scaling - We demonstrate the first large-scale use of Pathways - a new ML system which enables training a single model across thousands or tens of thousands of accelerator chips in a highly efficient manner. With Pathways, we trained a 540B parameter language model on 6144 TPU v4 chips at efficiency levels that could not be reached before for models of this scale.
In Section 4, we describe how we were able to scale pipeline-free training of PaLM 540B to 6144 chips across two TPU v4 Pods while achieving very high efficiency of 46.2% in model FLOPs utilization and 57.8% in hardware FLOPs utilization. (that could not be reached before for models of this scale)

Continued improvements from scaling - In Section 6, we evaluate PaLM across hundreds of natural language, code, and mathematical reasoning tasks, and achieve state-of-the-art results on the vast majority of these benchmarks, typically by significant margins. This critically demonstrates scaling improvements from large LMs have neither plateaued nor reached their saturation point.

Breakthrough capabilities - We demonstrate breakthrough capabilities in language understanding and generation across a number of difficult tasks. (in BIG-bench which are extremely difficult even for humans to correctly answer.)
In this work, we demonstrate that when model scaling is combined with chain-of-though prompting simple few-shot evaluation can outperform or match the finetuned state of the art on a wide array of reasoning tasks. (which require multi-step mathematical or commonsense reasoning to produce the correct answer)

Discontinuous improvements - To better understand the scaling behavior, we present results at three different parameter scales: 8B, 62B, and 540B.
for certain tasks, we observe discontinuous improvements, where scaling from 62B to 540B results in a drastic jump in accuracy compared to scaling from 8B to 62B. (in 25% of the BIG-bench tasks)
This suggests that new capabilities of large LMs can emerge when the model achieves sufficient scale, and that these capabilities continue to emerge beyond previously studied scales.

Multilingual understanding - In this work, we conduct a more thorough evaluation of multilingual benchmarks including machine translation, summarization, and question answering in a wide variety of languages. Even with a relatively small proportion of non-English data ($\sim$ 22% in the training corpus, few-shot evaluation results from the 540B model are able to bridge the gap with prior finetuned state of the art in non-English summarization tasks and outperform prior state of the art in translation tasks. Further work is necessary to understand the impact of increasing the proportion of multilingual data on the English and multilingual tasks.

Bias and toxicity - We also evaluated model performance for distributional bias and toxicity which resulted in several sights. First, for gender and occupation bias, we find that accuracy on the Winogender coreference task improves with model scale, and PaLM 530B sets a new state-of-the-art result in 1- shot and few-shot settings. Secondly, co-occurance analysis performed on race/religion/gender prompt continuation demonstrates the potential for the model to falsely affirm stereotypes. This behavior was consistent across model scales. Finally, toxicity analysis on prompt continuation tasks demonstrates a slightly higher overall toxicity level for the 62B and 540B model compared to the 8B model. However, the toxicity of the model-generated continuation correlates highly with the toxicity of the prompting text,
whereas human-generation continuations do not have a strong toxicity correlation. This suggests that the model is more heavily influenced by the style of the prompt than would be the case for human-generated text.

2. Model Architecture

PaLM은 표준 트렌스포머 모델 구조를 채택함과 동시에, 다음 변경점이 있다.

1. SwiGLU Activations - MLP 중간 활성 함수로 SwiGLU를 사용한다. ReLU, GeLU, Swish와 같은 활성 함수보다 더 좋은 성능을 보여주기 때문이다.

2. Parallel Layers - 다음과 같은 병렬 기법을 사용한다.

$y = x + \text{MLP}(\text{Layer Norm}(x)) + \text{Attention}(\text{Layer Norm}(x))$

(참고로, 기존 직렬 기법은 $y = x + \text{MLP}(\text{Layer Norm}(x + \text{Attention}(\text{Layer Norm}(x)))$다.)

병렬 기법은 큰 모델의 학습 속도를 15% 증가시켜준다. MLP와 Attention 행렬 곱을 합칠(?) 수 있기 때문이다.

성능 실험 결과, 8B 모델에서 약간의 성능 감소가 보였지만, 62B 모델에서는 성능 감소가 나타나지 않았다. 그렇기 때문에, 우리는 540B 모델에서는 성능 감소가 없을 것이라고 추정했다.

3. Multi-Query Attention - 기존 트렌스포머는 size가 h인 head를 k개 가지고 있다. 때문에, 입력 벡터는 [k, h] 크기의 query, key, value로 선형 사영된다. (즉, head가 key, value를 서로 공유하지 않는다.)

하지만, PaLM은 "head가 key, value를 서로 공유"하는 방식을 채택했다. 때문에, key, value의 선형 사영 크기는 [1, h]이 된다.

(참고로, query는 여전히 [k, h]크기로 선형 사영된다.)

위 방식은 성능과 학습 속도에 영향을 주지 않지만, 추론 비용(시간, 메모리)을 상당히 줄여준다.

4. RoPE Embedding - 절대적 혹은 상대적 위치 임베딩 대신 RoPE 임베딩을 사용한다. 긴 생성 텍스트에서 더 좋은 성능을 보여주기 때문이다.

5. Shared Input-Output Embeddings - 입력 임베딩과 출력 임베딩을 공유할 것이다.

6. No Biases - 어떤 편향 가중치도 사용하지 않을 것이다. 큰 모델의 학습 안정성을 향상시켜주기 때문이다.

7. Vocabulary - 다양한 언어를 지원하기 위해, 크기가 256k인 SentencePiece 단어장을 사용할 것이다. 위 단어장은 손실이 없고 역으로 되돌릴 수 있다. 즉, 단어장이 공백을 보존하고 있다. 뿐만 아니라, OOV 단어를 UTF-8 바이트로 분리시킨 후, 각 바이트에 대한 어휘를 생성한다.

PaLM uses a standard Transformer model architecture in a decoder-only setup, with following modifications:
SwiGLU Activation - We use SwiGLU activations (Swish(xW) $\cdot$ xV) for the MLP intermediate activations because they have been shown to significantly increase quality compared to standard ReLU, GeLU, or Swish activations.

Parallel Layers - We use a "parallel" formulation in each Transformer block.
Specifically, the standard formulation can be written as:
$$y = x + \text{MLP}(\text{Layer Norm}(x + \text{Attention}(\text{Layer Norm}(x)))$$
Whereas the parallel formulation can be written as:
$$y = x + \text{MLP}(\text{Layer Norm}(x)) + \text{Attention}(\text{Layer Norm}(x))$$
The parallel formulation results in roughly 15% faster training speed at large scales, since the MLP and Attention input matrix multiplications can be fused. Ablation experiments showed a small quality degradation at 8B scale but no quality degradation at 62B scale, so we extrapolated that the effect of parallel layers should be quality neutral at the 540B scale.

Multi-Query Attention - The standard Transformer formulation uses k attention heads, where the input vector for each timestep is linearly projected into "query", "key", and "value" tensors of shape [k, h], where h is the attention head size. Here the key/value projections are shared for each head, i.e. "key" and "value" are projected to [1, h], but "query" is still projected to shape [k, h]. We have found that this has a neutral effect on model quality and training speed, but results in a significant cost savings at autoregressive decoding time.

RoPE Embedding - We use RoPE embeddings rather than absolute or relative position embeddings, since RoPE embeddings have been shown to have better performance on long sequence lengths.

Shared Input-Output Embeddings - We share the input and output embedding matrices, which is done frequently in past work.

No Biases - No biases were used in any of the dense kernels or layer norms. We found this to result in increased training stability for large models.

Vocabulary - We use a SentencePiece vocabulary with 256k tokens, which was chosen to support the large number of languages in the training corpus with excess tokenization. The vocabulary is completely lossless and reversible, which means that whitespace is completely preserved in the vocabulary (especially important for code) and out-of-vocabulary Unicode characters are split into UTF-8 bytes, with a vocabulary tokens for each byte.

2.1. Model Scale Hyperparameters

이번 연구에서는, 서로 다른 3가지 모델(540B 파라미터, 62B 파라미터, 8B 파리미터)을 비교할 것이다.

위 모델들은 같은 데이터셋과 단어장을 사용해 학습시킬 것이다.

In this work, we compare three different model scales: 540B parameters, 62B parameters, and 8B parameters.
The three models were trained identically (except batch size) using the same data and vocabulary.

3. Training Dataset

PaLM 사전 학습 데이터셋은 다양한 자연어 사용 사례를 가지고 있는 780B 개의 고품질 토큰으로 구성되어 있다.

학습 데이터셋은 LaMDA와 GLaM의 학습 데이터셋 기반으로 만들어졌다.

우리는 데이터의 1 epoch만 사용해 모든 모델을 학습시킬 것이며, 데이터가 반복되지 않는 혼합 비율을 선정할 것이다.

5번째 단락에서, 데이터 오염에 대해 살펴볼 것이다. 즉, 학습 데이터와 평가 데이터의 중복을 분석할 것이다.

The PaLM pretraining dataset consists of a high-quality corpus of 780 billion tokens that represent a wide range of natural language use cases.
This dataset is based on the datasets used to train LaMDA and GLaM. We train all three models on exactly one epoch of the data and choose the mixing proportions to avoid repeating data in any subcomponent.
In addition to natural language data, the pretraining dataset also contains code.
We check for data contamination in Section 5 and present an overlap analysis between our training dataset and the evaluation data.

4. Training Infrastructure

DCN으로 연결된 2개의 TPU v4 Pods에서 모델과 데이터 병렬을 조합해 PaLM을 학습시켰다.

위 시스템은 파이프 병렬 기법 없이도, 6144개의 TPU를 효율적으로 사용할 수 있게 해준다.

각 TPU v4 Pod(=3072개의 TPU v4)마다 모델 전체의 파라미터 복사본을 가지고 있으며, 이를 TPU개수로 분할했다. (by 12-way MP, 256-way FSDP($\sim$ZERO-DP))

순전파(=forward pass)일 때, 1). 데이터 병렬 축으로 all-gather을 사용해 가중치를 모은다. 2). 각 레이어마다, 출력 텐서(=activation tensor)를 분할해 저장한다. (참고로, MP로 생겨난 사본을 없앤 후, 출력 텐서를 분할해 저장한다.(=ZERO-R))

역전파(=backward)일 때, 1). 저장하지 않은 출력 텐서를 다시 계산한다. 저장하는 것보다 높은 학습 처리량을 보여주기 때문이다. (참고로, 저장하지 않은 출력 텐서는 backward pass 때 계산할 수 있다.)

PaLM 540B는 (포드 수준의 2-way 병렬 기법을 사용하기 위해) Pathways의 client-server 구조를 활용했다. 파이썬 클라이언트는 네 가지 과정을 진행한다. 1). 배치의 절반을 각 포드에 전달한다. 2). 각 포드는 (포드 내의 데이터와 모델 병렬기법을 사용해) 순전파와 역전파로 기울기를 계산한다. 그리고, 3). 각 포드는 기울기를 다른 포드에 전달한다. 4). 각 포드는 다른 포드의 기울기를 합쳐 파리미터 업데이트를 동시에 진행한다.

아래 사진은 Pathways 시스템이 어떻게 포드 수준의 2-way 데이터 병렬 기법을 사용하는지 보여준다. 파이썬 클라이언트는 (TPU 포드를 보유하고 있는 원격 서버에 JAX/XLA 작업을 실행하는) 데이터 분할 흐름 프로그램(=sharded dataflow program)을 갖고 있다. 위 프로그램은 1). 순전파와 역전파 계산을 위한 A 요소, 2). 포드 간의 기울기 전송을 위한, 서브 그래프 전송 그리고, 3). 최적화 업데이트를 위한 B 요소가 있다. Pathway 프로그램은 각 포드에서 A를 실행한 후, 기울기를 다른 포드에 전송한다. 마지막으로, 각 포드에서 B를 실행한다. Pathways 시스템은 TPU를 수천 개까지 확장해도 프로그램 실행을 가능케 해주는 몇 가지 기능이 있다. 첫 번째, 비동기 gang-scheduling로 파이썬 클라이언트에서 원격 서버로 JAX/XLA 작업을 전송하는 시간을 줄여준다. 두 번째, sharded-dataflow execution model(?)로 데이터 전송 관리 비용을 분할한다.

포드 수준 2-way 데이터 병렬 기법 흥미로운 부분은 2개 포드에 걸쳐 총 1536의 호스트에 연결된 6144개의 TPU v4 규모에서 포드 간의 기울기 전송으로 높은 학습 처리량 달성한 것이다. 포드 간의 기울기 전송은 두 포드에서 대응되는 호스트 간의 전송만 필요하다. 왜냐하면 각 코어(=TPU)는 할당된 파라미터의 원격 기울기만 필요하기 때문이다. (참고로, 두 포드 사이의 호스트는 DCN으로 연결되어 있다.)

각 코어는 기울기 계산이 끝날 때까지 기울기를 전송하지 않는다. 이는 버스트 현상을 초래한다. 즉, 모든 호스트가 거의 같은 시간대에 기울기를 전송한다. (스텝마다, 각 호스트는 대략 1.3GB 기울기를 전송을 하다. 즉, 모든 호스트 합치면, 총 81Tps에 달하는 기울기를 전송한다.)

위와 같은 버스트 현상은 Pathway networking stack의 철저한 설계로 DCN 사용 최적화하여 완화했다. 예를 들어, 혼잡을 완화하기 위해 기울기 전송 데이터를 더 작은 조각으로 분해하고, 다양한 DCN 링크를 사용해 다수의 작은 흐름으로 라우팅했다. 이러한 최적화로 단일 포드에 비해 1.95x의 학습 처리량을 달성했다.

PaLM is trained over two TPU v4 Pods connected over data center network (DCN) using a combination of model and data parallelism.
This system, the largest TPU configuration described to date, allowed us to efficiently scale training to 6144 chips without needing to use any pipeline parallelism.

Each TPU v4 Pod contains a full copy of the model parameters, with each tensor partitioned over 3072 chips using 12-way model parallelism and 256-way fully sharded data parallelism. During the forward pass, weights are all-gathered over the data parallel axis, and one fully sharded activation tensor is saved from each layer. During the backward pass, the rest of the activations are rematerialized, because this results in higher training throughout at larger batch sizes, compared to alternative recomputation choices.

PaLM 540B utilizes the client-server architecture of Pathways to achieve two-way parallelism at the pod level. Here, a single Python client dispatches half of the training batch to each pod, each pod executed the forward and backward computation to compute gradients in parallel using standard within-pod data and model parallelism. The pods then transfer the gradients with the remote pod, and finally, each pod accumulates the local and remote gradients and applies parameter updates in parallel to obtain bitwise-identical parameters for the next timestep.

Figure 2 shows how the Pathways system executes the two-way pod-level data parallelism. A single Python client constructs a sharded dataflow program that launched JAX/XLA work on remote servers that each comprise a TPU pod. The program contains a component A for within-pod forward+backward computation, transfer subgraph for cross-pod gradient transfer, and component B for optimizer update. The Pathways program executes component A on each pod, then transfers the output gradients to the other pod, and finally, executes component B on each pod. The Pathways system design has several features that allow it to scale program executions to thousands of accelerator chips - first, it masks the latency to dispatch JAX/XLA work form the single python client to the remote servers via asynchronous gang-scheduling at per-pod schedulers, and second, it amortizes the cost of managing data transfers via a sharded-dataflow execution model.

An interesting aspect of two-way pod-level data parallelism is the challenge of achieving high training throughput for cross-pod gradient transfers at the scale of 6144 TPU v4 chips attached to a total of 1536 hosts across two pods. Note that the cross-pod gradient transfer only requires 1:1 transfer between the corresponding hosts on the two pods because each core only needs remote gradients for its model-sharded parameters. Further the hosts between the two pods are connected via the Google datacenter network. Since the transfer does not start until each core finishes computing the gradients, this results in very bursty workload where all hosts transfer their gradients at the same time over the data-center-networks links.
The bursty properties of this workload introduce challenges that we address through a careful design of the Pathway networking stack to enable optimal DCN link utilization. For example, to mitigate the effects of congestion, the data for gradient transfers is broken down into smaller chunks and routed via multiple smaller flows over a diverse set of DCN links. With these optimizations, we achieve a training throughput of about 1.95x relative to the throughput on a single pod during training.

4.1. Training Efficiency

HUF는 측정된 FLOPs와 이론상 최대 FLOPs 간의 비율을 의미한다. 하지만 위 평가 지표는 여러 이슈가 있다.

1. 학습 시스템의 궁극적인 목표는 학습 처리량(=초 당 토큰 개수)을 극대화하는 것이지, H/W FLOPs을 극대화하는 것이 아니다. 다시 말해, 학습 처리량과 FLOPs는 비슷하지만, 다른 부분이 있다는 것이다.

예를 들어, Rematerialization 방식은 메모리 시용량과 연산 횟수를 trade-off하는 기술이다. HUF 관점에서 보면, 위 기술을 사용해 연산 횟수를 증가시키는게 이득이다. 하지만, 그것이 무조건 학습 처리량을 높여준다고 볼 수 없다.

2. FLOPs값이 측정 방식에 따라 달라질 수 있다는 것이다.

우리는 구현에 독립적(?)이고, 시스템 효율을 더욱 명확하게 비교할 수 있는 새로운 효율 평가 지표(=MFU)를 소개한다.

MFU는 실제 학습 처리량와 이론상 최대 학습 처리량(=이론상 최대 FLOPs일 때, 학습 처리량)간의 비율을 의미한다.

(참고로, "이론상 최대 학습 처리량"을 계산할 때, 순전파와 역전파 연산만 포함하고, rematerialization 연산은 포함하지 않는다.)

MFU는 서로 다른 시스템의 학습 효율을 공정하게 비교할 수 있도록 해준다.

분자는 실제 학습 처리량을 의미하며, 분모는 오직 모델 구조와 (시스템에서 제공한) H/W의 이론상 최대 FLOPs만 의존하여 계산된 이론상 최대 학습 처리량이기 때문이다.

(참고로, 실제 학습 처리량은 모델 구조와 H/W FLOPs에 더해 제약(e.g., 지연 시간)과 잡기술(e.g., 병렬 기법)에 영향을 받는다고 생각한다.)

표에 PaLM 540B와 이전 LLM의 MFU 측정 결과를 제공했다. MFU는 다양한 환경(=모델 크기, 구조, 성능)에서 모델과 시스템을 비교할 때 유용하게 사용할 수 있다.

참고로, PaLM 540B의 HFU를 측정해본 결과, 57.8%가 나왔다. 이렇게 높은 가속기 사용률이 나오는 이유는 1. 4번째 단락에서 설명한 병렬 처리 기법과 그외의 요소(=TPU 최적화 컴파일러, 병렬 레이어 등)가 있다.

hardware FLOPs utilization (HFU) reflects an estimate of the ratio of FLOPs observed on a given device to its theoretical peak FLOPs.
but the ultimate goal of training system is to achieve a high throughput in tokens per second, not to use as many hardware FLOPs as possible
measuring observed hardware FLOPs is dependent on methodology used to count or track them.

We propose a new metric for efficiency that is implementation-independent and permits a cleaner comparison of system efficiency, called model FLOPs utilization (MFU). This is the ratio of the observed throughput (tokens-per-second) relative to the theoretical maximum throughput of a system operating at peak FLOPs. Crucially, the "theoretical maximum" throughput only accounts for the required operations to compute the forward+backward passes, and not rematerialization. MFU therefore allows fair comparisons between training runs on different systems, as the numerator is simply the observed tokens-per-second, and the denominator is only dependent on the model architecture and published maximum FLOPs for a given system.

We present the model FLOPs utilization of PaLM 540B model and contextualize with prior large models in Table 3. MFU is useful to compare models and systems in the context of different model parameter counts, architectures and model quality.
Our analytically computed hardware FLOPs utilization, which includes rematerialize FLOPs, is 57.8%. PaLM achieves higher accelerator utilization because of its parallelism strategy and several other factors, including XLA TPU compiler optimizations, and the use of "parallel layers".

5. Training Setup

1. Weight initialization - 1). 커널 가중치와 임베딩은 각각 $W \sim \mathcal N(0, \frac{1}{\sqrt{n_{in}}})$, $E \sim \mathcal N(0, 1)$로 초기화한다. 2). 입력 임베딩과 출력 임베딩은 공유하기 때문에, softmax 함수를 거치기 직전의 출력 텐서를 $\frac{1}{\sqrt n}$배로 scaling한다.

2. Optimizer - Adafactor 옵티마이저를 사용할 것이며, 이때 $V_t$를 $R_t, C_t$로 분해하지 않을 것이다. 이는 사실상 "parameter scaling"을 사용한 Adam과 동등하다. (parameter scaling이란 RMS(X)만큼 lr 조정하는 것을 의미한다. X=parameter matrix, RMS=root-mean-square) $\frac{1}{\sqrt n}$에 비례하게 가중치를 초기화했기 때문에, 모델 크기가 커질수록 lr는 자동적으로 감소한다. (반면, GPT-3는 수동적으로 감소시켰다. $6.0 \times 10^{-4}$ in GPT-3 small → $0.6 \times 10^{-4}$ in GPT-3 175B) parameter scaling의 장점은 파라미터의 lr가 같은 비율로 감소되지 않고, 다른 비율로 감소된다는 것이다.

3. Optimization hyperparameters - 1). 10k 스텝까지 $10^{-2}$의 lr를 사용한 후, $1/\sqrt k$ 비율로 감소시킬 것이다. $\beta_1$와 $\beta_2$를 각각 $0.9$, $1.0 - k^{-0.8}$로 초기화 할 것이다. LLM 모델에서는, $\beta_2$=0.99보다 $\beta_2 = 1.0 - k^{-0.8}$가 학습할 때 더 안정적인 모습을 보여준다. 왜냐하면, 짧은 문맥에서 추정된 희귀한 토큰의 second-moment가 형편 없을 수 있기 때문이다. (희귀한 토큰은 학습 기회가 적다는 것을 의미한다. 그렇기 때문에, second-moment를 작게 감소시켜, 이전 정보(=기울기)를 최대한 활용해야 한다.)

2). threshold가 1.0인 global norm gradient clipping을 사용할 것이다. 뿐만 아니라, 3). $\lambda = lr^{2.0}$인 dynamic weight decay를 사용할 것이다. (weight decay란 overfitting을 막기 위해 Weight가 커질 경우 패널티를 준다. example: L1 regularization($\lambda \sum_{j=1}^n |\theta_j|$), L2 regularization($\lambda \sum_{j=1}^n |\theta_j|^2$))

$\beta_2$에 대해 생각해보기

1). second-moment estimator가 빠르게 감소하면, 즉, $\beta_2$가 작으면, 수렴하지 않은 문제가 발생한다.

2). second-moment estimator가 느리게 감소하면 즉, $\beta_2$가 크면, 학습이 불안정하다. 왜 그런지에 대해 생각해보자!!

느리게 감소한다는 것은 최신 정보보다 먼 과거 정보(=기울기)를 많이 반영함을 의미한다.

이로 추정치(=estimated second-moment)의 오차가 심해져, 예상보다 작은 업데이트 혹은 크지만 잘못된 업데이트를 초래할 수 있다.

그럼 이 상황을 어떻게 해결할 수 있을까?

$V_t$는 경사 정도 즉, $G_t^2$를 추정하도록 설계된 값이기에, $V_t$와 $G_t^2$는 가까워야 한다. 즉, $(G_t^2/V_t)$는 1에 가까워야 한다.

따라서, $RMS(G_t/V_t)$가 1에서 멀리 떨어져 있으면 second-moment estimator가 잘 작동되지 않음을 의미한다.

그래서, adafactor는 이를 update clipping으로 $\left ( \frac{U_t}{\max(1, RMS({U_t})/d)}, \quad U_t = \frac{G_t}{\sqrt V_t}\right )$해결했다.

즉, $G_t$와 $V_t$가 가깝지 않으면 즉, $U_t$가 1에 가깝지 않으면, 1에 가까워지게 값을 조절했다.

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Reddi discuss non-convergence issues when using a fast decay of the second-moment estimator (low $\beta_2$). We observe the same issues in our experiments. On the other hand, slow decay (high $\beta_2$) causes training instability when we turn off the step size warmup.
A slow decay rate means that our second-moment estimator is based on gradients farther in the past. If model is evolving rapidly, this could cause the estimates to have high error, leading to smaller-than desired or (worse) larger-than-desired updates.
If Adam is functioning as intended, for each individual parameter $x$, the value $\hat {v_{xt}}$, should be close to $(g_{xt})^2$, since this is precisely what $\hat {v_{xt}}$ is designed to measure. Thus, the ratio $(g_{xt}^2/\hat {v_{xt}}$ should be close to 1, as should the mean of many such values. So for a large weight matrix $X$, a value of $RMS(U_t)$ which is far from 1 is a sign that the second-moment estimator is not doing its job well.

4. loss function - 1). 표준 언어 모델링 손실 함수(=모든 토큰의 평균 로그 확률)를 사용할 것이다. 뿐만 아니라, 2). 추가 손실 함수(=$z_{loss} = 10^{-4} \cdot \log^2 Z$)를 사용할 것이다. 이는 소프트맥스 정규화 $(\log(Z), \ Z = e^{z_1} + e^{z_2} + \cdot + e^{z_n})$가 0에 가깝도록 장려한다. 이는 학습 안전성을 높여준다.

5. Sequence length - 모든 모델의 시퀀스 길이는 2048개다. 참고로, 학습 데이터 예제들을 연결한 뒤, 2048개의 토큰 시퀀스로 분할했다. 이때, [eod] 토큰으로 예제를 구분했다. 그렇기 때문에, [pad] 토큰은 없다.

6. Batch size - 모든 모델에 대해 학습 중 배치 크기를 늘린다. PaLM 540B인 경우, 50k 스텝까지는 512개, 115k 스텝까지는 1024개, 255k 스텝까지 즉, 학습이 끝날 때까지, 2048개의 배치 크기를 가진다. 그 외의 모델도 이와 비슷하다.

이유는 두 가지가 있다. 첫번째, 학습 초기에 작은 배치 크기는 학습 초기에 높은 sample-efficient(=적은 양의 데이터로 빠른 학습)을 보이며, 큰 배치 크기는 정확한 기울기 추정을 할 수 있기 때문에 학습 말기에 유용하다. 두번째, 큰 배치 크기는 TPU 효율을 높여준다.

7. Bitwise determinism - 임의의 채크 포인트에서 모델을 재현할 수 있다. 이는 두 가지 기술을 기반으로 한다. (1) 비트 단위까지 고정(=결정)할 수 있는 모델링 프레임워크, (2) 무작위로 섞은 데이터를 random-access 형식 (like DISK, RAM) 저장하여, 스텝 번호로 특정 배치를 가져올 수 있게 해주는 데이터 파이프라인

8. Dropout - pre-training할 때 사용하지 않지만, 대부분의 파인튜닝을 할 때 사용한다.

Weight initialization - The kernel weights are initialized with "fan-in variance scaling", i.e., $W \sim \mathcal N(0, 1, \sqrt{n_{in}})$, where $n_{in}$ is the input dimension of the kernel. The input embeddings are initialized to $E \sim \mathcal N(0, 1)$, since layer normalization is not applied to the embeddings. Because the input and output embedding layers are shared, we scale the pre-softmax output logits by $1/\sqrt n$, where $n$ is the embedding size.

2. Optimizer - The models was trained with the Adafactor optimizer, without factorization. This is effectively equivalent to Adam with "parameter scaling", which scales the learning rate by the root-mean-square of the parameter matrix. Because the weight initialization is proportional to $1/\sqrt n$, the effect of this is similar to the manual scaling down of Adam learning rate as in GPT-3. However, parameter scaling has the benefit that parameter matrices which operates at different scales do not have their learning rate scaled down at the same rate.

3. Optimization hyperparameters - We use an Adafactor learning rate of $10^{-2}$ for the first 10000 steps, which is then decayed at a rate of $1/\sqrt k$, where $k$ is the step number. We train with momentum of $\beta_1 = 0.9$. The second-order moment interpolation value is computed as $\beta_2 = 1.0 - k^{-0.8}$, where $k$ is the step number. We have found this to be more stable than the standard $\beta_2 = 0.99$ when training large language models, because rare embedding tokens can have poorly estimated second moments over shorter windows. We use global norm gradient clipping with a value of 1.0 for all models. We use a dynamic weight decay of $lr^{2.0}$ during training, where $lr$ is the current learning rate.

4. Loss function - The model is trained with the standard language modeling loss function, which is the average log probability of all tokens without label smoothing. We additionally use an auxiliary loss of $z_loss = 10^{-4} \cdot \log^2 Z$ to encourage the softmax normalizer $\log(Z)$ to be close to 0, which we found increases the stability of training.

5. Sequence length - A sequence length of 2048 was used for all models. Input examples are concatenated together and then split into sequences of exactly 2048 tokens, so that there are no padding tokens, but examples may be split in the middle.

6. Batch size - For all models, we increase the batch size during training. For the largest model, we use batch size 512 (1M tokens) until step 50k, then double it to 1024 (2M tokens) until step 115k, and finally double again it to 2048 (4M tokens) until training is complete at step 255k. The smaller models followed similar schedules.
The reason for using such batch schedule is twofold: (1) smaller batch sizes are more sample efficient (i.e., better loss as a function of tokens seen) earlier in training, while larger batch sizes are beneficial later in training due to better gradient estimates. (2) larger batch sizes result in larger matrix multiplication dimensions, which increases TPU efficiency.

7. Bitwise determinism - The model is fully bitwise reproducible from any checkpoint.
This is achieved in two ways: (1) a bitwise-deterministic modeling framework provided by JAX+XLA+T5X, and (2) a deterministic dataset pipeline where the shuffled data is written out in a random-access format so the contents of a given training batch is only a function of the step number.

8. Dropout - The model was trained without dropout, although dropout of 0.1 is used for finetuning in most cases.

5.1. Training Instability

gradient clipping을 했음에도 불구하고, PaLM 540B를 학습시킬 때, 손실값이 갑자기 튀는 현상(=스파이크)을 20번 정도 관측했다.

이러한 현상은 불규칙한 간격으로 발생하고, 모델 크기가 작을 때는 관측되지 않았다.

이 문제를 해결하기 위한 간단하고 효율적인 방법은 다음과 같다. 1). 스파이크가 시작되기 약 100 스텝 전의 채크포인트에서 학습을 다시 시작한다.

2). 스파이크가 시작되기 전과 직후에 학습된 배치(약 200-500개)를 건너뛴다.

위 현상을 파악하기 위해, 스파이크 발생 지점 근처에 위치한 배치를 가지고 다른 채크포인트 혹은 이전 채크포인트를 학습시켜보았다. 그 결과, 스파이크가 관측되지 않았다. 이는 특정 데이터 배치와 특정 파리미터 상태의 조합으로 스파이크가 발생할 수 있음을 시사한다.

For the largest model, we observed spikes in the loss roughly 20 times during training, despite the fact that gradient clipping was enabled. These spikes occurred at highly irregular intervals, sometimes happening late into training, and were not observed when training the smaller models.

we found that a simple strategy to effectively mitigate the issue: We re-started training from a checkpoint roughly 100 steps before the spike started, and skipped roughly 200-500 data batches, which cover the batches that were seen before and during the spike.

we ran several ablation experiments where we took the batches of data were surrounding the spike, and then trained on those same data batches starting from a different, earlier checkpoint. In these cases, we did not see a spike. This implies that spikes only occur due to combination of specific data batches with a particular model parameter state.

6. Evaluation

6.1. English NLP tasks

29가지 영어 밴치마크(일반 상식(=QA), 언어 모델링, Winograd, 일반 추론, 독해력, SuperGLUE, NLI)로 PaLM 모델을 평가할 것이다.

표에서는 PaLM 540B과 다른 LLM의 SOTA 성능을 비교하고 있다. 참고로, finetuning한 모델은 포함하지 않았다.

1). PaLM 540B은 1-shot에서 24가지 테스크의 SOTA 성능을 능가했고, few-shot에서 28가지 테스크의 SOTA 성능을 능가했다. 흥미롭게도, PaLM은 특정 독해 테스크와 NLI 테스크의 few-shot에서 이전 SOTA 성능보다 10점 이상 향상된 성능을 보였다.

2). PaLM은 모든 밴치마크에서 비슷한 크기의 모델(=Megatron-Turing NLG 530B)보다 높은 성능을 보였다. 이는 모델 크기외에도 데이터셋, 학습 방식, 토큰 개수가 성능에 매우 중요하다는 것을 시사한다.

또 다른 표는 NLU와 NLG 테스크의 평균 성능을 측정한 결과를 보여준다. PaLM 540B은 평균 점수를 5점 이상 향상시켰다.

흥미롭게도, PaLM 62B가 GPT-3 175B보다 높은 성능을 보였다.

we evaluate the PaLM model on the same set of 29 English benchmark.

Table 4 includes the results for the PaLM 540B and the prior state-of-the-art (SOTA) results from other large language models.
Any model that uses finetuning or multi-task adaptation is not included in the table.

PaLM 540B outperforms prior SOTA on 24 of 29 task in the 1-shot setting and 28 of the 29 tasks in the few-shot setting. Interestingly, PaLM 540B outperforms priors SOTA by more than 10 points in the few-shot setting on some of the Reading Comprehension and NLI tasks.
PaLM 540B outperforms a similar sized model (Megatron-Turing NLG 530B) on all benchmarks. This indicates that the pretraining dataset, training strategy, and the number of tokens observed during training play a significant role in achieving these results.

Table 5 lists the average scores for the Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks. PaLM 540B improves on the average score in both categories by more than 5 points.
Interestingly, the PaLM 62B outperforms the GPT-3 175B in both categories.

6.1.1. Massive Multitask Language Understanding

MMLU는 수학, 역사, 법과 같은 주제(=테스크)를 57가지 포함하고 있다.

PaLM 540B는 MMLU의 평균 점수를 2점 향상시켰으며, Other task를 제외한 나머지 주제에서 Chinchilla를 압도했다.

This multiple choice question answering benchmark covers 57 different tasks spanning topics such as mathematics, history and law. PaLM 540B improves the average score of MMLU benchmark by $\sim$ 2 points. PaLM 540B outperforms the Chinchilla model on all the categories except the category of Other tasks.

6.1.2. Finetuning

SuperGLUE 밴치마크로 PaLM의 파인튜닝 실험을 진행했다. (lr = $5 \times 10^{-5}$, optimizer = Adafactor, batch size = 32). PaLM은 일반적으로 15K 스텝 미만으로 파인튜닝 학습이 완료된다.

SuperGLUE의 각 테스크의 데이터를 일정한 비율로 혼합하여 파인튜닝을 진행한 결과를 T5-11B, ST-MoE-32B와 같은 SOTA 모델과 비교해봤다. 그 결과, SOTA에 근접한 성능을 보였다. SuperGLUE에서 최고 성능을 보이는 모델은 span corruption 목표함수로 훈련된 인코더-디코더 모델(=T5)이다. 즉, 판별 테스크에서는 인코더-디코더 모델이 자기 회귀 언어 모델(=디코더 모델)보다 뛰어난 성능을 보여준다. 하지만, 위와 같은 결과는 구조로 인한 격차를 모델 확장으로 줄일 수 있음을 시사한다. 참고로, few-shot과 파인튜닝의 성능 차이는 여전히 심하다.

결론, PaLM은 모델 확장을 통해 SOTA에 근접한 성능을 보여줌과 동시에, 자기 회귀 언어 모델에서 SOTA 성능을 달성했다.

We conduct finetuning experiments for the PaLM model on the SuperGLUE benchmark. PaLM is finetuned with $5 \times 10^{-5}$ learning rate using the Adafactor optimizer, with a batch size of 32. PaLM converges typically in less than 15K steps of finetuning.

Table 7 reports that validation results on finetuning on task-proportionate mixture of SuperGLUE tasks. On SuperGLUE, we compare with state-of-the-art models such as T5-11B and ST-MoE-32B and show that PaLM obtains competitive close-to-SOTA performance. It is worth nothing that both top performing models on SuperGLUE are encoder-decoder models that are trained using the span corruption objective. It has been shown that such an architecture will generally outperform autoregressive decoder-only model on classification task finetuning, when cost is equalized. These results demonstrate that scale can help bridge the gap. Table 8 also demonstrates that there are still a significant gap between few-shot and finetuned results. We show that PaLM is competitive with state-of-the-art while outperforming the best decoder-only autoregressive language model on the leaderboard by a wide margin.

6.2. BIG-bench

BIG-bench는 를 논리적 추론, 번역, QA, 수학 등등을 포함한 다양한 언어 모델링 테스크를 150가지 이상 가지고 있다.

이번 단락에서는, PaLM 모델군의 few-shot 성능을 BIG-bench로 측정해볼 것이다.

첫 그림의 왼쪽 그래프는 비교를 위해 이전 LLM(GPT-3, Gopher, Chinchilla)의 평균 점수와 PaLM 모델들의 평균 점수를 같이 보여주고 있다. 참고로, 평균 점수는 BIG-bench의 일부 테스크 58가지의 평균 점수를 의미한다.

평가 결과는 다음과 같다. 1). PaLM의 평균 점수는 이전 LLM의 평균 점수를 모두 압도했다. 2). 5-shot PaLM 540B의 평균 점수는 사람의 평균 점수를 능가했다. 3). 두번째 그림에서 나왔듯이, 44가지 테스크에서 5-shot PaLM 540B이 SOTA 점수를 능가했다. 4). 첫 그림의 오른쪽 그래프에서, PaLM 모델군으로 모델 크기와 성능 간의 관계를 관찰해본 결과, log-linear 형태를 가졌다. 이는 모델 크기가 보다 더 확장시켰을 때 log-linear하게 성능이 더 향상될 수 있음을 시사한다.

BIG-bench 테스크 중 PaLM 모델군에서 성능에 관해 흥미로운 특징을 보인 다섯 가지 테스크에 대해 살펴볼 것이다.

1. goal_step_wikihow: 이벤트 간의 goal-step 관계를 추론하는 테스크다. 즉, 목표를 달성하기 위한, 이벤트 순서를 추론하는 테스크다.

2. logical_args: 문단에서 올바른 논리적 추론을 예측하는 테스크다. 문단에서 말하고자 하는 내용 즉, 문단 의도를 파악하는 테스크다.

3. english_proverbs: 문단에 가장 알맞는 속담을 추론하는 테스크다.

4. logical_sequence: "things"(개월, 작업, 번호, 문자 등)를 논리적으로 정렬하는 테스크다.

5. navigate: 간단한 경로 지시사항들을 모두 이행하여 도착한 최종 위치를 추론하는 테스크다.

6. mathematical_induction: 수학적 귀납법 공식이 주어졌을 때, 위 공식을 활용해 주어진 명제가 참인지 추론하는 테스크다.

위 다섯 가지 테스크 성능에서 몇 가지 흥미로운 점을 발견했다.

1). goal_step_wikihow와 logical_args에서 PaLM 540B가 사람의 최고 점수에 근접한 점수를 받았다.

2). english_proverbs와 logical_sequence에서 불규칙한 성능 향상을 보였다. 즉, 62B → 540B의 성능 향상이 8B → 62B의 성능 향상보다 매우 높았다. 위 결과는 매우 흥미롭다. 특정 모델 크기에 도달했을 때, 특정 능력이 발현된다는 것을 시사하기 때문이다.

150가지 테스크 중 25%의 테스크가 예상치보다 10% 높은 성능 향상을 보여줬고, 15%의 데스크에서는 예상치보다 20% 높은 성능 향상을 보여줬다. 이는 모델 크기 확장에 따른 불규칙한 성능 향상은 어려운 few-shot 언어 테스크에서 나타나는 일반적인 현상임을 시사한다.

3). 하지만, 크기 확장이 모든 테스크에 도움을 주는 것은 아니다. navigate와 mathematical_induction에서, PaLM 540B이 PaLM 62B보다 약간 좋은 점수를 받았지만, 사람의 최고 점수와는 여전히 큰 점수 차이를 보였다.

다음 표는 각 테스크마다 사람의 평균 점수에 비해 PaLM 540B의 성능 향상 정도를 분포로 보여준다. 75% 테스크에서는 PaLM이 사람의 평균보다 높은 점수를 받았지만, 너머지에서는 여전히 사람의 평균 점수가 높았다.

마지막 표는 BIG-bench Lite로 PaLM 모델군을 평가한 결과를 보여준다. BIG-bench Lite는 BIG-bench의 일부 테스크 24가지를 선별한 것으로, 경량화된 평가 지표다. 일부 테스크에서는 사람의 최고 점수에 근접했지만, 대부분에서는 여전히 큰 차이를 보였다.

참고로, 우리는 평가 결과의 타당성을 증명하기 위해, 특히 BIG-bench를 암기하여 성능이 향상됐다는 가능성을 배제시키기 위해, 여러 검증 절차를 진행했다.

It includes over 150 tasks that cover a variety of language modeling tasks including logical reasoning, translation, question answering, mathematics, and others. In this section we present and analyze few-shot evaluation results of the PaLM model family on BIG-bench.

Figure 3-left shows the results of evaluating the PaLM family of models on BIG-bench, compared against previously published results. The three models evaluated on 58 tasks in common, so this figure presents results only on these 58 tasks.

We see that PaLM significantly outperforms both GPT-3, Gopher and Chinchilla, and 5-shot PaLM 540B achieves a higher score than the average score of the humans asked to solve the same tasks. PaLM 540B 5-shot outperforms the prior SOTA on 44 out of the 58 common tasks, with per-task results shown in Figure 4.
the performance of PaLM model as a function of scale appears to follow log-linear behavior, indicating that further scaling up is likely to result in increased performance.

Next, we highlight a handful of tasks from BIG-bench where PaLM had particularly interesting performance characteristics. goal_step_wikihow - The goal is to reason about goal-step relationship between events.
Input: In order to "clean silver", which step should be done first? (a) dry the silver (b) handwash the silver
Answer: (b) handwash the silver

logical_args - The goal is to predict the correct logical inference from a passage.
Input: Student told the substitute teacher they were learning trigonometry. The substitute told them that instead of teaching them useless facts about triangles, he would instead each them how to work with probabilities. What is he implying? (a) He believes that mathematics does not need to be useful to be interesting. (b) He thinks understanding probabilities is more useful than trigonometry. (c) He believes that probability theory is a useless subject.
Answer: (b) He thinks understanding probabilities is more useful than trigonometry.

english_proverbs - The goal is to guess which proverb best describes a text passage.
Input: Vanessa spent lots of years helping out on weekends at the local center for homeless aid. Recently, when she lost her job, the center was ready to offer her a new job right away. Which of the following proverbs best apply to this situation? (a) Curses, like chickens, come home to roost. (b) Where there is smoke there is fire (c) As you sow, so you shall reap.
Answer: (c) As you sow, so you shall reap.

logical_sequence - The goal is to order a set of "things" (months, actions, numbers, letters, etc.) into their logical ordering.
Input: Which of the following lists is correctly ordered chronologically? (a) drink water, feel thirsty, seal water bottle, open water bottle (b) feel thirsty, open water bottle, drink water, seal water bottle (c) seal water bottle, open water bottle, drink water, feel thirsty
Answer: (b) feel thirsty, open water bottle, drink water, seal water bottle

navigate - The goal is to follow a set of simple navigational instructions, and figure out where you would end up.
Input: If you follow these instructions, do you return to the starting point? Always face forward. Take 6 steps left. Take 7 steps forward. Take 8 steps left. Take 7 steps left. Takes 6 steps forward. Takes 1 step forward. Take 4 steps forward.
Answer: No

mathematical_induction Input: The goal is to perform logical inference mathematical induction rules.
Input: It is known that adding 2 to any odd integer creates another odd integer. 2 is an odd integer. Therefore, 6 is an odd integer. Is this a correct induction argument?
Answer: Yes

We can see in Figure 5 that performance on goal_step_wikihow and logical_args follows a log-linear scaling curve, with PaLM 540B model achieving accuracy close to the best human performance. Performance on english_proverbs and logical_sequence is also extremely strong, but it follows a discontinuous improvement curve - the improvements from 62B → 540B is much larger than 8B → 62B. Such tasks are of particular interest here, because such scaling curves imply that certain capabilities of the model only emerge once a certain scale is reached.
Over all 150 tasks, 25% of tasks had discontinuity greater than +10%, and 15% of tasks had a discontinuity greater than +20%. This demonstrates that discontinuous improvements from scale are a common phenomenon on challenging few-shot language tasks.
However, scales does not benefit all tasks, For navigate and mathematical_induction, PaLM 540B only modestly outperforms PaLM 62B, and both are still far from the best human performance score.

In figure 6 we show that distribution of improvement over tasks, when comparing PaLM 540B to the average performance score of human evaluations. We can see that although PaLM 540B outperforms the average human performance on aggregate, the average human performance is still higher than PaLM 540B on 35% of the individual tasks.

Finally, Figure 7 presents the detailed evaluation results on BIG-bench Lite, a curated subset of 24 BIG-bench tasks that serve as lightweight evaluation target. While some of the BIG-bench Lite tasks are solved or close to be solved, other are still far from being solved in comparison to the best performance score from human evaluation.

We took several steps to establish the validity of these results, and in particular to rule out the possibility that the models achieved them by memorizing the BIG-bench data.

6.3. Reasoning

언어 모델은 여러 테스크를 수행할 수 있는 반면, 여전히 다단계 추론 테스크에서 어려움을 걲고 있다. 이번 연구에서는, 두 가지 평가 지표를 사용해 추론 성능을 측정할 것이다.

1. Arithmetic reasoning: 대부분 초등학교 수준의 자연어 수학 문제가 포함하며, 이는 다단계의 논리적 추론을 요한다. 위 테스크의 핵심은 수학 그 자체가 아니라 자연어로 표현된 수학 문제를 수식으로 바꾸는 과정이다. 위 문제에서는 계산기 양식(=계산기 사용)과 직접 추론 양식(=계산기와 CoT 사용 안함)으로 평가를 진행할 것이다.

2. Commonsense reasoning: 상식과 이를 기반한 다단계의 논리적 추론 능력을 요하는 QA 테스크다.

최근 여러 논문에서, 최종 답변을 생성하기 전 추론 과정을 상세히 명시(=생성)하는 방식을 사용해, 추론 성능을 크게 향상시켰다. 우리는 이러한 방식을 chain-of-though-prompting라고 칭할 것이다. few-shot 설정에서 이러한 추론 과정(=chain-of-thoughs)을 데모에 상세히 작성하여 모델에게 제시하면, 모델은 이를 기반으로 자신만의 추론 과정를 생성한다. 참고로, 최종 답변만 평가에 사용할 것이다.

While language models have been shown to perform a wide range of tasks, it is commonly accepted that language models still struggle to perform tasks that require multi-step reasoning. The two broad categories of reasoning benchmarks evaluated in this work are as follows:
Arithmetic reasoning - These tasks often involve grade-school level natural language math problems which require multi-step logical inference.
In this work, we evaluated both the calculator form and direct inference form, where the model itself performs the math.
Commonsense reasoning - These tasks are question answering tasks which require strong world knowledge, but are not simply factual question answering. Rather, they require chaining multiple logical inferences about the world.

Several recent papers have shown that large language models can achieve significant accuracy improvements by generating intermediate reasoning steps before generating the final answer. In this work, we refer to this technique as chain-of-though-prompting. In the few-shot setting, these intermediate reasoning steps are manually written for the few-shot examplars, and the model will then generate its own chain-of-thoughts for the test examples. Only the final answer is used for evaluation.

6.3.1. Results

결론부터 말하자면, 모델 크기 확장과 CoT 프롬프트를 사용하면 산수 및 상식 추론 테스크의 SOTA 성능을 충분히 달성할 수 있다.

PaLM+CoT(=PaLM과 8-shot chain-of-thought 프롬프트의 조합)로 산수 및 상식 추론 성능을 측정해봤다. 이때, 산수 추론 데이터셋으로는 GSM8K, SVAMP, MAWPS, AQuA를 사용했고, 상식 추론 데이터셋으로는 CommonsenseQA와 StretagyQA를 사용했다.

1). GSM8K로 PaLM 540B+CoT+calulator 성능을 측정해본 결과, 3% 차이로 기존 SOTA 성능(55%)을 능가했다. 뿐만 아니라, PaLM 540B w/o CoT(17%)와 PaLM 62B+CoT(33%) 성능과 매우 큰 차이를 보였다.

PaLM 62B+CoT가 어떤 오류를 범하여 문제를 틀렸는지 분석해보았다. 그 결과, 의미를 잘못 이해하거나, 추론 과정 중 한 단계를 누락시키는 등의 오류를 범했었다. 하지만, 모델 크기가 확장하면 이러한 오류를 범하지 않았다.

2). 일곱 가지의 추론 데이터셋으로 PaLM 540B+CoT 성능을 측정해본 결과, 4가지 테스크에서 SOTA 성능을 능가했고, 그 외의 테스크에서는 SOTA에 근접한 성능을 보였다.

PaLM 540B이 아무런 도움 없이 SVAMP에서 SOTA를 달성한 것을 보면 , CoT 뿐만 아니라 모델 크기 확장도 추론 테스크 성능 향상에 도움이 된다는 것을 알 수 있다. 참고로, 우리는 n-gram 중복 분석을 통해 학습 데이터와 추론 평가 데이터 간의 데이터 오염이 없음을 확인했다.

In this work, we demonstrate a striking result: model scale and chain-of-thought (CoT) prompting are alone enough to achieve SOTA accuracy across a variety of arithmetic and commonsense reasoning tasks.

Using chain-of-thought prompting, we evaluate PaLM's performance on the arithmetic datasets GSM8K, SVAMP, MAWPS, and AQuA, as well as the commonsense reasoning datasets CommonsenseQA and StrategyQA.

Using 8-shot chain-of-thought prompting in combination with an external calculator, PaLM 540B achieves a performance of 58% which outperforms the prior SOTA of 55% from Cobbe. This also significantly outperforms both PaLM 540B w/o chain-of-thought and PaLM 62B+chain-of-thought. We analyzed the errors for problems that the PaLM 62B model got wrong and found they typically are in following categories - semantic understanding, one-step missing, and other errors. Scaling up to the 540B model size fixed a large amount of these errors

Across the 7 reasoning datasets, 8-shot prediction with PaLM 540B+chain-of-thought achieved SOTA accuracy on 4 tasks (GSM8K, MAWPS, SVAMP, StrategyQA), and close to SOTA on the remaining 3 tasks (ASDiv, QAuA, CommonsenseQA).

We can also see that both chain-of-thought and model scaling help significantly on all tasks, as PaLM would have only achieved SOTA on a single task (SVAMP) without both techniques. Note that we verified through n-gram overlap analysis that there was no data contamination between our model training corpus and the reasoning evaluation sets.

6.4. Code Tasks

PaLM은 여러 코딩 테스크에서 뛰어난 성능을 보여준다.

1. Text-to-code: 테스크의 목표는 요구사항에 맞게 코드를 작성하는 것이며, 3가지 데이터셋(HumanEval, MBPP, GSM8K-Python)으로 평가를 진행할 것이다. 1). HumanEval와 MBPP는 간단한 요구사항과 몇 개의 입출력 예시를 제공하고, 짧은 파이썬 프로그램(단일 함수) 코딩을 요구한다. 2). GSM8K-Python은 자연어 수학 문제를 제공하고, 정답을 반환하는 파이썬 프로그램 코딩을 요구한다.

2. Code-to-code: 1). TransCode는 C++ 프로그램을 Python 프로그램으로 변환시키는 테스크다. 2). DeepFix code repair code은 컴파일 에러가 나는 C 프로그램을 수정하는 테스크다.

우리는 pass@k 평가 지표를 사용할 것이다. 이는 각 문제에서, k개의 샘플을 추출(=생성)한 후, 임의의 샘플이 문제를 풀 수 있으면, 푼 것으로 간주한다. 참고로, k=1인 경우 그리드 디코딩을 사용하고, k>1인 경우 p=0.95고, temperature=0.8인 nucleus 샘플링을 사용할 것이다.

PaLM을 다른 언어 모델(LaMDA, Codex model)과 비교할 것이다. 1). LaMDA 137B는 비록 코드를 학습하지 않았지만, 12.5%의 사전 학습 데이터가 코드 웹 문서(=코드 Q&A 혹은 코드 튜토리얼)로 구성되어 있다. 2). Codex model 12B는 비공개 모델이여서, 성능 차이의 원인을 정확히 파악할 수 없다. 하지만, 이로 테스크의 여러운 정도를 파악하는데 유용하게 사용할 수 있다.

Datasets: 사전 학습 데이터에는 39B개의 코드 토큰이 있다. 대부분의 평가 지표는 파이썬 프로그래밍 스킬을 테스트하기 때문에, 추가적으로 깃허브 여러 레포에서 5.8B개의 파이썬 코드 토큰(=ExtraPythonData)을 수집했다. 참고로, ExtraPythonData는 사전 학습 데이터의 코드 토큰과 겹치지 않는다.

PaLM 540B: 1). LaMDA 모델이 코드를 학습하지 않았음에도 불구하고, 모든 테스크에서 0 이상의 성능을 보였다. 코드 웹 문서가 코딩 테스크에 유용하다는 것을 의미한다. 2). 그렇기는 하지만, PaLM이 모든 테스크에서 LaMDA보다 높은 성능을 보였다. 그리고, HumanEval에서는 Codex 12B와 견줄만한 성능을 보였다. 참고로, PaLM은 자연어 테스크와 코딩 테스크 모두에서 SOTA 성능을 목표로 하는 최초의 LLM이며, 실제로 두 가지 테스크에서 매우 높은 성능을 달성했다. 3). 더 놀라운 것은 PaLM은 Codex 12B보다 50배 적은 파이썬 코드 토큰을 사용했음에도 불구하고, few-shot에서 견줄만한 성능을 보였다는 것이다. 원인에는 (1) 다른 프로그래밍 언어와 자연어의 전이 학습, (2) 모델 확장에 따른 sample-efficient 증가가 있다고 생각한다.

PaLM-Coder: 추가 코딩 데이터셋으로 PaLM 모델군 모두 파인튜닝하여 기존 모델과 비교할 것이다. 파인튜닝 방법은 다음과 같다. 1). 60%의 파이썬 코드, 30%의 파이썬 외 언어 코드, 10%의 자연어로 구성된 6.5B의 데이터셋으로 첫 파인튜닝을 진행한다. 2). 1.9B의 파이썬 코드로 추가 파인튜닝을 진행한다.

파인튜닝을 해본 결과 540B 모델은 대부분의 테스크에서 성능이 더욱 향상되었다. 대표적으로 1). HumanEval pass@100에서 +12% 향상, 2). MBPP pass@80에서 +5% 향상되었다. 뿐만 아니라, 나머지 모델의 성능도 모든 테스크에서 향상되었다.

하지만 유일하게, GMS8K-Python pass@1에서 성능이 감소되었다. PaLM-Coder 540B가 57.5 점수를 보인 반면, PaLM 540B가 58.1 점수를 보였다.

DeepFix Code Repair: 코드 오류 수정 테스크에서 PaLM-Coder 540B 모델이 인상적인 성능(82.1%)을 보였다. 위 테스크는 정확도 뿐만 아니라 코드 편집량도 중요하다. 왜냐하면, 최소의 편집량으로 오류를 수정하는 것이 목표이기 때문이다.

편집량을 분석하기 위해, 여러 편집 지표에서 정확도 성능을 측정해봤다. 그 결과, 흥미롭게도 PaLM이 최소 편집을 하였다. 1). PaLM-Coder가 최소 정규화 편집 거리 지표에서 가장 높은 정확도를 보였고, Divinci Codex는 줄 편집 지표에서 가장 높은 정확도를 보였다.

생성된 코드를 관찰해본 결과, PaLM-Coder가 Codex보다 기존 코드의 스타일에 변화를 덜 주었다.

추가로, 프롬프트의 코딩 스타일을 바꾸면, PaLM-Coder의 성능에 어떠한 변화를 주는지 관찰했다.

자세히 말하자면, 동시 선언에서 각자 선언으로 프롬프트의 코딩 스타일을 바꿔, PaLM-Coder의 성능을 측정해봤다. 참고로, 수정된 프롬프트는 동시 선언을 막으려는 의도가 있다. 그 결과 모든 지표에서 성능이 향상되었다.

생성된 코드를 관찰해본 결과, 의도대로 PaLM이 동시 선언을 선호하지 않았다. 위 추가 실험으로, 프롬프트의 코딩 스타일 변경으로 성능 향상을 기대할 수 있음을 보여줬다.

소프트웨어 개발 내에서 LM 기반 시스템을 사용할 경우, 생성 코드가 부정확거나 사소한 버그를 유발할 수 있는 리스크가 존재한다.

현재 위 리스크에 관한 좋은 해결 방법이 존재하지 않아, 직접 코드를 프로그램에 추가하기 전에 검토해야 한다.

추가적으로, 생성된 코드가 소수 테스트 케이스에서 기능이 정확히 작동한다고 안전이 보장되는 것은 아니다. 즉, 모든 케이스에서 기능이 정확히 작동하는 것은 아니다.

때문에, 소수의 테스트 케이스로 평가를 진행하는 대부분의 벤치마크는 성능을 과대 평가할 가능성이 있다. 그렇기에, 철저한 기능 정확 테스트가 필요하다.

기능 정확성은 코드 퀄리티의 여러 관점 중 하나에 불과하다. LM이 생성한 코드는 가독성, 성능, 안전성, 보안을 모두 고려해야 한다.

DeepFix을 통해 PaLM-Coder의 문제점을 살펴볼 수 있었다. PaLM-Coder는 코드를 수정하여 컴파일되도록 코드를 수정했지만, 형식과 입력 크기에 관한 조건에 의존해 안전하지 않았다. (참고로, DeepFix는 C언어 강의의 연습문제로 만들어졌으며, 위 문제는 형식과 입력 크기에 관한 조건이 있었다.) 이러한 코드는 일반적인 상황에 적절치 않다. (즉, 특정 조건에만 작동하기 때문에) 때문에, 가독성이 높고, 안전성이 높은 코드를 생성하는 것이 앞으로의 문제다.

In this section, we show that the PaLM model achieves outstanding results on a variety of coding tasks:
1. Text-to-code. We consider three tasks where the goal is to write code given a natural language description. In HumanEval and MBPP datasets, the model is given an English-language description of a few sentences and a small number of input-output examples, and the goal is to generate a short Python program, usually a single function.
the goal of GSM8K-Python task is to produce a Python program that returns a correct solution.
2. Code-to-code. TransCoder is task involving translation of C++ programs to Python. We also evaluate on the DeepFix code repair task. Starting from broken student-written C programs that fail to compile, the objective is to modify the programs so that they compile successfully.

we report results using the pass@k metric, in which for each problem in the test set, k samples of source code are drawn from the model, and a problem is counted as solved if any sample solves it.
When reporting pass@1, we use greedy decoding. For k > 1, we use nucleus sampling with p=0.95 and temperature 0.8.

We compare PaLM model to several different language models for code. First, we compare to the LaMDA 137B parameter model. Although LaMDA was not trained on code from GitHub, about 12.5% of the LaMDA pretraining mixture were from web sites related to code such as Q&A sites and tutorials, which we call "code web docs"
Second, we compare to the early Codex model 12B
Unfortunately, there are many things that are not publicly known about Dvinci Codex model.
Although these uncertainties make it impossible to understand the causes of differences in performance, this comparison is still useful to understand the intrinsic difficulty of the tasks that we consider.

Datasets. there were 39B code tokens in the pre-training dataset. Because most of our evaluation datasets test Python programming skills, we collected an additional dataset specifically of Python code. which we call ExtraPythonData, contains 5.8B tokens from GitHub repositories that were not used during pre-training.

PaLM 540B The LaMDA model has nonzero performance across all tasks, even though it was not trained on GitHub code. This shows that the code web documents used in the LaMDA training are informative for these tasks.
Even so, the PaLM model shows better performance across all tasks than LaMDA, and on HumanEval is comparable to Codex 12B.
To our knowledge, this is the first large language model that aspires to state-of-the-art performance across natural language and code tasks in a single model, and indeed PaLM achieves the best published performance in both.
The PaLM model achieves comparable performance in few-shot evaluations to previously-published results from 50 times less Python code. We interpret this as a combination of (a) transfer from other programming languages and from natural language data and (b) a particular striking instance of the observation from Kaplan that larger models can be more sample efficient than smaller models.

PaLM-Coder Now we evaluate the effect of further finetuning only on code
We finetune the 8B, 62B and 540B PaLM models in two stages: (a) first finetuning for 6.5B tokens on a mixture of 60% Python code from ExtraPythonData, 30% code across languages and 10% natural language, and (b) finetuning for an additional 1.9B tokens on more Python code from ExtraPythonData.
The performance of PaLM-Coder 540B increases even further
In the case of the 540B model, this gives a +12% absolute improvement on HumanEval pass@100 and +5% absolute improvement on MBPP pass@80 compared to the non-finetuned model. Each increase in scale given an improvement in performance across all datasets.
For GMS8K-Python dataset, we observe that PaLM-Coder 540B achieves pass@1 score 57.5 for 8-shot prompt, while PaLM 540B model obtained 58.1 pass@1 score.

DeepFix Code Repair. The PaLM-Coder 540B model demonstrates impressive performance on the DeepFix code repair task, reaching a compile rate of 82.1%, compared to 71.7% achieved by prior work.
For code repair, it is important to assess the amount of code changed by the model - ideally we only want to modify a small portion of the broken code.
we break down the results using various metrics for defining "small" edits.
Interestingly, PaLM produces the smallest edits, while PaLM-Coder has the highest success rate when only considering edits with small normalized edit distances, and Divinci Codex has the highest success rate when only considering edits with few lines changed. In other words, PaLM-Coder tends to change fewer characters spread across more lines compared to Codex.
We observe this behavior qualitatively in the predictions, where PaLM-Coder is more likely than Codex to make minor stylistic changes.

We use this opportunity to observe how PaLM-Coder's predictions change in response to a changed prompt. We run PaLM-Coder again using the same two prompt examples except where one occurrence of two variable declarations on the same line was split into two lines in both the broken and fixed code, with the intention of discouraging PaLM-Coder from combining variable declarations. Using the new prompts, all of the metrics improved
Qualitatively, we observe that PaLM-Coder is much less likely to combine variable declarations when using the new prompts, as desired. Although this is just one anecdotal experiment, it is encouraging that a small targeted change to the prompt examples led to improved predictions in the expected way.

Discussion When deploying LM-based systems within software development, a key risk is that the generated code could be incorrect, or introduce subtle bugs.
Current mitigations for this risk are imperfect. Developers should review suggested code before adding it to a program, but they may not always find subtle bugs in suggested code. Additionally, code suggestions can also be confirmed by a test suite, but it not always safe to infer from a small number of test cases that solutions are functionally correct. Indeed, even on the benchmarks we consider here, we follow previous work in measuring functional correctness based on a small number of tests, but this can overestimate the performance of the methods. More thorough tests of functional correctness would be desirable.

Functional correctness is only one aspect of source code quality; LM-generated suggestions must also be readable, robust, fast, and secure. DeepFix illustrates one issue with PaLM-Coder's current predictions, the "fixed" programs compile but are not necessarily secure because they rely on assumptions about the format and size of the input. The DeepFix dataset is drawn from student-written submissions to exercises in a C programming course, where students are allowed to make such assumptions. Such suggestions are likely to be undesirable in a more general setting. Generally, helping developers understand and have confidence in the code that has been suggested is an open problem.

7. Memorization

인공 신경망은 학습 데이터를 암기할 수 있다.

PaLM은 엄청난 수용 능력을 가지고 있기 때문에, 학습 데이터를 한 번만 봐도 암기할 수 있다. 추가적으로, 웹 말뭉치에서 근접 중복 텍스트가 존재한다. 때문에 학습 때 같은 구절이 여러 학습하여 암기할 수 있다.

이번 단락에서는, PaLM의 학습 데이터 암기량을 다음과 같이 분석해볼 것이다. 1). 학습 데이터에서 100개 토큰으로 구성된 시퀀스를 무직위로 고른다 2). 시퀀스 앞 부분 절반을 프롬프트로 줘, 50개 토큰을 생성한다. 3). 생성된 토큰을 시퀀스 뒷 부분 절반이 일치하는지 확인한다.

분석 결과는 다음과 같다. 1). 8B 모델은 1.6% 스퀀스를 재현했고, 540B 모델은 2.4% 스퀀스를 재현했다. 2). 한 번 봤던 스퀀스들의 암기 비율은 0.75%인 반면, 500번 이상 봤던 스퀀스들의 암기 비율은 40%를 넘어간다. 참고로, 시퀀스 중복 횟수가 많은 이유는 학습 데이터 중복 제거를 문서 수준에서 했고, 시퀀스 길이를 100개 토큰으로 설정했기 때문이다. 3). 스퀀스 암기 비율을 데이터셋으로 나눠 분석해본 결과, 완전 중복, 근접 중복, 형식의 양에 따라 암기 비율이 달라졌다.

이러한 결과로, 몇 가지 결론에 도달할 수 있다. 1). 모델이 커질수록 암기 비율이 log-linear하게 높아진다. 2). 일반적인 형식과 어구의 암기가 예상된다. 반면, held-out 데이터보다 학습 데이터를 더 잘 암기한 것을 보면, 일부 데이터를 실제로 암기한다는 것을 반증한다. 3). 한번 학습한 것보다 여러번 학습한 것을 더 잘 암기한다.

대부분의 암기 내용은 일반적인 형식과 어구 텍스트이지만, 이야기, 뉴스 기사, 사실도 일부 암기하고 있음을 관찰했다.

참고로, 암기가 문제인지 대한 여부를 학습 데이터셋의 암기량으로 파악할 수 없다.

암기가 문제인지는 데이터 성질(e.g., 유해 데이터)과 다운 테스크(e.g., 이야기 생성)에 의해 결정된다. 그렇기 때문에, LLM을 활용할 때는 다운스트림 테스크의 선택에 항상 주의를 기울어야 한다.

추론 때 암기를 방지하는 방법에는 학습 데이터에 플롬 필터를 적용하거나, 학습 데이터에 있는 시퀀스를 제한하는 방법이 있다.

위 방법은 계산 측면에서 효율적이지만, 메모리 측면에는 효율적이지 않다. 그렇기 때문에 암기하면 문제가 될만한 데이터만 적용하여 메모리 효율을 올려주는 방법도 있다. 하지만, 이는 학습 데이터와 일치하는 텍스트 생성은 막을 수 있지만, 비슷한 텍스트 생성은 완전 막을 수 없기에 완벽한 해결책이 아니다.

그렇기 때문에, 스퀀스 생성하는 언어 모델을 언제 사용하는 것이 가장 적절한지 파악하는 것 즉, 기준을 선정하는 것이 가장 좋은 방법이라고 생각한다.

restict sequences that occur verbatim in the training dataset from ever being generated.
our models also have an extremely large capacity, so it is plausible that even a single pass could memorize a significant portion of the training data. Additionally, due to the presence of near-duplicate text in web-derived corpora, some passages are seen multiple times during training.

In this section, we analyze the extent to which PaLM models have memorized the training data. To evaluate this, we randomly selected 100 token sequences from the training examples, and prompted the model with the-first 50 tokens from the span. We run greedy decoding and measure how often the model produced a 50-token continuation that exactly matches the training example.

We can see that the 8B model was able to exactly reproduce the 50 token continuation for 1.6% of the data, while the 540B model was able to reproduce the continuation for 2.4% of the data.
We can see that examples seen exactly once in the training have a memorization rate of 0.75% for our largest model, while examples seen more than 500 times have a memorization rate of over 40%. Note that reason why there are any examples with such a high duplication rate is that our training is only de-duplicated on full documents, and here we evaluate memorization on 100 token spans.
After analyzing the memorized examples, we conclude that the biggest differentiator is the amount of exact duplication, near duplication, or templating of the example in the training.

From these results. we can reach the following conclusions about memorization:
- Larger models have a higher rate of memorization than smaller models
- some amount of "memorization" is expected, as the model will produce exact match continuations for common templates and boilerplate. However, the memorization rate on training data is significantly higher than heldout data, which indicates that the model does genuinely memorize some portion of the data
- Examples that are only seen once are much less likely to be memorized than examples that are seen many times.

We found that most instances of memorization were of formulaic text that is unlikely to trigger concern.
However, we also observed memorization of stories, news articles, and facts.
However, simply measuring the amount of training set text that can be extracted does not tell us anything about whether this memorization is problematic or not.

Whether memorization is problematic depends on the properties of the dataset and target application. Thus, care should always be taken in choosing downstream applications for a large language model. On computationally efficient, though memory-intensive, approach to prevent generation-time memorization would be to implement a bloom filter over the training data, and restrict sequences that occur verbatim in the training dataset from ever being generated. Since data from some sources may be more problematic if memorized than other sources, such an approach could be made more memory-efficient by only building the bloom filter over parts of the data. While this approach would remove exactly memorized content, approximately memorized content could still be produced.
Ultimately, the best mitigation strategy is to make careful and deliberate choices of when it is appropriate to use large language models for long-sequence generation.

8. Dataset Contamination

이전 연구들에서는 단순히 평가 예제와 학습 데이터에서 동시 발생한 높은 n-gram (e.g., 13-gram)만으로 오염 여부를 판단했다.

대부분의 벤치마크는 웹에서 가져온 문맥을 기반으로, 새로운 QA를 생성하는 방식으로 만들어진다. 즉, 문맥이 문제를 푸는데 필요한 대부분의 정보를 가지고 있다.

그렇기 때문에, 학습 때 관련 문맥을 봤던 못 봤던, 평가 때 문맥을 제공해주기 때문에 공평한 평가를 진행할 수 있다.

이번 연구에서는, n-gram 중복 대신, 29가지 NLP 벤치마크에 대한 통계를 계산하고, 수 많은 예제를 일일이 조사하여 오염 비율이 높은 데이터셋을 찾을 것이다.

그 결과, 29가지 벤치마크를 4가지 범주로 분류했다.

1). Wholsale contamination: 데이터셋 대부분이 웹과 일치하는 경우 2). Constructed from web: QA가 웹에서 추출된 경우 3). Context on web: 대부분 QA가 웹 기반으로 만들어졌지만, 질의는 웹에서 추출하지 않은 경우 4). No significant overlap: 학습 데이터와 중복 정도가 심하지 않은 경우

29가지 중 10가지가 처음 두 가지 범주에 속해있지만, 일부 예제만 학습 데이터에 발견되었다. 왜냐하면, 웹은 조 단위의 토큰을 가지고 있는 반면, 학습 데이터는 200B개 단어로 구성되어 있기 때문이다.

따라서, 프롬프트, 질의 또는 정답의 8-grams 중 최소 70%가 훈련 데이터에 존재하면 오염 데이터로 분류하고, 그 외를 클린 데이터로 분류했다.

그리고 클린 데이터로 성능을 측정해 오염이 성능에 어떠한 영향을 주는지 살펴볼 것이다. 참고로, 양의 증분은 데이터 오염이 성능을 부풀리지 않음을 의미하며, 음의 증분은 그 반대를 의미한다. 측정 전, 우리는 540B가 암기력이 높기 때문에 8B보다 높은 음의 증분을 가져야 한다고 생각했다. 하지만, 그런 경향은 나타나지 않았다.

측정 결과, 두 모델은 비슷한 증분 양상을 보여줬다.

These previous works simply looked at the occurances of high-order n-grams between the full evaluation example text and the training data, and considered any example with overlap to be "contaminated". However, many benchmarks are constructed by taking some context from the open web, and then asking annotators to generate a noval question/answer about that context. Because the context is provided at the evaluation time for these tasks, even if the model had previously trained on the context for the language modeling objectives, this does not give it an unfair advantage at evaluation time.

Here, instead of simply looking for high-order n-gram overlaps, we computed statistic for our 29 primary English NLP benchmark tasks and manually examined a large number of examples for each, to determine which has a high proportion of contaminated
examples.

We can roughly divide the 29 benchmark tasks into four categories:
- Wholesale contamination - Datasets where a significant portion of the dataset itself appears in the open web.
- Constructed from web - Datasets where the question+answers were automatically extracted from the open web, so many evaluation examples are likely to be in our training.
- Context on web - Question answering datasets where the context is taken from the web, but the questions were not.
- No significant overlap - Datasets which did not have a significant overlap with our training data

We found that 10 out of the 29 sets fall into the first two categories. Of these sets, only a portion of the evaluation examples were actually found in our training.
So, we were able to split each dataset into a "contaminated" and "clean" subset based on whether at least 70% of the 8-grams in question, prompt, or target were seen at least once our training sets.
We can see that an equal number of sets have a positive vs. negative accuracy delta on the clean subset, which would imply that data contamination does not cause meaningful inflation of our reported results. Note that positive deltas act as evidence against data contamination inflating results, while negative deltas act as potential evidence for results being inflated by data contamination.
we would expect the clean subset to have larger negative deltas than the 8B model since the 540B model has much greater memorization capacity.
both 8B and 540B have approximately similar number of negative deltas between the clean and full validation set.

9. Exploring Explanations

6번째 단락에서, CoT 프롬프트로 다단계 추론 테스크의 성능이 향상되었음을 보았다.

CoT 프롬프트 형식으로 설명을 생성하는 것은 성능 향상 외에도 여러 면에서 유용하다. 1). 옳바른 논리로 답을 도출했는지 확인할 수 있다. 2). 설명을 통해 답을 신뢰해도 되는지 판단할 수 있다. 3). 설명이 원하는 답일 수 있다.

CoT 프롬프트로 답을 논리적으로 설명하는 PaLM의 실력을 보여주는 것이 이번 단락의 목표다.

이를 위해, 2가지 테스크를 사용할 것이다. 각 테스크는 (희망 출력 스타일을 나타낸) 2개의 표본 예제(=exemplar)와 문제들(example)로 구성되어 있다.

비록 논문 저자가 표본 예제와 문제들은 만들고 선별했어도, 위 테스크는 PaLM의 실력을 정확히 평가할 수 있다고 생각한다. 왜냐하면, 1). 테스크 별로 2개의 표본 예제만 만들어 사용했다. 즉 이는 오직 출력 스타일을 의미하며, 문제 내용과 무관하다. 2). 출력은 모델의 최선 예측 결과인 그리드 디코딩으로 이루어진다. 비유하자면, 모델이100% 능력을 사용해 문제를 풀고 있다고 볼 수 있다. 3). 운 혹은 단순한 통계 상관관계로 적절한 설명을 할 가능성은 매우 낮다. 4). 저자가 직접 문제를 만들었기에, 오염 혹은 암기 문제 가능성은 거의 없다.

위 예제를 보면, 철저한 결과 분석은 하지 않았지만, 모델이 언어에 대한 깊은 이해를 가지고 있음을 알 수 있다.

we empirically demonstrated how chain-of-thought prompting can drastically improve prediction accuracy in multi-step reasoning tasks.
such explanatory generations can be useful for reasons beyond accuracy improvements. First it is of great scientific interest to know whether the model is producing the correct answer for the "right reasons", rather than just making surface-level statistical inferences. Second, the explanation can potentially be shown to an end-user of the system to increase or decrease their confidence in a given prediction. Third, in many situations, the explanation itself is the desired output.

The goal of this section is to showcase the capabilities of PaLM with regards to explanatory language generation with chain-of-thought prompting.
To do this, we present model outputs on two preliminary tasks, which we call "Explaining a Joke" and "Logical Inference". For each task, we wrote 2-shot exemplars demonstrating the desired style of the output, and number of examples to prompt the model with. Although both the exemplars and evaluated examples were written and selected by the authors, we believe that these results still demonstrate groundbreaking proficiency in language understanding by PaLM. This is due to several key factors relating to how this analysis was performed: 1. All predictions are generated with the same 2-shot exemplars, which relate only in style and not content to the examples we are evaluating. Additionally, the full exemplar prompts were written before any examples were evaluated, and were never modified based on the examination of the model output. 2. All outputs are from greedy decoding This is critical, because it means that each output is the model's canonical 1-best prediction. 3. Because the purpose of these tasks is to induce the model to generate thorough natural language explanations, the chances that greedy decoding will produce a fully correct explanation through simple statistical correlations or "lucky guesses" is vanishingly low. 3. Because the prompts were written by the authors, this mitigates the possibility of direct data contamination and memorization.

While we acknowledge that these results do not equate to a thorough quantitative analysis, we will say that we believe this demonstrates a truly remarkable level or deep language understanding.

10. Representational Bias Analysis

사전 학습 언어 모델은 데이터에 있는 편견을 가지고 있다.

이번 단락에서는, PaLM의 (집단) 편견 및 (텍스트) 유해성을 분석할 것이다. 이는 모델의 잠재 리스크를 파악하는데 도움을 줄 것이다.

Pre-trained language models have been demonstrated to contain and amplify biases in underlying data.
In this section, we analysis PaLM for distributional biases related to social groups, and for toxicity in open-ended language generation.
This analysis helps outline some of the potential risks of the model.

10.1. Distributional bias in social groupsRepresentational Bias Analysis

10.1.1. Gender and occupation bias

영어의 대명사에는 성별의 의미를 포함하고 있기 때문에, 성별과 명사 간의 편견이 coreference resolution 성능에 영향을 준다고 생각한다.

성별과 명사(=직업) 간의 편견 정도 (or 수준)를 Winogender 벤치마크로 평가할 것이다.

다선지 채점 방식은 일반적으로 사용되는 채점 방식이다. 각 선지의 생성 확률을 계산하고, 확률이 가장 높은 선지와 정답을 비교하는 방식이다.

위 방식은 성능을 부풀리게 측정한다. 왜냐하면, 정답의 생성 확률이 절대적으로 낮더라도, 맞다고 채점될 수 있기 때문이다.

생성 채점 방식은 보다 엄격한 채점 방식이다. 대/소문자 구분 없는 (정답과 생성 간의) 문자열 매칭을 사용한다. 참고로, 생성은 구두점 혹은 줄바꿈으로 구분한다.

평가를 진행한 결과. 1). 모델 크기가 증가할수록 정확도가 상승했으며, 2). PaLM 540B은 one-, few-shot에서 SOTA를 달성했다. 하지만 아직도 사람의 정확도보다 아래다.

이전 연구처럼, 추가적으로 winogender 데이터셋을 stereotypical(고정관념)과 gotcha(고정관념과 대조)로 나누어 평가를 진행했다.

stereotyical과 gotcha 데이터셋 간의 성능 격차는 모델이 얼마나 통계적 편의 (or 오류)에 의존하는지를 나타내는 강력한 지표다.

추가 평가를 진행한 결과, 1). gotcha보다 stereotypical에서 높은 정확도를 보였다. 그리고, 2). gotcha에서 여성에 대한 정확도가 가장 낮았다.

성능 차이는 아마 대명사 성별 간의 빈도수 차이(남성: 30%, 여성: 14%) 때문일 수 있다. 하지만, 정확도와 직업 빈도수 순위 간의 명확한 관계는 찾지 못했다.

In English, pronouns are marked with semantic gender and coreference resolution performance may be impacted by unintended bias between gender and antecedent nouns. We evaluate PaLM for one aspect of this bias using the Winogender benchmark.

Multiple choice scoring is commonly used for Winogender and involves scoring each potential answer with the probability that the model will produce that answer.
Importantly, this scoring method can result in examples being scored correct, even if the absolute probability of the model producing the correct answer is quite low.
we find that computing accuracy from this scoring method overstates model performance.
For generative scoring, we use case-insensitive exact string matching. The model output is truncated at punctuation or newlines.

We find that accuracy improves with model scale, and PaLM 540B set a new state-of-the-art in 1-shot and few-shot settings.
As in prior work, we additionally report disaggregated accuracy which split Winogender into stereotypical or gotcha subsets.
the gap between the stereotypical and gotcha subsets is a strong measure of how much a model is relying on statistical shortcuts.
We find that accuracy is higher on stereotypical examples than on gotcha examples, and that accuracy is lowest on gotcha examples for female gender.
Differences in performance may be related to differences in the frequency of English pronouns in the training set, but we see no clear relationship between accuracy and the rank of occupations.

10.1.2. Toxicity and bias

이번 단락에서는, "{term} was very ..."와 같은 프롬프트가 주어졌을 때, 공통적으로 나타나는 단어를 분석해볼 것이다. 여기서, term은 성별, 지역, 종교, 인종 등 특정 집단을 의미한다. 각 프롬프트마다 top-k 샘플링(k=40, temperature=1.0)을 사용해 800개 결과값을 생성할 것이다.

이때, 집단을 지칭하지 않은 묘사 (or 설명) 단어를 최대한 배제하기 위해, 생성 텍스트 전체의 형용사 말고, 첫 문장의 형용사만 카운트할 것이다.

동시 출현 단어 횟수를 분석한 결과, 1). 이슬람은 민감한 용어(e.g., 테러리스트, 폭력적인, 급진적인)와 연관이 있었다. 참고로, 다른 디코더 모델도 이슬람에 대한 부정적 인식을 가지고 있다. 2) 인종 관련 단어가 주어졌을 때, 다른 인종 단어가 자주 나타났다. 즉, asian이 주어졌을 때, black, white와 같은 인종 단어가 자주 나타났다. 3) 인종 관련 단어가 주어졌을 때, 프롬프트의 작은 변화가 동시 출현 단어 횟수 순위에 큰 변화를 주었다. 4). 놀랍게도, indian가 주어졌을 때, 가장 자주 나타나는 단어가 while였다. 많은 생성 텍스트가 백인과 미국 인디언 간의 식민지 역학(?)를 묘사하고 있기 때문이다.

참고로, 특정 단어는 의미하는 바가 정확하지 않은 경우도 있다. 1). Indian은 아메리칸 인디언, 인도에 사는 인도인 두 가지로 해석할 수 있다. 2). Black, White는 인종 외 다른 것을 지칭할 때 더 많이 사용된다. 3).White는 사회인구학적으로 백인을 표현하는데 사용하지 않는다.

이와 같은 이유로 특히 white는 다른 인종 및 윤리 단어와 달라 정확하게 분석할 수 없다.

모델 크기 관점에서 분석해볼 결과, 62B 모델과 540B 모델의 동시 출현 단어가 매우 비슷했다. 자세히 말하자면, 평균적으로 top-10 단어들 중 70%가 같았다. 이는 모델 크기보다 학습 데이터셋에 더 많은 영향을 받음을 의미한다.

동시 출현 단어 분석으로 어떠한 단어가 자주 나타났는지 파악할 뿐만 아니라, 다른 단어와 관련하여 어떤 식으로 나타나는지 파악해볼 필요가 있다.

이때 취한 접근법은 특정 집단에 대한 생성 텍스트의 유해성을 분석하는 것이다.

만약 유해성을 바람직하지 않은 묘사 및 연관성으로 정의할 경우, 모델이 이슬람에 대한 잘못된 고정 관념(e.g., 테러리스트, 폭력주의자)을 가지고 있다고 볼 수 있다.

각 집단마다 잘못된 고정 관념을 얼마나 가지고 있는지 파악하기 위해, (동시 출현 단어 횟수 외에도), Perspective API로 유해성을 측정해볼 것이다.

그 결과, 유대교에서 ("All {practitioners} are"이라는 프롬프트가 주어졌을 때,) 전반적으로 유해성이 높은 텍스트를 생성했다.

광범위하게 편견과 유해성 평가는 진행하지 않았지만, 이는 다운스트림 테스크의 리스크를 파악하는데 중요한 인사이트를 제공할 것이다.

하지만, 우리가 사용한 평가 방법 즉, 형식 기반 접근법(e.g., {term} was very ...)은 불안정하다는 것을 기억해줬으면 좋겠다.

we analyze commonly co-occurring words in the model continuations when given prompts like "{term} was very ..." where the substituted term references either gender, religion, or racial and ethnic identity. For each prompt, 800 outputs are generated using top-k sampling (k=40) with a temperature of 1.0.
To reduce the noise resulting from counting descriptive words that are not referring to the identity group, we compute the counts of the adjectives present only in the first full sentence rather than in the whole continuation generated using 128 decoding steps.

Islam is associated with highly charged terms and stereotypes such as terrorist, violent and radical, similar to other evaluations of decoder-only models revealing anti-Muslim bias.
Racial identity co-occurrence analyses reveal several insights. First, racial identity terms often co-occurred with each other, raising questions about the relationship. Second insight is that small changes in prompt language led to drastic changes in the outcome
Surprisingly, Indian was a top co-occurring term with white.
Many of the continuations depict colonial dynamics between white people and American Indians

It is important to note when examining our results that identity terms are not disambiguated.

We also find that the 62B and 540B models lead to very similar co-occurrence counts. On average, 70% of the top-10 words are the same for the 62B and 540B models across race, religion and gender dimensions. Therefore, we posit that the underlying training data has more influence on the results than the size.

The co-occurrence analyses point to the importance and value of using a complementary approach to investigate, not just which terms appear, but also how they appear in relation to other terms. One approach, taken here, is to analyze the toxicity of model completions using prompt templates with identity term.
Taking toxicity as a proxy for undesirable descriptions or associations, we can begin to see the potential for model completions to falsely affirm, stereotypes of Muslims as terrorists, extremists, and violent.

So, in addition to computing, co-occurrence counts, we use the Perspective API to classify the toxicity of continuations.
Figure 22 reports the distribution of the toxicity probability across model responses, broken down by different religious groups. While we observe a high variance in the toxicity probabilities, we note that Islam and Judaism have a higher overall probability of producing toxic responses in continuation to the prompt "All {practitioners} are"

While the bias and toxicity evaluations we conducted are not comprehensive across all language model applications, they provide important insights into potential downstream risks. We emphasize that even though our analysis highlights the biases, the variance of results across prompt framing underscores that template-based approaches are quite brittle to small changes in prompt language.

10.2. Toxicity in open-ended generation

우리는 RealToxicityPrompts 데이터셋과 Perspective API를 활용해, 생성 텍스트의 유해성 분포를 조사해볼 것이다.

RealToxicityPrompts에서 만 개 프롬프트를 무작위로 뽑은 후, 각 프롬프트마다 서로 다른 25개 텍스트를 생성한다. 이때, top-k 샘플링을 사용할 것이며, 텍스트 별 최대 토큰 개수를 128개로 제한한다.

이때, 우리는 생성 텍스트 첫 문장의 유해성 정도만 조사할 것이다. 왜냐하면 (1) 비교군인 사람의 유해성을 측정할 때, 단일 문장만 사용했다. 첫 문장 이후 담화가 어떻게 전개될지 추정하는 것은 불가능에 가깝기 때문에, 모델도 단일 문장만 사용해 유해성 정도를 측정하는 것이 공평하다고 생각한다. (2) Perspective API의 유해성 점수는 토큰 개수에 비례하는 경향이 있다. 왜냐하면, 토큰 개수가 증가할수록 유해 텍스트 생성 기회가 많이지기 때문이다. (2) 내용이 사실인지 파악하기 위해, 첫 문장의 유해성 정도와 전체 텍스트의 유해성 정도를 비교해봤다. 자세히 말하자면, 유해성 혹은 무해성 프롬프트가 주어졌을 때, 25개의 생성 텍스트 중 적어도 하나의 유해성 텍스트가 생성될 확률을 측정했다.

유해성 정도를 측정해본 결과, 1). 모델 TPC와 TPP는 일관성 있게 비례하는 반면, 사람 TPC와 TPP는 그렇지 못했다. 이는 모델이 사람에 비해 프롬프트 성향에 영향을 많이 받는다는 것을 의미한다. 2). 모델 TPC는 TPP와 사람 TPC보다 유해 정도가 낮았다.

PaLM는 타 모델에 비해 유해성 정도가 상대적으로 낮게 측정되었다. 첫 문장으로 제한했기 때문이다. 그렇기 때문에, PaLM이 타 모델보다 낮은 유해성을 보인다고 생각하면 안된다.

위 방식으로 타 모델과 유해성 정도를 비교하기 어렵다. 왜냐하면, 1). 무작위로 프롬프트를 샘플링했고, 2). 생성 텍스트 길이가 결과에 영향을 주기 때문이다. (즉, 텍스트 갈이가 길어질수록 첫 문장의 길이가 길어진다. 이는 결과에 영향을 준다.)

We leverage the RealToxicityPrompts dataset which consists of sentence-level prompts and continuations. We use the Perspective API to assign a toxicity probability to the continuation. We then study the distribution of toxicity probability in model continuations given
various likelihoods that the prompt was toxic.

For a set of 10K randomly sampled prompts, we generate 25 continuations for each prompt, with up to 128 decoding steps per continuation using top-k sampling (k=40) with a temperature of 1.0.

we restrict ourselves to reporting the toxicity metrics of the first complete sentence continuation. The reasons for this are twofold: (1) the human baseline consists of a single sentence continuation, and it is therefore impossible to extrapolate how one's discourse would evolve beyond a single sentence, and (2) the toxicity score assigned by the Perspective API tends to increase with the number of tokens generated, given that model has more opportunity to generate toxic content and that the results are not normalized to the text length.

We find that the TPC increases with the TPP, while consistently lower than the prompt toxicity and the human baseline.
We observe that the model TPC is more consistent with the TPP than the human TPC. This indicates that the model is strongly influenced by the prompt-style and is likely to generate continuations with a similar level of toxicity as the prompt.

The TPC is generally lower than previously reported in other similar studies
however, this is due to restricting the toxicity measurement to the first full-sentence rather than indicative of a model with a lower propensity to generate toxic content. A side-by-side comparison with previous work is difficult given that (1) the random 10K sampled prompts are different, and (2) the continuation length affects the reported results.
we report the probability of generating at least one toxic comment given both toxic and non-toxic prompts for first sentence and 128 decoding steps.

10.3. Toxicity in open-ended generation

이번 단락에서 사용한 공정성 분석의 큰 한계점은 영어에서만 사용가능하다는 것이다.

서양권에서 만들어진 공정성 평가는 다른 지리 문화적 맥락에서 쉽게 이식되지 않을 수 있다. 왜냐하면, 사회적 불평등이 완전히 다른 식으로 나타날 수 있기 때문이다. 그렇기 때문에, 측정할 수 있는 것 이상으로 잠재적 편견이 존재할 수 있다.

또한, 영어 편향을 조사하는 작업이 증가했음에도 불구하고, 주의해야할 점이 세 가지 있다. 1). 공정성 벤치마크에 대한 표준화 부족, 2). 다양한 편견을 측정할 때, 어떤 요소가 정확한 측정을 방해하는지에 대한 이해 부족, 3). 정체성(집단)을 유동적이고 포괄적으로 보호하지 못한다.

이와 마찬가지로 이번 단락에서 사용한 공정성 평가도 위와 같은 이유로 한계가 있으며, 측정할 수 있는 것 이상의 잠재적 리스크가 있을 수 있다.

우리는 Winogender와 동시 발생 단어 분석을 통해, 의도하지 않은 편견 정도를 평가하고자 했다.

히지만, 이러한 벤치마크는 번역, 코드 생성, 추론, QA 등과 같은 작업에서만 발생할 수 있는 편견의 유형만 측정하는 지표일 수 있다.

추가적으로, 다운스트림 테스크, 학습 방식, 다운스트림 테스크 수준의 보호는 시스템에 편향에 영향을 준다.

사전 학습 모델에 공정성과 유해성을 특정 관점에서 평가해도, 다운스트림에서 어떻게 모델을 사용하는지에 따라 편견이 주는 영향이 달라진다.

그리고, 사전 학습 모델 평가가 다운스트림 테스크로 파인튜닝한 모델의 평가에 영향을 미치는지에 대해서도 불분명하다.

따라서, 배포 전에 다운스트림 테스크의 텍스트의 공정성 격차를 평가하기 위한 적절한 측정이 필요하다고 생각한다.

it is important that bias benchmarks be developed and utilized for other languages and socio-cultural contexts.

fairness evaluations and benchmarks developed in and for the Western world may not be readily portable to the other geo-cultural contexts where societal disparities may manifest along an entirely different set of axes. We thus note that potential biases exist beyond what we are currently capable of measuring.

Further, it is important to note that despite a growing body of work investigating biases in English language technologies,
there is a lack of standardization of fairness benchmarks, an understanding of what harms different bias measures in NLP relate to, and coverage of identities in fluid, comprehensive ways.
As such, our fairness evaluations in this section are also limited by the same concerns and there are potential risks beyond what can be measured.
We expand upon previous efforts to evaluate unintended biases and our evaluations are limited to popular tasks such as pronoun resolution and co-occurrence analysis.

Additionally, bias can pervade a system depending on the specific downstream application, it specific training pipeline, and application-level protections.
While we evaluate the pre-trained model here for fairness and toxicity along certain axes, it is possible that these biases can have varied downstream impact depending on how the model is used. It is also unclear if evaluations done on the pre-trained language model affect the downstream task evaluations after the model are finetuned. Therefore, we recommend appropriate measures be taken to assess the fairness gaps in the application context before deployment.

13. Open Questions in Scaling

LLM의 few-shot 성능을 향상시켜주는 대표적인 요소 네 가지가 있다. (1) 모델 크기, (2) 학습 토큰 개수, (3) 데이터 품질, (4) 비용 증가 없이 모델 수용 능력 증가시켜주는 기법들 (MoE, Sparse transformer)

이번 연구에서는 모델 크기와 학습 토큰 개수 간의 trade-off를 면밀히 조사하지 않았아, "PaLM 540B는 크기에 비해 토큰 개수가 적절한가?" 더 나아가, "PaLM 540B의 훈련 비용 즉, PaLM 540B 학습에 사용된 컴퓨팅 자원으로 최고의 성능을 내는 모델 크기와 토큰 개수는 몇인가"를 파악하지 못했다.

Chinchilla 모델이 PaLM의 보간된 곡선을 BIG-bench에서 다소 능가하고, 9가지 영어 NLP 테스크에서 약간 하회하는 반면, Gopher는 둘 다 PaLM의 보간된 곡선을 하화한다. 이는 Gopher가 모델 크기에 학습이 덜 됐음을 시사한다. 참고로, Chinchilla와 Gopher는 같은 데이터셋을 사용했다. 하지만 위 결과로 PaLM 540B이 덜 학습됐는지 등에 대한 여부는 정확히 파악할 수 없다. 왜냐하면, Chinchilla와 PaLM은 다른 데이터셋을 사용했기 때문이다.

더군다나, 실험을 통해 PaLM 최적화 여부를 파악하는 것은 매우 힘들다.

왜냐하면, (1) 이를 정확히 파악하기 위해서는, Chinchilla 연구처럼 수 많은 모델을 학습시켜봐야 하는데 이는 매우 큰 비용이 든다. (2) 작은 모델을 학습시킬 때 문제가 발생한다. TPU 개수를 줄일 경우, FLOP가 같으니, 실제 학습 시간이 증가한다. 그렇다고 TPU 개수를 유지할 경우, 배치 크기가 증가하여 sample-efficient이 낮아진다.

(3). PaLM 데이터셋을 사용할 경우, 학습 토큰 개수가 780B를 넘어가면 문장 반복되기 시작한다. 같은 문장을 반복해서 학습시키는 것이 성능에 얼마나 영향을 미치는지는 아직 불분명하다. 참고로, PaLM 62B로 추가 실험을 해본 결과, 반복 학습은 성능에 큰 도움을 주지 못한 반면, 새로운 데이터로 학습한 결과 성능이 도움이 됨을 발견했다.

앞으로 다양한 테스크에 걸쳐 일반화가 더 잘된 LMs를 찾기 위해 요소 간의 trade-off를 조사해야 한다.

In our introductory section, we describe the four main axes which have led to significant quality improvements of large LMs for few-shot learning. These can be summerized as: (1) model depth and width, (2) number of tokens trained, (3) training corpus quality, (4) increased model capacity without increased compute (i.e., sparse models)

we did not perform ablation studies to tease out the effects of (1) vs (2)
In other words, a critical open question is: "How would a 62B parameter model trained for 7T token compare to our 540B parameter model trained for 780B tokens?"

Chinchilla moderately outperforms PaLM's interpolated scaling curve on BIG-bench, and slightly underperforms the scaling curve on the 9 English NLP tasks. Gopher significantly underperforms both scaling curves
These results certainly support the conclusion drawn in Hoffmann that Gopher was under-trained for a model of its size, however, we cannot use these results to extrapolate an answer to our critical scaling question ("How would a model of size X trained on Y tokens compared to PaLM 540B?"). There are several reasons why this is a difficult question to answer: (1) In order to draw strong conclusions, such experiments would have to be performed at scale, which has a large computational cost. (2) If the smaller model were trained using fewer TPU chips than the larger model, this would proportionally increase the wall-clock time of training, since the total training FLOP count is the same. If it were trained the same number of TPU chips, it would be very difficult to maintain TPU compute efficiently without a drastic increase in batch size. The batch size of PaLM 540B is already 4M tokens, and it is unclear if ever larger batch sizes would maintain sample efficiency. (3) For the corpus mixing proportions chosen for PaLM, data begins to repeat in some of our subcorpora after 780B tokens, which is why we chose that as the endpoint of training. It is unclear how the "value" of repeated data compares to unseen data for large-scale language model training. In ablations in the appendix F, we did not see benefit from repeated data sources after 780B tokens on PaLM 62B, but showcase performance improvements from training longer on a refreshed dataset.

In future work, we plan to investigate the trade-off between different factors that lead to more capable LMs which generalize well across a number of tasks.

14. Conclusion

이번 연구에서, 우리는 few-shot 언어 이해 및 생성의 한계점을 돌파하고자 PaLM을 만들었다. PaLM은 540B 파리미터를 가지고 있으며, 품질 좋고 주제가 다양한 780B 토큰으로 학습했다. 그 결과, 여러 평가 지표(English NLP, BIG-bench)에서 매우 높은 점수를 받았다. 뿐만 아니라, 다단계 논리적 추론 테스크에서 엄청난 성능을 보였다. 위 성능은 모델 크기 확장만으로 달성한 것이 아닌 chain-of-thought 포롬프트와의 조합으로 이룬 것이다.

위 연구로 몇 가지 결론을 도출했다. 1). 540B 크기까지 확장해본 결과, 모델 크기 확장에 따른 few-shot 언어 이해의 성능 향상은 아직도 유효하다. 2). 많은 벤치마크에서 불규칙한 성능 향상이 관측됐었으며, 이는 충분히 큰 언어 모델에만 특정 능력이 나타남을 시사한다. 3). chain-of-thought 프롬프트는 성능을 향상시켜줄 뿐만 아니라 "유저가 옳바른 논리로 설명을 했는지" 혹은 "설명을 통해 답을 신뢰해도 되는지" 판단할 수 있게 만들어준다. 그리고, chain-of-thought 프롬프트를 사용하면, 텍스트 생성 능력을 활용해 분류 테스크의 성능 향상을 도울 수 있다.

In this work, we continue to push the boundaries of few-shot language understanding and generation by training PaLM, a 540B parameter dense Transformer language model trained on 780B tokens of high-quality, diverse text.

Another critical takeaway from this work is the breakthrough performance on reasoning tasks, which require multi-step logical inference.
The results on reasoning tasks are not achieved through model scale alone, but by a combination of scale and chain-of-thought prompting.

From these results, we can draw a number of conclusions. First, the results presented here suggest that the improvements from scale for few-shot language understanding have not yet plateaued.
on a number of benchmarks, improvements are actually discontinuous, meaning that the improvements from 8B to 62B are very modest, but then jump immensely when scaling to 540B.
certain capabilities of language models only emerge when trained at sufficient scale

Second, the breakthrough performance on reasoning tasks has critical implications.
a model being able to generate natural language to explain its predictions is beneficial to the end user of a system, in order to better understand why a model made a certain prediction.
prompting the model to generate explicit inference chains can drastically increase the quality of the predictions themselves.
In other words, the model's generation capabilities can be immensely beneficial even for tasks that are modeled as categorical prediction or regression, which typically do not require significiant language generation.

저작자표시 비영리 변경금지 (새창열림)

'Paper Review' 카테고리의 다른 글

[Paper Review] LLaMA: Open and Efficient Foundation Language Models (0)	2023.11.20
[Paper Review] Improving alignment of dialogue agents via targeted human judgements (0)	2023.10.21
[Paper Review] Training Compute-Optimal Large Language Models (0)	2023.09.19
[Paper Review] Training language models to follow instructions with human feedback (0)	2023.09.06
[Paper Review] Scaling Language Models: Methods, Analysis & Insights from Training Gopher (0)	2023.08.28

강정노트