전체 글(104)
-
[Paper Review] Auto-Encoding Variational Bayes
1. Introduction How can we perform efficient approximate inference and learning with directed probabilistic models whose continuous latent variables and/or parameters have intractable posterior distribution? 논문 목표는 intractable한 $p_{Z|X}(z|x) \ (z \ \text{is continuous})$를 추정하는 것이다. 글쓴이 뇌피셜 "In the AEVB algorithm we make inference and learning especially efficient by using the SGVB estimator to o..
2022.07.20 -
[Paper Review] ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
1. Extended Introduction DP(Data Parallelism) runs out of memory for models with more than 1.4B parameters on current generation of GPUs with 32GB memory. MP(Model Parallelism) requires model refactoring and have significant communication overhead. To overcome this limitations, we first analyze the full spectrum of memory consumption of the existing systems on model training and classify it into..
2022.06.27 -
[PyTorch] Writing Distributed Applications with Pytorch
Introduction The distributed package included in PyTorch(i.e., torch.distributed) enables researchers and practitioners to easily parallelize their computations across processes and clusters of machines. To do so, it leverages message passing semantics allowing each process to communicate data to any of the other processes. Setup In order to get started we need the ability to run multiple proces..
2022.06.10 -
[Paper Review] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Abstract In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. We sustain $15.1$ PetaFLOPs across the entire application with $76$% scaling efficiency when compared to a strong single GPU baseline that sustains $39$ TeraFLOPs, wh..
2022.06.07 -
[Paper Review] LaMDA: Language Models for Dialog Applications
Abstract We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of 1). safety and 2). factual grounding. We also explore the use of LaMDA in 3). the domains of education and content recommendations to investigate its potential and shortcomings. 1. Introduction According t..
2022.05.24 -
3. Making new Layers and Models via subclassing
import tensorflow as tf import numpy as np from tensorflow import keras The Layer class: the combination of state (weights) and some computation A layer encapsulates both a state (the layer's "weights") and a transformation from inputs to outputs (a "call", the layer's forward pass). class CustomLinear1(keras.layers.Layer): def __init__(self, d_in, d_out): super().__init__() w_init = tf.random_n..
2022.03.12