3. Introduction to gradients and automatic differentiation

3. Introduction to gradients and automatic differentiation

2022. 3. 12. 13:48ㆍTool/TensorFlow

In this guide, you will explore ways to compute gradients with TensorFlow especially in eager execution.

import numpy as np

import matplotlib.pyplot as plt
import tensorflow as tf

Computing gradients

To differentiate automatically, TensorFlow needs to remember what operations happen in what order during the forward pass. Then, during the backward pass, TensorFlow traverses this list of operations in reverse order to compute gradients.

Gradient tapes

TensorFlow provides the tf.GradientTape API for automatic differentiation; that is, computing the gradient of a computation with respect to some inputs, usually tf.Variables. TensorFlow "records" relevant opreations executed inside the context of a tf.GradientTape onto a "tape". TensorFlow then uses that tape to compute the gradients of a "recorded" computation.

w = tf.Variable(tf.random.normal((3, 2)), name='w')
b = tf.Variable(tf.zeros(2, dtype=tf.float32), name='b')
x = tf.constant([[1.0, 2.0, 3.0]])

with tf.GradientTape(persistent=True) as tape:
    y = x @ w + b
    loss = tf.reduce_mean(y**2)

Once you've recorded some operations, use GradientTape.gradient(target, sources) to calculate the gradient of some target (often a loss) relative to some source (often the model's variables). The tape is flexible about how sources are passed and will accept any nested combination of lists or dictionaries and return the gradient structured the same way:

# 1. list
[dl_dw, dl_db] = tape.gradient(loss, [w, b])
print("dl/dw: ", dl_dw)
print("dl/db: ", dl_db)
print()
# 2. dictionary
my_vars = {
    'w': w,
    'b': b
}

grad = tape.gradient(loss, my_vars)
print("dl/dw: ", grad['w'])
print("dl/db: ", grad['b'])

dl/dw:  tf.Tensor(
[[2.5721242 1.7501   ]
 [5.1442485 3.5002   ]
 [7.7163725 5.2503   ]], shape=(3, 2), dtype=float32)
dl/db:  tf.Tensor([2.5721242 1.7501   ], shape=(2,), dtype=float32)

dl/dw:  tf.Tensor(
[[2.5721242 1.7501   ]
 [5.1442485 3.5002   ]
 [7.7163725 5.2503   ]], shape=(3, 2), dtype=float32)
dl/db:  tf.Tensor([2.5721242 1.7501   ], shape=(2,), dtype=float32)

Gradients with respect to a model

It's common to collect tf.Variables into a tf.Module or one of its subclasses (layers.Layer, keras.Model) for checkpointing and exporting.

In most cases, you will want to calculate gradients with respect to a model's trainable variables. Since all subclasses of tf.Module aggregate their variables in the Module.trainable_variables property. you can calculate these gradients in a few lines of code:

layer = tf.keras.layers.Dense(2, activation='relu')
x = tf.constant([[1.0, 2.0, 3.0]])

with tf.GradientTape() as tape:
    y = layer(x)
    loss = tf.reduce_mean(y**2)

grad = tape.gradient(loss, layer.trainable_variables)

for var, g in zip(layer.trainable_variables, grad):
    print(f"{var.name}, shape: {g.shape}")

dense/kernel:0, shape: (3, 2)
dense/bias:0, shape: (2,)

Controlling what the tape watches

The default behavior is to record all operations after accessing a trainable tf.Variable. The reasons for this are:

The tape needs to know which operations to record in the forward pass o calculate the gradients in the backwards pass
The tape holds references to intermediate outputs, so you don't want to record unnecessary operations
The most common use case involves calculating the gradient of a loss with respect to all a model's trainable variables

tf.GradientTape provides hooks that give the user control over what is or not is watched. To record gradients with respect to a tf.Tensor, you need to call GradientTape.watch(x).

Conversely, to disable the default behavior of watching all tf.Variables, set watch_accessed_variables=False when creating the gradient tape. This calculation uses two variables, but only connects the gradient for one of the variables.

x0 = tf.Variable(0.0)
x1 = tf.Variable(10.0)

with tf.GradientTape(watch_accessed_variables=False) as tape:
    tape.watch(x1)
    y0 = tf.math.sin(x0)
    y1 = tf.nn.softplus(x1)
    y = y0 + y1
    ys = tf.reduce_sum(y)

grad = tape.gradient(ys, {'x0': x0, 'x1': x1})

print('dy/dx0: ', grad['x0'])
print('dy/dx1: ', grad['x1'].numpy())

dy/dx0:  None
dy/dx1:  0.9999546

Intermediate results

You can also request gradients of the output with respect to intermediate values computed inside the tf.GradientTape context.

By default, the resources held by a GradientTape are released as soon as the GradientTape.gradient method is called. persistent=True allows multiple calls to the gradient method a s resources are released when the tape object is garbage collected:

x = tf.constant([1, 3.0])
with tf.GradientTape(persistent=True) as tape:
    tape.watch(x)
    y = x * x
    z = y * y

print('dz/dy: ', tape.gradient(z, x).numpy())
print('dy/dx: ', tape.gradient(y, x).numpy())

dz/dy:  18.0

del tape

Notes on performance

There is a tiny overhead associated with doing operations inside a gradient tape context. So you should still use tape context around the areas only where it is required
Gradient tapes use memory to store intermediate results, including inputs and outputs, for use during the backwards pass

For efficency, some ops (like ReLU) don't need to keep their intermediate results and they are pruned during the forward pass. However, if you use persistent=True on your tape, nothing is discarded and your peak memory usage will be higher.

Gradients of non-scalar targets

The target(s) are not scalar the gradient of the sum is calculated. This makes it simple to take the gradient of the sum of a collection of losses, or the gradient of the sum of element-wise loss calculation.

x = tf.Variable(2.0)
with tf.GradientTape(persistent=True) as tape:
    y0 = x**2
    y1 = tf.exp(1 / x)
    y2 = y0 + y1

print('dy0/dx: ', tape.gradient(y0, x).numpy())
print('dy1/dx: ', tape.gradient(y1, x).numpy())
print('d(y1 + y0)/dx: ', tape.gradient({'y0': y0, 'y1': y1}, x).numpy())
print('dy2/dx: ', tape.gradient(y2, x).numpy())

dy0/dx:  4.0
dy1/dx:  -0.4121803
d(y1 + y0)/dx:  3.5878196
dy2/dx:  3.5878196

Control flow

The gradient only connects to the variable that was used:

x = tf.constant(1.0)

v0 = tf.Variable(2.0)
v1 = tf.Variable(2.0)

with tf.GradientTape(persistent=True) as tape:
    tape.watch(x)
    if x > 0.0:
        result = v0
    else:
        result = v1**2

dv0, dv1 = tape.gradient(result, [v0, v1])

print('dv0/dx: ', dv0)
print('dv1/dx: ', dv1)

dv0/dx:  tf.Tensor(1.0, shape=(), dtype=float32)
dv1/dx:  None

Getting a gradient of `None`

When a target is not connected to a source you will get a gradient of None.

x = tf.Variable(2.)
y = tf.Variable(3.)

with tf.GradientTape() as tape:
    z = y * y

print('dz/dx: ', tape.gradient(z, x))

dz/dx:  None

1. Replaced a variable with a tensor

The tape will automatically watch a tf.Variable but not a tf.Tensor. One common error is to inadvertently replace a tf.Variable with a tf.Tensor, instead of using Variable.assign to update the tf.Variable:

x = tf.Variable(2.)

for epoch in range(2):
    with tf.GradientTape() as tape:
        y = x + 1

    print(type(x).__name__, ":", tape.gradient(y, x))
    x = x + 1

ResourceVariable : tf.Tensor(1.0, shape=(), dtype=float32)
EagerTensor : None

2. Did calculations outside of TensorFlow

The tape can't record the gradient path if the calculation exits TensorFlow:

x = tf.Variable([[1., 2.],
                 [3., 4.]], dtype=tf.float32)

with tf.GradientTape() as tape:
    x2 = x**2

    y = np.mean(x2, axis=0)

    y = tf.reduce_mean(y, axis=0)

print('dy/dx: ', tape.gradient(y, x))

dy/dx:  None

3. Took gradients through an integer or string

Integers and strings are not differentiable. If a calculation path uses these data types there will be no gradient:

x = tf.constant(10)

with tf.GradientTape() as g:
    g.watch(x)
    y = x * x

print('dy/dx: ', g.gradient(y, x))

WARNING:tensorflow:The dtype of the watched tensor must be floating (e.g. tf.float32), got tf.int32
WARNING:tensorflow:The dtype of the target tensor must be floating (e.g. tf.float32) when calling GradientTape.gradient, got tf.int32
WARNING:tensorflow:The dtype of the source tensor must be floating (e.g. tf.float32) when calling GradientTape.gradient, got tf.int32
dy/dx:  None

4. Took gradients through a stateful object

State stops gradients. When you read from a stateful object(=tf.Variable), the tape can only observe the current state, not the history that lead to it:

x0 = tf.Variable(3.)
x1 = tf.Variable(0.)

with tf.GradientTape() as tape:
    x1.assign_add(x0)
    y = x1**2

print('dy/dx0: ', tape.gradient(y, x0))

with tf.GradientTape() as tape:
    x2 = x1 + x0
    y = x2**2

print('dy/dx0: ', tape.gradient(y, x0))

y:  tf.Tensor(9.0, shape=(), dtype=float32)
dy/dx0:  None
dy/dx0:  tf.Tensor(12.0, shape=(), dtype=float32)

No gradient registered

Some tf.Operations are registered as being non-differentiable and will return None. Others have no gradient registered.

if you attempt to take a gradient through a float op that has no gradient registered the tape will throw an error instead of silently returning None.

x0 = tf.Variable([[[0.5, 0.0, 0.0]]])
x1 = tf.Variable(0.1)

with tf.GradientTape() as tape:
    y = tf.image.adjust_contrast(x0, x1)

try:
    print(tape.gradient(y, [x0, x1]))
    assert False
except LookupError as e:
    print(f'{type(e).__name__}: {e}')

LookupError: gradient registry has no entry for: AdjustContrastv2

Zero instead of None

In some cases it would be convenient to get 0 instead of None for unconnected gradients. You can decide what to return when you have unconnected gradient using the unconected_gradients argument:

x = tf.Variable([2., 2.])
y = tf.Variable(3.)

with tf.GradientTape() as tape:
    z = y**2

print('dz/dx: ', tape.gradient(z, x, unconnected_gradients=tf.UnconnectedGradients.ZERO))

dz/dx:  tf.Tensor([0. 0.], shape=(2,), dtype=float32)

저작자표시 비영리 변경금지

'Tool > TensorFlow' 카테고리의 다른 글

6. Basic training loops (0)	2022.03.12
5. Introduction to modules, layers, and models (0)	2022.03.12
4. Introduction to graphs and tf.function (0)	2022.03.12
2. Introduction to Variables (0)	2022.03.12
1. Introduction to Tensors (0)	2022.03.12

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

강정노트

강정노트

최근글

Computing gradients

Gradient tapes

Gradients with respect to a model

Controlling what the tape watches

Intermediate results

Notes on performance

Gradients of non-scalar targets

Control flow

Getting a gradient of `None`

1. Replaced a variable with a tensor

2. Did calculations outside of TensorFlow

3. Took gradients through an integer or string

4. Took gradients through a stateful object

No gradient registered

Zero instead of None

'Tool > TensorFlow' 카테고리의 다른 글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

강정노트

최근글

Computing gradients

Gradient tapes

Gradients with respect to a model

Controlling what the tape watches

Intermediate results

Notes on performance

Gradients of non-scalar targets

Control flow

Getting a gradient of None

1. Replaced a variable with a tensor

2. Did calculations outside of TensorFlow

3. Took gradients through an integer or string

4. Took gradients through a stateful object

No gradient registered

Zero instead of None

'Tool > TensorFlow' 카테고리의 다른 글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Getting a gradient of `None`