[Paper Review] LaMDA: Language Models for Dialog Applications

2022. 5. 24. 10:04Paper Review

Abstract

We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of 1). safety and 2). factual grounding. We also explore the use of LaMDA in 3). the domains of education and content recommendations to investigate its potential and shortcomings.

1. Introduction

According to previous papers, dialog models successfully take advantage of Transformer's ability, and are well suited to model scaling. So we train LaMDA a family of Transformer-based neural language models designed for dialog. These models's sizes range from $2\mathbf{B}$ to $137\mathbf{B}$ parameters. LaMDA makes use of a single model to perform multiple tasks: it generates potential responses, which are then filtered for safety, grounded on an external knowledge source, and re-ranked to find the highest-quality response.
We observe that: 1). model scaling alone improves quality, but its improvements on safety and groundedness are far behind human performance, and 2). combining scaling and fine-tuning improves LaMDA significantly on all metrics. Finally, 3). we find that pre-training-only models can perform role consistency well, but fine-tuned models are more helpful.

3. LaMDA pre-training

Model architecture of LaMDA is decoder-only Transformer languange model. LaMDA was pre-trained to predict the next token in a text corpus. Unlike previous dialog models trained on dialog data alone, we pre-trained LaMDA on a dataset created from public dialog data and other public web documents.
The pre-training dataset consists of $2.97\mathbf{B}$ documents, $1.12\mathbf{B}$ dialogs, and $13.39\mathbf{B}$ dialog utterances, for a total of $1.56\mathbf{T}$ words. We used the SentencePiece library to tokenize the dataset into $2.81\mathbf{T}$ byte pair encoding (BPE) tokens, with a vocabulary of $32\mathbf{K}$ tokens.
We call the model before any fine-tuning "PT", for Pretrained.

4. Metrics

4-1. Foundation metrics: Quality, Safety and Groundedness

Sensibleness, Specificity, Interestingness (SSI):

  1. Sensibleness: whether a response make sense in context and don't contradict anything that was said earlier.
  2. Specificity: whether a response is specific to a given context.
  3. Interestingness: whether a response "catch someone's attention" or "arouse their curiousity", or is unexpected, witty, or insightful.

Safety: whether a response violates any of the safety objectives, derived from Google's AI Principles.
Groundedness: the percentage of responses containing claims about the external world that can be supported by authoritative external sources, as a share of all those containing claims about the external worlds.
Informativeness: the percentage of responses that carry information about the external world that can be supported by known sources as a share of all responses.
Citation accuracy: the percentage of model responses that cite the URLs of their sources as a share of all responses with explicit claims about the external world, excluding claims with well-known facts.

4-2. Role-specific metrics: Helpfulness and Role consistency

Helpfulness: whether a response contain correct information, and is helpful to user.
Role consistency: whether a response is consistent with agent's role.

5. LaMDA fine-tuning and evaluation data

Quality (Sensibleness, Specificity, Interestingness)

  1. Crowdworkers interact with a LaMDA instance about any topic.
  2. For each response, other crowdworkers rate whether the response is SSI.

Safety

  1. Crowdworkers interact with a LaMDA instance in three different ways: naturally, sensitively, and adversarially.
  2. For each response, other crowdworkers rate whether the response violates any of the safety objectives.

Groundedness

  1. Crowdworkers interact with model that steer the conversation towards information-seeking interactions.
  2. Crowdworkers check whether the information in the turn makes any claims about the external world.
  3. For each response, crowdworkers record the search queries that they would use to investigate them.
  4. Crowdworkers edit the model's response to incorporate brief search results from an external knowledge-retrieval system.

Estimating these matrics for human-genrated responses:
We ask crowdworkers to respond to randomly selected samples of the evaluation datasets. The crowdworkers are explicity informed to reply in a safe, sensible, specific, interesting grounded, and informative manner. They are also explicity asked to use any external tools necessary to generate these responses.

6. LaMDA fine-tuning

We create LaMDA using several fine-tunings applied to the pre-trained (PT).

6-1. Discriminative and generative fine-tuning for Quaility (SSI) and Safety

One of fine-tunings is mix of 1). generative tasks that generate response given contexts, and 2). discriminative tasks that evaluate quality and safety of a response in context. This results in a single model that can function as both a generator and a discriminator.

Fine-tuning examples

  1. Generative fine-tuning examples are expressed as "<context><sentinel><response>", with losses applied only for the response portion.
  2. Discriminative fine-tuning examples are expressed as "<context><sentinel><response><attribute-name><rating>", with losses applied for the rating following the attribute name only.

Fine-tuning strategies

  1. Predict the SSI and safety ratings of the generated candidate responses: P("<desired-rating>"|"<context><sentinel><response><attribute-name>").
  2. Generate the response in given context: P("<response>"|"<context><sentinel>"). This trained by $800\mathbf{K}$ turns that filtered pre-training dataset($2.5\mathbf{M}$) by LaMDA SSI and safety discriminator.

6-2. Fine-tuning to learn to call an external information retrieval system

Language models such as LaMDA tend to generate outputs that seem plausible, but contradict facts established by known external sources. So, we present our approach to fine-tuning by learning to consult a set of external knowledge resources and tools.

The toolset (TS)

We create a toolset (TS) that includes an information retrieval system, a calculator, and a translator. TS takes a single string as input and outputs a list of one or more strings.

Dialog collection

We collect $40\mathbf{K}$ annotated dialog turns, described in previous section.
We also collect $9\mathbf{K}$ dialog turns, in which the LaMDA's generated candidates are labeled 'correct' or 'incorrect'. to be used as input data for the ranking task (discriminative data). We also collect human-human dialogs, focused on information-seeking interactions, and evaluate whether their statements can be supported by known authoritative sources.

Fine-tuning

Fine-tuning LaMDA by two tasks:

  1. Takes the multiturn dialog context to date and the response generated by the base model, then generates query that should be sent to the toolset: context + base $\rightarrow$ "TS, query".
  2. Takes the snippet returned by a tool, and dialog statement. It then predicts the ground version: context + base + query + snippet $\rightarrow$ response.

Recall that the 'Research' phase is one specialized task from a set that belong to a single multi-tasking model (e.g., 'Base' dialog response generation task, safety, and quality tasks).

How LaMDA handles groundedness through interactions with an external information retrieval system.

7. Results on foundation metrics

In this section, we first summarize the datasets and methods used and then summarize results.

 

Metrics Dataset Evaluation
Quality $6.4\mathbf{K}$ dialogs with binary labels for sensible, specific and interesting Crowdworkers label the response, given the context, for sensibleness, specificity and interestingess, on a common benchmark dataset of 1477 dialog turns.
Safety $8\mathbf{K}$ dialogs with binary labels for each of the safety objectives. Crowdworkers label the response, given the context, using the safety objectives for 1458 turns of dialog that cover provocative users turns.
Groundedness $4\mathbf{K}$ dialogs in which crowdworkers write queries to an information retrieval system and modify model responses. Also $1\mathbf{K}$ dialogs with binary labels on whether generated queries or response modeifications were correctly or incorrectly executed.  Crowdworkers evaluate 784 responses given contexts for informativeness and groundedness.

 

Leveraging these datasets, we perform two levels of fine-tuning:

  1. FT quality-safety: fine-tune the pre-trained model (PT) to train discriminators that predict quality and safety labels. PT is also fine-tuned to generate in-context responses from a clean sample of pre-training dialog data filtered using LaMDA discriminators.
  2. FT groundedness (LaMDA): fine-tune FT quality-safety to generate calls to an external information retrieval system to provide attributed respones. The model is also fine-tuned to jointly predict the quality and the type (i.e., calling a certain tool or replying to the user) of the next action.

Scaling up alone improves the pre-trained model quality (sensibleness, specificity, and interestingness) and groundedness (goundedness and informativeness) metrics, but it does not improve safety much. Fine-tuning with crowdworker-annotated data, however, turns out to be an effective method for improving all metrics. Groundedness further improves from FT quality-safety to FT groundedness (LaMDA). 

8. Domain grounding

We observe that LaMDA can perform domain-appropriate roles through pre-conditioning, also known as domain grounding. Here we explore such domain grounding in two areas: 1). LaMDA playing the role of a famous object such as Mount Everest for the purpose of eduation, and 2). LaMDA playing the role of a music recommendation agent.

To adapt LaMDA and PT to each role, we precondition them on a few turns of role-specific dialogs, and we use the same pre-conditioning for LaMDA and PT.

Results show that LaMDA perform significantly better than PT applications in Helpfulness, but LaMDA and PT instances score fairly well on role consistency.

9. Discussion and limitations

Collecting fine-tuning datasets brings the benefits of learning from nuanced human judgements. We expect better results with higher quality labels. So future work will focus on way to improve label quality: (1) selecting crowdworkers that mirror the system's target. (2) examining disagreements between crowdworkers due to social and cultural norms and values.

We have shown that fine-tuning can improve safety metrics on average by defining safety objectives for our safety fine-tuning. Future work will also need to foucs on how fine-tuning can cope with the long tail of inappropriate responses that LaMDA and other language models can generate.

더보기

9-1. Examining bias

Our safety objectives aim to reduce the number of responses biased against specifc subgroups of people, but such biases can be hard to detect since they manifest in a wide variety of subtle ways. Another limitation of our safety approach is that it may still propagate some representational harms present in the training datasets, even if the individual examples do not violate any of the safety objectives.

Known approaches to mitigate undesirable statistical biases in generative language models include attempts to filter pre-training data, train seperate filtering models, create control codes to condition generation, and fine-tuning models, as demonstrated in this paper.

9-2. Adversarial data collection

We use adversarial-intent conversations to improve the breadth of labeled data for fine-tuning. A limitation of our approach is that most of the participants are able to find commonly occurring problems, but not rarer ones. With the long tail nature of threats associated with generative models, future efforts should further incentivize novelty and detection of errors that could be rare or unseen but could have potentially severe consequences, especially in evolving societal contexts.

9-3. Safety as a concept and a metric

(1) Our rating aggregates fine-grained ratings on diverse set of safety objectives into a single value. So it leaves little room for weighting objectives differently. (2) Our rating scales are coarse and may not measure the full extent to which a response is unsafe or undesirable. (3) The safety objectives attempt to capture widely shared values across social groups. But in reality, these objectives cannot be treated as universal because of cutural difference.

9-4. Appropriateness as a concept and a metric

While safety and quality should be considered a minimum threshold for appropriate responses, additional considerations are neccessary to support a positive user experience. Politeness and agreeability objectives have distinct sociolinguistic characteristics and should be measured seperately from safety characteristics. A challenge to meeting this need is that social appropriateness is not universal. It is highly contextual and must be assessed in relation to relevant social and cultural contexts, so no set of specific appropriateness constraints can apply universally to generative language models.

9-5. Cultural responsiveness

Various traits that we measure for our safety objectives depend heavily on socio-cultural contexts. Any meaningful measure of safety for these objectives should take into account he societal context where the system will be used, employing a "participatory finetuning" approach that brings relevant communities into the human-centered data collection and curation processes.

9-6. Impersonation and anthropomorphization

Humans may interact with systems without knowing that they are artificial, or anthropomorphizing the system by ascribing some form of personality to it. Both of these situations present the risk that deliberate misuse of these tools might deceive or manipulate people, inadvertently or with malicious intent. Furthermore, adversaries could potentially attempt to tarnish another person's reputation, leverage their status, or show misinformation by using this technology to impersonate specific individuals' conversational style.

10. Future work

We intend to expand and revise the dimensions captured by our safety objectives and significantly increase the volumne of labeled training data that we collect to train our discriminators. We will need to continue to look carefully at crowdworker recruitment, training, and performance evaluation, as well as calibrate for cross-cultural differences in values and opinions.

Another potential area of exploration is to study how different applications may warrant distinct levels of safety, quality, and groundedness based on the risk/benefit tradeoffs of these individual applications.

Achiving broad consensus on the nuances of what constitutes safety and groundedness is going to remain a fundamental long-term challenge in the field of open-ended dialog systems.

11. Conclusion

This paper studies the importance of scale, annotated data for model fine-tuning, and the use of information retrieval as a tool in dialog modeling. 1). We find that crowd-annotated data is an effective tool for driving significant additional gains. 2). We also find that calling external APIs offers a path towards significantly improving groundedness. We pre-condition the models on a small number of turns of application-specific dialogs to quickly adapt LaMDA to these applications. 3). We find that models can adapt to their expected context, with more than four out of five responses staying consistent with their assigned role. LaMDA is a step closer to practical and safe open-ended dialog systems, which can in turn unlock a wide range of useful applications.