2021-arxiv P-tuning v1 GPT Understands, Too

Motivation

Manual discrete prompts have several drawbacks, which contain unstable performance(eg: changing a single word causes substantial drow) . To adress the problem, we propose a novel method which employs trainable continuous prompt embeddings in concatenation with discrete prompts.

Unstable performance of discrete prompts can be seen as follow:

  1. change a single word make large influence to prediction.

note: when the LLM is tuned, the instability can be alleviated, but performance difference is still sizeable, especially in few-shot. And recent algorithms do not well with it.

Method

  1. Firstly, the work flow of discrete prompts is as follow:

  2. Step1: Let M be a pretrained language model with a hidden size of h and a vocabulary size of |V|. Let {(xi, yi))}i be a labeled dataset for an NLU task, where x0:n = {x0, x1, …, xn} is an input consisting of a sequence of discrete tokens, and y ∈ Y is a label. Our goal is to estimate the conditional probability for classification fM(x) = pˆ(y|x) with parameters of M either finetuned or frozen.

  3. Step2: Let [Di] be a discrete prompt token. Each prompt can be described as a template T = {[D0:i], x, [D(i+1):j], y, [D(j+1):k]}
  4. Step3: Through the work above, the labeled data can be organized into a sequence of text tokens including x and y, and the task can be reformulated as filling in the blanks of the input text.

Eg: For the task of predicting a country’s capital (LAMA-TREx P36), a prompt could be “The capital of [INPUT] is [LABEL].” With a piece of labeled data “(Britain, London)”, the reformulated text would be “The capital of Britain is [MASK].”, where “[MASK]” should predict the given label “London”.

  1. The pipeline proposed in the paper and comparison with discrete prompts is as follow:

Let [Pi] be the ith continuous prompt embedding. The prompt template for P-tuning is as follows:

T = {[P0:i], x, [P(i+1):j], y, [P(j+1):k]}

P-Tuning leverages an extra embedding function f : [Pi] → hi to map the template to:

{h0, …, hi, e(x), hi+1, …, hj, e(y), hj+1, …, hk}

Finally, we update the embeddings $P_i^k$ to optimize a task loss function.

Note: For the choose of [Pi], they can be extracted from unused tokens belong to pretraining vocabulary, eg: unused 1 ~ unused99 in BERT vob.

  1. Prompt Encoder Network

There are three types of encoder for continuous embedding, they are LSTM, MLP and EMB. EMB means identity map and the ablation is as follow:

Based on the comparison above, LSTM and MLP generally works well on these tasks, and EMB can be substantially under-perform the other two on some tasks.

Results

More evaluation results can be found in the paper.


2021-arxiv P-tuning v1 GPT Understands, Too
https://firrice.github.io/posts/2024-06-25-P-tuning-v1/
Author
firrice
Posted on
June 25, 2024
Licensed under