实战文本摘要任务

一、文本摘要任务简介

  • 文本摘要任务是自然语言处理中的一个重要任务,旨在从给定的文本中提取出关键信息,并生成简洁的摘要。
  • 根据输入文档的数量划分,可以将摘要任务划分为单文档摘要和多文档摘要。
  • 根据输入和输出的语言划分,可以将摘要任务划分为单语言摘要、跨语言摘要和多语言摘要。
  • 本次实战主要为单文档单语言摘要。
  • 示例:
  • 评价指标:
    • ROUGE:Recall-Oriented Understudy for Gisting Evaluation,主要用于评估自动生成的摘要与人工生成的摘要之间的相似度。ROUGE指标包括ROUGE-N、ROUGE-L、ROUGE-W等多种变体。

      • Rouge-1、Rouge-2、Rouge-L:分别表示对比生成的摘要和参考摘要之间的n-gram重叠率,n=1、2、L分别表示单词、双词和最长公共子序列。
      • 分别基于1-gram、2-gram和LCS(最长公共子序列)计算的F1值。
      • 计算公式:
        • RougeN=imin(jcountij,kcountik)/iRouge-N = ∑_i min(∑_j count_i^j, ∑_k count_i^k) / ∑_i count_i^k
        • RougeL=imin(jcountij,kcountik)/icountikRouge-L = ∑_i min(∑_j count_i^j, ∑_k count_i^k) / ∑_i count_i^k
    • 示例:

      • 自动摘要Y(一般是模型生成的):

      the cat was found under the bed

      • 人工摘要X(一般是人工标注的):

      the cat was under the bed

      • 则模型生成的summary的1-gram、2-gram分别为:
      # 1-gram reference 1-gram 2-gram reference 2-gram
      1 the the the cat the cat
      2 cat cat cat was cat was
      3 was was was found was under
      4 found under found under under the
      5 under the under the the bed
      6 the bed the bed
      7 bed
      count 7 6 6 5
      • Rouge-1(X,Y) = 6/6 = 1.0:分子是待评测摘要和参考摘要都出现的1-gram的个数,分子是参考摘要的1-gram个数。(其实分母也可以是待评测摘要的,但是在精确率和召回率之间,我们更关心的是召回率Recall,同时这也和上面ROUGN-N的公式相同)
      • Rouge-2 ( X , Y ) = 4/5 = 0.8​
  • 序列到序列模型(T5):
  • 前缀语言模型(GLM):

二、代码实战(基于T5模型)

模型结构(ModelForConditionalGeneration):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
@add_start_docstrings("""T5 Model with a `language modeling` head on top.""", T5_START_DOCSTRING)
class T5ForConditionalGeneration(T5PreTrainedModel, GenerationMixin):
_keys_to_ignore_on_load_unexpected = [
"decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight",
]
_tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight", "lm_head.weight"]

def __init__(self, config: T5Config):
super().__init__(config)
self.model_dim = config.d_model

self.shared = nn.Embedding(config.vocab_size, config.d_model)

encoder_config = copy.deepcopy(config)
encoder_config.is_decoder = False
encoder_config.use_cache = False
encoder_config.is_encoder_decoder = False
self.encoder = T5Stack(encoder_config, self.shared)

decoder_config = copy.deepcopy(config)
decoder_config.is_decoder = True
decoder_config.is_encoder_decoder = False
decoder_config.num_layers = config.num_decoder_layers
self.decoder = T5Stack(decoder_config, self.shared)

self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)

# Initialize weights and apply final processing
self.post_init()

# Model parallel
self.model_parallel = False
self.device_map = None

@add_start_docstrings(PARALLELIZE_DOCSTRING)
def parallelize(self, device_map=None):
warnings.warn(
"`T5ForConditionalGeneration.parallelize` is deprecated and will be removed in v5 of Transformers, you"
" should load your model with `device_map='balanced'` in the call to `from_pretrained`. You can also"
" provide your own `device_map` but it needs to be a dictionary module_name to device, so for instance"
" {'encoder.block.0': 0, 'encoder.block.1': 1, ...}",
FutureWarning,
)
self.device_map = (
get_device_map(len(self.encoder.block), range(torch.cuda.device_count()))
if device_map is None
else device_map
)
assert_device_map(self.device_map, len(self.encoder.block))
self.encoder.parallelize(self.device_map)
self.decoder.parallelize(self.device_map)
self.lm_head = self.lm_head.to(self.decoder.first_device)
self.model_parallel = True

@add_start_docstrings(DEPARALLELIZE_DOCSTRING)
def deparallelize(self):
warnings.warn(
"Like `parallelize`, `deparallelize` is deprecated and will be removed in v5 of Transformers.",
FutureWarning,
)
self.encoder.deparallelize()
self.decoder.deparallelize()
self.encoder = self.encoder.to("cpu")
self.decoder = self.decoder.to("cpu")
self.lm_head = self.lm_head.to("cpu")
self.model_parallel = False
self.device_map = None
torch.cuda.empty_cache()

def get_input_embeddings(self):
return self.shared

def set_input_embeddings(self, new_embeddings):
self.shared = new_embeddings
self.encoder.set_input_embeddings(new_embeddings)
self.decoder.set_input_embeddings(new_embeddings)

def _tie_weights(self):
if self.config.tie_word_embeddings:
self._tie_or_clone_weights(self.encoder.embed_tokens, self.shared)
self._tie_or_clone_weights(self.decoder.embed_tokens, self.shared)

def set_output_embeddings(self, new_embeddings):
self.lm_head = new_embeddings

def get_output_embeddings(self):
return self.lm_head

def get_encoder(self):
return self.encoder

def get_decoder(self):
return self.decoder

@add_start_docstrings_to_model_forward(T5_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=Seq2SeqLMOutput, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.FloatTensor] = None,
decoder_input_ids: Optional[torch.LongTensor] = None,
decoder_attention_mask: Optional[torch.BoolTensor] = None,
head_mask: Optional[torch.FloatTensor] = None,
decoder_head_mask: Optional[torch.FloatTensor] = None,
cross_attn_head_mask: Optional[torch.Tensor] = None,
encoder_outputs: Optional[Tuple[Tuple[torch.Tensor]]] = None,
past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
) -> Union[Tuple[torch.FloatTensor], Seq2SeqLMOutput]:
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[-100, 0, ...,
config.vocab_size - 1]`. All labels set to `-100` are ignored (masked), the loss is only computed for
labels in `[0, ..., config.vocab_size]`

Returns:

Examples:

```python
>>> from transformers import AutoTokenizer, T5ForConditionalGeneration

>>> tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")
>>> model = T5ForConditionalGeneration.from_pretrained("google-t5/t5-small")

>>> # training
>>> input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
>>> labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
>>> outputs = model(input_ids=input_ids, labels=labels)
>>> loss = outputs.loss
>>> logits = outputs.logits

>>> # inference
>>> input_ids = tokenizer(
... "summarize: studies have shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids # Batch size 1
>>> outputs = model.generate(input_ids)
>>> print(tokenizer.decode(outputs[0], skip_special_tokens=True))
>>> # studies have shown that owning a dog is good for you.
```"""
use_cache = use_cache if use_cache is not None else self.config.use_cache
return_dict = return_dict if return_dict is not None else self.config.use_return_dict

# FutureWarning: head_mask was separated into two input args - head_mask, decoder_head_mask
if head_mask is not None and decoder_head_mask is None:
if self.config.num_layers == self.config.num_decoder_layers:
warnings.warn(__HEAD_MASK_WARNING_MSG, FutureWarning)
decoder_head_mask = head_mask

# Encode if needed (training, first prediction pass)
if encoder_outputs is None:
# Convert encoder inputs in embeddings if needed
encoder_outputs = self.encoder(
input_ids=input_ids,
attention_mask=attention_mask,
inputs_embeds=inputs_embeds,
head_mask=head_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
encoder_outputs = BaseModelOutput(
last_hidden_state=encoder_outputs[0],
hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,
attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
)

hidden_states = encoder_outputs[0]

if self.model_parallel:
torch.cuda.set_device(self.decoder.first_device)

if labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
# get decoder inputs from shifting lm labels to the right
decoder_input_ids = self._shift_right(labels)

# Set device for model parallelism
if self.model_parallel:
torch.cuda.set_device(self.decoder.first_device)
hidden_states = hidden_states.to(self.decoder.first_device)
if decoder_input_ids is not None:
decoder_input_ids = decoder_input_ids.to(self.decoder.first_device)
if attention_mask is not None:
attention_mask = attention_mask.to(self.decoder.first_device)
if decoder_attention_mask is not None:
decoder_attention_mask = decoder_attention_mask.to(self.decoder.first_device)

# Decode
decoder_outputs = self.decoder(
input_ids=decoder_input_ids,
attention_mask=decoder_attention_mask,
inputs_embeds=decoder_inputs_embeds,
past_key_values=past_key_values,
encoder_hidden_states=hidden_states,
encoder_attention_mask=attention_mask,
head_mask=decoder_head_mask,
cross_attn_head_mask=cross_attn_head_mask,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
cache_position=cache_position,
)

sequence_output = decoder_outputs[0]

# Set device for model parallelism
if self.model_parallel:
torch.cuda.set_device(self.encoder.first_device)
self.lm_head = self.lm_head.to(self.encoder.first_device)
sequence_output = sequence_output.to(self.lm_head.weight.device)

if self.config.tie_word_embeddings:
# Rescale output before projecting on vocab
# See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586
sequence_output = sequence_output * (self.model_dim**-0.5)

lm_logits = self.lm_head(sequence_output)

loss = None
if labels is not None:
loss_fct = CrossEntropyLoss(ignore_index=-100)
# move labels to correct device to enable PP
labels = labels.to(lm_logits.device)
loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), labels.view(-1))
# TODO(thom): Add z_loss https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L666

if not return_dict:
output = (lm_logits,) + decoder_outputs[1:] + encoder_outputs
return ((loss,) + output) if loss is not None else output

return Seq2SeqLMOutput(
loss=loss,
logits=lm_logits,
past_key_values=decoder_outputs.past_key_values,
decoder_hidden_states=decoder_outputs.hidden_states,
decoder_attentions=decoder_outputs.attentions,
cross_attentions=decoder_outputs.cross_attentions,
encoder_last_hidden_state=encoder_outputs.last_hidden_state,
encoder_hidden_states=encoder_outputs.hidden_states,
encoder_attentions=encoder_outputs.attentions,
)

def prepare_decoder_input_ids_from_labels(self, labels: torch.Tensor):
return self._shift_right(labels)

def _reorder_cache(self, past_key_values, beam_idx):
# if decoder past is not included in output
# speedy decoding is disabled and no need to reorder
if past_key_values is None:
logger.warning("You might want to consider setting `use_cache=True` to speed up decoding")
return past_key_values

reordered_decoder_past = ()
for layer_past_states in past_key_values:
# get the correct batch idx from layer past batch dim
# batch dim of `past` is at 2nd position
reordered_layer_past_states = ()
for layer_past_state in layer_past_states:
# need to set correct `past` for each of the four key / value states
reordered_layer_past_states = reordered_layer_past_states + (
layer_past_state.index_select(0, beam_idx.to(layer_past_state.device)),
)

if reordered_layer_past_states[0].shape != layer_past_states[0].shape:
raise ValueError(
f"reordered_layer_past_states[0] shape {reordered_layer_past_states[0].shape} and layer_past_states[0] shape {layer_past_states[0].shape} mismatched"
)
if len(reordered_layer_past_states) != len(layer_past_states):
raise ValueError(
f"length of reordered_layer_past_states {len(reordered_layer_past_states)} and length of layer_past_states {len(layer_past_states)} mismatched"
)

reordered_decoder_past = reordered_decoder_past + (reordered_layer_past_states,)
return reordered_decoder_past

2.1 导入相关包

1
2
3
import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments

2.2 加载数据集

1
2
ds = Dataset.load_from_disk("/kaggle/working/nlpcc-2017/")
ds = ds.train_test_split(100, seed=42)

划分好的数据集:

1
2
3
4
5
6
7
8
9
10
DatasetDict({
train: Dataset({
features: ['title', 'content'],
num_rows: 4900
})
test: Dataset({
features: ['title', 'content'],
num_rows: 100
})
})

ds数据集格式:

1
2
{{'title': '安岳县发生砍人事件致4人受伤,其中1人重伤,警方已控制嫌疑人;事件因居民不顾单向限行强行通过而起。',
'content': '中新网资阳6月17日电(吴平华)17日上午八时许,四川省资阳市安岳县龙台镇加油站旁发生砍人事件,在打斗中四人被砍伤,导致部分当地群众堵路围观。目前,当地政府和警方已介入,截至记者发稿时,重伤者仍在医院抢救。据安岳县龙台镇龙副镇长介绍,近段时间319道路维修,维护秩序的人员便采取了单向通行的管理,但送学生的当地居民杨某因赶时间,强行通过,双方当即发生争执和纠纷。“我们赶到现场时,只见地上还有不少血迹,几名伤者已被送往医院进行治疗,其中一人受重伤,腹部等处受伤,目前正在医院救治。”龙副镇长说,参与打人者也急忙逃跑,现场只抓住一个人。当地群众对此极为不满,要求依法严惩凶手。事发后,龙台镇政府及时前往现场并安抚受伤居民,当地警方控制了嫌疑人。目前,此案正在进一步调查处理中。'}}

2.3 数据集预处理

1
2
3
4
5
6
7
8
9
10
11
tokenizer = AutoTokenizer.from_pretrained("Langboat/mengzi-t5-base")

def process_func(examples):
# 处理数据集,添加摘要生成的前缀
content = ["摘要生成:\n" + e for e in examples["content"]]
inputs = tokenizer(content, max_length=384, truncation=True)
labels = tokenizer(text_target=examples["title"], max_length=64, truncation=True)
inputs["labels"] = labels["input_ids"] # 将labels添加到inputs中
return inputs

tokenized_datasets = ds.map(process_function, batched=True)

处理过后的第0条数据:

1
2
3
4
5
tokenizer.decode(tokenized_ds["train"][0]["input_ids"])
'摘要生成: 中新网资阳6月17日电(吴平华)17日上午八时许,四川省资阳市安岳县龙台镇加油站旁发生砍人事件,在打斗中四人被砍伤,导致部分当地群众堵路围观。目前,当地政府和警方已介入,截至记者发稿时,重伤者仍在医院抢救。据安岳县龙台镇龙副镇长介绍,近段时间319道路维修,维护秩序的人员便采取了单向通行的管理,但送学生的当地居民杨某因赶时间,强行通过,双方当即发生争执和纠纷。“我们赶到现场时,只见地上还有不少血迹,几名伤者已被送往医院进行治疗,其中一人受重伤,腹部等处受伤,目前正在医院救治。”龙副镇长说,参与打人者也急忙逃跑,现场只抓住一个人。当地群众对此极为不满,要求依法严惩凶手。事发后,龙台镇政府及时前往现场并安抚受伤居民,当地警方控制了嫌疑人。目前,此案正在进一步调查处理中。</s>'

tokenizer.decode(tokenized_ds["train"][0]["labels"])
'安岳县发生砍人事件致4人受伤,其中1人重伤,警方已控制嫌疑人;事件因居民不顾单向限行强行通过而起。</s>'

2.4 创建模型

1
model = AutoModelForSeq2SeqLM.from_pretrained("Langboat/mengzi-t5-base")

2.5 创建评估函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np
from rouge_chinese import Rouge

rouge = Rouge()

def compute_metric(evalPred):
predictions, labels = evalPred
decode_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decode_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# 基于字手动添加间隔
decode_preds = [" ".join(p) for p in decode_preds]
decode_labels = [" ".join(l) for l in decode_labels]
scores = rouge.get_scores(decode_preds, decode_labels, avg=True)
return {
"rouge-1": scores["rouge-1"]["f"],
"rouge-2": scores["rouge-2"]["f"],
"rouge-l": scores["rouge-l"]["f"],
}

2.6 配置训练参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
args = Seq2SeqTrainingArguments(
output_dir="/kaggle/working/summary",
logging_dir="/kaggle/working/logs",
per_device_train_batch_size=4,
per_device_eval_batch_size=8,
gradient_accumulation_steps=8,
logging_steps=8,
eval_strategy="epoch",
save_strategy="epoch",
metric_for_best_model="rouge-l",
predict_with_generate=True,
report_to="none",
disable_tqdm=False
)

2.7 创建训练器

1
2
3
4
5
6
7
8
9
trainer = Seq2SeqTrainer(
args=args,
model=model,
train_dataset=tokenized_ds["train"],
eval_dataset=tokenized_ds["test"],
compute_metrics=compute_metric,
processing_class=tokenizer,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer)
)

2.8 模型训练

1
trainer.train()

2.9 模型预测

1
2
3
4
5
6
7
8
9
from transformers import pipeline

pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device=0)

pipe("摘要生成:\n" + ds["test"][-1]["content"], max_length=64, do_sample=True)
# [{'generated_text': '习近平今日正与日本首相安倍晋三握手,并互致问候;日本称中方领导人与安倍分居印尼总统佐科侧。'}]

ds = Dataset.load_from_disk("/kaggle/working/nlpcc-2017/")
# '日媒称中日首脑在万隆会议会场握手并致问候,两人合影时分别站在印尼总统两侧'

三、代码实战(基于GLM模型)

模型结构(ModelForConditionalGeneration):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
class GlmForCausalLM(GlmPreTrainedModel, GenerationMixin):
_tied_weights_keys = ["lm_head.weight"]
_tp_plan = {"lm_head": "colwise_rep"}
_pp_plan = {"lm_head": (["hidden_states"], ["logits"])}

def __init__(self, config):
super().__init__(config)
self.model = GlmModel(config)
self.vocab_size = config.vocab_size
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

# Initialize weights and apply final processing
self.post_init()

def get_input_embeddings(self):
return self.model.embed_tokens

def set_input_embeddings(self, value):
self.model.embed_tokens = value

def get_output_embeddings(self):
return self.lm_head

def set_output_embeddings(self, new_embeddings):
self.lm_head = new_embeddings

def set_decoder(self, decoder):
self.model = decoder

def get_decoder(self):
return self.model

@deprecate_kwarg("num_logits_to_keep", version="4.50", new_name="logits_to_keep")
@add_start_docstrings_to_model_forward(GLM_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs: Unpack[KwargsForCausalLM],
) -> Union[Tuple, CausalLMOutputWithPast]:
r"""
Args:
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

logits_to_keep (`int` or `torch.Tensor`, *optional*):
If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all
`input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension.
This is useful when using packed tensor format (single dimension for batch and sequence length).

Returns:

Example:

```python
>>> from transformers import AutoTokenizer, GlmForCausalLM

>>> model = GlmForCausalLM.from_pretrained("meta-glm/Glm-2-7b-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("meta-glm/Glm-2-7b-hf")

>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict

# decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
cache_position=cache_position,
**kwargs,
)

hidden_states = outputs[0]
# Only compute necessary logits, and do not upcast them to float if we are not computing the loss
slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
logits = self.lm_head(hidden_states[:, slice_indices, :])

loss = None
if labels is not None:
loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)

if not return_dict:
output = (logits,) + outputs[1:]
return (loss,) + output if loss is not None else output

return CausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)

关于模型的一些问题

  1. 在 GLM 中执行 Seq2Seq 任务时,到底调用的是哪个模型类?
    在 GLM 系列(如 THUDM/glm-large-chinese)中,执行 Seq2Seq 任务时,用的是 GlmForCausalLM,即:

    1
    2
    3
    from transformers import AutoModel
    model = AutoModel.from_pretrained("THUDM/glm-large-chinese", trust_remote_code=True)
    # 实际返回的是 GlmForCausalLM 类的实例

    虽然是做 Seq2Seq,它本质上是通过伪Seq2Seq 的自回归结构完成任务的(类似 T5 但不一样)。

  2. 为啥叫 GlmForCausalLM,还能做 Seq2Seq?
    这是因为:

    1. GLM 模型是 Decoder-only 架构(类似 GPT)
    • 它没有传统 Encoder-Decoder 两部分
    • 所有任务都被统一转化成自回归生成任务(Causal Language Modeling)
    1. 它用“填空 + 生成”模拟 Seq2Seq 任务
    • 比如摘要任务的 prompt 是:
    1
    2
    文章:……  
    摘要:
    • 模型看到 摘要: 就开始生成后续内容
    • 实质上是自回归地预测 [output_token_1, output_token_2, ...],而不是通过标准的 encoder-decoder attention
  3. 再深入点:它怎么“伪装”成 Seq2Seq?
    GLM 使用一种“多段 prefix”结构来表示输入和输出的逻辑分隔:

    1
    2
    3
    [Instruction] 摘要任务
    [Input] 这是一篇关于GLM的文章……
    [Output] (模型开始生成)

    这种格式能让模型清楚知道哪里是输入、哪里是要预测的输出。

  4. 总结对比

模型类型 是否真正 Encoder-Decoder 是否能做 Seq2Seq 类名
T5 / BART ✅ 是 AutoModelForSeq2SeqLM
GPT ❌ 否(decoder-only) ❌(只能续写) GPT2LMHeadModel
GLM ❌ 否(decoder-only) ✅(用 prompt 模拟) GlmForCausalLM

3.1 导入相关包

1
2
3
import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments

3.2 加载数据集

1
2
ds = Dataset.load_from_disk("/kaggle/working/nlpcc-2017/")
ds = ds.train_test_split(100, seed=42)

划分好的数据集:

1
2
3
4
5
6
7
8
9
10
DatasetDict({
train: Dataset({
features: ['title', 'content'],
num_rows: 4900
})
test: Dataset({
features: ['title', 'content'],
num_rows: 100
})
})

ds数据集格式:

1
2
{{'title': '安岳县发生砍人事件致4人受伤,其中1人重伤,警方已控制嫌疑人;事件因居民不顾单向限行强行通过而起。',
'content': '中新网资阳6月17日电(吴平华)17日上午八时许,四川省资阳市安岳县龙台镇加油站旁发生砍人事件,在打斗中四人被砍伤,导致部分当地群众堵路围观。目前,当地政府和警方已介入,截至记者发稿时,重伤者仍在医院抢救。据安岳县龙台镇龙副镇长介绍,近段时间319道路维修,维护秩序的人员便采取了单向通行的管理,但送学生的当地居民杨某因赶时间,强行通过,双方当即发生争执和纠纷。“我们赶到现场时,只见地上还有不少血迹,几名伤者已被送往医院进行治疗,其中一人受重伤,腹部等处受伤,目前正在医院救治。”龙副镇长说,参与打人者也急忙逃跑,现场只抓住一个人。当地群众对此极为不满,要求依法严惩凶手。事发后,龙台镇政府及时前往现场并安抚受伤居民,当地警方控制了嫌疑人。目前,此案正在进一步调查处理中。'}}

3.3 数据集预处理

1
2
3
4
5
6
7
8
9
10
11
12
13
# 对于高版本的Transformers加载会报错,需要修改源码
# 文件地址 ~/.cache\huggingface\modules\transformers_modules\THUDM\glm-large-chinese\230f54e413fab4bc8f29bd3508aab301d757ef3e\tokenization_glm.py
# 231行 super().__init__(**kwargs) 移动至 235行,放至self.sp_model.Load(vocab_file)的后面一行
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-large-chinese", trust_remote_code=True)


def process_func(exmaples):
contents = ["摘要生成: \n" + e + tokenizer.mask_token for e in exmaples["content"]]
inputs = tokenizer(contents, max_length=384, truncation=True, padding="max_length", return_tensors="pt")
inputs = tokenizer.build_inputs_for_generation(inputs, targets=exmaples['title'], padding=True, max_gen_length=64)
return inputs

tokenized_ds = ds.map(process_func, batched=True, remove_columns=ds["train"].column_names)

3.4 创建模型

1
model = AutoModelForSeq2SeqLM.from_pretrained("THUDM/glm-large-chinese", trust_remote_code=True)

3.5 创建评估函数

由于GLM使用的是其内置的评估函数,所以不需要自己实现评估函数。

3.6 配置训练参数

1
2
3
4
5
6
7
8
args = Seq2SeqTrainingArguments(
output_dir="./summary_glm",
per_device_train_batch_size=4,
per_device_eval_batch_size=8,
gradient_accumulation_steps=8,
logging_steps=8,
num_train_epochs=1
)

3.7 创建训练器

1
2
3
4
5
6
trainer = Seq2SeqTrainer(
args=args,
model=model,
train_dataset=tokenized_ds["train"],
tokenizer=tokenizer,
)

3.8 模型训练

1
trainer.train()

3.9 模型预测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
input_text = ds["test"][-1]["content"]
inputs = tokenizer("摘要生成: \n" + input_text + tokenizer.mask_token, return_tensors="pt")
inputs = tokenizer.build_inputs_for_generation(inputs, max_gen_length=64)
inputs = inputs.to("cuda")
output = model.generate(**inputs, max_new_tokens=64, eos_token_id=tokenizer.eop_token_id, do_sample=True)
tokenizer.decode(output[0].tolist())

import torch

model = model.eval()

def predict_test():
predict = []
with torch.inference_mode():
for d in ds["test"]:
inputs = tokenizer("摘要生成: \n" + d["content"] + tokenizer.mask_token, return_tensors="pt")
inputs = tokenizer.build_inputs_for_generation(inputs, max_gen_length=64)
inputs = inputs.to("cuda")
output = model.generate(**inputs, max_new_tokens=64, eos_token_id=tokenizer.eop_token_id, do_sample=True)
predict.append(tokenizer.decode(output[0].tolist()).split("<|startofpiece|>")[1].replace("<|endofpiece|>", "").strip())
print("curID:", len(predict))
return predict


result = predict_test()

四、模型对比

4.1 定义概念

名称 定义
前缀语言模型(Prefix LM) 一种只使用“左侧上下文”来预测下一个 token 的模型,即:只能“从前往后”生成。典型代表是 GPT。
Seq2Seq 模型 包括 Encoder 和 Decoder 两部分。Encoder 读入整段输入,Decoder 在生成输出时能访问整个输入。典型代表是 T5、BART、Transformer。

4.2 模型结构对比

对比项 前缀语言模型(GPT) Seq2Seq 模型(T5/BART)
架构 单个 Transformer Decoder Encoder + Decoder 两部分
注意力机制 自回归(只能看到前面的 token) Decoder 可以看到前文 + 编码器输出(即源文本)
输入方式 把输入和输出拼接成一个序列,左到右生成 输入和输出是分开的两个序列
输出时能否看见输入? ✅ 通过前缀拼接看到(但不能动态对齐) ✅ 直接 attend 到 Encoder 输出

4.3 使用方式对比

对比项 前缀 LM(GPT, ChatGLM) Seq2Seq(T5, BART, GLM 伪 Seq2Seq)
输入 输入+任务描述+输出起始标记 源文本
输出 连续生成目标文本 通过 decoder 生成目标文本
示例 文章:… 摘要: → 生成 encoder(input) → decoder(output)

4.4 能力与适用任务对比

能力/任务 前缀语言模型 Seq2Seq
文本生成 ✅ 很擅长 ✅ 擅长
问答任务 ✅ 适合(通过 prompt) ✅ 更适合(结构上自然)
翻译 ✅ 能做(通过 prompt) ✅ 更自然
摘要 ✅ 支持(prompt 控制) ✅ 支持
信息抽取 ⚠️ 可做,但需定制 prompt ✅ 明确可做
多轮对话 ✅ 天然适合 ⚠️ 需结构调整

4.5 GLM 模型是怎么结合两者的?

特点 描述
训练时 使用“前缀 + 空白填空 + Seq2Seq”多任务训练,融合两者能力
推理时 类似 GPT,通过前缀 prompt 驱动生成
优点 保留了 GPT 的生成能力,又能处理复杂任务结构

4.6 总结

前缀语言模型:像 GPT,只靠“前文”做生成,适合续写、对话、prompt 驱动任务。
Seq2Seq 模型:输入输出明确分开,适合翻译、摘要、问答等复杂任务。
GLM:训练时用 Seq2Seq 思路,推理时走前缀路径,两者融合。

参考资料:
[1] 【手把手带你实战HuggingFace Transformers-入门篇】基础知识与环境安装
[2] 【Github项目地址】