北京大学研究者发布针对文档级文本简化任务的数据集、评测指标和基准模型
近日,北京大学万小军教授课题组提出了一个新的自然语言生成任务-文档级文本简化,并发布了新的数据集、评测指标和基准模型,相关论文已被自然语言处理顶会EMNLP2021接收(https://arxiv.org/abs/2110.05071)。
众所周知,文本简化是一种有价值的技术。然而,目前的研究仅限于句子级别的简化。在本项目中,研究者定义并研究了文档级文本简化的新任务,其目的是简化由多个句子组成的文章。基于维基百科的dumps,研究者首先构建了一个名为D-Wikipedia的大规模数据集,并对其进行分析和人工评价,以证明该数据集是可靠的。然后,提出了一个新的自动评价指标,称为D-SARI,它更适合于文档级的简化任务。最后,研究者选择了几个有代表性的模型作为这项任务的基准模型,并进行了自动评价和人工评价,分析了结果并指出了基准模型的优缺点。
文本简化是自然语言生成(NLG)领域一种有价值并且值得深入研究的技术。该任务的定义是,在保持原文主要意思不变的情况下,将原文简化为更容易理解的文本。文本简化可以为非母语人士、非专业读者和儿童提供便利。
目前,在文本简化领域,句子级别的简化已经被广泛研究。研究人员提出了许多句子简化的数据集,如WikiLarge、WikiSmall和Newsela。然而,现实世界中各种复杂的应用往往需要文档级的简化而不是句子级的简化。例如,如果你想简化一篇文章供儿童阅读,单独简化句子是非常低效的。因此,研究文档级文本简化可能比单独研究句子级文本简化更有意义。在该项工作中,本文研究者把句子级文本简化扩展到文档级文本简化。
有人可能会问,文档级文本简化和文本摘要之间有什么区别。下面举一个例子进行比较:
从表中可以看出,文本摘要并不涉及用更简单的语句重写文本,尽管这两项任务都可能删除原始文章中一些不重要的、与主旨无关的语句。
在文本简化领域,之前的一些工作或多或少地关注了文档级的信息。例如,Alva-Manchego等人专注于文本简化中的跨句子转换,并对其进行了分析,得出结论:文档级的简化不能仅仅通过选择部分内容然后简化单个句子来实现。Zhong等人使用discourse-level的因素来预测一个句子是否应该被删除,并取得了良好的效果。然而,文档级文本简化的任务仍然没有明确的定义。因此,研究者在问题表述部分给出了文档级文本简化的定义,并定义了六种文档级简化操作。在附录中,研究者还为每一种操作给出了具体的定义和一个例子。在定义中,文档级的简化应该允许信息的损失,但不应该允许重要信息的损失。为了提高可读性,应该删除与主旨关系不大的信息。
在明确定义了问题之后,研究者基于英文维基百科和简单英文维基百科,构建了一个名为D-Wikipedia的数据集,其中包含了超过14万个文章对。这个数据集已可以在GitHub免费下载。研究者还通过使用Newsela语料库制作了另一个数据集,但这个数据集中的每个简化级别的文章数量都少于一千,研究者只用它们来建立四个不同简化级别的额外测试集。如果你打算使用这个数据集,需要单独申请。
研究者对构建的D-Wikipedia数据集进行了统计。从Amazon众包平台上雇佣工人来标注文章中的简化操作。D-Wikipedia文章包含了六种不同简化操作的比例如下表所示。
研究者进一步分析了原始文章和简化文章之间的单词级别差异,并且计算了句子分离操作中连词(conjunction)和提示词(cue word)的odds ratio,如下表所示:
研究者发现现有的句子级别的人工评价指标可能也不适用于新的任务。研究者详细解释了其中的原因,并提出了一个名为O-simplicity的新指标。O-simplicity表示在保证质量的条件下,简化后的文章是否比原来的文章更简单。而且,它还应该阅读顺畅,并能保留原文的主要意思。作为一个评价简化程度的指标,O-simplicity是一个比原来的简化指标或简单地平均简化、意义和语法得分更有意义和全面的衡量标准。文章长度与人类评分的相关性,如下表所示:
结果证明,人类评委倾向于对短篇文章给予高的Simplicity-phrase得分和高的Simplicity-structure得分,而O-simplicity得分与文章长度的相关性很弱。
目前,最常用于句子级简化的自动评价指标是SARI指标。本文研究者通过具体实例说明了SARI直接用于文档级文本简化时的问题,并在SARI指标的基础上提出了用于文档级简化任务的D-SARI指标。在D-SARI指标中,保留了SARI中分别计算add、keep和delete分数这些被证明有效的策略,重点是将惩罚的依据设置为文本长度的差异。这个思路来自于BLEU,即一个候选词既不能太长也不能太短。在句子级别的文本简化中,简化后的句子与原句的长度差别不大,而在文档层面的文本简化中,情况则相反。一篇原始文章可能很长,而简化后的文章可能只包含一个句子。它足够简单,但不是原始文章的良好简化。一个合理的主张是,简化后的文章的长度应该接近参考文章的长度。统计分析表明,在几个指标中,D-SARI指标与人的评分有最强的关联性,具体如下表所示:
D-SARI指标与O-simplicity指标的相关度最高,超过了BLEU和SARI。在Simplicity-phrase和Simplicity-structure方面,D-SARI与人类评分的相关性也超过了SARI,虽然FKGL的相关性最高,但它与O-simplicity指标没有关系。另外,BLEU与Meaning和Grammar指标的相关性很小,可能是因为简化包含大量的句子分离操作,这与Sulem等人得到的结论一致。
研究者选择Transformer模型、SUC模型、BERTSumextabs模型和BART模型作为新任务的基准模型。在D-Wikipedia的测试集和Newsela Simp-4测试集上进行了实验。在D-Wikipedia的测试集上的结果如下表所示:
BertSumextabs模型在D-SARI值上取得了最佳结果。BART模型在SARI值和BLEU值上取得了最佳结果。Transformer模型在FKGL值上取得了最佳结果。
在Newsela Simp-4测试集上的结果如下表所示:
BART模型在D-SARI值上取得了最佳结果。但是与在D-Wikipedia上取得的结果相比有明显的下降。
研究者还使用了新的评测指标进行了人工评测,结果如下表所示:
总的来说,BART模型和BertSumextabs模型的表现要好于其他两个模型,尤其是在O-simplicity指标上。直接应用句子简化模型SUC并没有取得很好的结果,这说明文档级的简化与句子级的简化有很大不同。
原始文章:atal bihari vajpayee ( ; 25 december 1924 – 16 august 2018 ) was an indian statesman who served three terms as the prime minister of india ,first for a term of 13 days in 1996 , then for a period of 13 months from 1998 to 1999 , followed by a full term from 1999 to 2004 . a member of the bharatiya janata party ( bjp ) , he was the first indian prime minister not of the indian national congress to serve a full term in office . he was also noted as a poet and a writer . he was a member of the indian parliament for over five decades , having been elected ten times to the lok sabha , the lower house , and twice to the rajya sabha , the upper house . he served as the member of parliament for lucknow , retiring from active politics in 2009 due to health concerns . he was among the founding members of the bharatiya jana sangh ( bjs ) , of which he was president from 1968 to 1972 . the bjs merged with several other parties to form the janata party , which won the 1977 general election . in march 1977 , vajpayee became the minister of external affairs in the cabinet of prime minister morarji desai . he resigned in 1979 , and the janata alliance collapsed soon after . former members of the bjs formed the bjp in 1980 , with vajpayee its first president . during his tenure as prime minister , india carried out the pokhran - ii nuclear tests in 1998 . vajpayee sought to improve diplomatic relations with pakistan , travelling to lahore by bus to meet with prime minister nawaz sharif . after the 1999 kargil war with pakistan , he sought to restore relations through engagement with president pervez musharraf , inviting him to india for a summit at agra . the administration of narendra modi declared in 2014 that vajpayee ’s birthday , 25 december , would be marked as good governance day . in 2015 , he was conferred india ’s highest civilian honour , the bharat ratna , by the president of india , pranab mukherjee . he died on 16 august 2018 of age - related illness .
参考文章:atal bihari vajpayee ( 25 december 1924 – 16 august 2018 ) was an indian politician . he served as the 10th prime minister of india for six years from 1998 through 2004 . he was also serving as prime minister from may 1996 through june 1996 . he founded the bharatiya janata party . vajpayee was born on 25 december 1924 in gwalior , united provinces , british india ( now madhya pradesh , india ) . he studied at victoria college ( now laxmibai college ) and at dav college , kanpur . vajpayee was never married and had no children . vajpayee , who had dementia and diabetes , died at a new delhi hospital from multiple organ failure on 16 august 2018 , aged 93 .
Transformer模型:amar prasad ( 25 december 1924 – 16 august 2018 ) was an indian politician . he was a member of the indian national congress party . he served as prime minister of india from 1999 to 2004 . he was also the first prime minister of india . he was born in <unk>, <unk> died on 16 august 2018 at the age of 93 .
SUC模型:atal bihari vajpayee ( ; 25 december 1924 – 16 august 2018 ) was an indian statesman who served three terms as the prime minister of india , first for a term of 13 days in 1996 , then for a period of 13 months from 1998 to 1999 , followed by a full term from 1999 to 2004 . a member of the bharatiya janata party ( bjp ) , he was the first indian prime minister not of the indian national congress to be elected into the full term in the first indian prime minister . he was the first indian prime minister of the indian national congress and was called the bharatiya janata party ( bjp ) . he was a member of the indian parliament for over five decades and has been elected ten times to the lok sabha , the lower house , and twice to the rajya sabha , the upper house . he was elected ten times to the lok sabha , the lower house , and twice to the rajya sabha , the upper house . he served as the member of parliament for lucknow , retiring from active politics in 2009 due to health concerns . he was among the founding members of the bharatiya jana sangh ( bjs ) , of which he was president from 1968 to 1972 . the bjs merged with several other parties to form the janata party , which won the 1977 general election . it to make the janata party , he won the 1977 general election . in march 1977 , vajpayee became the minister of external affairs in the cabinet of prime minister morarji desai . he became the minister of external affairs in the cabinet of prime minister morarji desai . he left in 1979 and left the alliance collapsed soon after . in 1980 ’s first president.during members of prime minister , india carried out the bjp nuclear tests as prime minister . in 1998 he became prime minister , india , and created the pokhran-ii nuclear tests . vajpayee sought to improve diplomatic relations with pakistan , travelling to lahore by bus to meet with prime minister nawaz sharif . after the 1999 kargil war with pakistan , he sought to restore relations through engagement with president pervez musharraf , inviting him to india for a summit at agra.the administration of narendra modi declared in 2014 that vajpayee ’s birthday , 25 december , would be marked as good governance day . in 2015 , he was conferred india ’s highest civilian honour , the bharat ratna , by the president of india , pranab mukherjee . he died on 16 august 2018 of age-related illness .
BertSumextabs模型:atal bihari vajpayee ( ; 25 december 1924 – 16 august 2018 ) was an indian politician . he served three terms as the prime minister of india , first for a term of 13 days in 1996 , then for a period of 13 months from 1998 to 1999 , followed by a full term from 1999 to 2004 . a member of the bharatiya janata party ( bjp ) , he was the first indian prime minister not of the indian national congress .
BART模型:atal bihari vajpayee ( 25 december 1924 – 16 august 2018 ) was an indian politician . he was the first indian prime minister not of the indian national congress to serve a full term in office.he was elected as a member of the indian parliament for over five decades , having been elected ten times to the lok sabha , the lower house , and twice to the rajya sabha , the upper house . he served as the member of parliament for lucknow , retiring from active politics in 2009 due to health concerns.he was among the founding members of the bharatiya jana sangh ( bjs ) , of which he was president from 1968 to 1972.in march 1977 , vajpayee became the minister of external affairs in the cabinet of prime minister morarji desai . he resigned in 1979 , and the janata alliance collapsed soon after . former members of the bjs formed the bjp in 1980 , with vajpayee its first president.during his term as prime minister , india carried out the pokhran-ii nuclear tests in 1998 . vajpayee sought to improve diplomatic relations with pakistan , travelling to lahore by bus to meet prime minister nawaz sharif . after the 1999 kargil war with pakistan , he sought to restore relations through engagement with president pervez musharraf , inviting him to india for a summit at agra. vajpayee died on 16 august 2018 in lucknow , aged 93 .
分析:在这个例子中,SUC模型和BART模型的输出文章太长,并且保留了输入文章中大量复杂的句子。Transformer模型的输出文章包含许多事实错误,可读性很差。BertSumextabs模型简化了原始文章中一个不太重要的句子,而且这种简化是不合理的。因为它删除了 "to serve a full term in office "的关键信息,句子的含义可能会被改变。此外,BertSumextabs模型并没有保留原文中关于该人死亡的信息。
从上文的例子还可以看出,四个模型中表现最好的BertSumextabs模型与参考文章相比,仍然保留了一些复杂的词汇和句子结构,模型筛选出重要信息的能力需要进一步提高。另外,SUC模型的结果远远低于其他所有模型,这表明文档级的简化不能通过拼接句子简化的结果作为简化文章来解决。相信未来会出现为文档级简化而设计的新模型,这将大大推动这一领域的发展。
为了推动文档级文本简化研究,本文研究者建立了一个名为D-Wikipedia的大规模高质量数据集,并提出了一个新的自动评估指标D-SARI。还选择了几个有代表性的模型作为这项任务的基准模型。结果表明,D-Wikipedia数据集具有较高的质量,D-SARI指标相比SARI更加可靠。
孙壬梁 (北京大学)
编辑:李丕绩、杨沐昀