Fork me on GitHub

密集实体识别优化:京东商品命名识别为例

以下文章来源于 https://zhuanlan.zhihu.com/p/547067960



一、前言

密集实体识别一直是比较具有难点的应用场景,容易造成边界识别错误和漏识别现象,需要算法对于实体位置的把控精准。京东商品识别这场景就实体非常密集,基本上是实体挨着实体,实体类型要多大52类,中英文都有。
数据实例


二、数据处理和模型选用

因为主办方使用的是古老的BIO标注方法,为了方便看最好还是转化成位置标注。主办方提供三四万条的标注数据和一百万的无标注数据,至于无标注数据我第一时间想到是增量预训练和半监督学习。

{"text": "笔筒创意时尚小清新学生儿童摆件简约多功能收纳笔座办公文具盒子男女欧式复古桌面上可爱摆设彩色塑料格调 黑白羽毛(2个装)||文本中有59个字||实体类别数一共52类", "label": [[0, 1, "4", "笔筒"], [2, 3, "14", "创意"], [4, 5, "14", "时尚"], [6, 8, "14", "小清新"], [9, 10, "8", "学生"], [11, 12, "8", "儿童"], [13, 14, "4", "摆件"], [15, 16, "14", "简约"], [17, 19, "11", "多功能"], [20, 21, "11", "收纳"], [22, 23, "4", "笔座"], [24, 25, "5", "办公"], [26, 29, "4", "文具盒子"], [30, 31, "8", "男女"], [32, 33, "14", "欧式"], [34, 35, "14", "复古"], [36, 38, "7", "桌面上"], [39, 40, "14", "可爱"], [41, 42, "4", "摆设"], [43, 44, "16", "彩色"], [45, 46, "12", "塑料"], [47, 48, "14", "格调"], [50, 53, "10", "黑白羽毛"], [55, 57, "18", "2个装"]]}
{"text": "名片盒+亚克力A4三折页多层资料桌面展示架宣传架彩页架书报刊架 A6三层资料架||文本中有39个字||实体类别数一共52类", "label": [[0, 2, "40", "名片盒"], [4, 6, "12", "亚克力"], [7, 8, "18", "A4"], [9, 11, "13", "三折页"], [12, 13, "13", "多层"], [14, 15, "9", "资料"], [16, 17, "7", "桌面"], [18, 20, "4", "展示架"], [21, 23, "4", "宣传架"], [27, 30, "4", "书报刊架"], [32, 33, "18", "A6"], [34, 35, "13", "三层"], [36, 38, "4", "资料架"]]}
{"text": "优易思 卡片U盘名片U盘定制2.0 投标U盘logo礼品商务优盘创意个性8G16G32G 批量定制 水瓶座 1G||文本中有56个字||实体类别数一共52类", "label": [[0, 2, "1", "优易思"], [4, 5, "13", "卡片"], [6, 7, "4", "U盘"], [8, 9, "13", "名片"], [10, 11, "4", "U盘"], [12, 13, "29", "定制"], [14, 16, "18", "2.0"], [18, 19, "5", "投标"], [20, 21, "4", "U盘"], [22, 27, "40", "logo礼品"], [28, 29, "14", "商务"], [30, 31, "4", "优盘"], [32, 33, "14", "创意"], [34, 35, "14", "个性"], [36, 43, "18", "8G16G32G"], [45, 48, "29", "批量定制"], [50, 52, "10", "水瓶座"], [54, 55, "18", "1G"]]}
{"text": "取暖器家用节能速热电火箱暖脚器烤火炉电火桶冬天 豪华双人款70x32cm||文本中有36个字||实体类别数一共52类", "label": [[0, 2, "4", "取暖器"], [3, 4, "7", "家用"], [5, 6, "11", "节能"], [7, 8, "11", "速热"], [9, 11, "4", "电火箱"], [12, 14, "4", "暖脚器"], [15, 17, "4", "烤火炉"], [18, 20, "4", "电火桶"], [21, 22, "6", "冬天"], [24, 25, "14", "豪华"], [26, 28, "13", "双人款"], [29, 35, "18", "70x32cm"]]}
{"text": "贝迪低温标签,冰柜标签,实验室标签,瓶盖签,冻存架,冻存盒,提血库,离心管B-6421BP-IP300定制标签||文本中有55个字||实体类别数一共52类", "label": [[0, 1, "1", "贝迪"], [2, 3, "11", "低温"], [4, 5, "4", "标签"], [7, 8, "40", "冰柜"], [9, 10, "4", "标签"], [12, 14, "7", "实验室"], [15, 16, "4", "标签"], [18, 20, "4", "瓶盖签"], [22, 24, "40", "冻存架"], [26, 28, "40", "冻存盒"], [30, 32, "7", "提血库"], [34, 36, "9", "离心管"], [43, 50, "39", "BP-IP300"], [51, 52, "13", "定制"], [53, 54, "4", "标签"]]}
{"text": "全透明吃鸡神器刺激战场手游手柄按键式辅助神器新款安卓苹果x指绝地求生手机外设和平精英食鸡套装 S3透明【银色//一对装】+手柄||文本中有63个字||实体类别数一共52类", "label": [[0, 2, "16", "全透明"], [3, 6, "4", "吃鸡神器"], [7, 10, "5", "刺激战场"], [11, 12, "5", "手游"], [13, 14, "4", "手柄"], [15, 17, "13", "按键式"], [18, 21, "4", "辅助神器"], [22, 23, "14", "新款"], [24, 25, "47", "安卓"], [26, 28, "38", "苹果x"], [30, 33, "5", "绝地求生"], [34, 35, "40", "手机"], [38, 41, "5", "和平精英"], [49, 50, "16", "透明"], [52, 53, "16", "银色"], [56, 58, "18", "一对装"], [61, 62, "4", "手柄"]]}
{"text": "Qisheng奇声功放机家用专业家庭影院大功率蓝牙hifi数字5.1放大器 2.0升级版||文本中有44个字||实体类别数一共52类", "label": [[0, 6, "10", "Qisheng"], [7, 8, "10", "奇声"], [9, 11, "4", "功放机"], [12, 13, "7", "家用"], [16, 19, "7", "家庭影院"], [20, 22, "11", "大功率"], [23, 24, "11", "蓝牙"], [25, 28, "11", "hifi"], [29, 30, "10", "数字"], [31, 33, "18", "5.1"], [34, 36, "4", "放大器"], [38, 43, "13", "2.0升级版"]]}
{"text": "朗捷(longe)新党章a5党员学习笔记本党会笔记本党员本子会议记录本开会本定制单位logo a5党员本-贴心款||文本中有56个字||实体类别数一共52类", "label": [[0, 1, "1", "朗捷"], [3, 7, "1", "longe"], [9, 11, "10", "新党章"], [12, 13, "18", "a5"], [14, 15, "8", "党员"], [16, 17, "5", "学习"], [18, 20, "4", "笔记本"], [21, 22, "5", "党会"], [23, 25, "4", "笔记本"], [26, 27, "8", "党员"], [28, 29, "4", "本子"], [30, 31, "5", "会议"], [32, 34, "4", "记录本"], [35, 37, "4", "开会本"], [38, 39, "29", "定制"], [40, 41, "9", "单位"], [42, 45, "10", "logo"], [47, 48, "18", "a5"], [49, 51, "4", "党员本"], [53, 55, "14", "贴心款"]]}
{"text": "线康4K2.0高清hdmi线10/15/20/25/30米5延长电脑电视投影仪连接线 1.4版本 HDMI线 25米||文本中有58个字||实体类别数一共52类", "label": [[0, 1, "4", "线康"], [7, 8, "11", "高清"], [9, 13, "4", "hdmi线"], [14, 28, "18", "10/15/20/25/30米"], [32, 33, "40", "电脑"], [34, 35, "40", "电视"], [36, 38, "40", "投影仪"], [39, 41, "4", "连接线"], [49, 53, "4", "HDMI线"], [55, 57, "18", "25米"]]}
{"text": "夏新 S 9不入耳蓝牙耳机单耳无线迷你超小耳塞挂耳式运动开车骨传导概念超长待机苹果安卓通用可接听电话 高贵红色 【送收纳盒+三件套】 官方标配||文本中有71个字||实体类别数一共52类", "label": [[0, 1, "1", "夏新"], [3, 5, "3", "S 9"], [6, 8, "13", "不入耳"], [9, 12, "4", "蓝牙耳机"], [13, 14, "13", "单耳"], [15, 16, "11", "无线"], [17, 18, "14", "迷你"], [19, 20, "13", "超小"], [21, 22, "13", "耳塞"], [23, 25, "13", "挂耳式"], [26, 27, "5", "运动"], [28, 29, "5", "开车"], [30, 34, "11", "骨传导概念"], [35, 38, "11", "超长待机"], [39, 40, "40", "苹果"], [41, 42, "40", "安卓"], [43, 44, "11", "通用"], [45, 49, "11", "可接听电话"], [51, 52, "14", "高贵"], [53, 54, "16", "红色"], [58, 60, "22", "收纳盒"], [62, 64, "36", "三件套"], [69, 70, "18", "标配"]]}

#数据处理模块
import os
import math
import jsonlines
import random
from random import choice
from ark_nlp.factory.utils.conlleval import get_entity_bio
random.seed(1234)

# {
#                     'start_idx': _start_idx,
#                     'end_idx': _end_idx,
#                     'type': _type,
#                     'entity': text[_start_idx: _end_idx+1]
#                 }
datalist = []
with open('./train.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()
    lines.append('\n')
    
    text = []
    labels = []
    label_set = set()
    
    for line in lines: 
        if line == '\n':                
            text = ''.join(text)
            entity_labels = []
            for _type, _start_idx, _end_idx in get_entity_bio(labels, id2label=None):
                entity_labels.append((_start_idx, _end_idx, _type, text[_start_idx: _end_idx+1]))
                
            if text == '':
                continue
            
            datalist.append({
                'text': text,
                'label': entity_labels
            })
            
            text = []
            labels = []
            
        elif line == '  O\n':
            text.append(' ')
            labels.append('O')
        else:
            line = line.strip('\n').split()
            if len(line) == 1:
                term = ' '
                label = line[0]
            else:
                term, label = line
            text.append(term)
            label_set.add(label.split('-')[-1])
            labels.append(label)

num = math.floor(len(datalist)*0.1)
start = 0
end = num
#for i in range(5):
w1 = jsonlines.open('./gpointertf/data/train.json','w')
w2 = jsonlines.open('./gpointertf/data/dev.json','w')

dev = datalist[start:end]
train = datalist[0:start]+ datalist[end:]
print("训练集条数:",len(train))
print("验证集条数:",len(dev))


for k1 in train:
    k1['text'] = k1['text'] + '||'+'文本中有'+str(len(k1['text']))+'个字'
    w1.write(k1)
for k2 in dev:
    k2['text'] = k2['text'] + '||'+'文本中有'+str(len(k2['text']))+'个字'
    w2.write(k2)
w1.close()
w2.close()

模型选用是:Gobal Pointer(全局指针)

选用原因:虽然没有实体重叠现象,但是在数据量足够大的时候指针网络效果会比CRF模型效果好,尤其是针对于长实体数据。

苏神的全局指针链接:
GlobalPointer:用统一的方式处理嵌套和非嵌套NER - 科学空间|Scientific Spaces


三、实体识别效果优化

现在的比赛算法模型都差不多,想要在算法模型上超越别人很难,工业应用也是一样,需要针对场景进行优化。

  • 基于类MRC方式融入文本长度信息

因为MRC是针对于每个类别单独会写一个类别描述语句来提高单独抽取类别的效果,MRC有个致命的缺点就是每个类别需要单独抽取,在类别较少的时候还可以满足,如果类别很多,则这种方法在实际场景中根本行不通。于是我在想通过一种通用的信息来形成一句描述语句,例如文本中有64个字这句话接在文本后面,既不影响实体标注位置又融合了文本长度信息进去。

原理:实体数量与文本长度基本上是呈现正相关的尤其是密集实体效果更好,通过这类描述让Bert模型学习到实体数量和文本长度的关系,从而提高模型的召回率。

总结:此方法我在多个ner数据中使用基本都能提升几个千分点,算是比较有用的提升方法了。

  • 多模伪标签(半监督学习)

知识蒸馏是通过软标签给模型学习,多模伪标签则是通过硬标签给模型补充信息,在数据少的时候此方法效果很明显,原理也比较简单,多个模型对无标注数据预测加入base模型中进行训练,提升效果,能够提升几个千分点效果。
多模伪标签

  • 增量预训练

预训练模型无论是Bert、roberta还是nezha基本都是在正规文本中进行预训练,可能和京东这种垂直领域的数据有些差别,能通过增量预训练的方法缩小领域之间的差别。如果从头预训练模型代价很多,硬件要求也高,且很有可能没有原开源机构训练的效果好。增量预训练的代价就小的很多,一张2080ti也就可以开始预训练之旅。
归来仍是少年:Nezha的增量预训练探索(基于bert4keras)

上面链接是我之前写的基于bert4keras框架预训练nezha模型的代码,大家可以参考下,效果一般都能提升个好几个千分点来着。

  • 对抗扰动

对抗扰动都属于基操了,没什么好过多介绍的,有兴趣的小伙伴可以去网上搜博客。


四、总结

这种基础没有什么难度的任务,基本上是会很卷,大佬也挺多的,我主要去实验下拍脑袋的想法,融入文本信息确实对于命名实体这种任务效果还不错,比赛自从工作之后肝不动了,勉强苟进复赛之后就没打了,后面都是千分点万分点上卷意义不大,主要是对于数据层面的一些思考,我在比赛途中还突发奇想,想到一种实体消歧的方法,后面有机会再和大家介绍介绍。


本文地址:https://www.6aiq.com/article/1677516664814
本文版权归作者和AIQ共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出