【信息抽取 & LLM】大模型结构化输出控制技巧及简历信息抽取结构化实践

AI 4个月前 admin

90 0 0

前言

在使用大模型进行信息抽取任务时，如何使得大模型的输出结果更加可控、稳定（输出稳定的json等）非常重要，这关系到抽取的数据后期开发使用。常见的方法有：

微调法：微调大模型输出稳定格式的结果（json等）
few-shot法：通过在prompt中告知大模型几个示例，要求大模型输出相似的格式

但是，尽管如此，在实际操作过程中，仍然会面对着输出不稳定的情况，那么，经常采用的方法就是对输出的结果进行校验，如：要求输出json时，常校验json是否合理。校验失败时，常对大模型进行重复请求多次，以此达到输出结构化的格式。

下面，介绍一个输出控制的工具-outlines，并通过一个招聘领域（简历信息抽取并json结构化）的小demo，介绍下使用方法。

一、实践

直接上demo代码、用法

安装：

pip install outlines

1. 简历文本（输入）

张三
男 | 年龄：25岁 | 籍贯：北京 | 共产党员 | 18794434244
求职意向：算法工程师 | 期望城市：北京
个人优势
擅长领域：深度学习、CV 、图像识别、语义分割、目标检测、自动驾驶感知算法、JavaWeb开发
专业技能：熟悉 Python 、PyTorch 框架、熟悉 Java 、MySQL、SpringMVC、SpringBoot、Office
教育经历
北京工业大学 硕士 软件工程 2012-2015
担任职务：党支部书记；主修课程：深度学习、图像识别、语义分割、遥感图像处理、软件工程、时空大数据
河南工业大学 本科 计算机科学与技术 2008-2012
担任职务：班长、党支部书记；主修课程：Java、数据库、操作系统、计算机网络、数据结构
实习经历
大模型自然语言处理科技（北京）有限公司 算法工程师 2023.12-2024.03
● 负责数据采集、清洗并标注2D、3D驾驶数据，确保数据质量和多样性
● 负责自动驾驶感知模块的算法开发与优化，利用行车影像数据进行模型迭代优化
项目经历
图像识别 总负责人 2022.09-至今
● 设计了一种高性能深度学习网络 MT-AENet ，用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务，并开发出建筑物
总览可视化系统，实现对城市建筑物变化动态监测
信息管理系统（SSM框架） 项目设计师 2024.02-2024.03
负责高校党务信息管理系统的整体架构设计，确保系统功能模块化、高效稳定
技术栈：Java、MySQL、MyBatis、HTML、CSS、JavaScript、Vue.js、AJAX、Spring MVC、Maven、Git
● 使用 MySQL 数据库设计，MyBatis 框架实现数据持久层的开发，提高 JDBC 开发效率
● 使用 HTML、CSS 和 JavaScript 技术， 结合 Element 组件库，快速构建响应式前端网页界面

2. schema定义

该步骤主要定义需要从简历中抽取的实体类型及能够被outlines接收的schema结构。

{
    "title": "Resume",
    "type": "object",
    "properties": {
        "fullName": {
            "title": "全名",
            "maxLength": 50,
            "type": "string"
        },
        "contact": {
            "title": "联系方式",
            "type": "object",
            "properties": {
                "phone": {
                    "title": "电话号码",
                    "type": "string"
                }
            },
            "required": ["phone"]
        },
        "education": {
            "title": "教育背景",
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "degree": {
                        "title": "学位",
                        "type": "string"
                    },
                    "institution": {
                        "title": "学校",
                        "type": "string"
                    },
                    "fieldOfStudy": {
                        "title": "专业",
                        "type": "string"
                    },
                    "graduationYear": {
                        "title": "毕业年份",
                        "type": "integer"
                    }
                },
                "required": ["degree", "institution", "fieldOfStudy", "graduationYear"]
            }
        },
        "experience": {
            "title": "工作经验",
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "jobTitle": {
                        "title": "职位",
                        "type": "string"
                    },
                    "company": {
                        "title": "公司",
                        "type": "string"
                    },
                    "duration": {
                        "title": "任职时间",
                        "type": "string"
                    },
                    "responsibilities": {
                        "title": "职责",
                        "type": "array",
                        "items": {
                            "type": "string"
                        }
                    }
                },
                "required": ["jobTitle", "company", "duration", "responsibilities"]
            }
        },
        "skills": {
            "title": "技能",
            "type": "array",
            "items": {
                "type": "string"
            }
        }
    },
    "required": ["fullName", "contact", "education", "experience", "skills"]
}

3. LLM信息抽取完整代码

import outlines

schema = '''{
    "title": "Resume",
    "type": "object",
    "properties": {
        "fullName": {
            "title": "全名",
            "maxLength": 50,
            "type": "string"
        },
        "contact": {
            "title": "联系方式",
            "type": "object",
            "properties": {
                "phone": {
                    "title": "电话号码",
                    "type": "string"
                }
            },
            "required": ["phone"]
        },
        "education": {
            "title": "教育背景",
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "degree": {
                        "title": "学位",
                        "type": "string"
                    },
                    "institution": {
                        "title": "学校",
                        "type": "string"
                    },
                    "fieldOfStudy": {
                        "title": "专业",
                        "type": "string"
                    },
                    "graduationYear": {
                        "title": "毕业年份",
                        "type": "integer"
                    }
                },
                "required": ["degree", "institution", "fieldOfStudy", "graduationYear"]
            }
        },
        "experience": {
            "title": "工作经验",
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "jobTitle": {
                        "title": "职位",
                        "type": "string"
                    },
                    "company": {
                        "title": "公司",
                        "type": "string"
                    },
                    "duration": {
                        "title": "任职时间",
                        "type": "string"
                    },
                    "responsibilities": {
                        "title": "职责",
                        "type": "array",
                        "items": {
                            "type": "string"
                        }
                    }
                },
                "required": ["jobTitle", "company", "duration", "responsibilities"]
            }
        },
        "skills": {
            "title": "技能",
            "type": "array",
            "items": {
                "type": "string"
            }
        }
    },
    "required": ["fullName", "contact", "education", "experience", "skills"]
}'''

model = outlines.models.transformers(
    "大模型的路径")
generator = outlines.generate.json(model, schema)

resume_text = '''张三
男 | 年龄：25岁 | 籍贯：北京 | 共产党员 | 18794434244
求职意向：算法工程师 | 期望城市：北京
个人优势
擅长领域：深度学习、CV 、图像识别、语义分割、目标检测、自动驾驶感知算法、JavaWeb开发
专业技能：熟悉 Python 、PyTorch 框架、熟悉 Java 、MySQL、SpringMVC、SpringBoot、Office
教育经历
北京工业大学 硕士 软件工程 2012-2015
担任职务：党支部书记；主修课程：深度学习、图像识别、语义分割、遥感图像处理、软件工程、时空大数据
河南工业大学 本科 计算机科学与技术 2008-2012
担任职务：班长、党支部书记；主修课程：Java、数据库、操作系统、计算机网络、数据结构
实习经历
大模型自然语言处理科技（北京）有限公司 算法工程师 2023.12-2024.03
● 负责数据采集、清洗并标注2D、3D驾驶数据，确保数据质量和多样性
● 负责自动驾驶感知模块的算法开发与优化，利用行车影像数据进行模型迭代优化
项目经历
图像识别 总负责人 2022.09-至今
● 设计了一种高性能深度学习网络 MT-AENet ，用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务，并开发出建筑物
总览可视化系统，实现对城市建筑物变化动态监测
信息管理系统（SSM框架） 项目设计师 2024.02-2024.03
负责高校党务信息管理系统的整体架构设计，确保系统功能模块化、高效稳定
技术栈：Java、MySQL、MyBatis、HTML、CSS、JavaScript、Vue.js、AJAX、Spring MVC、Maven、Git
● 使用 MySQL 数据库设计，MyBatis 框架实现数据持久层的开发，提高 JDBC 开发效率
● 使用 HTML、CSS 和 JavaScript 技术， 结合 Element 组件库，快速构建响应式前端网页界面
'''
character = generator(resume_text)

print(repr(character))

4. 输出json结构化结果

{
  "fullName": "张三",
  "contact": {
    "phone": "18794434244"
  },
  "education": [
    {
      "degree": "硕士",
      "institution": "北京工业大学",
      "fieldOfStudy": "软件工程",
      "graduationYear": 2015
    },
    {
      "degree": "学士",
      "institution": "河南工业大学",
      "fieldOfStudy": "计算机科学与技术",
      "graduationYear": 2012
    }
  ],
  "experience": [
    {
      "jobTitle": "算法工程师",
      "company": "大模型自然语言处理科技（北京）有限公司",
      "duration": "2023.12-2024.03",
      "responsibilities": [
        "设计了一种高性能深度学习网络 MT-AENet ，用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务，并开发出建筑物",
        "负责数据采集、清洗并标注2D、3D驾驶数据，确保数据质量和多样性",
        "负责自动驾驶感知模块的算法开发与优化，利用行车影像数据进行模型迭代优化"
      ]
    },
    {
      "jobTitle": "图像识别",
      "company": "图像识别",
      "duration": "2022.09-至今",
      "responsibilities": [
        "设计了一种高性能深度学习网络 MT-AENet ，用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务，并开发出建筑物",
        "实现对城市建筑物变化动态监测的建筑物",
        "传感器数据采集停车场停车车位判断算法优化"
      ]
    },
    {
      "jobTitle": "信息管理系统（SSM框架）",
      "company": "信息管理系统（SSM框架）",
      "duration": "-non",
      "responsibilities": [
        "负责高校党务信息管理系统的整体架构设计，确保系统功能模块化、高效稳定",
        "使用 MySQL 数据库设计，MyBatis 框架实现数据持久层的开发，提高 JDBC 开发效率",
        "使用 HTML、CSS 和 JavaScript 技术， 结合 Element 组件库，快速构建响应式前端网页界面"
      ]
    }
  ],
  "skills": [
    "Python",
    "PyTorch",
    "Java",
    "MySQL",
    "SpringMVC",
    "SpringBoot",
    "Office"
  ]
}

二、其他格式控制案例

Multiple choices

import outlines

model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")

prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?

Review: This restaurant is just awesome!
"""

generator = outlines.generate.choice(model, ["Positive", "Negative"])
answer = generator(prompt)

Type constraint

import outlines

model = outlines.models.transformers("WizardLM/WizardMath-7B-V1.1")

prompt = "<s>result of 9 + 9 = 18</s><s>result of 1 + 2 = "
answer = outlines.generate.format(model, int)(prompt)
print(answer)
# 3

prompt = "sqrt(2)="
generator = outlines.generate.format(model, float)
answer = generator(prompt, max_tokens=10)
print(answer)
# 1.41421356

Efficient regex-structured generation

import outlines

model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")

prompt = "What is the IP address of the Google DNS servers? "

generator = outlines.generate.text(model)
unstructured = generator(prompt, max_tokens=30)

generator = outlines.generate.regex(
    model,
    r"((25[0-5]|2[0-4]d|[01]?dd?).){3}(25[0-5]|2[0-4]d|[01]?dd?)",
)
structured = generator(prompt, max_tokens=30)

print(unstructured)
# What is the IP address of the Google DNS servers?
#
# Passive DNS servers are at DNS servers that are private.
# In other words, both IP servers are private. The database
# does not contain Chelsea Manning

print(structured)
# What is the IP address of the Google DNS servers?
# 2.2.6.1

Efficient JSON generation following a Pydantic model

from enum import Enum
from pydantic import BaseModel, constr

import outlines
import torch


class Weapon(str, Enum):
    sword = "sword"
    axe = "axe"
    mace = "mace"
    spear = "spear"
    bow = "bow"
    crossbow = "crossbow"


class Armor(str, Enum):
    leather = "leather"
    chainmail = "chainmail"
    plate = "plate"


class Character(BaseModel):
    name: constr(max_length=10)
    age: int
    armor: Armor
    weapon: Weapon
    strength: int


model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")

# Construct structured sequence generator
generator = outlines.generate.json(model, Character)

# Draw a sample
rng = torch.Generator(device="cuda")
rng.manual_seed(789001)

character = generator("Give me a character description", rng=rng)

print(repr(character))
# Character(name='Anderson', age=28, armor=<Armor.chainmail: 'chainmail'>, weapon=<Weapon.sword: 'sword'>, strength=8)

character = generator("Give me an interesting character description", rng=rng)

print(repr(character))
# Character(name='Vivian Thr', age=44, armor=<Armor.plate: 'plate'>, weapon=<Weapon.crossbow: 'crossbow'>, strength=125)

Efficient JSON generation following a JSON Schema

import outlines

schema = '''{
    "title": "Character",
    "type": "object",
    "properties": {
        "name": {
            "title": "Name",
            "maxLength": 10,
            "type": "string"
        },
        "age": {
            "title": "Age",
            "type": "integer"
        },
        "armor": {"$ref": "#/definitions/Armor"},
        "weapon": {"$ref": "#/definitions/Weapon"},
        "strength": {
            "title": "Strength",
            "type": "integer"
        }
    },
    "required": ["name", "age", "armor", "weapon", "strength"],
    "definitions": {
        "Armor": {
            "title": "Armor",
            "description": "An enumeration.",
            "enum": ["leather", "chainmail", "plate"],
            "type": "string"
        },
        "Weapon": {
            "title": "Weapon",
            "description": "An enumeration.",
            "enum": ["sword", "axe", "mace", "spear", "bow", "crossbow"],
            "type": "string"
        }
    }
}'''

model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
generator = outlines.generate.json(model, schema)
character = generator("Give me a character description")

Using context-free grammars to guide generation

import outlines

arithmetic_grammar = """
    ?start: expression

    ?expression: term (("+" | "-") term)*

    ?term: factor (("*" | "/") factor)*

    ?factor: NUMBER
           | "-" factor
           | "(" expression ")"

    %import common.NUMBER
"""

model = outlines.models.transformers("WizardLM/WizardMath-7B-V1.1")
generator = outlines.generate.cfg(model, arithmetic_grammar)
sequence = generator("Alice had 4 apples and Bob ate 2. Write an expression for Alice's apples:")

print(sequence)
# (8-2)

Open functions

import outlines


def add(a: int, b: int):
    return a + b

model = outlines.models.transformers("WizardLM/WizardMath-7B-V1.1")
generator = outlines.generate.json(model, add)
result = generator("Return json with two integers named a and b respectively. a is odd and b even.")

print(add(**result))
# 3

Prompting

import outlines

examples = [
    ("The food was disgusting", "Negative"),
    ("We had a fantastic night", "Positive"),
    ("Recommended", "Positive"),
    ("The waiter was rude", "Negative")
]

@outlines.prompt
def labelling(to_label, examples):
    """You are a sentiment-labelling assistant.

    {% for example in examples %}
    {{ example[0] }} // {{ example[1] }}
    {% endfor %}
    {{ to_label }} //
    """

model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
prompt = labelling("Just awesome", examples)
answer = outlines.generate.text(model)(prompt, max_tokens=100)

总结

本文介绍了大模型输出结构控制的技巧工具-outlines，并通过一个简历信息抽取的实践demo，验证其有效性。还简单记录了一些其他格式控制的代码。

参考文献

https://github.com/outlines-dev/outlines

往期相关

浅尝prompt咒语设计：one-shot微调chatglm-6b实践信息抽取

大语言模型控制生成的过程Trick：自定义LogitsProcessor实践

原文始发于微信公众号（大模型自然语言处理）：【信息抽取 & LLM】大模型结构化输出控制技巧及简历信息抽取结构化实践

版权声明：admin 发表于 2024年7月10日下午9:09。
转载请注明：【信息抽取 & LLM】大模型结构化输出控制技巧及简历信息抽取结构化实践 | CTF导航

Nsshd后门自动化检测 | BinaryAI在恶意软件检测场景的实践

admin

【顶会论文分析】模型劫持攻击

admin

424

1377816: Security: WebAssembly UAF in catch block with stale memory start pointer

admin

365

[当人工智能遇上安全] 8.基于API序列和机器学习的恶意家族分类实例详解

admin

115

大模型时代下KG与图数据库的思变：从NebulaGraph、TigerGraph、Neo4j与大模型的结合尝试

admin

【技术分享】对抗样本及其背后性质分析（实战导向）

admin

223

【信息抽取 & LLM】大模型结构化输出控制技巧及简历信息抽取结构化实践

前言

一、实践

1. 简历文本（输入）

2. schema定义

3. LLM信息抽取完整代码

4. 输出json结构化结果

二、其他格式控制案例

Multiple choices

Type constraint

Efficient regex-structured generation

Efficient JSON generation following a Pydantic model

Efficient JSON generation following a JSON Schema

Using context-free grammars to guide generation

Open functions

Prompting

总结

参考文献

往期相关

让LLM“遗忘”特定知识

“铸网2024”网络安全实网攻防上海地区活动暨“磐石行动”上海市工业和信息化领域网络安全实战攻防活动部署会顺利召开

相关文章

相关文章

【信息抽取 & LLM】大模型结构化输出控制技巧及简历信息抽取结构化实践

前言

一、实践

1. 简历文本（输入）

2. schema定义

3. LLM信息抽取完整代码

4. 输出json结构化结果

二、其他格式控制案例

Multiple choices

Type constraint

Efficient regex-structured generation

Efficient JSON generation following a Pydantic model

Efficient JSON generation following a JSON Schema

Using context-free grammars to guide generation

Open functions

Prompting

总结

参考文献

往期相关

让LLM“遗忘”特定知识

“铸网2024”网络安全实网攻防上海地区活动暨“磐石行动”上海市工业和信息化领域网络安全实战攻防活动部署会顺利召开

相关文章

广告位

相关文章