前言
在使用大模型进行信息抽取任务时,如何使得大模型的输出结果更加可控、稳定(输出稳定的json等)非常重要,这关系到抽取的数据后期开发使用。常见的方法有:
-
微调法:微调大模型输出稳定格式的结果(json等) -
few-shot法:通过在prompt中告知大模型几个示例,要求大模型输出相似的格式
但是,尽管如此,在实际操作过程中,仍然会面对着输出不稳定的情况,那么,经常采用的方法就是对输出的结果进行校验,如:要求输出json时,常校验json是否合理。校验失败时,常对大模型进行重复请求多次,以此达到输出结构化的格式。
下面,介绍一个输出控制的工具-outlines,并通过一个招聘领域(简历信息抽取并json结构化)的小demo,介绍下使用方法。
一、实践
直接上demo代码、用法
安装:
pip install outlines
1. 简历文本(输入)
张三
男 | 年龄:25岁 | 籍贯:北京 | 共产党员 | 18794434244
求职意向:算法工程师 | 期望城市:北京
个人优势
擅长领域:深度学习、CV 、图像识别、语义分割、目标检测、自动驾驶感知算法、JavaWeb开发
专业技能:熟悉 Python 、PyTorch 框架、熟悉 Java 、MySQL、SpringMVC、SpringBoot、Office
教育经历
北京工业大学 硕士 软件工程 2012-2015
担任职务:党支部书记;主修课程:深度学习、图像识别、语义分割、遥感图像处理、软件工程、时空大数据
河南工业大学 本科 计算机科学与技术 2008-2012
担任职务:班长、党支部书记;主修课程:Java、数据库、操作系统、计算机网络、数据结构
实习经历
大模型自然语言处理科技(北京)有限公司 算法工程师 2023.12-2024.03
● 负责数据采集、清洗并标注2D、3D驾驶数据,确保数据质量和多样性
● 负责自动驾驶感知模块的算法开发与优化,利用行车影像数据进行模型迭代优化
项目经历
图像识别 总负责人 2022.09-至今
● 设计了一种高性能深度学习网络 MT-AENet ,用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务,并开发出建筑物
总览可视化系统,实现对城市建筑物变化动态监测
信息管理系统(SSM框架) 项目设计师 2024.02-2024.03
负责高校党务信息管理系统的整体架构设计,确保系统功能模块化、高效稳定
技术栈:Java、MySQL、MyBatis、HTML、CSS、JavaScript、Vue.js、AJAX、Spring MVC、Maven、Git
● 使用 MySQL 数据库设计,MyBatis 框架实现数据持久层的开发,提高 JDBC 开发效率
● 使用 HTML、CSS 和 JavaScript 技术, 结合 Element 组件库,快速构建响应式前端网页界面
2. schema定义
该步骤主要定义需要从简历中抽取的实体类型及能够被outlines接收的schema结构。
{
"title": "Resume",
"type": "object",
"properties": {
"fullName": {
"title": "全名",
"maxLength": 50,
"type": "string"
},
"contact": {
"title": "联系方式",
"type": "object",
"properties": {
"phone": {
"title": "电话号码",
"type": "string"
}
},
"required": ["phone"]
},
"education": {
"title": "教育背景",
"type": "array",
"items": {
"type": "object",
"properties": {
"degree": {
"title": "学位",
"type": "string"
},
"institution": {
"title": "学校",
"type": "string"
},
"fieldOfStudy": {
"title": "专业",
"type": "string"
},
"graduationYear": {
"title": "毕业年份",
"type": "integer"
}
},
"required": ["degree", "institution", "fieldOfStudy", "graduationYear"]
}
},
"experience": {
"title": "工作经验",
"type": "array",
"items": {
"type": "object",
"properties": {
"jobTitle": {
"title": "职位",
"type": "string"
},
"company": {
"title": "公司",
"type": "string"
},
"duration": {
"title": "任职时间",
"type": "string"
},
"responsibilities": {
"title": "职责",
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["jobTitle", "company", "duration", "responsibilities"]
}
},
"skills": {
"title": "技能",
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["fullName", "contact", "education", "experience", "skills"]
}
3. LLM信息抽取完整代码
import outlines
schema = '''{
"title": "Resume",
"type": "object",
"properties": {
"fullName": {
"title": "全名",
"maxLength": 50,
"type": "string"
},
"contact": {
"title": "联系方式",
"type": "object",
"properties": {
"phone": {
"title": "电话号码",
"type": "string"
}
},
"required": ["phone"]
},
"education": {
"title": "教育背景",
"type": "array",
"items": {
"type": "object",
"properties": {
"degree": {
"title": "学位",
"type": "string"
},
"institution": {
"title": "学校",
"type": "string"
},
"fieldOfStudy": {
"title": "专业",
"type": "string"
},
"graduationYear": {
"title": "毕业年份",
"type": "integer"
}
},
"required": ["degree", "institution", "fieldOfStudy", "graduationYear"]
}
},
"experience": {
"title": "工作经验",
"type": "array",
"items": {
"type": "object",
"properties": {
"jobTitle": {
"title": "职位",
"type": "string"
},
"company": {
"title": "公司",
"type": "string"
},
"duration": {
"title": "任职时间",
"type": "string"
},
"responsibilities": {
"title": "职责",
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["jobTitle", "company", "duration", "responsibilities"]
}
},
"skills": {
"title": "技能",
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["fullName", "contact", "education", "experience", "skills"]
}'''
model = outlines.models.transformers(
"大模型的路径")
generator = outlines.generate.json(model, schema)
resume_text = '''张三
男 | 年龄:25岁 | 籍贯:北京 | 共产党员 | 18794434244
求职意向:算法工程师 | 期望城市:北京
个人优势
擅长领域:深度学习、CV 、图像识别、语义分割、目标检测、自动驾驶感知算法、JavaWeb开发
专业技能:熟悉 Python 、PyTorch 框架、熟悉 Java 、MySQL、SpringMVC、SpringBoot、Office
教育经历
北京工业大学 硕士 软件工程 2012-2015
担任职务:党支部书记;主修课程:深度学习、图像识别、语义分割、遥感图像处理、软件工程、时空大数据
河南工业大学 本科 计算机科学与技术 2008-2012
担任职务:班长、党支部书记;主修课程:Java、数据库、操作系统、计算机网络、数据结构
实习经历
大模型自然语言处理科技(北京)有限公司 算法工程师 2023.12-2024.03
● 负责数据采集、清洗并标注2D、3D驾驶数据,确保数据质量和多样性
● 负责自动驾驶感知模块的算法开发与优化,利用行车影像数据进行模型迭代优化
项目经历
图像识别 总负责人 2022.09-至今
● 设计了一种高性能深度学习网络 MT-AENet ,用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务,并开发出建筑物
总览可视化系统,实现对城市建筑物变化动态监测
信息管理系统(SSM框架) 项目设计师 2024.02-2024.03
负责高校党务信息管理系统的整体架构设计,确保系统功能模块化、高效稳定
技术栈:Java、MySQL、MyBatis、HTML、CSS、JavaScript、Vue.js、AJAX、Spring MVC、Maven、Git
● 使用 MySQL 数据库设计,MyBatis 框架实现数据持久层的开发,提高 JDBC 开发效率
● 使用 HTML、CSS 和 JavaScript 技术, 结合 Element 组件库,快速构建响应式前端网页界面
'''
character = generator(resume_text)
print(repr(character))
4. 输出json结构化结果
{
"fullName": "张三",
"contact": {
"phone": "18794434244"
},
"education": [
{
"degree": "硕士",
"institution": "北京工业大学",
"fieldOfStudy": "软件工程",
"graduationYear": 2015
},
{
"degree": "学士",
"institution": "河南工业大学",
"fieldOfStudy": "计算机科学与技术",
"graduationYear": 2012
}
],
"experience": [
{
"jobTitle": "算法工程师",
"company": "大模型自然语言处理科技(北京)有限公司",
"duration": "2023.12-2024.03",
"responsibilities": [
"设计了一种高性能深度学习网络 MT-AENet ,用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务,并开发出建筑物",
"负责数据采集、清洗并标注2D、3D驾驶数据,确保数据质量和多样性",
"负责自动驾驶感知模块的算法开发与优化,利用行车影像数据进行模型迭代优化"
]
},
{
"jobTitle": "图像识别",
"company": "图像识别",
"duration": "2022.09-至今",
"responsibilities": [
"设计了一种高性能深度学习网络 MT-AENet ,用于遥感图像中建筑物提取、建筑垃圾分割、道路提取等任务,并开发出建筑物",
"实现对城市建筑物变化动态监测的建筑物",
"传感器数据采集停车场停车车位判断算法优化"
]
},
{
"jobTitle": "信息管理系统(SSM框架)",
"company": "信息管理系统(SSM框架)",
"duration": "-non",
"responsibilities": [
"负责高校党务信息管理系统的整体架构设计,确保系统功能模块化、高效稳定",
"使用 MySQL 数据库设计,MyBatis 框架实现数据持久层的开发,提高 JDBC 开发效率",
"使用 HTML、CSS 和 JavaScript 技术, 结合 Element 组件库,快速构建响应式前端网页界面"
]
}
],
"skills": [
"Python",
"PyTorch",
"Java",
"MySQL",
"SpringMVC",
"SpringBoot",
"Office"
]
}
二、其他格式控制案例
Multiple choices
import outlines
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?
Review: This restaurant is just awesome!
"""
generator = outlines.generate.choice(model, ["Positive", "Negative"])
answer = generator(prompt)
Type constraint
import outlines
model = outlines.models.transformers("WizardLM/WizardMath-7B-V1.1")
prompt = "<s>result of 9 + 9 = 18</s><s>result of 1 + 2 = "
answer = outlines.generate.format(model, int)(prompt)
print(answer)
# 3
prompt = "sqrt(2)="
generator = outlines.generate.format(model, float)
answer = generator(prompt, max_tokens=10)
print(answer)
# 1.41421356
Efficient regex-structured generation
import outlines
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
prompt = "What is the IP address of the Google DNS servers? "
generator = outlines.generate.text(model)
unstructured = generator(prompt, max_tokens=30)
generator = outlines.generate.regex(
model,
r"((25[0-5]|2[0-4]d|[01]?dd?).){3}(25[0-5]|2[0-4]d|[01]?dd?)",
)
structured = generator(prompt, max_tokens=30)
print(unstructured)
# What is the IP address of the Google DNS servers?
#
# Passive DNS servers are at DNS servers that are private.
# In other words, both IP servers are private. The database
# does not contain Chelsea Manning
print(structured)
# What is the IP address of the Google DNS servers?
# 2.2.6.1
Efficient JSON generation following a Pydantic model
from enum import Enum
from pydantic import BaseModel, constr
import outlines
import torch
class Weapon(str, Enum):
sword = "sword"
axe = "axe"
mace = "mace"
spear = "spear"
bow = "bow"
crossbow = "crossbow"
class Armor(str, Enum):
leather = "leather"
chainmail = "chainmail"
plate = "plate"
class Character(BaseModel):
name: constr(max_length=10)
age: int
armor: Armor
weapon: Weapon
strength: int
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
# Construct structured sequence generator
generator = outlines.generate.json(model, Character)
# Draw a sample
rng = torch.Generator(device="cuda")
rng.manual_seed(789001)
character = generator("Give me a character description", rng=rng)
print(repr(character))
# Character(name='Anderson', age=28, armor=<Armor.chainmail: 'chainmail'>, weapon=<Weapon.sword: 'sword'>, strength=8)
character = generator("Give me an interesting character description", rng=rng)
print(repr(character))
# Character(name='Vivian Thr', age=44, armor=<Armor.plate: 'plate'>, weapon=<Weapon.crossbow: 'crossbow'>, strength=125)
Efficient JSON generation following a JSON Schema
import outlines
schema = '''{
"title": "Character",
"type": "object",
"properties": {
"name": {
"title": "Name",
"maxLength": 10,
"type": "string"
},
"age": {
"title": "Age",
"type": "integer"
},
"armor": {"$ref": "#/definitions/Armor"},
"weapon": {"$ref": "#/definitions/Weapon"},
"strength": {
"title": "Strength",
"type": "integer"
}
},
"required": ["name", "age", "armor", "weapon", "strength"],
"definitions": {
"Armor": {
"title": "Armor",
"description": "An enumeration.",
"enum": ["leather", "chainmail", "plate"],
"type": "string"
},
"Weapon": {
"title": "Weapon",
"description": "An enumeration.",
"enum": ["sword", "axe", "mace", "spear", "bow", "crossbow"],
"type": "string"
}
}
}'''
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
generator = outlines.generate.json(model, schema)
character = generator("Give me a character description")
Using context-free grammars to guide generation
import outlines
arithmetic_grammar = """
?start: expression
?expression: term (("+" | "-") term)*
?term: factor (("*" | "/") factor)*
?factor: NUMBER
| "-" factor
| "(" expression ")"
%import common.NUMBER
"""
model = outlines.models.transformers("WizardLM/WizardMath-7B-V1.1")
generator = outlines.generate.cfg(model, arithmetic_grammar)
sequence = generator("Alice had 4 apples and Bob ate 2. Write an expression for Alice's apples:")
print(sequence)
# (8-2)
Open functions
import outlines
def add(a: int, b: int):
return a + b
model = outlines.models.transformers("WizardLM/WizardMath-7B-V1.1")
generator = outlines.generate.json(model, add)
result = generator("Return json with two integers named a and b respectively. a is odd and b even.")
print(add(**result))
# 3
Prompting
import outlines
examples = [
("The food was disgusting", "Negative"),
("We had a fantastic night", "Positive"),
("Recommended", "Positive"),
("The waiter was rude", "Negative")
]
@outlines.prompt
def labelling(to_label, examples):
"""You are a sentiment-labelling assistant.
{% for example in examples %}
{{ example[0] }} // {{ example[1] }}
{% endfor %}
{{ to_label }} //
"""
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
prompt = labelling("Just awesome", examples)
answer = outlines.generate.text(model)(prompt, max_tokens=100)
总结
本文介绍了大模型输出结构控制的技巧工具-outlines,并通过一个简历信息抽取的实践demo,验证其有效性。还简单记录了一些其他格式控制的代码。
参考文献
https://github.com/outlines-dev/outlines
往期相关
浅尝prompt咒语设计:one-shot微调chatglm-6b实践信息抽取
大语言模型控制生成的过程Trick:自定义LogitsProcessor实践
原文始发于微信公众号(大模型自然语言处理):【信息抽取 & LLM】大模型结构化输出控制技巧及简历信息抽取结构化实践