Entities Extraction Agent

一个基于大型语言模型（LLM）的灵活、高效的实体提取框架。

🌟 特性

🤖 LLM 驱动: 利用 LangChain 和现代 LLM（GPT、Claude 等）的强大能力
🎯 零样本/少样本学习: 支持 zero-shot 和 few-shot 提取策略
🔗 关系提取: 支持实体间关系提取，构建知识图谱
📄 文档解析: 集成 MinerU，支持 PDF 文档解析和提取
🔧 高度可配置: 灵活的配置系统，支持多种 LLM 提供商
📊 完整的评估体系: 内置精确率、召回率、F1 分数等评估指标
🔄 完整的处理流程: 包含数据预处理、实体提取、后处理和评估
📦 开箱即用: 基于成熟的开源框架，最小化手写代码

🏗️ 架构设计

核心思想

LLM作为核心提取器: 直接利用 LLM 的语言理解能力识别和提取实体
Zero-shot / Few-shot 学习: 通过精心设计的 Prompt 引导 LLM，减少标注数据需求
可配置与可扩展: 支持不同 LLM、实体类型和提取策略
迭代与优化: 提供评估机制，支持对 LLM 表现进行分析和优化

组件架构

entities-extraction-agent/
├── models/                # 数据模型定义（Entity, Relationship, ExtractionResult等）
├── core/                  # 核心功能
│   ├── extractor.py      # 实体和关系提取器
│   └── config.py         # 配置管理
├── preprocessing/         # 数据预处理（文本清洗、分块等）
├── postprocessing/        # 后处理（去重、过滤、合并等）
├── document_processing/   # 文档处理（MinerU集成，PDF解析）
├── pipeline/              # 文档提取流水线（整合文档解析+实体提取）
├── evaluation/            # 评估指标（精确率、召回率、F1等）
└── examples/              # 使用示例

📦 安装

# 克隆仓库
git clone https://github.com/deeplooplabs/entities-extraction-agent.git
cd entities-extraction-agent

# 安装依赖
pip install -e .

# 或安装开发依赖
pip install -e ".[dev]"

🚀 快速开始

基础使用

from entities_extraction_agent import EntityExtractor, EntityType
from entities_extraction_agent.core.config import LLMConfig, ExtractionConfig

# 1. 定义要提取的实体类型
entity_types = [
    EntityType(
        name="Person",
        description="人名，包括姓名、职位等",
        examples=["张三", "李四", "王经理"]  # 可选：用于 few-shot 学习
    ),
    EntityType(
        name="Organization",
        description="组织机构名称",
        examples=["阿里巴巴", "清华大学"]
    ),
    EntityType(
        name="Location",
        description="地理位置",
        examples=["北京", "上海"]
    ),
]

# 2. 配置 LLM
llm_config = LLMConfig(
    provider="openai",
    model_name="gpt-3.5-turbo",
    temperature=0.0,
    api_key="your-api-key"  # 或使用环境变量 OPENAI_API_KEY
)

# 3. 配置提取策略
extraction_config = ExtractionConfig(
    strategy="few-shot",  # 或 "zero-shot"
    max_retries=3
)

# 4. 创建提取器
extractor = EntityExtractor(
    entity_types=entity_types,
    llm_config=llm_config,
    extraction_config=extraction_config
)

# 5. 提取实体
text = "马云创立了阿里巴巴集团，总部位于杭州。"
result = extractor.extract(text)

# 6. 查看结果
for entity in result.entities:
    print(f"{entity.text} ({entity.type})")

完整流程示例

from entities_extraction_agent import EntityExtractor, EntityType
from entities_extraction_agent.preprocessing import TextPreprocessor
from entities_extraction_agent.postprocessing import EntityPostprocessor

# 预处理
preprocessor = TextPreprocessor(
    remove_html=True,
    normalize_whitespace=True
)

# 提取
extractor = EntityExtractor(entity_types=entity_types)
cleaned_text = preprocessor.preprocess(raw_text)
result = extractor.extract(cleaned_text)

# 后处理
postprocessor = EntityPostprocessor(
    deduplicate=True,
    normalize=True
)
final_result = postprocessor.postprocess(result)

# 查看统计
counts = postprocessor.get_entity_counts(final_result)
print(counts)  # Counter({'Person': 3, 'Organization': 2, ...})

评估示例

from entities_extraction_agent.evaluation import EntityEvaluator
from entities_extraction_agent import Entity, ExtractionResult

# 准备真实标注
ground_truth = ExtractionResult(
    text=text,
    entities=[
        Entity(text="马云", type="Person"),
        Entity(text="阿里巴巴集团", type="Organization"),
        Entity(text="杭州", type="Location"),
    ]
)

# 评估
evaluator = EntityEvaluator()
metrics = evaluator.evaluate(predicted_result, ground_truth)

print(f"Precision: {metrics.precision:.3f}")
print(f"Recall: {metrics.recall:.3f}")
print(f"F1 Score: {metrics.f1_score:.3f}")

关系提取

框架支持实体间的关系提取，用于构建知识图谱：

from entities_extraction_agent import EntityExtractor, EntityType, RelationshipType

# 定义实体类型
entity_types = [
    EntityType(name="Person", description="人名"),
    EntityType(name="Organization", description="组织名称"),
    EntityType(name="Location", description="地理位置"),
]

# 定义关系类型
relationship_types = [
    RelationshipType(
        name="works_for",
        description="雇佣关系，人与工作的组织",
        entity_types={"source": "Person", "target": "Organization"}
    ),
    RelationshipType(
        name="located_in",
        description="位置关系，组织所在的地理位置",
        entity_types={"source": "Organization", "target": "Location"}
    ),
]

# 创建提取器（同时提取实体和关系）
extractor = EntityExtractor(
    entity_types=entity_types,
    relationship_types=relationship_types,
)

text = "张三在阿里巴巴工作，公司总部位于杭州。"
result = extractor.extract(text)

# 查看提取的实体
for entity in result.entities:
    print(f"实体: {entity.text} ({entity.type})")

# 查看提取的关系
for rel in result.relationships:
    print(f"关系: {rel.source} -[{rel.type}]-> {rel.target}")

文档处理流水线

使用 DocumentExtractionPipeline 处理 PDF 文档，自动完成文档解析和实体提取：

from entities_extraction_agent import DocumentExtractionPipeline, EntityType, RelationshipType

# 定义要提取的实体和关系类型
entity_types = [
    EntityType(name="Person", description="人名"),
    EntityType(name="Organization", description="组织名称"),
]

relationship_types = [
    RelationshipType(name="works_for", description="雇佣关系"),
]

# 创建流水线
pipeline = DocumentExtractionPipeline(
    entity_types=entity_types,
    relationship_types=relationship_types,
)

# 从 URL 处理文档
result = pipeline.from_url("https://example.com/document.pdf")

# 或从本地文件处理
# result = pipeline.from_file("./local_document.pdf")

# 或从纯文本处理
# result = pipeline.from_text("文档文本内容...")

# 查看结果
print(f"文档长度: {len(result.document_text)} 字符")
print(f"提取实体数: {len(result.entities)}")
print(f"提取关系数: {len(result.relationships)}")

for entity in result.entities:
    print(f"  - [{entity.type}] {entity.text}")

for rel in result.relationships:
    print(f"  - [{rel.type}] {rel.source} -> {rel.target}")

安装 MinerU（用于 PDF 解析）：

# 安装 MinerU 及其所有依赖
pip install 'mineru[all]'

# 或使用 magic-pdf 包
pip install 'magic-pdf[full]'

Web UI 界面

框架提供了基于 Streamlit 的 Web UI 界面，支持可视化展示实体和关系图谱：

# 安装 UI 依赖
pip install -e ".[ui]"

# 启动 Web UI
streamlit run entities_extraction_agent/ui/app.py

UI 功能：

输入文档 URL 或上传 PDF 文件
配置实体类型和关系类型
选择 LLM 模型和参数
查看提取的实体和关系表格
可视化知识图谱展示实体间关系

🔧 配置说明

LLM 配置

支持通过环境变量或代码配置：

# .env 文件
LLM_PROVIDER=openai
LLM_MODEL_NAME=gpt-3.5-turbo
LLM_TEMPERATURE=0.0
LLM_API_KEY=your-api-key

提取策略

zero-shot: 仅依靠实体类型描述进行提取，无需示例
few-shot: 使用提供的示例引导 LLM，通常效果更好

📊 评估指标

框架提供完整的评估体系：

Precision (精确率): 提取的实体中正确的比例
Recall (召回率): 所有正确实体中被提取出的比例
F1 Score: 精确率和召回率的调和平均
按类型统计: 支持分实体类型的评估

🛠️ 开发

# 安装开发依赖
pip install -e ".[dev]"

# 运行测试
pytest

# 代码格式化
black entities_extraction_agent/

# 代码检查
ruff check entities_extraction_agent/

📚 示例

查看 entities_extraction_agent/examples/ 目录获取更多示例：

basic_example.py: 基础使用示例
evaluation_example.py: 评估和策略比较示例
document_pipeline_example.py: 文档处理流水线示例（PDF解析+实体/关系提取）

运行示例：

# 设置 API Key
export OPENAI_API_KEY="your-api-key"

# 运行基础示例
python -m entities_extraction_agent.examples.basic_example

# 运行评估示例
python -m entities_extraction_agent.examples.evaluation_example

# 运行文档流水线示例
python -m entities_extraction_agent.examples.document_pipeline_example

🤝 技术栈

LangChain: LLM 应用开发框架
Pydantic: 数据验证和设置管理
OpenAI API: LLM 服务（可扩展支持其他提供商）
MinerU: PDF 文档解析工具

📄 许可证

MIT License

🙏 致谢

本框架基于以下优秀的开源项目：

LangChain: 强大的 LLM 应用开发框架
Pydantic: 优雅的数据验证库

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
entities_extraction_agent		entities_extraction_agent
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN.md		DESIGN.md
GETTING_STARTED.md		GETTING_STARTED.md
LICENSE		LICENSE
README.md		README.md
SUMMARY.md		SUMMARY.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Entities Extraction Agent

🌟 特性

🏗️ 架构设计

核心思想

组件架构

📦 安装

🚀 快速开始

基础使用

完整流程示例

评估示例

关系提取

文档处理流水线

Web UI 界面

🔧 配置说明

LLM 配置

提取策略

📊 评估指标

🛠️ 开发

📚 示例

🤝 技术栈

📄 许可证

🙏 致谢

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Entities Extraction Agent

🌟 特性

🏗️ 架构设计

核心思想

组件架构

📦 安装

🚀 快速开始

基础使用

完整流程示例

评估示例

关系提取

文档处理流水线

Web UI 界面

🔧 配置说明

LLM 配置

提取策略

📊 评估指标

🛠️ 开发

📚 示例

🤝 技术栈

📄 许可证

🙏 致谢

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages