PDFDuplicateRemover

用于去除离散数学课件中冗余页面，提高可读性

Notice

第八章群论的ppt存在一些问题，少了以下页数

逻辑页码	对应的原始ppt实际页码
14	63
20	102
24	124
43	243
44	252
84	529
91	584
98	627
117	764

使用说明：PDF 删除重复页面脚本

功能简介

本脚本通过 OCR（光学字符识别）提取 PDF 页面右下角的页码信息，删除具有相同页码的重复页面，仅保留每个页码的最后一个出现的页面，并生成一个新的 PDF 文件。

环境要求

Python 3.8 或更高版本。
以下 Python 库：
- PyPDF2
- pdf2image
- pytesseract

安装步骤

安装 Python 库
使用 pip 安装所需库：
```
pip install PyPDF2 pdf2image pytesseract
```
安装 Tesseract-OCR
- Windows：从 Tesseract OCR 下载并安装。
  安装完成后，将路径（例如 C:\Program Files\Tesseract-OCR\tesseract.exe）添加到系统的 PATH 环境变量中。
- Linux：
```
sudo apt-get install tesseract-ocr
```
- macOS：
```
brew install tesseract
```
安装 Poppler
- Windows：从 Poppler for Windows 下载并安装。
  将 bin 文件夹的路径添加到系统 PATH 环境变量中。
- Linux：
```
sudo apt-get install poppler-utils
```
- macOS：
```
brew install poppler
```

使用方法

将脚本文件（remover.py）放置在项目目录中。
确保输入的 PDF 文件与脚本位于同一目录，或者在脚本中指定 PDF 文件的完整路径。
运行脚本：
```
python remover.py
```

参数说明

输入 PDF 文件：
- 修改 input_pdf 变量以指定输入 PDF 文件的名称。
  示例：input_pdf = "example.pdf"
输出 PDF 文件：
- 脚本会在输入 PDF 文件名的基础上生成一个新的去重 PDF 文件，文件名加上前缀 "output"。
OCR 检测区域：
- ocr_area 变量定义了页码所在的矩形区域。格式为 (left, top, right, bottom)。
- 需要根据实际 PDF 页面内容调整此区域。

示例配置

input_pdf = "示例.pdf"
output_pdf = "output_示例.pdf"
ocr_area = (908, 731, 1008, 756)  # 根据实际 PDF 调整

工作原理

转换 PDF 页面为图片：
使用 pdf2image 将 PDF 的每一页转换为图片。
提取页码信息：
对每页图片的指定区域运行 OCR，识别出页码信息（如 1/145）。
去重逻辑：
对具有相同页码的页面，仅保留其最后一次出现。
生成新 PDF 文件：
保存处理后的唯一页面到新 PDF 文件中。

输出示例

输入 PDF 文件为 example.pdf，脚本将生成去重后的文件 output_example.pdf。

开发者说明

本脚本基于 PyPDF2、pdf2image 和 pytesseract 实现。适用于处理带有重复页码信息的 PDF 文档，简化手动操作流程。

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
README.md		README.md
output1-命题.pdf		output1-命题.pdf
output2-谓词.pdf		output2-谓词.pdf
output3-集合.pdf		output3-集合.pdf
output4-关系.pdf		output4-关系.pdf
output6-函数.pdf		output6-函数.pdf
output7-基数.pdf		output7-基数.pdf
output8-群论.pdf		output8-群论.pdf
output_10-图_path.pdf		output_10-图_path.pdf
output_11-图_bipart.pdf		output_11-图_bipart.pdf
output_12-图_planar.pdf		output_12-图_planar.pdf
output_13-图_color.pdf		output_13-图_color.pdf
output_14-图_tree.pdf		output_14-图_tree.pdf
output_9-图_base.pdf		output_9-图_base.pdf
remover.py		remover.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFDuplicateRemover

Notice

使用说明：PDF 删除重复页面脚本

功能简介

环境要求

安装步骤

使用方法

参数说明

示例配置

工作原理

输出示例

开发者说明

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDFDuplicateRemover

Notice

使用说明：PDF 删除重复页面脚本

功能简介

环境要求

安装步骤

使用方法

参数说明

示例配置

工作原理

输出示例

开发者说明

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages