Conversation
基于项目教学文档和源代码示例,创建了一个全面的爬虫开发 Skill, 用于指导大模型在多种场景下编写爬虫代码。 主要内容: - 交互式工作流程:需求分析 → 技术评估 → 信息收集 → 代码生成 - 四种核心代码模板:API爬取、HTML页面爬取、浏览器自动化、高并发批量爬取 - 工程化项目结构:配置管理、模型定义、存储抽象、异常处理 - 12项关键技术详解: - cURL转Python代码 - 请求头伪装与UA轮换 - 速率控制(令牌桶算法) - 代理IP使用 - Cookie管理与持久化 - 数据存储(工厂模式:CSV/JSON/DB) - Pydantic数据模型 - 数据清洗与标准化 - 重试与异常处理(tenacity) - 登录认证(Cookie注入/扫码登录) - 验证码处理(OCR/滑块轨迹模拟) - Playwright反检测(stealth.js注入) - 场景判断决策树 - 引导话术模板 Co-authored-by: L TANG <[email protected]>
|
Cursor Agent can help with this pull request. Just |
Co-authored-by: L TANG <[email protected]>
参考 anthropics/skills 规范,将单文件 web-crawler-skill.md 重构为
完整的 Skill 文件夹结构:
.cursor/skills/web-crawler/
├── SKILL.md # 技能定义(YAML frontmatter + 工作流程 + 决策树)
├── scripts/
│ ├── curl_to_config.py # cURL 命令解析为 Python 配置
│ ├── generate_crawler.py # 根据配置生成爬虫代码
│ └── run_crawler.py # 通用爬虫执行器(API/HTML 双模式)
├── templates/
│ ├── api_crawler.py # API 数据爬取模板
│ ├── html_crawler.py # HTML 页面解析模板
│ └── browser_crawler.py # Playwright 浏览器自动化模板
└── references/
├── headers_reference.md # 请求头伪装完整指南
└── anti_detection_reference.md # 反爬对抗技术参考
主要特性:
- 所有脚本经过实际测试,可直接调用
- run_crawler.py 支持 API 和 HTML 双模式,一条命令即可爬取并输出 CSV
- curl_to_config.py 自动解析 cURL 命令为结构化配置和 Python 代码
- 三套完整模板覆盖 API/HTML/Browser 三种爬取场景
- SKILL.md 包含交互式工作流程和场景判断决策树
Co-authored-by: L TANG <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a comprehensive web crawler skill document to guide the model in interactive web scraping code generation.