README.md

Telegram Channel Crawler

A Python tool to fetch, archive, and analyze messages from Telegram channels and groups using Telethon. Includes trust scoring with OpenRank and database integration.

Features

Simple - Login with phone number, no session strings needed
Flexible - Configure via config.toml file
Async - Built with async/await for efficient message fetching
Rate limiting - Respects Telegram API limits
Parallel processing - Crawls multiple channels concurrently
Channel exclusion - Skip unwanted channels (logs, bots, etc.)
Checkpoints - Automatically saves progress and allows resuming if interrupted
JSON export - Saves messages with full metadata
Trust scoring - Calculate trust scores using OpenRank algorithm
Database integration - Import data to PostgreSQL
AI summarization - Generate summaries using OpenAI
Photo management - Download and upload user profile photos to S3

Quick Start

1. Get API Credentials

Visit https://my.telegram.org
Login with your phone number
Go to "API Development Tools"
Create a new application (any name/description)
Copy your api_id and api_hash

2. Install Dependencies

pip install -r requirements.txt

3. Setup Environment

Create a .env file with your credentials:

# Required for Telegram
TELEGRAM_APP_ID=12345678
TELEGRAM_APP_HASH=abcdef1234567890abcdef1234567890
TELEGRAM_PHONE=+1234567890

# Optional: For database imports
DATABASE_URL=postgresql://user:pass@localhost:5432/dbname

# Optional: For S3 photo uploads
S3USERNAME=your_aws_access_key_id
S3CREDENTIAL=your_aws_secret_access_key

# Optional: For AI summarization
OPENAI_API_KEY=sk-...

Note: TELEGRAM_PHONE is optional - you'll be prompted if not set

4. Configure Channels

First, list all your accessible channels:

python list_channels.py

This will show all channels with their IDs. Then edit config.toml to set which channels to crawl:

[group_chats]
include = [
    1234567890,     # Group chat ID (from list_channels.py)
]

[channels]
include = [
    -1001234567890,     # Channel ID (from list_channels.py)
]

5. Run the Crawler

python read_messages.py

On first run, you'll be prompted to:

Enter your phone number (if not in .env)
Enter the verification code Telegram sends you
Enter your 2FA password (if enabled)

A session file will be created so you don't need to login again on subsequent runs.

Configuration

Edit config.toml to customize the crawler:

[crawler]
time_window_days = 365            # How many days back to fetch
max_messages_per_channel = 40000  # Message limit per channel
parallel_requests = 1             # Concurrent channels to process
batch_size = 500                  # Number of messages to fetch per batch
rate_limiting_delay = 0.5         # Delay between requests (seconds)
checkpoint_interval = 2000        # Save checkpoint every N messages (0 to disable)
fetch_replies = true              # Fetch replies/comments to channel posts
max_reply_depth = 4               # Maximum depth for nested replies (0-5 recommended)

[group_chats]
include = [1234567890]            # Group chat IDs to crawl

[channels]
include = [-1001234567890]        # Channel IDs to crawl

[output]
pretty_print = true               # Format JSON nicely
indent_spaces = 2                 # JSON indentation

[trust]
mention_points = 50               # Points for direct mentions
reply_points = 40                 # Points for replies
reaction_points = 30              # Points for reactions

Files

Main Scripts

Script	Description
`read_messages.py`	Main crawler script - fetches messages from Telegram
`list_channels.py`	List all accessible channels/groups with their IDs
`list_admins.py`	List admins/moderators for configured channels (saves to CSV)
`generate_trust.py`	Calculate trust scores from messages
`process_scores.py`	Aggregate and normalize trust scores
`process_seed.py`	Process seed CSV files with tier-based weighting
`generate_json.py`	Generate JSON files for UI from seed/output data

Database Scripts

Script	Description
`import_metadata_to_db.py`	Import messages, reactions, users, channels to PostgreSQL
`import_scores_to_db.py`	Import seeds and scores to PostgreSQL

Photo Management

Script	Description
`download_photos.py`	Download user profile photos to `raw/photos/`
`upload_photos.py`	Upload photos to S3 (`s3://openrank-files/telegram`)

AI Features

Script	Description
`summarize_posts.py`	Generate AI summaries of posts using OpenAI

Channel-Specific Scripts

Located in channel/ directory for channel-specific processing:

Script	Description
`channel/read_channel_messages.py`	Channel-specific message fetching
`channel/generate_channel_trust.py`	Channel-specific trust generation
`channel/generate_channel_json.py`	Channel-specific JSON generation

Pipeline

Script	Description
`run_pipeline.sh`	Run complete pipeline: trust → OpenRank → scores → JSON

Configuration Files

File	Description
`config.toml`	Main configuration file
`.env`	Environment variables (credentials)
`requirements.txt`	Python dependencies

Directory Structure

trank/
├── channel/          # Channel-specific scripts
├── output/           # Processed output files
├── raw/              # Raw data from Telegram
│   ├── checkpoints/  # Checkpoint files for resuming
│   └── photos/       # Downloaded user profile photos
├── schemas/          # PostgreSQL schema files
├── scores/           # Computed OpenRank scores
├── seed/             # Seed values for trust computation
├── trust/            # Trust edge files
├── ui/               # JSON files for UI consumption
└── tmp/              # Temporary files

Common Commands

# List all your channels
python list_channels.py

# Run the crawler
python read_messages.py

# List admins/moderators (saves to CSV)
python list_admins.py

# Calculate trust scores
python generate_trust.py

# Process and normalize scores
python process_scores.py

# Generate UI JSON files
python generate_json.py

# Run complete pipeline
./run_pipeline.sh

# Run pipeline for channels (not group chats)
./run_pipeline.sh --channel

# Download user profile photos
python download_photos.py
python download_photos.py --skip-existing
python download_photos.py --verbose

# Upload photos to S3
python upload_photos.py
python upload_photos.py --dry-run
python upload_photos.py --force

# Import to database
python import_metadata_to_db.py
python import_metadata_to_db.py --channel 123456
python import_metadata_to_db.py --dry-run

python import_scores_to_db.py
python import_scores_to_db.py --channel 123456

# Generate AI summaries
python summarize_posts.py

Checkpoints

The crawler automatically saves checkpoints during message fetching to prevent data loss if interrupted. Checkpoints are saved to raw/checkpoints/ directory.

How it works:

Checkpoint files are created every N messages (configurable via checkpoint_interval in config.toml)
Default is every 2000 messages
If the script is interrupted, it will detect the checkpoint on next run and ask if you want to resume
Checkpoints are automatically deleted after successful completion
Set checkpoint_interval = 0 to disable checkpoints

Resuming from checkpoint:

python read_messages.py
# If a checkpoint is found, you'll see:
# 📂 Found checkpoint for channel -1001234567890 with 500 messages
#    Last saved: 2025-01-15T10:30:45+00:00
#    Resume from checkpoint? (y/n):

Output Format

Messages

Messages are saved in the raw/ directory:

Format: raw/[channel_id]_messages.json
One file per channel
Contains simplified message data (ID, date, user ID, text, reactions, replies)

Example message:

{
  "id": 9099,
  "date": "2025-11-13T01:49:52+00:00",
  "from_id": 526750941,
  "message": "@lazovicff @dharmikumbhani",
  "reply_to_msg_id": 9098,
  "reactions": [
    {"user_id": 526750941, "emoji": "👍"},
    {"user_id": 123456789, "emoji": "👍"}
  ],
  "replies_count": 3,
  "replies_data": [...]
}

User Information

Format: raw/[channel_id]_user_ids.csv
Columns: user_id,username,first_name,last_name
Some users may not have usernames (this is normal on Telegram)

Admin Lists

Format: raw/[channel_id]_admins.csv
Columns: user_id,username,first_name,last_name
Generated by running python list_admins.py

Trust Scores

trust/[channel_id].csv - Raw trust edges with user IDs (i,j,v format)
scores/[channel_id].csv - OpenRank computed scores
output/[channel_id].csv - Processed scores with display names or user IDs

Trust Score Workflow

The complete trust scoring workflow:

Fetch messages: python read_messages.py
- Saves messages to raw/[channel_id]_messages.json
- Saves user info to raw/[channel_id]_user_ids.csv
Generate trust edges: python generate_trust.py
- Reads messages and calculates trust based on reactions, replies, and mentions
- Saves trust edges to trust/[channel_id].csv (format: i,j,v)

Compute OpenRank scores: Uses external openrank CLI tool

openrank compute-local-et trust/[channel_id].csv seed/[channel_id].csv \
    --out-path=scores/[channel_id].csv --alpha=0.25 --delta=0.000001

Process scores: python process_scores.py
- Aggregates incoming trust for each user
- Converts user IDs to display names
- Normalizes scores
- Saves to output/[channel_id].csv
Generate JSON: python generate_json.py
- Creates UI-ready JSON files in ui/ directory

Or run everything at once:

./run_pipeline.sh           # For group chats
./run_pipeline.sh --channel # For channels

Database Integration

Schema Files

Located in schemas/ directory:

messages.sql - Message storage
reactions.sql - Reaction data
users.sql - User information
channels.sql - Channel metadata
runs.sql - Processing run tracking
seeds.sql - Seed values
scores.sql - Computed scores
summaries.sql - AI-generated summaries

Importing Data

# Set DATABASE_URL in .env first
export DATABASE_URL=postgresql://user:pass@localhost:5432/dbname

# Import all metadata (messages, reactions, users, channels)
python import_metadata_to_db.py

# Import specific channel
python import_metadata_to_db.py --channel 123456

# Preview without inserting
python import_metadata_to_db.py --dry-run

# Import seeds and scores
python import_scores_to_db.py

Troubleshooting

"Missing Telegram credentials" → Make sure .env has TELEGRAM_APP_ID and TELEGRAM_APP_HASH

"Channel is not a valid ID" → Only numeric IDs are accepted, run python list_channels.py to get IDs

"Could not find the input entity" → Make sure the channel ID is correct (from list_channels.py)

"A wait of X seconds is required" → You're rate limited. Increase rate_limiting_delay in config.toml

Script keeps getting interrupted → Enable checkpoints in config.toml with checkpoint_interval = 2000 to save progress periodically

Want to restart from scratch (ignore checkpoint) → When prompted to resume, type 'n' or manually delete checkpoint files in raw/checkpoints/

Import errors → Install dependencies: pip install -r requirements.txt

Authorization failed → Make sure you enter the correct phone number and verification code

"Collected info for 0 unique users" for channel posts → This is normal for channels (not groups). Set fetch_replies = true in config.toml to fetch comments/replies where user interactions happen.

Database connection failed → Check that DATABASE_URL is set correctly in .env

S3 upload failed → Check that S3USERNAME and S3CREDENTIAL are set in .env

Session Files

The crawler creates a telegram_session.session file to remember your login.

This file is automatically created on first login
Don't commit this file to git (it's in .gitignore)
Delete it if you want to login with a different account

Resources

License

ISC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telegram Channel Crawler

Features

Quick Start

1. Get API Credentials

2. Install Dependencies

3. Setup Environment

4. Configure Channels

5. Run the Crawler

Configuration

Files

Main Scripts

Database Scripts

Photo Management

AI Features

Channel-Specific Scripts

Pipeline

Configuration Files

Directory Structure

Common Commands

Checkpoints

Output Format

Messages

User Information

Admin Lists

Trust Scores

Trust Score Workflow

Database Integration

Schema Files

Importing Data

Troubleshooting

Session Files

Resources

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Telegram Channel Crawler

Features

Quick Start

1. Get API Credentials

2. Install Dependencies

3. Setup Environment

4. Configure Channels

5. Run the Crawler

Configuration

Files

Main Scripts

Database Scripts

Photo Management

AI Features

Channel-Specific Scripts

Pipeline

Configuration Files

Directory Structure

Common Commands

Checkpoints

Output Format

Messages

User Information

Admin Lists

Trust Scores

Trust Score Workflow

Database Integration

Schema Files

Importing Data

Troubleshooting

Session Files

Resources

License