This project implements a distributed MongoDB cluster with sharding and replication to manage large-scale public transport data.
It simulates a real-world transportation system handling:
- Millions of tickets
- Thousands of passengers and buses
- Multi-city data distribution
The system is designed to ensure:
- Scalability
- High availability
- Fault tolerance
The system is based on a MongoDB Sharded Cluster:
- 3 Shards (data distribution)
- Replica Sets (fault tolerance)
- Config Servers (metadata)
- Mongos Router (query routing)
- Client sends query to mongos
- Mongos routes query to relevant shard(s)
- Each shard processes locally
- Results are aggregated and returned
| Collection | Amount of Data |
|---|---|
| Tickets | 1,000,000 |
| Passengers | 100,000 |
| Bus | 960 |
| Lines | 320 |
| Stations | 5686 |
This folder contains sample datasets used for demonstration purposes.
sample/→ small data samples for testingschema/→ data structure definitions
Full datasets (millions of records) are not included due to size constraints.
- Shard Key:
ville - Strategy: Range-based sharding
- High cardinality
- Frequent filtering by city
- Uniform distribution
- Stable field
- Distributed storage using MongoDB
- Horizontal scaling via sharding
- Replication for high availability
- Fault tolerance testing
- Analytical queries (aggregation)
✔ Reading continues even if a node fails
✔ Data remains accessible via secondary nodes
✔ Automatic failover supported
- Number of tickets per city
- Most used transport line
- Time-based filtering of tickets
- MongoDB (Sharding + Replication)
- Docker & Docker Compose
- PyMongo (optional integration)
- Python (for frontend visualization)
For more details about the project setup and implementation:
- Setup guide: see
docs/setup-guide.md - Full command list: see
scripts/commands.txt
These resources provide step-by-step instructions and cluster configuration details.