Batch-extract FASTA subsequences from NCBI by genomic coordinates with optional flanking regions.
Ever had a list of hundreds or thousands of genomic coordinates and needed to pull the actual sequences from NCBI? Doing it manually through the web interface takes ages. Copying one accession at a time, navigating to the right region, downloading, repeating… it's painful.
NCBI-SeqExtract takes that pain away. Just give it a CSV with your accessions and coordinates, and it will:
- 📥 Batch-download full contig/scaffold sequences from NCBI Nucleotide - no manual clicking
- ✂️ Extract the exact subsequences at your specified coordinates
- 📏 Add flanking regions (e.g. ±500 bp upstream/downstream) if you need surrounding context
- 🔄 Reverse-complement minus-strand annotations automatically
- 📝 Export everything into a clean, ready-to-use multi-FASTA file
⚠️ Report any skipped or failed rows so nothing is silently lost
Whether you're extracting endogenous viral elements (EVEs), transposable elements or any annotated genomic regions this tool handles it at scale so you don't have to.
- Python 3.7+
- Required packages:
pip install biopython pandas tqdmYour CSV must contain these required columns:
| Column | Required | Description |
|---|---|---|
seqname |
✅ Yes | NCBI accession / contig ID |
start |
✅ Yes | Start position (1-based) |
end |
✅ Yes | End position (1-based, inclusive) |
Sense |
❌ No | Strand: + or - (defaults to +) |
Family |
❌ No | Family annotation (defaults to NA) |
Genus |
❌ No | Genus annotation (defaults to NA) |
Species |
❌ No | Species annotation (defaults to NA) |
seqname,start,end
NC_000001.11,100000,105000
NC_000002.12,250000,253000seqname,start,end,Sense,Family,Genus,Species
NC_000001.11,100000,105000,+,Retroviridae,Gammaretrovirus,MLV
NC_000002.12,250000,253000,-,Parvoviridae,Dependoparvovirus,AAVTip: The
seqnamecolumn can contain pipe-separated IDs (e.g.NC_000001.1|extra_info) only the part before the first|is used as the accession.
Open Fasta_Seq_Extraction.py and edit the CONFIG section at the top:
Entrez.email = "[email protected]" # Required by NCBI
RATE_LIMIT = 0.4 # Seconds between API calls
BATCH_SIZE = 10 # Accessions per request
FLANK = 500 # Flanking bp (0 = exact region only)
CSV_FILE = "your_input.csv"
OUT_FASTA = "output_sequences.fasta"
SKIPPED_REPORT = "skipped_rows.csv"python Fasta_Seq_Extraction.py| File | Description |
|---|---|
output_sequences.fasta |
Extracted FASTA sequences |
skipped_rows.csv |
Rows that failed with reasons (missing_contig, invalid_coordinates) |
Control how much upstream/downstream sequence is included with the FLANK parameter:
FLANK value |
Behaviour |
|---|---|
0 |
Extract the exact region (start → end) |
500 |
Add 500 bp upstream and downstream |
5000 |
Add 5000 bp upstream and downstream |
- Flanking is automatically clamped to contig boundaries (never goes out of range).
- When flanking is active, the FASTA header shows the actual extracted coordinates and a
flank_Ntag.
Example header (FLANK = 500):
>NC_000001.11:99501-105500|flank_500|Retroviridae|Gammaretrovirus|MLV
Example header (FLANK = 0):
>NC_000001.11:100000-105000|Retroviridae|Gammaretrovirus|MLV
After each run, a summary is printed:
=================================
Total CSV rows : 150
Sequences written : 142
Rows skipped : 8
Skip reasons breakdown:
missing_contig 5
invalid_coordinates 3
Detailed skipped rows saved to: skipped_rows.csv
=================================
| Issue | Solution |
|---|---|
Missing columns in CSV |
Make sure your CSV has seqname, start and end columns |
| Timeout / connection errors | Increase socket.setdefaulttimeout() or reduce BATCH_SIZE |
| Too many skipped rows | Check accessions are valid NCBI IDs; verify coordinates |
| NCBI rate-limit errors (429) | Increase RATE_LIMIT (e.g. to 1.0) |
Contributions, issues and feature requests are welcome! Feel free to open an issue or pull request.