🧬 NCBI-SeqExtract

Batch-extract FASTA subsequences from NCBI by genomic coordinates with optional flanking regions.

📖 About

Ever had a list of hundreds or thousands of genomic coordinates and needed to pull the actual sequences from NCBI? Doing it manually through the web interface takes ages. Copying one accession at a time, navigating to the right region, downloading, repeating… it's painful.

NCBI-SeqExtract takes that pain away. Just give it a CSV with your accessions and coordinates, and it will:

📥 Batch-download full contig/scaffold sequences from NCBI Nucleotide - no manual clicking
✂️ Extract the exact subsequences at your specified coordinates
📏 Add flanking regions (e.g. ±500 bp upstream/downstream) if you need surrounding context
🔄 Reverse-complement minus-strand annotations automatically
📝 Export everything into a clean, ready-to-use multi-FASTA file
⚠️ Report any skipped or failed rows so nothing is silently lost

Whether you're extracting endogenous viral elements (EVEs), transposable elements or any annotated genomic regions this tool handles it at scale so you don't have to.

⚙️ Requirements

Python 3.7+
Required packages:

pip install biopython pandas tqdm

📂 Input CSV Format

Your CSV must contain these required columns:

Column	Required	Description
`seqname`	✅ Yes	NCBI accession / contig ID
`start`	✅ Yes	Start position (1-based)
`end`	✅ Yes	End position (1-based, inclusive)
`Sense`	❌ No	Strand: `+` or `-` (defaults to `+`)
`Family`	❌ No	Family annotation (defaults to `NA`)
`Genus`	❌ No	Genus annotation (defaults to `NA`)
`Species`	❌ No	Species annotation (defaults to `NA`)

Minimal CSV Example

seqname,start,end
NC_000001.11,100000,105000
NC_000002.12,250000,253000

Full CSV Example

seqname,start,end,Sense,Family,Genus,Species
NC_000001.11,100000,105000,+,Retroviridae,Gammaretrovirus,MLV
NC_000002.12,250000,253000,-,Parvoviridae,Dependoparvovirus,AAV

Tip: The seqname column can contain pipe-separated IDs (e.g. NC_000001.1|extra_info) only the part before the first | is used as the accession.

🚀 Usage

1. Configure the script

Open Fasta_Seq_Extraction.py and edit the CONFIG section at the top:

Entrez.email = "[email protected]"   # Required by NCBI
RATE_LIMIT   = 0.4                        # Seconds between API calls
BATCH_SIZE   = 10                         # Accessions per request
FLANK        = 500                        # Flanking bp (0 = exact region only)

CSV_FILE       = "your_input.csv"
OUT_FASTA      = "output_sequences.fasta"
SKIPPED_REPORT = "skipped_rows.csv"

2. Run

python Fasta_Seq_Extraction.py

3. Output

File	Description
`output_sequences.fasta`	Extracted FASTA sequences
`skipped_rows.csv`	Rows that failed with reasons (`missing_contig`, `invalid_coordinates`)

🧩 Flanking Regions

Control how much upstream/downstream sequence is included with the FLANK parameter:

`FLANK` value	Behaviour
`0`	Extract the exact region (start → end)
`500`	Add 500 bp upstream and downstream
`5000`	Add 5000 bp upstream and downstream

Flanking is automatically clamped to contig boundaries (never goes out of range).
When flanking is active, the FASTA header shows the actual extracted coordinates and a flank_N tag.

Example header (FLANK = 500):

>NC_000001.11:99501-105500|flank_500|Retroviridae|Gammaretrovirus|MLV

Example header (FLANK = 0):

>NC_000001.11:100000-105000|Retroviridae|Gammaretrovirus|MLV

📊 Output Summary

After each run, a summary is printed:

=================================
Total CSV rows      : 150
Sequences written   : 142
Rows skipped        : 8

Skip reasons breakdown:
missing_contig          5
invalid_coordinates     3

Detailed skipped rows saved to: skipped_rows.csv
=================================

🛠️ Troubleshooting

Issue	Solution
`Missing columns in CSV`	Make sure your CSV has `seqname`, `start` and `end` columns
Timeout / connection errors	Increase `socket.setdefaulttimeout()` or reduce `BATCH_SIZE`
Too many skipped rows	Check accessions are valid NCBI IDs; verify coordinates
NCBI rate-limit errors (429)	Increase `RATE_LIMIT` (e.g. to `1.0`)

🙋 Contributing

Contributions, issues and feature requests are welcome! Feel free to open an issue or pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Fasta_Seq_Extraction.py		Fasta_Seq_Extraction.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 NCBI-SeqExtract

📖 About

⚙️ Requirements

📂 Input CSV Format

Minimal CSV Example

Full CSV Example

🚀 Usage

1. Configure the script

2. Run

3. Output

🧩 Flanking Regions

📊 Output Summary

🛠️ Troubleshooting

🙋 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 NCBI-SeqExtract

📖 About

⚙️ Requirements

📂 Input CSV Format

Minimal CSV Example

Full CSV Example

🚀 Usage

1. Configure the script

2. Run

3. Output

🧩 Flanking Regions

📊 Output Summary

🛠️ Troubleshooting

🙋 Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages