Skip to content

HarshApurva/NCBI-SeqExtract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

🧬 NCBI-SeqExtract

Batch-extract FASTA subsequences from NCBI by genomic coordinates with optional flanking regions.

Python Biopython


📖 About

Ever had a list of hundreds or thousands of genomic coordinates and needed to pull the actual sequences from NCBI? Doing it manually through the web interface takes ages. Copying one accession at a time, navigating to the right region, downloading, repeating… it's painful.

NCBI-SeqExtract takes that pain away. Just give it a CSV with your accessions and coordinates, and it will:

  • 📥 Batch-download full contig/scaffold sequences from NCBI Nucleotide - no manual clicking
  • ✂️ Extract the exact subsequences at your specified coordinates
  • 📏 Add flanking regions (e.g. ±500 bp upstream/downstream) if you need surrounding context
  • 🔄 Reverse-complement minus-strand annotations automatically
  • 📝 Export everything into a clean, ready-to-use multi-FASTA file
  • ⚠️ Report any skipped or failed rows so nothing is silently lost

Whether you're extracting endogenous viral elements (EVEs), transposable elements or any annotated genomic regions this tool handles it at scale so you don't have to.


⚙️ Requirements

  • Python 3.7+
  • Required packages:
pip install biopython pandas tqdm

📂 Input CSV Format

Your CSV must contain these required columns:

Column Required Description
seqname ✅ Yes NCBI accession / contig ID
start ✅ Yes Start position (1-based)
end ✅ Yes End position (1-based, inclusive)
Sense ❌ No Strand: + or - (defaults to +)
Family ❌ No Family annotation (defaults to NA)
Genus ❌ No Genus annotation (defaults to NA)
Species ❌ No Species annotation (defaults to NA)

Minimal CSV Example

seqname,start,end
NC_000001.11,100000,105000
NC_000002.12,250000,253000

Full CSV Example

seqname,start,end,Sense,Family,Genus,Species
NC_000001.11,100000,105000,+,Retroviridae,Gammaretrovirus,MLV
NC_000002.12,250000,253000,-,Parvoviridae,Dependoparvovirus,AAV

Tip: The seqname column can contain pipe-separated IDs (e.g. NC_000001.1|extra_info) only the part before the first | is used as the accession.


🚀 Usage

1. Configure the script

Open Fasta_Seq_Extraction.py and edit the CONFIG section at the top:

Entrez.email = "[email protected]"   # Required by NCBI
RATE_LIMIT   = 0.4                        # Seconds between API calls
BATCH_SIZE   = 10                         # Accessions per request
FLANK        = 500                        # Flanking bp (0 = exact region only)

CSV_FILE       = "your_input.csv"
OUT_FASTA      = "output_sequences.fasta"
SKIPPED_REPORT = "skipped_rows.csv"

2. Run

python Fasta_Seq_Extraction.py

3. Output

File Description
output_sequences.fasta Extracted FASTA sequences
skipped_rows.csv Rows that failed with reasons (missing_contig, invalid_coordinates)

🧩 Flanking Regions

Control how much upstream/downstream sequence is included with the FLANK parameter:

FLANK value Behaviour
0 Extract the exact region (start → end)
500 Add 500 bp upstream and downstream
5000 Add 5000 bp upstream and downstream
  • Flanking is automatically clamped to contig boundaries (never goes out of range).
  • When flanking is active, the FASTA header shows the actual extracted coordinates and a flank_N tag.

Example header (FLANK = 500):

>NC_000001.11:99501-105500|flank_500|Retroviridae|Gammaretrovirus|MLV

Example header (FLANK = 0):

>NC_000001.11:100000-105000|Retroviridae|Gammaretrovirus|MLV

📊 Output Summary

After each run, a summary is printed:

=================================
Total CSV rows      : 150
Sequences written   : 142
Rows skipped        : 8

Skip reasons breakdown:
missing_contig          5
invalid_coordinates     3

Detailed skipped rows saved to: skipped_rows.csv
=================================

🛠️ Troubleshooting

Issue Solution
Missing columns in CSV Make sure your CSV has seqname, start and end columns
Timeout / connection errors Increase socket.setdefaulttimeout() or reduce BATCH_SIZE
Too many skipped rows Check accessions are valid NCBI IDs; verify coordinates
NCBI rate-limit errors (429) Increase RATE_LIMIT (e.g. to 1.0)

🙋 Contributing

Contributions, issues and feature requests are welcome! Feel free to open an issue or pull request.

About

Tired of downloading sequences one by one from NCBI? Give it a CSV, get a FASTA with optional flanking regions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages