Highly used databases in bioinformatics
To download reference data, there are some different sources available: General biological databases: Ensembl, NCBI, and UCSC Organism-specific biological databases: Wormbase, Flybase, TAIR etc. (often updated more frequently, so may be more comprehensive) Reference data collections: Illumina's iGenomes, one location to access genome reference data from Ensembl, UCSC and NCBI
Ensembl identifiers
When using Ensembl, note that it uses the following format for biological identifiers:
ENSG###########: Ensembl Gene ID
ENST###########: Ensembl Transcript ID
ENSP###########: Ensembl Peptide ID
ENSE###########: Ensembl Exon ID
For non-human species a suffix is added:
ENSMUSG###: MUS (Mus musculus) for mouse
ENSDARG###: DAR (Danio rerio) for zebrafish
human genome assemblies
GRCh38 (aka hg38)
many rare/private alleles replaced
www.ensembl.org
Most up-to-date and supported
GRCh37 (aka hg19)
Some large gaps
grch37.ensembl.org
Limited data and software updates
Still the preferred genome of the clinical community
NCB136 (aka hg18)
many gaps
ncbi36.ensembl.org
No longer updated
https://www.ncbi.nlm.nih.gov/home/download/
GCA for GenBank assemblies
GCF for RefSeq assemblies
GCA(or GCF) is followed by an underscore and 9 digits. GRCh38.p11 is GCA_000001405.26.
$ wget https://ftp.ncbi.nlm.nih.gov/bioproject/summary.txt
$ cat summary.txt | awk -F "\t" '{print $3}' | sed -E 's/[0-9]+//g' | sort | uniq -c
292 PRJDA
12933 PRJDB
497 PRJEA
57137 PRJEB
658381 PRJNA
1 Project AccessionPRJNA stands for "Project accession number" in NCBI.
PRJEB stands for "European Bioinformatics Institute (EBI) Project Accession" and is specifically used to identify projects hosted by the EBI.
PRJEA stands for "European Nucleotide Archive (ENA) Project Accession".
PRJDB stands for "DNA Data Bank of Japan (DDBJ) Project Accession" and is specific to projects hosted by the DDBJ.
PRJDA stands for "DNA Data Archive (DRA) Project Accession" and is used to identify projects hosted by the DRA.The DRA is an archive maintained by the National Bioscience Database Center (NBDC) in Japan and is part of the SRA consortium.
$ nr.gz 非冗余蛋白数据库,来源于Swisprot,PIR,PDF,PDB和RefSeq
$ nt.gz 核酸数据库,来源于GeneBank, EMBL, DDBJ.部分非冗余.
$ # BLAST+程序包,提供了一个脚本update_blastdb.pl可以很方便的进行本地化下载blast数据库
$ perl update_blastdb.pl # 可以很方便的查看有哪些数据库支持下载
$ nohup perl update_blastdb.pl --decompress nt &> update.log & # 后台下载,支持断点续传Entrez是NCBI收录基因信息的数据库,基因的信息来源于手动收录和自动整合NCBI's Reference Sequence project (RefSeq)和其他合作的数据库。 Entrez 使用整数(integer)表示基因,ID唯一(PMID: 17148475)。现已经整合到NCBI Gene
Reference Sequence (RefSeq) database Refseq RefSeq Prefix 该项目1999年开始,收录人的3446条转录本和蛋白序列。现在,NCBI 的 RefSeq 项目提供病毒、微生物、细胞器和真核生物的基因组、转录本和蛋白质的序列记录。 RefSeq FTP 第 61 版于 2013 年 9 月发布,包含来自 29,000 多种生物体的 4,100 多万条序列记录
- Genomic
AC_ : Complete genomic molecule, usually alternate assembly
NC_ : Complete genomic molecule, usually reference assembly
NG_ : Incomplete genomic region
NT_ : Contig or scaffold, clone-based or WGSa
NW_ : Contig or scaffold, primarily WGSa
NZ_ : Complete genomes and unfinished WGS data- Transcript
NM_ : Protein-coding transcripts (usually curated)
# 真核生物成熟mRNA的核苷酸序列, NM_表示序列来源是手动注释,(通常有实验数据的支持),XM_表示数据来源是计算模型预测(未经实验验证)
NR_ : Non-protein-coding transcripts
# Non-coding RNA (lncRNA, rRNA etc) , NR_ 为手动审阅版本,XR_计算模型预测版本
XM_ : Predicted model protein-coding transcript
XR_ : Predicted model non-protein-coding transcript
- Protein
AP_ : Annotated on AC_ alternate assembly
NP_ : Associated with an NM_ or NC_ accession # NP_手动审阅的NM_
XP_ : Predicted model, associated with an XM_ accession # 预测的蛋白质, 对应预测的XM_
YP_ : Annotated on genomic molecules without an instantiated transcript record
WP_ : Non-redundant across multiple strains and species
# 一个NM_序列通常对应一个NP_蛋白质序列(但通过选择性剪接,一个基因可能产生多个不同的mRNA和蛋白质异构体,因此会有多个NM/NP编号)。
加州大学圣克鲁兹分校基因组数据库 https://genome.ucsc.edu/goldenpath/help/ftp.html
国家基因库
$ wget -c -nH -np -r -R "index.html*" --cut-dirs 4 ftp://ftp.cngb.org/pub/CNSA/data1/CNP0000405/CNS0064392/- hg19
$ wget ftp://[email protected]/bundle/hg19/*.gz # 一次性全部下载
$
$ wget ftp://[email protected]/bundle/hg19/1000G_omni2.5.hg19.sites.vcf.gz
$ wget ftp://[email protected]/bundle/hg19/1000G_omni2.5.hg19.sites.vcf.idx.gz
$ wget ftp://[email protected]/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.gz
$ wget ftp://[email protected]/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.idx.gz
$ wget ftp://[email protected]/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz
$ wget ftp://[email protected]/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.idx.gz
$ wget ftp://[email protected]/bundle/hg19/CEUTrio.HiSeq.WGS.b37.bestPractices.hg19.vcf.gz
$ wget ftp://[email protected]/bundle/hg19/CEUTrio.HiSeq.WGS.b37.bestPractices.hg19.vcf.idx.gz
$ wget ftp://[email protected]/bundle/hg19/dbsnp_138.hg19.excluding_sites_after_129.vcf.gz
$ wget ftp://[email protected]/bundle/hg19/dbsnp_138.hg19.excluding_sites_after_129.vcf.idx.gz
$ wget ftp://[email protected]/bundle/hg19/dbsnp_138.hg19.vcf.gz
$ wget ftp://[email protected]/bundle/hg19/dbsnp_138.hg19.vcf.idx.gz
$ wget ftp://[email protected]/bundle/hg19/hapmap_3.3_hg19_pop_stratified_af.vcf.gz
$ wget ftp://[email protected]/bundle/hg19/hapmap_3.3.hg19.sites.vcf.gz
$ wget ftp://[email protected]/bundle/hg19/hapmap_3.3.hg19.sites.vcf.idx.gz
$ wget ftp://[email protected]/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
$ wget ftp://[email protected]/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.idx.gz
$ wget ftp://[email protected]/bundle/hg19/NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.sites.vcf.gz
$ wget ftp://[email protected]/bundle/hg19/NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.sites.vcf.idx.gz
$ wget ftp://[email protected]/bundle/hg19/NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.vcf.gz
$ wget ftp://[email protected]/bundle/hg19/NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.vcf.idx.gz
$ wget ftp://[email protected]/bundle/hg19/NA12878.knowledgebase.snapshot.20131119.hg19.vcf.gz
$ wget ftp://[email protected]/bundle/hg19/NA12878.knowledgebase.snapshot.20131119.hg19.vcf.idx.gz
$ wget ftp://[email protected]/bundle/hg19/ucsc.hg19.dict.gz
$ wget ftp://[email protected]/bundle/hg19/ucsc.hg19.fasta.fai.gz
$ wget ftp://[email protected]/bundle/hg19/ucsc.hg19.fasta.gz- hg38
$ wget ftp://[email protected]/bundle/hg38/*.gz # 全部下载
$
$ wget ftp://[email protected]/bundle/hg38/1000G_omni2.5.hg38.vcf.gz
$ wget ftp://[email protected]/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz
$ wget ftp://[email protected]/bundle/hg38/Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz
$ wget ftp://[email protected]/bundle/hg38/dbsnp_138.hg38.vcf.gz
$ wget ftp://[email protected]/bundle/hg38/dbsnp_144.hg38.vcf.gz
$ wget ftp://[email protected]/bundle/hg38/dbsnp_146.hg38.vcf.gz
$ wget ftp://[email protected]/bundle/hg38/hapmap_3.3_grch38_pop_stratified_af.vcf.gz
$ wget ftp://[email protected]/bundle/hg38/hapmap_3.3.hg38.vcf.gz
$ wget ftp://[email protected]/bundle/hg38/Homo_sapiens_assembly38.fasta.gz
$ wget ftp://[email protected]/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz目前最新的T2T基因组
- URL: https://github.com/marbl/CHM13
- CONTENTS: chr1-22(CHM13),chrX(CHM13),chrY(NA24385),chrM(CHM13)
- CITATION : Nurk, S., Koren, S., Rhie, A., Rautiainen, M., Bzikadze, A. V., Mikheenko, A., ... & Phillippy, A. M. (2022). The complete sequence of a human genome. Science, 376(6588), 44-53.
推迟发布NCBI RefSeq assembly GCF_000001405.26 (replaced)
Submitted GenBank assembly GCA_000001405.15 (replaced)
Taxon Homo sapiens (human)
Synonym hg38
Assembly type haploid with alt loci
Submitter Genome Reference Consortium
Date Dec 17, 2013NCBI RefSeq assembly GCF_000001405.13 (replaced)
Submitted GenBank assembly GCA_000001405.1 (replaced)
Taxon Homo sapiens (human)
Synonym hg19
Assembly type haploid with alt loci
Submitter Genome Reference Consortium
Date Feb 27, 2009
wget https://ilmn-dragen-giab-samples.s3.amazonaws.com/FASTA/GRCh37.faThe human genome reference integrating GRCh37 and the decoy sequences is termed hs37d5
wget https://ilmn-dragen-giab-samples.s3.amazonaws.com/FASTA/hs37d5.faBroad Institute 以 GRCh37为基础创建的一个参考基因组. 该基因组被称为b37(Homo_sapiens_assembly19.fasta, MD5sum: 886ba1559393f75872c1cf459eb57f2d)
The humanG1Kv37 (human_g1k_v37.fasta, MD5sum: 0ce84c872fc0072a885926823dcd0338) 和 b37相同, 但前者不包含人类疱疹病毒4型1(名为NC007605_)的诱饵序列。该参考序列源自千人基因组计划。
人类泛基因组参考联盟 HPRC 是由美国国立卫生研究院 (NIH) 资助的项目,旨在吸引科学家和生物伦理学家参与创建代表全球人类基因组变异的人类泛基因组参考和资源。 在创建泛基因组的过程中,HPRC 正在开发改进的基因组组装技术,以及利用泛基因组进行全面分析的下一代工具生态系统。 数据下载地址Download
- 执行者: New York Genome Center (NYGC)
- 样本数:3202
- 深度: 30X
1KGP3 项目信息:
- The GRCh37 reference genome used in this analysis
- Files listing the samples used in the work (.ped and panel)
- VCF files containing the variants detected and additional genotype VCF files listing genotypes for each individual at each variant location (provided per chromosome due to file size)
1000 Genomes Phase 3 (1KGP3)HapMap 3 是国际 HapMap 项目的第三阶段。本阶段涵盖的 DNA 样本数量从第一阶段和第二阶段的 270 个增加到来自不同人群的 1,301 个。这是第三版草案。
人群结构
- ASW – African ancestry in Southwest USA
- CEU – Utah residents with Northern and Western European ancestry from the CEPH collection
- CHB – Han Chinese in Beijing, China
- CHD – Chinese in Metropolitan Denver, Colorado
- GIH – Gujarati Indians in Houston, Texas
- JPT – Japanese in Tokyo, Japan
- LWK – Luhya in Webuye, Kenya
- MXL – Mexican ancestry in Los Angeles, California
- MKK – Maasai in Kinyawa, Kenya
- TSI – Toscani in Italia
- YRI – Yoruba in Ibadan, Nigeria
NCBI RefSeq assembly GCF_000001635.20 (replaced)
Submitted GenBank assembly GCA_000001635.2 (replaced)
Taxon Mus musculus (house mouse)
Strain C57BL/6J
Synonym mm10
Assembly type haploid with alt loci
Submitter Genome Reference Consortium
Date Jan 9, 2012NCBI RefSeq assembly GCF_000001635.27
Submitted GenBank assembly GCA_000001635.9
Taxon Mus musculus (house mouse)
Strain C57BL/6J
Synonym mm39
Assembly type haploid
Submitter Genome Reference Consortium
Date Jun 24, 2020
ftp://ftp.ncbi.nlm.nih.gov/geo/
$ wget --recursive --no-parent -nd ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE50nnn/GSE50499/suppl/
$ wget -r -np -nd -R "index.html*" ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE50nnn/GSE50499/suppl/# There are four hierarchical levels of SRA entities and their accessions:
STUDY with accessions in the form of SRP, ERP, or DRP
SAMPLE with accessions in the form of SRS, ERS, or DRS
EXPERIMENT with accessions in the form of SRX, ERX, or DRX
RUN with accessions in the form of SRR, ERR, or DRR
## The fastq-dump command will only download the fastq version of the SRR, given the SRR number and an internet connection
$ fastq-dump SRR1013512SRA文件名转化为项目处理的样本名, rename.py (SRP173272)
import os
import pandas as pd
df = pd.read_csv("SraRunTable.txt")
df['Time'] = df.time.apply(lambda x : x.replace(" ", ""))
df['New'] = df.treatment + "_" + df.Time
#print(df)
for name, group in df.groupby(by=['New']):
group = group.reset_index()
group['replicate'] = group.index + 1
group['ID'] = group.New +"_"+group.replicate.apply(lambda x: str(x))
#print(group)
for index, row in group.iterrows():
old_dir = row['Experiment']
new_dir = row['ID']
old_base = row['Run']
os.system(f'mv {old_dir} {new_dir} ') # SRX to treatment id
os.system(f'mv {new_dir}/{old_base}_1.fastq.gz {new_dir}/{new_dir}_1.fastq.gz') # basename renamersid (reference SNP)
dbSNP Reference SNP (rs or RefSNP) number is a locus accession for a variant type assigned by dbSNPdownload path : https://ftp.ncbi.nih.gov/snp/latest_release/VCF
web search: https://www.ncbi.nlm.nih.gov/snp/rs328
长链非编码RNA LNCipedia LNCipedia数据库注释的是human基因组,目前版本包括 127,802 transcripts 和 56,946 genes。
$ less -S lncipedia_5_2_hg38.gtf | cut -f 9 | grep -v "^#" | awk -F ";" '{print $1}' | grep gene_id | awk -F " " '{print $2}' | sort | uniq | wc
56946 56946 704558
(base) [09:34:00] zhusitao zhusitaodeMacBook-Air ~/Downloads
$ less -S lncipedia_5_2_hg38.gtf
(base) [09:34:12] zhusitao zhusitaodeMacBook-Air ~/Downloads
$ less -S lncipedia_5_2_hg38.gtf | cut -f 9 | grep -v "^#" | awk -F ";" '{print $2}' | grep transcript_id | awk -F " " '{print $2}' | sort | uniq | wc
127802 127802 1800082Broad Institute单细胞测序数据,目前收录645 studies,包括 40,522,660 cells。single cell。
