code/genbank-fan at master · fandemonium/code

History

Name		Name	Last commit message	Last commit date
parent directory ..
Linking_Organism_to_hitID.py		Linking_Organism_to_hitID.py
Linking_Organism_to_hitID.pypy3		Linking_Organism_to_hitID.pypy3
README		README
README~		README~
gbk_to_fa.py		gbk_to_fa.py
gbk_to_fa.pypy3		gbk_to_fa.pypy3
get_accessionNumber_list_from_geneGBK.py		get_accessionNumber_list_from_geneGBK.py
get_accessionNumber_list_from_geneGBK.pypy3		get_accessionNumber_list_from_geneGBK.pypy3
get_accessionNumber_list_from_geneGBK.py~		get_accessionNumber_list_from_geneGBK.py~
get_full_genome_gbk.py		get_full_genome_gbk.py
get_full_genome_gbk.pypy3		get_full_genome_gbk.pypy3
get_full_genome_gbk.py~		get_full_genome_gbk.py~
get_full_genome_gbk_from_unique_asscession.py		get_full_genome_gbk_from_unique_asscession.py
get_full_genome_gbk_from_unique_asscession.pypy3		get_full_genome_gbk_from_unique_asscession.pypy3
get_fungi_18S.py		get_fungi_18S.py
get_fungi_18S.pypy3		get_fungi_18S.pypy3
get_gene_gbk_from_genbank.py		get_gene_gbk_from_genbank.py
get_gene_gbk_from_genbank.pypy3		get_gene_gbk_from_genbank.pypy3
get_gene_gbk_from_genbank.py~		get_gene_gbk_from_genbank.py~
get_list_of_fungi_genome.py		get_list_of_fungi_genome.py
get_list_of_fungi_genome.pypy3		get_list_of_fungi_genome.pypy3
nwk_tree_parser.py		nwk_tree_parser.py
nwk_tree_parser.pypy3		nwk_tree_parser.pypy3
nwk_tree_parser.py~		nwk_tree_parser.py~
parse-genbank2.py~		parse-genbank2.py~
parse-genbank2_16S.py		parse-genbank2_16S.py
parse-genbank2_16S.pypy3		parse-genbank2_16S.pypy3
unique_dbsource.py		unique_dbsource.py
unique_dbsource.pypy3		unique_dbsource.pypy3

README

To construct phylogenetic tree based on 16S from the contig to CAZY abundance data:

1. Find the most recent contig to CAZY abundance data (filtered: For each contig, across all soil aggregate fraction samples, at least one sample has to have a minimum abundance of 5 occurance; merged: with CAZY family abundance, taxaonomy, and hit ID's).

2. Check your contig file (input file) and determine the column number for "hit ID". Modify get_gene_gbk_from_genbank.py line 9 accordingly (should not need to change because it's counting backwards).

3. Run get_gene_gbk_from_genbank.py. This will pull gene information from genbank for each "hit ID." (This will generate individual gbk file with "hit ID's" as file names). To prevent the confusions of these gbk with the ones will be generated later, put them into a folder before proceeding.

4. From these hit_id.gbk files, create a list of hit_id's, organism names, and genome accession numbers by using "get_accessionNumber_list_from_geneGBK.py." This step will need shell for loop. Double check and make sure that no file is missing in the generated list. Some gbk file acts funky (all gene should have an accession number, even though not all of them are genomes).
5. While all of the hit_ids in this list of hit_ids, organism names and accession number are unique, the access numbers might repeat. One can use the entire list to get the genome gbk (repeated accession number will simply be overwritten). However, this step is really time consuming and biopython usually runs into errors while retreiving the gbks. So it's worth the time to create another list of unique accession numbers alone. (unique_dbsource.py)

4. Then from each gbk file, pull out accession number/dbsource for the genome using "get_full_genome_gbk.py." (Not all gene were sequenced fully, hence some would not have any 16S gene. This will also create individual folders from each accession number/dbsource.)

5. Use "parse-genbank2_16S.py" to pull out 16S information from genome gbk files. (This will give you organism names for each gene. Some will have multiple 16S and some will have none. Each dbsource number will have a 16S file.)

6. To consolidate the 16S to one file, do:
for i in *.16s.fa; do head -n 2 $i; done > all_16S_OnePerOrganism.fa
(this picks the first 16S gene from each file.)

7. If interested in fungi population, run "get_list_of_fungi_genome.py" on full genome gbk files to get a list of genome assession number with taxonomy domains.

8. Use MEGA to construct maximum likelihood tree with bootstrap values (permutation = 999). To use the full organism names in MEGA trees, "_" will have to be used to replace spaces. Then export the tree file in nwk format.

9. If iTOL will be used to visualize the tree, the alternate organism names in parentheses will be recoganized by iTOL as a separate branch. So one should delete the parentheses in leave nodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README

FilesExpand file tree

genbank-fan

Directory actions

More options

Directory actions

More options

Latest commit

History

genbank-fan

Folders and files

parent directory

README