This project provides a comprehensive pipeline for lncRNA identification and analysis, from CodingRNA database construction to lncRNA identification and analysis. The pipeline integrates multiple bioinformatics tools to achieve full-process analysis from raw data processing to functional annotation.
The project consists of four main modules:
- Module 001: Basic data processing environment for raw data QC, alignment, and assembly
- Module 002: LncRNA identification environment focusing on non-coding RNA prediction and screening
- Module 003: Functional analysis environment for expression analysis and functional annotation
- Module 004: Basic data processing environment (snakemake one-click run) for raw data QC, alignment, and assembly
- This pipeline involves multiple programming languages including Python, R, and Perl
- Integrates over 30 professional bioinformatics software tools
- Due to complex dependencies, one-click installation and usage is not possible
- Recommended to install and run each module step by step
- Verify output results before proceeding to the next step
- CodingRNA database construction
- Raw data quality control and preprocessing
- Reference genome-based transcript assembly
- Strict screening process for long non-coding RNA
- Transcript functional annotation and analysis
- Support for multi-sample parallel processing
- Integration of multiple professional bioinformatics tools
- Linux/Unix operating system
- Python ≥ 3.7
- R ≥ 4.0
- Conda package manager
Major bioinformatics tools:
- HISAT2 (transcriptome alignment)
- StringTie (transcript assembly)
- TACO (transcript integration)
- FastQC (sequencing data QC)
- Samtools (SAM/BAM file processing)
- MultiQC (QC report integration)
- Bowtie2 (transcriptome alignment)
- GFFread (GFF/GTF file conversion)
- Gffcompare (transcript assembly evaluation)
- Fastp (raw data QC)
- TrimGalore (raw data QC)
- seqkit (sequence processing)
- transeq (sequence translation)
- pfam_scan.pl (Pfam database alignment)
- cmscan (Rfam database alignment)
- Diamond (NR database alignment)
- Snakemake (workflow management)
- PLEK (lncRNA identification)
- cnci (lncRNA identification)
- cpc2 (lncRNA identification)
- FeatureCounts (expression quantification)
- DESeq2 (differential expression analysis)
- wgcna (WGCNA analysis)
- clusterProfiler (functional enrichment analysis)
- gseGO (GO enrichment analysis)
- gseKEGG (KEGG enrichment analysis)
- enrichplot (enrichment result visualization)
- ggplot2 (result visualization)
- Create three independent conda environments:
conda create -n lncrna_001 python=3.7
conda activate lncrna_001
conda install -c bioconda hisat2 stringtie taco fastqc samtools multiqc bowtie2 gffread gffcompare fastp trim-galore seqkit embossconda create -n lncrna_002 python=3.7
conda activate lncrna_002
conda install -c bioconda plek cnci cpc2
conda install -c bioconda diamond pfam_scan hmmer infernalconda create -n lncrna_003 r=4.0
conda activate lncrna_003
conda install -c bioconda bioconductor-deseq2 bioconductor-wgcna bioconductor-clusterprofiler
conda install -c conda-forge r-ggplot2
conda install -c bioconda subread # for FeatureCounts- Create working directories:
# Create main directory
mkdir -p ~/rice_lncrna_pipeline
cd ~/rice_lncrna_pipeline
# Create raw data directory
mkdir -p raw_data/{sra,fastq}
# Create QC-related directories
mkdir -p qc/{fastqc,multiqc}
# Create alignment and assembly directories
mkdir -p alignment/{hisat2_index,hisat2_output,stringtie_output}
mkdir -p assembly/{taco_output,merged_transcripts}
# Create lncRNA identification directory
mkdir -p lncrna/{plek,cnci,cpc2,merged_results}
# Create functional analysis directory
mkdir -p analysis/{expression,deseq2,wgcna,enrichment}
# Create results and log directories
mkdir -p results
mkdir -p logs- Download reference genome:
cd ~/rice_lncrna_pipeline
wget http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/Osativa_204_v7.0.fa.gz
gunzip Osativa_204_v7.0.fa.gz
mv Osativa_204_v7.0.fa reference/
cd alignment/hisat2_index
hisat2-build ../../reference/Osativa_204_v7.0.fa rice7_index- Data collection and download
- Data quality control
- rRNA sequence removal
- Reference genome alignment
- Transcript assembly and merging
- Comparison with reference genome annotation
- Final fusion with original genome annotation, database generation
- Collect published rice RNA-seq data
- Download raw data using SRA toolkit
- Convert to FASTQ format
- Quality assessment using FastQC
- QC report integration using MultiQC
- Low-quality sequence and adapter removal using Fastp
- Bowtie2 alignment to rRNA database
- Non-rRNA sequence extraction
- Quality reassessment
- Alignment using HISAT2
- SAM/BAM file conversion and sorting using SAMtools
- Generate alignment statistics report
- Transcript assembly using StringTie
- Merge GTF files from multiple samples
- Generate non-redundant transcript set
- Compare annotation files using GFFcompare
- Extract known coding genes
- Construct coding RNA database
- Extract transcripts with class codes i,u,x,o,p
- Length filtering (>200nt)
- Exon number filtering
- CPC2 analysis
- CNCI analysis
- PLEK analysis
- Rfam/Pfam/NR database alignment
- Integration of all software identification results
- Long non-coding RNA identification
- Classification (antisense, intronic, intergenic, etc.)
- Integration with known lncRNA database information
- Calculate expression levels in each sample
- Expression normalization
- Sample correlation analysis
- DESeq2 differential expression analysis
- Differential gene screening
- WGCNA co-expression network construction
- Module identification and analysis
- Key gene screening
- GO functional enrichment analysis
- KEGG pathway enrichment analysis
- Functional annotation visualization
- Functional enrichment analysis
- Result visualization
- Ensure all dependencies are correctly installed
- Check software version compatibility
- Properly allocate computational resources
- Regularly backup important data
- Record key parameter settings
- Save analysis log files
- Pay attention to biological replication
- Focus on data quality control results
- Validate reliability of key findings
- 2024-02-17: Initial Version V1.0 Release
- Upcoming: Continuous pipeline optimization
- Upcoming: Addition of new analysis tools
This repository contains various datasets and resources related to the identification and functional analysis of long noncoding RNAs (lncRNAs) in rice (Oryza sativa). All data here are based on Oryza sativa Japonica (commonly known as Japonica rice). If you plan to use other rice varieties (e.g., Indica rice), please carefully cross-check the data for compatibility.
Below is a list of files in this repository and their descriptions:
- Contains known rice lncRNAs from multiple databases. This file is crucial for the identification and analysis of rice lncRNAs.
- This file contains the ribosomal RNA (rRNA) sequences for Oryza sativa. These sequences are used as references in RNA-seq analyses to filter out rRNA from the data.
- Contains the GTF (General Feature Format) file of the MSU V7 rice genome annotation. This is a reference genome annotation file for use in RNA-seq analysis and transcript annotation.
- Contains the MSU V7 rice genome sequence in FASTA format. This file includes all genomic sequences that are used for alignment during transcriptome analysis.
- This file contains symbol IDs and Gene Ontology (GO) annotations for rice genes. It helps in the functional annotation of genes and the analysis of biological functions using GO terms.
- A zip file containing the org.db package for rice. This package is used for gene enrichment analysis (GO, KEGG, etc.) and provides key rice gene information in the appropriate format for enrichment tools.
- This file contains detailed information about rice gene GO and KEGG annotations, as well as symbol IDs and trait information. It is essential for understanding gene function and performing further analysis such as pathway enrichment.
- Contains information about hormones in rice. It includes data relevant to plant hormone signaling pathways and their effects on rice physiology.
- This file contains NLR (Nucleotide Binding Leucine-Rich Repeat) genes sequences in rice. NLR genes play critical roles in immunity and are important for studying rice disease resistance.
- Contains protein sequences for rice. This file is important for downstream analysis, including protein structure prediction and functional studies.
- Contains CDS (coding sequences) for rice. This file includes the translated gene sequences for functional annotation and gene expression analysis.
- A R script for calculating qPCR results. This script is useful for analyzing quantitative PCR data and conducting gene expression studies.
- A Python script for calculating qPCR results. This Python-based tool provides an alternative to the R script for analyzing qPCR data.
- Contains a list of transcription factor (TF) IDs for rice. This file is important for understanding the regulation of gene expression in rice and is used in transcription factor-based analyses.
-
Data Processing:
- Use the rice genome annotations and lncRNA databases (found in
s001,s003,s004) for transcript assembly and RNA-seq analysis. - The ribosomal RNA file (
s002) is used for filtering out rRNA from RNA-seq data before lncRNA identification.
- Use the rice genome annotations and lncRNA databases (found in
-
Gene Enrichment Analysis:
- For functional annotation and enrichment analysis, refer to the GO, KEGG, and symbol ID annotations provided in
s005ands007. - For enrichment analysis, use the
s006package, which supports GO and KEGG enrichment tools.
- For functional annotation and enrichment analysis, refer to the GO, KEGG, and symbol ID annotations provided in
-
Hormone Pathway Analysis:
- Use
s008for the analysis of plant hormones and their pathways, which are critical for studying rice immunity and other physiological traits.
- Use
-
NLR and Transcription Factor Studies:
- For disease resistance and regulatory studies, refer to the NLR gene sequences in
s009and the transcription factors ins014.
- For disease resistance and regulatory studies, refer to the NLR gene sequences in
-
qPCR Analysis:
- Analyze qPCR data using the provided R (
s012) or Python (s013) scripts.
- Analyze qPCR data using the provided R (
If you use this pipeline, please cite the following paper:
Shan, X., Xia, S., Peng, L., Tang, C., Tao, S., Baig, A., & Zhao, H. (2025). Identification of Rice LncRNAs and Their Roles in the Rice Blast Resistance Network Using Transcriptome and Translatome. Plants (Under Review). Doi: 10.20944/preprints202502.1634.v1
- Issue Feedback: [GitHub Issues]
- Email Contact: [[email protected]]
