生物信息学
20230510_环状RNA(circRNA)的表达定量及差异表达分析
Song Wei
2023年5月10日 15:16
549
circRNA:
环状RNA(circRNA)是一类闭环结构的非编码RNA分子,由于其特殊的头尾相连的环形结构,使其在生物体内具有较高的稳定性。近年来,随着高通量测序技术的发展,大量circRNA被发现参与了多种生物过程和疾病的发生发展。研究表明,circRNA在许多生物过程中扮演重要角色,如细胞增殖、凋亡、分化和基因调控。同时,它们在多种疾病发生发展中的调控作用也日益受到关注,特别是在癌症、神经退行性疾病和心血管疾病等方面表现出重要的潜在诊断和治疗价值。因此,circRNA的研究将有助于揭示生命活动的新机制和发现新型生物标志物及治疗靶点。
circRNA的特点以及生物学功能:
- 环状结构:circRNA具有环状结构,与其他RNA不同,其形成主要是由于基因内部剪切事件产生的。
- 表达稳定:与线性RNA相比,circRNA的表达稳定性更高,因为它们不容易受到RNA外切酶的降解。circRNA在细胞内的寿命长,能够长期发挥其生物学功能。
- 细胞特异性:circRNA在不同组织和细胞类型中表达的水平和谱系均不同。circRNA的表达水平可以反映出细胞特异性和分化状态的差异。
- miRNA海绵功能:circRNA作为miRNA的“海绵”,可以与miRNA结合,从而调节miRNA的功能。circRNA通过结合miRNA中的互补序列,可以抑制miRNA的结合到靶基因mRNA上,从而增加mRNA的稳定性和翻译水平。
- RNA结合蛋白的调节:circRNA还可以作为RNA结合蛋白的结合靶点,与RNA结合蛋白相互作用,从而调节RNA结合蛋白的功能。
- 调控基因表达:circRNA可以参与调控基因表达,包括调控转录因子的功能和表观遗传修饰。circRNA还可以通过作为核酸酶和蛋白质的底物来调节RNA降解和翻译。
CIRIquant软件:
CIRIquant是一个用于差异表达分析的circRNA定量工具,可以使用它来比较两个或多个样本中的circRNA表达水平。其基本流程包括以下几个步骤:
- 读取和清洗测序数据:首先需要将测序数据从FASTQ文件中读入,并进行质量控制和去除低质量序列等处理。
- 检测circRNA:利用CIRI算法对测序数据进行circRNA检测,并确定其在基因组上的位置和边界。
- 定量circRNA:使用CIRIquant工具对circRNA进行定量,得到每个circRNA在样本中的表达量。
- 差异表达分析:比较不同样本中circRNA的表达水平,并进行差异表达分析。可以使用DESeq2、edgeR等常用的差异分析工具来进行统计分析和可视化。
CIRIquant软件配置文件准备:
#准备参考基因组及注释文件:mkdir gencode_version_genomecd gencode_version_genomewget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M31/gencode.vM31.annotation.gtf.gz ./wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M31/GRCm39.primary_assembly.genome.fa.gz ./gunzip *.gzcut -d " " -f 1 GRCm39.primary_assembly.genome.fa > GRCm39.primary_assembly.genome-simplifiedID.fagrep "chr" GRCm39.primary_assembly.genome-simplifiedID.fa | cut -d ">" -f 2 | seqkit grep -f - GRCm39.primary_assembly.genome-simplifiedID.fa > GRCm39.primary_assembly.genome-simplifiedID-remove-contamination.fagrep -v "^#" gencode.vM31.annotation.gtf > gencode.vM31.annotation-filtered.gtf#注意:gtf文件中"gene_id"需要在第九列的首位,否则需要从其它处下载参考基因组的fasta序列和gtf注释文件来满足这个条件。root@huajin-XPS-8930:/home/newdisk_1/zhl_20230201/running_result# head gencode_version_genome/gencode.vM31.annotation-filtered.gtfchr1 HAVANA gene 3143476 3144545 . + . gene_id "ENSMUSG00000102693.2"; gene_type "TEC"; gene_name "4933401J01Rik"; level 2; mgi_id "MGI:1918292"; havana_gene "OTTMUSG00000049935.1";chr1 HAVANA transcript 3143476 3144545 . + . gene_id "ENSMUSG00000102693.2"; transcript_id "ENSMUST00000193812.2"; gene_type "TEC"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_name "4933401J01Rik-201"; level 2; transcript_support_level "NA"; mgi_id "MGI:1918292"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";chr1 HAVANA exon 3143476 3144545 . + . gene_id "ENSMUSG00000102693.2"; transcript_id "ENSMUST00000193812.2"; gene_type "TEC"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_name "4933401J01Rik-201"; exon_number 1; exon_id "ENSMUSE00001343744.2"; level 2; transcript_support_level "NA"; mgi_id "MGI:1918292"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";chr1 ENSEMBL gene 3172239 3172348 . + . gene_id "ENSMUSG00000064842.3"; gene_type "snRNA"; gene_name "Gm26206"; level 3; mgi_id "MGI:5455983";chr1 ENSEMBL transcript 3172239 3172348 . + . gene_id "ENSMUSG00000064842.3"; transcript_id "ENSMUST00000082908.3"; gene_type "snRNA"; gene_name "Gm26206"; transcript_type "snRNA"; transcript_name "Gm26206-201"; level 3; transcript_support_level "NA"; mgi_id "MGI:5455983"; tag "basic"; tag "Ensembl_canonical";chr1 ENSEMBL exon 3172239 3172348 . + . gene_id "ENSMUSG00000064842.3"; transcript_id "ENSMUST00000082908.3"; gene_type "snRNA"; gene_name "Gm26206"; transcript_type "snRNA"; transcript_name "Gm26206-201"; exon_number 1; exon_id "ENSMUSE00000522066.2"; level 3; transcript_support_level "NA"; mgi_id "MGI:5455983"; tag "basic"; tag "Ensembl_canonical";chr1 HAVANA gene 3276124 3741721 . - . gene_id "ENSMUSG00000051951.6"; gene_type "protein_coding"; gene_name "Xkr4"; level 2; mgi_id "MGI:3528744"; havana_gene "OTTMUSG00000026353.2";chr1 HAVANA transcript 3276124 3286567 . - . gene_id "ENSMUSG00000051951.6"; transcript_id "ENSMUST00000162897.2"; gene_type "protein_coding"; gene_name "Xkr4"; transcript_type "protein_coding_CDS_not_defined"; transcript_name "Xkr4-203"; level 2; transcript_support_level "1"; mgi_id "MGI:3528744"; havana_gene "OTTMUSG00000026353.2"; havana_transcript "OTTMUST00000086625.1";chr1 HAVANA exon 3283832 3286567 . - . gene_id "ENSMUSG00000051951.6"; transcript_id "ENSMUST00000162897.2"; gene_type "protein_coding"; gene_name "Xkr4"; transcript_type "protein_coding_CDS_not_defined"; transcript_name "Xkr4-203"; exon_number 1; exon_id "ENSMUSE00000858910.2"; level 2; transcript_support_level "1"; mgi_id "MGI:3528744"; havana_gene "OTTMUSG00000026353.2"; havana_transcript "OTTMUST00000086625.1";chr1 HAVANA exon 3276124 3277540 . - . gene_id "ENSMUSG00000051951.6"; transcript_id "ENSMUST00000162897.2"; gene_type "protein_coding"; gene_name "Xkr4"; transcript_type "protein_coding_CDS_not_defined"; transcript_name "Xkr4-203"; exon_number 2; exon_id "ENSMUSE00000866652.2"; level 2; transcript_support_level "1"; mgi_id "MGI:3528744"; havana_gene "OTTMUSG00000026353.2"; havana_transcript "OTTMUST00000086625.1";#准备bwa索引mkdir bwa_indexcd bwa_indexnohup bwa index ../GRCm39.primary_assembly.genome-simplifiedID-remove-contamination.fa -p genome &cd ../#准备hisat2索引mkdir hisat_indexcd hisat_indexnohup hisat2-build -p 60 ../GRCm39.primary_assembly.genome-simplifiedID-remove-contamination.fa genome &cd ../../准备config.yaml配置文件root@huajin-XPS-8930:/home/newdisk_1/zhl_20230201/running_result# cat config.yamlname: genometools:bwa: /bin/bwahisat2: /bin/hisat2stringtie: /bin/stringtiesamtools: /bin/samtoolsreference:fasta: /data/gencode_version_genome/GRCm39.primary_assembly.genome-simplifiedID-remove-contamination.fagtf: /data/gencode_version_genome/gencode.vM31.annotation-filtered.gtfbwa_index: /data/gencode_version_genome/bwa_index/genomehisat_index: /data/gencode_version_genome/hisat_index/genome
在三个实验组和三个对照组样品中检测circRNA:
#三个实验组docker run -v /var/run/docker.sock:/var/run/docker.sock -v `pwd`:/data -w /data bioinformatician/ciriquant_v0.2.0 CIRIquant -t 60 -1 LPS1_1.clean-ok.fq -2 LPS1_2.clean-ok.fq --config config.yaml -o LPS1_result -p LPS1_sampledocker run -v /var/run/docker.sock:/var/run/docker.sock -v `pwd`:/data -w /data bioinformatician/ciriquant_v0.2.0 CIRIquant -t 60 -1 LPS2_1.clean-ok.fq -2 LPS2_2.clean-ok.fq --config config.yaml -o LPS2_result -p LPS2_sampledocker run -v /var/run/docker.sock:/var/run/docker.sock -v `pwd`:/data -w /data bioinformatician/ciriquant_v0.2.0 CIRIquant -t 60 -1 LPS3_1.clean-ok.fq -2 LPS3_2.clean-ok.fq --config config.yaml -o LPS3_result -p LPS3_sample#三个对照组docker run -v /var/run/docker.sock:/var/run/docker.sock -v `pwd`:/data -w /data bioinformatician/ciriquant_v0.2.0 CIRIquant -t 60 -1 Con1_1.clean-ok.fq -2 Con1_2.clean-ok.fq --config config.yaml -o Con1_result -p Con1_sampledocker run -v /var/run/docker.sock:/var/run/docker.sock -v `pwd`:/data -w /data bioinformatician/ciriquant_v0.2.0 CIRIquant -t 60 -1 Con2_1.clean-ok.fq -2 Con2_2.clean-ok.fq --config config.yaml -o Con2_result -p Con2_sampledocker run -v /var/run/docker.sock:/var/run/docker.sock -v `pwd`:/data -w /data bioinformatician/ciriquant_v0.2.0 CIRIquant -t 60 -1 Con3_1.clean-ok.fq -2 Con3_2.clean-ok.fq --config config.yaml -o Con3_result -p Con3_sample
对检测到的circRNA进行表达定量:
docker run -v /var/run/docker.sock:/var/run/docker.sock -v `pwd`:/data -w /data barryd237/ciriquant_v1.0.1 prep_CIRIquant -i sample1.list --lib library_info.csv --circ circRNA_info.csv --bsj circRNA_bsj.csv --ratio circRNA_ratio.csv#sample.list是输入文件,内容如下:CON1 ./Con1_result/Con1_sample.gtf C 1CON2 ./Con2_result/Con2_sample.gtf C 2CON3 ./Con3_result/Con3_sample.gtf C 3LPS1 ./LPS1_result/LPS1_sample.gtf T 1LPS2 ./LPS2_result/LPS2_sample.gtf T 2LPS3 ./LPS3_result/LPS3_sample.gtf T 3#library_info.csv 和 circRNA_bsj.csv 和 circRNA_ratio.csv 是输出文件python2 prepDE.py -i sample2.list -g gene_count_matrix.csv -t transcript_count_matrix.csv#输入文件sample2.list的内容CON1 ./Con1_result/gene/Con1_sample_out.gtfCON2 ./Con2_result/gene/Con2_sample_out.gtfCON3 ./Con3_result/gene/Con3_sample_out-filtered.gtfLPS1 ./LPS1_result/gene/LPS1_sample_out.gtfLPS2 ./LPS2_result/gene/LPS2_sample_out-filtered.gtfLPS3 ./LPS3_result/gene/LPS3_sample_out.gtf#gene_count_matrix.csv 和 transcript_count_matrix.csv 是输出文件
对表达定量后的circRNA进行差异表达分析以识别具有生物学意义的circRNA:
docker run -it -v /var/run/docker.sock:/var/run/docker.sock -v `pwd`:/data -w /data barryd237/ciriquant_v1.0.1 /bin/bashbash Anaconda3-5.2.0-Linux-x86_64.shsource ~/.bashrcconda config --add channels rconda config --add channels defaultsconda config --add channels conda-forgeconda config --add channels biocondaconda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/root@14d6b6a0be1c:/data# conda install -c bioconda r-base=3.6.1 -yRinstall.packages("BiocManager")library(BiocManager)BiocManager::install("edgeR")install.packages("optparse")exitroot@huajin-XPS-8930:/home/newdisk_1/zhl_20230201/running_result/combined_Result# docker commit 29f55a252e69 barryd237/ciriquant_v1.0.1-revisedsha256:51d9a5eb756899e93c48a09dfce3a3156c763dce0973568acf1a54ea45c0bea2docker run -it -v /var/run/docker.sock:/var/run/docker.sock -v `pwd`:/data -w /data barryd237/ciriquant_v1.0.1 /bin/bashbash Miniconda2-latest-Linux-x86_64.shsource ~/.bashrcconda install r-base=4.2.0install.packages("BiocManager")library(BiocManager)BiocManager::install("edgeR")install.packages("optparse")install.packages("statmod")(base) root@a14b1c0fdd04:/data# exitdocker commit a14b1c0fdd04 barryd237/ciriquant_v1.0.1-revised-with-Rdocker run -it -v /var/run/docker.sock:/var/run/docker.sock -v `pwd`:/data -w /data barryd237/ciriquant_v1.0.1-revised-with-R /bin/bashCIRI_DE_replicate --lib library_info.csv --bsj circRNA_bsj.csv --gene gene_count_matrix.csv --out identified_differential_expressed_genes.csv
最后生成的identified_differential_expressed_genes.csv文件即为最终文件,DE列为1的即表示是在实验组与对照组中表达差异的circRNA:
(base) huajin@huajin-XPS-8930:/home/newdisk_1/zhl_20230201/running_result/combined_Result$ column -t tmp.txt
logFC logCPM LR PValue DE FDR
chr11:4477836|4481305 6.62 -0.55 18.59 1.62E-05 1 0.0019
chr7:127684106|127685244 6.73 -0.61 18.35 1.84E-05 1 0.0119
chr9:41979093|41983642 6.62 -0.61 18.31 1.88E-05 1 0.0019
chr9:59204809|59210834 4.43 0.55 17.67 2.63E-05 0 0.0755
chr4:133354511|133356535 6.40 -0.70 16.87 3.99E-05 0 0.0905
chr10:59435750|59441999 5.04 -0.01 16.55 4.73E-05 0 0.0905
chr19:6391933|6393206 4.82 -0.34 15.12 0.000101062 0 0.1359
chr2:5057799|5059527 6.23 -0.93 15.07 0.000103343 0 0.1359
chr3:10051571|10051921 6.41 -0.84 15.02 0.000106453 0 0.1359
标签:
bioinfo
北京 天气
晴
0℃