生物信息学

20230426_使用Python机器学习算法对生物序列进行细胞器基因组预测

Song Wei 2023年4月26日 23:57

251

Blast输出结果用于训练机器学习模型：提升特征提取、预测准确性和泛化能力

Blast是一种广泛使用的生物信息学工具，可用于比对DNA或蛋白质序列。通过将Blast的输出结果用于机器学习模型训练，可以获得以下优势：

（1）增强特征提取能力：Blast输出结果包含许多有用的信息，例如序列相似度、序列长度等等。这些信息可以被提取出来，并作为机器学习模型的输入特征，从而增强模型的特征提取能力。

（2）提高模型预测准确性：Blast可以用于比对已知的蛋白质或DNA序列，从而识别出新的序列与已知序列的相似性。通过将Blast的输出结果用于机器学习模型训练，可以提高模型在识别新的序列时的预测准确性。

（3）降低模型训练难度：机器学习模型通常需要大量的数据用于训练。使用Blast的输出结果可以获得大量的有标签数据，从而降低了模型训练的难度。

（4）增加数据多样性：Blast可以用于比对多个数据库中的序列，从而获得多样的比对结果。将这些比对结果用于机器学习模型训练可以增加数据的多样性，从而提高模型的泛化能力。总之，使用Blast的输出结果来训练机器学习模型可以提高模型的特征提取能力、预测准确性、训练效率和泛化能力，是一种非常有用的方法。

Python机器学习算法对生物序列进行细胞器基因组预测：

本文介绍了一种利用机器学习算法从序列数据中预测生物细胞器（线粒体或叶绿体）的方法。首先需要导入系统库（如os、sys、subprocess）和数据处理及机器学习库（如BioPython、pandas、sklearn等），然后从命令行参数中读取线粒体序列、叶绿体序列和待预测序列的文件路径，并将其复制到当前目录。接下来使用subprocess库调用makeblastdb命令，为线粒体和叶绿体序列分别构建BLAST数据库。定义blast_query函数进行BLAST比对，定义predict_organelle函数通过blast_query函数计算线粒体和叶绿体的得分，然后根据得分和分类器预测序列属于线粒体还是叶绿体。接着定义load_model_and_predict函数从文件中加载预训练的模型，并调用predict_organelle函数进行预测。

在准备训练数据时，遍历线粒体和叶绿体序列文件，将每个序列与线粒体和叶绿体数据库进行BLAST比对，计算得分，然后将得分和对应的标签（0表示线粒体，1表示叶绿体）存储在data列表中。在完成BLAST比对后，删除临时文件“temp_query.fasta”。

将data列表转换为pandas DataFrame格式，然后使用train_test_split函数将数据集划分为训练集和测试集。本文中使用了四种常见的分类算法（随机森林、决策树、K近邻和支持向量机），对于每种算法，我们首先使用训练集对其进行训练，然后使用测试集对其进行评估，计算混淆矩阵、分类报告和准确率等评估指标。最后，我们将选择准确率最高的模型作为最佳模型，并将其保存到文件中以便后续使用。

使用最佳模型对待预测序列进行预测，输出预测结果（线粒体或叶绿体），以及线粒体和叶绿体的概率。通过这种方法，我们可以利用机器学习技术对生物序列进行线粒体和叶绿体的预测，从而辅助相关研究。此外，我们可以尝试其他机器学习算法，或者进一步优化现有算法的参数，以提高预测准确性。

代码实现过程：

import os
import sys
import subprocess
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pickle
os.system('cp  %s mito_sequences.fasta '%(sys.argv[1]))
os.system('cp  %s chloro_sequences.fasta '%(sys.argv[2]))
os.system('cp  %s example.fasta  '%(sys.argv[3]))
#os.system('rm -rf mito_sequences.fasta chloro_sequences.fasta example.fasta')
# 准备blast数据库
subprocess.run(["makeblastdb", "-in", "mito_sequences.fasta", "-dbtype", "nucl", "-out", "mito_db"])
subprocess.run(["makeblastdb", "-in", "chloro_sequences.fasta", "-dbtype", "nucl", "-out", "chloro_db"])


# 定义blast比对函数
def blast_query(query_seq, db):
    with open("temp_query.fasta", "w") as temp_query:
        SeqIO.write(query_seq, temp_query, "fasta")
    blastn = NcbiblastnCommandline(query="temp_query.fasta", db=db, outfmt="10 sseqid bitscore", max_target_seqs=1)
    stdout, stderr = blastn()
    if not stdout.strip():
        return None, None
    sseqid, bitscore = stdout.strip().split('\n')[0].split(',', 1)
    return sseqid, float(bitscore)
def predict_organelle(query_seq, clf):
    mito_id, mito_score = blast_query(query_seq=query_seq, db="mito_db")
    chloro_id, chloro_score = blast_query(query_seq=query_seq, db="chloro_db")
    if mito_score is not None and chloro_score is not None:
        X = pd.DataFrame([[mito_score, chloro_score]], columns=["mito_score", "chloro_score"])
        proba = clf.predict_proba(X)
        organelle = "mitochondria" if proba[0, 0] > proba[0, 1] else "chloroplast"
        return organelle, proba[0, 0], proba[0, 1]
    return None, None, None


def load_model_and_predict(query_seq, model_path):
    with open(model_path, "rb") as f:
        clf = pickle.load(f)
    return predict_organelle(query_seq, clf)

# 准备数据
data = []
for seq_file in ["mito_sequences.fasta", "chloro_sequences.fasta"]:
    seqs = list(SeqIO.parse(seq_file, "fasta"))
    for seq in seqs:
        mito_id, mito_score = blast_query(query_seq=seq, db="mito_db")
        chloro_id, chloro_score = blast_query(query_seq=seq, db="chloro_db")
        if mito_score is not None and chloro_score is not None:
            label = 0 if seq_file == "mito_sequences.fasta" else 1
            data.append([mito_score, chloro_score, label])
# 删除临时文件
os.remove("temp_query.fasta")
# 构建机器学习模型
df = pd.DataFrame(data, columns=["mito_score", "chloro_score", "label"])
X = df[["mito_score", "chloro_score"]]
y = df["label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
"""
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
# 保存模型
with open("trained_model.pkl", "wb") as f:
    pickle.dump(clf, f)
# 评估模型
y_pred = clf.predict(X_test)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
# 判断序列概率
proba = clf.predict_proba(X)
df["mito_probability"] = proba[:, 0]
df["chloro_probability"] = proba[:, 1]
print(df)
"""
# 添加其他的机器学习算法
classifiers = {
    "Random Forest": RandomForestClassifier(random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Support Vector Machine": SVC(random_state=42, probability=True)
}


# 训练和评估不同的机器学习算法
best_clf_name = None
best_clf = None
best_accuracy = 0
for clf_name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{clf_name}:")
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print("Accuracy Score:", accuracy)
    print("\n" + "=" * 40 + "\n")
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_clf_name = clf_name
        best_clf = clf
# 保存最佳模型
with open("best_trained_model.pkl", "wb") as f:
    pickle.dump(best_clf, f)
print("=================")
print(f"Best Classifier: {best_clf_name} with accuracy {best_accuracy:.4f}")
print("=================")
# 预测给定序列属于线粒体还是叶绿体
# 这里我们用一个样例序列进行测试，实际应用中可以替换为您需要预测的序列
#example_seq = SeqIO.read("example.fasta", "fasta")
#organelle, mito_prob, chloro_prob = predict_organelle(query_seq=example_seq, clf=clf)
# 加载模型并预测给定序列
for seq in SeqIO.parse("example.fasta", "fasta"):
    organelle, mito_prob, chloro_prob = load_model_and_predict(query_seq=seq, model_path="best_trained_model.pkl")
    if organelle is not None:
        print("\nPrediction for the example sequence:")
        print(f"Organelle: {organelle}")
        print(f"Mitochondria probability: {mito_prob:.4f}")
        print(f"Chloroplast probability: {chloro_prob:.4f}")
    else :
        print(f"\nSequence ID: %s "%(seq.id) )
        print("Prediction failed for this sequence.")
    print("==========================")
os.system('rm -rf mito_sequences.fasta chloro_sequences.fasta example.fasta')

代码运行结果：

(base) root@dell-server:/home/dell/organelle_result/# python test4-1.py 1.fasta 2.fasta 3.fasta
Building a new DB, current time: 04/26/2023 18:33:41
New DB name:   /home/dell/organelle_result/test2/test2/test2/mito_db
New DB title:  mito_sequences.fasta
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /home/dell/organelle_result/test2/test2/test2/mito_db
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 4599 sequences in 0.155678 seconds.

Building a new DB, current time: 04/26/2023 18:33:41
New DB name:   /home/dell/organelle_result/test2/test2/test2/chloro_db
New DB title:  chloro_sequences.fasta
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /home/dell/organelle_result/test2/test2/test2/chloro_db
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 3940 sequences in 0.135202 seconds.
Random Forest:
Confusion Matrix:
 [[10  0]
 [ 0 39]]
Classification Report:
               precision    recall  f1-score   support
           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        39
    accuracy                           1.00        49
   macro avg       1.00      1.00      1.00        49
weighted avg       1.00      1.00      1.00        49
Accuracy Score: 1.0
========================================
Decision Tree:
Confusion Matrix:
 [[10  0]
 [ 0 39]]
Classification Report:
               precision    recall  f1-score   support
           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        39
    accuracy                           1.00        49
   macro avg       1.00      1.00      1.00        49
weighted avg       1.00      1.00      1.00        49
Accuracy Score: 1.0
========================================
K-Nearest Neighbors:
Confusion Matrix:
 [[10  0]
 [ 0 39]]
Classification Report:
               precision    recall  f1-score   support
           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        39
    accuracy                           1.00        49
   macro avg       1.00      1.00      1.00        49
weighted avg       1.00      1.00      1.00        49
Accuracy Score: 1.0
========================================
Support Vector Machine:
Confusion Matrix:
 [[ 9  1]
 [ 0 39]]
Classification Report:
               precision    recall  f1-score   support
           0       1.00      0.90      0.95        10
           1       0.97      1.00      0.99        39
    accuracy                           0.98        49
   macro avg       0.99      0.95      0.97        49
weighted avg       0.98      0.98      0.98        49
Accuracy Score: 0.9795918367346939
========================================
=================
Best Classifier: Random Forest with accuracy 1.0000
=================
Prediction for the example sequence:
Organelle: mitochondria
Mitochondria probability: 1.0000
Chloroplast probability: 0.0000
==========================
Prediction for the example sequence:
Organelle: mitochondria
Mitochondria probability: 0.6700
Chloroplast probability: 0.3300
==========================
Prediction for the example sequence:
Organelle: mitochondria
Mitochondria probability: 0.6700
Chloroplast probability: 0.3300
==========================
Prediction for the example sequence:
Organelle: mitochondria
Mitochondria probability: 0.9300
Chloroplast probability: 0.0700
==========================
Prediction for the example sequence:
Organelle: mitochondria
Mitochondria probability: 0.6700
Chloroplast probability: 0.3300
==========================
Prediction for the example sequence:
Organelle: mitochondria
Mitochondria probability: 0.6700
Chloroplast probability: 0.3300
==========================

标签： bioinfo

20230429_利用ggmagnify包优化R中ggplot2图形的局部可视化效果

北京天气

晴

0℃

文章分类

标签云集

python bioinfo linux dairly rstudio bioinformatics

20230426_使用Python机器学习算法对生物序列进行细胞器基因组预测

20230429_利用ggmagnify包优化R中ggplot2图形的局部可视化效果

20230425_BLAST软件升级：新增可精确定位错配碱基位置功能

文章分类

最新文章

推荐阅读

标签云集

网站浏览