Background

CRISPR-Cas adaptive immune system is one of the most widespread immunity strategies in prokaryotes against invading bacteriophages and plasmids [1, 2]. To counteract and overcome different CRISPR-Cas immunity systems, bacteriophages have evolved anti-CRISPR proteins (Acrs) that were first discovered in Pseudomonas aeruginosa phages in 2013 [3]. Subsequently, a proliferation of Acrs has proved to inactivate multiple CRISPR subtypes [3,4,5,6,7].

Several methods have been proposed to identify Acrs, including “Guilt-by-association” studies [6, 8], self-targeting CRISPR arrays [6, 7], and metagenome DNA screening [9, 10], etc. These methods assumed the new Acrs are similar to the previous Acrs. However, most Acrs fall short in sharing similarities currently acknowledged. Therefore, the traditional screening methods based on homology search are unreliable and require a lot of prior knowledge of Acrs to identify new Acrs. For instance, the “Guilt-by-association” method involves searching for homologs of helix-turn-helix (HTH)-containing proteins that are typically encoded downstream of Acrs [11]. The performance of “Guilt-by-association” is unstable when known Acrs proteins might share low similarity with queried protein. Therefore, a computational approach with less requirement for prior knowledge of known Acrs will provide a new perspective on the identification of Acrs. Machine learning algorithms with appropriate features could reveal the potential mechanism of Acrs and identify the Acrs without prior knowledge.

Recently, some machine learning methods have been presented for predicting Acrs. There are several web servers about Acrs, such as: Anti-CRISPRdb [12], AcrHub [13], AcrDB [14], CRISPRminer2 [15], AcRanker [14, 16], AcrFinder [17], AcrCatalog [18] and PaCRISPR [48], the original PSSM profile (L × 20) could be reduced to a L × 10 matrix by merging some columns. RPSSM is obtained by exploring the local sequence information based on the L × 10 reduced PSSM [49, 50]:

$$re-PSSM=({P}_{1}, {P}_{2},{P}_{3}, \cdots , {P}_{10})$$

and

$${P}_{1}=\frac{{p}_{F}+{p}_{Y}+{p}_{W}}{3}, {P}_{2}=\frac{{p}_{M}+{p}_{L}}{2}, {P}_{3}=\frac{{p}_{I}+{p}_{V}}{2}, {P}_{4}=\frac{{p}_{A}+{p}_{T}+{p}_{S}}{3}$$
$${P}_{5}\frac{{p}_{N}+{p}_{H}}{2}, {P}_{6}=\frac{{p}_{Q}+{p}_{E}+{p}_{D}}{3}, {P}_{7}=\frac{{p}_{R}+{p}_{K}}{2}, {P}_{8}={p}_{C}, {P}_{9}={p}_{G}, {P}_{10}={p}_{P}$$

where \(p_{A} ,p_{R} , \ldots ,p_{V}\) represent the 20 columns in the original PSSM profile corresponding to the 20 amino acids. The re-PSSM is further transformed into a 10-dimensional vector:

$${E}_{j}=\frac{1}{L}{\sum_{i=1}^{L}({p}_{i,j}-{\overline{p} }_{j})}^{2}$$

and

$$\overline{{p }_{j}}=\frac{1}{L}\sum_{i=1}^{L}{p}_{i,j}, (j=1, 2, \cdots , 10; i=1, 2, \cdots , L)$$

Additionally, the re-PSSM can be further transformed into a 10 × 10 matrix to capture the local sequence-order information by this formula:

$${E}_{j, t}=\frac{1}{L-1}\sum_{i=1}^{L-1}\frac{{({p}_{i, j}-{p}_{i+1,t})}^{2}}{2}, (s,t=1, 2, 3,\cdots , 10)$$

where \({p}_{i,j}\) represents the element at the ith row and jth column of there-PSSM. Finally, a 110-dimensional RPSSM feature is obtained by combining \({E}_{j,t}\) and \({E}_{j}\):

$$RPSSM=[{E}_{\mathrm{1,1}},{E}_{\mathrm{1,2}},\cdots ,{E}_{\mathrm{10,10}},{E}_{1},\cdots ,{E}_{10}]$$

Pretrained SSA embedding

The pretrained SSA embedding mosdel is obtained by combining the pre-trained language model with the soft sequence alignment (SSA) [51]. First, an embedding matrix RL×121 is given using the stacked BiLSTM encoders for each sequence, where L is the protein sequence length [52]. Then, the pretrained SSA embedding model is trained and optimized by SSA, which the following formulas could describe. For convenience, we supposed two embedding matrices P1(RL1×121) and P2(RL2×121), of two different protein sequences with lengths L1 and L2, respectively:

$${P}_{1}=[{x}_{1},{x}_{2},\cdots ,{x}_{L1}], {P}_{2}=[{y}_{1},{y}_{2},\cdots ,{y}_{L2}]$$

where xi, yi are vectors with 121-dimension.

The following formula represents the similarity of P1 and P2:

$$\widehat{p}=-\frac{1}{A}\sum_{i=1}^{L1}\sum_{j=1}^{L2}{\alpha }_{ij}\Vert {x}_{i}-{{y}_{j}\Vert }_{1}$$

and

$$A=\sum_{i=1}^{L1}\sum_{j=1}^{L2}{\alpha }_{ij}, { \alpha }_{ij}={\delta }_{ij}+{\varepsilon }_{ij}-{\delta }_{ij}{\varepsilon }_{ij}$$

with

$${\delta }_{ij}=\frac{exp(-\Vert {x}_{i}-{{y}_{k}\Vert }_{1})}{\sum_{k=1}^{L2}exp(-\Vert {x}_{i}-{{y}_{k}\Vert }_{1})}, {\varepsilon }_{ij}=\frac{exp(-\Vert {x}_{k}-{{y}_{j}\Vert }_{1})}{\sum_{k=1}^{L1}exp(-\Vert {x}_{k}-{{y}_{j}\Vert }_{1})}$$

The SSA embedding model could convert each protein sequence into an embedded matrix RL×121, and finally, an average pooling operation obtained a 121-dimensional feature.

Feature selection

Original features are represented by a high dimensional vector or matrix, which would raise severe problems in machine learning algorithms, such as overfitting, time-consuming training process and high requirement of computing resources. Therefore, identifying the most contributing information and features plays a vital role in performance improvement. As one of the most popular feature selection algorithms, maximum relevance minimum redundancy (mRMR) was proposed by Peng et al. [53] and has been applied in many studies and achieved robust performances [54,55,56]. In this study, mRMR was used to identify the most important features and improve the generalization ability of the model.

Machine learning algorithm

In this study, we focused on the traditional machine learning classification methods, including support vector machine, k-nearest neighbor, multi-layer perceptron, logistic regression, random forest, extreme gradient boosting, Light gradient boost machine and ensemble method that integrates the previous eight classification methods by hard voting strategy and stacking classifiers. More information is shown in the following subsections.

Support vector machine

Support vector machine (SVM) was first proposed by Vapnik et al. [57], and has successfully dealt with some binary classification problems in bioinformatics [25, 58, 59]. Two parameters Cost (C) and Gamma (γ) affect the performance of the SVM model with the RBF kernel. In this study, we used the grid search strategy to optimize C and γ in the space {2−6, 2−5, …, 25, 26}. Finally, an SVM classifier with the optimal value of C and γ was constructed.

K-nearest neighbor

K-nearest neighbor (KNN) is a fundamental classifier that has been applied in predicting protein function [60], extracting protein–protein information [61], and predicting eukaryotic protein subcellular [62]. The performance of KNN is directly affected by the parameter k. In this study, a grid search within the space \(\left\{ {1,2, \ldots ,\max \left\{ {\sqrt {FeaNum} ,\frac{FeaNum}{2}} \right\}} \right\}\) was applied to optimize the parameter k during model training, where FeaNum is the number of features used in modelling.

Multi-layer perception

Multi-layer perceptron (MLP) is known as a type of artificial neural network (ANN) [63, 64]. MLP has been applied in many bioinformatics studies, such as the prediction of protein structure classes [65], protein tertiary structure [66], and DNA–protein binding sites [67]. In this study, an MLP classifier with two hidden layers was trained, and the first and second hidden layers have 64 and 32 nodes, respectively. The maximum learning iterations is 1000.

Logistic regression

Logistic regression (LR) is widely used to predict the probability of an event happening [59, 68], which the following formula could represent:

$$p(y)=\frac{1}{1+{e}^{-({\beta }_{0}+{\beta }_{1}\chi )}}$$

where p(y) is the expected probability of dependent variable \(\mathrm{y}\), and β0 and β1 are constants.

Random forest

Random forest (RF) classifier is proposed by Breiman [69] and has been used in the prediction of type IV secreted effector proteins [70] and protein structural class [59]. To find the optimal number of the trees M and features mtry, we used a gird searching to optimize \(\mathrm{M}\) and \(\mathrm{mtry}\) within space \(\{1, 2,\cdots ,\mathrm{max}\left\{\sqrt{FeaNum},\frac{FeaNum}{2}\right\}\}\) and {1, 6, 11, 16}, respectively, where FeaNum is the number of features adopted during modeling.

XGBoost

Extreme gradient boosting (XGBoost) is a scalable end-to-end tree boosting system [71] and has been widely used as a fast and highly effective machine learning method [72, 73]. Eitzinger et al. implemented AcRanker using XGBoost to identify Acrs [14, 16]. In this study, the default parameters are adopted in the XGBoost model, except for the learning rate of 0.1.

LightGBM

Light gradient boost machine (LightGBM) shows excellent performance when the feature dimension is high and the larger data size [21]. LightGBM has been used in identifying miRNA targets [74] and predicting the protein–protein interactions [75] and the blood–brain-barrier penetration [76]. This study used the LightGBM package with default parameters in python during experiments.

CatBoost

CatBoost achieves state-of-the-art results since it successfully handles categorical features and calculates leaf values via a new scheme, which helps reduce overfitting [23]. Catboost has been applied in various tasks, including molecular structure relationship and the biological activity prediction [77] and the identification of pyroptosis-related molecular subtypes of lung adenocarcinoma [78]. In this study, the parameters of CatBoost were set as default values.

Ensemble learning method

This study proposed three ensemble models to construct more robust and reliable classifiers, which predicted new Acrs proteins by integrating the above eight classifiers (SVM, KNN, MLP, LR, RF, XGB, LightGBM, and CatBoost) through the hard voting rule (Ens-vote) or two stacking classifiers with logistic regression (Sta-LR) and gradient boosting classifier (Sta-GBC) [79], respectively.

Performance assessment

Fairly evaluating the classification methods' predictive performance is an essential subject in machine learning. In this study, we used six measurements, namely, Sensitivity (SN), Specificity (SP), Accuracy (ACC), Precision (PRE), F1-score, and Matthew’s correlation coefficient (MCC) [80], which are denoted as:

$$SN=\frac{TP}{TP+FN}$$
$$SP=\frac{TN}{TN+FP}$$
$$PRE=\frac{TP}{TP+FP}$$
$$ACC=\frac{TP+TN}{TP+FP+TN+FN}$$
$$F-score=2\times \frac{TP}{2TP+FP+FN}$$
$$\mathrm{MCC}=\frac{TP\times TN-FP\times FN}{\sqrt{\left(TP+FN\right)\times \left(TN+FP\right)\times \left(TP+FP\right)\times \left(TN+FN\right)}}$$

where TP, TN, FP, and FN are the number of true positive, true negative, false positive and false negative, respectively. Besides, the area under the receiver operating characteristic (ROC) curve (AUC) is also used to assess the performance, and the ROC was shown in a plot of the TP rate versus the FP rate. All methods were evaluated based on a fivefold cross-validation.