Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus.

Background Coronavirus can cross the species barrier and infect humans with a severe respiratory syndrome. SARS-CoV-2 with potential origin of bat is still circulating in China. In this study, a prediction model is proposed to evaluate the infection risk of non-human-origin coronavirus for early warning. Methods The spike protein sequences of 2666 coronaviruses were collected from 2019 Novel Coronavirus Resource (2019nCoVR) Database of China National Genomics Data Center on Jan 29, 2020. A total of 507 human-origin viruses were regarded as positive samples, whereas 2159 non-human-origin viruses were regarded as negative. To capture the key information of the spike protein, three feature encoding algorithms (amino acid composition, AAC; parallel correlation-based pseudo-amino-acid composition, PC-PseAAC and G-gap dipeptide composition, GGAP) were used to train 41 random forest models. The optimal feature with the best performance was identified by the multidimensional scaling method, which was used to explore the pattern of human coronavirus. Results The 10-fold cross-validation results showed that well performance was achieved with the use of the GGAP (g = 3) feature. The predictive model achieved the maximum ACC of 98.18% coupled with the Matthews correlation coefficient (MCC) of 0.9638. Seven clusters for human coronaviruses (229E, NL63, OC43, HKU1, MERS-CoV, SARS-CoV, and SARS-CoV-2) were found. The cluster for SARS-CoV-2 was very close to that for SARS-CoV, which suggests that both of viruses have the same human receptor (angiotensin converting enzyme II). The big gap in the distance curve suggests that the origin of SARS-CoV-2 is not clear and further surveillance in the field should be made continuously. The smooth distance curve for SARS-CoV suggests that its close relatives still exist in nature and public health is challenged as usual. Conclusions The optimal feature (GGAP, g = 3) performed well in terms of predicting infection risk and could be used to explore the evolutionary dynamic in a simple, fast and large-scale manner. The study may be beneficial for the surveillance of the genome mutation of coronavirus in the field.

As considerable coronaviruses have been isolated from bats and other animals, it is believed that there is a viral gene reservoir in wild animals [7]. Coronavirus can directly cross the species barrier and infect humans with high fatality [8]. As the antigen is novel for a human host, public health is being seriously challenged. The infection risk of coronavirus in animals should be analyzed and a prediction model should be constructed for early warning. For this purpose, machine-learning methods appear to be ideal tools [9,10]. The spike protein on the surface of the viral particle plays key roles in the binding of the cell receptor and membrane fusion [3,11], by which the host range is firmly determined [8]. In this study, we screened the features of the spike protein using three encoding algorithms and predicted the cross-species infection of coronaviruses with the random forest method. Moreover, the optimal feature (G-gap dipeptide composition, GGAP, g = 3) was used to explore the dynamic of evolution in a simple, fast and massive manner.

Dataset
The protein sequences of 2666 coronaviruses were collected from 2019 Novel Coronavirus Resource (2019nCoVR) Database of China National Genomics Data Center (NGDC, https://bigd.big.ac.cn/ncov) on Jan 29, 2020 [12]. These strains had full length genomes and were isolated between 1941 and 2020, and included SARS-CoV-2 strains. The information related to these strains was summarized in Additional file 1. The 507 human-origin coronaviruses were regarded as positive samples, whereas the 2159 non-humanorigin coronaviruses were regarded as negative.

Feature encoding algorithms
To capture the key information of the spike protein, we used three encoding algorithms from multiple perspectives, that is compositional information, position-related information and physicochemical properties ( Table 1).
The optimal feature with the best performance was shown by the multidimensional scaling method in R (MDS, https://cran.r-project.org/web/packages/MASS/ index.html). The details of the feature encoding algorithms used to encode the spike protein into feature vectors are listed below.

Amino acid composition
Amino acid composition (AAC) is a simple but commonly used feature descriptor for sequence analysis and model construction. For a total of 20 amino acid types, the AAC descriptor calculates the frequency of each type of amino acid. For example, if the amino acid type i occurs n i times in the protein sequence, then the frequency of i is denoted by f(i) = n i /L, where L is the protein length. For a given strain, we yielded a 20-dimensional feature vector by computing the frequencies of 20 different amino acids.

Parallel correlation-based pseudo-amino-acid composition
Parallel correlation-based pseudo-amino-acid composition (PC-PseAAC) measures the parallel correlation between any two amino acids in a protein sequence [13]. For a given strain P, the PC-PseAAC feature vector is represented by where u is an integer; fv u (1 ≤ u ≤ 20) represents the normalized appearance frequency of the 20 amino acids in the spike protein of P; λ represents the highest tier of the correlation along P; and θj (j = 1, 2, ..., λ) is the correlation function that measures the j-tier sequence-order correlation between all the j-th most contiguous residues along P. θj is calculated using the following formula: where Hm (Pi) (m = 1,2,3,4,5) represents the polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge corresponding to the i-th amino acid Pi in the protein sequence P, respectively [14]. If I + j > L, then I + j equals I + j -L.

G-gap dipeptide composition
The G-gap dipeptide composition (GGAP) achieves the dipeptide composition coupled with local order information of any two interval residues within the spike sequence. It is formulated as where fv g i is the occurrence frequency of the i-th (i = 1,2, ...,400) G-gap dipeptide, which is computed as where O g i represents the occurrence number of the i-th G-gap dipeptide in the spike protein. The dimension of the GGAP feature vector is 20 × 20 = 400.

Machine learning
The framework for the overall prediction is shown in Fig. 1. Two main steps are included: feature representation and machine learning. First, feature representations from three feature descriptors are achieved using the algorithm as described above. Second, the random forest (RF) method is used to train and test the prediction models.
As robust and well performance in the field of machine learning, the RF has been widely used to model biological data. In this study, the RF algorithm is used to construct models and make predictions for the crossspecies transmission of coronavirus. The RF behaves like an ensemble algorithm and proposes a set of decision trees, which are grown by a subset of features. The RF repeats the computing process many times and then makes a final prediction on each sample. The final prediction can simply be the mean of each prediction with bootstrapping algorithm. In this study, the RF algorithm in the R environment was used [15]. All the experiments in the study were conducted under R 3.5.0 with default parameters (tree number = 500). To reduce the bias of unbalanced sample number, the positive samples were increased fourfold by the direct duplication of their protein sequences. The 10-fold cross validation method was used to evaluate the predictive performance. Platt scaling was used to transform the output of the RF model into a probability over two classes and evaluated the infection risk of coronaviruses.

Performance evaluation metrics
Four commonly used metrics for model performance evaluation, that is, sensitivity (SN), specificity (SP), accuracy (ACC) and Matthews correlation coefficient (MCC), were used in the study. The details are listed as follows: where TP indicates true positive, which is the number of correctly predicted true strains with the phenotype of cross-species transmission; TN represents true negative, which is the number of correctly predicted true strains without the phenotype of cross-species transmission; FP represents false positive, which is the number of strains without the phenotype of cross-species transmission predicted to be strains with the phenotype of cross-species transmission; and FN represents false negative, which is the number of strains with the phenotype of cross-species transmission predicted to be strains without the phenotype of cross-species transmission. The SE and SP metrics measure the predictive ability of the model for positive and negative cases, respectively. The other two measures, ACC and MCC, are used to evaluate the overall performance of the model. Regarding all the metrics above, the higher their scores, the better performance of the model have.
In this study, we also used the receiver operating characteristic curve (ROC) to evaluate the overall performance of a binary classifier system [16]. It is generated by plotting the true positive rate (TPR) against the false positive rate (FPR) under different classification thresholds. TPR is also known as sensitivity, as described in the above section, whereas FPR can be calculated as specificity.

Screening of the optimal feature
As described in the section Feature encoding algorithms, we used three feature encoding algorithms from multiple perspectives, that is, compositional information and position-related information, in addition to physicochemical properties. A total of 41 features were used to train the prediction models as shown in Table 1. The performances of the protein features were different and the prediction results for the features with the best performance for each type are shown in Table 2. As shown in Table 2 and Fig. 2a, the predictive model achieved the maximum ACC of 98.18% coupled with the MCC of 0.9638 when the feature GGAP (g = 3) was selected. The performance varied from 96.15 to 98.18% for ACC and from 0.9243 to 0.9638 for MCC. This indicated that the feature GGAP with parameter 3 had the optimal representation ability to distinguish coronaviruses with different phenotypes of cross-species transmission. For the receiver ROC shown in Fig. 2b, the feature GGAP (g = 3) also performed better than the other features (PC-PseAAC or AAC). The optimal GGAP feature representation could be explored to monitor the evolutionary dynamics of coronavirus.

Patterns of human coronavirus
As shown in Table 2 and Fig. 2, the GGAP (g = 3) had the best performance and is proposed to monitor the evolutionary dynamics of coronavirus. The features of the 507 human samples in our dataset were used to show the patterns with the multidimensional scaling method. Seven clusters for 229E (α-CoV), NL63 (α-CoV), OC43 (β-CoV), HKU1 (β-CoV), MERS-CoV (β-CoV), SARS-CoV (β-CoV), and SARS-CoV-2 (β-CoV) were formed obviously (Fig. 3). The clusters for 229E and NL63 were closed and located in the upper right of the figure. The cluster for SARS-CoV-2 was very close to that for SARS-CoV, which suggests that both viruses have the same human receptor (angiotensin converting enzyme II, ACE2). The two clusters for MERS and OC43 were far away from SARS-CoV and SARS-CoV-2.

Evolutionary dynamics of SARS-CoV and SARS-CoV-2
The optimal GGAP feature performed well in terms of predicting infection risk and was used to explore the dynamic of evolution in a simple, fast and massive manner. Based on the GGAP (g = 3) feature, we computed the Euclidean distance of SARS-CoV-2 and SARS-CoV from other coronaviruses in the dataset to explore the evolution dynamic, separately. As shown in Fig. 4a, the distance curve between SARS-CoV-2 and other coronaviruses had two gaps. The   Fig. 4b, the distance curve between SARS-CoV and other coronaviruses also had a gap of value 0.03, which is similar to that of SARS-CoV-2. The two gaps at 0.03 suggest that coronaviruses close to SARS-CoV-2 s or SARS-CoVs form a separate group. We further checked the coronaviruses close to SARS-CoV-2 and SARS-CoV (< 0.03) and found that these close relatives were the same. The results were similar to those from the MDS method and confirmed that SARS-CoV-2 s and SARS-CoVs have the same origin. Moreover, the big gap at 0.02 suggests that the origin of SARS-CoV-2 s is not clear and further surveillance in the field should be made continuously. The smooth curve for SARS-CoVs shows that its close relatives still exist in nature and public health is challenged as usual.

Implementation of the prediction tool
We used the Python language to establish an easy-to-use tool that implements our predictor, which is freely accessible via https://github.com/kouzheng/CovPred-FL and can run in a simple, fast and massive manner. For the convenience of researchers, we provide guidelines on how to use the tool to obtain the desired results: (1) Users need to prepare the query sequences in the FASTA format. Examples of FASTA formatted sequences can be found in the directory mentioned previously.
(2) Users need to input the name of the query file

Discussion
At present, SARS-CoV-2 is still circulating in China and the epidemic causes widespread social concern in the world [17,18]. As considerable coronaviruses have been isolated from bats and other animals, it is believed that there is a viral gene reservoir in wild animals [7]. Coronavirus can directly cross the species barrier and infect humans with a severe syndrome [8]. As an antigen that is novel for a human host, public health is being challenged seriously. With the use of the viral spike protein, in this study, the infection risk of non-human-origin coronavirus was analyzed and a prediction model was constructed for early warning to prevent disease. The spike protein on the surface of the viral particle plays key roles in the binding of the cell receptor and membrane fusion [3,11], by which the host range is firmly determined [8]. In the study, we choose the spike protein as a candidate target to predict the cross-species infection of coronaviruses using the RF method. For the spike protein of coronavirus, the sequence lengths were different and sequence identities were very low between remote relatives, which caused the problem of alignment and challenged the algorithms used to model biology data. For analysis and modeling in a simple, fast and massive manner, we used three different feature encoding algorithms from multiple perspectives, such as compositional information and, position-related information, in addition to physicochemical properties. The computation of protein features did not require multiple sequence alignment and reduced the computational complexity.
A total of 41 features were used to train the prediction models. The best predictive model achieved the maximum ACC of 98.18% coupled with the MCC of 0.9638 when the feature GGAP (g = 3) was selected, which indicated that the feature GGAP with parameter 3 had the optimal representation ability to distinguish coronaviruses with different phenotypes of cross-species transmission. As shown in Table 2, the number of false positives was 59. The reason for the false positives may be the sporadic infection of coronavirus that originated from an animal or a conflicting description of the ability of human receptor binding. With the improvement of annotation in the database, the false rate could be reduced [19].
The MDS results were similar to those from traditional evolution analysis [1,3,5], which confirmed that the screening of the GGAP (g = 3) feature was reasonable for the prediction of cross-species transmission. Moreover, we computed the Euclidean distance of SARS-CoV-2 and SARS-CoV from other coronaviruses in the dataset to explore the evolution dynamic. The big gap of 0.02 suggests that the origin of SARS-CoV-2 is not clear and further surveillance in the field should be made continuously. As considerable work on molecular epidemiology in the field has been conducted recently, more than 2000 genome sequences of coronavirus isolated from animals have been identified. In addition to various bat species, other animals should be suspected as direct hosts for SARS-CoV-2. According to the smooth curve for SARS-CoVs, the fact should be noted that its close relatives still exist in nature and public health is challenged as usual.