Skip to main content

Using amino acid features to identify the pathogenicity of influenza B virus

Abstract

Background

Influenza B virus can cause epidemics with high pathogenicity, so it poses a serious threat to public health. A feature representation algorithm is proposed in this paper to identify the pathogenicity phenotype of influenza B virus.

Methods

The dataset included all 11 influenza virus proteins encoded in eight genome segments of 1724 strains. Two types of features were hierarchically used to build the prediction model. Amino acid features were directly delivered from 67 feature descriptors and input into the random forest classifier to output informative features about the class label and probabilistic prediction. The sequential forward search strategy was used to optimize the informative features. The final features for each strain had low dimensions and included knowledge from different perspectives, which were used to build the machine learning model for pathogenicity identification.

Results

The 40 signature positions were achieved by entropy screening. Mutations at position 135 of the hemagglutinin protein had the highest entropy value (1.06). After the informative features were directly generated from the 67 random forest models, the dimensions for class and probabilistic features were optimized as 4 and 3, respectively. The optimal class features had a maximum accuracy of 94.2% and a maximum Matthews correlation coefficient of 88.4%, while the optimal probabilistic features had a maximum accuracy of 94.1% and a maximum Matthews correlation coefficient of 88.2%. The optimized features outperformed the original informative features and amino acid features from individual descriptors. The sequential forward search strategy had better performance than the classical ensemble method.

Conclusions

The optimized informative features had the best performance and were used to build a predictive model so as to identify the phenotype of influenza B virus with high pathogenicity and provide early risk warning for disease control.

Graphical Abstract

Background

Influenza B virus (IBV) belongs to the Orthomyxoviridae family, and its genome is composed of eight negative-strand RNA of different lengths [1, 2]. As a pathogen that can cause human respiratory diseases, IBV was first isolated from clinical samples in 1940 [3]. According to the antigen characteristics of the hemagglutinin protein, two lineages of IBV were reported: Victoria-like virus and Yamagata-like virus [4]. IBV can cause local outbreaks or seasonal epidemics with a high mortality rate in children and adolescents, so it poses a serious threat to public health [5,6,7,8,9,10].

There are at least 11 viral proteins encoded in the genome of IBV: polymerase basic protein 2 (PB2), polymerase basic protein 1 (PB1), polymerase acid protein (PA), hemagglutinin (HA), nucleoprotein (NP), neuraminidase (NA), glycoprotein (NB), matrix protein (M), matrix protein 2 (BM2), nonstructural protein 1 (NS1), and nuclear export protein (NEP) [11]. The pathogenicity of influenza viruses to mammals is determined by amino acid mutation. For example, mutations in PB2 increase the virulence for influenza A virus isolated from avian species and swine [12, 13]. The screening of the key amino acid mutation is crucial for understanding the pathogenicity of IBV, which can be used to evaluate its virulence and predict even pandemic risk. Although several mutations are related to viral pathogenicity, comprehensive screening has not been achieved [14,15,16,17]. System identification of amino acid mutations is expected with the increase of genome data for IBV [18,19,20,21,22].

The pathogenicity of any influenza virus is an important indicator for pandemic risk. Computational tools in the field of machine learning have been used to identify phenotype of biological data [23, 24]. Machine learning techniques gain knowledge from viral protein sequences and represent viruses by optimal features [25]. A model with good performance evaluates the pathogenicity of IBV and predicts the ability of transmission. With the increase of genome data in the public database, machine learning methods are ideal tools for phenotype identification of IBVs [26].

To capture the key information of mutant amino acids of viral proteins, different feature encoding algorithms from multiple perspectives are considered in this paper, such as compositional information, position-specific information, and physicochemical properties. The amino acid composition (AAC) is a simple feature descriptor for sequence analysis [27]. Parallel correlation-based pseudo-amino-acid composition (PC-PseAAC) measures the parallel correlation of any two amino acids in the signature positions [28]. The standard amino acid alphabet is classified and grouped based on five physicochemical properties: polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge [29]. The orthotropic one-hot and overlapping properties can be used to describe amino acids [30]. Different types of information for amino acid features can be used to construct a machine learning model with good performance.

In this paper, we propose a feature representation algorithm to identify the pathogenicity of IBV. Informative features about the class label or probabilistic prediction were learned from 67 random forest (RF) classifiers. A final predictor was proposed with the use of optimized informative features and performed impressively. Thus, we posit that the proposed method is a powerful tool for pathogenicity identification of IBVs at a large scale, which can aid in warning about transmission risk as well as benefit public health.

Methods

Data set

To describe the transmission dynamic of IBV, surveillance data from 1997 to 2020 were collected from the United States Centers for Disease Control and Prevention (https://www.cdc.gov/flu/weekly/fluviewinteractive.htm). Because of the impact of COVID-19, sparse data from the 2020 to 2021 and 2021 to 2022 influenza seasons were omitted. Regarding pathogenicity, the percentage of IBV in all positive samples of influenza virus per season was calculated. As the number of positive tests changes every year, the positive test rate was used to reflect the pathogenicity.

To construct a machine learning model, protein data of IBVs isolated from the US were downloaded from the GISAID public database (http://platform.gisaid.org/epi3/frontend) [31, 32]. To reduce the redundancy of sequence similarity and cover the integrity of the viral genome, the raw data were processed before modeling [18]. The clustering algorithm was used to reduce the redundancy of viral sequences. Only strains with the full length of viral proteins were considered. Ambiguous amino acid residues were checked and edited carefully. Strains with low-quality sequencing were also removed. The final dataset included all 11 influenza virus proteins (PB2, PB1, PA, HA, NP, NA, NB, M1, BM2, NS1, and NEP) of 1724 strains (see Additional file 1).

Signature amino acid position

Viral proteins have important biological functions and play key roles during infection and transmission. The total length of the 11 viral proteins was 4708 amino acids. Although fast mutation rates have been observed, most amino acid residues in the 11 viral proteins were conserved. Signature positions were screened to reduce the computing complexity. Entropies in each position of the 11 viral proteins were calculated and measured with \({E}_{i}= -\sum_{j=1}^{21}{P}_{i,j}\mathrm{log}\left({P}_{i,j}\right)\), where \({P}_{i,j}\) is the frequency of amino acid \(j\) at position \(i\). Deletion or insertion was also considered. High values reflect frequent mutations in any given position [33].

Amino acid composition

To identify the pathogenicity of IBV using a machine learning method, the features for amino acids in signature positions should be encoded as input. Six different encoding algorithms from multiple perspectives, including compositional information, position-specific information, and physicochemical properties, were used in this paper. The AAC is simple descriptor for the viral protein sequence of IBV [27]. The AAC method calculates the frequency of an amino acid in signature positions. The gap (deletion or insertion) was also considered. A 21-dimensional feature vector was used to represent each strain.

PC-PseAAC

The PC-PseAAC is an updated AAC that calculates the parallel correlation of any two amino acids in a protein or peptide sequence [28]. For each strain used in this paper, the PC-PseAAC feature vector is measured as

$$PC-PseAAC={\left[{fv}_{1},\dots ,{fv}_{21},{fv}_{21+1},\dots ,{fv}_{20+\uplambda }\right]}^{T},$$

where

$${fv}_{u}=\left\{\begin{array}{c}\frac{{f}_{u}}{{\sum }_{i=1}^{21}{f}_{i}+w{\sum }_{j=1}^{\uplambda }{\theta }_{j}}, 1\le u\le 21\\ \frac{w{\theta }_{u-21}}{{\sum }_{i=1}^{21}{f}_{i}+w{\sum }_{j=1}^{\uplambda }{\theta }_{j}}, 21+1\le u\le 21+\lambda \end{array}\right..$$

Here, \(u\) is an integer that changes with \(\uplambda\); \({fv}_{u}\) \((1\le u\le 21)\) represents the normalized appearance frequency of the 20 amino acids and a gap for each strain; λ represents the highest tier of the correlation along signature positions; \({\theta }_{j}\) \((j=\mathrm{1,2},\dots ,\uplambda )\) is the correlation function that measures the \(j\)-tier sequence-order correlation between all the \(j\)-th most contiguous residues along signature positions [18].

G-gap dipeptide composition

Th G-gap dipeptide composition (GGAP) measures the dipeptide composition coupled with local order information of any two interval residues within protein sequences. GGAP is represented as

$$GGAP\left(g\right)=\left({fv}_{1}^{g},{fv}_{2}^{g},\dots ,{fv}_{441}^{g}\right),$$

where \({fv}_{i}^{g}\) is the frequency of the \(i\)-th (\(i\)= 1,2, …, 441) g-gap dipeptide in signature positions [18]. The dimension of the GGAP feature vector is 21 × 21 = 441. Deletion or insertion was also computed.

Twenty-bit features

In addition to methods based on the frequency of the amino acid, features about position-specific information and physicochemical properties were also used. The standard 20 amino acids were grouped according to the five physicochemical properties: polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge [29]. For each physicochemical property, the 20 amino acids were clustered into three groups, and deletion/insertion was regarded as the fourth group [18]. A total of 20 groups for each alphabet in the signature positions were achieved. Each residue was encoded as a 20-bit vector comprising 0/1 elements, where the position of the bit was set to 1 if the residue belonged to the corresponding group, and 0 otherwise. The signature positions in this paper were screened with the method of entropy. The top k residues with the highest values of entropy were selected, and the dimension of the feature vector was 20 × k [18].

Twenty-one-bit features

For position-specific information of signature positions, each alphabet was encoded into a 21-bit 0/1 vector as in one-hot encoding, for example, Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 or deletion/Insertion by 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1). Therefore, the top k residues were encoded with a 21 × k dimensional feature vector [18].

Overlapping property features

Each amino acid was classified into 10 groups based on overlapping physicochemical properties [30]. The 10 physicochemical properties and their corresponding amino acid groups were as follows: Aromatic = {F, Y, W, H}, Negative = {D, E}, Positive = {K, H, R}, Polar = {N, Q, S, D, E, C, T, K, R, H, Y, W}, Hydrophobic = {A, G, C, T, I, V, L, K, H, F, Y, W, M}, Aliphatic = {I, V, L}, Tiny = {A, S, G, C}, Charged = {K, H, R, D, E}, Small = {P, N, D, T, C, A, G, S, V}, and Proline = {A, S, G, C}. Deletion/Insertion was regarded as the 11th group. The alphabet in the signature positions was then encoded by a 11-dimensional 0/1 vector. The position of the vector was set to 1 if the residue belonged to the physicochemical property and 0 otherwise. In this paper, the top k residues were encoded with a 11 × k feature vector [18].

RF predictor

The RF algorithm was used to output the informative features about the class label and probabilistic prediction [18]. R 3.5.0 (Lucent Technologies, Jasmine Mountain, USA) was used to perform the RF algorithm, and the tree number was set to 500 by default [34].

Framework for pathogenicity identification

The framework for pathogenicity identification of IBV is shown in Fig. 1. Two types of features were hierarchically used to represent IBV: amino acid features and informative features [27]. Amino acid features were directly delivered from 67 feature descriptors and were input into the RF predictors. The informative features about the class label and probabilistic prediction were then generated and further optimized. The optimal subset of informative features to represent each strain had low dimensions and included knowledge from different perspectives, which were expected to improve the performance of the identification model.

Fig. 1
figure 1

Flowchart of pathogenicity identification of IBV. The 40 signature positions based on entropy were first screened after data were downloaded and cleaned. Six encoding methods of amino acids with changeable parameters were used to extract features. Then, 67 descriptors were proposed, and two types of informative outputs from the RF method were obtained to be further optimized with the mRMR algorithm and the SFS strategy. Each strain was finally represented by two optimized informative features with the low dimension ‘class’ and ‘prob.’ These optimal subsets were used to construct predictive models

The six amino acid encoding algorithms were AAC, PC-PseAAC, GGAP, 20-Bit features (BIT20), 21-bit features (BIT21), and overlapping property features (OLP). The variate k is the common parameter for BIT20, BIT21, and OLP, and controls the dimension of amino acid features. k varied from 4 to 40 by a step size of 4. The maximum was set to 40 because there were 40 signature positions. The 67 feature descriptors under different parameters were produced (Table 1). The class and probabilistic features were then provided by each RF model. The class feature is the predicted class label. The positive samples were marked as 1, and the negative samples were marked as 0. The probabilistic feature is the probability of the positive label. For each type of informative feature, the 67 values were concatenated into a new vector. Each strain was then represented by two informative features.

Table 1 Summary of feature descriptor and feature number

In this paper, two 67-dimensional features were further optimized to reduce computational complexity and increase performance. The minimum-redundancy maximum-relevancy (mRMR) algorithm was used to rank informative features by importance scores [35]. Moreover, the sequential forward search (SFS) strategy was used to increase the informative features from the ranked list one by one. The subset with the best performance was considered to have the optimal features and was proposed to construct the final model for pathogenicity identification [27].

Performance evaluation

Four popular metrics for performance evaluation, Sensitivity (SN), Specificity (SP), Accuracy (ACC), and Matthews correlation coefficient (MCC), were used as follows:

$$SN=\frac{TP}{TP+FN}\times 100\%$$
$$SP=\frac{TN}{TN+FP}\times 100\%$$
$$ACC=\frac{TP+TN}{TP+TN+FP+FN}\times 100\%$$
$$MCC=\frac{TP\times TN+FP\times FN}{\sqrt{\left(TP+FN\right) \left(TP+FP\right) \left(TN+FN\right) \left(TN+FP\right)}}\times 100\%$$

where TP indicates the correct number of strains with the phenotype of high pathogenicity; TN represents the correct number of strains with the phenotype of low pathogenicity; FP indicates the wrong number of strains with the phenotype of low pathogenicity; and FN is the wrong number of strains with the phenotype of high pathogenicity.

The receiver operating characteristic (ROC) curve was also used to evaluate the overall performance [36]. The curve is generated by plotting the true positive rate (TPR) against the false positive rate (FPR) under different classification thresholds. The area under the ROC curve (AUC) was used to evaluate the predictive performance. A larger AUC value suggests that the model achieves a better performance [26].

Results

Pathogenicity of IBV

To summarize the transmission dynamic of IBV, US surveillance data from 1997 to 2020 were collected. The percentage of IBV in all positive samples of human influenza virus was calculated for each influenza season. The positive rates for the 2000–2001, 2002–2003, and 2019–2020 seasons were more than 35% (Fig. 2). IBV isolated from the three screened seasons with high positive rates were regarded as positive samples, while those in the other 20 seasons had low pathogenicity and were regarded as negative samples. The final dataset for model construction was composed of 1724 strains. Two groups were classified: (1) 865 viruses (positive sample; high pathogenicity; 2000–2001, 2002–2003, 2019–2020 seasons) and 859 viruses (negative sample; low pathogenicity; other 20 seasons). The information related to these strains is summarized in Additional file 1.

Fig. 2
figure 2

Proportion of IBV in all positive samples per influenza season. The x-axis represents the seasons from 1997 to 2000. The y-axis represents the positive proportion for IBV. The ratio of 35% is shown by the dotted blue line

Signature position

The value 0.65 was set as the threshold for entropy screening, and 40 signature positions were achieved, as shown in Table 2. Each strain was represented by 40 amino acids to fulfill further machine learning (Fig. 3). The HA and NA proteins contained the most selected amino acid residues (14 for both), which suggested that HA and NA are the most important factors for human pathogenicity. HA is mainly involved in receptor binding, membrane fusion, and antigen recognition. Signature positions 115–231 are located in or near the region of receptor binding and the antigenic determinant group. The mutations at position 135 had the highest value of 1.06 (Table 2). As shown in Fig. 3, the deletion at HA161–161 should be noted because amino acid deletion can strongly affect protein function. NA influences the release of viral particles from the cell surface. The mutations in positions 120–392 are closed related to the enzyme activity of viral neuraminidase. NB is a viral protein with a short length and is related to virus replication. The role for two mutations at positions 21 and 99 should be further verified to understand the mechanism of pathogenicity. Although most signature positions shown in Fig. 3 were located in HA, NA, and NB proteins, the remaining eight mutations located at PB1, PA, NS1, or NEP proteins require additional attention during surveillance.

Table 2 Amino acid set for pathogenicity identification
Fig. 3
figure 3

Signature positions in the 11 viral proteins. A Profile of 40 signature positions from positive samples of IBV. B Profile of 40 signature positions from negative samples of IBV. The x-axis represents the signature position in viral proteins. The y-axis represents the entropy value

Optimal features with low dimension

After the informative features were generated from the 67 RF predictors, the important scores for each feature were calculated by the mRMR algorithm. The SFS strategy was used to increase the ranked features one by one. The subset with best performance was considered to have the optimal features and was proposed to construct the final model for pathogenicity identification (Fig. 4). For the class features, a maximum ACC of 94.2% was achieved and coupled with the maximum MCC of 88.4%. The best performance was achieved when feature number 4 was selected, which suggests that the top four class features have the optimal representation of IBV. For the probabilistic features, the top three features produced the best model performance, with an ACC of 94.1% and MCC of 88.2%, which suggests that the top three probabilistic features have the optimal representation of IBV.

Fig. 4
figure 4

Optimization of informative features. A The SFS curves for the ACC of ‘class’ and ‘prob’ features. B The SFS curves for the MCC of ‘class’ and ‘prob’ features. The x-axis represents the incremental numbers of informative features. The y-axis represents the metric for the ACC and MCC. The ACC is marked in blue, while the MCC is marked in yellow

Performance of the informative features

Two types of information features, the class label and probabilistic prediction, were received from the 67 RF predictors. As shown in Table 3, the features for class information slightly outperformed the features for probabilistic information. In terms of ACC and MCC, the performances based on class information were 94.0% and 87.9%, while those based on probabilistic information were 93.9% and 87.7%. The performance based on the optimal probabilistic features increased from 93.9 to 94.1% for ACC and from 87.7 to 88.2% for MCC. The performance based on optimal class features increased from 94.0 to 94.2% for ACC and from 87.9 to 88.4% for MCC. The performances of the optimal features were better than those of the original features.

Table 3 Performance of the informative features

Comparison of informative features and amino acid features

In this paper, amino acid features were encoded from individual descriptors and input into the RF predictor to generate the informative features. To explore the power of the optimal subset of informative features, we compared the performance of the optimized informative features and the corresponding amino acid features. As shown in Table 4, there were differences in the performances of the optimal class feature and the amino acid features. The maximum ACC of 94.2% and maximum MCC of 88.4% were obtained from the optimal class feature, which were approximately 0.2–3% and 0.3–6% greater than those from amino acid features. It was notable that only four features were used for the optimal class feature, whereas OLP (k = 28) used 308 features, PC-PseAAC (λ = 6) used 27 features, GGAP (k = 5) used 441 features, and BIT20 (k = 12) used 240 features. The number for the optimal class feature was obviously lower than that for amino acid features.

Table 4 Performance of the optimal class features

As shown in Table 5, there were also differences in the performances of the optimal probabilistic feature and corresponding amino acid features. The maximum ACC of 94.1% and maximum MCC of 88.2% were obtained from the optimal probabilistic feature, which were approximately 0.3–3% and 0.6–6% greater than those of amino acid features. It was also notable that only three features were used for the optimal probabilistic feature, whereas BIT21 (k = 32) used 672 features, BIT20 (k = 4) used 80 features, AAC used 20 features, and BIT21 (k = 4) used 84 features. The number for the optimal probabilistic feature was obviously lower than that for amino acid features.

Table 5 Performance of the optimal probabilistic features

Comparison of SFS and ensemble strategy

The SFS strategy was used to search the optimal subset of informative features. To show the advantage of the SFS strategy, we compared the performances from the optimized informative features with those from two ensemble learning strategies (majority voting and probability averaging). The majority voting strategy considers the majority of class labels from the 67 RF models. The probability averaging strategy averages probabilistic values from the 67 RF models to perform classification. As shown in Table 6, the ACC for the SFS was approximately 0.7% greater than that for the majority voting strategies, while the MCC for the SFS was approximately 1.4% greater. The ACC for the SFS strategy was approximately 1% greater than that for the ensemble strategies, while the MCC for the SFS was approximately 2% greater. Both optimal features achieved better performance than the two ensemble methods.

Table 6 Performance of the SFS strategy

Comparison of four classical classifiers

As mentioned above, the optimal features for class and probabilistic information had good performance. To use two types of the optimal features to identify pathogenicity of IBV, we compared the performances of RF, support vector machine (SVM), Naïve Bayes (NB), and K-nearest neighbor (KNN). All machine learning methods were evaluated with tenfold cross-validation. When the optimal class features were used, the RF method had better predictive performance than the NB and SVM methods and the same performance as the KNN method (Fig. 5A). The RF method obtained an ACC of 94.2% and MCC of 88.4%, which were approximately 1% and 1.4% greater than that of the NB method. The AUC for the RF method (0.95) was the same with those of the three other classifiers (Fig. 5C). When the optimal probabilistic feature was used, the RF method obtained an ACC of 94.0% and MCC of 88.1%, which were approximately 0.6% and 1.2% greater than that for the NB method (Fig. 5B). The AUC for the RF method (0.96) is the same as that for the NB method and is better than that for the SVM and KNN methods (Fig. 5D). According to the performances of four classical classifiers, the RF method was selected to treat the optimized informative features and construct the model for pathogenicity identification of IBV.

Fig. 5
figure 5

Comparison of four traditional classifiers. A Performances of the optimal ‘class’ features. B Performances of the optimal ‘prob’ features. C ROC curves of the optimal ‘class’ features. D ROC curves of the optimal ‘prob’ features

Software implementation

An easy-to-use software freely accessible via https://github.com/kouzheng/BIVPred-FL was designed. The desired results can be easily achieved by the following steps: (1) Prepare the ‘FASTA’ file of amino acid sequences for IBV. Examples of formatted sequences can be found in the software directory. (2) Input the name of the query file, select the type of feature information, and set the confidence parameter as required. The predicted label for ‘P’ represents the phenotype of high pathogenicity, while ‘N’ means low pathogenicity. Amino acid features from the 67 individual descriptors were also delivered to facilitate further analysis.

Discussion

In this study, we presented a method for pathogenicity identification of IBV to benefit public health [37]. The 40 signature positions were first achieved to represent each strain. After two types of informative features were generated from the 67 RF predictors, the mRMR feature ranking algorithm was used to select the optimal subset of informative features. The optimized informative features outperformed the original informative features and amino acid features from individual descriptors. The SFS strategy had better performance than two classical ensemble methods. Finally, the RF method was selected to treat the optimized informative features and construct the machine learning model to predict the phenotypes of IBV.

To reduce computing complexity, each strain was represented by 40 amino acids to fulfill further machine learning [22]. The HA and NA proteins contained the most selected amino acid residues (14 for both), which suggests that HA and NA are the most important factors for pathogenicity among humans. The role of two mutations at positions 21 and 99 should be further verified to understand the mechanism of pathogenicity. Although most signature positions are located in HA, NA, or NB proteins [15, 16, 38], eight mutations located in PB1, PA, NS1, or NEP proteins need extra attention during surveillance [14, 15, 17]. All signature positions were screened based on genome data of IBVs at a large scale, which will benefit the study of the pathogenicity mechanism [39].

Two types of informative features were generated by the RF predictors in this paper. Redundant and irrelevant features were filtered to improve the ability of IBV representation. Good performance was achieved with the use of four class features and three probabilistic features. The optimal subset with low dimensions reduced the complexity of computation. The optimal features about class information were achieved from four individual descriptors: OLP (k = 28), PC-PseAAC (λ = 6), GGAP (k = 5), and BIT20. The optimal features about probabilistic information were obtained from three individual descriptors: BIT21 (k = 32), BIT20 (k = 4), and AAC. The discrimination from different perspectives will benefit the accuracy and interpretability of pathogenicity [40].

Although IBV has not caused a pandemic, the risk of pathogenicity for a pandemic should also be considered [41]. IBV poses a serious threat to susceptible groups, such as children and adolescents, and can cause serious clinical complications. The monitoring of transmission and further research of pathogenicity mechanism should be increased. The method in this paper is a powerful tool for pathogenicity identification of IBVs at a large scale and can facilitate further study in the field of virology.

Although features from signature positions were used to construct the model, whole genomes and full-length proteins should be considered to increase the performance of the prediction model [26]. A mathematical algorithm should be designed for complex data of various models to identify pathogenicity [18, 42]. However, applying the algorithm to multimodal data will be a challenge. The main limitation of this study was that only amino acids in signature positions were encoded to build the prediction model, and the whole genome with clinical image data was not involved. Although the pathogenicity risk may be predicted in view of the pathogen, comprehensive judgment should be exercised to minimize pandemic risk [43].

Conclusions

In this study, we presented a predictor for pathogenicity identification of IBV. The 40 signature positions were screened to represent each strain. Two types of informative features were generated from 67 RF models, and the mRMR algorithm was used to select the optimal subset. Based on the SFS strategy, the dimension of features about class information was optimized to four, with a maximum ACC of 94.2% and maximum MCC of 88.4%, and the dimension of features about probabilistic information was optimized to three, with a maximum ACC of 94.1% and maximum MCC of 88.2%. The optimal features outperformed the original informative features and amino acid features from individual descriptors. The SFS strategy had better performance than the two classical ensemble methods. The RF method was selected to predict the pathogenicity when optimal features were used as input. We believe that the method in this paper can serve as a powerful tool for pathogenicity identification of IBV and benefit public health.

Availability of data and materials

After the registration for any application (https://www.gisaid.org/registration/register/), the public sequences of influenza viruses used in this paper can be downloaded from the GISAID EpiFlu database (http://platform.gisaid.org/epi3/frontend) under the database access agreement (https://platform.epicov.org/epi3/frontend#5aa0ce) and with the acknowledgment GISAID data contributors (https://www.gisaid.org/help/publish-with-data-from-gisaid/). We used the Python programming language to create an easy-to-use tool that implements our predictor and handle massive data, which is freely accessible via https://github.com/kouzheng/BIVPred-FL.

Abbreviations

IBV:

Influenza B virus

PB2:

Polymerase basic protein 2

PB1:

Polymerase basic protein 1

PA:

Polymerase acid protein

HA:

Hemagglutinin

NP:

Nucleoprotein

NA:

Neuraminidase

NB:

Glycoprotein NB

M:

Matrix protein

BM2:

Matrix protein 2

NS1:

Nonstructural protein 1

NEP:

Nuclear export protein

AAC:

Amino acid composition

PC-PseAAC:

Parallel correlation-based pseudo-amino-acid composition

GGAP:

The G-gap dipeptide composition

BIT20:

Twenty-bit feature

BIT21:

Twenty-one-bit feature

OLP:

Overlapping property feature

mRMR:

Minimum-redundancy maximum-relevancy

SFS:

Sequential forward search

SE:

Sensitivity

SP:

Specificity

ACC:

Accuracy

MCC:

Mathew’s correlation coefficient

TP:

True positive

TN:

True negative

FP:

False positive

FN:

False negative

TPR:

True positive rate

FPR:

False positive rate

RF:

Random forest

SVM:

Support vector machine

NB:

Naïve Bayes

KNN:

K-nearest neighbor

ROC:

Receiver operating characteristic curve

AUC:

Area under the receiver operating characteristic curve

References

  1. Langat P, Raghwani J, Dudas G, Bowden T, Edwards S, Gall A, et al. Genome-wide evolutionary dynamics of influenza B viruses on a global scale. PLoS Pathog. 2017;13(12): e1006749.

    Article  Google Scholar 

  2. Osterhaus A, Rimmelzwaan G, Martina B, Bestebroer T, Fouchier R. Influenza B virus in seals. Science. 2000;288(5468):1051–3.

    Article  CAS  Google Scholar 

  3. Francis T. A new type of virus from epidemic influenza. Science. 1940;92(2392):405–8.

    Article  Google Scholar 

  4. Glezen P, Schmier J, Kuehn C, Ryan K, Oxford J. The burden of influenza B: a structured literature review. Am J Public Health. 2013;103(3):e43–51.

    Article  Google Scholar 

  5. Zhao B, Qin S, Teng Z, Chen J, Yu X, Gao Y, et al. Epidemiological study of influenza B in Shanghai during the 2009–2014 seasons: implications for influenza vaccination strategy. Clin Microbiol Infect. 2015;21(7):694–700.

    Article  CAS  Google Scholar 

  6. EI Moussi A, Pozo F, Ben Hadj Kacem M, Ledesma J, Cuevas M, Casas I, et al. Virological surveillance of influenza viruses during the 2008–2009, 2009–2010 and 2010–2011 seasons in Tunisia. PLoS One. 2013;8(9):e74064.

  7. Tewawong N, Suwannakarn K, Prachayangprecha S, Korkong S, Vichiwattana P, Vongpunsawad S, et al. Molecular epidemiology and phylogenetic analyses of influenza B virus in Thailand during 2010 to 2014. PLoS One. 2015;1(10): e116302.

  8. Harvala H, Smith D, Salvatierra K, Gunson R, von Wissmann B, Reynolds A, et al. Burden of influenza B virus infections in Scotland in 2012/13 and epidemiological investigations between 2000 and 2012. Euro Surveill. 2014;19(37):20903.

    Article  Google Scholar 

  9. Sam I, Su Y, Chan Y, Nor’E S, Hassan A, Jafar F, et al. Evolution of influenza B Virus in Kuala Lumpur, Malaysia, between 1995 and 2008. J Virol. 2015;89(18):9689–92.

    Article  CAS  Google Scholar 

  10. Feng L, Shay D, Jiang Y, Zhou H, Chen X, Zheng Y, et al. Influenza-associated mortality in temperate and subtropical Chinese cities, 2003–2008. Bull World Health Organ. 2012;90(4):279–88.

    Article  Google Scholar 

  11. Lamb R, Choppin P. The gene structure and replication of influenza virus. Annu Rev Biochem. 1983;52(1):467–506.

    Article  CAS  Google Scholar 

  12. Nilsson B, Te Velthuis A, Fodor E. Role of the PB2 627 domain in influenza A virus polymerase function. J Virol. 2017. https://doi.org/10.1128/JVI.02467-16.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Zhu W, Li L, Yan Z, Gan T, Li L, Chen R, et al. Dual E627K and D701N mutations in the PB2 protein of A (H7N9) influenza virus increased its virulence in mammalian models. Sci Rep. 2015;5(1):14170–81.

    Article  CAS  Google Scholar 

  14. Lugovtsev V, Vodeiko G, Levandowski R. Mutational pattern of influenza B viruses adapted to high growth replication in embryonated eggs. Virus Res. 2005;109(2):149–57.

    Article  CAS  Google Scholar 

  15. Lugovtsev V, Vodeiko G, Strupczewski C, Ye Z, Levandowski R. Generation of the influenza B viruses with improved growth phenotype by substitution of specific amino acids of hemagglutinin. Virology. 2007;365(2):315–23.

    Article  CAS  Google Scholar 

  16. Fujisaki S, Takashita E, Yokoyama M, Taniwaki T, Xu H, Kishida N, et al. A single E105K mutation far from the active site of influenza B virus neuraminidase contributes to reduced susceptibility to multiple neuraminidase-inhibitor drugs. Biochem Biophys Res Commun. 2012;429(1–2):51–6.

    Article  CAS  Google Scholar 

  17. Bae J, Lee I, Kim J, Park S, Yoo K, Park M, et al. A single amino acid in the polymerase acidic protein determines the pathogenicity of influenza B viruses. J Virol. 2018. https://doi.org/10.1128/JVI.00259-18.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Kou Z, Li J, Fan X, Kosari S, Qiang X. Predicting cross-species infection of swine influenza virus with representation learning of amino acid features. Comput Math Methods Med. 2021;2021:6985008.

    Article  Google Scholar 

  19. Qiang X, Kou Z, Fang G, Wang Y. Scoring amino acid mutations to predict avian-to-human transmission of avian influenza viruses. Molecules. 2018;23(7):1584–1584.

    Article  Google Scholar 

  20. Qiang X, Kou Z. Scoring amino acid mutation to predict pandemic risk of avian influenza virus. BMC Bioinform. 2019;20(S8):288.

    Article  Google Scholar 

  21. Borkenhagen L, Allen M, Runstadler J. Influenza virus genotype to phenotype predictions through machine learning: a systematic review. Emerg Microbes Infect. 2021;10(1):1896–907.

    Article  Google Scholar 

  22. Suttie A, Deng Y, Greenhill A, Dussart P, Horwood P, Karlsson E. Inventory of molecular markers affecting biological characteristics of avian influenza A viruses. Virus Genes. 2019;55(6):739–68.

    Article  CAS  Google Scholar 

  23. Han H. Derivative component analysis for mass spectral serum proteomic profiles. BMC Med Genomics. 2014;7(S1):S5.

    Article  Google Scholar 

  24. Tang Z, Yin Z, Wang L, Cui J, Yang J, Wang R. Solving 0–1 integer programming problem based on DNA strand displacement reaction network. ACS Synth Biol. 2021;10(9):2318–30.

    Article  CAS  Google Scholar 

  25. Qiang X, Xu P, Fang G, Liu W, Kou Z. Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus. Infect Dis Poverty. 2020;9:33.

    Article  Google Scholar 

  26. Kou Z, Huang Y, Shen A, Kosari S, Liu X, Qiang X. Prediction of pandemic risk for animal-origin coronavirus using a deep learning method. Infect Dis Poverty. 2021;10:128.

    Article  Google Scholar 

  27. Qiang X, Zhou C, Ye X, Du P, Su R, Wei L. CPPred-FL: a sequence-based predictor for large-scale identification of cell-penetrating peptides by feature representation learning. Brief Bioinform. 2020;21(1):11–23.

    Google Scholar 

  28. Liu B, Liu F, Wang X, Chen J, Fang L, Chou K. Pse-inone: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015;43(W1):W65–71.

    Article  CAS  Google Scholar 

  29. Atchley W, Zhao J, Fernandes A, Drüke T. Solving the protein sequence metric problem. Proc Natl Acad Sci USA. 2005;102(18):6395–400.

    Article  CAS  Google Scholar 

  30. Dou Y, Yao B, Zhang C. PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine. Amino Acids. 2014;46(6):1459–69.

    Article  CAS  Google Scholar 

  31. Elbe S, Buckland-Merrett G. Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob Chall. 2017;1(1):33–46.

    Article  Google Scholar 

  32. Shu Y, McCauley J. GISAID: global initiative on sharing all influenza data from vision to reality. Euro Surveill. 2017;22(13):30494.

    Article  Google Scholar 

  33. Hall T. BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl Acids Symp Ser. 1999;41:95–8.

    CAS  Google Scholar 

  34. Liaw A, Wiener M. Classification and regression by random forest. R News. 2002;2:18–22.

    Google Scholar 

  35. Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol. 2005;3(2):185–205.

    Article  CAS  Google Scholar 

  36. Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21(20):3940–1.

    Article  CAS  Google Scholar 

  37. Yang J, Jit M, Leung K, Zheng Y, Feng L, Wang L, et al. The economic burden of influenza-associated outpatient visits and hospitalizations in China: a retrospective survey. Infect Dis Poverty. 2015;4:44.

    Article  Google Scholar 

  38. Katinger D, Romanova J, Ferko B, Fekete H, Egorov A. Effect of a single mutation in neuraminidase on the properties of influenza B virus isolates. Arch Virol. 2004;149(1):173–81.

    Article  CAS  Google Scholar 

  39. Hatta M, Kawaoka Y. The NB protein of influenza B virus is not necessary for virus replication in vitro. J Virol. 2003;77(10):6050–4.

    Article  CAS  Google Scholar 

  40. Han H, Liu X. The challenges of explainable AI in biomedical data science. BMC Bioinform. 2022;22:443.

    Article  Google Scholar 

  41. Spoto S, Valeriani E, Locorriere L, Anguissola G, Pantano A, Terracciani F, et al. Influenza B virus infection complicated by life-threatening pericarditis: a unique case-report and literature review. BMC Infect Dis. 2019;19(1):40–5.

    Article  Google Scholar 

  42. Liu C. A note on domination number in maximal outerplanar graphs. Discret Appl Math. 2021;293:90–4.

    Article  Google Scholar 

  43. Koutsakos M, Nguyen T, Barclay W, Kedzierska K. Knowns and unknowns of influenza B viruses. Future Microbiol. 2016;11(1):119–35.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We would like to acknowledge the originating and submitting laboratories of the viral sequences from the influenza public database [31, 32].

Funding

This work was supported by the National Natural Science Foundation of China (61972109, 62172114) and the Natural Science Foundation of Guangdong Province of China (2022A1515011468).

Author information

Authors and Affiliations

Authors

Contributions

ZK and XQ designed the framework of analysis. ZK and XF performed all computational work. ZK, JL and ZS implemented the code and software. ZK and XQ wrote the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Zheng Kou or Xiaoli Qiang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Supplementary Information

Additional file 1.

The dataset for influenza B virus.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kou, Z., Fan, X., Li, J. et al. Using amino acid features to identify the pathogenicity of influenza B virus. Infect Dis Poverty 11, 50 (2022). https://doi.org/10.1186/s40249-022-00974-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40249-022-00974-0

Keywords

  • Influenza B virus
  • Pathogenicity
  • Amino acid feature
  • Machine learning