Exploring the role of meteorological factors in predicting incident pulmonary tuberculosis: A time-series study in eastern China

Background: Many studies have compared the performance of time-series models in 19 predicting pulmonary tuberculosis (PTB). Few studies regarding the role of 20 meteorological factors in predicting PTB are available. This study aims to explore 21 whether incorporating meteorological factors can improve the performance of time 22 series models in predicting pulmonary tuberculosis (PTB).


Background
Tuberculosis (TB) is a chronic communicable disease that severely threatens human health, ranking among the top ten causes of death worldwide.The World Health Organization (WHO) estimated that approximately 10.0 million people fell ill with TB around the world in 2018.Meanwhile, there were an estimated 1.2 million TB deaths among HIV-negative people and 251,000 TB deaths among HIV-positive people.In order to curb the TB epidemic, the WHO set a goal of reducing the morbidity and mortality of TB by 90% and 95%, respectively, between 2015 and 2035 [1].Accurately predicting the trend of the epidemic can help foresee the possible peaks and provide a reference for the prevention and control of TB.
A time series is formed by recording the development process of a random event in the order of time.Time series analysis plays a vital role in predicting trends by identifying the law of health-related events changing with time.The autoregressive integrated moving average (ARIMA) model is the most classic time series analysis model and has been widely applied to predict various infectious diseases like hepatitis B [2], hemorrhagic fever with renal syndrome [3], coronavirus disease 2019 [4] and hand, foot and mouth disease [5].The ARIMA with exogenous variables (ARIMAX) exhibits superior prediction performance by adding other event-related factors as input variables [6,7].Another commonly used time series analysis model is the artificial neural network (ANN), which is designed to simulate the way the human brain analyzes and processes information.It has been applied to construct time series models to forecast human diseases [8][9][10].The recurrent neural network (RNN) is a specific ANN with the ability to transfer information across time steps, as it can remember the previous information and apply it to the current output calculation.The ability to model temporal dependencies makes it particularly appropriate to analyze time series, which consists of a sequence of points that are not independent [11,12].Time series analysis has been used to predict TB morbidity or mortality, but most were conducted in one city or one region and based on one or two models without incorporating meteorological factors in the model [13][14][15][16].Our previous studies have observed that the incidence of TB exhibited seasonal fluctuations, indicating a potential relationship with meteorological factors [17].Thus, in the current study, we performed a time series analysis in the three cities of Jiangsu Province, China, and applied three types of models, including ARIMA, ARIMAX, and RNN, to explore whether the inclusion of meteorological factors can improve the performance of modeling.

Study area
Jiangsu Province is located on the eastern coast of China, with an area of 107,200 square kilometers.It governs 13 cities and has a permanent population of 80.7 million at the end of 2019.We randomly selected one city from northern, central, and southern Jiangsu, respectively, and finally included Xuzhou, Nantong, and Wuxi as the study sites.

Data collection
All newly diagnosed TB cases in China are registered in an online surveillance system operated by the Center for Disease Control and Prevention.We extracted the monthly reported number of pulmonary TB (PTB) cases in the study sites between 2005 and 2018.We also collected meteorological factors in the same area at the same time from the China Meteorological Data Network (http://www.nmic.cn/).These meteorological factors included monthly average temperature (MAT), monthly average atmospheric pressure (MAP), monthly average wind speed (MAS), monthly average relative humidity (MAH), monthly precipitation (MP), and monthly sunshine time (MST).

Construction of the ARIMA model
As described in our previous studies [17], we constructed a seasonal ARIMA model, which is expressed as ARIMA (p, d, q)(P, D, Q) s .The p, d, and q represent the autoregressive order, the number of ordinary differences, and the moving average order, respectively.The P, D, and Q represent the seasonal autoregressive order, the number of seasonal differences, and the seasonal moving average order, respectively.
The s represents the length of a periodic pattern (s =12 in this study).The number of PTB cases predicted at time t (   ) was determined by the formula:   = θ  ()Θ  (  )  Φ  (  )  ()(1−)  (1−  )  , where θ  () is the operator of moving average, Θ  (  ) is the operator of seasonal moving average,   () is the operator of autoregressive, Φ  (  ) is the operator of seasonal autoregressive, (1 − )  is the component of ordinary difference, (1 −   )  is the component of seasonal difference,   is the white noise and   is the predicted variable [18,19].Based on the monthly number of PTB, we constructed the ARIMA model for each city.First, we applied the ordinary difference and seasonal difference to make the series stationary.Second, by referring to the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots of the stationary series, we initially identified the values of the parameters (p, q, P and Q) to establish alternative ARIMA models.Third, we determined the optimal ARIMA model according to three criteria: (a) the smaller the normalized Bayesian information criterion (BIC) value of the model was, the better the model would be; (b) the residual series of the model was proved to be a white noise by the Ljung-Box test; (c) the parameter estimation showed that the parameters of the model were significant [17].Finally, we selected the optimal ARIMA model to predict PTB cases in 2018.

Construction of the ARIMAX model
The ARIMAX model adds exogenous variables based on ARIMA model, which can be described by the formula:   = θ  ()Θ  (  )  Φ  (  )  ()(1−)  (1−  )  + , where  represents the external regressor, which can be univariate or multivariate.The other parameters are consistent with the ARIMA model [18].Based on the monthly number of PTB cases and six meteorological factors, we constructed the ARIMAX model for each city.First, we constructed the optimal ARIMA model for each meteorological factor, obtained residual series of optimal ARIMA models, and ensured that they were all white noise.Second, we used the cross-correlation function (CCF) to analyze the residual series of PTB cases and meteorological factors to evaluate the correlation between them at different lag times.Third, we tried different combinations of significant meteorological factors as external variables into the optimal ARIMA model to construct alternative ARIMAX models.Finally, we determined the optimal ARIMAX model according to three criteria: (a) the normalized BIC value was smaller than the optimal one; (b) the residual series of the model was proved to be a white noise by the Ljung-Box test; (c) the performance of the model in predicting PTB cases in 2018.

Construction of the RNN model
The ANN usually consists of the input layer, hidden layer, and output layer.The layers of traditional ANN are fully connected, but the neurons in each layer are not connected.The difference of RNN is that it adds the connection between the neurons in the hidden layer (Figure 1A). Figure 1B shows the unfolding diagram of the forward propagation of the RNN [20], where   represents the input at time ,  represents the hidden state at time  ,  = ( * −1 +  *   ) ,  represents the weight of the input,  represents the weight of the input at the moment,   represents the output at time  ,   = ( *  ) , and  represents the weight of the output.Therefore, the input of the hidden layer of the RNN includes not only the output of the input layer but also the previous output of the hidden layer, making the model have memory function.We divided data into the training set, testing set, and predicting set.We trained each RNN model for three times and compared their performance on the testing set to determine the optimal RNN model.For each RNN model, we set the learning rate as 0.05, 0.1, and 0.2, and the dimensions of the hidden layer as 3, 5, and 10, respectively, and identified the appropriate training epochs through the epoch-error plot.By comparing the performance of the model on the testing set, we determined the most suitable parameters for each RNN model.First, we normalized the original data to convert all values to intervals [0, 1], using the formula: , where X is the original value,   is the maximum value of original data,   is the minimum value of original data, and X′ is the normalized value after conversion.Second, we used the number of PTB cases in the previous one, two, three, six, and twelve months as the input of the training set, respectively and the number of PTB cases in the current month as the output of the training set to construct five different RNN models (RNN1-RNN5), which did not incorporate meteorological factors.We compared the performance of five RNN models on the testing set and selected the best one to incorporate meteorological factors into it.Third, we used the Spearman rank correlation test to evaluate the correlation between PTB cases in the current month and meteorological factors one, two, and three months ago, respectively.Fourth, we incorporated the significant meteorological factors into the best model of RNN1-RNN5 to construct another four RNN models (RNN6-RNN9).Finally, we compared the performance of nine RNN models on the testing set to determine the optimal one and applied it to predict PTB cases in 2018.

Evaluating the performance of three models
The diagnostic indicators, including mean absolute percentage error (MAPE) and root mean square error (RMSE), were used to evaluate the performance of the three , where   is the actual value at time ,  ̂ is the output value of the model at time  and  is the number of samples.

Statistical software
We used SPSS 25.0 (Nanjing Medical University, Nanjing, China) to construct the ARIMA and ARIMAX models, and the package of "rnn" in R 3.6.3 (https://www.r-project.org/) to construct the RNN model.The significant level was set at 0.05.

Results
The ARIMA model The monthly number of PTB cases showed a long-term downward trend and seasonal fluctuations, with the peak in March to April, and the trough in December to January (in Xuzhou) or January to February (in Nantong and Wuxi) (Figure 2).Therefore, we applied one ordinary difference and one seasonal difference to make the series stationary (d = D =1).Then, we initially identified the parameters of the ARIMA model (p, q, P, and Q) to construct alternative models for each city according to the ACF and PACF plots of the stationary series (Figure S1, A1-A3, and B1-B3).We determined the optimal ARIMA model as ARIMA (1,1,1)(0,1,1) 12 for Xuzhou, and ARIMA (0,1,1)(0,1,1) 12 for Nantong and Wuxi, since they had the smallest normalized BIC, residual series were proved to be white noise, and parameters were all significant (P <0.05) (Table S1, C1-C3, and D1-D3 of Figure S1).PTB cases in 2018 were predicted by the optimal ARIMA model and listed in Table 1.

The RNN model
We compared the MAPE of each RNN model with different parameters on the testing set to identify the appropriate parameters.The RNN5 model had the smallest MAPE on the testing set in each city (Table 3).The number of PTB cases in the current  S2).Then, we constructed the RNN6-RNN9 models by incorporating significant meteorological factors into the RNN5 model.The detailed composition of the nine RNN models is listed in Table S3.We determined the optimal RNN model as RNN8 for Xuzhou, and RNN7 for Nantong and Wuxi, since they had the smallest MAPE on the testing set after three times training (Table 3).3).PTB cases in 2018 were predicted by the optimal RNN model and listed in Table 1.

Evaluating the performance of three models
As shown in

Discussion
In this study, we explored the role of meteorological factors in predicting PTB in three cities of China by constructing the ARIMA, ARIMAX, and RNN models.The prediction ability of the model has been improved by adding meteorological factors.
The ARIMAX model (ARIMA with meteorological factors) showed the best performance.To our knowledge, this is the first time-series study to construct different models in different cities to explore the role of meteorological factors in predicting PTB.
Although the notification rate of TB has declined with an annual rate of 3% between 2005 and 2017 [13], China still had about 866,000 new cases notified in 2018, second only to India [1].Accurately forecasting the future trend of TB epidemic can help policymakers to implement effective interventions and distribute healthcare resources appropriately.Previous studies have explored models, such as ARIMA [13,16], X12-ARIMA [16], and ARIMA-generalized regression neural network (GRNN), in predicting TB [13].However, few models considered seasonal variation characteristics, socioeconomic levels, and meteorological factors [15,[21][22][23].Therefore, we divided the study areas into three regions according to the geographical location and economic level and then compared the performance of different models with or without adding meteorological factors in predicting PTB in the Chinese population.
The ARIMA model, also known as the Box-Jenkins model [24], can analyze various types of time series data and is a commonly used model in time series analysis [2][3][4][5].
Different from the ARIMA model, which is a univariate time series model, the ARIMAX model can deal with multivariate time series data.It adds other variables related to the target series as input variables to improve the prediction accuracy.A time-series study in Guangzhou of China showed that the ARIMA model with the imported cases and the minimum temperature as input variables was superior to a single ARIMA model in forecasting dengue transmission [18].Another time-series study in Abidjan of Cote d'Ivoire also indicated that including rainfall as an input variable can increase the accuracy of the ARIMA model in predicting influenza [25].
However, when we incorporated two or more meteorological factors into the ARIMA model, its prediction performance did not continuously increase, which may be attributed to the high collinearity between these meteorological factors.
Considering that both ARIMA and ARIMAX were linear regression models, we applied the RNN model, which has a strong nonlinear fitting ability.It can recognize the relationship between variables without any restrictions and has a memory function.
This memory function makes the RNN model not only takes the current data as input but also applies its long-term experience as input [12].When constructing the RNN model, some parameters need to be determined artificially.Besides, since the initial weights and thresholds are random when training the RNN model, even for the same training set, the output of the model on the testing set will not be precisely the same.
Therefore, we trained each RNN model with different parameters and compositions three times and compared their performance on the testing set to determine the optimal RNN model.Finally, we found that the prediction performance of the RNN model was improved after incorporating meteorological factors.
The possible link between PTB and meteorological factors may be attributed to the following reasons.First, the temperature can affect indoor and outdoor activities of TB patients and susceptible people.For example, in hot summer and cold winter, people tend to stay indoors, which will increase the probability of mycobacterium tuberculosis (M.tb) transmission [26].Second, high wind speed can dilute the concentration of environmental M.tb, thereby reducing the risk of infection.Airflow is usually formed from high-pressure areas to low-pressure areas, so the correlation between PTB and atmospheric pressure may be related to wind speed, but further exploration is needed [27].Third, high relative humidity and abundant precipitation can provide an appropriate living environment for M.tb [27,28].Continuous exposure to dry air may decrease the production of protective mucus on the respiratory tract surface, thereby weakening its resistance to the pathogen [29].Fourth, the large amount of ultraviolet light provided by long-term sunshine not only restricts the growth of M.tb but also promotes the synthesis of vitamin D, which can protect people from TB to some extent [27].
The association between PTB and meteorological factors varied across regions [27], which may be partially attributed to socioeconomic differences or analytic methods.
TB is a poverty-related infectious disease [1].Different economic levels may lead to an uneven distribution of socioeconomic factors that affect the risk of TB, such as food and nutrition security, living conditions, community environment, and medical resources [23,30].The inconsistency between analytical methods may be due to their different requirements for the data.The Spearman rank correlation test has no special requirements for the distribution of variables and has a wide range of applications.
However, if there is a long-term trend in both time series, it will lead to a biased correlation.The cross-correlation analysis can evaluate the correlation between time series at different lag times without the influence of long-term trends.Besides, the exposure-response relationship between TB and meteorological factors might be nonlinear.For example, as mentioned earlier, TB may benefit from extremely high or extremely low temperatures and relative humidity.Both the Spearman rank correlation test and the cross-correlation analysis can perform linear correlation analyses between time series but have limitations in solving nonlinear relationships.
Our study has several limitations.First, the ARIMA, ARIMAX, and RNN models are all short-term prediction models; continuous data collection to update models is

Conclusions
The prediction performance of both ARIMA and RNN model was improved after incorporating meteorological factors, and the ARIMAX model (ARIMA with meteorological factors) had the best performance, indicating a potential link between PTB and meteorological factors.Taking meteorological factors into consideration may increase the accuracy of time series models in predicting the trend of PTB.

Supplementary Tables
month in Xuzhou has positively correlated with MAS one month ago (P <0.01), MAS two months ago (P <0.01) and MAS three months ago (P <0.01), and negatively correlated with MST two months ago (P <0.01), MAT three months ago (P <0.01), MP three months ago (P <0.05) and MST three months ago (P <0.01).The number of PTB cases in the current month in Nantong has negatively correlated with MAS one month ago (P <0.05), MAH one month ago (P <0.05), MAS two months ago (P <0.01), MAH two months ago (P <0.01), MAS three months ago (P <0.01) and MAH three months ago (P <0.05).The number of PTB cases in the current month in Wuxi has positively correlated with MAT one month ago (P <0.01), MAS one month ago (P <0.01), MST one month ago (P <0.05), MAS two months ago (P <0.01) and MAS three months ago (P <0.01), and negatively correlated with MAP one month ago (P <0.01), MAH one month ago (P <0.05), MAT three months ago (P <0.05) and MAH three months ago (P <0.05) (Table

Figure
FigureS3showed the epoch-error plots of optimal RNN models after three times of essential for maintaining their prediction performance.Second, we incorporated all combinations of significant meteorological factors into the ARIMA model to construct the ARIMAX model, but we only incorporated four combinations of meteorological factors into the RNN model.Although the performance of the RNN model in this study is inferior to the ARIMAX model, its prediction performance needs further exploration.Third, we just qualitatively evaluated the linear correlation between PTB and meteorological factors based on monthly data.Considering that this relationship may be nonlinear and has lag time, we intend to apply the distributed lag nonlinear model to quantitatively evaluate it based on weekly or daily data in the future study.

Figure 2 .
Figure 2. The monthly pulmonary tuberculosis cases in three cities between 2005

Figure 3 .
Figure 3.The cross-correlation function plots of residual series of pulmonary

α
: the MAPE of the model on the testing set after the first training.β : the MAPE of the model on the testing set after the second training.γ : the MAPE of the model on the testing set after the third training.

Figure 2 .
Figure 2. The monthly pulmonary tuberculosis cases in three cities between 2005 and 2017.

Figure 3 .
Figure 3.The cross-correlation function plots of residual series of pulmonary tuberculosis and

α:
the number of PTB cases in the current month in Xuzhou.β: the number of PTB cases in the current month in Nantong.γ : the number of PTB cases in the current month in Wuxi.* : P <0.05.# : P <0.01.

Figure S2 .
Figure S2.The time series plots of the six meteorological factors in the three cities between 2005

Figure S3 .
Figure S3.The epoch-error plots of the optimal RNN models of the three cities after three times of

Figures Figure 1
Figures

Figure 2 The
Figure 2

Table 4 ,
the ARIMAX model is slightly superior to ARIMA and RNN in Xuzhou, the ARIMAX model is significantly superior to ARIMA and RNN in Nantong, and the ARIMAX model is slightly superior to ARIMA and significantly superior to RNN in Wuxi.Generally speaking, the ARIMAX model showed the best performance.

Table 1 .
The monthly number of pulmonary tuberculosis cases in the three cities in 2018 predicted by the ARIMA, ARIMAX, and RNN models.

Table 2 .
The alternative ARIMAX models of three cities.
:The MAPE of the model in predicting the monthly number of PTB cases in 2018.

Table 3 .
The alternative RNN models of three cities.

Table 4 .
Evaluating the performance of ARIMA, ARIMAX, and RNN models in predicting the monthly number of pulmonary tuberculosis cases in three cities in 2018.
MAPE: mean absolute percentage error; RMSE: root mean square error.

Table S1 .
The alternative ARIMA models of the three cities.

Table S2 .
The spearman rank correlation coefficients between the monthly number of PTB cases and meteorological factors in the three cities.