پژوهش های اقلیم شناسی

پژوهش های اقلیم شناسی

بکارگیری الگوریتم‌های یادگیری ماشین در پیش‌بینی بارش روزانه با استفاده از داده‌های مشاهداتی سطح زمین و جو بالا، مطالعه موردی: شهر مشهد

نوع مقاله : مقاله پژوهشی

نویسندگان
کارشناسی ارشد، گروه مهندسی صنایع، دانشکده مهندسی، دانشگاه فردوسی مشهد، خراسان رضوی
10.22034/jcr.2025.541424.1710
چکیده
در این پژوهش، عملکرد مدل‌های یادگیری ماشین در پیش‌بینی بارش یک روزه ایستگاه مشهد با استفاده از داده‌های روزانه دوره ۲۰۰۰ تا ۲۰۲۳ مورد بررسی قرار گرفت. برای این منظور، سه گروه (سناریو) از داده‌ها در نظر گرفته شد: (۱) داده‌های سطح زمین، (۲) داده‌های جو بالا و (۳) ترکیب هر دو مجموعه داده. پیش از مدل‌سازی، متغیرهای دارای هم‌خطی بالا با استفاده از شاخص VIF شناسایی و حذف شدند تا از استقلال آماری ویژگی‌های ورودی اطمینان حاصل شود. در سناریوی ترکیبی، پس از ادغام داده‌ها، از روش انتخاب ویژگی حذف شناور (SFFS) برای گزینش شش شاخص مؤثر استفاده گردید که شامل چهار متغیر جو بالا (v700، Rel_Hum300، Spe_Hum500 و v500) و دو متغیر سطح زمین (nm وUmax ) بودند. نتایج ارزیابی مدل‌ها نشان داد که در تمامی سناریوهای مطالعه‌شده، الگوریتم CatBoost به ‌عنوان بهترین مدل شناسایی شد. در سناریوی اول (داده‌های سطح زمین)، CatBoost با MAE برابر 0.841، RMSE معادل 2.309، R² برابر 0.171 و Adjusted R² برابر 0.168، عملکردی برتر از سایر الگوریتم‌ها داشت. در سناریوی دوم (داده‌های جو بالا)، CatBoost دوباره پیشتاز ظاهر شد و با MAE برابر 0.772، RMSE معادل 2.294، R² برابر 0.182 و Adjusted R² برابر 0.175، کارایی بالاتری نسبت به رقبا از خود نشان داد. در نهایت، در سناریوی سوم (ترکیب داده‌های سطح زمین و جو بالا)، CatBoost با بهره‌گیری از شش ویژگی منتخب، بهترین نتایج را کسب کرد: R² = 0.190، Adjusted R² برابر 0.188 و RMSE = 2.283. یافته‌ها نشان می‌دهد که تلفیق داده‌های سطح زمین و جو بالا، همراه با انتخاب بهینه ویژگی‌ها، نه‌تنها دقت پیش‌بینی را افزایش می‌دهد، بلکه تعادل مناسبی میان پیچیدگی مدل و توان پیش‌بینی را ایجاد می‌کند. مدل CatBoost در این سناریو توانسته است بارش‌های خفیف و متوسط را با خطای کم پیش‌بینی کند و روند کلی نوسانات بارش را به‌خوبی دنبال نماید.
کلیدواژه‌ها

عنوان مقاله English

Using Machine Learning for Daily Precipitation Forecasting Using Surface and Upper Air Data: A Case Study in Mashhad

نویسندگان English

Amirhossein Babaeian
Mahdi Rostamzadeh
Mostafa Fazeli
Department of Industrial Engineering, Ferdowsi University of Mashhad
چکیده English

Introduction

Accurate 1-day (next-day) rainfall forecasts underpin water resources operations, smart agriculture, and early warning of hydro-meteorological hazards. Yet the short lead prediction of daily precipitation remains difficult because rainfall emerges from multiscale, nonlinear interactions that are only partly captured by single-source datasets. Machine learning (ML) can learn such relationships directly from data, but the relative value of surface observations versus upper air information—and their combination—has not been systematically assessed for Mashhad, Iran. This study addresses that gap by benchmarking several ensemble ML algorithms across three data scenarios and by applying feature selection to balance predictive skill and model simplicity.

Data and Study Area

We used two data sources for the Mashhad synoptic station during 2000–2023: (i) surface observations (maximum, minimum, and mean temperature; maximum, minimum, and mean relative humidity; wind speed; mean sea level pressure; sunshine hours; and daily rainfall), and (ii) ERA5 upper air reanalysis at pressure levels of 700, 500, and 300 hPa, including geopotential height, temperature, specific humidity, relative humidity, horizontal wind components (u, v) and vorticity. All predictors were used with a one-day lag to forecast next-day precipitation. The dataset was split into training (2000–2017) and testing (2018–2023) periods to enable out-of-sample evaluation.

Methodology

We designed three scenarios of S (surface only), U (upper air only), and S&U (combined). In scenarios S and U, each dataset was independently provided to five ensemble learning algorithms — Random Forest, AdaBoost, XGBoost, CatBoost, and LightGBM. Before model fitting, the Variance Inflation Factor (VIF) was computed to diagnose multicollinearity among predictors, and variables with VIF values above the acceptable threshold were excluded to ensure statistical independence and model stability. In the combined scenario, surface and upper air variables were merged into a unified feature matrix. To curb dimensionality, remove redundancy, and avoid overfitting, we applied Sequential Forward Floating Selection (SFFS), using five-fold cross-validated R² as the selection criterion. The six features retained by SFFS were v700 and v500 (meridional wind at 700/500 hPa), Spe_Hum500 (specific humidity at 500 hPa), Rel_Hum300 (relative humidity at 300 hPa), and two surface indicators (Umax and nm, representing near-surface wind and sunshine hours). Models were evaluated on the test set using Mean Absolute Error (MAE), Root Mean Square Error (RMSE), R², and Adjusted R².

Results and Discussion

In the surface-only scenario (S), the CatBoost model demonstrated the best performance with an R² of 0.171, Adjusted R² of 0.168, and the lowest RMSE of 2.309 in the test data. However, AdaBoost achieved the lowest MAE of 0.767, making it the best model in terms of minimizing mean absolute error, even though its R², Adjusted R², and RMSE were lower than those of CatBoost. In the upper-air-only scenario (U), similarly, CatBoost emerged as the top-performing model, achieving an R² of 0.182, Adjusted R² of 0.175, and the lowest RMSE of 2.294. Notably, the results in the upper-air scenario (U) showed better performance across all metrics compared to the surface-only scenario (S) for all algorithms. These results highlight the importance of upper air dynamics in improving model performance, particularly in terms of reducing error and enhancing explanatory power. In the combined scenario (S&U), which integrates both surface and upper-air data, CatBoost achieved the highest R² of 0.190, Adjusted R² of 0.188, and the lowest RMSE of 2.283 on the test data. Moreover, CatBoost achieved the second-lowest MAE of 0.795, just after AdaBoost, making it the best-performing model overall across multiple metrics. This suggests that the combination of surface and upper-air data with CatBoost provides the best balance of accuracy and model simplicity, making it the most effective model for predicting next-day rainfall.

Time series comparisons demonstrate that the selected CatBoost model accurately reproduces the sequence and magnitude of many light to moderate rainfall days, closely tracking day-to-day fluctuations. However, like most data-driven approaches trained on imbalanced samples dominated by zero/low rainfall, the model tends to under-estimate peaks during very heavy events. Two factors likely contribute to this: (i) class imbalance that downweights extremes in the loss landscape, and (ii) the one-day lag design that limits access to multi-day precursors (e.g., moisture build-up and synoptic persistence). Despite these limitations, the combined data approach delivers stable performance with a favorable accuracy–complexity trade-off and demonstrates the utility of integrating thermodynamic and dynamic information from different atmospheric layers.

Conclusion and Implications

The experiments confirm three key takeaways. First, upper air reanalysis fields provide distinct, complementary information to surface observations for next-day rainfall forecasting in Mashhad. Second, fusing surface and upper air predictors and then pruning with SFFS yields a compact feature set that preserves—or even improves—skill while enhancing parsimony, as reflected in Adjusted R². Third, tree-based gradient boosting methods, particularly CatBoost in the combined scenario, offer a practical balance between performance and simplicity for operational use.

Future work should target heavy rainfall underestimation by (a) enriching temporal context (multi day lags, moving averages, recent sum rainfall, dry/wet spell counters), (b) incorporating spatial context from neighboring stations and regional reanalysis tiles, (c) adopting two stage pipelines (occurrence classification followed by conditional amount regression) to mitigate zero inflation, and (d) testing sequence models (e.g., LSTM/GRU) or hybrid ML–NWP ensembles. Such extensions could elevate extreme event fidelity without sacrificing interpretability or operational feasibility.

کلیدواژه‌ها English

Daily precipitation forecast
Machine learning
land surface data
upper air data
Mashhad
1-       Barrera-Animas, A. Y., Oyedele, L. O., Bilal, M., Akinosho, T. D., Delgado, J. M. D., & Akanbi, L. A. (2022). Rainfall prediction: A comparative analysis of modern machine learning algorithms for time-series forecasting. Machine Learning with Applications, 7(August 2021), 100204. https://doi.org/10.1016/j.mlwa.2021.100204
2-       Bian, L., Qin, X., Zhang, C., Guo, P., & Wu, H. (2023). Application, interpretability and prediction of machine learning method combined with LSTM and LightGBM-a case study for runoff simulation in an arid area. Journal of Hydrology, 625(PB), 130091. https://doi.org/10.1016/j.jhydrol.2023.130091
3-       BREIMAN, L. (2001). Random Forests. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12343 LNCS, 503–515. https://doi.org/10.1007/978-3-030-62008-0_35
4-       C, S. K., & Sanadi, M. M. (2021). Rainfall Prediction Using Logistic Regression and Support Vector Regression Algorithms. In Communications in Computer and Information Science: Vol. 1440 CCIS. Springer International Publishing. https://doi.org/10.1007/978-3-030-81462-5_54
5-       Chai, T., & Draxler, R. R. (2014). Root mean square error (RMSE) or mean absolute error (MAE)? -Arguments against avoiding RMSE in the literature. Geoscientific Model Development, 7(3), 1247–1250. https://doi.org/10.5194/gmd-7-1247-2014
6-       Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13-17-Augu, 785–794. https://doi.org/10.1145/2939672.2939785
7-       Dash, Y., Mishra, S. K., & Panigrahi, B. K. (2018). Rainfall prediction for the Kerala state of India using artificial intelligence approaches. Computers and Electrical Engineering, 70(June), 66–73. https://doi.org/10.1016/j.compeleceng.2018.06.004
8-       Endalie, D., Haile, G., & Taye, W. (2022). Deep learning model for daily rainfall prediction: case study of Jimma, Ethiopia. Water Supply, 22(3), 3448–3461. https://doi.org/10.2166/WS.2021.391
9-       Fatemeh Mikaili, S. S. (2021). Daily Rainfall Prediction Using Random Forest and Support Vector Regression Methods (Case Study: Ardabil Station). 2118(2021). https://doi.org/https://civilica.com/doc/1411181/
10-    Freund, Y., & Schapire, R. E. (1999). A Short Introduction to Boosting. In Journal of Japanese Society for Artificial Intelligence (Vol. 14, Issue 5).
11-    Gaikwad, G. P., & Nikam, P. V. B. (2013). Different Rainfall Prediction Models And General Data Mining Rainfall Prediction Model. International Journal of Engineering Research & Technology(IJERT), 2(7), 115–124.
12-    Hasan, N., Nath, N. C., & Rasel, R. I. (2016). A support vector regression model for forecasting rainfall. 2nd International Conference on Electrical Information and Communication Technologies, EICT 2015, Eict, 554–559. https://doi.org/10.1109/EICT.2015.7392014
13-    Iliger, S., & Pinto, M. (2022). Comparative Study of Rainfall Prediction Modeling Techniques : Case Study on Karapur, India. In Artificial Intelligence and Communication Technologies (Vol. 7, Issue 3, pp. 159–166). Soft Computing Research Society. https://doi.org/10.52458/978-81-955020-5-9-16
14-    Khatere Asghari Tahergorabi, Amir Rajabi Behjat, H. D. (2022). Daily Rainfall Prediction using Deep Echo State Network (DeepESN) Method Based on Weather Station Data in Hormozgan Province. 31–38.
15-    Kisi, O., Heddam, S., Parmar, K. S., Petroselli, A., Külls, C., & Zounemat-Kermani, M. (2025). Integration of Gaussian process regression and K means clustering for enhanced short term rainfall runoff modeling. Scientific Reports, 15(1), 1–26. https://doi.org/10.1038/s41598-025-91339-8
16-    Li, H., Li, S., & Ghorbani, H. (2024). Data-driven novel deep learning applications for the prediction of rainfall using meteorological data. Frontiers in Environmental Science, 12(August), 1–15. https://doi.org/10.3389/fenvs.2024.1445967
17-    Liyew, C. M., & Melese, H. A. (2021). Machine learning techniques to predict daily rainfall amount. Journal of Big Data, 8(1), 153. https://doi.org/10.1186/s40537-021-00545-4
18-    Mahajan, D., & Sharma, S. (2022). Prediction Of Rainfall Using Machine Learning Techniques. 4th International Conference on Emerging Research in Electronics, Computer Science and Technology, ICERECT 2022, 9(01). https://doi.org/10.1109/ICERECT56837.2022.10059679
19-    Mohammadreza Fallahi, Hadi Varwani, S. G. (2011). Rainfall Prediction Using Tree Regression Models for Flood Control. https://doi.org/https://civilica.com/doc/143789/
20-    Nguyen, H. N., Nguyen, T. A., Ly, H. B., Tran, V. Q., Nguyen, L. K., Nguyen, M. V., & Ngo, C. T. (2021). Prediction of daily and monthly rainfall using a backpropagation neural Network. Journal of Applied Science and Engineering, 24(3), 367–379. https://doi.org/10.6180/jase.202106_24(3).0012
21-    Pande, C. B., Sidek, L. M., Varade, A. M., Elkhrachy, I., Radwan, N., Tolche, A. D., & Elbeltagi, A. (2024). Forecasting of meteorological drought using ensemble and machine learning models. Environmental Sciences Europe, 36(1). https://doi.org/10.1186/s12302-024-00975-w
22-    Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). Catboost: Unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 2018-Decem.
23-    Qorban Mahtabi, Farshid Taran, S. M. (2018). Prediction of Daily Rainfall Using Meteorological Data from Previous Days (Case Study: Isfahan City).
24-    Rasol Imani, Reza Ghazavi, A. E. O. (2021). Stochastic Monthly Rainfall Time Series Analysis, Modeling and Forecasting ( A cas study: Ardebilcity. 84–98.
 
25-    Ridwan, W. M., Sapitang, M., Aziz, A., Kushiar, K. F., Ahmed, A. N., & El-Shafie, A. (2021). Rainfall forecasting model using machine learning methods: Case study Terengganu, Malaysia. Ain Shams Engineering Journal, 12(2), 1651–1663. https://doi.org/10.1016/j.asej.2020.09.011
26-    Sattari, M. T., Bagheri, R., Shirini, K., & Allahverdipour, P. (2024). Modeling Daily and Monthly Rainfall in Tabriz using Ensemble Learning Models and Decision Tree Regression. 5(18), 31–48. https://doi.org/10.30488/ccr.2024.433394.1192
27-    T., A. S. S., Somula, R., K., G., Saxena, A., & A., P. R. (2020). Estimating rainfall using machine learning strategies based on weather radar data. International Journal of Communication Systems, 33(13). https://doi.org/10.1002/dac.3999
28-    van der Merwe, J.-P., Wang, T., Clarke, C., & Mansfield, S. D. (2023). Predicting temperature and rainfall for plantation forestry in Mpumalanga, South Africa, using locally developed climate models. Agricultural and Forest Meteorology, 329(December 2022), 109275. https://doi.org/10.1016/j.agrformet.2022.109275.