AN APPROACH TO QUALITY ANALYSIS, GAP FILLING AND HOMOGENEITY OF MONTHLY RAINFALL SERIES

The aim of this work was to propose a method for the consistency of climatic series of monthly rainfall using a supervised and unsupervised approach. The methodology was applied for the series (1961-2010) of rainfall from weather stations located in the State of Rio de Janeiro (RJ) and in the borders with the states of São Paulo, Minas Gerais and Espírito Santo with the State of Rio de Janeiro. The data were submitted to quality analysis (physical and climatic limit and, space-time tendency) and gap filling, based on simple linear regression analysis, associated with the prediction band (p < 0.05 or 0.01), in addition to the Z-score (3, 4 or 5). Next, homogeneity analysis was applied to the continuous series, using the method of cumulative residuals. The coefficients of determination (r2) between the assessed series and the reference series were greater than 0.70 for gap filling both for the supervised and unsupervised approaches. In the analysis of data homogeneity, supervised and unsupervised approaches were effective in selecting homogeneous series, in which five out of the nine final stations were homogeneous (p > 0.9). In the other series, the homogeneity break points were identified and the simple linear regression method was applied for their homogenization. The proposed method was effective to consist of the rainfall series and allows the use of these data in climate studies.


INTRODUCTION
Precipitation is one of the most important variables in climate studies, with emphasis on climate variability and changes, climatic classification and hydrological modeling, in addition to its importance in the planning, design and management of irrigation and drainage systems and water resources in general (ABREU et al., 2021;JAVARI, 2016). Its occurrence is directly related to water supply, energy generation, agricultural production, human health and wellbeing, natural disasters (droughts, floods, inundation, landslides and mass displacement), among other activities (BRUBACHER et al., 2020).
Climatic studies require long-time, continuous and quality weather observations to better understand the characteristics of the study region. These observations are often from automatic or conventional surface weather or climatological stations . Among the challenges related to rainfall monitoring, the construction and maintenance of a consistent and homogeneous historical database stands out (AY, 2020;SANTOS et al., 2012). In Brazil, rainfall series are often restricted, discontinuous, poor quality and not homogeneous in space-time (TOSTES et al., 2017).
Data consistency has to be executed so that the results based on these series do not express contradictory and erroneous conclusions. It involves quality analysis, gap filling (when they exist) and testing of the homogeneity of the series (ANA, 2012;ANIMASHAUN et al., 2020). The filling is important, due to the difficulties and limitations of several methods to be applied to the non-continuous series, with emphasis on hydrological models, drought indexes and climate variability, trend studies, occurrence probability and return time LIMA et al., 2021;LYRA et al., 2006LYRA et al., , 2017. In addition, this step can be restricted or have poor accuracy and precision due to physical limitations, such as those caused by complex terrains and/or proximity to the coastal environment (BRUBACHER;et al., 2020). The analysis of data homogeneity after gap filling is rarely performed, incurring in uncertainties about the rainfall data series (CARVALHO; RUIZ, 2016).
In the temporal and spatial scale, the gap filling methods can overlap with each other in terms of accuracy, which reinforces the lack of consensus among authors to list the appropriate methods for each situation. Thus, comparative studies are justified in order to define the best methods and strategies for each variable and conditions (BRUBACHER et al., 2020). Among the methods used to fill gaps in rainfall data, we can mention linear regression (KITE, 1988;PRECINOTO et al., 2012), regional weighting and regression weighting (OLIVEIRA et al., 2010). However, the regression method stands out for showing better performance in filling rainfall gaps (LO PRESTI et al., 2010;OLIVEIRA et al., 2010), being applied in several studies (BERTONI;TUCCI, 2013;CAMERA et al., 2014;DE OLIVEIRA-JÚNIOR et al., 2012;DI PIAZZA et al., 2011;MELLO;NEWMAN et al., 2015). Homogeneity tests allow identifying ruptures in the trend of the series and abrupt changes in the mean and variance of the distribution (SANTOS et al., 2016). Among the methods, double mass, regional vector, cumulative residuals stand out (ALLEN et al., 1998;KITE, 1988).
Further, the preliminary analysis of hydroclimatic data is extremely time-consuming and demands attention from the observer at each stage. The advent of software and programming languages leads to the possibility of creating algorithms that facilitate the processing of activities such as quality analysis, gap filling and homogeneity. Therefore, studies that assess supervised and unsupervised rainfall approaches are important. Among these methodologies for gap filling and homogeneity of hydrological series, the linear regression and accumulated residue, respectively, are methods that facilitate an unsupervised approach, in addition to being efficient in their respective objectives (ALMEIDA et al., 2021;DE OLIVEIRA-JÚNOR et al., 2012;OLIVEIRA et al., 2010).
The aim of this work was to i) propose a method for analyzing the quality of rainfall climatic series in a supervised or unsupervised approach and, ii) evaluate the consistency (data quality, gap filling and homogeneity) of historical series of the monthly rainfall, in the period from 1961 to 2010 for the State of Rio de Janeiro (RJ) and locations in the States of São Paulo, Minas Gerais and Espírito Santos, near to the border with the State of Rio de Janeiro, southeastern Brazil.

Study area and precipitation series
The study region is the state of Rio de Janeiro, southeastern Brazil (latitudes 20°45'54″ and 23°21'57" S and longitudes 40°57'59" and 44°53'18" W, with altitudes between 0 and 2792 m) (Figure 1). The monthly precipitation series were obtained from the Meteorological Database for Teaching and Research -BDMEP, maintained by the National Institute of Meteorology -INMET (Table 1)    The series were initially separated into two groups: one containing the principal weather stations and a second group denominated secondary stations. The principal stations were selected based on: i) location, distance less than 100 km from the border of the state of Rio de Janeiro or located in the state, ii) observation period greater than thirtyfive years and iii) with periods of data interruption of less than five consecutive years (10% of the size of the series) (ANA, 2012).
The purpose of the secondary stations was to assist in filling in the months without data in coincident periods and identifying possible errors, for that, they met the criteria (ANA, 2012): i) distance between stations less than 100 km, preferably 50 km, ii) data period longer than twenty years and iii) information on periods absent from the principal stations. Seventeen INMET Weather stations were selected, but only ten met the criteria for principal stations and eighteen from ANA for secondary stations.
The reference series was building using the arithmetic mean of the rainfalls of at least three and at most five other stations (principal and secondary stations) contained around each main station. Firstly, stations with a distance of less than 50 kilometers were chosen, with a maximum limit of 100 kilometers ( Figure 1) and coefficient of determination (r²) greater than 0.50 between the series to have the gaps filled, and each of the stations within the search radius. The averages of rainfall observations from the selected stations were considered representative of the climate trend of the study area.

Quality analysis -supervised and unsupervised method
After the establishment of reference series, for the analysis of data quality of the principal series, two approaches were proposed: a supervised and an unsupervised. The supervised approach was carried out on a step-by-step basis, with criteria to support the observer's decision-making on data quality, while the unsupervised approach was carried out using an algorithm developed in Excel ® software that performed the quality analysis and gap filling without the interference of the observer.
The objective of the unsupervised approach is to verify the effectiveness of this algorithm as the preliminary analysis of climate data is extremely time-consuming and slow. However, it is an effective step towards obtaining quality and continuous climate data series.
For the two data quality approaches (supervised and unsupervised), it was fisrtly identified the physical limits of monthly rainfall, that is, limit values that could occur for the evaluated phenomenon, such as monthly rainfall greater than 0 or less than 1000 mm (BRITO et al., 2017). Values beyond this range were considered physically inconsistent and removed from the series. The climatic limits were tested using the Z-score (Equation 1), which checks the relative position of the event, allowing to assess how many standard deviations the event is located far from the mean and the probability of occurrence of an event of such magnitude.
The unsupervised approach sought to identify spurious series values with an interval of three, four and five times the Z-value (Equation 1) of the series, combined with a simple linear regression prediction band with p-value of 0.05 and 0.01. This procedure was performed for each of the principal stations, where the data were considered spurious when simultaneously identified by both methods (greater than the Z limit and beyond the prediction band), which were identified only by one of the steps, were denominated as suspicious. (1) Where: x represents the observed value; represents the mean of the observed values and sd represents the standard deviation of the series.

Gap filling
Gap filling was performed with the aid of the simple linear regression (SLR) method. Therefore, the station series and the reference series must meet the criterion of the slope coefficient (β 1 ) of the simple linear regression (Y = β 0 + β 1 X) to be statistically significant (β 1 ≠ 0) and within the range between 0.7 > β1 > 1.3 and at the criterion of the coefficient of determination (r²) greater than 0.70 (ALLEN et al., 1998;KITE, 1988;OLIVEIRA et al., 2010).
The SLR method consists of estimating data missing in a series, resulting from the linear correlation between the series to be filled and the series of another station, without gaps (reference series). The linear regression (Y i = β 0 + β 1 X i ) introduces as an estimated variable (Y) the monthly precipitation series of the station under analysis, the predictive variable (X) the reference series of that station (without failures), β 0 represents the intercept (mm) and β 1 is the angular coefficient and the subscript term i represents the i-th observation (DOS SANTOS et al., 2018).

Homogeneity test
To analyze data homogeneity, the accumulated residue method was applied between the constructed reference series and the value obtained from the arithmetic mean after gap filling using the SLR method. The method of cumulative residuals consists of plotting the residuals of the SLR and, on the same graph, an ellipse defined from the coefficients α, β and θ, where α is the total number of data divided by two, β is obtained through Eq. 2 and θ are the eight equal values in ellipse degrees (0° to 360°). (2) Where: X i represents the total number of data, p is the level of significance to be used (90%) and Sy is given by (Eq.3): (3) Where: σ y is the standard deviation of the constructed series, and r is the Pearson correlation coefficient.
When the residual values are beyond the range defined by the ellipse, the series was considered non-homogeneous (KITE, 1988;ALLEN et al., 1998). Values outside the prediction range determine the rupture in the data series besides being graphically identified. The series, when not homogeneous, is divided into two subsets: before and after the rupture point.
The homogeneity is corrected using a correction factor (Δ), calculated through the difference between the regression estimates generated from the subsets of the non-homogeneous series (before and after the rupture point). The value Δ is added/ subtracted from the original series, considered non-homogeneous, from the rupture point. The procedure is progressively applied to verify the homogeneity of the corrected series, until homogeneity of the data (ALLEN et al., 1998).

Analysis of the performance of the supervised and unsupervised methods
The analysis of the performance of the supervised and unsupervised methods was verified based on the ability to identify monthly rainfall series with gaps and inhomogeneities and correct them by means of gap filling based on linear regression and homogeneity through the method of cumulative residuals.

RESULTS AND DISCUSSION
The rainfall time series of all stations used in this work showed some kind of gap in the records (Table 2), where it was selected seventeen stations from INMET, which presented data for the entire analyzed period.
Seven stations were excluded in the initial stage of selection for the data period, due to the low coefficient of determination, distance greater than 100 km from at least three others weather stations and the geopolitical border of the state of Rio de Janeiro as well as the percentage of gaps greater than two-thirds of the series (Table 3).
For these seasons, a large number of gaps were observed particularly in winter (June, July, August) and summer (December, January, February) and from 1984 to 1991.
Regarding the angular coefficient, all ten preselected stations had values within the range determined before and after the analysis. Before the quality analysis, the Taubaté station had the lowest slope (β 1 = 0.71) and the highest (β 1 = 1.21) was found in Coronel Pacheco station.
During the quality analysis, spurious values were identified for each method. The difference in the amount of data collected by the supervised and unsupervised methods of data consistency analysis was observed (Table 4).
Regardless of the definition of the methods (supervised/unsupervised), the maximum value of data removed from the gross historical series SANTOS, J. C. et al.  Vitória  18  20  24  28  16  18  18  20  20  18  22  18  Coronel Pacheco  20  20  20  20  24  22  22  24  24  24  18  22  Viçosa  22  18  18  18  18  18  20  20  18  16  18  20  Barbacena  10  14  12  10  10  10   Source: Elaborated by the authors was 5%. It was possible to observe that the greater the strictness of the parameters (Z = 3 and p < 0.05), with the unsupervised approach, the more data were identified as suspicious and spurious, therefore being removed from the series. The particularity of the supervised method in relation to the unsupervised method was that some series presented mismatched or suspicious data through the analyses of the prediction bands and Z scores, but could be considered as representative of the local where they were collected by the observer's assessment and thus remained in the series. The unsupervised method was often not effective in identifying some values, such as when the rainfall value indicated zero in months of high rainfall.
Without the interference of the observer, some local peculiarities tend to be ignored, which influences the exclusion of data from the series that could be representative, particularly for studies of extreme events. The loss of this type of information is relevant, especially in countries like Brazil where the number of rainfall stations per unit area tends to be small (ABREU et al., 2021). For this work, after removing the suspicious observations, the r² for the stations increased and remained within the range from 0.63 (Campos do Jordão) to 0.87 (Barbacena), but only nine out of the ten stations met the criterion of r 2 ≥ 0.7 ( Table  5). The r 2 values found in these nine stations express satisfactory data precision, with the exception of the Campos do Jordão station, which presented values below expectations for the selection performed in an unsupervised way.  The stations of Barbacena and Viçosa were the only ones in which the r² values were greater in the unsupervised mode than in manual mode. The station with the greatest discrepancy between the quality assessment methods was the Taubaté station ( Figure 2). Values beyond the prediction range were individually evaluated and represent months with rainfall above (below) the monthly climatological average, but because of their climatic importance, they were not removed from the series.
After this process of quality analysis and gap filling, the series were subjected to a homogeneity analysis using the accumulated residue method. Because in 77.78% of the assessed stations the highest r² value identified was given by the supervised method, it was decided to evaluate the homogeneity and trend rupture only in the series that met the criteria of the supervised method.
Of the nine selected weather stations, only four (Campos dos Goytacazes, Resende, Taubaté and Viçosa) ( Figure 3) showed a rupture in the trend of the series, which represented non-homogeneous data. The other series were homogeneous at a significance level of 90%. It was possible to observe through the graphs the rupture stations (BRITO et al., 2017). For the other regions, the data set ranged from 5 to 10% of failures.
In general, the unsupervised method was efficient in the selection of series, whose climatic behavior in the region does not show large variations throughout the year and where there is little availability of nearby weather stations (in the limit of less than 100 km), as there are several factors that may influence this result. The supervised method tends to be more rigorous, as data filtering is done individually and weighted by the observer.
When working with extreme rainfall, the unsupervised method is not recomended because the values that contrast to the Z-score and identified outside the prediction range are automatically removed from the series. Also, these values often represent extreme events in a particular region. Begert et al. (2005) identified the increase in the inhomogeneity of rainfall series in the nineteenth century, a change that was justified by the authors to some adjustment in seasonal data, global warming effects and also the introduction of automatic measurement equipment.
The series used in this study, between 1960 and 2010, comprise a period in which trends in the increase in monthly rainfall were detected in the northern region of the state of Rio de Janeiro (SALVIANO et al., 2016), while annual rainfall showed a decrease tendency in regions such as the Baixada Fluminense and Sul Fluminense, between 1951and 2001(DERECZYNSKI et al., 2013. This fact may have contributed to the number of exclusions of series classified as non-homogeneous, since trends in specific locations cause deviations in the accumulated residual graphs in relation to the average of neighboring locations, without a trend. In addition, rainfall in the state of Rio de Janeiro has high space-time variability, being influenced by weather systems and orography (ALVARENGA, 2012;CARDOSO;DIAS, 2004;RODRIGUES;WOLLINGS, 2017;SOARES;DIAS, 1986;SOBRAL et al., 2019), in addition to anomalies related to climate variability such as El Niño -South Oscillation (ENOS), Atlantic ocean surface temperature, Pacific Decadal Oscillation, South Atlantic Convergence Zone, among others (BARRETO, 2009;DA ROCHA et al., 2014;DE OLIVEIRA-JÚNIOR et al., 2018;GRIMM, 2003;MINUZZI et al., 2007;MOLION, 2003;PRADO et al., 2007;DERECZYNSKI, 2014;STRECK et al., 2009). This spatial variability also influences the behavior of individual series, and may cause deviations in the cumulative residuals graph. Non-homogeneous and trending rainfall series should not be used for frequency analysis or modeling due to the possibility of biased inferences. Therefore, methods that help to select consistent series are extremely important.

CONCLUSIONS
• Based on the criteria established to obtain a satisfactory representation of the temporal series, of the 17 main stations evaluated, Source: Elaborated by the authors Figure 3. Accumulated residue method for data homogeneity analysis for Campos dos Goytacazes station before and after applying the SLR method. a) before homogeneity b) after the homogeneity process only nine stations meet the minimum criteria necessary to carry out all the analyses, which emphasizes the need for creation, maintenance and investment in a database of consistent climate data for the State.
• Through statistical indices, the methods of quality analysis and gap filling of the evaluated rainfall data are adequately precise. The supervised method proves to be more efficient in identifying spurious values and more rigorous in the selection of suspicious data. However, this form of assessment requires knowledge of the observer regarding the removal of data from the series. For that, the unsupervised method, which is more similar to the supervised one, considers as parameters: Z-score = 3 and significance level for the prediction band equal to 95%.
• Most stations are characterized as statistically homogeneous after the application of the methods. Homogeneous series can be used for climate studies and the method is effective in filling the gaps arising from the absence of climate data.

DECLARATION OF INTERESTS
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.  Meteorologia, v. 19, n. 2, p. 113-122, 2004.