Researchers typically study air pollution dynamics using Air Quality Models (AQMs), which require very large amount of time to develop, and are specific to a relatively short period of time for a particular geographic region. Given the voluminous amounts of air quality data produced from monitoring networks which record hourly, ground level air quality and meteorological parameters, statistical methods allow investigation of extended study periods of ozone pollution. In this study, we use cluster analysis and sequence analysis to isolate the role of meteorology on ozone and we explore the ability of the stochastic Hidden Markov Models (HMMs) to predict ozone levels in Houston, TX. The adverse health effects of tropospheric ozone indicate substantial risk for many segments of the population. This necessitates the short term forecast in order to take evasive actions on days conducive to ozone formation.
Cluster analysis is an unsupervised form of statistics which indicates recurring patterns among a set of observations, while sequence analysis is an approach to data reduction based on the concept of sequence similarity in terms of similarity index. Cluster analysis is best suited to determine the hourly meteorological states. However, for ozone, the maximum daily ozone value is the regulated measure of air quality. Also, Houston has strong mesoscale influences and many meteorological regimes may be present through out the day. This necessitates the need to look at sequence of regimes to describe each day as a whole.
Thus, cluster analysis is performed at two time scales: one at the hourly scale to determine the prevailing meteorological regimes and the other along with sequence analysis at a daily scale to reveal classes that determine the dependency of ozone on meteorology in Houston, TX. The first stage of clustering is carried out for the surface wind observations on an hourly scale with agglomerative k-means algorithm. This algorithm uses a distance metric to determine the dissimilarity among observations to obtain the surface wind patterns. The second stage of clustering is carried out for the ozone exceedance and non-exceedance days on a daily scale and is performed using similarity index to determine the dissimilarity among symbols of observations.
Our study focuses on the period 1 April through 31 October of the years 2004 and 2005. Measurements are available from two separate networks of ground level monitoring stations operated by the Texas Commission for Environmental Quality (TCEQ). All data is reported at an hourly rate, though missing measurements are a common problem associated with analysis of environmental measurements over such an extended observation period. The first network, the Continuous Air Monitoring Stations (CAMS), monitor ozone and NOx concentration levels. From these air quality data, the 8-hr daily maximum ozone levels can readily be calculated for the episodic days. A second meteorological monitoring network records hourly, ground-level wind speed, wind direction, and temperature data. The north-south and east-west components of the hourly averaged surface winds for the surface stations were calculated. The k-means clustering algorithm is applied to these continuous wind field measurements to determine meteorological regimes affecting regional air quality.
To interpret the clusters obtained, we calculate and plot in geospatial coordinates for each cluster, the prevailing wind conditions from the wind measurements at each monitoring station. Frequency-of-occurrence plots that represent the number of times observations are assigned to a cluster each hour of a day are constructed for each cluster. Cluster averaged time series data of wind speed, direction, temperature and compositions of ozone precursors are analyzed for the episode days and their previous days to observe the diurnal pattern variation at each site. To determine the dependence of ozone on meteorology, an index based sequential analysis approach is taken to determine the recurring meteorological patterns among ozone exceedance and non-exceedance days. The predominant wind directions corresponding to the cluster labels obtained from the cluster analysis of hourly wind measurements are analyzed first by sequence analysis and then are clustered.
Based on the states (meteorological regimes and ozone classes) identified by the cluster analysis and sequence analysis, an ozone forecasting model is developed based on the HMMs. HMMs capture both the deterministic transition between states and the random factors involved. A set of HMMs are estimated by training the data corresponding to each meteorological cluster. After the HMMs are developed, labels are assigned to an hour in the future by choosing the class represented by a model with the maximum probability. HMMs are also developed for the clusters of wind indices capturing the distinct scenarios for ozone exceedances and non-exceedances. To predict the ozone class of a day, the hours of the day labeled according to HMMs based on wind clusters are input to the ozone HMMs to determine the class of ozone levels based on the maximum probability.
The two-stage approach of clustering performed on a large data set emphasizes the effect of coastal Gulf Breeze on ozone formation. It is found that the wind fields can be divided in to distinguishable clusters based on the hourly wind measurements. Clustering of the sequence indices of wind directions obtained from the hourly wind clustering with similarity index as a measure of distance gave clusters that capture distinct scenarios for ozone exceedances and non-exceedances. Also, the clusters from sequence analysis captured the diurnal cycle of the wind. Comparisons of the results for the 2005 versus 2004 indicate that the same major meteorological patterns appear in different years. Ozone forecast results based on the HMMs are promising, revealing several unique aspects of Houston ozone dynamics.