We consider in this study multivariate data obtained from continuously operating chemical processes exhibiting two important features. First is the inherently autocorrelated nature of chemical process measurements, which may represent events occurring at multiple time scales. Second is a cyclic component present for at least one of the process variables. While not present in every continuous chemical processes, such a confounding cycle may be caused by a periodic disturbance, as is the case for the pilot plant considered in this study. The cycle is known to exist in the historical data, and represents undesirable variability onto which the meaningful process variability is superimposed.
Because of the simple nature of the statistical models employed by traditional clustering algorithms, they are largely unsuited for chemical process data. Traditional clustering algorithms generally partition observations based on their magnitudes, such that the clusters are represented by the mean levels of their constituent set of observations. Clearly, such statistical models cannot account for the time series nature of chemical process data, and traditional clustering algorithms are of little use for chemical process monitoring and control applications.
We propose a novel clustering algorithm which uses the Principal Components Analysis (PCA) statistical model to partition chemical process data into groups corresponding to various operating regimes and/or faults. Our clustering algorithm combines the PCA model with a temporal windowing scheme which is used to control the temporal properties of any detected patterns. Before the clustering algorithm can be applied, a moving window defined by length L and spacing between windows D is used to divide the original time series into N equally spaced subsets of equal length. These N windows of data are then taken as the set of objects which are partitioned by the clustering algorithm.
The nonhierarchical clustering algorithm, termed the “k-PCA Models” algorithm by analogy to the traditional k-means algorithm, is initialized by choosing k, the number of clusters, and k randomly selected, non-overlapping windows to seed the clusters with one window each. An iterative scheme is then used to partition the windows. On each iteration, a prototype PCA model is estimated from the member observations of each cluster, then each window of data is projected into each prototype PCA model and the error is quantified using the scalar total sum of squared error. Each window is then reassigned to the best fitting cluster, and the procedure is repeated until no further reassignments are indicated and the algorithm terminates. Intuitively, the k-PCA Models algorithm identifies k PCA models which can be used to account for the dominant modes of variability in the data set.
The window length L for the moving window affects the time scale of patterns detected by the clustering algorithm. Detected events tend to persist for at least L samples because this corresponds to the minimum possible temporal averaging length performed when the PCA models are estimated. The window length can also be used to prevent the cyclic component in the historical data from introducing a periodic bias into the cluster labels--- the labels become associated with the phase of the confounding cycle, tracking its peaks and troughs instead of the evolution of true process regimes. Setting L equal to an integer multiple of T, the cycle period, will ensure that each window of data, and therefore each cluster, will have an equal number of measurements from each phase of the cycle, preventing periodic biases from contaminating the cluster labels. Using such a window length precludes the identification of any short events persisting less than one cycle period, thus such cluster analysis identifies any low frequency process states without any periodic biases.
Using a shorter window length, it is possible to perform a separate cluster analysis that isolates the high frequency events present in the data set. Necessarily, however, this high frequency solution will contain periodic biases--- any events lasting more than one cycle in duration (i.e. the low frequency events identified in the previous cluster analysis) will now be appear to repeat through a progression of states which reflect the cycle phase. Thus, the high frequency solution indicates the true high frequency events, which include any transition points between two low frequency events, as well as periodically biased artifacts indicating the presence of low frequency events.
Upon corroboration of the low frequency and high frequency cluster solutions, the true set of process states can readily be determined. The low frequency events are first obtained from the low frequency cluster analysis. Then, the high frequency events can be determined from the high frequency cluster solution, after eliminating any periodically biased labels associated with the times of occurrence for the now known low frequency events. An intuitive, graphical method is presented to combine the two independent cluster analyses to arrive at a final solution summarizing process states at all time scales.
The proposed clustering algorithm is applied to historical data obtained from the operation of a pilot plant reactor. There are 9 variables monitored at a 5 minute sample rate for a total of 400 observations. The process is affected by a periodic disturbance in the cooling utility which feeds into the reactor jacket, affecting the reactor outputs and creating a cyclic component for several key process variables. In addition to this confounding cycle, there exist events at both low and high frequencies in the data set. Low frequency events include multiple steady states as well as a slow process transition, while the high frequency events include several brief disturbances which are rapidly attenuated.
The clustering algorithm has identified 5 distinct modes of operation. One cluster is identified as the desired regime of operation, and fault diagnosis using variable contribution analysis is performed to isolate the nature of the other faulty regimes. A PCA model is formed for most (but not all) of the samples in the normal operating regime, and then the entire data set of 400 observations is projected into this PCA model. Variable contributions to the model error (the Q-statistic) at each sampling instance and for each variable are computed. By comparing the variable contributions for a given variable to the modeled period, relatively large contributions indicate variables not conforming to normal operation. Unmodeled data within the desired operating regime exhibit small contributions for all variables, validating the performance of the process model. Larger variable contributions for certain variables in the faulty regimes indicate specific sets of variables affected by the faults, allowing inference as to their root causes.
In summary, the cluster analysis is shown to be an effective tool for process monitoring and fault diagnosis.