stty consulting › our future

Data Mining: Another Tool To Increase Productivity In Manufacturing?

Data Mining Concepts

The major driving factor of the creation of Data Mining is enormous quantities of data being captured. Many Internet based businesses have in excess of one billion transactions a day and thus possess databases containing gigabytes or even terabytes of data. These vast mountains of data contain potentially valuable information. Data Mining seeks out the useful patterns and valuable information from the surround noise of the irrelevant values.

Data Mining as mentioned previously is the process of seeking meaningful relationships within data set. The relationships and patterns found using Data Mining must be fresh and original as Hand et al state, "There is little point in regurgitating well-established relationships (unless, the exercise is aimed at 'hypothesis' confirmation, in which one was seeking to determine whether established pattern also exists in a new data set) or necessary relationships (that, for example, all pregnant patients are female)."[4]

The boundaries of Data Mining as part of knowledge discovery are not precise. Knowledge Discovery in Databases involves several stages: Selecting the target data, processing the data, transforming the data where necessary and applying algorithms to discover patterns. Some argue that the data transformation is an intrinsic aspect of Data Mining, as without first pre-processing the data the analyst will not be able to ask meaningful questions or the interpretation of the extracted patterns would be impossible.

Data Mining techniques can fall within two simple headings, 'supervised or directed' and 'unsupervised or undirected' learning. The directed learning techniques require the analyst to specify a target field or particular variable of interest. The directed algorithm then sifts through the data set establishing relationships and structures between the chosen target and the independent variables.

In the undirected approach, the analyst does not determine an objective to the algorithm. No variable or target field is specified; therefore, the subject of the question is not defined. The associations between data are not restricted to its dependence to the target. Indeed this allows the algorithm to discover relationships and structures in the data independently of any prior implicit knowledge of the user.

Data Mining incorporates six basic activities;

  • Classification.
  • Estimation.
  • Prediction.
  • Affinity Grouping or Association rules.
  • Clustering.
  • Description and Visualisation.

The first three - classification, estimation and prediction- are examples of directed mining. These work by using the available data to build a model based on the target field or chosen variable. The remainder, grouping, clustering and visualisation, are undirected techniques whose goal is to establish a relationship between the available variables.

There are also other data mining techniques such as time-series and outlier analysis which of importance. I feel the methods discussed will cover the predominant DM techniques where other may be derived, e.g. time-series is an advanced association method. Time series algorithms attempt to discover the objects in a set of time series have in common by using comparison methods. The problem commonly overlooked is an algorithm determining what distinguishes time series from another from the same data source. Both attempt to identify associations and patterns, though in the latter case those patterns must be distinctive and 'disassociated'.

Figure 1. KDD and Data Mining should combine cognitive psychology with AI, Database technology and statistical techniques to insightful model.

If whilst creating 'learning', unsupervised, algorithms we observe the domain expert's cognitive processes and take them into account, we may be able to increase the overall usefulness of the relationships discovered. The implicit knowledge and perceptions of the domain experts ultimately determine the novelty, usefulness and acceptance to the Data Mining projects findings. Most Data Mining systems produce a single set of associations and make no effort to include this into the knowledge base. The experts, who are expected to gain insight by applying their knowledge, must evaluate these possibly interrelated concepts. With the study of the cognitive techniques and the methods in which the knowledge is assimilated could lead to better designed systems [4]. Hearst states that during an examination of the popular texts on the KDD subject none dedicated space to examine methods to ensure the knowledge extracted is useful, novel and understandable. "While some KDD papers cover these topics, most contain unfounded assumptions about 'comprehensibility' or 'interestingness'"[16].

As advances in AI allow for a greater understanding of the learning process, combined with the prevalence of 'vertical' applications, an increase in the Data Mining algorithms to discover 'interesting' and novel patterns. The problems will occur when, like some Data Mining projects have highlighted, that the empirical domain experts' observations are incorrect. Examples can be found in the later sections.

This paper will not spend much time discussing the data processing issues such as cleansing, validation, transformation and variable definition. Instead, it will focus on the basic principles of the techniques employed to identify data sources and extract relationships from the resultant data set.

Where's The Data?

After settling on a task and the methods you are to employ, you can begin to gather data. The initial data will assist in induction process of building the model. In some cases, this is a straightforward task with huge quantities of data already being collected. In other domains, this can be a largest challenge of the process.

The availability and quality of the data is dependent on the instrumentation collecting the information. In some manufacturing environments, a plethora of different measuring equipment will be used with varying levels of accuracy. Odd cases as in the example of Evans and Fisher [17] the had to rely on the technicians running the equipment to manually record values periodically. Most domains will fall somewhere these two extremes, the domain experts initially assisting in the classification, or creation, of the training data. Langley insists that "In the ideal situation, the expert systems can be tied directly into the flow of data from the operating system's instruments."[18] With modern machinery including a greater available of running data for control systems I would expect a substantial increase in the amount of engineering data captured.

Data Warehouse Overview

Data Warehouse is a technology, pioneered by Inmon et al, as a means to store analysis data sets without sacrificing the transaction speed of the production applications. The data is stored separately form the operational systems as Data Warehouses are primarily concerned with historical data which has in many legacy systems been archived as it became inactive. Traditionally reports would be run form these data silos as to minimise the impact on the operational systems. The Data Warehouse has formalised the collection, storage and representation of this data. The data is moved from, what Fayyad terms 'Data Tombs' [11] into a format that allows the business exploit this potentially valuable information.

Inmon defines a Data Warehouse as an "integrated, subject-oriented, time-variant, non-volatile database that provides support for decision making."[7]. This definition provided us with a clear view to how it relates to Decision Support Systems (DSS) and by quickly analysing its components an understanding of the creation of a Data Warehouse.

Integrated, consolidated and consistent data. A Data Warehouse must collect operational data from a multitude of sources which its own right is an arduous task. The Data Warehouse must ensure that the data gathered is in a consistent state; names, unit attributes, domain limits and characteristics are uniformly applied. The business metrics must be described in the same way through out the entire enterprise. The Data Warehouse therefore will show a consistent image of the various sources from which the original data had been collected.

Subject-oriented or Topic-oriented view of the data is markedly different to that of the operational databases. Operational systems tend to hold a large amount of function focused data sets, Product information, Accounts, etc. The transactional aspect of the these system are often not regarded as important in the business decision making process, to this end much of the variable data stored with a Data Warehouse is summarised. A Data Warehouse would contain, for example, the quantity and monetary value of sales by region thus allowing the user to produce quick comparisons.

Time-variant aspect is important as data in a Data Warehouse "constitutes a snapshot of the company history as a measured by its variables...within a time framework."[8] This is a stark contrast to the operational systems, which are concerned with reporting the current value at that moment in time. As the contents of data sources are periodically, extracted and used to populate the Data Warehouse a new time slice will be added to the view and all time dependent aggregate functions will be recalculated, monthly or annual sales. Non-volatile data is essential as the Data Warehouse represents a point in time, therefore should never change. As Microsoft literature states "after the data has been moved to the Data Warehouse successfully, it typically does not change unless the data was incorrect in the first place."[9]. To this end, the Data Warehouse should give a consistent view of the company's history where as the operational systems will give a representation of the present.



page 1 :: page 3