Building probalistic models from databases

Author: Miller, John W.

Year: 1993

Degree: Dissertation (Ph.D.)

Advisor: Goodman, Rodney M.

Committee Members: Goodman, Rodney M.; Franklin, Joel N.

Option: Electrical Engineering

DOI: 10.7907/mm98-d904

Abstract

The problem of creating a probabilistic model using a database is analyzed in this thesis. Two independent results in probabilistic modeling are presented. The first result is a method for creating a model which produces accurate probability estimates. The model is a Gibbs probability distribution representation of the database. This model is created by a new transformation relating the joint probabilities of attributes in the database to Gibbs potentials. The theory of this transformation is presented together with a specific algorithm for efficiently collecting and using the Gibbs potentials. A hash table scheme is used to collect the important potentials without iterative error minimization or repeated searches through a database. The Gibbs modeling scheme allows flexible control of the tradeoffs involving modeling error and sampling error as well as the tradeoffs involved in using the resources of computation time and memory. The performance of the probabilistic modeling algorithm is tested and analyzed. Used as a classifier with a variety of datasets, the Gibbs modeling algorithm was found to equal or surpass the classification results of other models such as neural networks trained with backwards error propagation and the nearest neighbor classification algorithm.

The second independent result is the analysis of systems that use error minimization to estimate probabilities. Minimization to a probability has been a known property of the squared error and cross entropy objective functions. Here the necessary and sufficient conditions for minimization to a probability are developed. It is found that the squared error and cross entropy functions are two of the simplist functions from a family of objective functions which minimize to a probability. If the system is incapable of producing the outputs consistent with the probability estimates, it is shown the minimum error is achieved when the system produces outputs closest to the correct probability estimate outputs. The measure of closeness is described here in terms of the objective function.

Files