Saturday, August 4, 2012

The Ideal Learner


Given a machine learning problem, i.e. given some training data O of type T, the ideal model M is found as follows:

  • First, a prior P over all (computable) distributions over T is found from past experience and prior knowledge of the problem (which must be independent of O). The prior may be represented as a function, that given a distribution over T, returns its weight. We shall write P(d) to denote the application of this function to a distribution d.
  • Then, the posterior S is derived from O and P by reweighing the distributions in T according to their likelihood given the data. So S(d), where the function S is defined similarly to P, is equal to P(d)*L(d|O) / n, where n is the normalization factor (equal to the sum over all d of  P(d)*L(d|O)).

Given a partial test case o (possibly trivial), M posteriorizes each distribution in S according to o. Then a weighted sum of all these posteriors is taken, according to the weights in S. This gives the ideal distribution for o given the knowledge available.

For example, given a classification problem with a training set <xi, yi>, where the xi belong to Rn and the yi to R, one would first find P over all distributions over Rn+1. Then one would find S, of the same type as P. Given a test case x, one would restrict all the distributions in S to the points whose first n coordinates are equal to x, then take the weighted sum of these to find the ideal distribution. One could further take the mode of the ideal distribution to get a concrete y.

Of course, finding M is usually impossible, and even approximating it is usually impractical. In practice, the prior is chosen mostly for convenience reasons. For example, choosing to use certain SVMs along with the way they are computed from training data defines a prior over all distributions over T (assigning probability zero to most of them).

No comments:

Post a Comment