In supervised learning, a label is normally provided along with the data point which may be n-dimensional. In other words, an observation in training data will have n features and there can be N such observations with labels provided by an expert or an oracle.
A number of simplyifying assumptions are made concerning the training data, namely, that all the samples that we are going to see would be statistically independent belonging to the same probability distribution but identically distributed.
We are all familiar with the notion of a function which maps a certain input to certain output. In a probabilistic setting, when we don’t know the function, we do function approximation which is like finding out the probability distribution P(Y|X). If X is a k valued single discrete random variable, we’ll have k different P(Y|X) to look up for depending upon the value which X takes. X can be a vector of discrete random variables as well, where there are n such X_i’s where each X_i may take up to k different values. X in this scenario is also termed feature vector. Formulaically , P(Y|X) = P(Y|X_1,X_2,X_3,…,X_n).
How to evaluate P(Y|X)?
Baye’s rule comes in very handy and we can express P(Y|X) as P(X|Y)P(Y)/P(X). Now we arrive at an interesting point, we have to make another assumption concerning P(Y) as we haven’t seen all possible data points, nor we can label all possible data points. Assume a prior, can be a beta prior, or a dirchlet prior or any other suited for the problem domain.
P(X|Y) is estimated empirically. How? By counting the number of times the values X assumes dividing total number of times Y takes on a particular value.
P(X) can be calculated using the law of total probability.
With basics out of the way, we come to parameter estimation. In parameter estimation, we are concerned with estimating the parameters of a distribution given a certain set of data points.
As we discussed estimating P(Y|X), Y can be theta (distribution’s parameter); whereas X can be D. The problem of parameter estimation thus becomes:
1) finding either the best theta that maximises the data likelihood, or in other words makes the data most proabable:
theta = argmax_theta P(D|theta)
2) Or, finding the most probable theta given the data D and prior P(theta):
theta = argmax_theta P(theta|D)
Note: Since the current wordpress theme doesn’t support Latex, please visit http://azfar.quora.com for crisp mathematical expressions.