Everything begins with the same old story: Once there is a variable vector named $\underline{x}$(ps: In Canada, underline represents vector) obey normal distribution, elements of $\underline{x}$ is not correlated. assume $\sigma$ is known, that is :
$\underline{x}\sim N(\underline{\theta},\sigma^{2}I)$, $\underline{X} \in \mathbf{R}$
so the
Maximum
Likelihood
Estimation of $\underline{\theta}$ will be :
$\underline{\widehat{\theta}} = \underline{\overline{x}}$
The MLE is the best estimation according to the Gauss-Markov theorem:
Among all linear unbiased estimators of $\underline{\theta}$, the above $\underline{\widehat{\theta}}$ has the smallest variance.
To any estimation, square loss is defined as:
$L(\theta,\widehat{\theta}) = E||\theta - \widehat{\theta}||^{2}$
since estimator
$\underline{\widehat{\theta}} = E(\underline{\theta})+\underline{bias}$
$L(\theta,\widehat{\theta}) = E||\underline{\theta}-E(\underline{\theta})-\underline(bias)||^2$
$=E||\underline{\theta}-E(\underline{\theta})||^2 + E||\underline{bias}||^2 - 2E||\underline{bias^{T}}(\underline{\theta}-E(\underline{\theta}))||^2$
for
$E||\underline{bias^{T}}(\underline{\theta}-E(\underline{\theta}))|| = 0$
so
$L(\theta,\widehat{\theta}) = Var(\underline{\widehat{\theta}}) + bias^{2}(\underline{\widehat{\theta}})$
So the loss is composed of bias and variance. As to MLE, the bias is 0, loss is only contributed by variance. Will it be possible that loss some on bias and gain more on variance and make L smaller than $L_{MLE}$?
The main idea of James-stain's shrinkage estimator is when the dimension of variables
p is bigger than 2, introduce a little bias can gain more on variance and the total loss will be smaller than $L_{MLE}$. That is
$\underline{\widehat{\theta}} = \underline{\overline{x}}+c(\underline{x}-\underline{\overline{x}})$
in which,
c is a shrinkage factor. The essential process in stain's method is the "shrinking" of all individual averages toward the grand average. The actual value is determined by the collection of all the observed averages. Even though there is no correlation between elements in $\underline{x}$. So you can estimate totally irrelevant variables together such as baseball score and rates of imported cars to get a better result than estimating them separately. Both theoretically and practically, it's been proved can provide better estimation than MLE. In the example given in "Stain's paradox in statistics"(B. Efron and C. Morris, 1977, American Scientific). When estimating a player's score together with other 45 players', c is about 0.2, that is about 20%'s shrinkage.
A equivalent form of stain's estimator is
$\underline{\widehat{\theta}} = (1-\frac{(p-2)\sigma^2}{||\underline{x}||^2})\underline{x}$
when $p>2$, you can prove
$L_{JS}<L_{MLE}$.
Next: shrinkage in regression
To be continued....
Reference
W. James and Charles Stain. Estimation with quadratic loss. 1961.
B. Efron and C. Morris. Stein's paradox in statistics. 1977.
PS: Writing in English is so hard.....:)