});

[Math for Fun] Statistics in ML

ML skills

Different clustering with Toy dataset

Found this post from sklearn with intuitive illustration of different clustering approaches.
Can play with the math behind.
{Ref}
Demo graph

PINN

Physical information neural network
A 1-D example is axial displacement of a column under axial load in given pattern.
The strain can be expressed as a partial derivative equation:

A PDE need to be solve:

and boundary condition:

  • Independent variables: x(input)
  • Dependent variables: u (outputs)

Derive u(x) for all x in range [0,1]

The exact solution is given by

Our f is $$f = \frac{\mathrm{d}^2u}{dx^2}+\frac{2(a+a^3b(b-x+1))}{{(a^2b^2+1)}^2}$$

The neural network is regression of the equation as well as the target.

Advantages:

  • less data
  • good for complex math equation
  • no need to solve the pde

Drawback:

  • large iterations

XGBoost

是一种基于树模型提出的并行式计算方法
发展:

其前身是GBDT – Gradient Boosting Decision Tree

具体实现可以采用python的xgboost库
其主要改进是剪枝算法的软硬件加速

老赵说目前主流的方法是这些,需要注意学习。

PCA(Primay Component Analysis)

by StatQuest

For a 4D example:

  1. Decentralized: count the mean point, and move the central point of entire sample group to the original point.
  2. Solve Eigenvectors: direction with minimum variation (Sum of Square Distance–SS(distance))
    1. SSD = Eignvalue for PC1
    2. sqrt(Eignvalue for PC1) = Singular Value for PC1
  3. SVD(Singular value decomposition): Rotate the Eigenvalue to primary axis
  4. Dimension Reduction/Screen Plot: If PC1 and PC2
    1. Compute variation of PC1 = SS(distance for PC1)/(n-1)
    2. Trivial PC can be ignored.

Stats & Maths

t-test

especially independent t-test
{Wiki-ref}
{scipy-ref}
{t-test explanation}

Distribution.log_prob() & nn.GaussianNLLLoss

Negative Log

Because logarithm’s simplicity and efficiency in converting multiplication into summation,

By negative log

Back to difference of the 2 implementation

parameter definition difference

All summary are drawn from official doc {GNLLss} {Distribution.log_prob}
First in log_prob, self.loc is mean, self.scale is standard deviation (stddev):

But in GNLLLoss, input parameters are input, target, var, where input is mean value \mu, var is variance s^2, target is the observation. Both input and var are generated by the network.

You may notice, stddev has denominator N while Variance has that as n-1. My view is that stddev is a probability parameter, hence, N is the total sample counted in the probability distribution, however, Variance is statistic parameter, so n is the number of observations, n-1 reflect the \sum{(x-\bar{x})^2} operation in nominator has 1 less degree of freedom than \sum{x^2}.

Other interesting strategy optimizing the computation efficiency comprises that:

  • Neglecting the constant value log{(\sqrt{2\pi})}
  • reduction by mean instead of sum
  • set a safety value of 1e-6 to avoid ZeroDividor Error

These approach can be verified by setting the GNLLLoss to:

It mentioned in some case var array can have less dimension that mean due to homoscedasticity assumption, which explained below.

homoscedasticity

The takeaway is homoscedastic means regard variance the same for all samples in certain dimension. The antonym is heteroscedastic.
{Ref}
Homoscedastic
Heteroscedastic

SNR

ddof: diviser of degree of freedom

In Statistic:

In Audio processing, decibels preferred
file

Central Limit Theory

No matter what distribution that samples are taken from, their mean value are in normal distribution. (Cauchy Dist seems not obey this rule.)

Its advantages is that the mean value probability is independant to the original data distribution.

R2 coefficient of correlation

Also based on sklearn. Most importantly, a validation standard is proposed, the coefficient of determination. In sklearn, it is called r2_score.

R^2=1-\frac{(\sum_{i}{(y_i-(yi ) ̂ )^2 )}}{(\sum{i}{{(y_i-\bar{y})}^2}}=1-\frac{MSE}{VAR(y)}

An more comprehensive explanation is

You may wonder why SST = SSR + SSE

For more information, please refer to the link below:
https://blog.csdn.net/u012841922/article/details/78691825

Origin of exp (Euler’s constant)

源自最大利率问题中发现
Originated from maximal interest problem in investment

Then in case that basic interest rate is 1/n. We simply multiply MARKDOWN_HASH878bd532f1718635c637124be801e4d9MARKDOWNHASH by x.
$$
lim
{n \rightarrow \infty}{(1+x/n)^n}=lim_{n \rightarrow \infty}{[(1+x/n)^{n/x}]^x}=e^{x}=exp(x)
$$

Proof of unbiased estimation of STDV

Why there is n-1 in denominator of the estimation of STDV (Standard Deviation)

The inituitively correct STDV is biased deviation for it does not consider the error between average of each sampling and the real mean value of the entire distribution.

Let us set

TSS (Total Sum Square-error)
RSS (Real Sum Square-error)
ESS (Error Sum Square-error)

It can easily prove that:

At the same time,

Take out the denominator n, we got

It assume the x_i is a random distribution, so the multiplication of all i \ne j samples are negligible high-order minors.

To wrap up:

Thus, unbiased estimation using n-1 as denominators is preferred when sample size is small (say\<30), or highly possible to be biased from true distribution.

Ref:

  1. Math deduction:
    http://www.360doc.com/content/18/0924/22/48898194_789378974.shtml
  2. Code validation:
    https://stats.stackexchange.com/questions/249688/why-are-we-using-a-biased-and-misleading-standard-deviation-formula-for-sigma

This might be the start of Complex Functions.
Great thanks to 3Blue1Brown\’s video for the inspiration.

When it comes to exp([imaginery numbers]),
it is more appropriate to consider exp as a polynomial function with infinite factorial terms as shown below instead of taking it as multiple productions of the operand itself in the situation exp([real number]).

Taylor
Taylor
Taylor
Taylor

This is defined by Taylor\’s series expansion.
In my view, these series equations only valid at x=0.
However, we won\’t put too much attention on it here.
For further info, you may go to this link in Zhihu.

exp to Euler\’s formula

exp### The proof of exp(a)*exp(b) = exp(a+b)

This is the homework for the video of 3Blue1Brown.

The

QR factorization

A is a random matrx, R is upper triangular matrix, Q is a orthogonal matrix usually built by Gram-schmit method (find a set of orthogonal vectors from any 3 linear independent vectors)
This method takes advantage of orthogonality of Q matrix to numerically solve the upper triangular matrix of A. The R takes less storage space and easier to find inverse.

Gaussian Process with Bayes rule

$$ P(w|y, x) = \frac{p(y, x, w)}{p(y, x)} = \frac{p(y|x,w)p(x)p(w)}{p(y|x)p(x)}=\frac{p(y|x,w)p(w)}{p(y|x)}$$

Stats and ML

Logistic Function

Consider a random animal species on a island, the population W is function of time t, which is W(t).
For a short term, With abundant resources, the birth rate is a constant \alpha. Mathematically, \frac{dW(t)}{dt} = \beta W(t)
For a long term, the birth rate will gradually drop to zero. Mathematically,
\frac{dW(t)}{dt} = (\beta - \phi) W(t)
As the amount of drop \phi will be a function of population, too. There is
\frac{dW(t)}{dt} = (\omega - W(t)) \widetilde{\beta} W(t)
Notice that now: \omega\widetilde{\beta} where \omega has the meaning of maximal population.
Now a trick on math is define a normalized ratio P(t)=W(t)/\omega
And \frac{dP(t)}{dt} = \frac{1}{\omega} \frac{dW(t)}{dt} = \frac{1}{\omega} (\omega - W(t)) \widetilde{\beta} W(t) = \omega\widetilde{\beta} P(t)(1-P(t))
Tide it up to \frac{1}{P(t)(1-P(t))} dP(t) = \beta dt, and integrate both sides.
log{\frac{P(t)}{1-P(t)}} = \beta t + \alpha
P(t) = \frac{exp(\alpha + \beta t)}{1+exp(\alpha + \beta t)}

Get normal distribution from max entropy assumption

To be more mathematical, we paraphrase the task as below.

\widehat{p(x)} = \underset{s.t. \int{p(x)dx}=1, \int{xp(x)dx}=\mu, \int{(x-\mu)^{2}p(x)dx}=\sigma^{2}}{max{[-\int{p(x)log{p(x)}dx}]}}

The following part is not completed, but I am too annoyed since the original file is lost without auto-save

Lagrange multiplier:

Partial derivation \frac{\partial{L}}{\partial{p(x)}}

The p(x) is even, so \lambda_2=0

Let e^{\lambda_1 - 1} = c, and \lambda = -\lambda_3

By polar coordinate integration, \int{e^{-\frac{\pi}{2}}dx}=\sqrt{2\pi}

The \lambda=\frac{1}{2\sigma^2}

Therefore, p(x)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}

Precision Recall F-score

Ref: {Link}

For a confusion matrix

Actual/Pred Negative Positive
Negative True Negative False Positive
Positive False Negative True Positive

The ‘True’ and ‘False’ are comment for prediction.

fscore is useful when seeking balance between precision and recall, or with unbalanced dataset.

kernel function in SVM

Support vector machine
Kernel functions are applied on the input data, somewhat like the convolution layers, for feature extraction
Kernel functions are implicit form of mapping function.
In application, we define kernel function but not mapping function, because the calculation of kernel function is simpler than mapping function.
file
Ref: 《统计学习方法》- by 李航, from 清华大学
Also with a example of ellipse that may help
file
I personally conclude that kernel function mapping the inner product of input vector to a space with higher dimension, wrapping the input up like a kernel, so that the low-dimensional (usually 1-D) parameters (w & b) can be optimized in a non-linear case as that in the linear case.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.