Machine Learning Algorithms

Why does the perceptron converge when the dataset is linearly separable?

In the Perceptron algorithm, why do ws use y - hat{y} instead of hat{y} - y?

Upcoming (p.85: Why does the MSE(Mean Squared Error) increase with each epoch when the learning rate is large?)

Why is ADALINE called a full-batch gradient descent?

The relationship between Logit function and Z = WX+b

How to get logistic sigmoid function from logit function?

Why do we use 0.5 as threshold of predition in ADALINE algorithm?

Upcoming (p.87: Why does MSE not become 0 even if all samples are perfectly classified?)

The difference between row-wise and column-wise preprocessing

An introduction to Lasso/Ridge regularization