📝 Added some notes on machine learning

This commit is contained in:
tomit4 2026-04-16 18:35:03 -07:00
parent b102fb1fa5
commit 03e0c915ca
2 changed files with 184 additions and 0 deletions

View file

@ -0,0 +1,81 @@
# Linear Regression
In linear regression, given features and labels (X, Y), where Y is real-valued,
we try to learn a function f(x) to predict Y given x. Figure 2 outlines this
function:
$$ \hat{y} = w_0 + w_1x_1 + w_2x_2 + \dots + w_mx_m = \mathbf{w}^T\mathbf{X} $$
_Figure 2: Learning Function_
where $\mathbf{X} = x_1\text{, } \dots \text{, } x_m$ are the feature values and
$\mathbf{w} = w_0 \text{, } \dots \text{, } w_n$ can be seen as weights.
The weights determine how the corresponding feature affects the predicted value.
Thus, our task is to find the appropriate values of **w**.
**Cost function:** The cost function helps us to figure out the best possible
values for **w**. For the cost function, we use the Mean Squared Error (MSE),
Figure 3.
$$ MSE(\mathbf{w}) = \frac{1}{m}\sum_{i=1}^{m}{\left(\hat{y_i} - y_i\right)^2} $$
_Figure 3: MSE_
Using this MSE function we are going to update the values of w, such that the
MSE value settles at the minimum. The method of updating w to minimize the cost
function (MSE) is called gradient descent. We initialize the values of w and
then update these values iteratively to minimize the cost. Sometimes the cost
function can be a non-convex function where you can settle at a local minimum,
but for linear regression, it is always a convex function. To update w, we take
gradients from the cost function. To find these gradients, we take partial
derivatives with respect to w. Figure 4 outlines this 'update rule'.
- Initialize $w_i$
- Repeat until convergence
$\{w_i := w_i - \alpha \times \frac{\partial MSE(\mathbf{w})}{\partial w_i}\}$
Parameter $\alpha$ is called learning rate.
_Figure 4: Update Rule_
Code: In order to perform linear regression, we are going to use a Python module
called scikit learn. In the following example, we will use the California
Housing Data Set. The data set contains information about the housing values in
the suburbs of Boston.
There are 14 attributes for each **X**. Examples of these attributes include:
- MedInc per capita crime rate by town
- HouseAge Average age of a house in years
- AveRooms Average Rooms in a home
- Population City population
The target value **Y** is:
- MedHouseVal - Median value of owner-occupied homes in $1000's
Next, we split the data into training and testing sets. We train the model with
80% of the samples and test with the remaining 20%. Finally, we will evaluate
our model using MSE.
```py
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X = housing ['data']
Y = housing ['target']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)
lr = LinearRegression()
lr.fit(X_train, Y_train)
Y_pred = lr.predict(X_test)
mse = sklearn.metrics.mean_squared_error(Y_test, Y_pred)
print('Mean squared error for test set:', mse)
```

View file

@ -0,0 +1,103 @@
# Classification
Logistic regression is used in classification problems. For example, an email
can be classified as belonging to one of two classes: 'spam' and 'not spam'.
Given features and labels (**x**, **Y**), where **Y** can take only discrete
values (we can also say that the target variable is categorical), we try to
learn a function f(x) to predict Y given x. Figure 5 outlines this function.
$$ \hat{y} = w_0 + w_1x_1 + w_2x_2 + \dots + w_mx_m = \mathbf{w}^T\mathbf{X} $$
where $\mathbf{X} = x_1 \text{, } \dots \text{, } x_m$ are the feature values
and $\mathbf{w} = w_0 \text{, } \dots \text{, } w_n$ can be seen as weights.
_Figure 5: Learning Function_
As in linear regression, the weights determine how the corresponding feature
affects the predicted value, thus our task is to find the appropriate values of
**w**.
In this binary classification problem, the predicted function must return binary
values (either 0 or 1). To achieve this, we apply to our function the sigmoid or
logistic function (Figure 6). The sigmoid function has the domain of all real
numbers, with a return value from 0 to 1. Unlike linear regression, using the
sigmoid function we transform the output into a probability.
$$ \text{Sigmoid function: } \sigma(x) = \frac{1}{1 + \mathbf{e}^{-x}} $$
$$ \text{Sigmoid applied to learning function: } \sigma(\hat{y}) = \sigma\left(\mathbf{w}^T\mathbf{X}\right) = \frac{1}{1 + \mathbf{e}^{-\mathbf{w}^T\mathbf{X}}} $$
$$ \text{Probability for } \mathbf{X} \text{ to belong in the positive class: } Pr\left(c_{+}\mid X\right) = \frac{1}{1 + \mathbf{e}^{-\mathbf{w}^T\mathbf{X}}} $$
$$ \text{Probability for } \mathbf{X} \text{ to belong in the negative class: } Pr\left(c_{-}\mid X\right) = 1 - Pr\left(c_{+}\mid X\right) $$
_Figure 6: Sigmoid Function_
**Cost function**: Figure 7 outlines the cost function that is used in logistic
regression (Maximum Likelihood).
$$ J(\mathbf{w}) = \frac{1}{m}\sum_{i=1}^{m}{-\left[y_i\log \hat{y} + \left(1 - y_i\right)\left(1 - \hat{y}\right)\right]} $$
_Figure 7: Cost Function in Logistic Regression_
Using this cost function, we are going to update the values of **w**, such that
the J(w) value settles at the minimum. To obtain the values of **w**, we perform
the gradient descent algorithm. Figure 8 outlines the update rule of **w** in
logistic regression.
- Initialize $w_i$
- Repeat until convergence
$\{w_i := w_i - \alpha \cdot \frac{\partial MSE(\mathbf{w})}{\partial w_i}\}$
Parameter $\alpha$ is called learning rate.
_Figure 8: Update Rule_
**Code:i** To perform logistic regression we again use the scikit learn module.
In the following example, we will use the Breast Cancer Wisconsin (Diagnostic)
Data Set. There are 10 attributes for every **X** including:
- radius (mean of distances from the center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
The **Y** classes are:
- WDBC-Malignant
- WDBC-Benign
Next, we split the data into training and testing sets. We train the model with
80% of the samples and test with the remaining 20%. Finally, we will evaluate
our model using precision and recall metrics. The precision is the intuitive
ability of the classifier not to label as positive a sample that is negative,
and recall is the ability of the classifier to find all the positive samples.
```py
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
data = load_breast_cancer()
X = data['data']
Y = data['target']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)
clf = LogisticRegression()
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print('Recall:', recall_score(Y_test, Y_pred))
print('Precision:', precision_score(Y_test, Y_pred))
```
The disadvantage of this algorithm is that for each iteration m gradients have
to be computed leading to m training examples. If the training set is very
large, the above algorithm is going to be memory inefficient and might crash if
the training set doesn't fit in the memory. The Stochastic Gradient Descent
algorithm may be helpful in this case as it takes a sample of the training set
to calculate the weights-parameters instead of the entire sample space for each
iteration. This makes training much faster.