From 03e0c915ca904b42ddffd34b27da81d1c14324ca Mon Sep 17 00:00:00 2001 From: tomit4 Date: Thu, 16 Apr 2026 18:35:03 -0700 Subject: [PATCH] :memo: Added some notes on machine learning --- .../machine_learning/linear_regression.md | 81 ++++++++++++++ .../machine_learning/logistic_regression.md | 103 ++++++++++++++++++ 2 files changed, 184 insertions(+) create mode 100644 math_notes/machine_learning/linear_regression.md create mode 100644 math_notes/machine_learning/logistic_regression.md diff --git a/math_notes/machine_learning/linear_regression.md b/math_notes/machine_learning/linear_regression.md new file mode 100644 index 00000000..131cc8fa --- /dev/null +++ b/math_notes/machine_learning/linear_regression.md @@ -0,0 +1,81 @@ +# Linear Regression + +In linear regression, given features and labels (X, Y), where Y is real-valued, +we try to learn a function f(x) to predict Y given x. Figure 2 outlines this +function: + +$$ \hat{y} = w_0 + w_1x_1 + w_2x_2 + \dots + w_mx_m = \mathbf{w}^T\mathbf{X} $$ + +_Figure 2: Learning Function_ + +where $\mathbf{X} = x_1\text{, } \dots \text{, } x_m$ are the feature values and +$\mathbf{w} = w_0 \text{, } \dots \text{, } w_n$ can be seen as weights. + +The weights determine how the corresponding feature affects the predicted value. +Thus, our task is to find the appropriate values of **w**. + +**Cost function:** The cost function helps us to figure out the best possible +values for **w**. For the cost function, we use the Mean Squared Error (MSE), +Figure 3. + +$$ MSE(\mathbf{w}) = \frac{1}{m}\sum_{i=1}^{m}{\left(\hat{y_i} - y_i\right)^2} $$ + +_Figure 3: MSE_ + +Using this MSE function we are going to update the values of w, such that the +MSE value settles at the minimum. The method of updating w to minimize the cost +function (MSE) is called gradient descent. We initialize the values of w and +then update these values iteratively to minimize the cost. Sometimes the cost +function can be a non-convex function where you can settle at a local minimum, +but for linear regression, it is always a convex function. To update w, we take +gradients from the cost function. To find these gradients, we take partial +derivatives with respect to w. Figure 4 outlines this 'update rule'. + +- Initialize $w_i$ +- Repeat until convergence + $\{w_i := w_i - \alpha \times \frac{\partial MSE(\mathbf{w})}{\partial w_i}\}$ + Parameter $\alpha$ is called learning rate. + +_Figure 4: Update Rule_ + +Code: In order to perform linear regression, we are going to use a Python module +called scikit learn. In the following example, we will use the California +Housing Data Set. The data set contains information about the housing values in +the suburbs of Boston. + +There are 14 attributes for each **X**. Examples of these attributes include: + +- MedInc per capita crime rate by town +- HouseAge Average age of a house in years +- AveRooms Average Rooms in a home +- Population City population + +The target value **Y** is: + +- MedHouseVal - Median value of owner-occupied homes in $1000's + +Next, we split the data into training and testing sets. We train the model with +80% of the samples and test with the remaining 20%. Finally, we will evaluate +our model using MSE. + +```py +import sklearn +from sklearn.linear_model import LinearRegression +from sklearn.datasets import load_boston +from sklearn.model_selection import train_test_split +from sklearn.datasets import fetch_california_housing + +housing = fetch_california_housing() +X = housing ['data'] +Y = housing ['target'] + +X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5) + +lr = LinearRegression() +lr.fit(X_train, Y_train) +Y_pred = lr.predict(X_test) + +mse = sklearn.metrics.mean_squared_error(Y_test, Y_pred) + +print('Mean squared error for test set:', mse) +``` diff --git a/math_notes/machine_learning/logistic_regression.md b/math_notes/machine_learning/logistic_regression.md new file mode 100644 index 00000000..c679f853 --- /dev/null +++ b/math_notes/machine_learning/logistic_regression.md @@ -0,0 +1,103 @@ +# Classification + +Logistic regression is used in classification problems. For example, an email +can be classified as belonging to one of two classes: 'spam' and 'not spam'. +Given features and labels (**x**, **Y**), where **Y** can take only discrete +values (we can also say that the target variable is categorical), we try to +learn a function f(x) to predict Y given x. Figure 5 outlines this function. + +$$ \hat{y} = w_0 + w_1x_1 + w_2x_2 + \dots + w_mx_m = \mathbf{w}^T\mathbf{X} $$ + +where $\mathbf{X} = x_1 \text{, } \dots \text{, } x_m$ are the feature values +and $\mathbf{w} = w_0 \text{, } \dots \text{, } w_n$ can be seen as weights. + +_Figure 5: Learning Function_ + +As in linear regression, the weights determine how the corresponding feature +affects the predicted value, thus our task is to find the appropriate values of +**w**. + +In this binary classification problem, the predicted function must return binary +values (either 0 or 1). To achieve this, we apply to our function the sigmoid or +logistic function (Figure 6). The sigmoid function has the domain of all real +numbers, with a return value from 0 to 1. Unlike linear regression, using the +sigmoid function we transform the output into a probability. + +$$ \text{Sigmoid function: } \sigma(x) = \frac{1}{1 + \mathbf{e}^{-x}} $$ + +$$ \text{Sigmoid applied to learning function: } \sigma(\hat{y}) = \sigma\left(\mathbf{w}^T\mathbf{X}\right) = \frac{1}{1 + \mathbf{e}^{-\mathbf{w}^T\mathbf{X}}} $$ + +$$ \text{Probability for } \mathbf{X} \text{ to belong in the positive class: } Pr\left(c_{+}\mid X\right) = \frac{1}{1 + \mathbf{e}^{-\mathbf{w}^T\mathbf{X}}} $$ + +$$ \text{Probability for } \mathbf{X} \text{ to belong in the negative class: } Pr\left(c_{-}\mid X\right) = 1 - Pr\left(c_{+}\mid X\right) $$ + +_Figure 6: Sigmoid Function_ + +**Cost function**: Figure 7 outlines the cost function that is used in logistic +regression (Maximum Likelihood). + +$$ J(\mathbf{w}) = \frac{1}{m}\sum_{i=1}^{m}{-\left[y_i\log \hat{y} + \left(1 - y_i\right)\left(1 - \hat{y}\right)\right]} $$ + +_Figure 7: Cost Function in Logistic Regression_ + +Using this cost function, we are going to update the values of **w**, such that +the J(w) value settles at the minimum. To obtain the values of **w**, we perform +the gradient descent algorithm. Figure 8 outlines the update rule of **w** in +logistic regression. + +- Initialize $w_i$ +- Repeat until convergence + $\{w_i := w_i - \alpha \cdot \frac{\partial MSE(\mathbf{w})}{\partial w_i}\}$ + Parameter $\alpha$ is called learning rate. + +_Figure 8: Update Rule_ + +**Code:i** To perform logistic regression we again use the scikit learn module. +In the following example, we will use the Breast Cancer Wisconsin (Diagnostic) +Data Set. There are 10 attributes for every **X** including: + +- radius (mean of distances from the center to points on the perimeter) +- texture (standard deviation of gray-scale values) +- perimeter +- area +- smoothness (local variation in radius lengths) + +The **Y** classes are: + +- WDBC-Malignant +- WDBC-Benign + +Next, we split the data into training and testing sets. We train the model with +80% of the samples and test with the remaining 20%. Finally, we will evaluate +our model using precision and recall metrics. The precision is the intuitive +ability of the classifier not to label as positive a sample that is negative, +and recall is the ability of the classifier to find all the positive samples. + +```py +import sklearn +from sklearn.linear_model import LogisticRegression +from sklearn.datasets import load_breast_cancer +from sklearn.model_selection import train_test_split +from sklearn.metrics import recall_score +from sklearn.metrics import precision_score + +data = load_breast_cancer() +X = data['data'] +Y = data['target'] + +X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5) +clf = LogisticRegression() +clf.fit(X_train, Y_train) +Y_pred = clf.predict(X_test) + +print('Recall:', recall_score(Y_test, Y_pred)) +print('Precision:', precision_score(Y_test, Y_pred)) +``` + +The disadvantage of this algorithm is that for each iteration m gradients have +to be computed leading to m training examples. If the training set is very +large, the above algorithm is going to be memory inefficient and might crash if +the training set doesn't fit in the memory. The Stochastic Gradient Descent +algorithm may be helpful in this case as it takes a sample of the training set +to calculate the weights-parameters instead of the entire sample space for each +iteration. This makes training much faster.