From 03e0c915ca904b42ddffd34b27da81d1c14324ca Mon Sep 17 00:00:00 2001
From: tomit4 <mosssap@gmail.com>
Date: Thu, 16 Apr 2026 18:35:03 -0700
Subject: [PATCH] :memo: Added some notes on machine learning

---
 .../machine_learning/linear_regression.md     |  81 ++++++++++++++
 .../machine_learning/logistic_regression.md   | 103 ++++++++++++++++++
 2 files changed, 184 insertions(+)
 create mode 100644 math_notes/machine_learning/linear_regression.md
 create mode 100644 math_notes/machine_learning/logistic_regression.md

diff --git a/math_notes/machine_learning/linear_regression.md b/math_notes/machine_learning/linear_regression.md
new file mode 100644
index 00000000..131cc8fa
--- /dev/null
+++ b/math_notes/machine_learning/linear_regression.md
@@ -0,0 +1,81 @@
+# Linear Regression
+
+In linear regression, given features and labels (X, Y), where Y is real-valued,
+we try to learn a function f(x) to predict Y given x. Figure 2 outlines this
+function:
+
+$$ \hat{y} = w_0 + w_1x_1 + w_2x_2 + \dots + w_mx_m = \mathbf{w}^T\mathbf{X} $$
+
+_Figure 2: Learning Function_
+
+where $\mathbf{X} = x_1\text{, } \dots \text{, } x_m$ are the feature values and
+$\mathbf{w} = w_0 \text{, } \dots \text{, } w_n$ can be seen as weights.
+
+The weights determine how the corresponding feature affects the predicted value.
+Thus, our task is to find the appropriate values of **w**.
+
+**Cost function:** The cost function helps us to figure out the best possible
+values for **w**. For the cost function, we use the Mean Squared Error (MSE),
+Figure 3.
+
+$$ MSE(\mathbf{w}) = \frac{1}{m}\sum_{i=1}^{m}{\left(\hat{y_i} - y_i\right)^2} $$
+
+_Figure 3: MSE_
+
+Using this MSE function we are going to update the values of w, such that the
+MSE value settles at the minimum. The method of updating w to minimize the cost
+function (MSE) is called gradient descent. We initialize the values of w and
+then update these values iteratively to minimize the cost. Sometimes the cost
+function can be a non-convex function where you can settle at a local minimum,
+but for linear regression, it is always a convex function. To update w, we take
+gradients from the cost function. To find these gradients, we take partial
+derivatives with respect to w. Figure 4 outlines this 'update rule'.
+
+- Initialize $w_i$
+- Repeat until convergence
+  $\{w_i := w_i - \alpha \times \frac{\partial MSE(\mathbf{w})}{\partial w_i}\}$
+  Parameter $\alpha$ is called learning rate.
+
+_Figure 4: Update Rule_
+
+Code: In order to perform linear regression, we are going to use a Python module
+called scikit learn. In the following example, we will use the California
+Housing Data Set. The data set contains information about the housing values in
+the suburbs of Boston.
+
+There are 14 attributes for each **X**. Examples of these attributes include:
+
+- MedInc per capita crime rate by town
+- HouseAge Average age of a house in years
+- AveRooms Average Rooms in a home
+- Population City population
+
+The target value **Y** is:
+
+- MedHouseVal - Median value of owner-occupied homes in $1000's
+
+Next, we split the data into training and testing sets. We train the model with
+80% of the samples and test with the remaining 20%. Finally, we will evaluate
+our model using MSE.
+
+```py
+import sklearn
+from sklearn.linear_model import LinearRegression
+from sklearn.datasets import load_boston
+from sklearn.model_selection import train_test_split
+from sklearn.datasets import fetch_california_housing
+
+housing = fetch_california_housing()
+X = housing ['data']
+Y = housing ['target']
+
+X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)
+
+lr = LinearRegression()
+lr.fit(X_train, Y_train)
+Y_pred = lr.predict(X_test)
+
+mse = sklearn.metrics.mean_squared_error(Y_test, Y_pred)
+
+print('Mean squared error for test set:', mse)
+```
diff --git a/math_notes/machine_learning/logistic_regression.md b/math_notes/machine_learning/logistic_regression.md
new file mode 100644
index 00000000..c679f853
--- /dev/null
+++ b/math_notes/machine_learning/logistic_regression.md
@@ -0,0 +1,103 @@
+# Classification
+
+Logistic regression is used in classification problems. For example, an email
+can be classified as belonging to one of two classes: 'spam' and 'not spam'.
+Given features and labels (**x**, **Y**), where **Y** can take only discrete
+values (we can also say that the target variable is categorical), we try to
+learn a function f(x) to predict Y given x. Figure 5 outlines this function.
+
+$$ \hat{y} = w_0 + w_1x_1 + w_2x_2 + \dots + w_mx_m = \mathbf{w}^T\mathbf{X} $$
+
+where $\mathbf{X} = x_1 \text{, } \dots \text{, } x_m$ are the feature values
+and $\mathbf{w} = w_0 \text{, } \dots \text{, } w_n$ can be seen as weights.
+
+_Figure 5: Learning Function_
+
+As in linear regression, the weights determine how the corresponding feature
+affects the predicted value, thus our task is to find the appropriate values of
+**w**.
+
+In this binary classification problem, the predicted function must return binary
+values (either 0 or 1). To achieve this, we apply to our function the sigmoid or
+logistic function (Figure 6). The sigmoid function has the domain of all real
+numbers, with a return value from 0 to 1. Unlike linear regression, using the
+sigmoid function we transform the output into a probability.
+
+$$ \text{Sigmoid function: } \sigma(x) = \frac{1}{1 + \mathbf{e}^{-x}} $$
+
+$$ \text{Sigmoid applied to learning function: } \sigma(\hat{y}) = \sigma\left(\mathbf{w}^T\mathbf{X}\right) = \frac{1}{1 + \mathbf{e}^{-\mathbf{w}^T\mathbf{X}}} $$
+
+$$ \text{Probability for } \mathbf{X} \text{ to belong in the positive class: } Pr\left(c_{+}\mid X\right) = \frac{1}{1 + \mathbf{e}^{-\mathbf{w}^T\mathbf{X}}} $$
+
+$$ \text{Probability for } \mathbf{X} \text{ to belong in the negative class: } Pr\left(c_{-}\mid X\right) = 1 - Pr\left(c_{+}\mid X\right) $$
+
+_Figure 6: Sigmoid Function_
+
+**Cost function**: Figure 7 outlines the cost function that is used in logistic
+regression (Maximum Likelihood).
+
+$$ J(\mathbf{w}) = \frac{1}{m}\sum_{i=1}^{m}{-\left[y_i\log \hat{y} + \left(1 - y_i\right)\left(1 - \hat{y}\right)\right]} $$
+
+_Figure 7: Cost Function in Logistic Regression_
+
+Using this cost function, we are going to update the values of **w**, such that
+the J(w) value settles at the minimum. To obtain the values of **w**, we perform
+the gradient descent algorithm. Figure 8 outlines the update rule of **w** in
+logistic regression.
+
+- Initialize $w_i$
+- Repeat until convergence
+  $\{w_i := w_i - \alpha \cdot \frac{\partial MSE(\mathbf{w})}{\partial w_i}\}$
+  Parameter $\alpha$ is called learning rate.
+
+_Figure 8: Update Rule_
+
+**Code:i** To perform logistic regression we again use the scikit learn module.
+In the following example, we will use the Breast Cancer Wisconsin (Diagnostic)
+Data Set. There are 10 attributes for every **X** including:
+
+- radius (mean of distances from the center to points on the perimeter)
+- texture (standard deviation of gray-scale values)
+- perimeter
+- area
+- smoothness (local variation in radius lengths)
+
+The **Y** classes are:
+
+- WDBC-Malignant
+- WDBC-Benign
+
+Next, we split the data into training and testing sets. We train the model with
+80% of the samples and test with the remaining 20%. Finally, we will evaluate
+our model using precision and recall metrics. The precision is the intuitive
+ability of the classifier not to label as positive a sample that is negative,
+and recall is the ability of the classifier to find all the positive samples.
+
+```py
+import sklearn
+from sklearn.linear_model import LogisticRegression
+from sklearn.datasets import load_breast_cancer
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import recall_score
+from sklearn.metrics import precision_score
+
+data = load_breast_cancer()
+X = data['data']
+Y = data['target']
+
+X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)
+clf = LogisticRegression()
+clf.fit(X_train, Y_train)
+Y_pred = clf.predict(X_test)
+
+print('Recall:', recall_score(Y_test, Y_pred))
+print('Precision:', precision_score(Y_test, Y_pred))
+```
+
+The disadvantage of this algorithm is that for each iteration m gradients have
+to be computed leading to m training examples. If the training set is very
+large, the above algorithm is going to be memory inefficient and might crash if
+the training set doesn't fit in the memory. The Stochastic Gradient Descent
+algorithm may be helpful in this case as it takes a sample of the training set
+to calculate the weights-parameters instead of the entire sample space for each
+iteration. This makes training much faster.