SGD

GD

梯度下降

Backtracking line search

In gradient descent, backtracking line search is an "inexact search" method used to automatically determine a step size $η$ (or $α$ ) that ensures a "sufficient decrease" in the objective function at each iteration.

The Core Mechanism | 核心机制

Backtracking automatically chooses a step size by starting with a relatively large value and iteratively shrinking it until a specific stability condition is met.

Initialization (初始化): The algorithm typically starts with an initial step size $α = 1$ . 算法通常从初始步长 $α = 1$ 开始。**
The Armijo Condition (Armijo 准则): At each step, it checks if the new function value satisfies the Armijo condition: $f (x + α p) \leq f (x) + c α \nabla f (x)^{T} p$ . This condition ensures that the reduction in the function value is proportional to the step size and the steepness of the gradient. 在每一步中，它会检查新的函数值是否满足 Armijo 准则。该准则确保函数值的下降程度与步长和梯度的陡峭程度成正比，从而保证“充分下降”。其中 $p$ 是梯度下降的方向， $α$ 是学习率， $c$ 是Armijo constant(0, 1).
Contraction (缩减): If the condition is not met (meaning the step is too large and potentially overshooting the minimum), the step size is reduced by multiplying it by a contraction factor $τ$ (where $τ \in (0, 1)$ ), such as $α = τ α$ . 如果不满足该准则（意味着步长过大，可能越过了极小值），则将步长乘以一个缩减因子 $τ$ （其中 $τ \in (0, 1)$ ）来减小步长，例如 $α = τ α$ 。
Iteration (迭代): This process repeats until the Armijo condition is satisfied, at which point the current $α$ is accepted as the step size for that iteration. 此过程不断重复，直到满足 Armijo 准则，此时当前的 $α$ 被接受为该次迭代的步长。

SGD

随机梯度下降

Stochastic gradient descent (often shortened as SGD) is a stochastic approximation of the gradient descent method for minimizing an objective function that is written as a sum of differentiable functions.

MSE

以下是线性回归（Linear Regression）模型中，均方误差（Mean Squared Error, MSE）损失函数的梯度（Gradient）的数学推导和计算过程：

1. 损失函数定义 (Loss Function)

该代码对应的损失函数是均方误差（MSE）。对于 $N$ 个样本，损失函数 $L (Θ)$ 定义为预测值与真实值之差的平方和的平均值：

\mathcal{L}(\Theta) = \frac{1}{N} | X\Theta - Y |_2^2 = \frac{1}{N} \sum_{i=1}^N (x^{(i)T}\Theta - y^{(i)})^2 $$ 其中： - $X$ (即 `X_batch`) 是输入数据矩阵。 - $\Theta$ (即 `theta`) 是参数向量。 - $Y$ (即 `y_batch`) 是真实标签向量。 ### 2. 梯度的推导过程 (Gradient Derivation) 为了使用梯度下降法更新参数，我们需要对 $\Theta$ 求导。利用矩阵微积分的链式法则，计算过程如下： 1. **展开平方项求导**： 令残差向量为 $e = X\Theta - Y$。我们需要对 $|e|^2 = e^Te$ 求导。 $$ \frac{\partial}{\partial \Theta} (e^Te) = 2e^T \frac{\partial e}{\partial \Theta}

计算内部导数：残差 $e$ 对 $Θ$ 的导数是 $X$ 。在矩阵微积分的分母布局（denominator layout）中，这通常写作 $X^{T}$ 。
代回并添加系数：将上述结果结合，并乘以前面的系数 $\frac{1}{N}$ ： $\nabla_{Θ} L (Θ) = \frac{1}{N} \cdot 2 \cdot X^{T} (X Θ - Y) $ $ $ $ \nabla_{Θ} L (Θ) = \frac{2}{N} X^{T} (X Θ - Y)$

3. 代码与公式的对应 (Mapping to Code)

def mse_loss(theta, x, y):
	N = len(y)
	predictions = x @ theta
	return (1/N) * np.sum((predictions - y)**2)

def grad_f(theta, X_batch, y_batch):
	N_b = len(y_batch)
	predictions = X_batch @ theta
	# Gradient: (2/N) * X^T * (X*theta - y)
	return (2/N_b) * X_batch.T @ (predictions - y_batch)

代码中的实现直接对应上述矩阵公式：

2/N_b: 对应公式中的系数 $\frac{2}{N}$ 。
X_batch.T: 对应公式中的 $X^{T}$ （ $X$ 的转置）。
predictions - y_batch: 对应公式中的残差项 $(X Θ - Y)$ ，其中 predictions 即 $X Θ$ 。
@: 表示矩阵乘法。

总结： 这段代码通过矩阵乘法一次性计算了当前批次（Batch）中所有样本对梯度的贡献总和，并除以样本数量进行平均。这正是**随机梯度下降（SGD）或小批量梯度下降（Mini-batch GD）**的核心计算步骤。