GD
梯度下降
Backtracking line search
In gradient descent, backtracking line search is an "inexact search" method used to automatically determine a step size
The Core Mechanism | 核心机制
Backtracking automatically chooses a step size by starting with a relatively large value and iteratively shrinking it until a specific stability condition is met.
- Initialization (初始化): The algorithm typically starts with an initial step size
. 算法通常从初始步长 开始。** - The Armijo Condition (Armijo 准则): At each step, it checks if the new function value satisfies the Armijo condition:
. This condition ensures that the reduction in the function value is proportional to the step size and the steepness of the gradient. 在每一步中,它会检查新的函数值是否满足 Armijo 准则。该准则确保函数值的下降程度与步长和梯度的陡峭程度成正比,从而保证“充分下降”。其中 是梯度下降的方向, 是学习率, 是Armijo constant(0, 1). - Contraction (缩减): If the condition is not met (meaning the step is too large and potentially overshooting the minimum), the step size is reduced by multiplying it by a contraction factor
(where ), such as . 如果不满足该准则(意味着步长过大,可能越过了极小值),则将步长乘以一个缩减因子 (其中 )来减小步长,例如 。 - Iteration (迭代): This process repeats until the Armijo condition is satisfied, at which point the current
is accepted as the step size for that iteration. 此过程不断重复,直到满足 Armijo 准则,此时当前的 被接受为该次迭代的步长。
SGD
随机梯度下降
Stochastic gradient descent (often shortened as SGD) is a stochastic approximation of the gradient descent method for minimizing an objective function that is written as a sum of differentiable functions.
MSE
以下是线性回归(Linear Regression)模型中,均方误差(Mean Squared Error, MSE)损失函数的梯度(Gradient)的数学推导和计算过程:
1. 损失函数定义 (Loss Function)
该代码对应的损失函数是均方误差(MSE)。对于
- 计算内部导数: 残差
对 的导数是 。在矩阵微积分的分母布局(denominator layout)中,这通常写作 。 - 代回并添加系数: 将上述结果结合,并乘以前面的系数
:
3. 代码与公式的对应 (Mapping to Code)
def mse_loss(theta, x, y):
N = len(y)
predictions = x @ theta
return (1/N) * np.sum((predictions - y)**2)
def grad_f(theta, X_batch, y_batch):
N_b = len(y_batch)
predictions = X_batch @ theta
# Gradient: (2/N) * X^T * (X*theta - y)
return (2/N_b) * X_batch.T @ (predictions - y_batch)
代码中的实现直接对应上述矩阵公式:
2/N_b: 对应公式中的系数。 X_batch.T: 对应公式中的( 的转置)。 predictions - y_batch: 对应公式中的残差项,其中 predictions即。 @: 表示矩阵乘法。
总结: 这段代码通过矩阵乘法一次性计算了当前批次(Batch)中所有样本对梯度的贡献总和,并除以样本数量进行平均。这正是**随机梯度下降(SGD)或小批量梯度下降(Mini-batch GD)**的核心计算步骤。