Model and Cost Function

1 模型概述 - Model Representation

To establish notation for future use, we’ll use

x(i)
denote the “input” variables (living area in this example), also called input features, and
y(i)
denote the “output” or target variable that we are trying to predict (price).

A pair (x(i),y(i)) is called a training example
the dataset that we’ll be using to learn—a list of m training examples (x(i),y(i));i=1,…,m—is called a training set.
the superscript “(i)” in the notation is simply an index into the training set, and has nothing to do with exponentiation

X
denote the space of input values
Y
denote the space of output values

In this example

X = Y = R

To describe the supervised learning problem slightly more formally, our goal is,
given a training set, to learn afunction h : X → Yso that h(x) is a “good” predictor for the corresponding value of y.
For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is therefore like this

regression problem
When the target variable that we’re trying to predict iscontinuous, such as in our housing example
classification problem
When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say)
简单的介绍了一下数据集的表示方法，并且提出来h（hypothesis），即通过训练得出来的一个假设函数，通过输入x，得出来预测的结果y。并在最后介绍了线性回归方程

2 代价函数 - Cost Function

代价函数是用来测量实际值和预测值精确度的一个函数模型.
We can measure the accuracy of our hypothesis function by using acost function.
This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x’s and the actual output y’s.

首先需要搞清楚假设函数和代价函数的区别
当假设函数为线性时，即线性回归方程，其由两个参数组成：theta0和theta1

我们要做的就是选取两个参数的值，使其代价函数的值达到最小化

J(θ0,θ1)=12m∑i=1m(y^i−yi)2=12m∑i=1m(hθ(xi)−yi)2

To break it apart, it is 1/2 x ̄ where x ̄ is the mean of the squares of hθ(xi)−yi , or the difference
between the predicted value and the actual value.
This function is otherwise called theSquared error function, or Mean squared error.
The mean is halved (1/2)as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term.
The following image summarizes what the cost function does:

3 代价函数(一)

If we try to think of it in visual terms, our training data set is scattered on the x-y plane.
We are trying to make a straight line (defined by hθ(x)) which passes through these scattered data points.

Our objective is to get the best possible line. The best possible line will be such so that the average squared vertical distances of the scattered points from the line will be the least.

Ideally, the line should pass through all the points of our training data set. In such a case, the value of J(θ0,θ1) will be 0.

The following example shows the ideal situation where we have a cost function of 0.

When θ1=1, we get a slope of 1 which goes through every single data point in our model.
Conversely, when θ1=0.5, we see the vertical distance from our fit to the data points increase.

This increases our cost function to 0.58. Plotting several other points yields to the following graph:
Thus as a goal, we should try to minimize the cost function. In this case, θ1=1 is our global minimum.

4 代价函数(二)

等高线图是包含许多等高线的图形,双变量函数的等高线在同一条线的所有点处具有恒定值
采用任何颜色并沿着’圆’，可得到相同的成本函数值
当θ0= 800且θ1= -0.15时，带圆圈的x显示左侧图形的成本函数的值
取另一个h（x）并绘制其等高线图，可得到以下图表

例如，在上面的绿线上找到的三个红点具有相同的J（θ0，θ1）值，因此，它们能够被沿着同一条线找到
当θ0= 360且θ1= 0时，等高线图中J（θ0，θ1）的值越接近中心，从而降低了成本函数误差
现在给出我们的假设函数略微正斜率可以更好地拟合数据。

上图尽可能地使成本函数最小化，因此，θ1和θ0的结果分别约为0.12和250。
在我们的图表右侧绘制这些值似乎将我们的观点置于最内圈“圆圈”的中心。

5 梯度下降 - Gradient Descent

对于假设函数，我们有一种方法可以衡量它与数据的匹配度
现在我们需要估计假设函数中的参数。
这就是梯度下降使用到的地方。
想象一下，我们根据其字段θ0和θ1绘制我们的假设函数（实际上我们将成本函数绘制为参数估计函数）
我们不是绘制x和y本身，而是我们的假设函数的参数范围以及选择一组特定参数所产生的成本值
我们将θ0放在x轴上，θ1放在y轴上，成本函数放在z轴
我们的图上的点将是使用我们的假设和那些特定的θ参数的成本函数的结果

我们知道，当我们的成本函数位于图中凹坑的最底部时，即当它的值是最小值时，我们已经成功了
红色箭头显示图表中的最小点。
我们这样做的方法是采用我们的成本函数的导数（一个函数的切线）
切线的斜率是该点的导数，它将为我们提供一个朝向的方向
我们在最陡下降的方向上降低成本函数

每个步骤的大小由参数α确定，该参数称为学习率
例如，上图中每个“星”之间的距离表示由参数α确定的步长
较小的α将导致较小的步长，较大的α将导致较大的步长
采取步骤的方向由J（θ0，θ1）的偏导数确定,根据图表的开始位置，可能会在不同的点上结束
两个不同的起点，最终出现在两个不同的地方.
梯度下降算法：重复直到收敛

θj:=θj−α∂∂θjJ(θ0,θ1)

where
j=0,1 represents the feature index number.
在每次迭代j，应该同时更新参数θ1，θ2，…，θn。
在第j次迭代计算另一个参数之前更新特定参数将导致错误的实现。

6 梯度下降知识点总结

在本文，我们探讨了使用一个参数θ1并绘制其成本函数以实现梯度下降的场景
对单个参数的公式是：重复直到收敛

θ1:=θ1−αddθ1J(θ1)

无论ddθ1J（θ1）的斜率符号如何，θ1最终会收敛到其最小值
下图显示当斜率为负时，θ1的值增加，为正时，θ1的值减小
斜率为正值
斜率为负值

Choose Learning Rate α

另外，我们应该调整参数α以确保梯度下降算法在合理的时间内收敛。
没有收敛或太多时间来获得最小值意味着我们的步长是错误的。

如果J(θ)在下降，但是下降的速度很慢的话，就需要增大学习率α，因为每一步走的都太短了，导致到达最优解的速度下降，即收敛速度下降。
因为α大的原因，每次都一步跳过了最优解点，导致距离最优解越来越远，J(θ)不断上升

梯度下降如何由一个定步长的α收敛？

收敛背后的直觉是当我们接近凸函数的底部时ddθ1J（θ1）接近0。
至少，导数总是0

因此我们得到：

θ1:=θ1−α∗0

7 线性回归的梯度下降

回顾下之前所学
即

梯度下降算法
线性回归模型
- 线性假设
- 平方差代价函数
  
  我们要做的就是将梯度下降算法应用于线性回归模型的平方差代价函数
  其中关键的是这个导数项

当具体应用于线性回归的情况时，可以导出梯度下降方程的新形式
我们可以替换我们的实际成本函数和我们的实际假设函数，并将等式修改为：

repeat until convergence: {θ0:=θ1:=}θ0−α1m∑i=1m(hθ(xi)−yi)θ1−α1m∑i=1m((hθ(xi)−yi)xi)

derivation of ∂∂θjJ(θ) for a single example :
其中m是训练集的大小
θ0是一个常数，它将与给定训练集（数据）的θ1和xi，yi 的值同步变化
注意，我们已经将θj的两种情况分离为θ0和θ1的两种情况的偏导数方程

而对于θ1，由于导数，我们在末尾乘以xi
以下是一个单个例子的∂∂θjJ（θ）的推导：

所有这一切的要点是，如果我们从某个猜想开始，然后重复应用这些梯度下降方程，我们的假设将变得越来越准确
因此，这只是原始成本函数J的梯度下降
该方法在每个步骤中查看整个训练集中的每个示例，并称为批量梯度下降

需要注意的是，虽然梯度下降一般对局部最小值敏感，但我们在线性回归中提出的优化问题只有一个全局，而没有其他局部最优; 因此，梯度下降总是收敛（假设学习率α不是太大）于全局最小值
实际上，J是凸二次函数。下面是梯度下降的示例，因为它是为了最小化一个二次函数而运行的

上面显示的椭圆是二次函数的轮廓
还示出了梯度下降所采用的轨迹，其在（48,30）处初始化
图中的x（由直线连接）标记了渐变下降经历的θ的连续值，当它收敛到其最小值时

吴恩达机器学习 Coursera 笔记(二) - 单变量线性回归原创