Linear Regression
Regression problems pop up whenever we want to predict a numerical value. Common examples include predicting prices (of homes, stocks, etc.), predicting the length of stay (for patients in the hospital), forecasting demand (for retail sales), among numerous others. Not every prediction problem is one of classical regression. Later on, we will introduce classification problems, where the goal is to predict membership among a set of categories.
As a running example, suppose that we wish to estimate the prices of houses (in dollars) based on their area (in square feet) and age (in years). To develop a model for predicting house prices, we need to get our hands on data, including the sales price, area, and age for each home. In the terminology of machine learning, the dataset is called a training dataset or training set, and each row (containing the data corresponding to one sale) is called an example (or data point, instance, sample). The thing we are trying to predict (price) is called a label (or target). The variables (age and area) upon which the predictions are based are called features (or covariates).
using Pkg;
Pkg.activate("../../d2lai")
using d2lai
using Plots Activating project at `/workspace/workspace/d2l-julia/d2lai`Basics
Linear regression is both the simplest and most popular among the standard tools for tackling regression problems. Dating back to the dawn of the 19th century [1], [2], linear regression flows from a few simple assumptions. First, we assume that the relationship between features
Model
At the heart of every solution is a model that describes how features can be transformed into an estimate of the target. The assumption of linearity means that the expected value of the target (price) can be expressed as a weighted sum of the features (area and age):
:eqlabel:eq_price-area
Here eq_price-area is an affine transformation of input features, which is characterized by a linear transformation of features via a weighted sum, combined with a translation via the added bias. Given a dataset, our goal is to choose the weights
In disciplines where it is common to focus on datasets with just a few features, explicitly expressing models long-form, as in :eqref:eq_price-area, is common. In machine learning, we usually work with high-dimensional datasets, where it is more convenient to employ compact linear algebra notation. When our inputs consist of
Collecting all features into a vector
:eqlabel:eq_linreg-y
In :eqref:eq_linreg-y, the vector
:eqlabel:eq_linreg-y-vec
where broadcasting (:numref:subsec_broadcasting) is applied during the summation. Given features of a training dataset
Even if we believe that the best model for predicting
Before we can go about searching for the best parameters (or model parameters)
Loss Function
Naturally, fitting our model to the data requires that we agree on some measure of fitness (or, equivalently, of unfitness). Loss functions quantify the distance between the real and predicted values of the target. The loss will usually be a nonnegative number where smaller values are better and perfect predictions incur a loss of 0. For regression problems, the most common loss function is the squared error. When our prediction for an example
:eqlabel:eq_mse
The constant
Fitting a linear regression model to one-dimensional data.
Note that large differences between estimates
When training the model, we seek parameters (
Analytic Solution
Unlike most of the models that we will cover, linear regression presents us with a surprisingly easy optimization problem. In particular, we can find the optimal parameters (as assessed on the training data) analytically by applying a simple formula as follows. First, we can subsume the bias
Solving for
will only be unique when the matrix
While simple problems like linear regression may admit analytic solutions, you should not get used to such good fortune. Although analytic solutions allow for nice mathematical analysis, the requirement of an analytic solution is so restrictive that it would exclude almost all exciting aspects of deep learning.
Minibatch Stochastic Gradient Descent
Fortunately, even in cases where we cannot solve the models analytically, we can still often train models effectively in practice. Moreover, for many tasks, those hard-to-optimize models turn out to be so much better that figuring out how to train them ends up being well worth the trouble.
The key technique for optimizing nearly every deep learning model, and which we will call upon throughout this book, consists of iteratively reducing the error by updating the parameters in the direction that incrementally lowers the loss function. This algorithm is called gradient descent.
The most naive application of gradient descent consists of taking the derivative of the loss function, which is an average of the losses computed on every single example in the dataset. In practice, this can be extremely slow: we must pass over the entire dataset before making a single update, even if the update steps might be very powerful [4]. Even worse, if there is a lot of redundancy in the training data, the benefit of a full update is limited.
The other extreme is to consider only a single example at a time and to take update steps based on one observation at a time. The resulting algorithm, stochastic gradient descent (SGD) can be an effective strategy [5], even for large datasets. Unfortunately, SGD has drawbacks, both computational and statistical. One problem arises from the fact that processors are a lot faster multiplying and adding numbers than they are at moving data from main memory to processor cache. It is up to an order of magnitude more efficient to perform a matrix–vector multiplication than a corresponding number of vector–vector operations. This means that it can take a lot longer to process one sample at a time compared to a full batch. A second problem is that some of the layers, such as batch normalization (to be described in :numref:sec_batch_norm), only work well when we have access to more than one observation at a time.
The solution to both problems is to pick an intermediate strategy: rather than taking a full batch or only a single sample at a time, we take a minibatch of observations [6]. The specific choice of the size of the said minibatch depends on many factors, such as the amount of memory, the number of accelerators, the choice of layers, and the total dataset size. Despite all that, a number between 32 and 256, preferably a multiple of a large power of
In its most basic form, in each iteration
In summary, minibatch SGD proceeds as follows: (i) initialize the values of the model parameters, typically at random; (ii) iteratively sample random minibatches from the data, updating the parameters in the direction of the negative gradient. For quadratic losses and affine transformations, this has a closed-form expansion:
:eqlabel:eq_linreg_batch_update
Since we pick a minibatch
After training for some predetermined number of iterations (or until some other stopping criterion is met), we record the estimated model parameters, denoted
Linear regression happens to be a learning problem with a global minimum (whenever
Predictions
Given the model
Vectorization for Speed
When training our models, we typically want to process whole minibatches of examples simultaneously. Doing this efficiently requires that (we) (should) (vectorize the calculations and leverage fast linear algebra libraries rather than writing costly for-loops in Python.)
To see why this matters so much, let's (consider two methods for adding vectors.) To start, we instantiate two 10,000-dimensional vectors containing all 1s. In the first method, we loop over the vectors with a Python for-loop. In the second, we rely on a single call to +.
n = 100000
a = ones(n)
b = ones(n);Now we can benchmark the workloads. First, we add them, one coordinate at a time, using a for-loop.
c = zeros(n)
@time begin
for i in 1:n
c[i] = a[i] + b[i]
end
end 0.026434 seconds (598.98 k allocations: 10.666 MiB, 39.13% gc time)Alternatively, we rely on the reloaded + operator to compute the elementwise sum.
@time c .= a .+ b; 0.000103 seconds (1 allocation: 32 bytes)The second method is dramatically faster than the first. Vectorizing code often yields order-of-magnitude speedups. Moreover, we push more of the mathematics to the library so we do not have to write as many calculations ourselves, reducing the potential for errors and increasing portability of the code.
The Normal Distribution and Squared Loss
So far we have given a fairly functional motivation of the squared loss objective: the optimal parameters return the conditional expectation
Linear regression was invented at the turn of the 19th century. While it has long been debated whether Gauss or Legendre first thought up the idea, it was Gauss who also discovered the normal distribution (also called the Gaussian). It turns out that the normal distribution and linear regression with squared loss share a deeper connection than common parentage.
To begin, recall that a normal distribution with mean
Below we define a function to compute the normal distribution.
function normal(x, mu, sigma)
p = 1 / sqrt(2*pi*sigma^2)
exp(-1*((1 / (2*sigma^2))*(x - mu)^2))*p
endnormal (generic function with 1 method)x = -7:0.01:7
params = [(0, 1), (0, 2), (3, 1)]
plt = plot()
map(params) do (mu, sigma)
plot!(x, normal.(x, Ref(mu), Ref(sigma)), label = "mean $mu sigma $sigma")
end
pltNote that changing the mean corresponds to a shift along the
One way to motivate linear regression with squared loss is to assume that observations arise from noisy measurements, where the noise
Thus, we can now write out the likelihood of seeing a particular
As such, the likelihood factorizes. According to the principle of maximum likelihood, the best values of parameters
The equality follows since all pairs
If we assume that
Linear Regression as a Neural Network
While linear models are not sufficiently rich to express the many complicated networks that we will introduce in this book, (artificial) neural networks are rich enough to subsume linear models as networks in which every feature is represented by an input neuron, all of which are connected directly to the output.
Figure depicts linear regression as a neural network. The diagram highlights the connectivity pattern, such as how each input is connected to the output, but not the specific values taken by the weights or biases.
Linear regression is a single-layer neural network.
The inputs are
Biology
Because linear regression predates computational neuroscience, it might seem anachronistic to describe linear regression in terms of neural networks. Nonetheless, they were a natural place to start when the cyberneticists and neurophysiologists Warren McCulloch and Walter Pitts began to develop models of artificial neurons. Consider the cartoonish picture of a biological neuron in Figure, consisting of dendrites (input terminals), the nucleus (CPU), the axon (output wire), and the axon terminals (output terminals), enabling connections to other neurons via synapses.
The real neuron (source: "Anatomy and Physiology" by the US National Cancer Institute's Surveillance, Epidemiology and End Results (SEER) Program).
Information
Certainly, the high-level idea that many such units could be combined, provided they have the correct connectivity and learning algorithm, to produce far more interesting and complex behavior than any one neuron alone could express arises from our study of real biological neural systems. At the same time, most research in deep learning today draws inspiration from a much wider source. We invoke Russell and Norvig [10] who pointed out that although airplanes might have been inspired by birds, ornithology has not been the primary driver of aeronautics innovation for some centuries. Likewise, inspiration in deep learning these days comes in equal or greater measure from mathematics, linguistics, psychology, statistics, computer science, and many other fields.
Summary
In this section, we introduced traditional linear regression, where the parameters of a linear function are chosen to minimize squared loss on the training set. We also motivated this choice of objective both via some practical considerations and through an interpretation of linear regression as maximimum likelihood estimation under an assumption of linearity and Gaussian noise. After discussing both computational considerations and connections to statistics, we showed how such linear models could be expressed as simple neural networks where the inputs are directly wired to the output(s). While we will soon move past linear models altogether, they are sufficient to introduce most of the components that all of our models require: parametric forms, differentiable objectives, optimization via minibatch stochastic gradient descent, and ultimately, evaluation on previously unseen data.
Exercises
Assume that we have some data
. Our goal is to find a constant such that is minimized. Find an analytic solution for the optimal value of
. How does this problem and its solution relate to the normal distribution?
What if we change the loss from
to ? Can you find the optimal solution for ? Prove that the affine functions that can be expressed by
are equivalent to linear functions on . Assume that you want to find quadratic functions of
, i.e., . How would you formulate this in a deep network? Recall that one of the conditions for the linear regression problem to be solvable was that the design matrix
has full rank. What happens if this is not the case?
How could you fix it? What happens if you add a small amount of coordinate-wise independent Gaussian noise to all entries of
? What is the expected value of the design matrix
in this case? What happens with stochastic gradient descent when
does not have full rank? Assume that the noise model governing the additive noise
is the exponential distribution. That is, . Write out the negative log-likelihood of the data under the model
. Can you find a closed form solution?
Suggest a minibatch stochastic gradient descent algorithm to solve this problem. What could possibly go wrong (hint: what happens near the stationary point as we keep on updating the parameters)? Can you fix this?
Assume that we want to design a neural network with two layers by composing two linear layers. That is, the output of the first layer becomes the input of the second layer. Why would such a naive composition not work?
What happens if you want to use regression for realistic price estimation of houses or stock prices?
Show that the additive Gaussian noise assumption is not appropriate. Hint: can we have negative prices? What about fluctuations?
Why would regression to the logarithm of the price be much better, i.e.,
? What do you need to worry about when dealing with pennystock, i.e., stock with very low prices? Hint: can you trade at all possible prices? Why is this a bigger problem for cheap stock? For more information review the celebrated Black–Scholes model for option pricing [11].
Suppose we want to use regression to estimate the number of apples sold in a grocery store.
What are the problems with a Gaussian additive noise model? Hint: you are selling apples, not oil.
The Poisson distribution captures distributions over counts. It is given by
. Here is the rate function and is the number of events you see. Prove that is the expected value of counts . Design a loss function associated with the Poisson distribution.
Design a loss function for estimating
instead.