.. _sec_linear_regression: Linear Regression ================= *Regression* refers to a set of methods for modeling the relationship between one or more independent variables and a dependent variable. In the natural sciences and social sciences, the purpose of regression is most often to *characterize* the relationship between the inputs and outputs. Machine learning, on the other hand, is most often concerned with *prediction*. Regression problems pop up whenever we want to predict a numerical value. Common examples include predicting prices (of homes, stocks, etc.), predicting length of stay (for patients in the hospital), demand forecasting (for retail sales), among countless others. Not every prediction problem is a classic regression problem. In subsequent sections, we will introduce classification problems, where the goal is to predict membership among a set of categories. Basic Elements of Linear Regression ----------------------------------- *Linear regression* may be both the simplest and most popular among the standard tools to regression. Dating back to the dawn of the 19th century, linear regression flows from a few simple assumptions. First, we assume that the relationship between the independent variables :math:`\mathbf{x}` and the dependent variable :math:`y` is linear, i.e., that :math:`y` can be expressed as a weighted sum of the elements in :math:`\mathbf{x}`, given some noise on the observations. Second, we assume that any noise is well-behaved (following a Gaussian distribution). To motivate the approach, let us start with a running example. Suppose that we wish to estimate the prices of houses (in dollars) based on their area (in square feet) and age (in years). To actually develop a model for predicting house prices, we would need to get our hands on a dataset consisting of sales for which we know the sale price, area, and age for each home. In the terminology of machine learning, the dataset is called a *training dataset* or *training set*, and each row (here the data corresponding to one sale) is called an *example* (or *data point*, *data instance*, *sample*). The thing we are trying to predict (price) is called a *label* (or *target*). The independent variables (age and area) upon which the predictions are based are called *features* (or *covariates*). Typically, we will use :math:`n` to denote the number of examples in our dataset. We index the data examples by :math:`i`, denoting each input as :math:`\mathbf{x}^{(i)} = [x_1^{(i)}, x_2^{(i)}]^\top` and the corresponding label as :math:`y^{(i)}`. .. _subsec_linear_model: Linear Model ~~~~~~~~~~~~ The linearity assumption just says that the target (price) can be expressed as a weighted sum of the features (area and age): .. math:: \mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b. :label: eq_price-area In :eq:`eq_price-area`, :math:`w_{\mathrm{area}}` and :math:`w_{\mathrm{age}}` are called *weights*, and :math:`b` is called a *bias* (also called an *offset* or *intercept*). The weights determine the influence of each feature on our prediction and the bias just says what value the predicted price should take when all of the features take value 0. Even if we will never see any homes with zero area, or that are precisely zero years old, we still need the bias or else we will limit the expressivity of our model. Strictly speaking, :eq:`eq_price-area` is an *affine transformation* of input features, which is characterized by a *linear transformation* of features via weighted sum, combined with a *translation* via the added bias. Given a dataset, our goal is to choose the weights :math:`\mathbf{w}` and the bias :math:`b` such that on average, the predictions made according to our model best fit the true prices observed in the data. Models whose output prediction is determined by the affine transformation of input features are *linear models*, where the affine transformation is specified by the chosen weights and bias. In disciplines where it is common to focus on datasets with just a few features, explicitly expressing models long-form like this is common. In machine learning, we usually work with high-dimensional datasets, so it is more convenient to employ linear algebra notation. When our inputs consist of :math:`d` features, we express our prediction :math:`\hat{y}` (in general the "hat" symbol denotes estimates) as .. math:: \hat{y} = w_1 x_1 + ... + w_d x_d + b. Collecting all features into a vector :math:`\mathbf{x} \in \mathbb{R}^d` and all weights into a vector :math:`\mathbf{w} \in \mathbb{R}^d`, we can express our model compactly using a dot product: .. math:: \hat{y} = \mathbf{w}^\top \mathbf{x} + b. :label: eq_linreg-y In :eq:`eq_linreg-y`, the vector :math:`\mathbf{x}` corresponds to features of a single data example. We will often find it convenient to refer to features of our entire dataset of :math:`n` examples via the *design matrix* :math:`\mathbf{X} \in \mathbb{R}^{n \times d}`. Here, :math:`\mathbf{X}` contains one row for every example and one column for every feature. For a collection of features :math:`\mathbf{X}`, the predictions :math:`\hat{\mathbf{y}} \in \mathbb{R}^n` can be expressed via the matrix-vector product: .. math:: {\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b, where broadcasting (see :numref:`subsec_broadcasting`) is applied during the summation. Given features of a training dataset :math:`\mathbf{X}` and corresponding (known) labels :math:`\mathbf{y}`, the goal of linear regression is to find the weight vector :math:`\mathbf{w}` and the bias term :math:`b` that given features of a new data example sampled from the same distribution as :math:`\mathbf{X}`, the new example's label will (in expectation) be predicted with the lowest error. Even if we believe that the best model for predicting :math:`y` given :math:`\mathbf{x}` is linear, we would not expect to find a real-world dataset of :math:`n` examples where :math:`y^{(i)}` exactly equals :math:`\mathbf{w}^\top \mathbf{x}^{(i)}+b` for all :math:`1 \leq i \leq n`. For example, whatever instruments we use to observe the features :math:`\mathbf{X}` and labels :math:`\mathbf{y}` might suffer small amount of measurement error. Thus, even when we are confident that the underlying relationship is linear, we will incorporate a noise term to account for such errors. Before we can go about searching for the best *parameters* (or *model parameters*) :math:`\mathbf{w}` and :math:`b`, we will need two more things: (i) a quality measure for some given model; and (ii) a procedure for updating the model to improve its quality. Loss Function ~~~~~~~~~~~~~ Before we start thinking about how to *fit* data with our model, we need to determine a measure of *fitness*. The *loss function* quantifies the distance between the *real* and *predicted* value of the target. The loss will usually be a non-negative number where smaller values are better and perfect predictions incur a loss of 0. The most popular loss function in regression problems is the squared error. When our prediction for an example :math:`i` is :math:`\hat{y}^{(i)}` and the corresponding true label is :math:`y^{(i)}`, the squared error is given by: .. math:: l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2. :label: eq_mse The constant :math:`\frac{1}{2}` makes no real difference but will prove notationally convenient, canceling out when we take the derivative of the loss. Since the training dataset is given to us, and thus out of our control, the empirical error is only a function of the model parameters. To make things more concrete, consider the example below where we plot a regression problem for a one-dimensional case as shown in :numref:`fig_fit_linreg`. .. _fig_fit_linreg: .. figure:: ../img/fit-linreg.svg Fit data with a linear model. Note that large differences between estimates :math:`\hat{y}^{(i)}` and observations :math:`y^{(i)}` lead to even larger contributions to the loss, due to the quadratic dependence. To measure the quality of a model on the entire dataset of :math:`n` examples, we simply average (or equivalently, sum) the losses on the training set. .. math:: L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2. When training the model, we want to find parameters (:math:`\mathbf{w}^*, b^*`) that minimize the total loss across all training examples: .. math:: \mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\ L(\mathbf{w}, b). Analytic Solution ~~~~~~~~~~~~~~~~~ Linear regression happens to be an unusually simple optimization problem. Unlike most other models that we will encounter in this book, linear regression can be solved analytically by applying a simple formula. To start, we can subsume the bias :math:`b` into the parameter :math:`\mathbf{w}` by appending a column to the design matrix consisting of all ones. Then our prediction problem is to minimize :math:`\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2`. There is just one critical point on the loss surface and it corresponds to the minimum of the loss over the entire domain. Taking the derivative of the loss with respect to :math:`\mathbf{w}` and setting it equal to zero yields the analytic (closed-form) solution: .. math:: \mathbf{w}^* = (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf{y}. While simple problems like linear regression may admit analytic solutions, you should not get used to such good fortune. Although analytic solutions allow for nice mathematical analysis, the requirement of an analytic solution is so restrictive that it would exclude all of deep learning. Minibatch Stochastic Gradient Descent ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Even in cases where we cannot solve the models analytically, it turns out that we can still train models effectively in practice. Moreover, for many tasks, those difficult-to-optimize models turn out to be so much better that figuring out how to train them ends up being well worth the trouble. The key technique for optimizing nearly any deep learning model, and which we will call upon throughout this book, consists of iteratively reducing the error by updating the parameters in the direction that incrementally lowers the loss function. This algorithm is called *gradient descent*. The most naive application of gradient descent consists of taking the derivative of the loss function, which is an average of the losses computed on every single example in the dataset. In practice, this can be extremely slow: we must pass over the entire dataset before making a single update. Thus, we will often settle for sampling a random minibatch of examples every time we need to compute the update, a variant called *minibatch stochastic gradient descent*. In each iteration, we first randomly sample a minibatch :math:`\mathcal{B}` consisting of a fixed number of training examples. We then compute the derivative (gradient) of the average loss on the minibatch with regard to the model parameters. Finally, we multiply the gradient by a predetermined positive value :math:`\eta` and subtract the resulting term from the current parameter values. We can express the update mathematically as follows (:math:`\partial` denotes the partial derivative): .. math:: (\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b). To summarize, steps of the algorithm are the following: (i) we initialize the values of the model parameters, typically at random; (ii) we iteratively sample random minibatches from the data, updating the parameters in the direction of the negative gradient. For quadratic losses and affine transformations, we can write this out explicitly as follows: .. math:: \begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right),\\ b &\leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b) = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right). \end{aligned} :label: eq_linreg_batch_update Note that :math:`\mathbf{w}` and :math:`\mathbf{x}` are vectors in :eq:`eq_linreg_batch_update`. Here, the more elegant vector notation makes the math much more readable than expressing things in terms of coefficients, say :math:`w_1, w_2, \ldots, w_d`. The set cardinality :math:`|\mathcal{B}|` represents the number of examples in each minibatch (the *batch size*) and :math:`\eta` denotes the *learning rate*. We emphasize that the values of the batch size and learning rate are manually pre-specified and not typically learned through model training. These parameters that are tunable but not updated in the training loop are called *hyperparameters*. *Hyperparameter tuning* is the process by which hyperparameters are chosen, and typically requires that we adjust them based on the results of the training loop as assessed on a separate *validation dataset* (or *validation set*). After training for some predetermined number of iterations (or until some other stopping criteria are met), we record the estimated model parameters, denoted :math:`\hat{\mathbf{w}}, \hat{b}`. Note that even if our function is truly linear and noiseless, these parameters will not be the exact minimizers of the loss because, although the algorithm converges slowly towards the minimizers it cannot achieve it exactly in a finite number of steps. Linear regression happens to be a learning problem where there is only one minimum over the entire domain. However, for more complicated models, like deep networks, the loss surfaces contain many minima. Fortunately, for reasons that are not yet fully understood, deep learning practitioners seldom struggle to find parameters that minimize the loss *on training sets*. The more formidable task is to find parameters that will achieve low loss on data that we have not seen before, a challenge called *generalization*. We return to these topics throughout the book. Making Predictions with the Learned Model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Given the learned linear regression model :math:`\hat{\mathbf{w}}^\top \mathbf{x} + \hat{b}`, we can now estimate the price of a new house (not contained in the training data) given its area :math:`x_1` and age :math:`x_2`. Estimating targets given features is commonly called *prediction* or *inference*. We will try to stick with *prediction* because calling this step *inference*, despite emerging as standard jargon in deep learning, is somewhat of a misnomer. In statistics, *inference* more often denotes estimating parameters based on a dataset. This misuse of terminology is a common source of confusion when deep learning practitioners talk to statisticians. Vectorization for Speed ----------------------- When training our models, we typically want to process whole minibatches of examples simultaneously. Doing this efficiently requires that we vectorize the calculations and leverage fast linear algebra libraries rather than writing costly for-loops in Python. .. raw:: html