Skip to main content

Module 9: Linear Regression with SciPy, Scikit-Learn and Stats-Model

Linear regression is a ubiquitous tool in Statistics to understand data. Data often consists of several features, for example weight-loss per week, food intake in calories, and rigorous exercise measured in hours per week. The goal of the analysis is to understand the relationship between these features numerically in order to deepen our understanding, make predictions, and act on the data. In our example, we want to give actable advice a person desiring to control their weight on how to moderate food intake, increase weekly exercise, and balance the two. Most data analysis needs to assume a causal model. In our example, we assume that food intake and weekly exercise determine weight loss. The first two would be the independent or explanatory features and weight loss the dependent feature. In general, there are more than two determining features. Sometimes, they have little influence, sometimes they cannot be easily measured, such as metabolism in the weight loss case. We often model these extraneous, explanatory features as random changes in the dependent feature. The easiest model is one where the explanatory features have no influence at all and the only factor is the random one. Such a model has no value for understanding and prediction. The next simplest type of relationship between numerical data is a linear relationship, where the dependent feature can be written as a linear function of the explanatory features i.e. as The task at hand is to extract the values of the coefficients and the intercept from the data. The simplest way is to fit the line to the data. We have seen curve-fitting in Module XXX. Curve-fitting in general and linear regression in particular depend on the definition of goodness of fit. If we have a multi-dimensional prediction function and a number of data-points we determine the residuals which are the difference between the dependent feature values and the values predicted by . The common metric for the fit between and the data is the sum of the squares of the residuals or, what amounts to the same, its square-root This is not the only possibility of course. Another measure of fit used is but Mathematics knows of related metrics such as for Table : Fictional Weight-loss in dependence on calorie cuts data Figure : Graphical representation of the data in Figure 1 together with a linear regression line We illustrate our example with artificially created numbers, given in Table 1. We used a linear, but fictitious weight-loss formula based on daily calories cut, strenuous exercise per week, metabolism, and a small random factor. of the data will allow us to reconstruct the formula to a great extent. If we look at the scatter