LINEAR REGRESSION
Queries can be sent to me
- Introduction
- very often when 2 (or more) variables are observed, relationship
between them can be visualized
- predictions are always required in economics or physical science from
existing and historical data
- regression analysis is used to help formulate these predictions and
relationships
- linear regression is a special kind of regression analysis in which 2
variables are studied and a straight-line relationship is assumed
- linear regression is important because
- there exist many relationships that are of this form
- it provides close approximations to complicated relationships
which would otherwise be difficult to describe
- the 2 variables are divided into (i) independent variable and (ii)
dependent variable
- Dependent Variable is the variable that we want to forecast
- Independent Variable is the variable that we use to make the forecast
- e.g. Time vs. GNP (time is independent, GNP is dependent)
- scatter diagrams are used to graphically presenting the relationship
between the 2 variables
- usually the independent variable is drawn on the horizontal axis (X)
and the dependent variable on vertical axis (Y)
- the regression line is also called the regression line of Y on X
- Assumptions
- there is a linear relationship as determined (observed) from the
scatter diagram
- the dependent values (Y) are independent of each other, i.e. if we
obtain a large value of Y on the first observation, the result of the
second and subsequent observations will not necessarily provide a large
value. In simple term, there should not be auto-correlation
- for each value of X the corresponding Y values are normally
distributed
- the standard deviations of the Y values for each value of X are the
same, i.e. homoscedasticity
- Process
- observe and note what is happening in a systematic way
- form some kind of theory about the observed facts
- draw a scatter diagram to visualize relationship
- generate the relationship by mathematical formula
- make use of the mathematical formula to predict
- Method of Least Squares
- from a scatter diagram, there is virtually no limit as to the number
of lines that can be drawn to make a linear relationship between the 2
variables
- the objective is to create a BEST FIT line to the data concerned
- the criterion is the called the method of least squares
- i.e. the sum of squares of the
vertical deviations from
the points to the line be a minimum (based on the fact that the dependent
variable is drawn on the vertical axis)
- the linear relationship between the dependent variable (Y) and the
independent variable can be written as Y = a + bX , where a and b are
parameters describing the vertical intercept and the slope of the
regression line respectively
- Calculating a and b
- Correlation
- when the value of one variable is related to the value of another,
they are said to be correlated
- there are 3 types of correlation: (i) perfectly correlated; (ii)
partially correlated; (iii) uncorrelated
- Coefficient of Correlation (r) measures such a relationship


- the value of r ranges from -1 (perfectly correlated in the negative
direction) to +1 (perfectly correlated in the positive direction)
- when r = 0, the 2 variables are not correlated
- Coefficient of Determination
- Standard Error of Estimate (SEE)
- a measure of the variability of the regression line, i.e. the
dispersion around the regression line
- it tells how much variation there is in the dependent variable
between the raw value and the expected value in the regression

- this SEE allows us to generate the confidence interval on the
regression line as we did in the estimation of means
- Confidence interval for the regression line (estimating the
expected value)
- estimating the mean value of
for a given value of X is a
very important practical problem
- e.g. if a corporation's profit Y is linearly related to its
advertising expenditures X, the corporation may want to estimate the
mean profit for a given expenditure X
- this is given by the formula

- at n-2 degrees of freedom for the t-distribution
- Confidence interval for individual prediction
- for technical reason, the above formula must be amended and is given
by

An Example
|
| Accounting X
| Statistics Y
| X2
| Y2
| XY |
| 1
| 74.00
| 81.00
| 5476.00
| 6561.00
| 5994.00 |
| 2
| 93.00
| 86.00
| 8649.00
| 7396.00
| 7998.00 |
| 3
| 55.00
| 67.00
| 3025.00
| 4489.00
| 3685.00 |
| 4
| 41.00
| 35.00
| 1681.00
| 1225.00
| 1435.00 |
| 5
| 23.00
| 30.00
| 529.00
| 900.00
| 690.00 |
| 6
| 92.00
| 100.00
| 8464.00
| 10000.00
| 9200.00 |
| 7
| 64.00
| 55.00
| 4096.00
| 3025.00
| 3520.00 |
| 8
| 40.00
| 52.00
| 1600.00
| 2704.00
| 2080.00 |
| 9
| 71.00
| 76.00
| 5041.00
| 5776.00
| 5396.00 |
| 10
| 33.00
| 24.00
| 1089.00
| 576.00
| 792.00 |
| 11
| 30.00
| 48.00
| 900.00
| 2304.00
| 1440.00 |
| 12
| 71.00
| 87.00
| 5041.00
| 7569.00
| 6177.00 |
| Sum
| 687.00
| 741.00
| 45591.00
| 52525.00
| 48407.00 |
| Mean
| 57.25
| 61.75
| 3799.25
| 4377.08
| 4033.92 |

Figure 1: Scatter Diagram of Raw Data




Figure 2: Scatter Diagram and Regression Line

Interpretation/Conclusion
There is a linear relation between the results of Accounting and
Statistics as shown from the scatter diagram in Figure 1. A linear
regression analysis was done using the least-square method. The resultant
regression line is represented by
in which X
represents the results of Accounting and Y that of Statistics. Figure 2
shows the regression line. In this example, the choice of dependent and
indeendent variables is arbitrary. It can be said that the results of
Statistics are correlated to that of Accounting or vice versa.
The Coefficient of Determination
is
0.9194. This shows
that the two variables are correlated. Nearly 92% of the variation in Y
is
explained by the regression line.
The Coefficient of Correlation (r) has a value of 0.8453. This indicates
that
the two variables are positively correlated (Y increases as X
increases).
Queries can be sent to me
Previous Page
Home Page
SM 18Oct95