Regression: An Introduction to Econometrics

January 20, 2020 | Author: Bartholomew Stephens | Category: N/A
Share Embed Donate


Short Description

1 Regression: An Introduction to Econometrics Overview The goal in the econometric work is to help us move from the qual...

Description

Regression: An Introduction to Econometrics Overview The goal in the econometric work is to help us move from the qualitative analysis in the theoretical work favored in the textbooks to the quantitative world in which policy makers operate. The focus in this work is the quantification of relationships. For example, in microeconomics one of the central concepts was demand. It is one half of the supply-demand model that economists use to explain prices, whether it is the price of stock, the exchange rate, wages, or the price of bananas. One of the fundamental rules of economics is the downward sloping demand curve - an increase in price will result in lower demand. Knowing this, would you be in a position to decide on a pricing strategy for your product? For example, armed with the knowledge that demand is negatively related to price, do you have enough information to decide whether a price increase or decrease will raise sales revenue? You may recall from a discussion in an intro econ course that the answer depends upon the elasticity of demand, a measure of how responsive demand is to price changes. But how do we get the elasticity of demand? In your earlier work you were just given the number and asked how this would influence your choices, while here you will be asked to figure out what the elasticity figure is. It is here things become interesting, where we must move from the deterministic world of algebra and calculus to the probabilistic world of statistics. To make this move, a working knowledge of econometrics, a fancy name for applied statistics, is extremely valuable. As you will see, this is not a place for the meek at heart. There are a number of valuable techniques you will be exposed to in econometrics. You will work hard on setting up the 'right experiment' for your study, collecting the data and specifying the equation. Fortunately, this is only the beginning. There will never be the magic button that produces 'truth' at the end of some regression, the favorite econometric technique for estimating relationships. You can also be assured you will not get it quite right the first time. There is, however, something to be learned from your 'mistakes'. To the trained eye, the summary statistics produced by any regression package paint a vivid, if somewhat blurred picture, of the problems with the model as specified. These are problems that must be dealt with because they can produce biases in the results that reduces the reliability of the regression and increases the chance we will not end up an understanding of the true relationship. With existing software packages, anyone can produce regression results so one needs to be aware of the limitations of the analysis when evaluating regression results. In this overview of econometrics we will begin with a discussion of Specification. What equation will we estimate? Does demand depend upon price alone, or does income also matter? Is demand linearly or nonlinearly related to price? These are the types of questions discussed in this section. We will then shift to Interpretation, a discussion of how to interpret the results of our regression. What if we find out demand is negatively relate to price? Should we believe the result? And what about the times where demand turns out to be positively related to price. How could we explain this result and do we actually have proof demand curves should be positively sloped. This will be followed by a discussion of the assumptions of the Classical Linear Model, all of the things that must go right if we are to have complete confidence in our results. And for those instances where we have some reason to believe there is a problem, we have a discussion of the Limitations of the Classical Linear Model where the potential problems as well as solutions are discussed. When you have completed this section, you should be well aware of the fact the estimation of 'economic relationships' has both an art and a science component. Given the technology available to people today,

1

anyone can run regressions with the use of some magic buttons. Computer programs exist that allow us to estimate the regressions, perform diagnostics to evaluate the model, and correct any problems encountered. Do not, however, be misled into thinking your empirical work will be easy. As you will find with your own work, there is a long road of painful, time-consuming work ahead of anyone who embarks on an empirical project. Furthermore, there are many places where you can take a wrong turn. This section was designed to offer you some guidance as you make the journey, to help you know in advance the obstacles you are likely to encounter and the best way of dealing with them. There is a second reason for spending the time studying regression analysis and conducting your own empirical project. The scientific advances are not a guarantee we are more likely to uncover the 'truth' that we are searching for. The world is in many respects the same as it was when was prompted to write his wonderful little book entitled, How to Lie With Statistics. In the hands of an unscrupulous researcher, the modern econometric software increases the chances someone can find the results they want. The complexities of the statistical analysis simply make it harder to find the biases in the study. Your time spent here will simply increase the chances of recognizing the biases. For an on-line overview of regression analysis you might want to check out the DAU and Stockburger sites. You should also check out the worksheet Regression, the output from an excel regression. The data on sheet simple is for years, inflation rate, unemployment rate, and interest rate appear in cells A3 - D50. Once the data set is complete, you then select Data Analysis in the Tools menu. You will then select Regression, which will bring up a dialogue box. At this time you highlight the data set for the input box. The Y variable is the variable you want to explain, in this case and it is the interest rate. The X variable is the explainer, in this case the inflation rate. We are going to use regression to see the extent to which the inflation rate explains interest rates. You then specify the top left cell of the space where you want the output to appear. For an interpretation of the results, you should check out the Interpretation page. In these results you find the coefficient of inflation to be .68 - every time the inflation rate rises by one percentage point, interest rates rise by nearly .7 percent. The t-statistic is 7.22, which indicates you should believe in this relationship, and the R2 tells you the model helps explain about one half the variation in interest rates.

Mechanics Once you have decided on estimating a relationship using regression analysis, you need to decide upon the appropriate software package. There are some very useful software packages designed primarily for regression type analysis you may want to explore if you were doing some high powered regression work or you were using the software in other courses. Here, however, we will stick with Excel that allows you to run some simple regression analyses. The first step is creation of the data set, an example of which can be found on the simple tab on the Regression spreadsheet example. On the simple tab example we will be looking at a bivariate regression - a regression with only one right-side variable. The estimated equation will be of the form Y = a + bX + e, where Y is the variable being explained (dependent) and X is the variable doing the explaining (independent). To estimate the regression you simply select Data Analysis from the Tool menu and within this select Regression. You will get a dialogue box into which you need to input the relevant data. In the simple example we will be trying to identify the impact inflation has on interest rates. Because the causality runs from inflation to interest rates, the interest rate will be the dependent variable and the inflation rate will be the independent variable. You will input the dependent variable in the Input Y Range: by highlighting the interest rate column (C3:C50). You then input the independent variable in the Input X Range: by highlighting the inflation rate column (B3:B50). Because I did not use the labels you do not check off the labels box. I then tell it I would like the output to have its top left corner in cell F2. After checking off all the options you get all of the information on the simple tab. Below is the data that appears with the regression output. While all of this data gives you important information about the relationship, at this time your attention should be directed to just a few of the features

2

that are highlighted in red. The first is the adjusted R Square. This tells the reader that of all of the year to-year variation in the interest rate, about 52% of it can be explained by movements in the independent variable (inflation rate). The second things to look for are the coefficients. In this example, the regression analysis suggests the best equation for these data would be: Interest rate = 2.44 + .68*Inflation rate What we are most interested in is the coefficient of the Inflation rate, which in this example is .68. This means every time the inflation rate rises by one percentage pint (from 4 to 5 percent), then the interest rate rises by .68 percentage points. The final piece of valuable information is the t-stat, which tells us how much to "believe" in the coefficients. You notice the t-Stats are associated with each coefficient so you can actually test the "believability" of all coefficients. Fortunately there is a convenient rule of thumb for the tStats. If the absolute value of the t-stat is greater than 2 then you believe the coefficient is not zero, which would be the case if there was no relationship, In this example you will see the t-Stats for both the intercept and the coefficient of the inflation rate are greater than two so you can assume the interest rate is affected by the inflation rate. SUMMARY OUTPUT Regression Statistics Multiple R

0.72901

R Square

0.531456

Adjusted R Square Standard Error

0.52127 1.990115

Observations

48

ANOVA df

SS

Regression

1

MS

206.6476 206.6476 52.1764

Residual

46

182.1856 3.960557

Total

47

388.8332

Coefficients

F

Standard Error

t Stat

P-value

Significance F 4.22E-09

Lower 95%

Upper 95%

Lower 95.0%

Upper 95.0%

Intercept

2.443228

0.481394 5.075321 6.82E-06

1.474233 3.412222

1.474233

3.412222

X Variable 1

0.680573

0.094219 7.223323 4.22E-09

0.490921 0.870226

0.490921

0.870226

Specification Before becoming involved with the more sophisticated statistical questions regarding regression analysis, it is useful to briefly discuss some of the preliminary issues one must deal with in any empirical project. First, it is important to note the difference between causality and correlation. The statistical analyses that are used to determine the nature of the causality, never actually allow us to prove the causality. It is impossible to separate out causality from correlation. All we can reasonably hope to do is find statistical correlations that do not disprove our hypotheses.

3

Given this limitation, once the decision has been made to undertake an empirical project, the principal investigator must make a number of important choices. A schematic outline of the process is presented below. The project starts with the choice of the theoretical relationship which one wants to study, the hypotheses one wants to test. Before proceeding with the development of the specific model, it is appropriate to review the scholarly literature. There is little advantage to be gained by rediscovering the wheel and in reviewing these articles you might find some information that would help in the other stages in your empirical analysis. A good place to start your search of the literature would be the Journal of Economic Literature. After the review of the literature has been completed and the outlines of the model settled on, there is the need to identify the data necessary to estimate your model. What are you going to use as the dependent variable? For example, consider the empirical project designed to identify the link between investment spending and interest rates. There is a need to specify which interest rate it is we are concerned with explaining. Is it the rate on 3-month government securities or the rate on 30-year bonds we expect to affect investment decisions? A decision must also be made concerning the choice of the independent variables. The choice of the regressors is based on the underlying economic theory. The variables should be selected because there is a reason to believe they are causally related to Y. If, for example, your goal was to estimate a demand equation for a certain product, then based on your knowledge of microeconomic theory you would need to identify at the very least the appropriate data to capture the influence of income, population, and the price of related goods. In each of these instances, you will be making choices that will significantly affect the findings of your study. Furthermore, for every variable selected, you should have an a priori expectation for the estimated parameters. Based on our understanding of economic theory, for example, the coefficient of price in the demand equation should be negative and the coefficient of income should be positive. A good example of the importance of the proper specification of the independent variable would be the treatment of demographic factors in a demand equation. The normal choice would often be the population, but there may be instances where this is likely to be an inappropriate choice. Consider the demand for motorcycles. Is it the growth in the population or is it the growth in the population of young people that matters? To the extent the primary market for motorcycles is younger people, then the use of total population as an independent variable could cause problems. This would be the case if there was a divergence between the two growth rates, a phenomenon of the 1970s. Similarly, a model for housing demand would most certainly include as an independent variable some measure of population. Is it the number of people or is it the number of separate households that is the primary determinant of demand? The choice you make will have a significant impact on the results since we find that in the 1980s the growth rates of the two differed substantially. The choice of dependent and independent variables involves a number of other crucial decisions. For timeseries analysis, care has to be taken to avoid the mixing of seasonally adjusted and unadjusted data. This is not a problem when dealing with annual data, but it is a potential problem when dealing with quarterly or monthly data. It is also often relevant to adjust data for population. For example, in a demand equation for a specific product, it might be personal income per capita rather than personal income that is the appropriate independent variable. One also has choices with regard to the form of the variables. Let us assume we believe the unemployment rate has an influence on demand. Is it best captured by the level of the unemployment rate, which would be used as an indicator of ‘ability to pay’, or would it be better measured by the change in the unemployment rate, which would capture the 'expectations' effect of a change in the direction in the economy? When estimating a saving equation, should the dependent variable be aggregate savings (S), the average savings rate (S/Y), or the year-to-year changes in the savings rate (Δ(S/Y)). Most likely, the answer to these questions will be, at least in part, determined by the empirical work. One must also be very careful to adjust the data for the influences of inflation. I will always recall my undergraduate students who reported that the 1970s was a period of high growth because GNP grew more rapidly during this period than in the 1960s and 1980s. This is certainly not the case. The 1970s figures

4

were primarily a reflection of higher inflation rates and any econometric model should account for these substantial differences. Returning to the product demand example, the model should certainly be specified in terms of real, or inflation adjusted income. Similarly, when we examine the relationship between investment spending and interest rates, it is the real interest rate which we would expect to use as an independent variable. There is also the problem of dealing with phenomena that cannot be easily or adequately quantified. In a model of the inflation-unemployment trade-off, there is reason to believe there was a significant difference between the 1960s and 1970s. Another situation would be in a model of wage determination where we were attempting to identify the relationship between average earnings (W) and the number of years of education (E). In the wage study there would be a need to capture the gender effect because of the sharply different profiles for males and females. In fact, it is questions such as this that are at the center of many of the discrimination cases that get to the courtroom. Similarly, in any study of retail toy sales based on quarterly data, it would be important to take explicit account of the fact sales are typically higher in the fourth quarter.

Each of these problems can be solved with the use of dummy variables. A dummy variable is a 0-1 variable that can best be viewed as an on-off switch. The left hand diagram describes the situation where we would want to add an intercept dummy, a variable that has a value of 0 for each year in the 1960s and a value of 1 in the 1970s. The estimated equation would be:

i = b0 +b1*u +b2*D The diagram indicates a situation where the coefficient of D would be positive, the intercept is shifted upwards in the 1970s. The equations for the two time periods would be:

i = b0 +b1*u (1960s) i = (b0 ++b2) + b1*u (1970s) A somewhat different situation is depicted in the second diagram. Here it is not the intercept but the slope seems to vary. For this example consider the situation where the gender variable (G) would have a value of 0 for each observation of a woman's wage, and a value of 1 for each man's wage. We could use this dummy variable to test the hypotheses that the education-earnings profile for women is flatter than it is for males, that the extra earnings men receive for an extra year of education are greater than the gains for women. The equation would be:

Ei = b0 +b1*Educ +b2*D*Educ In this case, evidence of the steeper slope for the males would be found in a positive coefficient for b2. The slope of the females curve would be b1 while the slope of the males would be b2. The equations for the two periods would be:

E = b0 +(b1+b2)*Educ (males)

5

E = b0 +b1*Educ (females) Finally, in the retail sales equation in which we attempt to identify the link between sales (S) and income (Y), it would be appropriate to specify three dummy variables. The first dummy variable would have a value of 1 in the first quarter and 0 other wise, the second would have a value of 1 in the second quarter and the third dummy variable would have a value of 1 in the third quarter.

S = b0 +b1*D1 + b2*D2 + b3*D3 +b4*Y In this case, evidence of seasonal patterns in retail sales would be found in the coefficient for the dummy variables. The equations for the four quarters would be:

S = (b0 +b1) + b4*Y S = (b0 +b2) + b4*Y S = (b0 +b3) + b4*Y S = b0 + b4*Y

Q1 Q2 Q3 Q4

If sales were highest in the fourth quarter then the coefficients for all of the dummy variables would be negative. Having decided on the appropriate independent variables, the first issue involves the choice of time-series or cross-section analysis. Returning to the interest rate problem, one possibility would involve a study of investment spending and interest rates for the year 1991 for a sample of 35 countries. A second approach could focus on the behavior of these two phenomena in the U.S. for the past 30 years. Each approach has its strengths and weaknesses and its econometric peculiarities. I suspect, however, the majority of the work you are likely to do will be time-series analysis. When you work with time-series data you must decide on both the time period and the frequency of the data (daily, weekly, monthly, quarterly, annually). We now have the variables and we have the data The final decision to be made is the choice of the estimation procedure. There are many possibilities open to the researcher interested in quantifying a specific relationship. At this time I intend only to discuss linear regression, equations that are linear in their parameters. Furthermore, I do not intend the discussion of regression analysis that follows to be a replacement for statistics and econometrics texts. The emphasis here will be on a brief overview of the process one goes through in arriving at a finished product. We will begin at the beginning with the singleequation, bi-variate linear regression model. The simplest form of the model is:

Yi = B0 + B1Xi+ ei = 1...n where; • • • • •

Yi = ith observation on the dependent variable Xi = ith observation on the independent variable ei = ith observation on the error term B0, B0 = the parameter estimates n = number of observations

As is often the case, a picture can save one a good deal of explaining. The data collected on variables Y and X are presented in a scatter diagram below. Linear regression analysis identifies the equation for the straight line that best captures the 'flavor' of the scatter. More specifically, the regression procedure specifies the values of the parameters B0 and B1 so that we have a specific equation, which will allow us to calculate the 'average' value of Y [AVG(Y)] given the value of X. What remains unexplained by the equation is captured in the error term. In the diagram below, the actual value of Y for the ath observation is Ya while the model estimates AVG(Ya) as the value for Y. The difference between these two is the error term.

6

Bi-variate Regression: The Graphics

We can never expect a perfect fit with our model because there are always going to be some minor influences on Y omitted in the specification of the model, human behavior will always contain an element of randomness or unpredictability, the variables may not be measured correctly, and the model may not be truly linear. We do, however, hope these problems are minor, and when they do surface, we can modify our analysis in a number of ways to help correct the problems. In any event, as we will see later, the standard linear regression model is designed to choose the values for the parameters in such a way as to minimize the errors. For example, in the diagram below it is obvious that the equation Y = B2 +B3X does not adequately reflect the data and that the error terms would on average be larger. Stated somewhat differently, the second equation does a much poorer job of representing the data.

Alternative Regression Equations

If this were the end of the story, it would be a short one. The fact is there are few, if any, instances where the bi-variate model is appropriate because there are few cases where the value of a dependent variable is influenced by only one independent variable. It is more likely the dependent variable (Y) will be influenced by a number of independent variables. In this case the linear regression model can be written as:

Yi = B0 + B1X1 i + B2X2 i ...+... BKXK i + e i i = 1...n • where; • Y i = ith observation on the dependent variable • X ji = ith observation on the jth independent variable • e i = ith observation on the error term • B0... BK = the parameter estimates • K = the number of independent variables • n = number of observations It is also true there are many times when the linear model depicted above does not adequately reflect the data. One possible alternative specification would be the exponential form:

Y = ea1X1b1X2b2e

7

It you believed this was the appropriate model, you would employ a logarithmic transformation, which makes the equation linear in its parameters:

lnY = a1 + b1lnX1 + b2lnX2 + e In the case of the exponential model, the sign and size of the estimated coefficients have a significant impact on the 'picture' of the relationship. The graph below shows the relationship between y and X1 for different values of the parameter b1. When b>1 we have the familiar parabola and when b1
View more...

Comments

Copyright � 2017 SILO Inc.