ࡱ> FHE Bbjbj "xjj2l:::::::N$|N1~$$$$$$$0000000$2 50:$$$$$0&::$$1&&&$ :$:$0&$0&&%r.T::/$r `߮N. ./101/585/&NN::::Regression Analysis (Simple) With regression we are trying to be more reflective of the population than the mean (of the Y, or dependent value) alone, which would otherwise be our best estimate of a predicted value from a set of given values. We are analyzing the relationship between variables. The statements: The more a candidate spends in a campaign, the more votes they will get And, Cabeiri is taller than Arzoo, are different in that the first implies a causal or functional relationship, and the second does not. One of the activities of researchers is to examine hypothesized functional relationships. Therein lies the rub of regression. The dependent variable is denoted Y, the independent variable, X. The variables will never be perfectly related, so there is always an error term. Variation from the regression line, can be thought of as having two parts: explained variation, which is accounted for by the independent variable, and unexplained variation, which is unaccounted for by the independent variable (this is error term). That is, part of the change in a variable is due to another variable that we hypothesize, and part is due to other factors outside our hypotheses. The relationship could be random as wella spurious one, and it is our role to determine if this is the case. Linear Regression: We are concerned with whether the relationship pattern between two values of variables can be described as a straight line, which is the simplest and most commonly used form. Remember from geometry class that a line is described by the formula: Y = a + bX (in geometry we said Y = mx + b where m was slope and b was y-int) Where Y is the dependent variable, measured in units of the dependent variable, X is the independent variable, measured in units of the independent variable, and a and b are constants defining the nature of the relationship between the variables X and Y. The a or Y-intercept (aka Yint) is the value of Y when X = 0. The b is the slope of the line and is known as the regression coefficient and is the change in Y associated with a one-unit change in X. The greater the slope or regression coefficient, the more influence the independent variable has on the dependent variable, and the more change in Y associated with a change in X. The regression coefficient is typically more important than the intercept from a policy researcher perspective as we are usually interested in the effect of one variable on another. Coming back to the equation, we also have a term to capture the error in our estimating equation, denoted  or e. Also known as the residual, it reflects the unexplained variation in Y, and its magnitude reflects the goodness of fit of the regression line. The smaller the error, the closer the points are to our line. So our general equation describing a line is: Y = a + bX + e Remember, b is the regression coefficient and is interpreted as the change in Y associated with a one-unit change in X. Example of interpretation of a regression equation: Say we are interested in the relationship between family food consumption and family income. We calculate a regression equation, in which consumption is denoted C and income I, both measured in dollars, of: C = 1375 + .064 I What is the intercept? 1375 What does it mean? That for a family with no income their food consumption is $1,375. What is the regression coefficient? How is it interpreted? For every dollar increase in family income there is a .064 dollar increase in food consumption. Note that we generally would have hypothesized a relationship and dep/indep variables. The relationship of I to C could have been reversed. The direction (sign) could have been opposite. This would likely reflect on a prior theory we may have had. The goal of regression is to draw a line through our data that best represents or describes the relationship between the two variables. Essentially we are trying to do better than just taking the mean observation. Simple regression is a procedure to find specific values for the slope and the intercept. If the line we draw to describe the data is upward sloping, the data suggest a positive relationship. If the line is downward sloping, the data suggest a negative relationship. If horizontal, the data suggest no relationship. In drawing our line, we want to minimize the distance between points and our linein the normal case we plot the dependent variable on the vertical (Y axis) and the independent variable on the horizontal (X axis). Distance is then measured vertically from an observed point to our estimated line. Since we cannot draw a line that minimizes the distance between all points and the line at the same time, we need a way to average the distances to get a best-fitting line. In the most common form of regression analysis, the technique is to find the sum of the squared values of the vertical distance: (draw a scatterplot and demonstrate these things on it)  EMBED Equation.3  That form of regression is called Ordinary Least Squares, or Least Squares, and it has two key properties: The sum of all actual values minus expected values equals zero The sum of all (actual expected) squared is the minimum value possible. In equation form: 1.  EMBED Equation.3  = 0 2.  EMBED Equation.3 = minimum Hypothesized Regression Equation/Model and the Estimating Equation When we follow the steps in regression (coming up shortly) we come up with two forms of our regression line or model. The first is a hypothesized model (following the general format of steps to research design) From a previous example, on Effort and Performance in 520, we had this: Ex.: Some have hypothesized that there is a cause/effect relationship in this class: CAUSE(EFFECTEfforti(PerformanceiIndependent(Dependent This relationship is expressed in an equation form that uses a CONSTANT and a PARAMETER: Gradei =1.0 +.0002(hours)ConstantParameter Constant, measured in units of the dependent variable, performance: grade points Parameter, measured in units of both, like gp/h or mph. In a more general expression of this, we might suggest this as our hypothesized model: Gi = 0 +1 Ei +iGrade of ith person, dep var, in u of a, (we know)Hypothesized constant in units of DV (unknown) is regression slope coefficient in units of both DV and V, E is IV  effort Error term, where i =  EMBED Equation.3  ( = actual expected) Estimating Equation, where parameters (the betas) are determined (by computer):  EMBED Equation.3  = b0 + b1xi or: Gi = 1.0 + .0002 Ei The formula for b =  EMBED Equation.3  And a: a =  EMBED Equation.3  Cross Section versus Time Series Fixed time versus measurement over time. The example above is fixed time, a snapshot in time. To denote a time series analysis, the subscript changes to t OLS cannot do pooled cross-sectional and time series Simple vs Complex or Multiple Regression Simple linear regression has only one independent variable: Yi = 0 + 1 Xi + i Multiple linear Regression has multiple independent variables Yi = 0 + 1 X1i + 2 X2i + 3 X3i + i Where linear means in the parameters (Bs are to the power of one) but not necessarily the variables. REGRESSIONS 11 STEPS TO ULTIMATE HAPPINESS Clearly define problem Conceptualize problem (define appropriate variables, identify plausible reasons for change in dependent variable) Operationalize Hypothesize regression model Collect data Check for multicollinearity (multiple regression only) Estimate OLS equation (computer) Do statistical test For equationsum of squares For coefficients Interpret coefficients Check OLS assumptions Conclusions, limitations Exercise top of 226 W&C in class, by pairs, on computer. Step 1) Define the problem, clearly define the question. Are expenditures per pupil related to the average performance of pupils on a standardized exam? Step 2) Conceptualize Problem: What are our variables? What might contribute to performance on standardized exam? How do we speculate the relationship might work? Step 3) Operationalize How would we measure this stuff? Expenditure in dollars per student, Performance on points on standardized exam. Step 4) Hypothesize Regression Model Yi = 0 + 1Xi + i Scorei = 0 + + 1Expenditurei + i Step 5) Collect data Thank you Welch and Comer, see table 225/226 on expenditure/scores. Step 6) Check for Multicollinearity (done! Well, not done, but only need for mult. regress) Step 7) Estimate OLS equation (can be done with Data Analysis tool in Excel, but well do simple form in the class exercise so people understand the deconstructed version of the black box that is excel) Step 8) Do Statistical Tests 8a. Goodness of fit Simple Regression II Review/summary of objectives of regression: To determine whether a relationship exists between two variables To describe the nature of the relationship, should one exist, in the form of a mathematical equation To assess the degree of accuracy of description or prediction achieved by the regression equation, and In multiple regression, assess the relative importance of the various predictor variables in their contribution to variation of the dependent variable. Assumptions of Linear Regression: Relationship is approximately linear (approximates a straight line in scatter plot of Y, X) For each value of X there is a probability distribution of independent values of Y, and from each of these Y distributions one or more values is sampled at random. The means of the Y distributions fall on the regression line. Thus any individual observation can vary from the line, and this variation is captured by the error term, . Left off at Step 8, Statistical tests. 8a) Overall Goodness of Fit Test Total sum of squares = sum of squares due to regression + sum of squares about regression:  EMBED Equation.3  TSS SSDue SSAbout (aka error, ) R2, or the coefficient of determination, is defined as the percent of variation in Y about its mean that is explained by the linear influence of the variation of X. Mathematically it is described by: R2 = SSD/TSS and will range between 0 and 1. Closer to one is a poorer model, closer to one is a better model. Example: say you had a regression model for which you calculated SSD/TSS as: 463.7/502.5 = .92 or, the model explains 92% of the variation about the mean. 8b) Statistical significance of regression coefficients Need to ask ourselves: statistically speaking, is 1 significantly different from zero? (We generally do not test the constant) Ho: 1 = 0 Ha: 1 `" 0  EMBED Equation.3  where d.f.= n-k-1 where k is the number of independent variables Say you ve got a b1 of -.459 and a std err of b1 of .047: tcalc = -.459-0/.047 = -9.77 (standard errors, or calculated t) tcrit.alpha/2, n-k-1= tcrit .025, 8= 2.306 9) Interpret Regression Coefficients Change in X associated with a one-unit change in Y. Specific language for definition of b1 for time series and cross-sectional studies: Cross Sectional: If A is one unit higher on the independent variable than another B then A will be b1 units of Y greater or less than B. Example: If a shopping center A has 1 square foot greater space than another shopping center, B, it will generate .003 more trips than the other. Time Series: When the independent variable increases by one unit, then dependent variable changes by b1 units of Y. Since we are less confident about point estimates, we give a confidence interval for our regression coefficient. The formula is: bi s.e. b1 (talpha/2, n-k-1) at the 95% confidence level, or alpha = .05 bi .047 (2.306) = -.459 .108 = range of 0.57 and 0.36 Pr[-.57 d" 1 d" -.36] = .95 or, we are 95% sure that the range will include 1. Step 10) Four Tests for OLS Assumptions and How to Test Them Normality: the error term is distributed normally around a mean of zero. If not normal it calls into question 1. Homoskedasticity: assumes equal variance of error term for every level of independent variable (typically a problem with cross-sectional data). Non-Auto Regressive: an error term i, associated with one observation, is not associated with error term of the next observation (typically a problem with time series data). You should not be able to see trends, or guess the next error term. Random effects: observations on independent variable X are 1) randomly selected, and 2) independent of all other independent variables (for multiple regression) What to do: plot  vs Xi and look at the first three tests with that. Interpretation of the error: If a model predicted X and the actual value was X-5, the model overpredicted the value by 5 units. If  is positive (that is, Yi-Yi-hat > 0), model is underestimating, if  is negative, model is overestimating.  indicates the success (or lack thereof) of your OLS analysis. u8 9 < =    U fh67:;?@TjGEHK]dež jH* j jEHUjSC UV j^EHUjpTC CJUVaJ jEHUjxSC CJUVaJ jU6]5\G+,=n()u&') * = 3 4 B W X AD &d P   E+,Gh^h & F`b$$IflbF.  6 B6    4 la$ B&#$/If@JSdvv$F&#$/If !$ B&#$/Ifb$$IflaF.  6 B6    4 la S`abkuTb$$IflnF.* B 6F6    4 la$F&#$/Ifuvwz}~_ch xx $$Ifa$ $$Ifa$ ! ^`b$$IfloF.* B 6F6    4 la    \ ] p q r s v w x z | } ~  """"""""f#h#p#r#x#z#######ƿ jEHUjֆTC CJUVaJ jH EHUjTC CJUVaJ jQ EHUjsTC UV6] j;EHUjsTC CJUVaJ jUH*D"$ [ \  !j$$Ifl\"p Z 64 la$If !!!!!2"4"""""X#Z###=$>$?$@$m$n$$$%#% & F#####J&&&'x(z(((((((((((((((((((((((((((),*$0f01 1F1H1J1L1112233"4$44444445555556666R6e6h6t6666 jEHUjecC CJUVaJH* j EHUj$acC CJUVaJ jU 6H*]6]5\H*M#%0%g%%%%%%%&&I&J&&&&&''''''((*(t(v(( ! & F & F(((())))-*.*M***++-+D+E+r++,,---<--x. & Fh^h & Fx...///"0$0f011N1P111s2t233V3W333344455 & F566P6Q6~6666607177L8M888C99999:::;;;; & F67707@7777M8X8888C9D9E9N9O9R9`9999: :::: ;;;;<B5\H*6]!;;;;;<<<<<(=*=,=.=0=2=4=6=8=:=<=>=@=>>>>>> & F>>>>>>>??@?@@@@AAtBvBBB8^8 & F 1h/ =!"#$%^DdJ  C A? "2VrB ҥ Df`!VrB ҥ `hn bxcdd``~ @c112BYL%bpu}ocg )I)'x\[I7GZpX =*gx\3C1-UʞPK;^y? u&t68~kwk~}f<4۬uFiȷҿ{U?]*_ӿmL3VOs'morHCH*T dQ._l\:f;(?r칎}$ 6K_bOU&~DdJ  C A? "2x A8v Tf`!L A8v b@2hxcdd``6dd``baV d,FYzP1n:&! KA?H1L깡jx|K2B* RvfRv,L ! ~ Ay +G1#X r8S+3H&CoG`ܞ@*"+a|%jO?eBܤr? W'J.hlpc b;njLLJ% H  3X_rdDd0  # A2sRPтJh3##3#Qs$43깠z >v0o8~+KRsXP`qTûDd4J  C A? "2mxXmeㅘd!>͸8kϊ?GHXNN(uASZi8R6W"=8rstZ3Mn~Dd@J  C A? "2IڞOaf[Gf`!SIڞOaf`\ !xcdd``$d@9`,&FF(`T A?dU@øjx|K2B* R 8 :@AbSC:% d3ab, fj=j&br<_ʣj#~# J!+1 +ss,\cX :\%9#  !"#$%&'()*+,-./0123456789:;<>?@ABCDZGJKLNMOQPRTSUWVXY]^[\_abcdefghijklmnopqrstuvwxyzRoot Entry0 F2߮IData =WordDocument/"xObjectPool2`߮2߮_1129577848F`߮`߮Ole CompObjfObjInfo "'*+,-0346789:;<>?@AC FMicrosoft Equation 3.0 DS Equation Equation.39q;adC Y i "2Y  i ()  "  2 FMicrosoft Equation 3.0 DS EqEquation Native }_1129607405 F@߮@߮Ole CompObj fuation Equation.39q8S&B Y i "2Y  i ()  ";ax w Y i "2Y  i ()  "  2ObjInfo Equation Native  o_1129577865 F߮߮Ole  ObjInfo Equation Native }_1129608094F]߮]߮Ole  FMicrosoft Equation 3.0 DS Equation Equation.39q84 (Q Y i "2Y  iCompObjfObjInfoEquation Native P_1129608122F߮߮Ole ObjInfoEquation Native :_1129612801%F@ ߮@ ߮8x w 2Y  i FMicrosoft Equation 3.0 DS Equation Equation.39q8+ (Y"2Y)(X"2X)  " X"2X() 2 "Ole CompObjfObjInfoEquation Native _1129613014"F ߮ ߮Ole  CompObj!#!fObjInfo$# FMicrosoft Equation 3.0 DS Equation Equation.39q8!p| 2Y"b2X FMicrosoft Equation 3.0 DS Equation Equation.39qEquation Native $=_1130586404 *'F4߮4߮Ole %CompObj&(&fObjInfo)(Equation Native )_1130587637,FN߮N߮Ole .pOL Y i "2Y()  "  2 =2Y  i "2Y() 2 " +Y i "2Y  i () 2 " FMicrosoft Equation 3.0 DS EqCompObj+-/fObjInfo.1Equation Native 21Table`5uation Equation.39qsZ9 t=b 1 " 1 std.err.b 1Oh+'0 (4 P \ ht|RpeqIj. 1  HDd<J  C A? "2ZLT@mm A6df`!.LT@mm AR`%xڥ9KA߼͡h#,b5ha!&BQ!0B`B+6"6V^a7xaϼ{@94x|!!fD0 us.PjBCsG{||LMgsl%e|x$WDyW/?/@/A//////////////000000000000|1}11111F2G22222000000000000000000000000000000000000000000000000000000000000000000000 0 00000000000000000000000000000000000000000000000000000000000000000000000e0e0e0"00X0X0X0X0X0X0X0X0X0X0X0000 0 0 0 0 0 0 0 0 0 0 0 0  0 00000000000000000} 0} 0} 0} 0} 00 0 0 0 0 00!0!00c"0c"00"0" 0" 0" 0" 0"0"0"0" 0" 0" 0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0"0" 0#"0"0"0"0"0"0"0"0"0"0"0" 0#"0"0"0"0"0"0"0"0"0"0"0"0" 0#"0"0"0"0"0"0"0"0"0"0"0" 0#"0"0"0"0"0"0"0"0"0"0"0"#6B(149 Su #%(x.5;>B)+,-./0235678:;B*BVX';=L`b& ' ')))2:::::::::22r)*2K#. 3 hjBY'>CGLcdeww22 Hit-Enter-Now7D:\Document\Classes\520-Methods\Regression Analysis.doc Hit-Enter-Now7D:\Document\Classes\520-Methods\Regression Analysis.doc Hit-Enter-Now7D:\Document\Classes\520-Methods\Regression Analysis.doc Hit-Enter-Now7D:\Document\Classes\520-Methods\Regression Analysis.doc Hit-Enter-NowtC:\Documents and Settings\Administrator\Application Data\Microsoft\Word\AutoRecovery save of Regression Analysis.asd Hit-Enter-NowtC:\Documents and Settings\Administrator\Application Data\Microsoft\Word\AutoRecovery save of Regression Analysis.asd Hit-Enter-NowCD:\Document\Classes\520-Methods\Regression lectures for website.doc Hit-Enter-NowCD:\Document\Classes\520-Methods\Regression lectures for website.doc Hit-Enter-NowCD:\Document\Classes\520-Methods\Regression lectures for website.doc Hit-Enter-NowCD:\Document\Classes\520-Methods\Regression lectures for website.doc=Μ 't!V|$ZP:Tz}__r,h^`.h^`.hpLp^p`L.h@ @ ^@ `.h^`.hL^`L.h^`.h^`.hPLP^P`L.^`o(.^`.pLp^p`L.@ @ ^@ `.^`.L^`L.^`.^`.PLP^P`L.^`o()^`.pLp^p`L.@ @ ^@ `.^`.L^`L.^`.^`.PLP^P`L.^`o(.0^`0o(.pLp^p`L.@ @ ^@ `.^`.L^`L.^`.^`.PLP^P`L. 't!ZP:__=                            ^q       @JS`abkuvchnvyz+qr2@&&s&& hhhhh h!h%h&h'h)h*=.=/=0}12p@pp @pp@@p"pH@p(pT@p.p0pd@p4pl@p:p<p|@p@UnknownGz Times New Roman5Symbol3& z Arial;Wingdings"h5{f;{f _*Y!9r0322QRegression Analysis (Simple) Hit-Enter-Now Hit-Enter-Now