Prediction with a Log-Transformed Dependent VariableThis discussion continues with the analysis of voting patterns in the state of Florida for the United States 2000 presidential election. A linear regression equation that relates Buchanan's votes to Bush's votes in the counties of Florida may feature heteroskedastic errors arising from an error variance that is positively related to population size. A possible test for heteroskedasticity is to estimate a separate regression equation for the counties that rank in the top 25% for the total number of votes and for the counties with total number of votes less than the value for the upper quartile. A test for equal error variance in the two groups can then be implemented. This test procedure is known as the Goldfeld-Quandt test. Note that the test requires the choice of a breakpoint for splitting the two groups of observations. The SHAZAM commands below use the
The SHAZAM output can be viewed. For the simple linear regression equation of the Buchanan-Bush
vote relationship the Goldfeld-Quandt test statistic is calculated as
the ratio of the estimated error variance for the "small-medium"
counties to the estimated error variance for the "large" counties.
The SHAZAM output reports a test statistic value
The above SHAZAM commands transform the vote data to logarithms
and estimate a linear regression equation with the log-transformed
data (excluding Palm Beach county).
For this model, the Goldfeld-Quandt test statistic is reported
as The p-value exceeds the usual significance levels and therefore there is evidence that the log-linear model has homoskedastic errors. The results from the log-linear regression can be used to predict the log of the Buchanan vote for Palm Beach county. For a meaningful interpretation of the results, the log prediction must then be converted to a prediction for the number of votes. A useful discussion about the calculation of point predictions and prediction intervals when the dependent variable is log-transformed is Nelson [1973, pp. 161-165]. Denote the observations on the dependent variable as: zt = log(yt) Based on a linear regression model, suppose the point prediction for an observation z0 is: and the estimated standard error of the prediction is: se(e0) A prediction interval estimate for z0 is: where tc is an appropriate t-distribution critical value. Assuming the errors for the log-linear regression equation are normally distributed, the antilog point prediction for y0 = exp(z0) is: and a prediction interval estimate for y0 is: Note that the interval estimate for the anti-log prediction is not symmetric about the point prediction estimate. The above results were applied to obtain a prediction of
the number of votes for Buchanan in Palm Beach county based on
a log-linear regression that relates Buchanan's vote to Bush's vote in the
other 66 Florida counties.
The calculations show that the predicted number of
votes for Buchanan in Palm Beach county is The interval estimate obtained from the log-linear model is wider
than the interval estimate calculated from the simple linear regression
equation.
However, the results still show that the Buchanan vote count of
ReferencesCharles R. Nelson, Applied Time Series Analysis for Managerial Forecasting, 1973, Holden-Day.
SHAZAM output|_SAMPLE 1 67 |_READ (PRES2000.txt) GORE BUSH BUCHANAN NADER OTHER / SKIPLINE=1 UNIT 88 IS NOW ASSIGNED TO: PRES2000.txt 5 VARIABLES AND 67 OBSERVATIONS STARTING AT OBS 1 |_* Exclude Palm Beach County |_SAMPLE 1 66 |_GENR TOTAL=GORE+BUSH+BUCHANAN+NADER+OTHER |_SORT TOTAL GORE BUSH BUCHANAN NADER OTHER DATA HAS BEEN SORTED BY VARIABLE TOTAL |_GEN1 N75=INT(66*.75) |_* Estimate a linear relationship between Buchanan/Bush votes |_OLS BUCHANAN BUSH OLS ESTIMATION 66 OBSERVATIONS DEPENDENT VARIABLE= BUCHANAN ...NOTE..SAMPLE RANGE SET TO: 1, 66 R-SQUARE = 0.7511 R-SQUARE ADJUSTED = 0.7472 VARIANCE OF THE ESTIMATE-SIGMA**2 = 12880. STANDARD ERROR OF THE ESTIMATE-SIGMA = 113.49 SUM OF SQUARED ERRORS-SSE= 0.82430E+06 MEAN OF DEPENDENT VARIABLE = 213.00 LOG OF THE LIKELIHOOD FUNCTION = -404.927 VARIABLE ESTIMATED STANDARD T-RATIO PARTIAL STANDARDIZED ELASTICITY NAME COEFFICIENT ERROR 64 DF P-VALUE CORR. COEFFICIENT AT MEANS BUSH 0.34962E-02 0.2516E-03 13.90 0.000 0.867 0.8666 0.6857 CONSTANT 66.940 17.48 3.829 0.000 0.432 0.0000 0.3143 |_* Test for heteroskedasticity |_DIAGNOS / HET CHOWONE=N75 DEPENDENT VARIABLE = BUCHANAN 66 OBSERVATIONS REGRESSION COEFFICIENTS 0.349623785167E-02 66.9403199359 HETEROSKEDASTICITY TESTS CHI-SQUARE D.F. P-VALUE TEST STATISTIC E**2 ON YHAT: 31.055 1 0.00000 E**2 ON YHAT**2: 43.351 1 0.00000 E**2 ON LOG(YHAT**2): 17.817 1 0.00002 E**2 ON LAG(E**2) ARCH TEST: 0.899 1 0.34314 LOG(E**2) ON X (HARVEY) TEST: 19.757 1 0.00001 ABS(E) ON X (GLEJSER) TEST: 56.839 1 0.00000 E**2 ON X TEST: KOENKER(R2): 31.055 1 0.00000 B-P-G (SSR) : 133.086 1 0.00000 E**2 ON X X**2 (WHITE) TEST: KOENKER(R2): 46.740 2 0.00000 B-P-G (SSR) : 200.303 2 0.00000 SEQUENTIAL CHOW AND GOLDFELD-QUANDT TESTS N1 N2 SSE1 SSE2 CHOW PVALUE G-Q DF1 DF2 PVALUE 49 17 0.10622E+06 0.50292E+06 10.950 0.000 0.6740E-01 47 15 0.000 CHOW TEST - F DISTRIBUTION WITH DF1= 2 AND DF2= 62 |_* Transform the data to logarithms and estimate a log-log model. |_SAMPLE 1 67 |_DIM YHAT 67 SE 67 |_GENR LBUCHNN=LOG(BUCHANAN) |_GENR LBUSH=LOG(BUSH) |_SAMPLE 1 66 |_OLS LBUCHNN LBUSH / LOGLOG OLS ESTIMATION 66 OBSERVATIONS DEPENDENT VARIABLE= LBUCHNN ...NOTE..SAMPLE RANGE SET TO: 1, 66 R-SQUARE = 0.8652 R-SQUARE ADJUSTED = 0.8631 VARIANCE OF THE ESTIMATE-SIGMA**2 = 0.17662 STANDARD ERROR OF THE ESTIMATE-SIGMA = 0.42026 SUM OF SQUARED ERRORS-SSE= 11.304 MEAN OF DEPENDENT VARIABLE = 4.7965 LOG OF THE LIKELIHOOD FUNCTION(IF DEPVAR LOG) = -351.987 VARIABLE ESTIMATED STANDARD T-RATIO PARTIAL STANDARDIZED ELASTICITY NAME COEFFICIENT ERROR 64 DF P-VALUE CORR. COEFFICIENT AT MEANS LBUSH 0.72960 0.3599E-01 20.27 0.000 0.930 0.9302 0.7296 CONSTANT -2.3166 0.3547 -6.531 0.000-0.632 0.0000 -2.3166 |_* Test for heteroskedasticity |_DIAGNOS / HET CHOWONE=N75 REQUIRED MEMORY IS PAR= 13 CURRENT PAR= 2000 DEPENDENT VARIABLE = LBUCHNN 66 OBSERVATIONS REGRESSION COEFFICIENTS 0.729595970989 -2.31656542773 HETEROSKEDASTICITY TESTS CHI-SQUARE D.F. P-VALUE TEST STATISTIC E**2 ON YHAT: 2.495 1 0.11421 E**2 ON YHAT**2: 2.025 1 0.15470 E**2 ON LOG(YHAT**2): 3.071 1 0.07971 E**2 ON LAG(E**2) ARCH TEST: 0.747 1 0.38727 LOG(E**2) ON X (HARVEY) TEST: 0.000 1 0.98734 ABS(E) ON X (GLEJSER) TEST: 1.126 1 0.28873 E**2 ON X TEST: KOENKER(R2): 2.495 1 0.11421 B-P-G (SSR) : 2.402 1 0.12117 E**2 ON X X**2 (WHITE) TEST: KOENKER(R2): 4.872 2 0.08750 B-P-G (SSR) : 4.691 2 0.09580 SEQUENTIAL CHOW AND GOLDFELD-QUANDT TESTS N1 N2 SSE1 SSE2 CHOW PVALUE G-Q DF1 DF2 PVALUE 49 17 8.8998 1.9840 1.1956 0.309 1.432 47 15 0.227 CHOW TEST - F DISTRIBUTION WITH DF1= 2 AND DF2= 62 |_* Log prediction for Buchanan in Palm Beach. |_FC / LIST BEG=67 END=67 PREDICT=YHAT FCSE=SE DEPENDENT VARIABLE = LBUCHNN 1 OBSERVATIONS REGRESSION COEFFICIENTS 0.729595970989 -2.31656542773 OBS. OBSERVED PREDICTED CALCULATED STD. ERROR NO. VALUE VALUE RESIDUAL 67 8.1336 6.3928 1.7408 0.431 I * SUM OF ABSOLUTE ERRORS= 1.7408 R-SQUARE BETWEEN OBSERVED AND PREDICTED = 0.0000 R-SQUARE BETWEEN ANTILOGS OBSERVED AND PREDICTED = 0.0000 MEAN ERROR = 1.7408 SUM-SQUARED ERRORS = 3.0305 MEAN SQUARE ERROR = 3.0305 MEAN ABSOLUTE ERROR= 1.7408 ROOT MEAN SQUARE ERROR = 1.7408 MEAN SQUARED PERCENTAGE ERROR= 458.09 THEIL INEQUALITY COEFFICIENT U = 0.000 DECOMPOSITION PROPORTION DUE TO BIAS = 1.0000 PROPORTION DUE TO VARIANCE = 0.0000 PROPORTION DUE TO COVARIANCE = 0.0000 DECOMPOSITION PROPORTION DUE TO BIAS = 1.0000 PROPORTION DUE TO REGRESSION = 0.0000 PROPORTION DUE TO DISTURBANCE = 0.0000 |_* Calculate a 99% prediction interval |_* Obtain the critical value. |_GEN1 DF=$N-$K ..NOTE..CURRENT VALUE OF $N = 66.000 ..NOTE..CURRENT VALUE OF $K = 2.0000 |_SAMPLE 1 1 |_GEN1 ALPHA=0.01/2 |_DISTRIB ALPHA / TYPE=T DF=DF INVERSE CRITICAL=TC T DISTRIBUTION DF= 64.000 VARIANCE= 1.0323 H= 1.0000 PROBABILITY CRITICAL VALUE PDF ALPHA ROW 1 0.50000E-02 2.6553 0.13308E-01 |_SAMPLE 67 67 |_* Estimate a confidence interval for the log prediction |_GENR YUP=YHAT+TC*SE |_GENR YLOW=YHAT-TC*SE |_* Estimate a confidence interval for the anti-log prediction |_GENR YUP=EXP(YUP) |_GENR YLOW=EXP(YLOW) |_* Estimate the anti-log point prediction |_GENR YHAT=EXP(YHAT+SE*SE/2) |_* Print the results. |_PRINT YLOW YHAT YUP YLOW YHAT YUP 190.4035 655.5699 1875.011 |_* The prediction interval is not symmetric about the point prediction. |_STOP ![]() |