Prediction with a Log-Transformed Dependent Variable

Prediction with a Log-Transformed Dependent Variable

This discussion continues with the analysis of voting patterns in the state of Florida for the United States 2000 presidential election. A linear regression equation that relates Buchanan's votes to Bush's votes in the counties of Florida may feature heteroskedastic errors arising from an error variance that is positively related to population size.
A possible test for heteroskedasticity is to estimate a separate regression equation for the counties that rank in the top 25% for the total number of votes and for the counties with total number of votes less than the value for the upper quartile. A test for equal error variance in the two groups can then be implemented. This test procedure is known as the Goldfeld-Quandt test. Note that the test requires the choice of a breakpoint for splitting the two groups of observations.
The SHAZAM commands below use the SORT command to order the county observations (excluding Palm Beach county) by the total number of votes. In the ordered data set, the first observation is the county with the smallest number of total votes. The final observation, Palm Beach county, is the target county for the prediction exercise. The DIAGNOS command that follows the OLS estimation command is used to compute a Goldfeld-Quandt test for heteroskedasticity. The number of observations in the first group is specified with the CHOWONE= option.

SAMPLE 1 67 READ (PRES2000.txt) GORE BUSH BUCHANAN NADER OTHER / SKIPLINE=1 * Exclude Palm Beach County SAMPLE 1 66 GENR TOTAL=GORE+BUSH+BUCHANAN+NADER+OTHER SORT TOTAL GORE BUSH BUCHANAN NADER OTHER GEN1 N75=INT(66*.75) * Estimate a linear relationship between Buchanan/Bush votes OLS BUCHANAN BUSH * Test for heteroskedasticity DIAGNOS / HET CHOWONE=N75 * Transform the data to logarithms and estimate a log-log model. SAMPLE 1 67 DIM YHAT 67 SE 67 GENR LBUCHNN=LOG(BUCHANAN) GENR LBUSH=LOG(BUSH) SAMPLE 1 66 OLS LBUCHNN LBUSH / LOGLOG * Test for heteroskedasticity DIAGNOS / HET CHOWONE=N75 * Log prediction for Buchanan in Palm Beach. FC / LIST BEG=67 END=67 PREDICT=YHAT FCSE=SE * Calculate a 99% prediction interval * Obtain the critical value. GEN1 DF=$N-$K SAMPLE 1 1 GEN1 ALPHA=0.01/2 DISTRIB ALPHA / TYPE=T DF=DF INVERSE CRITICAL=TC SAMPLE 67 67 * Estimate a confidence interval for the log prediction GENR YUP=YHAT+TC*SE GENR YLOW=YHAT-TC*SE * Estimate a confidence interval for the anti-log prediction GENR YUP=EXP(YUP) GENR YLOW=EXP(YLOW) * Estimate the anti-log point prediction GENR YHAT=EXP(YHAT+SE*SE/2) * Print the results. PRINT YLOW YHAT YUP * The prediction interval is not symmetric about the point prediction. STOP

The SHAZAM output can be viewed.
For the simple linear regression equation of the Buchanan-Bush vote relationship the Goldfeld-Quandt test statistic is calculated as the ratio of the estimated error variance for the "small-medium" counties to the estimated error variance for the "large" counties. The SHAZAM output reports a test statistic value 0.0674. For a test of the null hypothesis of equal error variance against the alternative hypothesis of larger error variance for the "large" counties the p-value is less than 0.0005. This gives strong evidence for the presence of heteroskedastic errors.
The above SHAZAM commands transform the vote data to logarithms and estimate a linear regression equation with the log-transformed data (excluding Palm Beach county). For this model, the Goldfeld-Quandt test statistic is reported as 1.432. Since the test value is greater than one, the estimated error variance for the first group exceeds the estimated error variance for the second group. A rejection rule for the null hypothesis of equal error variance is obtained by comparing the test statistic with values from an F distribution with (47,15) degrees of freedom. On the SHAZAM output, the p-value for the test is calculated as:
P(F(47,15) > 1.432) = 0.227
The p-value exceeds the usual significance levels and therefore there is evidence that the log-linear model has homoskedastic errors.
The results from the log-linear regression can be used to predict the log of the Buchanan vote for Palm Beach county. For a meaningful interpretation of the results, the log prediction must then be converted to a prediction for the number of votes. A useful discussion about the calculation of point predictions and prediction intervals when the dependent variable is log-transformed is Nelson [1973, pp. 161-165].
Denote the observations on the dependent variable as:
z_t = log(y_t)
Based on a linear regression model, suppose the point prediction for an observation z₀ is:

and the estimated standard error of the prediction is:
se(e₀)
A prediction interval estimate for z₀ is:

where t_c is an appropriate t-distribution critical value.
Assuming the errors for the log-linear regression equation are normally distributed, the antilog point prediction for
y₀ = exp(z₀) is:

and a prediction interval estimate for y₀ is:

Note that the interval estimate for the anti-log prediction is not symmetric about the point prediction estimate.
The above results were applied to obtain a prediction of the number of votes for Buchanan in Palm Beach county based on a log-linear regression that relates Buchanan's vote to Bush's vote in the other 66 Florida counties. The calculations show that the predicted number of votes for Buchanan in Palm Beach county is 656 and the 99% prediction interval estimate is:
[190 , 1875]
The interval estimate obtained from the log-linear model is wider than the interval estimate calculated from the simple linear regression equation. However, the results still show that the Buchanan vote count of 3407 in Palm Beach county may be labelled as an outlier.
References

Charles R. Nelson, Applied Time Series Analysis for Managerial Forecasting, 1973, Holden-Day.

SHAZAM output

|_SAMPLE 1 67 |_READ (PRES2000.txt) GORE BUSH BUCHANAN NADER OTHER / SKIPLINE=1 UNIT 88 IS NOW ASSIGNED TO: PRES2000.txt 5 VARIABLES AND 67 OBSERVATIONS STARTING AT OBS 1 |_* Exclude Palm Beach County |_SAMPLE 1 66 |_GENR TOTAL=GORE+BUSH+BUCHANAN+NADER+OTHER |_SORT TOTAL GORE BUSH BUCHANAN NADER OTHER DATA HAS BEEN SORTED BY VARIABLE TOTAL |_GEN1 N75=INT(66*.75) |_* Estimate a linear relationship between Buchanan/Bush votes |_OLS BUCHANAN BUSH OLS ESTIMATION 66 OBSERVATIONS DEPENDENT VARIABLE= BUCHANAN ...NOTE..SAMPLE RANGE SET TO: 1, 66 R-SQUARE = 0.7511 R-SQUARE ADJUSTED = 0.7472 VARIANCE OF THE ESTIMATE-SIGMA**2 = 12880. STANDARD ERROR OF THE ESTIMATE-SIGMA = 113.49 SUM OF SQUARED ERRORS-SSE= 0.82430E+06 MEAN OF DEPENDENT VARIABLE = 213.00 LOG OF THE LIKELIHOOD FUNCTION = -404.927 VARIABLE ESTIMATED STANDARD T-RATIO PARTIAL STANDARDIZED ELASTICITY NAME COEFFICIENT ERROR 64 DF P-VALUE CORR. COEFFICIENT AT MEANS BUSH 0.34962E-02 0.2516E-03 13.90 0.000 0.867 0.8666 0.6857 CONSTANT 66.940 17.48 3.829 0.000 0.432 0.0000 0.3143 |_* Test for heteroskedasticity |_DIAGNOS / HET CHOWONE=N75 DEPENDENT VARIABLE = BUCHANAN 66 OBSERVATIONS REGRESSION COEFFICIENTS 0.349623785167E-02 66.9403199359 HETEROSKEDASTICITY TESTS CHI-SQUARE D.F. P-VALUE TEST STATISTIC E**2 ON YHAT: 31.055 1 0.00000 E**2 ON YHAT**2: 43.351 1 0.00000 E**2 ON LOG(YHAT**2): 17.817 1 0.00002 E**2 ON LAG(E**2) ARCH TEST: 0.899 1 0.34314 LOG(E**2) ON X (HARVEY) TEST: 19.757 1 0.00001 ABS(E) ON X (GLEJSER) TEST: 56.839 1 0.00000 E**2 ON X TEST: KOENKER(R2): 31.055 1 0.00000 B-P-G (SSR) : 133.086 1 0.00000 E**2 ON X X**2 (WHITE) TEST: KOENKER(R2): 46.740 2 0.00000 B-P-G (SSR) : 200.303 2 0.00000 SEQUENTIAL CHOW AND GOLDFELD-QUANDT TESTS N1 N2 SSE1 SSE2 CHOW PVALUE G-Q DF1 DF2 PVALUE 49 17 0.10622E+06 0.50292E+06 10.950 0.000 0.6740E-01 47 15 0.000 CHOW TEST - F DISTRIBUTION WITH DF1= 2 AND DF2= 62 |_* Transform the data to logarithms and estimate a log-log model. |_SAMPLE 1 67 |_DIM YHAT 67 SE 67 |_GENR LBUCHNN=LOG(BUCHANAN) |_GENR LBUSH=LOG(BUSH) |_SAMPLE 1 66 |_OLS LBUCHNN LBUSH / LOGLOG OLS ESTIMATION 66 OBSERVATIONS DEPENDENT VARIABLE= LBUCHNN ...NOTE..SAMPLE RANGE SET TO: 1, 66 R-SQUARE = 0.8652 R-SQUARE ADJUSTED = 0.8631 VARIANCE OF THE ESTIMATE-SIGMA**2 = 0.17662 STANDARD ERROR OF THE ESTIMATE-SIGMA = 0.42026 SUM OF SQUARED ERRORS-SSE= 11.304 MEAN OF DEPENDENT VARIABLE = 4.7965 LOG OF THE LIKELIHOOD FUNCTION(IF DEPVAR LOG) = -351.987 VARIABLE ESTIMATED STANDARD T-RATIO PARTIAL STANDARDIZED ELASTICITY NAME COEFFICIENT ERROR 64 DF P-VALUE CORR. COEFFICIENT AT MEANS LBUSH 0.72960 0.3599E-01 20.27 0.000 0.930 0.9302 0.7296 CONSTANT -2.3166 0.3547 -6.531 0.000-0.632 0.0000 -2.3166 |_* Test for heteroskedasticity |_DIAGNOS / HET CHOWONE=N75 REQUIRED MEMORY IS PAR= 13 CURRENT PAR= 2000 DEPENDENT VARIABLE = LBUCHNN 66 OBSERVATIONS REGRESSION COEFFICIENTS 0.729595970989 -2.31656542773 HETEROSKEDASTICITY TESTS CHI-SQUARE D.F. P-VALUE TEST STATISTIC E**2 ON YHAT: 2.495 1 0.11421 E**2 ON YHAT**2: 2.025 1 0.15470 E**2 ON LOG(YHAT**2): 3.071 1 0.07971 E**2 ON LAG(E**2) ARCH TEST: 0.747 1 0.38727 LOG(E**2) ON X (HARVEY) TEST: 0.000 1 0.98734 ABS(E) ON X (GLEJSER) TEST: 1.126 1 0.28873 E**2 ON X TEST: KOENKER(R2): 2.495 1 0.11421 B-P-G (SSR) : 2.402 1 0.12117 E**2 ON X X**2 (WHITE) TEST: KOENKER(R2): 4.872 2 0.08750 B-P-G (SSR) : 4.691 2 0.09580 SEQUENTIAL CHOW AND GOLDFELD-QUANDT TESTS N1 N2 SSE1 SSE2 CHOW PVALUE G-Q DF1 DF2 PVALUE 49 17 8.8998 1.9840 1.1956 0.309 1.432 47 15 0.227 CHOW TEST - F DISTRIBUTION WITH DF1= 2 AND DF2= 62 |_* Log prediction for Buchanan in Palm Beach. |_FC / LIST BEG=67 END=67 PREDICT=YHAT FCSE=SE DEPENDENT VARIABLE = LBUCHNN 1 OBSERVATIONS REGRESSION COEFFICIENTS 0.729595970989 -2.31656542773 OBS. OBSERVED PREDICTED CALCULATED STD. ERROR NO. VALUE VALUE RESIDUAL 67 8.1336 6.3928 1.7408 0.431 I * SUM OF ABSOLUTE ERRORS= 1.7408 R-SQUARE BETWEEN OBSERVED AND PREDICTED = 0.0000 R-SQUARE BETWEEN ANTILOGS OBSERVED AND PREDICTED = 0.0000 MEAN ERROR = 1.7408 SUM-SQUARED ERRORS = 3.0305 MEAN SQUARE ERROR = 3.0305 MEAN ABSOLUTE ERROR= 1.7408 ROOT MEAN SQUARE ERROR = 1.7408 MEAN SQUARED PERCENTAGE ERROR= 458.09 THEIL INEQUALITY COEFFICIENT U = 0.000 DECOMPOSITION PROPORTION DUE TO BIAS = 1.0000 PROPORTION DUE TO VARIANCE = 0.0000 PROPORTION DUE TO COVARIANCE = 0.0000 DECOMPOSITION PROPORTION DUE TO BIAS = 1.0000 PROPORTION DUE TO REGRESSION = 0.0000 PROPORTION DUE TO DISTURBANCE = 0.0000 |_* Calculate a 99% prediction interval |_* Obtain the critical value. |_GEN1 DF=$N-$K ..NOTE..CURRENT VALUE OF $N = 66.000 ..NOTE..CURRENT VALUE OF $K = 2.0000 |_SAMPLE 1 1 |_GEN1 ALPHA=0.01/2 |_DISTRIB ALPHA / TYPE=T DF=DF INVERSE CRITICAL=TC T DISTRIBUTION DF= 64.000 VARIANCE= 1.0323 H= 1.0000 PROBABILITY CRITICAL VALUE PDF ALPHA ROW 1 0.50000E-02 2.6553 0.13308E-01 |_SAMPLE 67 67 |_* Estimate a confidence interval for the log prediction |_GENR YUP=YHAT+TC*SE |_GENR YLOW=YHAT-TC*SE |_* Estimate a confidence interval for the anti-log prediction |_GENR YUP=EXP(YUP) |_GENR YLOW=EXP(YLOW) |_* Estimate the anti-log point prediction |_GENR YHAT=EXP(YHAT+SE*SE/2) |_* Print the results. |_PRINT YLOW YHAT YUP YLOW YHAT YUP 190.4035 655.5699 1875.011 |_* The prediction interval is not symmetric about the point prediction. |_STOP

[SHAZAM Guide home]