Regression & Linear Modeling

8 downloads 85266 Views 137KB Size Report
Best Practices and Modern Methods ... Title: Regression & linear modeling : best practices and modern .... Simple Logistic Regression Using Statistical Software.
Regression & Linear Modeling Best Practices and Modern Methods

Jason W. Osborne Clemson University

FOR INFORMATION:

Copyright  2017 by SAGE Publications, Inc.

SAGE Publications, Inc.

All rights reserved. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher.

2455 Teller Road Thousand Oaks, California 91320 E-mail: [email protected] SAGE Publications Ltd. 1 Oliver’s Yard 55 City Road London EC1Y 1SP United Kingdom SAGE Publications India Pvt. Ltd.

All trademarks depicted within this book, including trademarks appearing as part of a screenshot, figure, or other image are included solely for the purpose of illustration and are the property of their respective holders. The use of the trademarks in no way indicates any relationship with, or endorsement by, the holders of said trademarks. SPSS is a registered trademark of International Business Machines Corporation.

B 1/I 1 Mohan Cooperative Industrial Area Mathura Road, New Delhi 110 044

Printed in the United States of America

India

Library of Congress Cataloging-in-Publication Data

SAGE Publications Asia-Pacific Pte. Ltd.

Names: Osborne, Jason W., author.

3 Church Street #10-04 Samsung Hub Singapore 049483

Title: Regression & linear modeling : best practices and modern methods / Jason W. Osborne, Clemson University. Other titles: Regression and linear modeling Description: Los Angeles : SAGE, [2017] | Includes bibliographical references and index. Identifiers: LCCN 2015042929 | ISBN 978-1-5063-0276-8 (hardcover : alk. paper) Subjects: LCSH: Regression analysis. | Linear models (Statistics) Classification: LCC QA278.2 .O833 2017 | DDC 519.5/36— dc23 LC record available at http://lccn.loc.gov/2015042929

Acquisitions Editor:  Leah Fargotstein

This book is printed on acid-free paper.

eLearning Editor:  Katie Ancheta Editorial Assistant:  Yvonne McDuffee Production Editor:  Kelly DeRosa Copy Editor:  Christina West Typesetter:  C&M Digitals (P) Ltd Proofreader:  Scott Oney Indexer:  Will Ragsdale Cover Designer:  Candice Harman Marketing Manager:  Susannah Goldes

16 17 18 19 20 10 9 8 7 6 5 4 3 2 1

BRIEF CONTENTS

Preface xix Acknowledgments xxiii About the Author Chapter 1. A Nerdly Manifesto

xxv 1

Chapter 2. Basic Estimation and Assumptions

23

Chapter 3. Simple Linear Models With Continuous Dependent Variables: Simple Regression Analyses

47

Chapter 4. Simple Linear Models With Continuous Dependent Variables: Simple ANOVA Analyses

71

Chapter 5. Simple Linear Models With Categorical Dependent Variables: Binary Logistic Regression

97

Chapter 6. Simple Linear Models With Polytomous Categorical Dependent Variables: Multinomial and Ordinal Logistic Regression

133

Chapter 7. Simple Curvilinear Models

157

Chapter 8. Multiple Independent Variables

193

Chapter 9. Interactions Between Independent Variables: Simple Moderation

219

Chapter 10. Curvilinear Interactions Between Independent Variables

255

Chapter 11. Poisson Models: Low-Frequency Count Data as Dependent Variables

283

Chapter 12. Log-Linear Models: General Linear Models When All of Your Variables Are Unordered Categorical

305

Chapter 13. A Brief Introduction to Hierarchical Linear Modeling

323

Chapter 14. Missing Data in Linear Modeling

341

Chapter 15. Trustworthy Science: Improving Statistical Reporting

359

Chapter 16. Reliable Measurement Matters

389

Chapter 17. Prediction in the Generalized Linear Model

405

Chapter 18. Modeling in Large, Complex Samples: The Importance of Using Appropriate Weights and Design Effect Compensation

427

Appendix A. A Brief User’s Guide to z-Scores 437 Author Index

000

Subject Index

000

DETAILED CONTENTS

Preface xix Acknowledgments xxiii About the Author

xxv

Chapter 1. A Nerdly Manifesto 1 The Variables Lead the Way 3 Ordinality 3 Equal Intervals 3 True Zero Point 4 Different Classifications of Measurement 4 Ratio Measurement 4 Interval Measurement 5 Ordinal Measurement 5 Nominal Measurement 5 It’s All About Relationships! 6 A Brief Review of Basic Algebra and Linear Equations 7 The GLM in One Paragraph 9 A Brief Consideration of Prediction Versus Explanation in Linear Modeling 11 A Brief Primer on Null Hypothesis Statistical Testing 12 A Trivial and Silly Example of Hypothesis Testing 13 A Tale of Two Errors 15 What Conclusions Can We Draw Based on NHST Results? 16 So What Does Failure to Reject the Null Hypothesis Mean? 17 Moving Beyond NHST 18 Other Pieces of Information Necessary to Draw Proper Conclusions 18 The Importance of Replication and Generalizability 19 Where We Go From Here 20 Enrichment 20 References 21 Chapter 2. Basic Estimation and Assumptions Estimation and the GLM What Is OLS Estimation? ML Estimation—A Gentle but Deeper Look

23 23 24 26

Assumptions for OLS and ML Estimation 27 Model 27 Variables 31 Residuals and Distributions 33 Simple Univariate Data Cleaning and Data Transformations 37 Data Screening 39 Missing Data 40 Transformation of Data 40 University Size and Faculty Salary in the United States 42 What If We Cannot Meet the Assumptions? 43 Where We Go From Here 43 Enrichment 43 References 45 Chapter 3. Simple Linear Models With Continuous Dependent Variables: Simple Regression Analyses 47 Advance Organizer 47 It’s All About Relationships! 47 Basics of the Pearson Product-Moment Correlation Coefficient 49 Calculating r 50 Effect Sizes and r 51 A Real Data Example 52 The Basics of Simple Regression 53 Basic Calculations for Simple Regression 54 Standardized Versus Unstandardized Regression Coefficients 55 Hypothesis Testing in Simple Regression 55 A Real Data Example 56 The Assumption That the Model Is Correctly Specified 56 Assumptions About the Variables 56 Assumptions About Residuals 57 Summary of Results 58 Does Centering or z-Scoring Make a Difference? 60 Some Simple Multivariate Data Cleaning 61 What Is a Bivariate Outlier? 61 Standardized Residuals 62 Studentized Residuals 63 Global Measures of Influence: DfFit or Cook’s Distance (Cook’s D) 63 Specific Measures of Influence: DfBetas 66 Summary 68 Enrichment 69 References 70 Chapter 4. Simple Linear Models With Continuous Dependent Variables: Simple ANOVA Analyses Advance Organizer It’s All About Relationships! (Part 2)

71 71 71

Analyzing These Data via t-Test 73 Analyzing These Data via ANOVA 74 ANOVA Within an OLS Regression Framework 75 When Your IV Has More Than Two Groups: Dummy Coding Your Unordered Polytomous Variable 78 Define the Reference Group 79 Set Up the Dummy-Coded Variables 80 Evaluating the Effects of the Categorical Variable in the Regression Model 81 Smoking and Diabetes Analyzed via ANOVA 81 Smoking and Diabetes Analyzed via Regression 83 What If the Dummy Variables Are Coded Differently? 85 Unweighted Effects Coding 86 Weighted Effects Coding 89 Common Alternatives to Dummy or Effects Coding 93 Simple Contrasts 93 Difference (Reverse Helmert) Contrasts 93 Helmert Contrasts 94 Repeated Contrasts 94 Summary 94 Enrichment 95 References 96 Chapter 5. Simple Linear Models With Categorical Dependent Variables: Binary Logistic Regression 97 Advance Organizer 97 It’s All About Relationships! (Part 3) 98 Why Is Logistic Regression Necessary? 98 The Linear Probability Model 100 How Logistic Regression Solves This Issue: The Logit Link Function 103 A Brief Digression Into Probabilities, Conditional Probabilities, and Odds 105 Simple Logistic Regression Using Statistical Software 106 Indicators of Overall Model Fit 107 What Is a −2 Log Likelihood? 108 The Logistic Regression Equation 109 Interpreting the Constant 109 What If You Want CIs for the Constant? 110 Summary So Far 110 Logistic Regression With a Continuous IV 110 Some Best Practices When Using a Continuous Variable in Logistic Regression 112 Testing Assumptions and Data Cleaning in Logistic Regression 113 Deviance Residuals 115 DfBetas 117

Hosmer and Lemeshow Test for Model Fit 119 How Should We Interpret Odds Ratios That Are Less Than 1.0? 121 Summary 123 Enrichment 124 Appendix 5A: A Brief Primer on Probit Regression 126 What Is a Probit? 126 The Probit Link 127 A Real-Data Example of Probit Regression 129 Why Are There Two Different Procedures If They Produce the Same Results? 130 Some Nice Features of Probit 131 Assumptions of Probit Regression 131 Summary and Conclusion 131 References 132 Chapter 6. Simple Linear Models With Polytomous Categorical Dependent Variables: Multinomial and Ordinal Logistic Regression 133 Advance Organizer 133 Understanding Marijuana Use 135 Dummy-Coded DVs and Our Hypotheses to Be Tested 136 Basics and Calculations 137 Multinomial Logistic Regression (Unordered) With Statistical Software 138 Multinomial Logistic Regression With a Continuous Predictor 140 Multinomial Logistic Regression as a Series of Binary Logistic Regressions 142 Data Cleaning and Multinomial Logistic Regression 143 Testing Whether Groups Can Be Combined 143 Ordered Logit (Proportional Odds) Model 146 Assumptions of the Ordinal Logistic Model 148 Interpreting the Results of the Ordinal Regression 149 Interpreting the Intercepts/Thresholds 149 Interpreting the Parameter Estimates 151 Data Cleaning and More Advanced Models in Ordinal Logistic Regression 152 The Measured Variable is Continous, Why Not Just Use OLS Regression for This Type of Analysis? 152 A Brief Note on Log-Linear Analyses 153 Summary and Conclusions 154 Enrichment 154 References 155 Chapter 7. Simple Curvilinear Models 157 Advance Organizer 157 Zeno’s Paradox, a Nerdy Science Joke, and Inherent Curvilinearity in the Universe . . .  158 A Brief Review of Simple Algebra 159

Hypotheses to Be Tested 162 Illegitimate Causes of Curvilinearity 162 Model Misspecification: Omission of Important Variables 163 Poor Data Cleaning 163 Detection of Nonlinear Effects 163 Theory 163 Ad Hoc Testing 163 Box-Tidwell Transformations 163 Basic Principles of Curvilinear Regression 164 Occam’s Razor 165 Ordered Entry of Variables 165 Each Effect Is One Part of the Entire Effect 165 Centering 165 Curvilinear OLS Regression Example: Size of the University and Faculty Salary 166 Data Cleaning 170 Interpreting Curvilinear Effects Effectively 173 Reality Testing This Effect 174 Summary of Curvilinear Effects in OLS Regression 175 Curvilinear Logistic Regression Example: Diabetes and Age 175 Curvilinear Effects in Multinomial Logistic Regression 178 Replication Becomes Important 182 More Fun With Curves: Estimating Minima and Maxima as Well as Slope at Any Point on the Curve 182 Summary 189 Enrichment 189 References 191 Chapter 8. Multiple Independent Variables 193 Advance Organizer 193 The Basics of Multiple Predictors 194 What Are the Implications of This Act? 195 Hypotheses to Be Tested in Multiple Regression 197 Assumptions of Multiple Regression and Data Cleaning 198 Predicting Student Achievement From Real Data 200 Where Is the Missing Variance? 202 Testing Assumptions and Data Cleaning in the NELS88 Data 202 What Does the Intercept Mean When There Are Multiple IVs? 203 Methods of Entering Variables 203 User-Controlled Methods of Entry 205 Hierarchical Entry 205 Blockwise Entry 205 Software-Controlled Entry 206 Using Multiple Regression for Theory Testing 208 What Is the Meaning of This Intercept? 210

Logistic Regression With Multiple IVs 210 Assessing the Overall Logistic Regression Model: Why There Is No R2 for Logistic Regression 213 Summary and Conclusions 215 Enrichment 216 References 217 Chapter 9. Interactions Between Independent Variables: Simple Moderation 219 Advance Organizer 219 What Is an Interaction? 220 Procedural and Conceptual Issues in Testing for Interactions Between Continuous Variables 221 Procedural and Conceptual Issues in Testing for Interactions Containing Categorical Variables 223 Hypotheses to Be Tested in Multiple Regression With Interactions Present 223 An OLS Regression Example: Predicting Student Achievement From Real Data 224 Interpreting the Results From a Significant Interaction 225 Graphing Interaction Effects 226 Staying Out of Trouble on the X Axis 226 Staying Out of Trouble on the Y Axis 227 Procedural Issues With Graphing 228 An Interaction Between a Continuous and a Categorical Variable in OLS Regression 231 Interactions With Logistic Regression 236 Example Summary of Interaction Analysis 241 Interactions and Multinomial Logistic Regression 241 Data Cleaning 242 Calculation of Overall Model Statistics 242 Example Summary of Findings 243 Can These Effects Replicate? 246 Post Hoc Probing of Interactions 246 Regions of Significance 250 Using Statistical Software to Produce Simple Slopes Analyses 251 Summary 252 Enrichment 253 References 253 Chapter 10. Curvilinear Interactions Between Independent Variables Advance Organizer What Is a Curvilinear Interaction? A Quadratic Interaction Between X and Z A Cubic Interaction Between X and Z

255 255 256 257 258

A Real-Data Example and Exploration of Procedural Details 258 Step 1. Create the Terms Prior to Analysis 260 Step 2. Build Your Equation Slowly 260 Step 3. Clean the Data Thoughtfully to Ensure You Are Not Missing an Interesting Effect 261 Step 4. After Influential Cases Are Removed, Perform the Analysis Again 263 Step 5. Provide Your Audience With a Graphical Representation of These Complex Results 264 Step 6. Summarize the Results Coherently Using the Graphs as Guides 264 Summary 265 Curvilinear Interactions Between Continuous and Categorical Variables 265 Summary 272 Curvilinear Interactions With Categorical DVs (Multinomial Logistic) 272 Curvilinear Interaction Effects in Ordinal Regression 274 Summary 278 Chapter Summary 281 Enrichment 282 References 282 Chapter 11. Poisson Models: Low-Frequency Count Data as Dependent Variables 283 Advance Organizer 283 The Basics and Assumptions of Poisson Regression 284 Curvilinearity in Poisson Models 286 The Nature of the Variables 286 Issues With Zeros 288 Issues With Variance 288 Why Can’t We Just Analyze Count Data via OLS, Multinomial, or Ordinal Regression? 289 Multinomial or Ordinal Regression 290 Hypotheses Tested in Poisson Regression 291 Model Fit 291 Poisson Regression With Real Data 291 Interactions in Poisson Regression 293 Data Cleaning in Poisson Regression 294 Refining the Model by Eliminating Excess (Inappropriate) Zeros 295 A Refined Analysis With Excess Zeros Removed 296 Curvilinear Effects in Poisson Regression 300 Dealing With Overdispersion or Underdispersion 301 Effects of Adjusting the Scale Parameter 302 Negative Binomial Model 303 Summary and Conclusions 303 Enrichment 303 References 304

Chapter 12. Log-Linear Models: General Linear Models When All of Your Variables Are Unordered Categorical 305 Advance Organizer 305 The Basics of Log-Linear Analysis 306 What Is Different About Log-Linear Analysis? 308 Hypotheses Being Tested 310 Individual Parameter Estimates 311 Assumptions of Log-Linear Models 312 A Slightly More Complex Log-Linear Model 312 Can We Replicate These Results in Logistic Regression? 315 Data Cleaning in Log-Linear Models 317 Summary and Conclusions 321 Enrichment 322 References 322 Chapter 13. A Brief Introduction to Hierarchical Linear Modeling 323 Advance Organizer 323 Why HLM Models Are Necessary 324 What Is a Hierarchical Data Structure? 324 Why Is Hierarchical or Nested Data an Issue? 324 The Problem of Independence of Observations 325 The Problem of How to Deal With Multilevel Data 325 How Do Hierarchical Models Work? A Brief Primer 326 Generalizing the Basic HLM Model 327 Example 1. Modeling a Continuous DV in HLM 328 Example 2. Modeling Binary Outcomes in HLM 330 Residuals in HLM 332 Results of DROPOUT Analysis in HLM 332 Cross-Level Interactions in HLM Logistic Regression 334 So What Would Have Happened If These Data Had Been Analyzed via Simple Logistic Regression Without Accounting for the Nested Data Structure? 334 Summary and Conclusions 336 Enrichment 336 References 339 Chapter 14. Missing Data in Linear Modeling 341 Advance Organizer 341 Not All Missing Data Are the Same 342 Utility of Legitimately Missing Data for Data Checking 343 Categories of Missingness: Why Do We Care If Data Are MCAR or Not? 344 How Do You Know If Your Data Are MCAR, MAR, or MNAR? 346 What Do We Do With Randomly Missing Data? 348 Data MCAR 350 Mean Substitution 350 Strong and Weak Regression Imputation 351

Multiple Imputation (Bayesian) 352 Summary 353 Data MNAR 353 Example 1. Nonrandom Missingness Reverses the Effect 353 Example 2. Nonrandom Missingness Dramatically Inflates the Effect 355 Summary 355 How Missingness Can Be an Interesting Variable in and of Itself 356 Summing Up: Benefits of Appropriately Handling Missing Data 357 Enrichment 358 References 358 Chapter 15. Trustworthy Science: Improving Statistical Reporting 359 Advance Organizer 359 What Is Power, and Why Is It Important? 361 Correctly Rejecting a Null Hypothesis 362 Informing Null Results 362 Is Power an Ethical Issue? 363 Power in Linear Models 364 OLS Regression With Multiple Predictors 366 Binary Logistic Regression 366 Summary of Points Thus Far 368 Who Cares as Long as p < .05? Volatility in Linear Models 368 Small Samples Versus Large Samples 370 A Brief Introduction to Bootstrap Resampling 374 Principle 1. Results From Larger Samples Will Be Less Volatile Than Results From Smaller Samples 375 Principle 2. Effect Sizes Should Not Affect the Replicability of the Results 375 Principle 3. Complex Effects Are Less Likely to Replicate Than Simple Effects, Particularly in Smaller Samples 378 Summary and Conclusions 385 Enrichment 387 References 387 Chapter 16. Reliable Measurement Matters 389 Advance Organizer 389 A More Modern View of Reliability 390 What Is Cronbach’s Alpha (and What Is It Not)? 391 Alpha and the Kuder-Richardson Coefficient of Equivalence 391 The Correct Interpretation of Alpha 392 What Alpha Is Not 393 Factors That Influence Alpha 393 Length of the Scale 393 Average Inter-Item Correlation 393 Reverse-Coded Items (Negative Item-Total Correlations) 393 Random Responding or Response Sets 393

Multidimensionality 393 Outliers 394 Other Assumptions of Alpha 394 What Is “Good Enough” for Alpha? 394 Reliability and Simple Correlation or Regression 395 Reliability and Multiple IVs 396 Reliability and Interactions in Multiple Regression 397 Protecting Against Overcorrecting During Disattenuation 398 Other (Better) Solutions to the Issue of Measurement Error 398 Does Reliability Influence Other Analyses, Such as Analysis of Variance? 399 Reliability in Logistic Models 400 But Other Authors Have Argued That Poor Reliability Isn’t That Important. Who Is Right? 401 Sample Size and the Precision/Stability of Alpha-Empirical CIs 401 Summary and Conclusions 402 References 402 Chapter 17. Prediction in the Generalized Linear Model 405 Advance Organizer 405 Prediction Versus Explanation 406 How Is a Prediction Equation Created? 407 Methods for Entering Variables Into the Equation 408 Shrinkage and Evaluating the Quality of Prediction Equations 409 Cross-Validation 409 Double Cross-Validation 409 An Example Using Real Data 409 Double Cross-Validation 410 So How Much Shrinkage Is Too Much Shrinkage? 410 The Final Step 411 How Does Sample Size Affect the Shrinkage and Stability of a Prediction Equation? 411 Improving on Prediction Models 412 Calculating a Predicted Score, and CIs Around That Score 413 Prediction (Prognostication) in Logistic Regression (and Other) Models 414 Overall Performance 414 Concordance or Discrimination 415 Estimated Shrinkage in Logistic Models 416 Other Proposed Methods of Estimating Shrinkage 416 An Example of External Validation of a Prognostic Equation Using Real Data 416 External Validation of a Prediction Equation 417 Overall Performance (Brier Score) 418 Estimated Shrinkage 418 Concordance and Discrimination 419

Using Bootstrap Analysis to Estimate a More Robust Prognostic Equation 420 General Bootstrap Methodology for Internal Validation of a Prognostic Model 420 Internal Validation 421 Summary 424 References 424 Chapter 18. Modeling in Large, Complex Samples: The Importance of Using Appropriate Weights and Design Effect Compensation 427 Advance Organizer 427 What Types of Studies Use Complex Sampling? 428 Why Does Complex Sampling Matter? 428 What Are Best Practices in Accounting for Complex Sampling? 429 Does It Really Make a Difference in the Results? 431 Conditions Used 431 Unweighted 431 Weighted 431 Scaled Weights 431 Appropriately Modeled 432 Comparison of Unweighted Versus Weighted Analyses 432 Large Effect in Ordinary Least Squares Regression 432 Modest Effect in Binary Logistic Regression 432 Null Effect in Analysis of Variance 434 Null Effect in Ordinary Least Squares Regression 435 Summary 435 Enrichment 436 References 436 Appendix A. A Brief User’s Guide to z-Scores 437 The Normal (Gaussian) Distribution 437 Why Is the Normal Distribution Such a Big Deal? 438 Author Index

000

Subject Index

000