Acta Univ. Agric. Silvic. Mendelianae Brun. 2014, 62(6), 1527-1534 | DOI: 10.11118/actaun201462061527

Missing Categorical Data Imputation and Individual Observation Level Imputation

Pavel Zimmermann1, Petr Mazouch1, Klára Hulíková Tesárková2
1 Department of Statistics and Probability, Faculty of Informatics and Statistics, University of Economics, nám. W. Churchilla 4, 130 67 Prague 3, Czech Republic
2 Department of Demography and Geodemography, Faculty of Science, Charles University in Prague, Albertov 6, 128 00 Prague 2, Czech Republic

Traditional missing data techniques of imputation schemes focus on prediction of the missing value based on other observed values. In the case of continuous missing data the imputation of missing values often focuses on regression models. In the case of categorical data, usual techniques are then focused on classification techniques which sets the missing value to the 'most likely' category. This however leads to overrepresentation of the categories which are in general observed more often and hence can lead to biased results in many tasks especially in the case of presence of dominant categories. We present original methodology of imputation of missing values which results in the most likely structure (distribution) of the missing data conditional on the observed values. The methodology is based on the assumption that the categorical variable containing the missing values has multinomial distribution. Values of the parameters of this distribution are than estimated using the multinomial logistic regression. Illustrative example of missing value and its reconstruction of the highest education level of persons in some population is described.

Keywords: missing data, categorical data, multinomial regression
Grants and funding:

The article was written with the support provided by the Grant Agency of the Czech Republic to the project No. P404/12/0883 "Generační úmrtnostní tabulky České republiky: data, biometrické funkce a trendy".

Published: January 17, 2015  Show citation

ACS AIP APA ASA Harvard Chicago IEEE ISO690 MLA NLM Turabian Vancouver
Zimmermann, P., Mazouch, P., & Hulíková Tesárková, K. (2014). Missing Categorical Data Imputation and Individual Observation Level Imputation. Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis62(6), 1527-1534. doi: 10.11118/actaun201462061527
Download citation

References

  1. FINUCAN, H. M. 1964. The Mode of a Multinomial Distribution. Biometrika, 51(3-4): 513-517. DOI: 10.1093/biomet/51.3-4.513 Go to original source...
  2. HOSMER JR, D. W., LEMESHOW, S. 2004. Applied logistic regression. New York: John Wiley & Sons. Go to original source...
  3. JOHNSON, L. J., KOTZ, S. and BALAKRISHNAN, N. 1997. Discrete multivariate distributions. Vol. 165. New York: Wiley.
  4. LE GALL, F. 2003. Determination of the modes of a Multinomial distribution. Statistics & Probability Letters, 62(4): 325-333. DOI: 10.1016/S0167-7152(02)00430-3 Go to original source...
  5. RUBIN, D. B., 2002. Inference and missing data. Biometrika, 63(3): 581-592. DOI: 10.1093/biomet/63.3.581 Go to original source...
  6. SCHAFER, J. L., GRAHAM, J. W. 2002. Missing data: our view of the state of the art. Psychological methods, 7(2): 147-177. DOI: 10.1037/1082-989X.7.2.147 Go to original source...
  7. PANAGIOTIS, S., LEFTERIS, A., STAMELOS, I. 2004. Multiple logistic regression as imputation method applied on software effort prediction. In: Proceedings of the 10th International Symposium on Software Metrics, 2004. Chicago: IEEE Computer Society.
  8. ZIMMERMANN, P., MAZOUCH, P., HULÍKOVÁ TESÁRKOVÁ, K. 2013. Categorical data imputation under MAR missing scheme. In: Proceedings of the 31st International Conference Mathematical Methods in Economics, 2013. Jihlava: College of Polytechnics Jihlava.

This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY NC ND 4.0), which permits non-comercial use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.