Asian Ethnic Group Classification Model Using Data Mining

Yoon Geon Kim; Ji Hyun Lee; Sohee Cho; Moon Young Kim; Soong Deok Lee; Eun Ho Ha; Jae Joon Ahn

doi:10.7580/kjlm.2017.41.2.32

Journal List > Korean J Leg Med > v.41(2) > 1088013

Go to TopGo to Top Go to BottomGo to Bottom

TOOLS

Kim, Lee, Cho, Kim, Lee, Ha, and Ahn: Asian Ethnic Group Classification Model Using Data Mining

Original Article

Korean J Leg Med 2017;41(2):32-40.

Published online: 19 January 2017

DOI: https://doi.org/10.7580/kjlm.2017.41.2.32

Asian Ethnic Group Classification Model Using Data Mining

Yoon Geon Kim¹, Ji Hyun Lee², Sohee Cho³, Moon Young Kim³, Soong Deok Lee^2,³, Eun Ho Ha⁴, Jae Joon Ahn⁴

¹Department of Applied Statistics, Yonsei University, Seoul, Korea

²Department of Forensic Medicine, Seoul National University College of Medicine, Seoul, Korea

³Institute of Forensic Science, Seoul National University College of Medicine, Seoul, Korea

⁴Department of Information and Statistics, Yonsei University, Wonju, Korea

Correspondence to Jae Joon Ahn Department of Information and Statistics, Yonsei University, 1 Yeonsedae-gil, Heungeop-myeon, Wonju 26493, Korea Tel: +82-33-760-2766 Fax: +82-33-760-2211 E-mail: ahn2615@yonsei.ac.kr

Received 1 May 20178 May 2017 Accepted 22 May 2017

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

In addition to identifying genetic differences between target populations, it is also important to determine the impact of genetic differences with regard to the respective target populations. In recent years, there has been an increasing number of cases where this approach is needed, and thus various statistical methods must be considered. In this study, genetic data from populations of Southeast and Southwest Asia were collected, and several statistical approaches were evaluated on the Y-chromosome short tandem repeat data. In order to develop a more accurate and practical classification model, we applied gradient boosting and ensemble techniques. To infer between the Southeast and Southwest Asian populations, the overall performance of the classification models was better than that of the decision trees and regression models used in the past. In conclusion, this study suggests that additional statistical approaches, such as data mining techniques, could provide more useful interpretations for forensic analyses. These trials are expected to be the basis for further studies extending from target regions to the entire continent of Asia as well as the use of additional genes such as mitochondrial genes.

Go to :

Keywords: Y-chromosomal short tandem repeats, Statistical models, Decision trees, Data mining, Ensemble model

REFERENCES

1.Butler JM. Advanced topics in forensic DNA typing: methodology. San Diego, CA: Academic Press;2011.

2.Enoch MA., Shen PH., Xu K, et al. Using ancestry-informative markers to define populations and detect population stratification. J Psychopharmacol. 2006. 20:(4 Suppl):. 19–26.

3.Li JZ., Absher DM., Tang H, et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008. 319:1100–4.

4.Rosenberg NA., Pritchard JK., Weber JL, et al. Genetic structure of human populations. Science. 2002. 298:2381–5.

5.Pritchard JK., Stephens M., Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000. 155:945–59.

6.Quinlan JR. Induction of decision trees. Mach Learn. 1986. 1:81–106.

7.Opitz D., Maclin R. Popular ensemble methods: an empirical study. J Artif Intell Res. 1999. 11:169–98.

8.Rokach L. Ensemble-based classifiers. Artif Intell Rev. 2010. 33:1–39.

9.Quinlan JR. Bagging, boosting, and C4.5. AAAI/IAAI '96 Proceedings of the Thirteenth National Conference on Artificial Intelligence. 1996 Aug 4-8; Portland, OR, USA. Vol. 1. Palo Alto, CA: AAAI Press;. 1996. 725–30.

10.Breiman L. Bagging predictors. Mach Learn. 1996. 24:123–40.

11.Schapire RE. The strength of weak learnability. Mach Learn. 1990. 5:197–227.

12.Freund Y., Schapire RE. A short introduction to boosting. J Jpn Soc Artif Intell. 1999. 14:771–80.

13.Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal. 2002. 38:367–78.

14.Wang R., Lee N., Wei Y. A case study: improve classification of rare events with SAS Enterprise Miner. In: Proceedings of the SAS Global Forum 2015 Conference. Cary, NC: SAS Institute Inc.;2015.

15.Rahman MM., Davis DN. Addressing the class imbalance problem in medical datasets. Int J Mach Learn Comput. 2013. 3:224–8.

16.Purps J., Siegert S., Willuweit S, et al. A global analysis of Y-chromosomal haplotype diversity for 23 STR loci. Forensic Sci Int Genet. 2014. 12:12–23.

Go to :

Fig. 1.

Classification analysis process.

undefined

Fig. 2.

Examples of decision rules.

undefined

Fig. 3.

Bagging procedure.

undefined

Fig. 4.

Boosting procedure.

undefined

Fig. 5.

Under sampling.

undefined

Fig. 6.

Progress of ethnicity classification model analysis.

undefined

Fig. 7.

Gradient boosting and decision tree (chi-square) ensemble model separation rule tree.

undefined

Table 1.

Details of populations analyzed

Population	Sample size	Data source
Vietnam	46	Seoul National University
Nepal	69
India	23
Vietnam	45	Purps et al. [16]
Philippines	798
Singapore	104
India	298
Total	1,383

Table 2.

Composition of data

No.	Variable	Definition
1	Sample Info	National information
2	DYS576	Gene
3	DYS389I
4	DYS448
5	DYS389II
6	DYS19
7	DYS391
8	DYS481
9	DYS549
10	DYS533
11	DYS438
12	DYS437
13	DYS570
14	DYS635
15	DYS390
16	DYS439
17	DYS392
18	DYS643
19	DYS393
20	DYS458
21	DYS456
22	YGATAH4
23	TARGET	0: Southeast Asian
		1: Southwest Asian

Table 3.

The results of data splitting and under sampling

Category	Dataset	Count	Target rate
Raw data	Y_STR_Raw	1,345	72.0:28.0
Data partition	Train dataset	846	72.0:28.0
	Validate dataset	364	72.0:28.0
	Test dataset	135	72.0:28.0
Under sampling	Train dataset (under sampling)	470	50:50:00
	Validate dataset (under sampling)	204	50:50:00

Table 4.

Result of classification model

No.	Model	Resampling	Misclassification rate			ROC index
No.	Model	Resampling	Train	Validate	Test	Train	Validate	Test
1	GB and DT (Chi-square) Ensemble	Bagging	0.038	0.068	0.044	0.996	0.975	0.992
2	GB and DT (Chi-square) Ensemble	Boosting	0.044	0.063	0.037	0.995	0.973	0.99
3	DT (Entropy) and DT (Entropy) Ensemble	Bagging and boosting	0.046	0.073	0.037	0.995	0.978	0.992
4	GB and DT (Gini) Ensemble	Boosting	0.046	0.078	0.037	0.995	0.968	0.992
5	DT (Gini) and DT (Gini) Ensemble	Bagging and boosting	0.055	0.092	0.037	0.993	0.969	0.994
6	DT (Chi-square) and DT (Chi-square) Ensemble	Bagging and boosting	0.057	0.083	0.037	0.993	0.977	0.992
7	GB and DT (Entropy) Ensemble	–	0.063	0.087	0.044	0.98	0.966	0.985
8	GB and DT (Gini) Ensemble	–	0.063	0.087	0.044	0.98	0.966	0.985
9	GB	–	0.065	0.063	0.044	0.981	0.966	0.984
10	GB and DT (Chi-square) Ensemble	–	0.067	0.087	0.052	0.98	0.966	0.984
11	GB and DT (Entropy) Ensemble	Bagging	0.069	0.083	0.037	0.981	0.972	0.987
12	DT (Gini)	–	0.069	0.083	0.052	0.95	0.955	0.969
13	DT (Entropy)	–	0.069	0.083	0.052	0.95	0.955	0.969
14	GB and DT (Gini) Ensemble	Bagging	0.071	0.078	0.037	0.98	0.971	0.988
15	GB and DT (Chi-square) Ensemble	Bagging	0.071	0.087	0.037	0.98	0.971	0.988
16	DT (Chi-square)	–	0.076	0.083	0.059	0.948	0.954	0.967
17	DT (Chi-square)	Bagging	0.08	0.073	0.067	0.963	0.97	0.989
18	DT (Gini)	Bagging	0.08	0.073	0.067	0.963	0.97	0.989
19	DT (Entropy)	Bagging	0.084	0.083	0.059	0.973	0.966	0.983
20	DT (Gini)	Boosting	0.137	0.248	0.163	1	0.962	0.993
21	DT (Chi-square)	Boosting	0.149	0.15	0.126	1	0.973	0.993
22	DT (Entropy)	Boosting	0.179	0.238	0.185	1	0.977	0.991

ROC, receiver operation characteristic; GB, gradient boosting; DT (Chi-square), decision tree model using chi-square statistics; DT (Entropy), decision tree model using chi-square (entropy) statistics; DT (Gini), decision tree model using chi-square (Gini) statistics.

Table 5.

Ensemble model variable importance

Variable	Count of split rules	Variable importance
Variable	Count of split rules	Train	Validate
DYS392	1	1	1
DYS390	2	0.68	0.649
DYS448	1	0.256	0.193
DYS643	1	0.219	0.13
DYS438	1	0.193	0.279

TOOLS

Similar articles