Journal List > Korean J Leg Med > v.43(3) > 1131911

Jeong, Lee, Lee, Lee, Park, Kim, and Lee: Classification of Common Relationships Based on Short Tandem Repeat Profiles Using Data Mining

Abstract

We reviewed past studies on the identification of familial relationships using 22 short tandem repeat markers. As a result, we can obtain a high discrimination power and a relatively accurate cut-off value in parent-child and full sibling relationships. However, in the case of pairs of uncle-nephew or cousin, we found a limit of low discrimination power of the likelihood ratio (LR) method. Therefore, we compare the LR ranking method and data mining techniques (e.g., logistic regression, linear discriminant analysis, diagonal linear discriminant analysis, diagonal quadratic discriminant analysis, K-nearest neighbor, classification and regression trees, support vector machines, random forest [RF], and penalized multivariate analysis) that can be applied to identify familial relationships, and provide a guideline for choosing the most appropriate model under a given situation. RF, one of the data mining techniques, was found to be more accurate than other methods. The accuracy of RF is 99.99% for parent-child, 99.44% for full siblings, 90.34% for uncle-nephew, and 79.69% for first cousins.

Figures and Tables

Table 1

TNSA for common relationships

kjlm-43-97-i001

TNSA, total number of shared alleles; SD, standard deviation.

Table 2

Classification of common relationships according to TNSA

kjlm-43-97-i002

TNSA, total number of shared alleles; NF, number of family relationships (true/total); NU, number of unrelated (true/total).

Table 3

LR for common relationships

kjlm-43-97-i003

LR, likelihood ratio; SD, standard deviation; Min, minimum; Max, maximum.

Table 4

Classification of common relationships according to Log LR

kjlm-43-97-i004

LR, likelihood ratio; NF, number of family relationships (true/total); NU, number of unrelated (true/total).

Table 5

Logistic regression of TNSA and Log LR for common relationships

kjlm-43-97-i005

TNSA, total number of shared alleles; LR, likelihood ratio; NF, number of family relationships (true/total); NU, number of unrelated (true/total).

Table 6

Classification of common relationships according to various classification methods

kjlm-43-97-i006

Sen, sensitivity; Spe, specificity; Acc, accuracy; LDL, linear discriminant analysis; DLDA, diagonal linear discriminant analysis; DQDA, diagonal quadratic discriminant analysis; KNN, K-nearest neighbor; CART, classification and regression trees; SVM, support vector machines; RF, random forest; PMA, penalized multivariate analysis.

Table 7

Summary classification of common relationships according to various classification methods

kjlm-43-97-i007

Sen, sensitivity; Spe, specificity; Acc, accuracy; TNSA, total number of shared alleles; LR, likelihood ratio; RF, random forest.

Acknowledgments

This research was supported by the KU Future Research Grant(KU-FRG, K1720021, 2017) and was supported by the research project for practical use and advancement of forensic DNA analysis, of the Supreme Prosecutors' Office, Republic of Korea (1333-304-260, 2014).

Notes

Conflicts of Interest No potential conflict of interest relevant to this article was reported.

References

1. Butler JM, Hill CR. Biology and genetics of new autosomal STR loci useful for forensic DNA analysis. Forensic Sci Rev. 2012; 24:15–26.
2. Bieber FR, Brenner CH, Lazer D. Human genetics: finding criminals through DNA of their relatives. Science. 2006; 312:1315–1316.
3. Myers SP, Timken MD, Piucci ML, et al. Searching for first-degree familial relationships in California's offender DNA database: validation of a likelihood ratio-based approach. Forensic Sci Int Genet. 2011; 5:493–500.
crossref
4. Schneider PM. Scientific standards for studies in forensic genetics. Forensic Sci Int. 2007; 165:238–243.
crossref
5. Lee JW, Lee HS, Lee HJ, et al. Statistical evaluation of sibling relationship. Commun Stat Appl Methods. 2007; 14:541–549.
crossref
6. Jeong SJ, Lee JW, Lee SD, et al. Statistical evaluation of common relationships using STR markers in Korean population. Korean Acad Sci Crim Invest. 2016; 10:107–115.
crossref
7. Evett IW, Weir BS. Interpreting DNA evidence: statistical genetics for forensic scientists. Sunderland: Sinauer Associates;1998.
8. Yang IS, Lee HY, Park SJ, et al. Analysis of Kinship Index distributions in Koreans using simulated autosomal STR profiles. Korean J Leg Med. 2013; 37:57–65.
crossref
9. Gaytmenn R, Hildebrand DP, Sweet D, et al. Determination of the sensitivity and specificity of sibship calculations using AmpF lSTR Profiler Plus. Int J Legal Med. 2002; 116:161–164.
10. Budowle B, Shea B, Niezgoda S, et al. CODIS STR loci data from 41 sample populations. J Forensic Sci. 2001; 46:453–489.
crossref
11. Cowen S, Thomson J. A likelihood ratio approach to familial searching of large DNA databases. Forensic Sci Int Genet Suppl Ser. 2008; 1:643–645.
crossref
12. Curran JM, Buckleton JS. Effectiveness of familial searches. Sci Justice. 2008; 48:164–167.
crossref
13. Cox DR. The regression analysis of binary sequences. J R Stat Soc Ser B Methodol. 1958; 20:215–242.
crossref
14. Fisher RA. The use of multiple measurements in taxonomic problems. Ann Eugen. 1936; 7:179–188.
crossref
15. Bickel PJ, Levina E. Some theory for Fisher's linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli. 2004; 10:989–1010.
crossref
16. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc. 2002; 97:77–87.
crossref
17. Vapnik VN. The nature of statistical learning theory. Berlin: Springer;2000.
18. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and regression trees. Monterey: Wadsworth & Brooks/Cole Advanced Books & Software;1984.
19. Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat. 1992; 46:175–185.
crossref
20. Breiman L. Random forests. Mach Learn. 2001; 45:5–32.
21. Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009; 10:515–534.
crossref
22. Buckleton JS, Triggs CM, Walsh SJ. DNA evidence. Boca Raton: CRC Press;2004.
TOOLS
ORCID iDs

Su Jin Jeong
https://orcid.org/0000-0001-6754-8925

Jae Won Lee
https://orcid.org/0000-0002-3718-2704

Similar articles