Searching for Appropriate Statistical Parameters for Validation of Mitochondrial DNA Database

Chong Min Choung; Ji Hyun Lee; Sohee Cho; Soong Deok Lee

doi:10.7580/kjlm.2014.38.2.59

Journal List > Korean J Leg Med > v.38(2) > 1004738

Go to TopGo to Top Go to BottomGo to Bottom

TOOLS

Choung, Lee, Cho, and Lee: Searching for Appropriate Statistical Parameters for Validation of Mitochondrial DNA Database

원 저

Korean J Leg Med 2014;38(2):59-65.

Published online: 17 January 2014

DOI: https://doi.org/10.7580/kjlm.2014.38.2.59

Searching for Appropriate Statistical Parameters for Validation of Mitochondrial DNA Database

Chong Min Choung¹, Ji Hyun Lee², Sohee Cho², Soong Deok Lee^2,³

¹Forensic DNA Division, National Forensic Service, Wonju-si, Gangwon, Korea

²Department of Forensic Medicine, Seoul, Korea

³Institute of Forensic Science, Seoul National University College of Medicine, Seoul, Korea

이 논문은 2013 년도 정부(미래창조과학 부)의 재원으로 한국연구재단 바이오.의 료기술개발사업의 지원을 받아 수행된 연 구입니다 (No. 2013-057192). 책임저자 : 이숭덕 (110-799) 서울시 종로구 연건동 대학로 103 번지, 서울대학교 의과대학 법의학교 실 전화 : +82-2-740-8359 FAX : +82-2-764-8340 E-mail : sdlee@snu.ac.kr

Received 29 April 2014 Revised 19 May 2014 Accepted 20 May 2014

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Recently, studies on mitochondrial DNA (mtDNA) have increased rapidly. Conventional parameters, such as diversity index, pairwise comparison, are used to interpret and validate data on autosomal DNA; however, the use of these parameters to validate data from mitochondrial DNA databases (mtDNA DBs) needs to be verified because of the different transmission patterns of mtDNA. This study was done to verify the use of these conventional parameters and to test the “coverage concept” for a new parameter. The mtDNA DB is not very big; however, it is necessary to check how the change in parameters corresponds to the DB size. For this, we artificially rearranged a Korean DB into several small sub-DBs of variable sizes. The results show that the diversity in nucleotide variations and the different haplotype numbers do not vary as the size of DB increases. However, the “coverage”changed a lot. The coverage increased from 0.113 in a DB of 100 people to 0.260 in a DB of 653 people. Additionally, using the “coverage concept”, we predicted how the total number of haplotypes changed with variations in the sub-DB size and compared the predicted result with final result. In conclusion, “coverage”, in addition to conventional statistical parameters, can be used to check the usability of an mtDNA DB. Finally, we tried to predict the size of the whole mtDNA number in Korea using “saturation concept”.

Keywords: Key Words, mtDNA DB, Statistical parameter, Coverage, Phylogeny, Saturation curve

REFERENCES

1. Torroni A, Achilli A, Macaulay V, et al. Harvesting the fruit of the human mtDNA tree. Trends Genet. 2006; 22:339–45.

2. Torroni A, Schurr TG, Cabell MF, et al. Asian affinities and continental radiation of the four founding Native American mtDNAs. Am J Hum Genet. 1993; 53:563–90.

3. Egeland T, B�velstad HM, Storvik GO, et al. Inferring the most likely geographical origin of mtDNA sequence profiles. Ann Hum Genet. 2004; 68:461–71.

4. Chao A, Lee SM. Estimating the number of classes via sample coverage. JASA. 1992; 87:210–7.

5. Huang SP, Weir BS. Estimating the total number of alleles using a sample coverage method. Genetics. 2001; 159:1365–73.

6. Egeland T, Salas A. Estimating haplotype frequency and coverage of databases. PLoS one. 2008; 3:e3998.

7. Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989; 123:585–95.

8. Pereira L, Cunha C, Amorim A. Predicting sampling saturation of mtDNA haplotypes: an application to an enlarged Portuguese database. Int J Legal Med. 2004; 118:132–6.

9. Pfeiffer H, Brinkmann B, Hu ¨hne J, et al. Expanding the forensic German mitochondrial DNA control region database: genetic diversity as a function of sample size and mi-crogeography. Int J Legal Med. 1999; 112:291–8.

10. Haas PJ, Ko ¨nig C. A bi-level Bernoulli scheme for database sampling. In proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM. 2004. 275–86.

11. Mao CX. Predicting the conditional probability of discovering a new class. JASA. 2004; 99:1108–18.

12. Nei M, Li WH. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc Natl Acad Sci USA. 1979; 76:5269–73.

13. Bunge J, Fitzpatrick M. Estimating the number of species: a review. JASA. 1993; 88:364–73.

Fig. 1.

Saturation curves of expanded sample sizes. a. Expanded up to 10,000 people b. Expanded up to 100,000 people A result of examining the number of possible observed haplotypes when group size increased up to 10,000, 100,000, the final expected number of haplotypes was 4,500 over. The shaded portion of the graph is the confidence interval upper and lower limits.

Fig. 2.

Result of simulated saturation curve from N. of observed haplotypes. Graph is obtained by curve expert professional 1.6.5 version. The fit converged to a tolerance of 1e-006 in 5 iterations. No weighting is used.

Table 1.

Primers used to Sequence the D-loop of mtDNA

Primer	Sequence	5’→ 3’
F15971	TTA	ACT	CCA	CCA	TTA	GCA	CC
F16291	AGG	ATA	CCA	ACA	AAC	CTA	C
F049	CTC	ACG	GGA	GCT	CTC	CAT	GC
R16410	GAG	GAT	GGT	GGT	CAA	GGG	AC
R16493	GAA	GTA	GGA	ACC	AGA	TGT	CGG
R159	ATA	TTG	AAC	GTA	GGT	GCG	AT
R408	CTG	TTA	AAA	GTG	CAT	ACC	GCC
R921	CTT	TAC	GCC	GGC	TTC	ATA	TG

Table 2.

Variation of Sequence Diversity and Number of Sequence Changes at each Sample Sizes∗

	HV Ⅰ		HV Ⅱ		HV Ⅰ+HV Ⅱ		D-loop
	n^†	H^†	n	H	n	H	n	H
N^§= 100	82	0.9419	46	0.8685	128	0.9509	154	0.963
N = 200	96	0.0944	47	0.8750	143	0.9522	172	0.9627
N = 300	113	0.9438	60	0.8737	173	0.9521	210	0.9630
N = 400	125	0.9465	71	0.8693	196	0.9523	237	0.9632
N = 500	129	0.9465	72	0.8730	201	0.9523	240	0.9629
N = 600	135	0.9460	75	0.8718	210	0.9525	258	0.9632
N = 653	139	0.9454	84	0.8763	223	0.9533	266	0.9636

^∗ : Data was grouped by randomly labeled number for making small DB from original 653 people DB.

^† : Sequence diversity calculated from 1-Σ×^2. ¹⁰)

^† : number of nucleotides that show difference when compared to rCRS

^§ : DB size

Table 3.

Number of observed Haplotypes and Comparison of Statistic Parameters∗

No. of Observed Haplotypes
	Total	H^†	C^†	CI^§	f1^‖	f2^‖	f3^‖	f4^‖	f5^‖	f6^‖	f7^‖	f8^‖
N = 100	93	0.9894	0.113	(0.065~0.159)	86	6	1	0	0	0	0	0
N = 200	183	0.9939	0.123	(0.083~0.151)	171	11	0	0	0	0	1	0
N = 300	269	0.9954	0.168	(0.130~0.195)	246	18	4	0	0	0	1	0
N = 400	351	0.9970	0.207	(0.177~0.233)	313	30	5	2	0	0	0	1
N = 500	433	0.9978	0.223	(0.196~0.247)	382	36	11	2	0	1	0	1
N = 600	509	0.9983	0.252	(0.227~0.275)	443	47	13	4	0	0	0	2
N = 653	546	0.9979	0.260	(0.239~0.281)	477	48	14	3	2	0	0	2

^∗: , Data were not selected depending on size and simply grouped with serial number.

^†: Haplotype diversity calculated from 1-Σ×^2. ¹⁰⁾

^†: Calculated coverage

^§: Confidential interval 95%

^‖: Number of the observed haplotypes

f1 means the number of haplotype unique, f2 means the number of haplotype which are shared of two, f3 means the number of haplotype which are shared of three, and so on….

Table 4.

Comparison of Coverage by the Way Simulated and Nonselected DB Showing no Significant Selection Effect

	Sim-1^∗	Sim-2^∗	Non-selected^†
N = 100	0.050	0.020	0.113
N = 200	0.132	0.130	0.118
N = 300	0.138	0.145	0.166
N = 400	0.180	0.200	0.205
N = 500	0.219	0.247	0.222
N = 600	0.253	0.272	0.252
N = 653	-	-	0.26

^∗ : Data from randomly selected by random shuffling.

^† : Data from simply selected by serial number.

Table 5.

Unique Haplotype Comparison between Observed One and Estimated One using Mao Equation

Estimated value
	Total	M (t = 1)	M (t = 0.5)	M (t = 0.25)
N = 100	93
N = 200	183	174
N = 300	269		265.76
N = 400	351	344
N = 500	433			427.45
N = 600	509	502	500.5
N = 653	546
N = 800		636
N = 1,000		786
N = 1,200		953
P value			0.97

The estimated values were calculated from the equation Δ(t) = f₁t-f₂t²+f₃t^3--….

This formula is valid t ≤ 1 and t = 1 corresponds to a doubling size. The column M (t = 1) are the estimated values when the expected number of the doubled size, M (t = 0.5) are the 1.5 times, M (t = 0.25) are the 0.5 times. The result of the qui-square test, p value was 0.97 (α= 0.05). For N = 100 ~ N = 653, total means sum of the number of observed haplotypes within each data size. For N ≥ 800 only estimated one was presented. The bolds mean estimated value using Mao equation.

TOOLS

Similar articles