Up a LevelClick the arrow to go up a level to the DNA & Family Traits page.
Follow the above link or click the graphic below to visit the Homepage.

HomepageExtended Haplotype Estimation of Clade TMRCAs
Confirmation by Computer Simulation


Ken Nordtvedt & Jim Cullen
Homepage




Note: This article is also available in PDF format. Click here to download from my MediaFire account. The file name is 'HaploEst.pdf'. Find the words 'Click here to start download'. MediaFire is funded by advertising so there will be a popup.


We have examined by computer simulations how well extended y-haplotypes can estimate the effective TMRCAs of clade sample populations. The summary graph below indicates that using extended haplotypes for the G* estimation should yield results with 95 percent statistical confidence interval of plus or minus 20 percent. The optimal estimator employing multiple marker haplotypes requires novel down-weighting of the faster mutating markers due to their 'variance of variance" depending non-linearly on mutation rate. Text discussion of the foundations for the estimator's form of construction follows the graph and table of study results.


Distributions of 10,000 Clade Age Estimates by Computer Simulations


 
Results and Parameters of Simulated G* for Clades of 3 Ages
 
GenerationsRunsNumNumber of
Markers
Simulations' Average G*
in units of AvgG*
Simulations' Variance of G*
in units of VarG*
1021000064531.0053850.9902053
2041000064531.0062931.004125
3061000064531.0063550.9833726


The time back to most recent common ancestor (TMRCA) is estimated using the self- variances of the many STR markers of extended haplotypes from a present-day sample population of the clade. The estimator is

G* = Sum i [ w(i) SVar(i) ] / Sum i [ w(i) m(i) ]


with m(i) being the ith marker mutation rate, SVar(i) being that marker’s self- variance, and w(i) being the marker weight factors. The indicated sum is over all the markers of the haplotypes. In the Appendix below it was found that the marker weight factors which minimize the confidence interval (distributional width) for these G* estimates are

w(i) = 1 / ( 1 + m(i) G*** )


with the parameter G*** given by properties of the sample population’s tree back to its TMRCA.

G*** = Sum over c of [ 4 f{c}^2 (1-f{c})^2 G{c} ] / Sum over c of [ f{c}^2 (1-f{c})^2 ]


The indicated sum here is over all contributors in the sample population’s tree. f(c) is the fraction of the sample population descending from contributor c, and G(c) is that contributor’s generational distance from the TMRCA.

The down weighting of the fast mutating markers by the w(i) factors is due to the variances of the self-variances being analytically found to grow non-linearly with marker mutation rate (see Appendix below).

Computer simulations were done to see how much better the G* predictions are by using the optimal weighting versus the traditional flat weighting. The results are presented in the table below. Trees of two different ages, and ending in sample populations of 64 haplotypes of either 9, 24, or 53 of the FTDNA markers were considered. The computer randomly inserted mutations of all the markers into the trees according to the standard rules, and 10,000 independent estimates were made for the tree age parameter G* in each case. The average of the age estimates, the variance of the age estimates, and the "two-sigma" confidence interval --- all normalized to the actual tree age parameter, are given in the table. While the optimal weight factors produce age estimates slightly more confined toward the true values, extended haplotypes substantially improve the quality of the estimates. These shown uncertainties for age estimates are purely those due to the probabilistic nature of the mutation process and will be present if the underlying model perfectly represents the behavior of the STRs in nature; any model uncertainties such as marker mutation rate values will produce additional uncertainties in the age estimates. In summary, here is how to read that final yellow highlighted entry in the table: A tree of 408 generations in depth and resulting in 64 final haplotype samples was peppered with random mutations for the 53 FTDNA markers of the haplotypes. This was done independently 10,000 times and an age estimate obtained from the selfvariances of the 64 haplotypes each time. About 95 percent (two-sigma) of the 10,000 age estimates for G* were within 18.2 percent of the true G* value when using the flat weighting for the 53 markers.


Statistical Confidence Intervals for Predicting Tree Age Parameter G*


Computer simulations using extended haplotypes of many STRs with different mutation rates have been performed to see how closely the analytical expression

AvgG* = G – Sum c [ f(c)^2 ]


will be estimated from the markers’ simulated variances which go into the above estimator. The shape of the actual distribution of G* estimates obtained from the thousands of simulations is also produced, and the analytical expression for its confidence interval also confirmed by these simulations

VarG* = Sum c [ f(c)^2 {1-f(c)}^2 ] / Sum i [ w(i) m(i) ]


Various sets of the FTDNA markers --- the standard 12, 37, and 67 marker collections (less excluded markers)--- are compared. Individual marker mutation rates as estimated by John Chandler are used. Markers DYS385, DYS389ii, DYS459, DYS464, CDY, and DYS425 are excluded because of complications in their variance determinations.

The graph above shows a comparison of the G* estimator’s distribution of outputs when using the 9 marker, 24 marker, and 53 marker haplotypes. A sample population of 64 was used, and a tree TMRCA of 204 generations (about 6000 years) assumed. The G* values were normalized to their analytically predicted AvgG*.

Note how the distributions from thousands of simulations grow narrower and more centered around the AvgG* analytical value as extended haplotypes with more markers are employed. This is a consequence of the central limit theorem as it affects probabilistic phenomena. The actual distributions of variances for each individual marker are quite asymmetric about their average values, as a graph in the Appendix illustrates.

95 percent statistical confidence intervals for clade TMRCA’s estimates seem to be about plus or minus 20 percent when extended haplotypes are used.





Appendix: Basics of Clade Variances
Computer Simulation Confirmations of the Analytics


The following tables show results from a computer simulation of the variance of a single Y-STR marker’s values in a sample of 128 individuals who descend from their MRCA. The model uses the pruned phylogenetic tree converging back hundreds of generations to the MRCA, and allowing for the mutation of the Y-STR markers at some rate seen in our haplotypes. The tree pruning removes non-contributing branches --- both branch lines that have gone extinct and those leading to non-sampled males. The sample simulations in the tables below were run with the given parameters defined as follows:

  • Count = 40 : Generations between each tree doubling of branch lines.
  • MRate = 0.003 : Individual marker mutation rate.
  • CLT = 10 : Number of runs combined to illustrate the Central Limit Theorem in action.
  • Runs = 50000 : Number of computer simulations of tree's variance.
  • Num = 128 = 2^7 : Sample population after 7 doublings.
  • AvgVar = 1.006464 : Average variance of 50,000 runs in units of mG (G = 280)
  • VarVar = 1.0027 : Average square deviation of variance from AvgVar
        in units of Sum over c of [ m f{c}^2 (1 + 4mG{c}) ]


Y-STR Marker Variance Simulation:
 
Count = 40   MRate = 0.003   CLT = 10   Runs = 50,000   Num = 128

AvgVar = 1.006464   VarVar = 1.0027
 
Y-STR Variance
bin levels
in units of mG
Run counts of
Y-STR Variance
falling in the bin
run counts of
average variance
of markers taken 10 at a time
0.0500
0.15150
0.252970
0.3515050
0.4530340
0.5549890
0.65555966
0.755618362
0.854703981
0.9545881242
1.0537511044
1.153152692
1.252336336
1.352069168
1.45166864
1.55127623
1.6591814
1.758777
1.856790
1.955181


Y-STR Marker Variance Simulation graphed output.


The next table is a summary of the parameters and results of some of the other simulations that have been run during the course of our analysis of Y-STR variance.

CountGenerationsMRateCLTRunsNumAvgVarVarVarAvgSVarVarSVar
201400.00310250001280.99968160.96177751.0048750.9836805
402800.00310250001281.0071861.0057251.0051170.9844979
604200.00310250001281.001730.98198971.0021060.9712738
1208400.00310500001281.0062891.02852111.0060550.995638
201400.00110500001281.0025070.99774571.0020680.9906449
402800.00110500001281.0105631.0391301.0096801.023305
604200.00110500001281.0070341.0059001.0066431.000448
1208400.00110750001281.0078031.013561.0071231.001978
201400.000310750001280.99855650.99678151.0008831.012432
402800.000310750001281.0003071.0040771.0009930.9856617
604200.000310750001281.0086451.0146921.0064841.000791
1208400.0003101000001281.0044721.0236291.0039691.013676


A sample population of haplotypes which descended from a most recent common ancestor (MRCA) have for any of their STR markers two variance parameters --- average squared repeat value difference from the founding MRCA’s repeat value, Var, and average squared repeat value difference from the sample population’s average, SVar:

Graph of asymmetric distribution of variances


With r(i) being marker’s repeat value in ith haplotype, rf is the MRCA’s founding repeat value, <r> is the sample population’s average repeat value, and N is the number of haplotypes in the sample population. There are different advantages and disadvantages to using either form of variance.

The main purpose of our computer simulations was to confirm that recently found analytic expressions for the average variance and the ‘variance of the variance’ for these two types of variance are valid, and that’s what we found. The analytic formulas for the interesting parameters of the distribution of the two types of variance are:

Var = m G and SVar = m [ G – Sum over tree contributors c of [ f(c)^2 ] and VarVar = Sum over tree contributors c of [m  f(c)^2 {1 + 4 m G(c) } ] and VarSVar = Sum over tree contributors c of [ m f(c)^2{1-f(c)}^2 {1 + 4 m G(c)} ]


With a tree contributor being each male in the tree ancestral to a fraction f(c) of the sample population, and with G(c) being the generations between the MRCA and the contributor c. The simulation outputs for the four summary parameters of the total distributions were normalized to these theoretical expressions. A surprise result from these analytic products is that the ‘variance of the variances’ do not scale just linearly with marker mutation rate, contrary to common lore.

The ‘variance of the variance’ indicates how much a typical variance outcome will deviate from its expected value. If one took a tree leading from MRCA to the sample population, and let the marker mutations get sprinkled into that tree according to its probabilistic rules of occurence, and did this over and over, one gets a distribution of variances with average value as predicted above but with individual variance values asymmetrically distributed about this mean. The distribution peaks substantially below the expected or average value, and a long tail above the average value balances the distribution about its mean.

The spread about the mean is characterised by the average squared deviation of variance from its mean --- call this ‘variance of the variance’. It is given by:

VarVar = Sum over runs k of [ (Var(k) – AvgVar)^2 ] / K


With Var(k) being the marker variance in the k'th simulation run, and K is the total number of runs.

While the confirmation simulations were done for selected trees, the analytic expressions for AvgVar, AvgSVar, VarVar, and VarSVar are valid for arbitrarily structured trees.

The ‘variance of the variance’ for fast markers grows more rapidly than continuation of the linear-in-m dependence of traditional lore would suggest. This means that fast markers should be down-weighted in the combination used to make an estimate for G

The modified expression for the G estimator using all the markers becomes:

G = Sum over i of [ V{i} / (1 + m{i} G**) ] divided by Sum over i of [ m{i} / (1 + m{i} G**) ]


i is summed over the markers of your haplotypes, m{i} is the mutation rate for i'th marker, and V{i} is the variance found for the i'th marker. The new G** parameter of the tree is given by:

G** = Sum over c of [ 4 f{c}^2 G{c} ] / Sum over c of [ f{c}^2 ]


The analogous parameter needed for self-variance weight factors is:

G*** = Sum over c of [ 4 f{c}^2 (1-f{c})^2 G{c} ] / Sum over c of [ f{c}^2 (1-f{c})^2 ]


G** and G*** can be decent fractions of G if a very small sample population is used.

As a by-product of the computer simulations, we clustered several runs through the tree together to look at the distribution for the sum of independent marker variances. The idea is to visualize the central limit theorem in action, pushing the very asymmetric distributions for single markers into a Gaussian-like distribution for the sum of variances. The gaussianization goes slower than one might have thought, giving another reason for using as many markers as possible in G estimates.



Use your Back Button or click here to go to the DNA & Family Traits page