Legofit
infers population history from nucleotide site patterns.
Data Structures | Functions
boot.c File Reference

Functions for a moving blocks bootstrap. More...

#include "boot.h"
#include "misc.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <assert.h>
#include <gsl/gsl_rng.h>

Data Structures

struct  Boot
 Contains the all data involved in a moving blocks bootstrap. More...
 

Functions

double interpolate (double p, double *v, long len)
 Interpolate in order to approximate the value v[p*(len-1)]. More...
 
long LInt_div_round (long num, long denom)
 Divide num by denom and round the result to the nearest integer.
 
long Boot_multiplicity (const Boot *self, long snpndx, long rep)
 How many copies of snp with index snpndx are present in a given repetition (rep)?

 
BootBoot_new (int nchr, long nsnpvec[nchr], long nrep, int npat, long blocksize, gsl_rng *rng)
 Constructor for class Boot.
 
void Boot_sanityCheck (const Boot *self, const char *file, int line)
 
void Boot_add (Boot *self, int chr, long snpndx, int pat, double z)
 Add one site pattern contribution to a Boot structure. More...
 
void Boot_free (Boot *self)
 Destructor.
 
void Boot_aggregate (Boot *self, int rep, int npat, double count[npat])
 Add to an array the site pattern counts from a bootstrap replicate. More...
 
void confidenceBounds (double *lowBnd, double *highBnd, double confidence, long len, double v[len])
 Calculate confidence bounds from a vector of values representing samples drawn from the sampling distribution of some estimator. More...
 
void Boot_print (const Boot *self, FILE *ofp)
 Print a Boot object.
 

Detailed Description

Functions for a moving blocks bootstrap.

Author
Alan R. Rogers

This bootstrap treats all nucleotides as a single array, ignoring the distinction between chromosomes. Thus, blocks may span chromosomes. This should not cause problems with high-quality genomes, in which chromosomes are known. When a block spans two chromosomes, the sites within the block are less correlated within that block, but this doesn't violate any assumption.

I'm more worried about genomes that consist of many small contigs. With such genomes, nucleotides in different contigs may be tightly linked, and these linked contigs may end up in different blocks. This violates the assumption (of the moving-blocks bootstrap) that observations in different blocks are only weakly correlated. I'm not sure what effect this will have, but I fear it will make the confidence intervals too narrow.

Function Documentation

◆ Boot_add()

void Boot_add ( Boot self,
int  chr,
long  snpndx,
int  pat,
double  z 
)

Add one site pattern contribution to a Boot structure.

Parameters
[in,out]selfThe Boot structure to modify.
[in]chrThe index of the chromosome to modify.
[in]snpndxThe index of the current snp.
[in]patThe index of the current site pattern.
[in]zthe contribution of the snp to the site pattern.

References Boot_multiplicity(), Boot::count, Boot::cum, and Boot::nrep.

◆ Boot_aggregate()

void Boot_aggregate ( Boot self,
int  rep,
int  npat,
double  count[npat] 
)

Add to an array the site pattern counts from a bootstrap replicate.

Parameters
[in]selfPoints to a Boot object.
[in]theindex of the bootstrap replicate
[in]npatthe number of site patterns
[out]countAn array of doubles. The function will add to count[i] the contribution of site pattern i in bootstrap replicate rep.

References Boot::npat.

◆ confidenceBounds()

void confidenceBounds ( double *  lowBnd,
double *  highBnd,
double  confidence,
long  len,
double  v[len] 
)

Calculate confidence bounds from a vector of values representing samples drawn from the sampling distribution of some estimator.

To calculate the lower bound (*lowBnd), the function calculates the total probability mass in the tails (1 - confidence) and divides this into two equal parts to find p, the probability mass in each tail. It then estimates a value L such that a fraction p of the data values are less than or equal to L. To find this value, the function uses linear interpolation between the sorted list of data values.

The upper bound (*highBnd) is calculated in an analogous fashion.

Parameters
[out]lowBnd,highBndCalculated results will be written into these memory locations.
[in]confidenceFraction of sampling distribution that lies inside the confidence bounds.
[in]lenThe number of values inf v.
[in]vThe vector of values.
Side Effects:\n Sorts the vector v.

◆ interpolate()

double interpolate ( double  p,
double *  v,
long  len 
)

Interpolate in order to approximate the value v[p*(len-1)].

Return NaN if len==0.