Legofit
infers population history from nucleotide site patterns.
|
Generate site patterns by coalescent simulation.
usage: legosim [options] input_file_name where options may include: -b or --branch_length print branch lengths rather than probabilities -i <x> or --nItr <x> number of iterations in simulation -1 or --singletons Use singleton site patterns -d <x> or --deterministic <x> Deterministic algorithm, ignoring states with Pr <= x --network Print summary of population network and exit -U <x> Mutations per generation per haploid genome. -h or --help print this message --version print version and exit Options -i, --nItr, and -U cannot be used with --deterministic. Option -U cannot be used with -b or --branch_length.
Here, "input_file" should be in .lgo format, which describes the history of population size, subdivision, and gene flow. The output looks like this:
####################################### # legosim: site pattern probabilities # # version 2.0.1-12-gf241700-dirty # ####################################### # Program was compiled: Nov 21 2020 10:15:39 # Program was run: Sat Nov 21 10:17:57 2020 # cmd: ./legosim -1 -d 0 input.lgo # input file : input.lgo # algorithm : deterministic # ignoring probs <= : 0 # including singleton site patterns. # SitePat Prob x 0.1560242402 y 0.1481456533 n 0.4023561523 x:y 0.2854101359 x:n 0.0000926157 y:n 0.0079712026
Here, the "SitePat" column labels site patterns. For example, site pattern xy (denoted by "x:y" in this output) refers to nocleotide sites at which the derived allele is present in single haploid samples from X and Y but not in samples from other populations. The probability of this site pattern, under the model of history specified in file input.lgo
, is given in the "Prob" column. When the -d
or --deterministic
options are used, these probabilities are calculated by a deterministic algorithm.
In the example above, I used -d 0
, which tells legosim to ignore only those states with probability 0. Thus, the answers will be extremely accurate. When neither -d
nor --deterministic
are used, the probabilities are estimated by coalescent simulation.
The deterministic algorithm sums across all possible histories of the samples defined in the .lgo file. The number of histories increases rapidly with sample size and with the number of migration events. For larger models, it is not feasible to use -d 0
. However, one can still use the deterministic algorithm to obtain an approximate answer, using an argument such as -d 1e-6
. This tells legosim to use the deterministic algorithm while ignoring states whose probability is less than or equal to 1e-6. Nonzero arguments to -d
will introduce bias. To evaluate the magnitude of this bias, compare the results to those obtained using the stochastic algorithm or (where feasible) the deterministic one with -d 0
.
When the deterministic algorithm is used, legosim prints a line of output that looks like this:
# Summed probability of ignored improbable events: 3.515617e-04
This "summed probability" can exceed 1, because it sums across several probability distributions. Nonetheless, when it is small, that is an indication that we have not ignored much of any of these distributions.
To simulate site pattern counts across an entire genome, use the -U
option, whose argument give the expected number of mutations per generation per haploid genome. This argument should equal , where is the mutation rate per nucleotide site per generation, and is the number of nucleotide sites sequenced per haploid genome, including monomorphic sites but excluding those that fail quality control. For example, adding -U 18
to the command above led to the following output:
####################################### # legosim: site pattern probabilities # # version 1.89 # ####################################### # Program was compiled: Oct 22 2020 22:30:28 # Program was run: Fri Oct 23 12:35:53 2020 # cmd: legosim -1 -i 10000 -U 18 input.lgo # input file : input.lgo # algorithm : stochastic # nreps : 10000 # mutations per haploid genome: 18.000000 # including singleton site patterns. # SitePat Count x 4 y 1 n 8 x:y 5 x:n 0 y:n 0
Now, the 2nd column is labeled "Count" rather than "Prob" and gives the simulated count of each site pattern across the genome as a whole. It is calculated by sampling from a Poisson distribution with mean , where and are as described above, and is the average branch length. This Poisson model treats nucleotide sites as independent, ignoring linkage disequilibrium. The counts it provides are correct in expectation, but their variances in repeated runs of the program are too small.