Legofit
infers population history from nucleotide site patterns.
legosim

Generate site patterns by coalescent simulation.

legosim: coalescent simulations within a network of populations

usage: legosim [options] input_file_name
   where options may include:
   -b or --branch_length
      print branch lengths rather than probabilities
   -i <x> or --nItr <x>
      number of iterations in simulation
   -1 or --singletons
      Use singleton site patterns
   -d <x> or --deterministic <x>
      Deterministic algorithm, ignoring states with Pr <= x
   --network
      Print summary of population network and exit
   -U <x>
      Mutations per generation per haploid genome.
   -h or --help
      print this message
   --version
      print version and exit
Options -i, --nItr, and -U cannot be used with --deterministic.
Option -U cannot be used with -b or --branch_length.

Here, "input_file" should be in .lgo format, which describes the history of population size, subdivision, and gene flow. The output looks like this:

#######################################
# legosim: site pattern probabilities #
#   version 2.0.1-12-gf241700-dirty   #
#######################################

# Program was compiled: Nov 21 2020 10:15:39
# Program was run: Sat Nov 21 10:17:57 2020

# cmd: ./legosim -1 -d 0 input.lgo
# input file                  : input.lgo
# algorithm                   : deterministic
# ignoring probs <=           : 0
# including singleton site patterns.
#       SitePat            Prob
              x    0.1560242402
              y    0.1481456533
              n    0.4023561523
            x:y    0.2854101359
            x:n    0.0000926157
            y:n    0.0079712026

Here, the "SitePat" column labels site patterns. For example, site pattern xy (denoted by "x:y" in this output) refers to nocleotide sites at which the derived allele is present in single haploid samples from X and Y but not in samples from other populations. The probability of this site pattern, under the model of history specified in file input.lgo, is given in the "Prob" column. When the -d or --deterministic options are used, these probabilities are calculated by a deterministic algorithm.

In the example above, I used -d 0, which tells legosim to ignore only those states with probability 0. Thus, the answers will be extremely accurate. When neither -d nor --deterministic are used, the probabilities are estimated by coalescent simulation.

The deterministic algorithm sums across all possible histories of the samples defined in the .lgo file. The number of histories increases rapidly with sample size and with the number of migration events. For larger models, it is not feasible to use -d 0. However, one can still use the deterministic algorithm to obtain an approximate answer, using an argument such as -d 1e-6. This tells legosim to use the deterministic algorithm while ignoring states whose probability is less than or equal to 1e-6. Nonzero arguments to -d will introduce bias. To evaluate the magnitude of this bias, compare the results to those obtained using the stochastic algorithm or (where feasible) the deterministic one with -d 0.

When the deterministic algorithm is used, legosim prints a line of output that looks like this:

# Summed probability of ignored improbable events: 3.515617e-04

This "summed probability" can exceed 1, because it sums across several probability distributions. Nonetheless, when it is small, that is an indication that we have not ignored much of any of these distributions.

To simulate site pattern counts across an entire genome, use the -U option, whose argument give the expected number of mutations per generation per haploid genome. This argument should equal $\mu L$, where $\mu$ is the mutation rate per nucleotide site per generation, and $L$ is the number of nucleotide sites sequenced per haploid genome, including monomorphic sites but excluding those that fail quality control. For example, adding -U 18 to the command above led to the following output:

#######################################
# legosim: site pattern probabilities #
#            version 1.89             #
#######################################

# Program was compiled: Oct 22 2020 22:30:28
# Program was run: Fri Oct 23 12:35:53 2020

# cmd: legosim -1 -i 10000 -U 18 input.lgo
# input file                  : input.lgo
# algorithm                   : stochastic
# nreps                       : 10000
# mutations per haploid genome: 18.000000
# including singleton site patterns.
#       SitePat           Count
              x               4
              y               1
              n               8
            x:y               5
            x:n               0
            y:n               0

Now, the 2nd column is labeled "Count" rather than "Prob" and gives the simulated count of each site pattern across the genome as a whole. It is calculated by sampling from a Poisson distribution with mean $\mu L b$, where $\mu$ and $L$ are as described above, and $b$ is the average branch length. This Poisson model treats nucleotide sites as independent, ignoring linkage disequilibrium. The counts it provides are correct in expectation, but their variances in repeated runs of the program are too small.