infers population history from nucleotide site patterns.

Estimate parameters describing population sizes, the times of separations and of episodes of gene flow, and levels of gene flow.

legofit: estimate population history from site pattern data

# legofit: estimate population history #
#    version 2.0.2-5-g1571a5c-dirty    #

# Program was compiled: Nov 24 2020 18:22:41
# Program was run: Tue Nov 24 18:22:55 2020

# cmd: ./legofit -h
usage: legofit [options] input.lgo sitepat.txt
   where file input.lgo describes population history,
   and file sitepat.txt contains site pattern frequencies.
Options may include:
   -1 or --singletons
      Use singleton site patterns
   -c or --clic
      write output files needed by clic
   -d <x> or --deterministic <x>
      Deterministic algorithm, ignoring states with Pr <= x
   -e or --epsilon
      add a small value to each site pattern prob to prevent numerical 
   -F <x> or --scaleFactor <x>
      set DE scale factor
   -h or --help
      print this message
   -p <x> or --ptsPerDim <x>
      number of DE points per free var
   -s <x> or --strategy <x>
      set DE strategy
   -S <g>@<r> or --stage <g>@<r>
      add stage with <g> generations and <r> simulation reps. The "@<r>"
      portion can be omitted if -d or --deterministic are used.
   --stateIn <filename>
      read initial state from new-style file. Option may be repeated.
   --stateOut <filename>
      write final state to file
   -T <x> or --tol <x>
      termination criterion
   -t <x> or --threads <x>
      number of threads (default is auto)
   -v or --verbose
      verbose output
      Print version and exit
   -x <x> or --crossover <x>
      set DE crossover probability

Two arguments are required:

Legofit estimates the values of all the parameters that are declared as "free" in the .lgo file. It does this by minimizing a "cost function", which measures the difference between observed and expected values. Currently, the cost function is the KL divergence. Minimizing this is equivalent to maximizing a composite likelihood function based on the multinomial distribution.

Site pattern frequencies are estimated either by computer simulation (the stochastic algorithm) or by a newer deterministic algorithm. The stochastic algorithm is the default. The -d or --deterministic arguments invoke the deterministic algorithm.

The deterministic algorithm sums across all possible histories of the samples defined in the .lgo file. The number of histories increases rapidly with sample size and with the number of migration events. For larger models, it is not feasible to use -d 0. However, one can still use the deterministic algorithm to obtain an approximate answer, using an argument such as -d 1e-6. This tells legosim to use the deterministic algorithm while ignoring states whose probability is less than or equal to 1e-6. Nonzero arguments to -d will introduce bias, and it is not yet clear that such arguments are useful. At present, I recommend against using -d with a nonzero argument. If the model is so complex that -d 0 is not feasible, then omit this argument to use the default stochastic algorithm.

Optimization is done using the "differential evolution" (DE) algorithm. The DE algorithm maintains a swarm of points, each at a different set of parameter values. The objective function is evaluated at these points in a multithreaded job queue, so the program runs fastest on a machine with lots of cores. You can set the number of threads using the -t argument. By default, the stochastic algorithm uses 3/4 as many threads as there are processors on the machine, and the deterministic algorithm uses 1/2 as many threads as there are processors. If the deterministic algorithm is using too much memory, reduce the number of threads.

The DE algorithm can be tuned via command line arguments -F, -x, -s, and -p. Details regarding these choices can be found in "Differential evolution: a practical approach to global optimization", by Price, Storn, and Lampinen. We've had good results with -s 2, -F 0.3, and -x 0.8, so these became the defaults as of version 1.15.

If you are using the stochastic algorithm, it is a good idea to use several -S arguments to create a "simulation schedule". During the first few hundred generations of the DE algorithm, the swarm of points adapts to the objective function. During this initial phase, high accuracy is not required, so it can speed things up to use low resolution in function evaluations. This is the purpose of the -S argument, which adds a "stage" to the simulation schedule. For example, -S 1000@10000 adds a stage of 1000 DE generations, in each of which 10000 simulation replicates are used to evaluate each objective function. The -S argument can be given several times to set up a simulation schedule with several stages. The algorithm is allowed to converge only during the final stage.

With the deterministic algorithm, there is no point in using more than one -S argument, and you don't need to specify the number of iterations. The argument -S 5000 would allow legofit to use as many as 5000 generations of differential evolution.

By default, the initial swarm of points consists of one point representing the parameter values in the .lgo file, plus other points scattered randomly throughout the feasible region of parameter space. The total number of points defaults to 10 times the number of free parameters. To change this number, see the --ptsPerDim option.

The initial swarm of points can also be specified using the --stateIn option. This reads a file specifying the initial state of the swarm of points maintained by DE. The format of this file is as described below for the --stateOut option. The number of free parameters in each --stateIn file should be as specified in the .lgo file. The number of points in the files may differ.

The --stateIn option may be given more than once, each time with a different input file. When more than one file is given, Legofit constructs the initial swarm of points by combining points from all input files.

The option --stateOut is used to define an output file for the final state of the optimizer. This output file begins with a row giving the number of points and the number of free parameters. After that, there is a row for each point in the swarm of points maintained by diffev.c. In each row, the first entry is the value of the cost function at that point. The remaining entries give the free parameter values in the same order in which they are printed by legofit.

The format of the state file changed in early July, 2018. Before that date, the state file did not include names of parameters. Parameter values in the state file had to be arranged in the same order as the free parameters in the .lgo file. If the .lgo and state files referred to different parameters or to the same parameters in a different order, parameters would get the wrong values. No error was detected unless this misassignment resulted in a tree that was not feasible–for example, one in which a segment of the population tree was older than its parent in the tree.

The new state file format includes the names of parameters, and the input routine compares these against the names of free parameters in the .lgo file. The two lists must have the same parameters in the same order. Otherwise, legofit aborts with an error message. Old- and new-format state files can both be input using –stateIn arguments and can be intermingled in a single legofit run.

The -1 option tells legofit to use singleton site patterns–patterns in which the derived allele is present in only a single sample. This is a bad idea with low-coverage sequence data, because singletons are likely to be sequencing errors. We use this option routinely now with high-coverage data. To use it, you need to generate a data set that contains singletons, by using the -1 option of \ref tabpat "tabpat", sitepat, scrmpat, or simpat.

2017-08-26: The convergence criterion has been changed. In the previous code, differential evolution (DE) iterations stopped when neither the best objective function value nor the spread of these values had changed in a fixed number of iterations. That criterion was used in our PNAS paper, "Early history of Neanderthals and Denisovans".

I began to notice convergence problems with models larger than those used in the August 2017 PNAS paper. All bootstrap replicates would report convergence, but some yielded wild parameter estimates. In these outliers, the spread of objective function values was also very large, indicating that the algorithm had not really converged. So I implemented a new convergence criterion, based on the spread of objective function values. The iterations terminate when this spread falls to a pre-determined value.

This new convergence criterion works best with the KL (Kullback-Leibler) cost function. Minimizing KL is the same as maximizing log composite likelihood (lnL). But KL equals zero when the fit is perfect. For this reason, it isn't necessary to scale the tolerance value to lnL.

In each DE generation, the algorithm evaluates the cost function of each point in the swarm. The cost function measures the difference between observed and expected site pattern frequencies. The difference between the maximum and minimum cost is called the "spread". The algorithm stops when "spread" falls to the tolerance value, "ytol". This tolerance can be set on the command line with "-T" or "--tol" arguments. Convergence is checked only in the last stage, as set by "-S" or "--stage".

If the algorithm fails to converge, there are several options. First, you can use "-S" or "--stage" to increase the maximum number of DE generations. Second, you can relax the tolerance. By default, this is 1e-4. It is reported in the legofit output. To double this value, use "-T 2e-4" or "--tol 2e-4".

Legofit handles three types of signal: SIGINT, SIGTERM, and SIGUSR1. If legofit is running in the foreground, the first of these signals can be generated by typing Ctrl-C. Otherwise (under linux or osx), use killall -SIGINT legofit, killall -SIGTERM legofit, or killall -SIGUSR1 legofit. In response to SIGINT or SIGTERM, Legofit will wait until the end of the current generation of the diffev algorithm and then exit gracefully, printing all the usual output. In response to SIGUSR1, Legofit will wait until the end of the current DE generation and then write to stderr a summary of the state of the optimizer.