Legofit
infers population history from nucleotide site patterns.

Tabulate site pattern frequencies from data produced by ms2sim or msprime.

simpat: tabulates site patterns

Simpat reads data generated by ms2sim or msprime, tabulates counts of nucleotide site patterns, and write the result to standard output.

Usage

Usage: simpat [options] [inputfile]
  Reads from stdin by default. Options may include:
   -V or --version
      Print version and exit
   -h or --help
      Print this message

Example

simpat parses a file generated using ms2sim or msprime. The first few lines of input should look like this:

npops = 3
pop sampsize
x 2
y 1
z 1
0 0 1 0 0
0 0 1 0 0
0 1 0 0 0

Line 1 gives the number of populations. These need not correspond to the populations of ms or msprime. For example, ms might define samples from the same population at different points in time, and you might want to treat these as distinct samples within simpat.

Line 2 is a header for the 3 lines that follow.

In lines 3-5, the 1st column is a label for a population, and the 2nd column specifies the number of haploid samples from that population. The number of lines in this section of the input should equal npops, as specified on line 1. The order in which these samples are listed should correspond to the order in which they occur in the lines of data that follow.

All remaining lines have the same format. Each provides data for a single nucleotide site, and each consists of the same number of fields, separated by white space. The 1st field labels the current chromosome (or replicate). Each remaining field is a haploid genotype, with 0 representing the ancestral allele and 1 the derived allele. These are given in the order specified in the previous section of output (lines 3-5 in the example above). For example, our example implies that columns 2-3 refer to population x, 4 refers to y, and 5 to z.

In the output, site pattern "x:y" refers to the pattern in which the derived allele is present haploid samples from "x" and "y" but not on those from other populations. Here is the output of a run with 4 populations, x, y, n, and d:

# simpat version 1.67
# Including singleton site patterns.
# Number of site patterns: 14
# Nucleotide sites: 7298790
# Sites used: 7298790
#       SitePat             E[count]
              x      1401902.0000000
              y      1239084.0000000
              n       589582.0000000
              d       598305.0000000
            x:y      1074236.0000000
            x:n        21136.0000000
            x:d        21693.0000000
            y:n        38330.0000000
            y:d        31022.0000000
            n:d      1408276.0000000
          x:y:n        45407.0000000
          x:y:d        44519.0000000
          x:n:d       324614.0000000
          y:n:d       460684.0000000

The left column lists the site patterns that occur in the data. The right column gives the expected count of each site pattern. These are not necessarily integers, because they represent averages over all possible subsamples consisting of a single haploid genome from each population. They are integers here only because this simulation modeled a single haploid sample from each population.