Legofit
infers population history from nucleotide site patterns.
sitepat

Tabulate site pattern frequencies from .raf files.

Sitepat: tabulates site patterns

Sitepat reads data in .raf format and tabulates counts of nucleotide site patterns, writing the result to standard output. Optionally, it also calculates a moving-blocks bootstrap, writing each bootstrap replicate into a separate file.

Usage

Usage: sitepat [options] <x>=<in_1> <y>=<in_2> ... outgroup=<in_K>
   where <x> and <y> are arbitrary labels, and <in_i> are input
   files in raf format. Labels may not include the character ":".
   Final label must be "outgroup". Writes to standard output.

   If input file name ends with .gz, input is decompressed using
   gunzip. Maximum number of input files: 32 plus outgroup. Minimum is
   2 plus outgroup.

Options may include:
   -f <name> or --bootfile <name>
      Bootstrap output file basename. Def: sitepat.boot.
   -r <x> or --bootreps <x>
      # of bootstrap replicates. Def: 0
   -b <x> or --blocksize <x>
      # of SNPs per block in moving-blocks bootstrap. Def: 0.
   -1 or --singletons
      Use singleton site patterns
   -m or --logMismatch
      Log REF mismatches to sitepat.log
   -A or --logAA
      Log sites with uncallable ancestral allele
   --version
      Print version and exit
   -h or --help
      Print this message

Reference genomes are often useful as outgroups, but they must first be aligned to the target genome. These alignments are typically distributed in axt format. To convert from axt format to raf format, see axt2raf.

Example

Before running sitepat, use raf to convert the input data into raf format. To save space, you may want to compress the ".raf" files using the external utility "gzip". The names of these compressed files should end in ".gz", and sitepat will use an external utility (gunzip) to decompress them. If gunzip has not been installed, sitepat cannot read compressed input files.

When making bootstraps (using the -r or --bootreps options), sitepat must read the data files twice. If the files are compressed, they must be decompressed twice. It may be faster to use uncompressed input files when making bootstraps.

Let us assume you have generated raf files, and that directory ~/raf contains a separate raf file for each population. We want to compare 4 populations, whose .raf files are yri.raf, ceu.raf, altai.raf, and denisova.raf. Ancestral alleles are to be called using chimp.raf as an outgroup. The following command will do this, putting the results into obs.txt.

sitepat x=raf/yri.raf \
       y=raf/ceu.raf \
       n=raf/altai.raf \
       d=raf/denisova.raf \
       outgroup=raf/chimp.raf > obs.txt

Here, "x", "y", "n", and "d" are labels that will be used to identify site patterns in the output. For example, site pattern "x:y" refers to the pattern in which the derived allele is present haploid samples from "x" and "y" but not on those from other populations. The order of the command-line arguments determines the order in which labels are sorted on output. Given the command line above, we would get a site pattern labeled "x:y:d" rather than, say, "y:x:d".

The output looks like this:

# Population labels:
#    x = /home/rogers/raf/yri.raf
#    y = /home/rogers/raf/ceu.raf
#    n = /home/rogers/raf/altai.raf
#    d = /home/rogers/raf/denisova.raf
# Excluding singleton site patterns.
# Number of site patterns: 10
# Tabulated 12327755 SNPs
#       SitePat             E[count]
            x:y       340952.4592501
            x:n        46874.1307236
            x:d        46034.4670204
            y:n        55137.4236715
            y:d        43535.5248078
            n:d       231953.3372578
          x:y:n        91646.1277991
          x:y:d        88476.9619569
          x:n:d        96676.3877423
          y:n:d       100311.4411513

The left column lists the site patterns that occur in the data. The right column gives the expected count of each site pattern. These are not integers, because they represent averages over all possible subsamples consisting of a single haploid genome from each population.

In the raf files used as input, chromosomes should appear in lexical order. (See this link for advice about lexical sorting.) Within each chromosome, nucleotides should appear in numerical order. There should be no duplicate (chromosome, position) pairs. Otherwise, the program aborts with an error.

To generate a bootstrap, use the --bootreps option:

sitepat --bootreps 50 \
       x=raf/yri.raf \
       y=raf/ceu.raf \
       n=raf/altai.raf \
       d=raf/denisova.raf \
       outgroup=raf/chimp.raf > obs.txt

This will generate not only the primary output file, obs.txt, but also 50 additional files, each representing a single bootstrap replicate. The primary output file now has a bootstrap confidence interval:

# Population labels:
#    x = /home/rogers/raf/yri.raf
#    y = /home/rogers/raf/ceu.raf
#    n = /home/rogers/raf/altai.raf
#    d = /home/rogers/raf/denisova.raf
# Excluding singleton site patterns.
# Number of site patterns: 10
# Tabulated 12327755 SNPs
# bootstrap output file = sitepat.boot
# confidence level = 95%
#       SitePat             E[count]            low           high
            x:y       340952.4592501 338825.6604586 342406.6670816
            x:n        46874.1307236  46361.5798377  47438.1857029
            x:d        46034.4670204  45605.6588012  46631.6434277
            y:n        55137.4236715  54650.0763578  55783.7051253
            y:d        43535.5248078  43110.5119922  44234.0919024
            n:d       231953.3372578 229495.3741057 234173.6878092
          x:y:n        91646.1277991  90494.0219749  92873.4443706
          x:y:d        88476.9619569  87137.1867967  89585.8431419
          x:n:d        96676.3877423  95935.5184294  97417.6241185
          y:n:d       100311.4411513  99292.9839140 101163.3457462

Here, low and high are the limits of a 95% confidence interval. The bootstrap output files look like sitepat.boot000, sitepat.boot001, and so on.