Tabulate site pattern frequencies from .daf files.

Tabpat: tabulates site patterns

Tabpat reads data in .daf format and tabulates counts of nucleotide site patterns, writing the result to standard output. Optionally, it also calculates a moving-blocks bootstrap, writing each bootstrap replicate into a separate file.

Usage

Usage: tabpat [options] <x>=<in1> <y>=<in2> ...
   where <x> and <y> are arbitrary labels, and <in1> and <in2> are input
   files in daf format. Writes to standard output. Labels may not include
   the character ":". Maximum number of input files: 32.

Options may include:
   -f <name> or --bootfile <name>
      Bootstrap output file basename. Def: tabpat.boot.
   -r <x> or --bootreps <x>
      # of bootstrap replicates. Def: 0
   -b <x> or --blocksize <x>
      # of SNPs per block in moving-blocks bootstrap. Def: 0.
   -1 or --singletons
      Use singleton site patterns
   -F or --logFixed
      log fixed sites to tabpat.log
   -a or --logAll
      log all sites to tabpat.log
   -h or --help
      Print this message

Example

Before running tabpat, use daf to convert the input data into daf format. Let us assume you have done this, and that directory ~/daf contains a separate daf file for each population. We want to compare 4 populations, whose .daf files are yri.daf, ceu.daf, altai.daf, and denisova.daf. The following command will do this, putting the results into obs.txt.

tabpat x=~/daf/yri.daf \
       y=~/daf/ceu.daf \
       n=~/daf/altai.daf \
       d=~/daf/denisova.daf > obs.txt

Here, "x", "y", "n", and "d" are labels that will be used to identify site patterns in the output. For example, site pattern "x:y" refers to the pattern in which the derived allele is present haploid samples from "x" and "y" but not on those from other populations. The order of the command-line arguments determines the order in which labels are sorted on output. Given the command line above, we would get a site pattern labeled "x:y:d" rather than, say, "y:x:d".

The output looks like this:

# Population labels:
#    x = /home/rogers/daf/yri.daf
#    y = /home/rogers/daf/ceu.daf
#    n = /home/rogers/daf/altai.daf
#    d = /home/rogers/daf/denisova.daf
# Excluding singleton site patterns.
# Number of site patterns: 10
# Tabulated 12327755 SNPs
#       SitePat             E[count]
            x:y       340952.4592501
            x:n        46874.1307236
            x:d        46034.4670204
            y:n        55137.4236715
            y:d        43535.5248078
            n:d       231953.3372578
          x:y:n        91646.1277991
          x:y:d        88476.9619569
          x:n:d        96676.3877423
          y:n:d       100311.4411513

The left column lists the site patterns that occur in the data. The right column gives the expected count of each site pattern. These are not integers, because they represent averages over all possible subsamples consisting of a single haploid genome from each population.

In the daf files used as input, chromosomes should appear in lexical order. Within each chromosome, nucleotides should appear in numerical order. There should be no duplicate (chromosome, position) pairs. Otherwise, the program aborts with an error.

To generate a bootstrap, use the --bootreps option:

tabpat --bootreps 50 \
       x=~/daf/yri.daf \
       y=~/daf/ceu.daf \
       n=~/daf/altai.daf \
       d=~/daf/denisova.daf > obs.txt

This will generate not only the primary output file, obs.txt, but also 50 additional files, each representing a single bootstrap replicate. The primary output file now has a bootstrap confidence interval:

# Population labels:
#    x = /home/rogers/daf/yri.daf
#    y = /home/rogers/daf/ceu.daf
#    n = /home/rogers/daf/altai.daf
#    d = /home/rogers/daf/denisova.daf
# Excluding singleton site patterns.
# Number of site patterns: 10
# Tabulated 12327755 SNPs
# bootstrap output file = tabpat.boot
# confidence level = 95%
#       SitePat             E[count]            low           high
            x:y       340952.4592501 338825.6604586 342406.6670816
            x:n        46874.1307236  46361.5798377  47438.1857029
            x:d        46034.4670204  45605.6588012  46631.6434277
            y:n        55137.4236715  54650.0763578  55783.7051253
            y:d        43535.5248078  43110.5119922  44234.0919024
            n:d       231953.3372578 229495.3741057 234173.6878092
          x:y:n        91646.1277991  90494.0219749  92873.4443706
          x:y:d        88476.9619569  87137.1867967  89585.8431419
          x:n:d        96676.3877423  95935.5184294  97417.6241185
          y:n:d       100311.4411513  99292.9839140 101163.3457462

Here, low and high are the limits of a 95% confidence interval. The bootstrap output files look like tabpat.boot000, tabpat.boot001, and so on.

Copyright: Copyright (c) 2016, Alan R. Rogers roger.nosp@m.s@an.nosp@m.thro..nosp@m.utah.nosp@m..edu. This file is released under the Internet Systems Consortium License, which can be found in file "LICENSE".