Legofit
infers population history from nucleotide site patterns.
|
Tabulate site pattern frequencies from .daf files.
Tabpat reads data in .daf format and tabulates counts of nucleotide site patterns, writing the result to standard output. Optionally, it also calculates a moving-blocks bootstrap, writing each bootstrap replicate into a separate file.
Usage: tabpat [options] <x>=<in1> <y>=<in2> ... where <x> and <y> are arbitrary labels, and <in1> and <in2> are input files in daf format. Writes to standard output. Labels may not include the character ":". Maximum number of input files: 32. Options may include: -f <name> or --bootfile <name> Bootstrap output file basename. Def: tabpat.boot. -r <x> or --bootreps <x> # of bootstrap replicates. Def: 0 -b <x> or --blocksize <x> # of SNPs per block in moving-blocks bootstrap. Def: 0. -1 or --singletons Use singleton site patterns -F or --logFixed log fixed sites to tabpat.log -a or --logAll log all sites to tabpat.log -h or --help Print this message
Before running tabpat
, use daf to convert the input data into daf format. Let us assume you have done this, and that directory ~/daf contains a separate daf file for each population. We want to compare 4 populations, whose .daf files are yri.daf
, ceu.daf
, altai.daf
, and denisova.daf
. The following command will do this, putting the results into obs.txt
.
tabpat x=~/daf/yri.daf \ y=~/daf/ceu.daf \ n=~/daf/altai.daf \ d=~/daf/denisova.daf > obs.txt
Here, "x", "y", "n", and "d" are labels that will be used to identify site patterns in the output. For example, site pattern "x:y" refers to the pattern in which the derived allele is present haploid samples from "x" and "y" but not on those from other populations. The order of the command-line arguments determines the order in which labels are sorted on output. Given the command line above, we would get a site pattern labeled "x:y:d" rather than, say, "y:x:d".
The output looks like this:
# Population labels: # x = /home/rogers/daf/yri.daf # y = /home/rogers/daf/ceu.daf # n = /home/rogers/daf/altai.daf # d = /home/rogers/daf/denisova.daf # Excluding singleton site patterns. # Number of site patterns: 10 # Tabulated 12327755 SNPs # SitePat E[count] x:y 340952.4592501 x:n 46874.1307236 x:d 46034.4670204 y:n 55137.4236715 y:d 43535.5248078 n:d 231953.3372578 x:y:n 91646.1277991 x:y:d 88476.9619569 x:n:d 96676.3877423 y:n:d 100311.4411513
The left column lists the site patterns that occur in the data. The right column gives the expected count of each site pattern. These are not integers, because they represent averages over all possible subsamples consisting of a single haploid genome from each population.
In the daf files used as input, chromosomes should appear in lexical order. Within each chromosome, nucleotides should appear in numerical order. There should be no duplicate (chromosome, position) pairs. Otherwise, the program aborts with an error.
To generate a bootstrap, use the --bootreps
option:
tabpat --bootreps 50 \ x=~/daf/yri.daf \ y=~/daf/ceu.daf \ n=~/daf/altai.daf \ d=~/daf/denisova.daf > obs.txt
This will generate not only the primary output file, obs.txt
, but also 50 additional files, each representing a single bootstrap replicate. The primary output file now has a bootstrap confidence interval:
# Population labels: # x = /home/rogers/daf/yri.daf # y = /home/rogers/daf/ceu.daf # n = /home/rogers/daf/altai.daf # d = /home/rogers/daf/denisova.daf # Excluding singleton site patterns. # Number of site patterns: 10 # Tabulated 12327755 SNPs # bootstrap output file = tabpat.boot # confidence level = 95% # SitePat E[count] low high x:y 340952.4592501 338825.6604586 342406.6670816 x:n 46874.1307236 46361.5798377 47438.1857029 x:d 46034.4670204 45605.6588012 46631.6434277 y:n 55137.4236715 54650.0763578 55783.7051253 y:d 43535.5248078 43110.5119922 44234.0919024 n:d 231953.3372578 229495.3741057 234173.6878092 x:y:n 91646.1277991 90494.0219749 92873.4443706 x:y:d 88476.9619569 87137.1867967 89585.8431419 x:n:d 96676.3877423 95935.5184294 97417.6241185 y:n:d 100311.4411513 99292.9839140 101163.3457462
Here, low
and high
are the limits of a 95% confidence interval. The bootstrap output files look like tabpat.boot000
, tabpat.boot001
, and so on.