Legofit
infers population history from nucleotide site patterns.
|
Tabulate site pattern frequencies from .raf files.
Sitepat reads data in .raf format and tabulates counts of nucleotide site patterns, writing the result to standard output. Optionally, it also calculates a moving-blocks bootstrap, writing each bootstrap replicate into a separate file.
Usage: sitepat [options] <x>=<in_1> <y>=<in_2> ... outgroup=<in_K> where <x> and <y> are arbitrary labels, and <in_i> are input files in raf format. Labels may not include the character ":". Final label must be "outgroup". Writes to standard output. If input file name ends with .gz, input is decompressed using gunzip. Maximum number of input files: 32 plus outgroup. Minimum is 2 plus outgroup. Options may include: -f <name> or --bootfile <name> Bootstrap output file basename. Def: sitepat.boot. -r <x> or --bootreps <x> # of bootstrap replicates. Def: 0 -b <x> or --blocksize <x> # of SNPs per block in moving-blocks bootstrap. Def: 0. -1 or --singletons Use singleton site patterns -m or --logMismatch Log REF mismatches to sitepat.log -A or --logAA Log sites with uncallable ancestral allele --version Print version and exit -h or --help Print this message
Reference genomes are often useful as outgroups, but they must first be aligned to the target genome. These alignments are typically distributed in axt format. To convert from axt format to raf format, see axt2raf.
Before running sitepat
, use raf to convert the input data into raf format. To save space, you may want to compress the ".raf" files using the external utility "gzip". The names of these compressed files should end in ".gz", and sitepat
will use an external utility (gunzip
) to decompress them. If gunzip
has not been installed, sitepat
cannot read compressed input files.
When making bootstraps (using the -r
or --bootreps
options), sitepat must read the data files twice. If the files are compressed, they must be decompressed twice. It may be faster to use uncompressed input files when making bootstraps.
Let us assume you have generated raf files, and that directory ~/raf contains a separate raf file for each population. We want to compare 4 populations, whose .raf files are yri.raf
, ceu.raf
, altai.raf
, and denisova.raf
. Ancestral alleles are to be called using chimp.raf
as an outgroup. The following command will do this, putting the results into obs.txt
.
sitepat x=raf/yri.raf \ y=raf/ceu.raf \ n=raf/altai.raf \ d=raf/denisova.raf \ outgroup=raf/chimp.raf > obs.txt
Here, "x", "y", "n", and "d" are labels that will be used to identify site patterns in the output. For example, site pattern "x:y" refers to the pattern in which the derived allele is present haploid samples from "x" and "y" but not on those from other populations. The order of the command-line arguments determines the order in which labels are sorted on output. Given the command line above, we would get a site pattern labeled "x:y:d" rather than, say, "y:x:d".
The output looks like this:
# Population labels: # x = /home/rogers/raf/yri.raf # y = /home/rogers/raf/ceu.raf # n = /home/rogers/raf/altai.raf # d = /home/rogers/raf/denisova.raf # Excluding singleton site patterns. # Number of site patterns: 10 # Tabulated 12327755 SNPs # SitePat E[count] x:y 340952.4592501 x:n 46874.1307236 x:d 46034.4670204 y:n 55137.4236715 y:d 43535.5248078 n:d 231953.3372578 x:y:n 91646.1277991 x:y:d 88476.9619569 x:n:d 96676.3877423 y:n:d 100311.4411513
The left column lists the site patterns that occur in the data. The right column gives the expected count of each site pattern. These are not integers, because they represent averages over all possible subsamples consisting of a single haploid genome from each population.
In the raf files used as input, chromosomes should appear in lexical order. (See this link for advice about lexical sorting.) Within each chromosome, nucleotides should appear in numerical order. There should be no duplicate (chromosome, position) pairs. Otherwise, the program aborts with an error.
To generate a bootstrap, use the --bootreps
option:
sitepat --bootreps 50 \ x=raf/yri.raf \ y=raf/ceu.raf \ n=raf/altai.raf \ d=raf/denisova.raf \ outgroup=raf/chimp.raf > obs.txt
This will generate not only the primary output file, obs.txt
, but also 50 additional files, each representing a single bootstrap replicate. The primary output file now has a bootstrap confidence interval:
# Population labels: # x = /home/rogers/raf/yri.raf # y = /home/rogers/raf/ceu.raf # n = /home/rogers/raf/altai.raf # d = /home/rogers/raf/denisova.raf # Excluding singleton site patterns. # Number of site patterns: 10 # Tabulated 12327755 SNPs # bootstrap output file = sitepat.boot # confidence level = 95% # SitePat E[count] low high x:y 340952.4592501 338825.6604586 342406.6670816 x:n 46874.1307236 46361.5798377 47438.1857029 x:d 46034.4670204 45605.6588012 46631.6434277 y:n 55137.4236715 54650.0763578 55783.7051253 y:d 43535.5248078 43110.5119922 44234.0919024 n:d 231953.3372578 229495.3741057 234173.6878092 x:y:n 91646.1277991 90494.0219749 92873.4443706 x:y:d 88476.9619569 87137.1867967 89585.8431419 x:n:d 96676.3877423 95935.5184294 97417.6241185 y:n:d 100311.4411513 99292.9839140 101163.3457462
Here, low
and high
are the limits of a 95% confidence interval. The bootstrap output files look like sitepat.boot000
, sitepat.boot001
, and so on.