Legofit
infers population history from nucleotide site patterns.

Calculate reference allele frequency, raf.

Input file should consist of tab-separated columns:

  1. chromosome
  2. position
  3. reference allele
  4. alternate alleles
  5. genotype in format "0/1" or "0|1", where 0 represents a copy of the reference allele and 1 a copy of the alternate allele.
  6. etc for as many columns as there are genotypes.

This can be generated from a vcf file as follows:

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n' fname.vcf.gz

Output is in 5 columns, separated by tabs:

  1. chromosome
  2. position of the nucleotide
  3. ref, the reference allele
  4. alt, the alternate allele
  5. raf, reference allele frequency

The input files should include all sites at which derived alleles are present in any of the populations under study. For example, consider an analysis involving modern humans and Neanderthals. The modern human data must include all sites at which Neanderthals carry derived alleles, even if these sites do not vary among modern humans. To accomplish this, it is best to use whole-genome data for all populations.

The input should not contain duplicate nucleotide sites, the chromosomes should be sorted in lexical order, and within each chromosome, the nucleotides should be in numerical order. Otherwise, raf will abort with an error.

Sites are rejected unless they have a single ref. Missing values are allowed for the alt allele. At the end of the job a summary of rejected sites is written to stderr.