Legofit
infers population history from nucleotide site patterns.

Calculate derived allele frequency, daf.

Input file should consist of tab-separated columns:

  1. chromosome
  2. position
  3. reference allele
  4. alternate alleles
  5. ancestral allele
  6. genotype in format "0/1" or "0|1", where 0 represents a copy of the reference allele and 1 a copy of the derived allele.
  7. etc for as many columns as there are genotypes.

This can be generated from a vcf file that includes annotations for ancestral alleles. If the ancestral is labelled "AA", the input for daf can be generated, using bcftools, as follows:

bcftools query -f 'CHROMPOSREFALTINFO/AA[GT]
' fname.vcf.gz

Output is in 5 columns, separated by whitespace:

  1. chromosome
  2. position of the nucleotide
  3. aa, the ancestral allele
  4. da, the derived allele
  5. daf, derived allele frequency

The input files should include all sites at which derived alleles are present in any of the populations under study. For example, consider an analysis involving modern humans and Neanderthals. The modern human data must include all sites at which Neanderthals carry derived alleles, even if these sites do not vary among modern humans. To accomplish this, it is best to use whole-genome data for all populations.

The input should not contain duplicate nucleotide sites, the chromosomes should be sorted in lexical order, and within each chromosome, the nucleotides should be in numerical order. Otherwise, daf will abort with an error.

Sites are rejected unless they have a single ref or ancestral allele. Missing values are allowed for the alt allele. At the end of the job a summary of rejected sites is written to stderr.

Calculate derived allele frequency, daf.

Input file should consist of tab-separated columns:

  1. chromosome
  2. position
  3. reference allele
  4. alternate alleles
  5. ancestral allele
  6. genotype in format "0/1" or "0|1", where 0 represents a copy of the reference allele and 1 a copy of the derived allele.
  7. etc for as many columns as there are genotypes.

This can be generated from a vcf file that includes annotations for ancestral alleles. If the ancestral is labelled "AA", the input for daf can be generated, using bcftools, as follows:

bcftools query -f 'CHROMPOSREFALTINFO/AA[GT]
' fname.vcf.gz

Output is in 5 columns, separated by whitespace:

  1. chromosome
  2. position of the nucleotide
  3. aa, the ancestral allele
  4. da, the derived allele
  5. daf, derived allele frequency

The input files should include all sites at which derived alleles are present in any of the populations under study. For example, consider an analysis involving modern humans and Neanderthals. The modern human data must include all sites at which Neanderthals carry derived alleles, even if these sites do not vary among modern humans. To accomplish this, it is best to use whole-genome data for all populations.

The input should not contain duplicate nucleotide sites, the chromosomes should be sorted in lexical order, and within each chromosome, the nucleotides should be in numerical order. Otherwise, daf will abort with an error.

Sites are rejected unless they have a single ref or ancestral allele. Missing values are allowed for the alt allele. At the end of the job a summary of rejected sites is written to stderr.

Note
Ryan Bohlender 10-6-2017 Assuming a file sorted by position, within each chromosome, we can compare duplicates as they will always be adjacent. Given a qual value strictly greater than the previous line, we will print the current line, otherwise the previous line. To handle multiple duplicates, we will output the line upon reading a new position, or EOF.