statpop v.0.81beta (20111124) Sebastian E. Ramos-Onsins, Luca Ferretti and Giacomo Marmorini Variability Analyses of multiple populations: Calculation and estimation of statistics and neutrality tests. The application allows the analysis of variability on multiple population using fasta or ms-format files (Hudson, Bioinformatics 2002). The program has multiple options, missing values can be allowed and IUPAC code for data for diploid individuals can also be processed. Fst comparisons and permutation test can be performed among all populations. Optimal tests of neutrality are calculated but it is necessary to include GSL libraries in case of compiling the code. The application can be pipelined with ms or another simulator with the same output and calculates statistics for each iteration. Multiple options for outputs are allowed, form a extended file to a single line output. This program can be used for ABC analyses pielined with ms. Analyses performed for each population and for the positions (e.g., silent) annoted in a GFF file (optional) Calculation of Segregating Sites, Classification of variants in exclusive, fixed, shared and ancestral, Estimation of Nucleotide Variability (Watterson,Tajima,Fu and Li,Fay and Wu,Zeng,Achaz) and divergence Neutrality Tests (Tajima's D,Fu and Li's D and F,Fay and Wu's H,Zeng's E,Achaz Y,Ramos-Onsins and Rozas' R2) Optimal tests given an alternative frequency Spectrum (To_ii,To_00,To_i0,To_Qc,To_Qw,To_Lc) Fst (Hudson et al, 1992) and permutation test for the whole and for pair-pair comparisons, Gst' (Nei, 1987) and permutation test for the whole and for pair-pair comparisons. TO COMPILE: gcc *.c -lm -o statpop -Wall -O3 OR (IN CASE USING OPTIMAL TESTS) gcc -lgsl -lgslcblas *.c -lm -o statpop -Wall -DinGSL=1 -O3 Nore that you need to install GSL libraries in the last case. http://www.gnu.org/s/gsl/ TO USE: There are two ways to use: 1. Execute directly the application: the program will ask to you the necessary questions to calculate an observed data file. 2. RECOMMENDED: Command line. Include the options that are necessary to run the file. Finally send it to a file ( > file_output.txt ). Options in command line: statpop [0:fasta/nbrf, 1:ms] [input_file] [output: 0:extended, 1:single line, 2:single line freq spectrum, 3:single line joint freq distrib 4:single line pairwise distribution 5:single line frequency variant per line 6:Covariance matrix of single line frequency variant per line under SNM 7:Covariance matrix of single line frequency variant per line [number of lineages per sequence (1/2)] [include unknown positions (1/0)] [if ms format: include mask_filename or type -1 (all positions included)] (mask_file: 1st row with 'length' weights, next sample rows x lengths: missing 0, ok 1) [if fasta/nbrf format: [#permutation] [seed]] [if ms format: [ratio_sv] [seed] [length] [niter]] [outgroup (1/0)] (if outgroup, the outgroup must be the last population) [#_pops] [#samples_pop1] ... [#samples_popN] [alternative spectrum to calculate Optimal Tests (1/0)] [Alternative Spectrum File (optional): alternative_spectrum for each population (except outg) (average absolute values) header plus fr(0,1) fr(0,2) ... fr(0,n-1) theta(0)/nt,fr(1,1) fr(1,2) ... fr(1,n-1) theta(1)/nt...] [null spectrum to calculate Advanced Optimal Tests (1/0)] [Null Spectrum File (optional): null_spectrum for each population (except outg) (average absolute values) header plus fr(0,1) fr(0,2) ... fr(0,n-1) theta(0)/nt,fr(1,1) fr(1,2) ... fr(1,n-1) theta(1)/nt...] [GFF_file (optional and only with fasta/nbrf data)] [coding,noncoding,synonymous,nonsynonymous,silent,others(whatever annoted)(if GFF file defined)] [Genetic_Code: Nuclear_Universal,mtDNA_Drosophila,mtDNA_Mammals,Other (if GFF_file defined)] [if Other, introduce the code for the 64 triplets in the order UUU, UUC, UUA, UUG ..etc (optional)] EXAMPLES: Run statpop (here with the macosx executable) using a FASTA file: ./statpop_macosx 0 ../examples/MC1R_PigsOutg_aligned.fas 0 1 0 1000 123456 1 3 48 46 1 0 0 > ../outputs/MC1R_PigsOutg_Total.txt ./statpop_macosx 0 ../examples/MC1R_PigsOutg_aligned.fas 0 1 0 1000 123456 1 3 48 46 1 0 0 ../examples/MC1R.gff nonsynonymous Nuclear_Universal > ../outputs/MC1R_PigsOutg_NSyn.txt Run statpop and calculate Optimal tests: You must give a file with the expected frequency spectrum for the alternative model (always choosen previously, not a posteriori). ./statpop_macosx 0 ../examples/MC1R_PigsOutg_aligned.fas 0 1 0 1000 123456 1 3 48 46 1 1 ../examples/MC1R_H1frq.txt 0 ../examples/MC1R.gff nonsynonymous Nuclear_Universal > ../outputs/MC1R_PigsOutg_NSyn_Opttest.txt Run statpop in a pipeline with ms: (in case ms coalescent simulator (Hudson, Bioinformatics 2002) is installed in your computer) In this case, we simulate the nonsynonymous positions from the entire gene MC1R by filetering only the positions included in the mask file. ./ms 95 100 -t 10 -I 3 48 46 1 0 -ej 1.0 2 1 -ej 2.0 3 1 | ./statpop_macosx 1 "" 1 1 0 ../outputs/MC1R_PigsOutg_aligned_npop3_nsam95_nonsynonymous_mask.txt 0.75 34645 965 100 1 3 48 46 1 0 0 > ../outputs/MC1R_PigsOutg_NSyn_msrun.txt Run statpop using a ms-format file. We filter some interesting columns with "cut": ./statpop_macosx 1 ../examples/MC1R_ms_simulation.txt 1 1 0 ../outputs/MC1R_PigsOutg_aligned_npop3_nsam95_nonsynonymous_mask.txt 0.75 34645 965 100 1 3 48 46 1 1 ../examples/MC1R_H1frq.txt 0 > ../outputs/MC1R_PigsOutg_NSyn_msfile.txt ./statpop_macosx 1 ../examples/MC1R_ms_simulation.txt 1 1 0 ../outputs/MC1R_PigsOutg_aligned_npop3_nsam95_nonsynonymous_mask.txt 0.75 34645 965 100 1 3 48 46 1 1 ../examples/MC1R_H1frq.txt 0 | cut -f 37,38,53,54,59,60,75,76,79,80,83,84,87,88,93,94,99,100,103,104,109,110,115,116,119,120,121,122,129,130,131,132,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160 > ../outputs/MC1R_PigsOutg_NSyn_cut_msfile.txt