// Time-stamp: <2006-03-29 18:37:37 zaykind> // written by Dmitri Zaykin, zaykin@statgen.ncsu.edu The mld program performs a shuffling version of the exact conditional tests for different combinations of allelic and genotypic disequilibrium on haploid and diploid data, or their combination [3],[4]. It operates on multiple loci and alleles. At each shuffling step a measure of association, such as the inverse of conditional probability is calculated under the hypothesis of independence for a specific set of genes defined by user. When ran with "?" or "help" specified in the command line, the program prints out the list of words (parameters) that it understands. One of such parameters is "-generate" (tells the program to generate sample input files). Running it as ./mld.exe -generate (or something like "c:\mld.exe -generate" under MS Windows) will write out two sample files into the current directory, "1data.txt" and "1param.txt" "1data.txt" is the file with multilocus data. Rows are individuals, columns are loci. If data are diploid, there are 2 colums per each locus. Missing alleles are denoted by '?', otherwise allele names are arbitrary strings of letters and digits. The following small example shows 3 individuals scored at 4 diploid loci (otherwise, it could be 3 individuals scored at 8 haploid loci; ploidy and other parameters are specified in "param##.dat"). ? Y0 35 8 Y6 Y6 gf3 3gf gf4 gf8 ? Ygf 30 Ygf ? ? YY gfY Y7 Y9 6 gfY 4 Y8 Now, after sample files are generated, mld.x -paramfile=1param.txt will run the analysis. The data file "1data.txt" is listed as the input file in "1param.txt", so that the data will be read from it. Below is how "1param.txt" looks like. Note that square brackets play the role of comments: everything inside them is ignored: [comments are placed in square brackets [and can be nested too, but then space should separate adjacent comments, like here --> ] ] -inp=1data.txt [input data file name] -out=1outp.txt [output file name] -scr=100 [frequency of screen output, 0 = don't, -1 = no header in output] -runs=20000 [number of permutations] -perm=bbbb [permute methods for consequent loci 'b'=break, 'p'=preserve] -stat=Het [statistic for ordering samples (X2, Pr, Het, LR)] -onetail [one-tailed test for heterozygotes deficit; only affects -stat=Het] [-lambda=1 [pow diverg family parameter] ] -every [every combination of loci or all at once] -upto=4 [up to this loci combination (ceiling for "-every")] -ploidy=2 [ -join=1+2+3+4_7+8 [join 1,2,3,4 and 7,8 columns (not loci) of data.] ] The order of commands in the file is not important. When the program is ran without any arguments, it searches for the file with parameters called "params" in the current directory. http://statgen.ncsu.edu/zaykin/mld/ contains two other sample data sets: (1) 4g15asim.txt, 4g15asim.par (data set simulated under the drift) (2) mytilus.txt, mytilus.par (mussel data from the White Sea) Example of how these data might be run: c:\diseq\mld.exe -paramfile=4g15asim.par c:\diseq\mld.exe -paramfile=mytilus.par Now I'll comment on some parameters usage. There are 2 parameters that are required: -ploidy=x and -inp=x; the program will complain if they're omitted. Some parameters have default values: If -runs=x (number of shufflings) is not present, it is set to 3200. This value (B=3200) provides 99% confidence of being within 5% of the p-value based on the complete enumeration. It is obtained as follows: 0.01 >= 2.575 * Sqrt[0.05*(1-0.05) / B] ==> B >= 3149.55 If high precision is wished for "non-significant" p-values as well, (> 0.05), then B becomes about 17000: 0.01 >= 2.575 * Sqrt[0.5*(1-0.5) / B] ==> B >= 16576.6 -perm= parameter should be a string of 'b'-s and 'p'-s at the right side, like -perm=bpbpbbb. 'b' in position 'i' means "break original pairs of alleles at locus 'i' apart"; 'p' at position 'j' means "preserve pairs of alleles at 'j'-s locus". Thus, -perm=bpb means that alleles at 2nd locus are kept in their original pairs, but they are permuted in loci 1, 3. In terms of the hypothesis testing: the null hypothesis here is that the 3-locus genotype can be described as a product of allele frequencies at the loci 1 and 3 times the genotype frequency at the 2nd locus. There are more explanations in [3]. Also, the 1st equation at the top of the p.128 of Bruce Weir's book [2] corresponds to -perm=pp, and the second equation corresponds to -perm=bb. If no -perm is specified among the parameters, 'b'-s are assumed for all loci. During the shuffling, genotype arrays are compared to the original array, and the proportion of "more extreme arrays" is regarded as a p-value. Such comparison can be based on different statistics. While conditional probability as a criterion for comparisons seem to work well, I provide a possibility for a choice. -stat= can be one of X2, LR, Het, Pr. X2 is chi square, LR is the likelihood ratio, Het is heterozygosity excess (H_obs - H_exp) / H_exp, averaged over loci (chapter 4 of [2]), Pr is the conditional probability. If -stat is omitted, -stat=Pr is assumed, unless -lambda is specified instead. -lambda is a parameter for the power divergence family. This family of statistics is indexed by a single real valued parameter [-Infinity -- Infinity]. The family includes chi-square (-lambda=1), likelihood ratio (-lambda=0), Freeman-Tukey (-lambda=-0.5), Gokhale-Kullback (-lambda=-1), Neyman modified chi-squared (-lambda=-2) statistics. Cressie and Read recommend 2/3 for the parameter, that is -lambda=0.66666. -ploidy: use either -ploidy=2 or -ploidy=1. However, it is also possible to have mixed ploidy data: to do so, specify -ploidy=2 and for those loci which are haploid, add an extra column of a dummy allele, and set 'p' in -perm for those loci as well (so the dummy allele does not get shuffled with the real ones). For example, if there are 3 loci, 1st and 3rd are diploid, and 2nd is haploid: a b c1 f g a a c2 g g b b c1 f f the data file should be modified this way: a b c1 x f g a a c2 x g g b b c1 x f f An extra column of 'x', (or some other name) is added. Then -perm parameter could look like this: -perm=bpb [ important that the second letter should be 'p' ] Note that this also provides a way to do "case-control" type of tests: just create a dummy haploid locus holding the group type and put 'p' in -perm in the corresponding place. -infer parameter: if there is missing data, the allele state is randomly assigned with probabilities of the observed frequencies. Otherwise a particular individual is removed from consideration if it has missing alleles at the currently analysed set of loci. -join: Allows to join columns of data. The syntax of this command is better to explain with examples: * haploid loci; treat 2nd and 4th loci as a single unit: -join=2+4 * diploid loci; treat 2nd and 4th loci as a single unit: -join=2+3+4+5 (there are 2 columns per locus in diploid data) * haploid loci, treat 1 with 5 as a single unit, and 3, 4, 6 as a single unit: -join=1+5_3+4+6 Notes on compilation. --------------------- The C++ compiler should support "The Standard Template Library" (STL). sig.cpp might potentially need modification. This file is for catching Control-C under UNIX/Linux. Under Windows Esc key is used instead. It seems that under Irix static void StOnInterrupt(int) should be replaced with static void StOnInterrupt() I haven't tried to see how it works elsewhere; syntax for signal() function differs among operating systems. How to compile: * On UNIX-like systems: % gzip -dc mld.tar.gz | tar -xvf - % cd Mld % make The default compiler (GNU g++), files and dependencies are listed in the makefile. * Under Ms Windows: I have compiled the program under Win95 with Borland C++ 5.0. All *.cpp files should be listed in the project. The executable target is one of: "EasyWin 16bit large model" or "Console 32 bit". Class Library checkbox should be checked on. References: ----------- [1] Cressie, N. and T.R.C. Read. 1984. Multinomial goodness of fit tests. J.Roy.Stat.Soc. B 46:440-464. [2] Weir, B.S. 1996. Genetic Data Analysis II. Sinauer Associates, Sunderland, MA. [3] Zaykin, D., L. Zhivotovsky and B.S. Weir. 1995. Exact Tests for Association Between Alleles at Arbitrary Numbers of Loci. Genetica 96:169-178. [4] Basten, C.J. and M.A. Asmussen, 1997. The exact test for cytonuclear disequilibria. Genetics (in press).