// Time-stamp: <2006-03-29 18:37:37 zaykind>
// written by Dmitri Zaykin, zaykin@statgen.ncsu.edu

The mld program performs a shuffling version of the exact conditional
tests for different combinations of allelic and genotypic
disequilibrium on haploid and diploid data, or their combination
[3],[4]. It operates on multiple loci and alleles. At each shuffling
step a measure of association, such as the inverse of conditional
probability is calculated under the hypothesis of independence for a
specific set of genes defined by user.

When ran with "?" or "help" specified in the command line, the program
prints out the list of words (parameters) that it understands. One of
such parameters is "-generate" (tells the program to generate sample
input files). Running it as

	./mld.exe -generate

(or something like "c:\mld.exe -generate" under MS Windows) 

will write out two sample files into the current directory,
"1data.txt" and "1param.txt"

"1data.txt" is the file with multilocus data. Rows are individuals,
columns are loci. If data are diploid, there are 2 colums per each
locus. Missing alleles are denoted by '?', otherwise allele names are
arbitrary strings of letters and digits. 

The following small example shows 3 individuals scored at 4 diploid
loci (otherwise, it could be 3 individuals scored at 8 haploid
loci; ploidy and other parameters are specified in "param##.dat").

      ?      Y0      35       8      Y6      Y6     gf3    3gf
    gf4     gf8       ?     Ygf      30     Ygf       ?       ?
     YY     gfY      Y7      Y9       6     gfY       4      Y8


Now, after sample files are generated,

        mld.x -paramfile=1param.txt

will run the analysis. The data file "1data.txt" is listed as the
input file in "1param.txt", so that the data will be read from
it. Below is how "1param.txt" looks like. Note that square brackets
play the role of comments: everything inside them is ignored:

  [comments are placed in square brackets [and can be nested too,
  but then space should separate adjacent comments, like here --> ] ]

  -inp=1data.txt [input data file name]
  -out=1outp.txt [output file name]
  -scr=100 [frequency of screen output, 0 = don't, -1 = no header in output]
  -runs=20000 [number of permutations]
  -perm=bbbb [permute methods for consequent loci 'b'=break, 'p'=preserve]
  -stat=Het [statistic for ordering samples (X2, Pr, Het, LR)]
  -onetail [one-tailed test for heterozygotes deficit; only affects -stat=Het]
  [-lambda=1   [pow diverg family parameter] ]
  -every [every combination of loci or all at once]
  -upto=4 [up to this loci combination (ceiling for "-every")]
  -ploidy=2
  [ -join=1+2+3+4_7+8  [join 1,2,3,4 and 7,8 columns (not loci) of data.] ]

The order of commands in the file is not important. When the program
is ran without any arguments, it searches for the file with parameters
called "params" in the current directory.

http://statgen.ncsu.edu/zaykin/mld/ contains two other sample data sets:

   (1) 4g15asim.txt, 4g15asim.par   (data set simulated under the drift)
   (2) mytilus.txt, mytilus.par   (mussel data from the White Sea)

Example of how these data might be run:

	c:\diseq\mld.exe -paramfile=4g15asim.par
	c:\diseq\mld.exe -paramfile=mytilus.par

Now I'll comment on some parameters usage. There are 2 parameters that
are required: -ploidy=x and -inp=x; the program will complain if
they're omitted. 

Some parameters have default values: If -runs=x (number of shufflings)
is not present, it is set to 3200. This value (B=3200) provides 99%
confidence of being within 5% of the p-value based on the complete
enumeration. It is obtained as follows:

   0.01 >= 2.575 * Sqrt[0.05*(1-0.05) / B] ==> B >= 3149.55

If high precision is wished for "non-significant" p-values as well, 
(> 0.05), then B becomes about 17000:

   0.01 >= 2.575 * Sqrt[0.5*(1-0.5) / B] ==> B >= 16576.6


-perm= parameter should be a string of 'b'-s and 'p'-s at the right
side, like -perm=bpbpbbb. 

'b' in position 'i' means "break original pairs of alleles at locus
'i' apart";

'p' at position 'j' means "preserve pairs of alleles at 'j'-s locus".

Thus, -perm=bpb means that alleles at 2nd locus are kept in their
original pairs, but they are permuted in loci 1, 3. In terms of the
hypothesis testing: the null hypothesis here is that the 3-locus
genotype can be described as a product of allele frequencies at the
loci 1 and 3 times the genotype frequency at the 2nd locus. There are
more explanations in [3].  Also, the 1st equation at the top of the
p.128 of Bruce Weir's book [2] corresponds to -perm=pp, and the second
equation corresponds to -perm=bb. If no -perm is specified among the
parameters, 'b'-s are assumed for all loci.

During the shuffling, genotype arrays are compared to the original
array, and the proportion of "more extreme arrays" is regarded as a
p-value. Such comparison can be based on different statistics. While
conditional probability as a criterion for comparisons seem to work
well, I provide a possibility for a choice. -stat= can be one of X2,
LR, Het, Pr. 

X2 is chi square, 

LR is the likelihood ratio, 

Het is heterozygosity excess (H_obs - H_exp) / H_exp, averaged over loci
(chapter 4 of [2]),

Pr is the conditional probability.

If -stat is omitted, -stat=Pr is assumed, unless -lambda is specified
instead.

-lambda is a parameter for the power divergence family. This family of
statistics is indexed by a single real valued parameter 
[-Infinity -- Infinity]. The family includes 
	chi-square (-lambda=1), 
	likelihood ratio (-lambda=0), 
	Freeman-Tukey (-lambda=-0.5), 
	Gokhale-Kullback (-lambda=-1), 
	Neyman modified chi-squared  (-lambda=-2) 
statistics. 

Cressie and Read recommend 2/3 for the parameter, that is
-lambda=0.66666.

-ploidy: use either -ploidy=2 or -ploidy=1. 

However, it is also possible to have mixed ploidy data: to do so,
specify -ploidy=2 and for those loci which are haploid, add an extra
column of a dummy allele, and set 'p' in -perm for those loci as well
(so the dummy allele does not get shuffled with the real ones). For
example, if there are 3 loci, 1st and 3rd are diploid, and 2nd is
haploid:

a b    c1    f g
a a    c2    g g
b b    c1    f f

the data file should be modified this way:

a b    c1 x   f g
a a    c2 x   g g
b b    c1 x   f f

An extra column of 'x', (or some other name) is added. Then -perm
parameter could look like this:

-perm=bpb  [ important that the second letter should be 'p' ]

Note that this also provides a way to do "case-control" type of tests:
just create a dummy haploid locus  holding the group type and put 'p'
in -perm in the corresponding place.

-infer parameter: if there is missing data, the allele state is
randomly assigned with probabilities of the observed
frequencies. Otherwise a particular individual is removed from
consideration if it has missing alleles at the currently analysed set
of loci.

-join: Allows to join columns of data. The syntax of this
command is better to explain with examples:
   * haploid loci; treat 2nd and 4th loci as a single 
unit: -join=2+4
   * diploid loci; treat 2nd and 4th loci as a single 
unit: -join=2+3+4+5 (there are 2 columns per locus in 
diploid data)
   * haploid loci, treat 1 with 5 as a single unit, 
and 3, 4, 6 as a single unit: -join=1+5_3+4+6

   Notes on compilation. 
   ---------------------

The C++ compiler should support "The Standard Template Library"
(STL). 

sig.cpp might potentially need modification. This file is for
catching Control-C under UNIX/Linux. Under Windows Esc key is
used instead. It seems that under Irix

   static void StOnInterrupt(int)

should be replaced with

   static void StOnInterrupt()

I haven't tried to see how it works elsewhere; syntax for 
signal() function differs among operating systems.


How to compile:

* On UNIX-like systems:

	% gzip -dc mld.tar.gz | tar -xvf -
	% cd Mld
	% make

The default compiler (GNU g++), files and dependencies are listed in
the makefile.


* Under Ms Windows:

I have compiled the program under Win95 with Borland C++ 5.0. All 
*.cpp files should be listed in the project. The executable target
is one of: "EasyWin 16bit large model" or "Console 32 bit". Class
Library checkbox should be checked on.


   References:
   -----------

[1] Cressie, N. and T.R.C. Read. 1984. Multinomial goodness of fit
tests. J.Roy.Stat.Soc. B 46:440-464.

[2] Weir, B.S. 1996. Genetic Data Analysis II. Sinauer Associates,
Sunderland, MA.

[3] Zaykin, D., L. Zhivotovsky and B.S. Weir. 1995. Exact Tests for
Association Between Alleles at Arbitrary Numbers of Loci. Genetica
96:169-178.

[4] Basten, C.J. and M.A. Asmussen, 1997. The exact test for
cytonuclear disequilibria. Genetics (in press).