// Time-stamp: <2002-07-09 02:19:49 zaykin>
// written by Dmitri Zaykin, zaykin@statgen.ncsu.edu

Programs here and at ftp://statgen.ncsu.edu/pub/zaykin/htr/
implement the Haplotype Trend Regression (HTR) method from

Zaykin DV, Westfall PH, Young SS, Karnoub MC, Wagner MJ, Ehm
MG. 2002. Testing association of statistically inferred haplotypes
with discrete and continuous traits in samples of unrelated
individuals. Human Heredity 53:79-91.

This software is also referenced in Xu C-F, Lewis K, Cantone CL, Khan
P, Donnelly C, White N, Crocker N, Boyd PR, Zaykin DV, Purvis IJ.
2002. The effectiveness of computational methods in haplotype
prediction. Hum Genet 110:148-156.

The programs in htr.zip archive (or files in "win" subdirectory) are
Cygwin ports of my UNIX code (to MS Windows). Files is src are the
original UNIX source.

Try "emgi.exe" (using fixed number of markers) or "bash emgi.sh"
(sliding window script) for a short help on usage.

There are two ways to run the programs, (1) the "fixed" set of markers
mode and (2) the "sliding window" mode.

(1) emgi.exe. A possible command line using "dat.txt" file:

  emgi.exe 11 .001 x dat.txt 10000 1234 > out.txt

11 -- # of random EM restarts

.001 -- EM convergence precision

x -- missing data label. If this is a character having special meaning
to the system, it should be quoted, e.g. '?'.

dat.txt -- data file. First column is the phenotypic value (could be
not only continuous, but the binary too). Next go genotypes, two
columns per marker. That is, each individual is represented by a
row. Allele names are arbitrary characters or strings, however they
are recoded to integers in the output unless they're integers
originally.

10000 -- # of shufflings for the empirical p-value. Use 0 to compute
the asymptotic test.

1234 -- random seed

out.txt -- output file

(2) The second mode is the sliding window. The program is a shell
script, emgi.sh, which is a wrap around emgi.exe binary. Example using
provided data and header files with the window size of 3 markers and
missing data coded by 'x':

  bash emgi.sh wnddat.txt wndhdr.txt 3 x > wndout.txt

The script is set up to compute asymptotic p-values, for speed. To
compute shuffled ones, replace num_runs=0 with, say num_runs=10000
in the script.


**** Random notes ****

1) On occasion the program produces the following user-oblivious
message:

  "Numerical Recipes run-time error...
  a or b too big, or MAXIT too small in betacf
  ...now exiting to system..."

This error is related to p-value computation for the multiple
regression F statistic and means low variance of estimated haplotypes
(or phenotype) for the complete data subset. My program checks for
this and the p-value should be set to 1 in the output (you can't get a
significant result in such cases). Across "usual" data sets, only a
minor proportion of would-be non-significant results generates this
message.