// Time-stamp: <2002-07-09 02:19:49 zaykin> // written by Dmitri Zaykin, zaykin@statgen.ncsu.edu Programs here and at ftp://statgen.ncsu.edu/pub/zaykin/htr/ implement the Haplotype Trend Regression (HTR) method from Zaykin DV, Westfall PH, Young SS, Karnoub MC, Wagner MJ, Ehm MG. 2002. Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Human Heredity 53:79-91. This software is also referenced in Xu C-F, Lewis K, Cantone CL, Khan P, Donnelly C, White N, Crocker N, Boyd PR, Zaykin DV, Purvis IJ. 2002. The effectiveness of computational methods in haplotype prediction. Hum Genet 110:148-156. The programs in htr.zip archive (or files in "win" subdirectory) are Cygwin ports of my UNIX code (to MS Windows). Files is src are the original UNIX source. Try "emgi.exe" (using fixed number of markers) or "bash emgi.sh" (sliding window script) for a short help on usage. There are two ways to run the programs, (1) the "fixed" set of markers mode and (2) the "sliding window" mode. (1) emgi.exe. A possible command line using "dat.txt" file: emgi.exe 11 .001 x dat.txt 10000 1234 > out.txt 11 -- # of random EM restarts .001 -- EM convergence precision x -- missing data label. If this is a character having special meaning to the system, it should be quoted, e.g. '?'. dat.txt -- data file. First column is the phenotypic value (could be not only continuous, but the binary too). Next go genotypes, two columns per marker. That is, each individual is represented by a row. Allele names are arbitrary characters or strings, however they are recoded to integers in the output unless they're integers originally. 10000 -- # of shufflings for the empirical p-value. Use 0 to compute the asymptotic test. 1234 -- random seed out.txt -- output file (2) The second mode is the sliding window. The program is a shell script, emgi.sh, which is a wrap around emgi.exe binary. Example using provided data and header files with the window size of 3 markers and missing data coded by 'x': bash emgi.sh wnddat.txt wndhdr.txt 3 x > wndout.txt The script is set up to compute asymptotic p-values, for speed. To compute shuffled ones, replace num_runs=0 with, say num_runs=10000 in the script. **** Random notes **** 1) On occasion the program produces the following user-oblivious message: "Numerical Recipes run-time error... a or b too big, or MAXIT too small in betacf ...now exiting to system..." This error is related to p-value computation for the multiple regression F statistic and means low variance of estimated haplotypes (or phenotype) for the complete data subset. My program checks for this and the p-value should be set to 1 in the output (you can't get a significant result in such cases). Across "usual" data sets, only a minor proportion of would-be non-significant results generates this message.