# Time-stamp: <2006-07-26 19:45:29 zaykind> [written by Dmitri Zaykin]

This directory contains programs to accompany

Dmitri V. Zaykin, Zhaoling Meng, Margaret G. Ehm. 2006. Contrasting
Linkage-Disequilibrium Patterns between Cases and Controls as a Novel
Association-Mapping Method. Am. J. Hum. Genet., 78:737-746.

(1) The scripts are written in R - it can be installed from
http://www.r-project.org/

(2) If you don't have the following packages installed, type at the R
command prompt:

   install.packages("mice")
   install.packages("ellipse")

(3) The data format is as in "testdat-recoded.txt": phenotype is the
column 1 (0: controls; 1: cases). Next columns are SNP genotypes, each
column for each SNP. Genotypes should be coded as -1,0,1 ("AA" -> -1;
Aa -> 0; aa -> 1; missing value -> NA). It doesn't matter which of the
two alleles are considered A or a. Missing phenotype values (in
column 1) are not allowed.

(4) Getting p-values for the correlation and Delta-prime based
statistics

   source("LD-contrast.r")
   LDcontrast("testdat-recoded.txt", "corr", 1000)
   LDcontrast("testdat-recoded.txt", "dprime", 1000)

(5) Plotting LD based on correlation and Delta-prime:

   source("LD-contrast-Plot.r")
   LDplot("testdat-recoded.txt", "corr")
   LDplot("testdat-recoded.txt", "dprime")

Notes:
------

(N1) Speed of the Dprime-based analysis can be greatly improved if you
have a GNU C++ compiler. On a Linux system, the source can be compiled
simply as

   R CMD SHLIB DprKK.cpp

This only needs to be done once, to create a shared library,
"DprKK.so" (to reside in the same directory as the rest of the
scripts). Then the step (4) above is modified by sourcing the file
"LD-contrast-C.r" instead of "LD-contrast.r"

   source("LD-contrast-C.r")
   LDcontrast("testdat-recoded.txt", "corr", 1000)
   LDcontrast("testdat-recoded.txt", "dprime", 1000)

(N2) Missing data handling uses multiple imputation via polytomous
logistic regression, as implemented in the package MICE, referenced in
the paper. Good performance of polytomous logistic regression was
reported in [1]. Presence of missing values considerably slows down
speed of the calculations. The algorithm implemented here is as
follows. First, generate the mean statistic value via multiple
imputations for the original data set (hardcoded as MxImp). Then use a
single multiple imputation per each phenotype permutation during the
p-value computation.

(N3) Only the Z2 statistic is currently fully implemented. For the
principal component-based analysis (Z1 statistic), there is code in
More/ldtstk.r - however currently this script has only been tested
without missing data.

(N4) SNPs can be recoded to -1,0,1 with the help of a perl script in
"More/recode_to_-1_0_1.pl" as

   recode_to_-1_0_1.pl x testdat.txt > testdat-recoded.txt

where 'x' denotes the original missing value which would be replaced
by NA. The limitation of the script is that alleles are assumed to be
coded as 1 or 2 only (see "More/testdat.txt").


References:
-----------

1. OW Souverein, AH Zwinderman, Tanck MWT 2006. Multiple Imputation of
Missing Genotype Data for Unrelated Individuals. Annals of Human
Genetics (in press)