GAPIT使用简单介绍

GAPIT是由康奈尔大学的Edward Buckler实验室和Zhiwu Zhang实验室共同开发的，分析方法包括CMLM,gBLUP,Enriched CMLM及SUPER,目前GAPIT现已更新到version2。

选择GAPIT的几个原因：

依托R语言为平台计算，适合大数据处理
操作简单，友善的用户交互界面

安装

GAPIT的使用涉及6个R包，下面进行安装

source("http://www.bioconductor.org/biocLite.R")
biocLite("multtest")
install.packages("gplots")
install.packages("LDheatmap")
install.packages("genetics")
install.packages('EMMREML')
install.packages("scatterplot3d")

安装GAPIT packages

1 2	source("http://zzlab.net/GAPIT/gapit_functions.txt") source("http://zzlab.net/GAPIT/emma.txt")

快速使用

#step 1 设定工作目录，用来保存结果
setwd('/Users/lilibei/Desktop/myGAPIT_result')

#step 2 读取数据（表行数据和基因型数据，此处可使用GAPIT自带的数据集进行测试）
myY <- read.table("myphenotypic.txt", head = TRUE)
myG <- read.table("mygenotype.hmp." , head = FALSE)

#step 3 运行GAPIT（默认PCA为3）
myGAPIT<-GAPIT(Y=myY, G=myG, PCA.total = 3)

数据格式

表型数据
第一列为材料名称，第二列开始为性状名称，缺失值使用NA或者NaN

1 2	#读取命令 myY <- read.table("mdp_traits.txt", head = TRUE)

基因型（以Hapmap格式为例）
一共有11列信息，包括标记名称，染色体，染色体位置等，有几个地方需要特别注意一下，否则读取数据的时候会报错
第一行和第二行中rs与assembly后面不能含有#符号
从12列开始是材料的名称，一定要与表型数据中的名称对应好，否则会报错
染色体名称一定要是数字，不能含有英文字母，在多倍体物种中会存在亚族染色体的名称，如A01,C01这样的染色体名称是错误的，但是 scffold是可以的
染色体要按照顺序排列

1 2	# 查看染色是否按照顺序排列 cut -f 3 haplotype \| uniq> Desktop/chromosome_names.txt

删除第一行的’#’以及将’–’替换为’NN’

#!/usr/bin/python
#coding:utf-8
import re

result = open('haplotype-sort-del#.hmp','w')
with open('haplotypep','r') as hamp:
    for line in hamp:
        p = re.compile(r'#') #delete '#'
        line = p.sub('',line)
        p = re.compile(r'--') #'--' replaced as 'NN'
        line = p.sub('NN',line)
        print >> result,line,
result.close()

若此时染色体排序无误可以将基因型和表行数据导入R进行分析，若染色体名称含有字符串，可构建字典进行替换，运行如下代码

#!/usr/bin/python
#coding:utf-8
import re

def haplotypeChromoSub(haplotype,chromesome_dict,haplotypeSub):
    result = open(haplotypeSub,'a')

    #sort the chromesome_dict 's keys
    keys = chromesome_dict.keys()
    keys.sort()

    #print the head of haplotype
    with open(haplotype) as fd:
        for line in fd:
            if line.startswith('rs#') or line.startswith('rs\t'):
                print >> result,line,
            break

     #sub the chromesome names       
    for chromosome in keys:
        with open(haplotype) as fd:
            for line in fd:
                if line.startswith('rs#') or line.startswith('rs\t'):
                    next
                line_list = line.split()
                if line_list[2] == chromosome:
                    line_list[2] = chromesome_dict[chromosome]
                    line = '\t'.join(line_list)
                    print >> result,line

    #print the scaffolds
    with open(haplotype) as fd:
        for line in fd:
            if re.search(r'scaffold',line):
                print >> result,line,
    result.close()

if __name__ == '__main__':
    chromesome_dict = {'A01':'1','A02':'2','A03':'3','A04':'4','A05':'5','C01':'6','C02':'7','C03':'8','C04':'9','C05':'10'}
    haplotype = '/home/lilibei/database/hapmap_new_except_ref'
    haplotypeSub = 'haphaplotypeSub.hap'
    haplotypeChromoSub(haplotype,chromesome_dict,haplotypeSub)

1 2	#读取命令 myG <- read.table("mdp_genotype_test.hmp.txt", head = FALSE)

Kinship
kinship(亲缘关系矩阵)在GAPIT中称为KI，第一列为材料名称，不存在表头名称，格式如下图

1 2	#读取命令 myKI <- read.table("KSN.txt", head = FALSE)

协变量

协变量的一般包括Q值与P值，在GAPIT中称为CV。即GWAS分析中的群体机构和主成分的协变量值。第一列对应材料的名称，第二列开始为协变量值。具体格式参看下图

按照文件名读取数据

通常重测序后的基因型数据量会很大，使用R读取会非常消耗内存。可以将大的基因型数据按照染色体进行拆分然后再读取。

将每个染色体的文件命名为mdp_genotype_chr1.hmp.txt的格式，前缀的名字为mdp_genotype_chr后缀的名字为hmp.txt使用的参数分别为file.G和file.Ext.G

#Step 1: Set data directory and import files
myY <- read.table("mdp_traits.txt", head = TRUE)

#Step 2: Run GAPIT myGAPIT <- GAPIT( Y=myY ,
PCA.total=3, file.G="mdp_genotype_chr", file.Ext.G="hmp.txt", file.from=1,
file.to=10, file.path="C:\\myGAPIT\\" 
)

运算流程

基本流程
kinship计算使用VanRaden method
聚类算法默认为average
group kinship type为mean

#Step 1: Set data directory and import files
myY <- read.table("mdp_traits.txt", head = TRUE)
myG <- read.table("mdp_genotype_test.hmp.txt", head = FALSE)
#Step 2: Run GAPIT myGAPIT <- GAPIT( Y=myY ,
G=myG,
PCA.total=3 )

增加参数

#Step 1: Set data directory and import files
myY <- read.table("mdp_traits.txt", head = TRUE)
myG <- read.table("mdp_genotype_test.hmp.txt", head = FALSE)
#Step 2: Run GAPIT
myGAPIT <- GAPIT(
Y=myY ,
G=myG,
PCA.total=3,
kinship.cluster=c("average", "complete", "ward"), kinship.group=c("Mean", "Max"), group.from=200,
group.to=1000000, group.by=10
)

导入Kinship Matrix and Covariates和Q值

#Step 1: Set data directory and import files
myY <- read.table("mdp_traits.txt", head = TRUE)
myG <- read.table("mdp_genotype_test.hmp.txt", head = FALSE)
myKI <- read.table("KSN.txt", head = FALSE)
myCV <- read.table("Copy of Q_First_Three_Principal_Components.txt", head = TRUE)
#Step 2: Run GAPIT myGAPIT <- GAPIT( Y=myY ,
G=myG,
KI=myKI, CV=myCV )

基因组预测

这个分析用不到基因型数据，所以使用SNP.test=FALSE这个参数

#Step 1: Set data directory and import files
myY <- read.table("mdp_traits.txt", head = TRUE) myKI <- read.table("KSN.txt", head = FALSE)
#Step 2: Run GAPIT myGAPIT <- GAPIT( Y=myY ,
KI=myKI, PCA.total=3, SNP.test=FALSE
)

多基因型数据导入分析

#Step 1: Set data directory and import files
myY <- read.table("mdp_traits.txt", head = TRUE)
#Step 2: Run GAPIT myGAPIT <- GAPIT( Y=myY ,
PCA.total=3, file.G="mdp_genotype_chr", file.Ext.G="hmp.txt", file.from=1,
file.to=10, file.path="C:\\myGAPIT\\" )

小样本分析
大样本计算PCs和kinship很耗费时间，SNP .fraction这个参数可以减少计算时间

#Step 1: Set data directory and import files
myY <- read.table("mdp_traits.txt", head = TRUE)
#Step 2: Run GAPIT
myGAPIT <- GAPIT(
Y=myY ,
PCA.total=3, file.GD="mdp_numeric", file.GM="mdp_SNP_information", file.Ext.GD="txt", file.Ext.GM="txt",
file.from=1, file.to=3,
SNP .fraction=0.6 )

节省内存

GAPIT默认的file.fragment参数为512，可以降低这个参数来节约内存的使用，提高计算效率。

#Step 1: Set data directory and import files
myY <- read.table("mdp_traits.txt", head = TRUE)
#Step 2: Run GAPIT
myGAPIT <- GAPIT(
Y=myY ,
PCA.total=3, file.GD="mdp_numeric", file.GM="mdp_SNP_information", file.Ext.GD="txt", file.Ext.GM="txt",
file.from=1, file.to=3,
SNP .fraction=0.6, file.fragment = 128 )

Model selection

myY <- read.table("mdp_traits.txt", head = TRUE)
myG <- read.table("mdp_genotype_test.hmp.txt", head = FALSE)
#Step 2: Run GAPIT myGAPIT <- GAPIT( Y=myY ,
G=myG,
PCA.total=3, Model.selection = TRUE )

模型选择

#Step 1: Set data directory and import files
myCV <- read.table("Copy of Q_First_Three_Principal_Components.txt", head = TRUE) myY <- read.table("mdp_traits.txt", head = TRUE)
myG <- read.table("mdp_genotype_test.hmp.txt" , head = FALSE)
#Step 2: Run GAPIT
myGAPIT_SUPER <- GAPIT(
Y=myY[,c(1,2)],
G=myG,
#KI=myKI,
CV=myCV ,
#PCA.total=3,
sangwich.top="MLM", #options are GLM,MLM,CMLM, FaST and SUPER sangwich.bottom="SUPER", #options are GLM,MLM,CMLM, FaST and SUPER LD=0.1,
)

格式转化->Convert HapMap format to numerical

1
2
3

myG <- read.table("mdp_genotype_test.hmp.txt", head = FALSE) myGAPIT <- GAPIT(G=myG, output.numerical=TRUE)
myGD= myGAPIT$GD
myGM= myGAPIT$GM

最后

这里只是对GAPIT的使用做了简单的介绍，具体可以查看官网说明书，还有些功能也不是很明白，欢迎共同探讨。