1. 缺失比例(Missing rates):( GENO > 0.05 )
    Shortly we will apply more stringent criteria, such that GENO > 0.05. In this case, 0.05*89 = 4.45 samples, meaning that if a SNP is missing in 4.45 more more samples, that SNP will be removed from the dataset.

  2. 最小等位基因频率(Minor Allele frequencies)( MAF< 0.03 如果SNP较多可以设置为MAF<0.05)
    MAF is the Minor Allele Frequency. It can be used to exclude SNPs which are not informative because they show little variation in the sample set being analyzed. For instance, if a SNP shows variation in only 1 of the 89 individuals, it is not useful statistically and should be removed.

  3. Removing SNPs out of Hardy-Weinberg equilibrium(p-value > 10−6 - 10−4 )
    Population genetic theory suggests that under ‘normal’ conditions, there is a predictable relationship between allele frequencies and genotype frequencies. In cases where the genotype distribution is different from what one would expect based on the allele frequencies, one potential explanation for this is genotyping error. Natural selection is another explanation. For this reason, we typically check for deviation from Hardy-Weinberg equilibrium in the controls for a case- control study. For a quantitative trait, PLINK just uses everyone. The following command generates p-values for deviation from HWE for each SNP. Low p-values indicate that a SNP is out of HWE.

  4. 如果你有vcf文件,可以先用vcftools转换为plink的输入形势,输出结果为:.bed与.map文件,然后以此作为输入进行过滤:

vcftools --vcf my.vcf --plink --out plink
plink --noweb --file plink --geno 0.05 --maf 0.05 --hwe 0.0001 --make-bed --out QC

参考文献:
1.Roshyara N R, Kirsten H, Horn K, et al. Impact of pre-imputation SNP-filtering on genotype imputation results[J]. BMC genetics, 2014, 15(1): 1.

2.Pongpanich M, Sullivan P F, Tzeng J Y. A quality control algorithm for filtering SNPs in genome-wide association studies[J]. Bioinformatics, 2010, 26(14): 1731-1737.


声明:本文转自http://blog.sina.com.cn/s/blog_83f77c940102w2eg.html