Converting PED into VCF

Updated 2015 August 25th: as suggested by Tim, I checked out PLINK 1.9 and found it much simpler to convert PED into VCF. I updated the post with instructions for performing the conversion using PLINK 1.9.

Being late to the game of analysing genomic variants, I only recently discovered that IGV is capable of visualising VCF files; this is great if your variants are in the VCF. However, I have some PLINK files (a PED and MAP file), which I believe are not supported by IGV. After searching on the web, I came across a Python script written by Brad Chapman that seems to be able to convert PED files into VCF files. Since the script has several dependencies, this short post simply documents how and where these are downloaded.

Firstly download the script.

wget https://raw.githubusercontent.com/chapmanb/bcbb/master/nextgen/scripts/plink_to_vcf.py

#make it executable
chmod 755 plink_to_vcf.py

#run the script
plink_to_vcf.py 
Incorrect arguments
Convert Plink ped/map files into VCF format using plink and Plink/SEQ.

Requires:

plink: http://pngu.mgh.harvard.edu/~purcell/plink/
PLINK/SEQ: http://atgu.mgh.harvard.edu/plinkseq/
bx-python: https://bitbucket.org/james_taylor/bx-python/wiki/Home

You also need the genome reference file in 2bit format:
http://genome.ucsc.edu/FAQ/FAQformat.html#format7
using faToTwoBit:
http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/

Usage:
  plink_to_vcf.py <ped file> <map file> <UCSC reference file in 2bit format)

As you can see above, there are several dependencies that this script requires. The bx-python project is a python library and associated set of scripts to allow for rapid implementation of genome scale analyses. It's easy enough to install:

easy_install https://bitbucket.org/james_taylor/bx-python/get/tip.tar.bz2

Next we need PLINK and PLINK/SEQ.

wget http://pngu.mgh.harvard.edu/~purcell/plink/dist/plink-1.07-mac-intel.zip
unzip plink-1.07-mac-intel.zip
wget http://psychgen.u.hpc.mssm.edu/plinkseq_downloads/plinkseq-mac-latest.zip
unzip plinkseq-mac-latest.zip
cp plink-1.07-mac-intel/plink ~/bin
cp plink-1.07-mac-intel/gPLINK.jar ~/bin
cp plinkseq-0.10/* ~/bin

#my PATH
echo $PATH
.:/Users/dtang/bin:/Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/usr/texbin:/usr/texbin

I copy the executables to my personal bin directory, which is in my $PATH. Lastly we need the 2bit files for the reference genomes that we're working with.

#if you're using hg18
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit

#if you're using hg19
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit

The PLINK folder has a test.ped and test.map file that we can use to test the conversion script.

cat plink-1.07-mac-intel/test.ped 
1 1 0 0 1  1  A A  G T
2 1 0 0 1  1  A C  T G
3 1 0 0 1  1  C C  G G
4 1 0 0 1  2  A C  T T
5 1 0 0 1  2  C C  G T
6 1 0 0 1  2  C C  T T

cat plink-1.07-mac-intel/test.map
1 snp1 0 1
1 snp2 0 2

# the SNPs are on positions 1 and 2 on chromosome 1,
# so I could have specified any 2bit file
plink_to_vcf.py plink-1.07-mac-intel/test.ped plink-1.07-mac-intel/test.map ~/data/ucsc/hg18.2bit

cat test.vcf 
##fileformat=VCFv4.1
##source=pseq
##FILTER=<ID=PASS,Description="Passed variant FILTERs">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  1       2       3       4       5       6
chr1    1       snp1    A       C       .       PASS    .       GT      0/0     0/1     1/1     0/1     1/1     1/1
chr1    2       snp2    G       T       .       PASS    .       GT      0/1     0/1     0/0     1/1     0/1     1/1

If you compare the genotypes in the PED file to the ones in the VCF file, you will see that they correspond to each other.

Using PLINK 1.9

Firstly, download a suitable binary depending on your operating system.

wget https://www.cog-genomics.org/static/bin/plink150805/plink_linux_x86_64.zip
unzip plink_linux_x86_64.zip

The zip file comes with a simple test file.

cat toy.ped 
1 1000000000 0 0 1 1 0 0 1 1
1 1000000001 0 0 1 2 1 1 1 2

cat toy.map 
1       rs0     0       1000
1       rs10    0       1001

We can perform the file conversion using the --recode function.

plink --file toy --recode vcf --out toy_converted
PLINK v1.90b3v 64-bit (15 Jul 2015)        https://www.cog-genomics.org/plink2
(C) 2005-2015 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to toy_converted.log.
Options in effect:
  --file toy
  --out toy_converted
  --recode vcf

1001226 MB RAM detected; reserving 500613 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (2 variants, 2 people).
--file: toy_converted-temporary.bed + toy_converted-temporary.bim +
toy_converted-temporary.fam written.
2 variants loaded from .bim file.
2 people (2 males, 0 females) loaded from .fam.
2 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.75.
2 variants and 2 people pass filters and QC.
Among remaining phenotypes, 1 is a case and 1 is a control.
--recode vcf to toy_converted.vcf ... done.
Warning: At least one VCF allele code violates the official specification;
other tools may not accept the file.  (Valid codes must either start with a
'<', only contain characters in {A,C,G,T,N,a,c,g,t,n}, or represent a
breakend.)

cat toy_converted.vcf
##fileformat=VCFv4.2
##fileDate=20150825
##source=PLINKv1.90
##contig=<ID=1,length=1002>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  1_1000000000    1_1000000001
1       1000    rs0     1       .       .       .       PR      GT      ./.     0/0
1       1001    rs10    1       2       .       .       PR      GT      0/0     0/1

Comparing Python script vs. PLINK 1.9

Here's the resulting VCF file using the test data from PLINK 1.07 and the Python script:

cat test.vcf 
##fileformat=VCFv4.1
##source=pseq
##FILTER=<ID=PASS,Description="Passed variant FILTERs">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  1       2       3       4       5       6
chr1    1       snp1    A       C       .       PASS    .       GT      0/0     0/1     1/1     0/1     1/1     1/1
chr1    2       snp2    G       T       .       PASS    .       GT      0/1     0/1     0/0     1/1     0/1     1/1

Let's convert the same test data using PLINK 1.9:

plink --file test --recode vcf --out test_converted
PLINK v1.90b3v 64-bit (15 Jul 2015)        https://www.cog-genomics.org/plink2
(C) 2005-2015 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to test_converted.log.
Options in effect:
  --file test
  --out test_converted
  --recode vcf

1001226 MB RAM detected; reserving 500613 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (2 variants, 6 people).
--file: test_converted-temporary.bed + test_converted-temporary.bim +
test_converted-temporary.fam written.
2 variants loaded from .bim file.
6 people (6 males, 0 females) loaded from .fam.
6 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 6 founders and 0 nonfounders present.
Calculating allele frequencies... done.
2 variants and 6 people pass filters and QC.
Among remaining phenotypes, 3 are cases and 3 are controls.
--recode vcf to test_converted.vcf ... done.

cat test_converted.vcf
##fileformat=VCFv4.2
##fileDate=20150825
##source=PLINKv1.90
##contig=<ID=1,length=3>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  1_1     2_1     3_1     4_1     5_1     6_1
1       1       snp1    C       A       .       .       PR      GT      1/1     0/1     0/0     0/1     0/0     0/0
1       2       snp2    T       G       .       .       PR      GT      0/1     0/1     1/1     0/0     0/1     0/0

The reference and alternate alleles are switched around but otherwise the resulting VCF files are the same.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
6 comments Add yours
  1. Hi Dave! Thanks for a great blog.

    I highly recommend using the newly updated plink 1.9 https://www.cog-genomics.org/plink2/ for this convertion job. It reads and outputs both vcf and plink format and is extremely fast and memory efficient.

    It will however not conserve phase information in the VCF, but the authors will include this in their final plink 2 release.

    Best,
    Tim Knutsen
    PhD student genomics
    Norway.

  2. For future refernece.

    Avoid renaming the samples with
    `plink –file test –recode vcf –out test_converted –const-fid`

    Plink always sets the minor allele as the ALT allele. This is dangerous if you are working with phased data, but it’s OK in most cases. Plink adds a warning in the header.
    This will be updated in the future plink 2.0 future release.
    See http://blogs.biomedcentral.com/gigablog/2015/02/26/gwas-reloaded-extended-qa-plink1-9-author-chris-chang/
    Tim

      1. I actually have missinformed you above Dave. Sorry.

        To get the IDs right when going from PED to VCF, the correct line is :
        plink –file test –recode vcf-iid –out test_converted

        The other line is when going from VCF to plink-format.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.