VCF to PED

One of the classic bioinformatics problems is converting files from one format into another. In this post, I go through the process of creating a PED and MAP file from a VCF file. All the files described in this post are available in this GitHub repository.

Firstly download the vcf_to_ped_convert.pl script from the 1000 Genomes Project's FTP site.

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/browser/vcf_to_ped_converter/version_1.1/vcf_to_ped_convert.pl
chmod 755 vcf_to_ped_convert.pl

For this post I have a simple VCF file (ex2.vcf.gz) that is bgzipped. This VCF file is missing a lot of metadata but for the purposes of this post, it doesn't matter because the required information is there.

gunzip -c ex2.vcf.gz 
##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003 NA00004 NA00005 NA00006 NA00007 NA00008 NA00009 NA00010
1       10000   .       C       T       99      PASS    DP=14   GT      0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0     0/0
2       20000   .       G       A       99      PASS    DP=14   GT      0/1     0/1     0/1     0/1     0/1     0/0     0/0     0/0     0/0     0/0
2       25000   .       G       A       99      PASS    DP=14   GT      0/1     0/1     0/1     0/1     0/1     0/0     0/0     0/0     0/0     0/0
3       30000   .       T       A       99      PASS    DP=11   GT      0/0     0/0     0/0     0/0     0/0     0/1     0/1     0/1     0/1     0/1
4       40000   .       A       G       99      PASS    DP=10   GT      1/1     0/0     1/1     0/0     1/1     0/0     1/1     0/0     1/1     0/0
5       50000   .       T       G       99      PASS    DP=13   GT      0/0     1/1     0/0     1/1     0/0     1/1     0/0     1/1     0/0     1/1
6       60000   .       T       C       99      PASS    DP=13   GT      0/0     0/0     0/0     0/0     0/0     0/1     0/1     0/1     0/1     0/1
7       70000   .       C       G       99      PASS    DP=9    GT      1/1     1/1     1/1     1/1     1/1     1/1     1/1     1/1     1/1     1/1
8       80000   .       C       G       99      PASS    DP=9    GT      1/1     1/1     1/1     1/1     1/1     0/0     0/0     0/0     0/0     0/0

# index with tabix
tabix -p vcf ex2.vcf.gz

The vcf_to_ped_convert.pl script requires a panel file, which is a four columned file as seen below. I have added a phenotype column to the panel file, so that I can add phenotypic information to the PED file after it is created.

cat ex2.panel 
sample  pop     super_pop       gender  phenotype
NA00001 AUS     AUS     male    1
NA00002 AUS     AUS     female  1
NA00003 AUS     AUS     male    1
NA00004 AUS     AUS     female  1
NA00005 AUS     AUS     male    1
NA00006 AUS     AUS     female  2
NA00007 AUS     AUS     male    2
NA00008 AUS     AUS     female  2
NA00009 AUS     AUS     male    2
NA00010 AUS     AUS     female  2

The vcf_to_ped_convert.pl script requires four parameters. The VCF file (-vcf), the panel file (-sample_panel_file), the region of interest (-region), and the population (-population). You can specify regions as ref:start-end or simply ref. If your chromosomes are listed as "chr1" instead of "1", the region needs to be "chr1". I prefer seeing nucleotides instead of numbers, so I also use -base_format letter.

I run the script twice, once for region 1 and once for region 2. There is no output for region 1 because all the individuals are homozygous for the reference base.

# nothing because everyone is homozygous to the reference
perl vcf_to_ped_convert.pl -vcf ex2.vcf.gz -sample_panel_file ex2.panel -region 1 -population AUS

perl vcf_to_ped_convert.pl -vcf ex2.vcf.gz -sample_panel_file ex2.panel -region 2 -population AUS -base_format letter

cat 2.ped
AUS_1   NA00003 0       0       0       0       G A     G A
AUS_2   NA00006 0       0       0       0       G G     G G
AUS_3   NA00010 0       0       0       0       G G     G G
AUS_4   NA00008 0       0       0       0       G G     G G
AUS_5   NA00009 0       0       0       0       G G     G G
AUS_6   NA00001 0       0       0       0       G A     G A
AUS_7   NA00007 0       0       0       0       G G     G G
AUS_8   NA00004 0       0       0       0       G A     G A
AUS_9   NA00002 0       0       0       0       G A     G A
AUS_10  NA00005 0       0       0       0       G A     G A

The first six columns of a PED file are:

  1. Family ID
  2. Individual ID
  3. Paternal ID
  4. Maternal ID
  5. Sex (1=male; 2=female; other=unknown)
  6. Phenotype

Notice that 2.ped has zeros for paternal and maternal ID, sex, and phenotype despite having that information in the panel file. If you look in the vcf_to_ped_convert.pl script, the print_ped() subroutine simply prints zeros for those columns.

print $FILE join("\t", $pedigree, $individual, 0, 0, 0, 0,);

I have written a Perl script to make use of the additional phenotype column I added in the panel file to create a new PED file.

cat add_phenotype.pl 
#!/usr/bin/env perl

use strict;
use warnings;

my $usage = "Usage: <infile.ped> <infile.panel>\n";
my $ped = shift or die $usage;
my $panel = shift or die $usage;

my %phenotype = ();
open(IN, '<', $panel) || die "Could not open $panel: $!\n";
while(<IN>){
   chomp;
   next if /^sample/;
   my ($sample, $pop, $super_pop, $gender, $phenotype) = split(/\t/);
   $phenotype{$sample} = $phenotype;
}
close(IN);

open(IN, '<', $ped) || die "Could not open $ped: $!\n";
while(<IN>){
   chomp;
   my @line = split(/\t/);
   my $indi = $line[1];
   if (exists $phenotype{$indi}){
      $line[5] = $phenotype{$indi};
   } else {
      die "Missing $indi in $panel\n";
   }
   print join("\t", @line), "\n";
}
close(IN);

exit(0);

__END__

Now to run the script.

# you can use http://gnu.wiki/man1/sponge.1.php
# instead of creating a temp file
add_phenotype.pl 2.ped ex2.panel > blah
mv -f blah 2.ped

cat 2.ped 
AUS_1   NA00003 0       0       0       1       G A     G A
AUS_2   NA00006 0       0       0       2       G G     G G
AUS_3   NA00010 0       0       0       2       G G     G G
AUS_4   NA00008 0       0       0       2       G G     G G
AUS_5   NA00009 0       0       0       2       G G     G G
AUS_6   NA00001 0       0       0       1       G A     G A
AUS_7   NA00007 0       0       0       2       G G     G G
AUS_8   NA00004 0       0       0       1       G A     G A
AUS_9   NA00002 0       0       0       1       G A     G A
AUS_10  NA00005 0       0       0       1       G A     G A

The vcf_to_ped_convert.pl script also creates an info file.

cat 2.info
2:20000 20000
2:25000 25000

The info file contains a generic ID (if your variant in the VCF file doesn't contain an ID) and location of the ID. We can use this information to create a MAP file. By default, each line of the MAP file describes a single marker and must contain exactly 4 columns:

  1. chromosome (1-22, X, Y or 0 if unplaced)
  2. rs# or snp identifier
  3. Genetic distance (morgans)
  4. Base-pair position (bp units)

I have written another Perl script to create a MAP file from the INFO file. This script assumes that I can get the reference chromosome from the INFO file and that reference chromosomes start with a digit. You'll need to change it if you want to identify chromosomes that start with a 'chr' and to identify sex and mitochondrial chromosomes.

#!/usr/bin/env perl

use strict;
use warnings;

my $usage = "Usage: <infile.info>\n";
my $infile = shift or die $usage;

my $chr = '';

if ($infile =~ /^(\d+)\_*.*\.info$/){
   $chr = $1;
} else {
   die "Could not work out chromosome from $infile\n";
}

open(IN, '<', $infile) || die "Could not open $infile: $!\n";
while(<IN>){
   chomp;
   my ($id, $bp) = split(/\t/);
   print join("\t", $chr, $id, 0, $bp), "\n";
}
close(IN);

exit(0);

__END__

Running the script.

info_to_map.pl 2.info > 2.map

cat 2.map 
2       2:20000 0       20000
2       2:25000 0       25000

Finally, we can create a BED file from the PED and MAP file and perform a simple association test for each variant using PLINK.

# PED to BED
plink --file 2 --make-bed --out 2

# simple association test
plink --bfile 2 --model --out data --allow-no-sex

cat data.model 
 CHR       SNP   A1   A2     TEST            AFF          UNAFF        CHISQ   DF            P
   2   2:20000    A    G     GENO          0/0/5          0/5/0           NA   NA           NA
   2   2:20000    A    G    TREND           0/10            5/5           10    1     0.001565
   2   2:20000    A    G  ALLELIC           0/10            5/5        6.667    1     0.009823
   2   2:20000    A    G      DOM            0/5            5/0           NA   NA           NA
   2   2:20000    A    G      REC            0/5            0/5           NA   NA           NA
   2   2:25000    A    G     GENO          0/0/5          0/5/0           NA   NA           NA
   2   2:25000    A    G    TREND           0/10            5/5           10    1     0.001565
   2   2:25000    A    G  ALLELIC           0/10            5/5        6.667    1     0.009823
   2   2:25000    A    G      DOM            0/5            5/0           NA   NA           NA
   2   2:25000    A    G      REC            0/5            0/5           NA   NA           NA

Update on blog

I'll try to make an effort to write more posts, even if they are short, like this VCF to PED post. I have been preparing a gigantic post for over six months and I don't think I will ever finish it, so I'll just split it into several smaller posts.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.