One of the classic bioinformatics problems is converting files from one format into another. In this post, I go through the process of creating a PED and MAP file from a VCF file. All the files described in this post are available in this GitHub repository.
Firstly download the vcf_to_ped_convert.pl script from the 1000 Genomes Project's FTP site.
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/browser/vcf_to_ped_converter/version_1.1/vcf_to_ped_convert.pl chmod 755 vcf_to_ped_convert.pl
For this post I have a simple VCF file (ex2.vcf.gz) that is bgzipped. This VCF file is missing a lot of metadata but for the purposes of this post, it doesn't matter because the required information is there.
gunzip -c ex2.vcf.gz ##fileformat=VCFv4.0 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 NA00004 NA00005 NA00006 NA00007 NA00008 NA00009 NA00010 1 10000 . C T 99 PASS DP=14 GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 2 20000 . G A 99 PASS DP=14 GT 0/1 0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0 2 25000 . G A 99 PASS DP=14 GT 0/1 0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0 3 30000 . T A 99 PASS DP=11 GT 0/0 0/0 0/0 0/0 0/0 0/1 0/1 0/1 0/1 0/1 4 40000 . A G 99 PASS DP=10 GT 1/1 0/0 1/1 0/0 1/1 0/0 1/1 0/0 1/1 0/0 5 50000 . T G 99 PASS DP=13 GT 0/0 1/1 0/0 1/1 0/0 1/1 0/0 1/1 0/0 1/1 6 60000 . T C 99 PASS DP=13 GT 0/0 0/0 0/0 0/0 0/0 0/1 0/1 0/1 0/1 0/1 7 70000 . C G 99 PASS DP=9 GT 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1 1/1 8 80000 . C G 99 PASS DP=9 GT 1/1 1/1 1/1 1/1 1/1 0/0 0/0 0/0 0/0 0/0 # index with tabix tabix -p vcf ex2.vcf.gz
The vcf_to_ped_convert.pl script requires a panel file, which is a four columned file as seen below. I have added a phenotype column to the panel file, so that I can add phenotypic information to the PED file after it is created.
cat ex2.panel sample pop super_pop gender phenotype NA00001 AUS AUS male 1 NA00002 AUS AUS female 1 NA00003 AUS AUS male 1 NA00004 AUS AUS female 1 NA00005 AUS AUS male 1 NA00006 AUS AUS female 2 NA00007 AUS AUS male 2 NA00008 AUS AUS female 2 NA00009 AUS AUS male 2 NA00010 AUS AUS female 2
The vcf_to_ped_convert.pl script requires four parameters. The VCF file (-vcf), the panel file (-sample_panel_file), the region of interest (-region), and the population (-population). You can specify regions as ref:start-end or simply ref. If your chromosomes are listed as "chr1" instead of "1", the region needs to be "chr1". I prefer seeing nucleotides instead of numbers, so I also use -base_format letter.
I run the script twice, once for region 1 and once for region 2. There is no output for region 1 because all the individuals are homozygous for the reference base.
# nothing because everyone is homozygous to the reference perl vcf_to_ped_convert.pl -vcf ex2.vcf.gz -sample_panel_file ex2.panel -region 1 -population AUS perl vcf_to_ped_convert.pl -vcf ex2.vcf.gz -sample_panel_file ex2.panel -region 2 -population AUS -base_format letter cat 2.ped AUS_1 NA00003 0 0 0 0 G A G A AUS_2 NA00006 0 0 0 0 G G G G AUS_3 NA00010 0 0 0 0 G G G G AUS_4 NA00008 0 0 0 0 G G G G AUS_5 NA00009 0 0 0 0 G G G G AUS_6 NA00001 0 0 0 0 G A G A AUS_7 NA00007 0 0 0 0 G G G G AUS_8 NA00004 0 0 0 0 G A G A AUS_9 NA00002 0 0 0 0 G A G A AUS_10 NA00005 0 0 0 0 G A G A
The first six columns of a PED file are:
- Family ID
- Individual ID
- Paternal ID
- Maternal ID
- Sex (1=male; 2=female; other=unknown)
- Phenotype
Notice that 2.ped has zeros for paternal and maternal ID, sex, and phenotype despite having that information in the panel file. If you look in the vcf_to_ped_convert.pl script, the print_ped() subroutine simply prints zeros for those columns.
print $FILE join("\t", $pedigree, $individual, 0, 0, 0, 0,);
I have written a Perl script to make use of the additional phenotype column I added in the panel file to create a new PED file.
cat add_phenotype.pl #!/usr/bin/env perl use strict; use warnings; my $usage = "Usage: <infile.ped> <infile.panel>\n"; my $ped = shift or die $usage; my $panel = shift or die $usage; my %phenotype = (); open(IN, '<', $panel) || die "Could not open $panel: $!\n"; while(<IN>){ chomp; next if /^sample/; my ($sample, $pop, $super_pop, $gender, $phenotype) = split(/\t/); $phenotype{$sample} = $phenotype; } close(IN); open(IN, '<', $ped) || die "Could not open $ped: $!\n"; while(<IN>){ chomp; my @line = split(/\t/); my $indi = $line[1]; if (exists $phenotype{$indi}){ $line[5] = $phenotype{$indi}; } else { die "Missing $indi in $panel\n"; } print join("\t", @line), "\n"; } close(IN); exit(0); __END__
Now to run the script.
# you can use http://gnu.wiki/man1/sponge.1.php # instead of creating a temp file add_phenotype.pl 2.ped ex2.panel > blah mv -f blah 2.ped cat 2.ped AUS_1 NA00003 0 0 0 1 G A G A AUS_2 NA00006 0 0 0 2 G G G G AUS_3 NA00010 0 0 0 2 G G G G AUS_4 NA00008 0 0 0 2 G G G G AUS_5 NA00009 0 0 0 2 G G G G AUS_6 NA00001 0 0 0 1 G A G A AUS_7 NA00007 0 0 0 2 G G G G AUS_8 NA00004 0 0 0 1 G A G A AUS_9 NA00002 0 0 0 1 G A G A AUS_10 NA00005 0 0 0 1 G A G A
The vcf_to_ped_convert.pl script also creates an info file.
cat 2.info 2:20000 20000 2:25000 25000
The info file contains a generic ID (if your variant in the VCF file doesn't contain an ID) and location of the ID. We can use this information to create a MAP file. By default, each line of the MAP file describes a single marker and must contain exactly 4 columns:
- chromosome (1-22, X, Y or 0 if unplaced)
- rs# or snp identifier
- Genetic distance (morgans)
- Base-pair position (bp units)
I have written another Perl script to create a MAP file from the INFO file. This script assumes that I can get the reference chromosome from the INFO file and that reference chromosomes start with a digit. You'll need to change it if you want to identify chromosomes that start with a 'chr' and to identify sex and mitochondrial chromosomes.
#!/usr/bin/env perl use strict; use warnings; my $usage = "Usage: <infile.info>\n"; my $infile = shift or die $usage; my $chr = ''; if ($infile =~ /^(\d+)\_*.*\.info$/){ $chr = $1; } else { die "Could not work out chromosome from $infile\n"; } open(IN, '<', $infile) || die "Could not open $infile: $!\n"; while(<IN>){ chomp; my ($id, $bp) = split(/\t/); print join("\t", $chr, $id, 0, $bp), "\n"; } close(IN); exit(0); __END__
Running the script.
info_to_map.pl 2.info > 2.map cat 2.map 2 2:20000 0 20000 2 2:25000 0 25000
Finally, we can create a BED file from the PED and MAP file and perform a simple association test for each variant using PLINK.
# PED to BED plink --file 2 --make-bed --out 2 # simple association test plink --bfile 2 --model --out data --allow-no-sex cat data.model CHR SNP A1 A2 TEST AFF UNAFF CHISQ DF P 2 2:20000 A G GENO 0/0/5 0/5/0 NA NA NA 2 2:20000 A G TREND 0/10 5/5 10 1 0.001565 2 2:20000 A G ALLELIC 0/10 5/5 6.667 1 0.009823 2 2:20000 A G DOM 0/5 5/0 NA NA NA 2 2:20000 A G REC 0/5 0/5 NA NA NA 2 2:25000 A G GENO 0/0/5 0/5/0 NA NA NA 2 2:25000 A G TREND 0/10 5/5 10 1 0.001565 2 2:25000 A G ALLELIC 0/10 5/5 6.667 1 0.009823 2 2:25000 A G DOM 0/5 5/0 NA NA NA 2 2:25000 A G REC 0/5 0/5 NA NA NA
Update on blog
I'll try to make an effort to write more posts, even if they are short, like this VCF to PED post. I have been preparing a gigantic post for over six months and I don't think I will ever finish it, so I'll just split it into several smaller posts.

This work is licensed under a Creative Commons
Attribution 4.0 International License.