Grepping a list with a list

The grep command-line utility is a commonly used tool for searching plain text files for lines that match a pattern. For example, you could search a gene or SNP ID in a BED/GFF/GTF file to find out its coordinates. In this post, I will demonstrate how you can search for a list of things in another list of things. One use case is that you could subset a BED file with a list of IDs.

I will use the Ubuntu:18:04 Docker image for this post for reproducibility. We will need the parallel tool, which can be installed using apt. If you are working on a machine without root (admin) privileges, you can install parallel using Conda. If you are not familiar with Docker and/or Conda, I have previously prepared some introductory material.

To get started, start a Docker container and install parallel.

docker run --rm -it ubuntu:18.04 /bin/bash

apt update
apt install -y parallel

parallel --version
GNU parallel 20161222
Copyright (C) 2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using programs that use GNU Parallel to process data for publication
please cite as described in 'parallel --citation'.

parallel --citation

We’ll generate a small file containing integers from 0 to 9999.

for (( i=0; i<10000; i++ )); do echo $i; done > small.txt

head -3 small.txt
0
1
2

tail -3 small.txt
9997
9998
9999

Say we wanted to find out if 9999 exists in small.txt.

grep 9999 small.txt
9999 # line that matched

However if we searched for 999, grep will return all lines matching the 999 pattern.

grep 999 small.txt
999
1999
2999
3999
4999
5999
6999
7999
8999
9990
9991
9992
9993
9994
9995
9996
9997
9998
9999

If we only want to match 999 entirely, use the -w parameter; this selects only those lines containing matches that form whole words.

grep -w 999 small.txt
999

Two other useful parameters I often use are -B (before) and -A (after), which returns lines before and after lines that matched your pattern.

grep -w -B 2 -A 2 999 small.txt
997
998
999
1000
1001

Now to the main topic: grepping a list with a list. We will create a search list by shuffling small.txt (using the shuf tool) and taking the first 10 elements.

cat small.txt | shuf | head > list.txt

head -3 list.txt
4760
2829
3969

We will use the -f parameter to specify the file containing all our patterns.

grep -w -f list.txt small.txt
2749
2829
3721
3969
4760
5624
6165
6311
7207
7601

If we are only interested to see if our list of patterns exist, we can use the -c parameter. If the number returned is the same as the length of your search list, then all items were found. This is assuming that the file you are searching against is unique.

grep -w -c -f list.txt small.txt
10

However, searching a list of patterns in another list will slow down when your lists become larger. We will illustrate this with a list of 100 million elements and a search list of 1 million items.

for (( i=0; i<100000000; i++ )); do echo $i; done > big.txt

cat big.txt | shuf | head -1000000 > big_list.txt

time grep -c -w -f big_list.txt big.txt
1000000

real    0m42.133s
user    0m41.088s
sys     0m0.946s

42 seconds is not bad at all but if you have extra processors available, you can speed up the search by splitting the file you are searching against using the split tool. We will split big.txt into 8 equally sized files by using split with the -l parameter and providing the number of lines we want in each split file (12,500,000).

echo $(( 100000000 / 8 ))
12500000

time split -l 12500000 big.txt big.

real    0m3.445s
user    0m0.536s
sys     0m1.237s

Now we will use parallel to run 8 parallel grep commands on each of the split files and use Perl to add up the counts. (If you are not familiar with parallel, I have also written about it.)

time parallel --verbose grep -c -w -f big_list.txt {} ::: big.a* | perl -nle '$n += $_; END { print $n }'
grep -c -w -f big_list.txt big.aa
grep -c -w -f big_list.txt big.ab
grep -c -w -f big_list.txt big.ac
grep -c -w -f big_list.txt big.ad
grep -c -w -f big_list.txt big.ae
grep -c -w -f big_list.txt big.af
grep -c -w -f big_list.txt big.ag
grep -c -w -f big_list.txt big.ah
1000000

real    0m8.029s
user    0m58.168s
sys     0m2.233s

The file splitting and parallel grep took around 11 seconds, an improvement over the 42 seconds.

Summary

The motivation behind this post was that I wanted to check whether a list of dbSNP Reference SNP number exists in a very large VCF file and it was taking too long. Using parallel in the manner demonstrated in this post made it possible to perform the search in a reasonable amount of time. In addition, I found that using the grep -F parameter helped decrease the search time; this parameter makes grep interpret each pattern as a list of fixed strings. What this means (to my understanding) is that the regular expression engine is no longer used because the pattern is treated as a fixed string instead and this helped speed things up.

I wrote a Bash script (see below but you should check out the GitHub Gist if you don’t want an emoji in your code) that can perform grep in parallel. The script expects that parallel is installed and is in your path; it also uses various Unix command line tools, that should be installed by default on most Linux operating systems unless you are using some stripped down version of Linux.


#!/usr/bin/env bash
set -euo pipefail
usage() {
>&2 echo "Usage: $0 [ -l search_list ] [ -f file_to_grep ] [ -n split_num ] [ -p num_threads ]"
exit 1
}
num_param=4
required_param=$(bc -l<<<${num_param}*2+1)
while getopts ":l:f:n:p:" options; do
case "${options}" in
l)
list=${OPTARG}
;;
f)
file=${OPTARG}
;;
n)
num=${OPTARG}
regex='^[1-9][0-9]*$'
if [[ ! ${num} =~ ${regex} ]]; then
usage
fi
;;
p)
num_threads=${OPTARG}
regex='^[1-9][0-9]*$'
if [[ ! ${num_threads} =~ ${regex} ]]; then
usage
fi
;;
🙂
echo "Error: –${OPTARG} requires an argument."
exit 1
;;
*)
usage ;;
esac
done
if [[ ${OPTIND} -ne ${required_param} ]]; then
usage
fi
# check if input files exist
for check in ${list} ${file}; do
if [[ ! -e ${check} ]]; then
>&2 echo ${check} does not exist
exit 1
fi
done
# generate prefixes
prefixes=({a..z}{a..z})
# check to see requested number of splits is larger than supported
num_prefix=${#prefixes[@]}
if [[ $num -gt ${num_prefix} ]]; then
>&2 echo Please enter number less than ${num_prefix}
exit 1
fi
# get basename
base=$(basename — ${file})
base="${base%.*}"
# calculate number of lines per split
total=$(cat ${file} | wc -l)
div=$(bc -l<<<${total}/${num}+1)
lines=$(printf %.0f ${div})
# split file to search
split -l ${lines} ${file} ${base}.
# file containing commands to run
cmd_txt=$(date +%Y%M%d%H%M%N)
# generate commands
#
# -w to prevent partial matches
# -F Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
# -c count
#
for ((n = 0; n < ${num}; n++)); do
echo "grep -w -c -F -f ${list} ${base}.${prefixes[${n}]}" >> ${cmd_txt}
done
parallel -j ${num_threads} < ${cmd_txt} | perl -nle '$s += $_; END { print $s }'
# clean up
rm ${cmd_txt}
for ((n = 0; n < ${num}; n++)); do
rm ${base}.${prefixes[${n}]}
done
exit 0

view raw

split_search.sh

hosted with ❤ by GitHub

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
3 comments Add yours
      1. Thanks! I was just looking at that post and I saw
        #a basic example of running 3 echos in parallel
        parallel echo ::: A B C

        many thanks!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.