The grep command-line utility is a commonly used tool for searching plain text files for lines that match a pattern. For example, you could search a gene or SNP ID in a BED/GFF/GTF file to find out its coordinates. In this post, I will demonstrate how you can search for a list of things in another list of things. One use case is that you could subset a BED file with a list of IDs.
I will use the Ubuntu:18:04 Docker image for this post for reproducibility. We will need the parallel tool, which can be installed using apt. If you are working on a machine without root (admin) privileges, you can install parallel using Conda. If you are not familiar with Docker and/or Conda, I have previously prepared some introductory material.
To get started, start a Docker container and install parallel.
docker run --rm -it ubuntu:18.04 /bin/bash apt update apt install -y parallel parallel --version GNU parallel 20161222 Copyright (C) 2007,2008,2009,2010,2011,2012,2013,2014,2015,2016 Ole Tange and Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. GNU parallel comes with no warranty. Web site: http://www.gnu.org/software/parallel When using programs that use GNU Parallel to process data for publication please cite as described in 'parallel --citation'. parallel --citation
We'll generate a small file containing integers from 0 to 9999.
for (( i=0; i<10000; i++ )); do echo $i; done > small.txt head -3 small.txt 0 1 2 tail -3 small.txt 9997 9998 9999
Say we wanted to find out if 9999 exists in small.txt.
grep 9999 small.txt 9999 # line that matched
However if we searched for 999, grep will return all lines matching the 999 pattern.
grep 999 small.txt 999 1999 2999 3999 4999 5999 6999 7999 8999 9990 9991 9992 9993 9994 9995 9996 9997 9998 9999
If we only want to match 999 entirely, use the -w parameter; this selects only those lines containing matches that form whole words.
grep -w 999 small.txt 999
Two other useful parameters I often use are -B (before) and -A (after), which returns lines before and after lines that matched your pattern.
grep -w -B 2 -A 2 999 small.txt 997 998 999 1000 1001
Now to the main topic: grepping a list with a list. We will create a search list by shuffling small.txt (using the shuf tool) and taking the first 10 elements.
cat small.txt | shuf | head > list.txt head -3 list.txt 4760 2829 3969
We will use the -f parameter to specify the file containing all our patterns.
grep -w -f list.txt small.txt 2749 2829 3721 3969 4760 5624 6165 6311 7207 7601
If we are only interested to see if our list of patterns exist, we can use the -c parameter. If the number returned is the same as the length of your search list, then all items were found. This is assuming that the file you are searching against is unique.
grep -w -c -f list.txt small.txt 10
However, searching a list of patterns in another list will slow down when your lists become larger. We will illustrate this with a list of 100 million elements and a search list of 1 million items.
for (( i=0; i<100000000; i++ )); do echo $i; done > big.txt cat big.txt | shuf | head -1000000 > big_list.txt time grep -c -w -f big_list.txt big.txt 1000000 real 0m42.133s user 0m41.088s sys 0m0.946s
42 seconds is not bad at all but if you have extra processors available, you can speed up the search by splitting the file you are searching against using the split tool. We will split big.txt into 8 equally sized files by using split with the -l parameter and providing the number of lines we want in each split file (12,500,000).
echo $(( 100000000 / 8 )) 12500000 time split -l 12500000 big.txt big. real 0m3.445s user 0m0.536s sys 0m1.237s
Now we will use parallel to run 8 parallel grep commands on each of the split files and use Perl to add up the counts. (If you are not familiar with parallel, I have also written about it.)
time parallel --verbose grep -c -w -f big_list.txt {} ::: big.a* | perl -nle '$n += $_; END { print $n }'
grep -c -w -f big_list.txt big.aa
grep -c -w -f big_list.txt big.ab
grep -c -w -f big_list.txt big.ac
grep -c -w -f big_list.txt big.ad
grep -c -w -f big_list.txt big.ae
grep -c -w -f big_list.txt big.af
grep -c -w -f big_list.txt big.ag
grep -c -w -f big_list.txt big.ah
1000000
real 0m8.029s
user 0m58.168s
sys 0m2.233s
The file splitting and parallel grep took around 11 seconds, an improvement over the 42 seconds.
Summary
The motivation behind this post was that I wanted to check whether a list of dbSNP Reference SNP number exists in a very large VCF file and it was taking too long. Using parallel in the manner demonstrated in this post made it possible to perform the search in a reasonable amount of time. In addition, I found that using the grep -F parameter helped decrease the search time; this parameter makes grep interpret each pattern as a list of fixed strings. What this means (to my understanding) is that the regular expression engine is no longer used because the pattern is treated as a fixed string instead and this helped speed things up.
I wrote a Bash script (see below but you should check out the GitHub Gist if you don't want an emoji in your code) that can perform grep in parallel. The script expects that parallel is installed and is in your path; it also uses various Unix command line tools, that should be installed by default on most Linux operating systems unless you are using some stripped down version of Linux.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/usr/bin/env bash | |
| set -euo pipefail | |
| usage() { | |
| >&2 echo "Usage: $0 [ -l search_list ] [ -f file_to_grep ] [ -n split_num ] [ -p num_threads ]" | |
| exit 1 | |
| } | |
| num_param=4 | |
| required_param=$(bc -l<<<${num_param}*2+1) | |
| while getopts ":l:f:n:p:" options; do | |
| case "${options}" in | |
| l) | |
| list=${OPTARG} | |
| ;; | |
| f) | |
| file=${OPTARG} | |
| ;; | |
| n) | |
| num=${OPTARG} | |
| regex='^[1-9][0-9]*$' | |
| if [[ ! ${num} =~ ${regex} ]]; then | |
| usage | |
| fi | |
| ;; | |
| p) | |
| num_threads=${OPTARG} | |
| regex='^[1-9][0-9]*$' | |
| if [[ ! ${num_threads} =~ ${regex} ]]; then | |
| usage | |
| fi | |
| ;; | |
| 🙂 | |
| echo "Error: -${OPTARG} requires an argument." | |
| exit 1 | |
| ;; | |
| *) | |
| usage ;; | |
| esac | |
| done | |
| if [[ ${OPTIND} -ne ${required_param} ]]; then | |
| usage | |
| fi | |
| # check if input files exist | |
| for check in ${list} ${file}; do | |
| if [[ ! -e ${check} ]]; then | |
| >&2 echo ${check} does not exist | |
| exit 1 | |
| fi | |
| done | |
| # generate prefixes | |
| prefixes=({a..z}{a..z}) | |
| # check to see requested number of splits is larger than supported | |
| num_prefix=${#prefixes[@]} | |
| if [[ $num -gt ${num_prefix} ]]; then | |
| >&2 echo Please enter number less than ${num_prefix} | |
| exit 1 | |
| fi | |
| # get basename | |
| base=$(basename -- ${file}) | |
| base="${base%.*}" | |
| # calculate number of lines per split | |
| total=$(cat ${file} | wc -l) | |
| div=$(bc -l<<<${total}/${num}+1) | |
| lines=$(printf %.0f ${div}) | |
| # split file to search | |
| split -l ${lines} ${file} ${base}. | |
| # file containing commands to run | |
| cmd_txt=$(date +%Y%M%d%H%M%N) | |
| # generate commands | |
| # | |
| # -w to prevent partial matches | |
| # -F Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. | |
| # -c count | |
| # | |
| for ((n = 0; n < ${num}; n++)); do | |
| echo "grep -w -c -F -f ${list} ${base}.${prefixes[${n}]}" >> ${cmd_txt} | |
| done | |
| parallel -j ${num_threads} < ${cmd_txt} | perl -nle '$s += $_; END { print $s }' | |
| # clean up | |
| rm ${cmd_txt} | |
| for ((n = 0; n < ${num}; n++)); do | |
| rm ${base}.${prefixes[${n}]} | |
| done | |
| exit 0 |

This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hello,
very useful example, could you please tell me what is this “{} :::”?
Many thanks
Hi Roberto,
that’s the syntax for GNU parallel; please see https://davetang.org/muse/2013/11/18/using-gnu-parallel/ for more information.
Cheers,
Dave
Thanks! I was just looking at that post and I saw
#a basic example of running 3 echos in parallel
parallel echo ::: A B C
many thanks!