# Grepping a list with a list

The grep command-line utility is a commonly used tool for searching plain text files for lines that match a pattern. For example, you could search a gene or SNP ID in a BED/GFF/GTF file to find out its coordinates. In this post, I will demonstrate how you can search for a list of things in another list of things. One use case is that you could subset a BED file with a list of IDs.

I will use the Ubuntu:18:04 Docker image for this post for reproducibility. We will need the parallel tool, which can be installed using apt. If you are working on a machine without root (admin) privileges, you can install parallel using Conda. If you are not familiar with Docker and/or Conda, I have previously prepared some introductory material.

To get started, start a Docker container and install parallel.

docker run --rm -it ubuntu:18.04 /bin/bash

apt update
apt install -y parallel

parallel --version
GNU parallel 20161222
Ole Tange and Free Software Foundation, Inc.
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.

Web site: http://www.gnu.org/software/parallel

When using programs that use GNU Parallel to process data for publication
please cite as described in 'parallel --citation'.

parallel --citation


We'll generate a small file containing integers from 0 to 9999.

for (( i=0; i<10000; i++ )); do echo $i; done > small.txt head -3 small.txt 0 1 2 tail -3 small.txt 9997 9998 9999  Say we wanted to find out if 9999 exists in small.txt. grep 9999 small.txt 9999 # line that matched  However if we searched for 999, grep will return all lines matching the 999 pattern. grep 999 small.txt 999 1999 2999 3999 4999 5999 6999 7999 8999 9990 9991 9992 9993 9994 9995 9996 9997 9998 9999  If we only want to match 999 entirely, use the -w parameter; this selects only those lines containing matches that form whole words. grep -w 999 small.txt 999  Two other useful parameters I often use are -B (before) and -A (after), which returns lines before and after lines that matched your pattern. grep -w -B 2 -A 2 999 small.txt 997 998 999 1000 1001  Now to the main topic: grepping a list with a list. We will create a search list by shuffling small.txt (using the shuf tool) and taking the first 10 elements. cat small.txt | shuf | head > list.txt head -3 list.txt 4760 2829 3969  We will use the -f parameter to specify the file containing all our patterns. grep -w -f list.txt small.txt 2749 2829 3721 3969 4760 5624 6165 6311 7207 7601  If we are only interested to see if our list of patterns exist, we can use the -c parameter. If the number returned is the same as the length of your search list, then all items were found. This is assuming that the file you are searching against is unique. grep -w -c -f list.txt small.txt 10  However, searching a list of patterns in another list will slow down when your lists become larger. We will illustrate this with a list of 100 million elements and a search list of 1 million items. for (( i=0; i<100000000; i++ )); do echo$i; done > big.txt

cat big.txt | shuf | head -1000000 > big_list.txt

time grep -c -w -f big_list.txt big.txt
1000000

real    0m42.133s
user    0m41.088s
sys     0m0.946s


42 seconds is not bad at all but if you have extra processors available, you can speed up the search by splitting the file you are searching against using the split tool. We will split big.txt into 8 equally sized files by using split with the -l parameter and providing the number of lines we want in each split file (12,500,000).

echo $(( 100000000 / 8 )) 12500000 time split -l 12500000 big.txt big. real 0m3.445s user 0m0.536s sys 0m1.237s  Now we will use parallel to run 8 parallel grep commands on each of the split files and use Perl to add up the counts. (If you are not familiar with parallel, I have also written about it.) time parallel --verbose grep -c -w -f big_list.txt {} ::: big.a* | perl -nle '$n += $_; END { print$n }'
grep -c -w -f big_list.txt big.aa
grep -c -w -f big_list.txt big.ab
grep -c -w -f big_list.txt big.ac
grep -c -w -f big_list.txt big.ad
grep -c -w -f big_list.txt big.ae
grep -c -w -f big_list.txt big.af
grep -c -w -f big_list.txt big.ag
grep -c -w -f big_list.txt big.ah
1000000

real    0m8.029s
user    0m58.168s
sys     0m2.233s


The file splitting and parallel grep took around 11 seconds, an improvement over the 42 seconds.

## Summary

The motivation behind this post was that I wanted to check whether a list of dbSNP Reference SNP number exists in a very large VCF file and it was taking too long. Using parallel in the manner demonstrated in this post made it possible to perform the search in a reasonable amount of time. In addition, I found that using the grep -F parameter helped decrease the search time; this parameter makes grep interpret each pattern as a list of fixed strings. What this means (to my understanding) is that the regular expression engine is no longer used because the pattern is treated as a fixed string instead and this helped speed things up.

I wrote a Bash script (see below but you should check out the GitHub Gist if you don't want an emoji in your code) that can perform grep in parallel. The script expects that parallel is installed and is in your path; it also uses various Unix command line tools, that should be installed by default on most Linux operating systems unless you are using some stripped down version of Linux.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.

 #!/usr/bin/env bash set -euo pipefail usage() { >&2 echo "Usage: $0 [ -l search_list ] [ -f file_to_grep ] [ -n split_num ] [ -p num_threads ]" exit 1 } num_param=4 required_param=$(bc -l<<<${num_param}*2+1) while getopts ":l:f:n:p:" options; do case "${options}" in l) list=${OPTARG} ;; f) file=${OPTARG} ;; n) num=${OPTARG} regex='^[1-9][0-9]*$' if [[ ! ${num} =~${regex} ]]; then usage fi ;; p) num_threads=${OPTARG} regex='^[1-9][0-9]*$' if [[ ! ${num_threads} =~${regex} ]]; then usage fi ;; 🙂 echo "Error: -${OPTARG} requires an argument." exit 1 ;; *) usage ;; esac done if [[${OPTIND} -ne ${required_param} ]]; then usage fi # check if input files exist for check in${list} ${file}; do if [[ ! -e${check} ]]; then >&2 echo ${check} does not exist exit 1 fi done # generate prefixes prefixes=({a..z}{a..z}) # check to see requested number of splits is larger than supported num_prefix=${#prefixes[@]} if [[ $num -gt${num_prefix} ]]; then >&2 echo Please enter number less than ${num_prefix} exit 1 fi # get basename base=$(basename -- ${file}) base="${base%.*}" # calculate number of lines per split total=$(cat${file} | wc -l) div=$(bc -l<<<${total}/${num}+1) lines=$(printf %.0f ${div}) # split file to search split -l${lines} ${file}${base}. # file containing commands to run cmd_txt=$(date +%Y%M%d%H%M%N) # generate commands # # -w to prevent partial matches # -F Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. # -c count # for ((n = 0; n <${num}; n++)); do echo "grep -w -c -F -f ${list}${base}.${prefixes[${n}]}" >> ${cmd_txt} done parallel -j${num_threads} < ${cmd_txt} | perl -nle '$s += $_; END { print$s }' # clean up rm ${cmd_txt} for ((n = 0; n <${num}; n++)); do rm ${base}.${prefixes[\${n}]} done exit 0

view raw

split_search.sh

hosted with ❤ by GitHub

.
1. Roberto Alonso Valero says:

Hello,

very useful example, could you please tell me what is this “{} :::”?

Many thanks

1. Roberto Alonso Valero says:

Thanks! I was just looking at that post and I saw
#a basic example of running 3 echos in parallel
parallel echo ::: A B C

many thanks!

This site uses Akismet to reduce spam. Learn how your comment data is processed.