Last updated: 2023/03/07
I'm not really a fan of PowerPoint but it's ubiquitous in research, so I have to work with them. Sometimes I need to find a slide amongst a pile of PowerPoint files and waste a lot of time opening and closing files. I wondered whether I could grep
PowerPoint files and sure enough it is possible. (For those unfamiliar with grep
or "grepping", it's a command line utility for searching plain text files.)
A PowerPoint file is a binary file, so you can't grep this directly. What you can do is extract files or objects from a PowerPoint file and run grep
on those. I used R Markdown and knitr
to create an example PowerPoint file that has ten slides. Each slide contains information on the slide number and whether this number is odd or even.
Let's download this file and unzip it.
wget https://github.com/davetang/muse/raw/main/code/example_powerpoint.pptx
unzip example_powerpoint.pptx
tree --charset ascii -L 2 .
# .
# |-- [Content_Types].xml
# |-- docProps
# | |-- app.xml
# | |-- core.xml
# | `-- custom.xml
# |-- example_powerpoint.pptx
# |-- ppt
# | |-- presentation.xml
# | |-- presProps.xml
# | |-- _rels
# | |-- slideLayouts
# | |-- slideMasters
# | |-- slides
# | |-- tableStyles.xml
# | |-- theme
# | `-- viewProps.xml
# `-- _rels
#
# 8 directories, 9 files
In the directory ppt/slides
are XML files containing information on each slide. These are the files that we use grep
on to find our missing slide. Say we are looking for a slide with the keyword "Seven" on it, we can run the following command below. (The -l
option prints the name of the input file once a match has been found.)
grep -l Seven ppt/slides/*.xml
ppt/slides/slide7.xml
I wrote a simple Bash script called search_pptx.sh that makes it easier to grep
PowerPoint files. You simply provide the path to the PowerPoint file and a single search term, and the script will indicate whether the search term was found or not.
./search_pptx.sh
# Usage: ./search_pptx.sh
# [ -i | --ignore-case ]
# [ -v | --verbose ]
# [ -t dir | --tmp dir ]
# <in.pptx> <search_term>
./search_pptx.sh example_powerpoint.pptx Seven
# Seven was found in example_powerpoint.pptx on slide7
./search_pptx.sh example_powerpoint.pptx blah
# blah was not found in example_powerpoint.pptx
./search_pptx.sh example_powerpoint.pptx Even
# Even was found in example_powerpoint.pptx on slide10
# Even was found in example_powerpoint.pptx on slide2
# Even was found in example_powerpoint.pptx on slide4
# Even was found in example_powerpoint.pptx on slide6
# Even was found in example_powerpoint.pptx on slide8
The script has an option to ignore case, provide more verbose output, and to change the default temporary directory (that is used to unzip files). The ignore case option is probably of the most use; just add -i
. Slide 7 contains "Seven", so "Even" matches when we ignore case.
./search_pptx.sh -i example_powerpoint.pptx Even
# Even was found in example_powerpoint.pptx on slide10
# Even was found in example_powerpoint.pptx on slide2
# Even was found in example_powerpoint.pptx on slide4
# Even was found in example_powerpoint.pptx on slide6
# Even was found in example_powerpoint.pptx on slide7
# Even was found in example_powerpoint.pptx on slide8
Use search_pptx.sh
in a for
loop or with GNU Parallel, and you can quickly search through many PowerPoint files.
However, as you can imagine, this way of grepping PowerPoint files is only useful when you have a distinguishing keyword. One could extend this to include a proper XML parser, to search for consecutive words, e.g. "Single cells" (feature added), and to search the other files that are extracted from a PowerPoint file.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Thanks for writing this, Dave. I had a load of .pptx files with spaces in the filenames, so I got this revision working that handles them:
#!/usr/bin/env bash
#https://davetang.org/muse/2023/03/01/grepping-powerpoint-files/
#Now supports spaces in filenames, grep-style syntax, and color-coded output for matches
set -euo pipefail
ignorecase=0
verbose=0
tmp=/tmp
usage(){
>&2 cat << EOF
Usage: $0
[ -i | –ignore-case ]
[ -v | –verbose ]
[ -t dir | –tmp dir ]
EOF
exit 1
}
check_depend (){
tool=$1
if [[ ! -x $(command -v ${tool}) ]]; then
>&2 echo Could not find ${tool}
exit 1
fi
}
now(){
date ‘+%Y/%m/%d %H:%M:%S’
}
args=$(getopt -a -o ivht: –long ignore-case,verbose,help,tmp: — “$@”)
if [[ $? -gt 0 ]]; then
usage
fi
eval set — ${args}
while :
do
case $1 in
-i | –ignore-case) ignorecase=1 ; shift ;;
-v | –verbose) verbose=1 ; shift ;;
-h | –help) usage ; shift ;;
-t | –tmp) tmp=$2 ; shift 2 ;;
# — means the end of the arguments; drop this, and break out of the while loop
–) shift; break ;;
*) >&2 echo Unsupported option: $1
usage ;;
esac
done
if [[ $# -lt 2 ]]; then
usage
fi
word=$1
pptx=${@:2}
dependencies=(unzip)
for tool in ${dependencies[@]}; do
check_depend ${tool}
done
SECONDS=0
[[ ${verbose} -gt 0 ]] && >&2 printf “[ %s %s ] Searching $(basename ${pptx})\n” $(now)
IFS=”$(printf ‘\n\t’)”
pptx=`ls ${@:2}`;
old_pwd=$PWD
for ppfile in ${pptx}; do
cd ${old_pwd}
duration=$SECONDS
rand=$$$RANDOM
tmpdir=${tmp}/${rand}
if [[ -d ${tmpdir} ]]; then
>&2 echo ${tmpdir} already exists!
exit 1
else
mkdir ${tmpdir}
fi
cp “${ppfile}” ${tmpdir}
cd ${tmpdir} && unzip -q “$(basename “${ppfile}”)”
if [[ ! -d ppt/slides/ ]]; then
>&2 echo No slides found
exit 1
fi
search=unset
if [[ ${ignorecase} -eq 1 ]]; then
search=$(grep -li “${word}” ppt/slides/*.xml || true)
else
search=$(grep -l “${word}” ppt/slides/*.xml || true)
fi
if [[ -z ${search} ]]; then
echo ${word} not found in ${ppfile}
else
for s in ${search[@]}; do
slide=$(basename ${s} .xml)
RED=’\033[0;31m’
NC=’\033[0m’ # No Color
echo -e ${RED} ${word} was found in ${ppfile} on ${slide} ${NC}
done
fi
cd /tmp && rm -rf ${tmpdir}
[[ ${verbose} -gt 0 ]] && >&2 printf “[ %s %s ] Completed.\n” $(now)
[[ ${verbose} -gt 0 ]] && >&2 echo -e “$(($duration / 60)) minutes and $(($duration % 60)) seconds elapsed.\n”
done
exit 0
Thanks for sharing!