# Grepping PowerPoint files

Last updated: 2023/03/07

I'm not really a fan of PowerPoint but it's ubiquitous in research, so I have to work with them. Sometimes I need to find a slide amongst a pile of PowerPoint files and waste a lot of time opening and closing files. I wondered whether I could grep PowerPoint files and sure enough it is possible. (For those unfamiliar with grep or "grepping", it's a command line utility for searching plain text files.)

A PowerPoint file is a binary file, so you can't grep this directly. What you can do is extract files or objects from a PowerPoint file and run grep on those. I used R Markdown and knitr to create an example PowerPoint file that has ten slides. Each slide contains information on the slide number and whether this number is odd or even.

wget https://github.com/davetang/muse/raw/main/code/example_powerpoint.pptx
unzip example_powerpoint.pptx

tree --charset ascii -L 2 .
# .
# |-- [Content_Types].xml
# |-- docProps
# |   |-- app.xml
# |   |-- core.xml
# |   -- custom.xml
# |-- example_powerpoint.pptx
# |-- ppt
# |   |-- presentation.xml
# |   |-- presProps.xml
# |   |-- _rels
# |   |-- slideLayouts
# |   |-- slideMasters
# |   |-- slides
# |   |-- tableStyles.xml
# |   |-- theme
# |   -- viewProps.xml
# -- _rels
#
# 8 directories, 9 files

In the directory ppt/slides are XML files containing information on each slide. These are the files that we use grep on to find our missing slide. Say we are looking for a slide with the keyword "Seven" on it, we can run the following command below. (The -l option prints the name of the input file once a match has been found.)

grep -l Seven ppt/slides/*.xml
ppt/slides/slide7.xml

I wrote a simple Bash script called search_pptx.sh that makes it easier to grep PowerPoint files. You simply provide the path to the PowerPoint file and a single search term, and the script will indicate whether the search term was found or not.

./search_pptx.sh
# Usage: ./search_pptx.sh
#    [ -i     | --ignore-case ]
#    [ -v     | --verbose ]
#    [ -t dir | --tmp dir ]
#    <in.pptx> <search_term>

./search_pptx.sh example_powerpoint.pptx Seven
# Seven was found in example_powerpoint.pptx on slide7

./search_pptx.sh example_powerpoint.pptx blah

./search_pptx.sh example_powerpoint.pptx Even
# Even was found in example_powerpoint.pptx on slide10
# Even was found in example_powerpoint.pptx on slide2
# Even was found in example_powerpoint.pptx on slide4
# Even was found in example_powerpoint.pptx on slide6
# Even was found in example_powerpoint.pptx on slide8

The script has an option to ignore case, provide more verbose output, and to change the default temporary directory (that is used to unzip files). The ignore case option is probably of the most use; just add -i. Slide 7 contains "Seven", so "Even" matches when we ignore case.

./search_pptx.sh -i example_powerpoint.pptx Even
# Even was found in example_powerpoint.pptx on slide10
# Even was found in example_powerpoint.pptx on slide2
# Even was found in example_powerpoint.pptx on slide4
# Even was found in example_powerpoint.pptx on slide6
# Even was found in example_powerpoint.pptx on slide7
# Even was found in example_powerpoint.pptx on slide8

Use search_pptx.sh in a for` loop or with GNU Parallel, and you can quickly search through many PowerPoint files.

However, as you can imagine, this way of grepping PowerPoint files is only useful when you have a distinguishing keyword. One could extend this to include a proper XML parser, to search for consecutive words, e.g. "Single cells" (feature added), and to search the other files that are extracted from a PowerPoint file.