Last updated: 2023/03/07
I'm not really a fan of PowerPoint but it's ubiquitous in research, so I have to work with them. Sometimes I need to find a slide amongst a pile of PowerPoint files and waste a lot of time opening and closing files. I wondered whether I could
grep PowerPoint files and sure enough it is possible. (For those unfamiliar with
grep or "grepping", it's a command line utility for searching plain text files.)
A PowerPoint file is a binary file, so you can't grep this directly. What you can do is extract files or objects from a PowerPoint file and run
grep on those. I used R Markdown and
knitr to create an example PowerPoint file that has ten slides. Each slide contains information on the slide number and whether this number is odd or even.
Let's download this file and unzip it.
wget https://github.com/davetang/muse/raw/main/code/example_powerpoint.pptx unzip example_powerpoint.pptx tree --charset ascii -L 2 . # . # |-- [Content_Types].xml # |-- docProps # | |-- app.xml # | |-- core.xml # | `-- custom.xml # |-- example_powerpoint.pptx # |-- ppt # | |-- presentation.xml # | |-- presProps.xml # | |-- _rels # | |-- slideLayouts # | |-- slideMasters # | |-- slides # | |-- tableStyles.xml # | |-- theme # | `-- viewProps.xml # `-- _rels # # 8 directories, 9 files
In the directory
ppt/slides are XML files containing information on each slide. These are the files that we use
grep on to find our missing slide. Say we are looking for a slide with the keyword "Seven" on it, we can run the following command below. (The
-l option prints the name of the input file once a match has been found.)
grep -l Seven ppt/slides/*.xml ppt/slides/slide7.xml
I wrote a simple Bash script called search_pptx.sh that makes it easier to
grep PowerPoint files. You simply provide the path to the PowerPoint file and a single search term, and the script will indicate whether the search term was found or not.
./search_pptx.sh # Usage: ./search_pptx.sh # [ -i | --ignore-case ] # [ -v | --verbose ] # [ -t dir | --tmp dir ] # <in.pptx> <search_term> ./search_pptx.sh example_powerpoint.pptx Seven # Seven was found in example_powerpoint.pptx on slide7 ./search_pptx.sh example_powerpoint.pptx blah # blah was not found in example_powerpoint.pptx ./search_pptx.sh example_powerpoint.pptx Even # Even was found in example_powerpoint.pptx on slide10 # Even was found in example_powerpoint.pptx on slide2 # Even was found in example_powerpoint.pptx on slide4 # Even was found in example_powerpoint.pptx on slide6 # Even was found in example_powerpoint.pptx on slide8
The script has an option to ignore case, provide more verbose output, and to change the default temporary directory (that is used to unzip files). The ignore case option is probably of the most use; just add
-i. Slide 7 contains "Seven", so "Even" matches when we ignore case.
./search_pptx.sh -i example_powerpoint.pptx Even # Even was found in example_powerpoint.pptx on slide10 # Even was found in example_powerpoint.pptx on slide2 # Even was found in example_powerpoint.pptx on slide4 # Even was found in example_powerpoint.pptx on slide6 # Even was found in example_powerpoint.pptx on slide7 # Even was found in example_powerpoint.pptx on slide8
search_pptx.sh in a
for loop or with GNU Parallel, and you can quickly search through many PowerPoint files.
However, as you can imagine, this way of grepping PowerPoint files is only useful when you have a distinguishing keyword. One could extend this to include a proper XML parser,
to search for consecutive words, e.g. "Single cells" (feature added), and to search the other files that are extracted from a PowerPoint file.
This work is licensed under a Creative Commons
Attribution 4.0 International License.