Grepping PowerPoint files

Last updated: 2023/03/07

I'm not really a fan of PowerPoint but it's ubiquitous in research, so I have to work with them. Sometimes I need to find a slide amongst a pile of PowerPoint files and waste a lot of time opening and closing files. I wondered whether I could grep PowerPoint files and sure enough it is possible. (For those unfamiliar with grep or "grepping", it's a command line utility for searching plain text files.)

A PowerPoint file is a binary file, so you can't grep this directly. What you can do is extract files or objects from a PowerPoint file and run grep on those. I used R Markdown and knitr to create an example PowerPoint file that has ten slides. Each slide contains information on the slide number and whether this number is odd or even.

Let's download this file and unzip it.

wget https://github.com/davetang/muse/raw/main/code/example_powerpoint.pptx
unzip example_powerpoint.pptx

tree --charset ascii -L 2 .
# .
# |-- [Content_Types].xml
# |-- docProps
# |   |-- app.xml
# |   |-- core.xml
# |   `-- custom.xml
# |-- example_powerpoint.pptx
# |-- ppt
# |   |-- presentation.xml
# |   |-- presProps.xml
# |   |-- _rels
# |   |-- slideLayouts
# |   |-- slideMasters
# |   |-- slides
# |   |-- tableStyles.xml
# |   |-- theme
# |   `-- viewProps.xml
# `-- _rels
# 
# 8 directories, 9 files

In the directory ppt/slides are XML files containing information on each slide. These are the files that we use grep on to find our missing slide. Say we are looking for a slide with the keyword "Seven" on it, we can run the following command below. (The -l option prints the name of the input file once a match has been found.)

grep -l Seven ppt/slides/*.xml
ppt/slides/slide7.xml

I wrote a simple Bash script called search_pptx.sh that makes it easier to grep PowerPoint files. You simply provide the path to the PowerPoint file and a single search term, and the script will indicate whether the search term was found or not.

./search_pptx.sh
# Usage: ./search_pptx.sh
#    [ -i     | --ignore-case ]
#    [ -v     | --verbose ]
#    [ -t dir | --tmp dir ]
#    <in.pptx> <search_term>

./search_pptx.sh example_powerpoint.pptx Seven
# Seven was found in example_powerpoint.pptx on slide7

./search_pptx.sh example_powerpoint.pptx blah
# blah was not found in example_powerpoint.pptx

./search_pptx.sh example_powerpoint.pptx Even
# Even was found in example_powerpoint.pptx on slide10
# Even was found in example_powerpoint.pptx on slide2
# Even was found in example_powerpoint.pptx on slide4
# Even was found in example_powerpoint.pptx on slide6
# Even was found in example_powerpoint.pptx on slide8

The script has an option to ignore case, provide more verbose output, and to change the default temporary directory (that is used to unzip files). The ignore case option is probably of the most use; just add -i. Slide 7 contains "Seven", so "Even" matches when we ignore case.

./search_pptx.sh -i example_powerpoint.pptx Even
# Even was found in example_powerpoint.pptx on slide10
# Even was found in example_powerpoint.pptx on slide2
# Even was found in example_powerpoint.pptx on slide4
# Even was found in example_powerpoint.pptx on slide6
# Even was found in example_powerpoint.pptx on slide7
# Even was found in example_powerpoint.pptx on slide8

Use search_pptx.sh in a for loop or with GNU Parallel, and you can quickly search through many PowerPoint files.

However, as you can imagine, this way of grepping PowerPoint files is only useful when you have a distinguishing keyword. One could extend this to include a proper XML parser, to search for consecutive words, e.g. "Single cells" (feature added), and to search the other files that are extracted from a PowerPoint file.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
2 comments Add yours
  1. Thanks for writing this, Dave. I had a load of .pptx files with spaces in the filenames, so I got this revision working that handles them:

    #!/usr/bin/env bash
    #https://davetang.org/muse/2023/03/01/grepping-powerpoint-files/
    #Now supports spaces in filenames, grep-style syntax, and color-coded output for matches

    set -euo pipefail

    ignorecase=0
    verbose=0
    tmp=/tmp

    usage(){
    >&2 cat << EOF
    Usage: $0
    [ -i | –ignore-case ]
    [ -v | –verbose ]
    [ -t dir | –tmp dir ]

    EOF
    exit 1
    }

    check_depend (){
    tool=$1
    if [[ ! -x $(command -v ${tool}) ]]; then
    >&2 echo Could not find ${tool}
    exit 1
    fi
    }

    now(){
    date ‘+%Y/%m/%d %H:%M:%S’
    }

    args=$(getopt -a -o ivht: –long ignore-case,verbose,help,tmp: — “$@”)

    if [[ $? -gt 0 ]]; then
    usage
    fi

    eval set — ${args}
    while :
    do
    case $1 in
    -i | –ignore-case) ignorecase=1 ; shift ;;
    -v | –verbose) verbose=1 ; shift ;;
    -h | –help) usage ; shift ;;
    -t | –tmp) tmp=$2 ; shift 2 ;;
    # — means the end of the arguments; drop this, and break out of the while loop
    –) shift; break ;;
    *) >&2 echo Unsupported option: $1
    usage ;;
    esac
    done

    if [[ $# -lt 2 ]]; then
    usage
    fi

    word=$1
    pptx=${@:2}

    dependencies=(unzip)
    for tool in ${dependencies[@]}; do
    check_depend ${tool}
    done

    SECONDS=0
    [[ ${verbose} -gt 0 ]] && >&2 printf “[ %s %s ] Searching $(basename ${pptx})\n” $(now)

    IFS=”$(printf ‘\n\t’)”

    pptx=`ls ${@:2}`;
    old_pwd=$PWD

    for ppfile in ${pptx}; do
    cd ${old_pwd}

    duration=$SECONDS
    rand=$$$RANDOM
    tmpdir=${tmp}/${rand}

    if [[ -d ${tmpdir} ]]; then
    >&2 echo ${tmpdir} already exists!
    exit 1
    else
    mkdir ${tmpdir}
    fi

    cp “${ppfile}” ${tmpdir}
    cd ${tmpdir} && unzip -q “$(basename “${ppfile}”)”

    if [[ ! -d ppt/slides/ ]]; then
    >&2 echo No slides found
    exit 1
    fi

    search=unset
    if [[ ${ignorecase} -eq 1 ]]; then
    search=$(grep -li “${word}” ppt/slides/*.xml || true)
    else
    search=$(grep -l “${word}” ppt/slides/*.xml || true)
    fi

    if [[ -z ${search} ]]; then
    echo ${word} not found in ${ppfile}
    else
    for s in ${search[@]}; do
    slide=$(basename ${s} .xml)
    RED=’\033[0;31m’
    NC=’\033[0m’ # No Color
    echo -e ${RED} ${word} was found in ${ppfile} on ${slide} ${NC}
    done
    fi

    cd /tmp && rm -rf ${tmpdir}

    [[ ${verbose} -gt 0 ]] && >&2 printf “[ %s %s ] Completed.\n” $(now)
    [[ ${verbose} -gt 0 ]] && >&2 echo -e “$(($duration / 60)) minutes and $(($duration % 60)) seconds elapsed.\n”

    done

    exit 0

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.