Grab Annotations from a PDF with pypdf2

If you've noticed a lot of PDF content around here lately, that's because I work with PDF a lot! Most of all, all my slide decks are in PDF and in the last year or so I've started using speaker notes in my presentations. Yes, this means that if you saw me speak in the first ten years of my speaking career, that was without speaker notes.

There are some situations where I don't have access to my speaker notes. Usually this is a good reason, such as I have mirrored my displays so I can demo or play a video without fiddling with my display settings in the middle of a talk. Sometimes, it's because something bad happened and I'm presenting from someone else's machine or a laptop that's completely off stage and I only have the comfort monitor. For those situations I use a printed set of backup speaker notes so I thought I'd share the script that creates these.

First, a complete aside

If you'd like to be ready to support a conference speaker with tech fail, read the post about presenting from PDF and install the tools on your laptop. A bunch of my colleagues past and present have done this and I've hugely appreciated that support! Almost every presentation tool can export to PDF so if the borrowed laptop has presentation tools you at least get the view that has the timer and the next slide.

Script to drag titles and annotations out of a PDF

Speaker notes are usually applied to PDF slides as annotations that are not in the visible space (they're like off the top left or something). You can use this approach for other PDF annotations too.

With the following python code in notes.py:

import sys
import PyPDF2, traceback
import pprint
from subprocess import call

try :
src = sys.argv[1]
except :
src = r'/path/to/my/file.pdf'

# put the role into the rst file
print('.. role:: slide-title')
print('')

input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()

for i in range(nPages) :
# get the data from this PDF page (first line of text, plus annotations)
page = input1.getPage(i)
text = page.extractText()

print(':slide-title:`' + text.splitlines()[0] + '`')
print('')

try :
for annot in page['/Annots'] :
# Other subtypes, such as /Link, cause errors
subtype = annot.getObject()['/Subtype']
if subtype == "/Text":
print(annot.getObject()['/Contents'])
print('')
except :
# there are no annotations on this page
pass

print('')

To use this, run python notes.py [pdf file name].

The script will output rst content (that I then use with rst2pdf) with the first line/title of each slide and the annotations associated with it. Even if there are no annotations, the title is still added which can be useful for just keeping track of which slides are coming up when you don't have any other support information.

It's a simple script but I found it handy (and will probably find it here another day and use it for something else, which is exactly the point of a blog). I hope it's useful to you too!


Also published on Medium.

Leave a Reply

Please use [code] and [/code] around any source code you wish to share.

This site uses Akismet to reduce spam. Learn how your comment data is processed.