You are here: Home » NewsFeeds » Getting a full PDF from a DRM-encumbered online textbook

Getting a full PDF from a DRM-encumbered online textbook

I recently started a calculus course that uses an online textbook. Buying this textbook online was mandatory, not for the content, but to get an electronic access code for homework assignments. While I had the option of additionally buying a physical copy of the book, I don’t like the idea of textbook publishers trying to squeeze the used books market with scummy tactics like this. On top of this, unless I paid extra, I will lose access to this book at some point in the future. That is unacceptable to me. So… I’m going to crack it.

(Yes, I probably could have just torrented a PDF copy. But that’s no fun!)

The DRM on this textbook is pretty intense. Of course, there isn’t a “download PDF” option. There is a printing option, but it’s limited to 10 pages at a time, and prints the pages out with a large watermark in the center, along with licensing info (my name, number, and a “do not scan, copy, duplicate, distribute, or exercise any freedom with the material” notice) in the margins. Fun!

First, we need to download all the pages. Due to the download limit, this is going to take forever… right? Nope! A little Clojure and java.awt.Robot has our mouse pointer whizzing around the screen by itself.

(ns scraper.core)

(import '(java.awt Robot)
        '(java.awt.event KeyEvent InputEvent))

(use '[clojure.string :only (join)])

(defn char-to-key [c]
    (KeyEvent/getExtendedKeyCodeForChar (int c)))

(defn type-key [r c]
    (.keyPress r c)
    (.delay r 100)
    (.keyRelease r c))

(defn type-string [r s]
    (doseq [c (seq s)]
        (let [upcase? (Character/isUpperCase c)]
              (if upcase? (.keyPress r KeyEvent/VK_SHIFT))
              (.delay r 100)
              (type-key r (char-to-key c))
              (if upcase? (.keyRelease r KeyEvent/VK_SHIFT))
              (.delay r 100)))
    (.waitForIdle r))

(defn mouse-down [r] (.mousePress r InputEvent/BUTTON1_DOWN_MASK))
(defn mouse-up [r] (.mouseRelease r InputEvent/BUTTON1_DOWN_MASK))

(defn click-mouse [r f]
    (when-not (nil? f)
        (f r))
    (doto r
        (.delay 100)

(defn to-next-page [r] (.mouseMove r 727 752))

(defn to-print-button [r] (.mouseMove r 758 154))
(defn to-page-range [r] (.mouseMove r 630 406))
(defn to-page-range-box [r] (.mouseMove r 713 401))

(defn type-range [r l u]
    (type-string r (format "%s-%s" (str l) (str u))))

(defn to-modal-print [r] (.mouseMove r 624 527))
(defn to-save-button [r] (.mouseMove r 285 166))

(defn rename-file [r new-name]
    (doto r
        (.delay 1500)
        (type-key KeyEvent/VK_BACK_SPACE)
        (type-key KeyEvent/VK_BACK_SPACE) ; just to be sure :^)
        (.delay 1500)
        (type-string new-name)
        (.delay 1500)))

(defn to-modal-save [r] (.mouseMove r 1008 734))

(defn open-print-menu [r]
    (doto r
        (click-mouse to-print-button)
        (.delay 3000)
        (click-mouse to-modal-print)
        (.delay 1500)))

(defn save-file-as [r new-name extra-wait]
    (Thread/sleep extra-wait)
    (doto r
        (click-mouse to-save-button)
        (.delay 1500)
        (rename-file new-name)
        (.delay 1500)
        (click-mouse to-modal-save)))

(defn print-pages-single [r start end prefix]
    (doseq [n (range start end)]
        (doto r
            (.delay 1500)
            (save-file-as (format "%s%04d" prefix n) 1000)
            (.delay 1500)
            (click-mouse to-next-page)
            (.delay 4500))))

(defn print-pages-range [r start end prefix]
    (doseq [[range-start range-end] (map (juxt first last) (partition 10 10 [] (range start end)))]
        (let [range-name (format "%s%d-%s%d" prefix range-start prefix range-end)
              range-file-name (format "%s%04d_%s%04d" prefix range-start prefix range-end)]
            (doto r
                (click-mouse to-print-button)
                (.delay 3000)
                (click-mouse to-page-range)
                (.delay 1500)
                (click-mouse to-page-range-box)
                (.delay 1500)
                (type-string range-name)
                (.delay 1500)
                (click-mouse to-modal-print)
                (.delay 3000)
                (save-file-as range-file-name 10000)
                (.delay 4500)))))

(defn print-pages [r page-ranges]
    (doseq [s page-ranges]
        (let [[start end prefix single?] s]
            ((if single?
                print-pages-range) start end prefix))))

(let [r (Robot.)]
    (println "Starting in 1 second...")
    (doto r
        (print-pages-single 0 1 "0000_cover")
        (print-pages-single 1 30 "0000_prologue")
        (print-pages-range 1 1171 "")
        (print-pages-range 1 147 "A")
        (print-pages-range 1 11 "R")

My Clojure was pretty rusty, so the code is far from pretty, and I got around timing problems by adding more sleeps… but with some trial-and-error, it worked pretty well. Several coffee/tea/tinder breaks later, broken up by restarting the scraping process where it broke for some reason, and all the pages are living on my hard drive. Nice! Er, except the ones that didn’t get captured due to timing issues. A bit of Python magic found which pages weren’t grabbed correctly, though, and I was able to rerun the scraper on just those ranges to clean up the remnants. Overall, the process took around a day, which while not ideal wasn’t too bad. I experimented with taking regional screenshots to actually detect if UI elements were ready instead of just guessing, but in reality if I was doing this to more books and wanted it to be robust I would look into cracking the .swf itself.

Now, we need to get a page into image form, so we can play around with it in GIMP. Once we get a process worked out, we can automate it with ImageMagick and process all 1500-odd pages of the book. Getting this image is easy: pick a page and run:

convert -density 300 -quality 100 $PDF -crop 1846x2306+208+322 out.png

to turn it into a high-quality PNG file.

Luckily page R11 was totally white, so converting it to a PNG yielded a clean, isolated copy of the watermark.

Dealing with the margins will be easy, we can just crop them out, so lets focus on the watermark. I made everything but the watermark itself transparent in GIMP, and after that removing it from the original image is as simple as overlaying the cleaned-up version on the page and setting the watermark’s layer mode to divide.

Now we need to repeat our earlier PDF->PNG conversion for all the files. This wasn’t much harder with a dash of GNU parallel (an incredibly handy tool):

parallel convert -density 300 -quality 100 {} '~/book-imgs/{/.}-%d.png' ::: pdfs/*.pdf

Now, the fun part – automating the image-munging process! With the watermark image in watermark.png and a test page image named test.png, we can easily replicate our GIMP process in ImageMagick:

convert test.png watermark.png -compose Divide_Src -composite out.png

Ahh, shell is a wonderful thing.

And we have a nice, clean page… which I’d love to show you, but, copyrights. So just image a pristine (well, there’s a few artifacts) textbook page, with no ugly watermarks. Ahhhh.

Now, lets do this 1500 times! Time for reach for parallel again:

parallel convert {} -background white -fill white -draw "rectangle 160,163 1799,215" -draw "rectangle 160,2725 2342,2834" -flatten watermark.png -compose Divide_Src -composite "/home/jon/book-imgs-proc/{/}" ::: ~/book-imgs/*.png

(I could have used mogrify and done it in-place, but I wanted to keep a backup. The first parallel command took quite a while to run.)

This command seems a bit confusing, so I’ll break it into its constituent pieces.

The first part is -background white ... -flatten, which fills the transparent edges with white. I wanted the images to have an 8.5×11 ratio (because I’m a silly American), and it turns out they already were in that ratio – if I included the transparent part. No cropping required!

The next part is -fill white -draw "rectangle 160,163 1799,215" -draw "rectangle 160,2725 2342,2834". Since we aren’t cropping out the licensing text, I’m instead simply covering it up with some filled-white rectangles. The coordinates took a bit of tweaking, but it worked out pretty well.

Finally after we -flatten, and all the operations have happened on the source image, we can load the watermark image and divide by it as before: watermark.png -compose Divide_Src -composite "/home/jon/book-imgs-proc/{/}". To make everything work, I had to manually move the watermark up in GIMP to get it to align. Not sure why, but after I did this everything processed basically perfectly (it’s still not exactly aligned, so there’s some very thin gray lines, but it’s good enough for me).

Now, we can use convert again (ImageMagick is so useful) to join the PNG images into a single PDF:

convert ~/book-images-proc/*.png book.pdf

Well actually, convert likes to use… lots of RAM, so I actually ran:

nice -n 19 convert -limit area 1GiB -limit memory 1GiB -limit map 1GiB ~/book-imgs-proc/*.png book.pdf

…and went out with a friend, then came back and read The C Programming Language for a bit, then browsed Hacker News for a bit… it ended up running for almost 3 hours but it chewed through all the pages eventually. The resulting PDF was more than 700 MiB.

Now we can use basically any PDF OCR tool to make the text searchable. If I had the motivation I could probably scrape the original text from the book to get it perfect, but I don’t care that much, so OCR it is.

I already have Ruby and the Tesseract OCR engine installed, so I just grabbed the one-script pdfocr tool from its Github repo. One extra command installation it needed for some reason and…

./pdfocr -t -i book.pdf -o book-ocr.pdf

…another hour or so of waiting later and my PDF was done! Basically 100% searchable and cleanly formatted. Over buying a “lifetime of edition” code I saved at least $120, so I’m pretty happy with this project overall.


Original article