Getting a full PDF from a DRM-encumbered online textbook

I recently started a calculus course that uses an online textbook. Buying this textbook online was mandatory, not for the content, but to get an electronic access code for homework assignments. While I had the option of additionally buying a physical copy of the book, I don’t like the idea of textbook publishers trying to squeeze the used books market with scummy tactics like this. On top of this, unless I paid extra, I will lose access to this book at some point in the future. That is unacceptable to me. So… I’m going to crack it.

(Yes, I probably could have just torrented a PDF copy. But that’s no fun!)

The DRM on this textbook is pretty intense. Of course, there isn’t a “download PDF” option. There is a printing option, but it’s limited to 10 pages at a time, and prints the pages out with a large watermark in the center, along with licensing info (my name, number, and a “do not scan, copy, duplicate, distribute, or exercise any freedom with the material” notice) in the margins. Fun!

First, we need to download all the pages. Due to the download limit, this is going to take forever… right? Nope! A little Clojure and java.awt.Robot has our mouse pointer whizzing around the screen by itself.

(ns scraper.core)

(import '(java.awt Robot)
        '(java.awt.event KeyEvent InputEvent))

(use '[clojure.string :only (join)])

(defn char-to-key [c]
    (KeyEvent/getExtendedKeyCodeForChar (int c)))

(defn type-key [r c]
    (.keyPress r c)
    (.delay r 100)
    (.keyRelease r c))

(defn type-string [r s]
    (doseq [c (seq s)]
        (let [upcase? (Character/isUpperCase c)]
              (if upcase? (.keyPress r KeyEvent/VK_SHIFT))
              (.delay r 100)
              (type-key r (char-to-key c))
              (if upcase? (.keyRelease r KeyEvent/VK_SHIFT))
              (.delay r 100)))
    (.waitForIdle r))

(defn mouse-down [r] (.mousePress r InputEvent/BUTTON1_DOWN_MASK))
(defn mouse-up [r] (.mouseRelease r InputEvent/BUTTON1_DOWN_MASK))

(defn click-mouse [r f]
    (when-not (nil? f)
        (f r))
    (doto r
        (.delay 100)

(defn to-next-page [r] (.mouseMove r 727 752))

(defn to-print-button [r] (.mouseMove r 758 154))
(defn to-page-range [r] (.mouseMove r 630 406))
(defn to-page-range-box [r] (.mouseMove r 713 401))

(defn type-range [r l u]
    (type-string r (format "%s-%s" (str l) (str u))))

(defn to-modal-print [r] (.mouseMove r 624 527))
(defn to-save-button [r] (.mouseMove r 285 166))

(defn rename-file [r new-name]
    (doto r
        (.delay 1500)
        (type-key KeyEvent/VK_BACK_SPACE)
        (type-key KeyEvent/VK_BACK_SPACE) ; just to be sure :^)
        (.delay 1500)
        (type-string new-name)
        (.delay 1500)))

(defn to-modal-save [r] (.mouseMove r 1008 734))

(defn open-print-menu [r]
    (doto r
        (click-mouse to-print-button)
        (.delay 3000)
        (click-mouse to-modal-print)
        (.delay 1500)))

(defn save-file-as [r new-name extra-wait]
    (Thread/sleep extra-wait)
    (doto r
        (click-mouse to-save-button)
        (.delay 1500)
        (rename-file new-name)
        (.delay 1500)
        (click-mouse to-modal-save)))

(defn print-pages-single [r start end prefix]
    (doseq [n (range start end)]
        (doto r
            (.delay 1500)
            (save-file-as (format "%s%04d" prefix n) 1000)
            (.delay 1500)
            (click-mouse to-next-page)
            (.delay 4500))))

(defn print-pages-range [r start end prefix]
    (doseq [[range-start range-end] (map (juxt first last) (partition 10 10 [] (range start end)))]
        (let [range-name (format "%s%d-%s%d" prefix range-start prefix range-end)
              range-file-name (format "%s%04d_%s%04d" prefix range-start prefix range-end)]
            (doto r
                (click-mouse to-print-button)
                (.delay 3000)
                (click-mouse to-page-range)
                (.delay 1500)
                (click-mouse to-page-range-box)
                (.delay 1500)
                (type-string range-name)
                (.delay 1500)
                (click-mouse to-modal-print)
                (.delay 3000)
                (save-file-as range-file-name 10000)
                (.delay 4500)))))

(defn print-pages [r page-ranges]
    (doseq [s page-ranges]
        (let [[start end prefix single?] s]
            ((if single?
                print-pages-range) start end prefix))))

(let [r (Robot.)]
    (println "Starting in 1 second...")
    (doto r
        (print-pages-single 0 1 "0000_cover")
        (print-pages-single 1 30 "0000_prologue")
        (print-pages-range 1 1171 "")
        (print-pages-range 1 147 "A")
        (print-pages-range 1 11 "R")

My Clojure was pretty rusty, so the code is far from pretty, and I got around timing problems by adding more sleeps… but with some trial-and-error, it worked pretty well. Several coffee/tea/tinder breaks later, broken up by restarting the scraping process where it broke for some reason, and all the pages are living on my hard drive. Nice! Er, except the ones that didn’t get captured due to timing issues. A bit of Python magic found which pages weren’t grabbed correctly, though, and I was able to rerun the scraper on just those ranges to clean up the remnants. Overall, the process took around a day, which while not ideal wasn’t too bad. I experimented with taking regional screenshots to actually detect if UI elements were ready instead of just guessing, but in reality if I was doing this to more books and wanted it to be robust I would look into cracking the .swf itself.

Now, we need to get a page into image form, so we can play around with it in GIMP. Once we get a process worked out, we can automate it with ImageMagick and process all 1500-odd pages of the book. Getting this image is easy: pick a page and run:

convert -density 300 -quality 100 $PDF -crop 1846x2306+208+322 out.png

to turn it into a high-quality PNG file.

Luckily page R11 was totally white, so converting it to a PNG yielded a clean, isolated copy of the watermark.

Dealing with the margins will be easy, we can just crop them out, so lets focus on the watermark. I made everything but the watermark itself transparent in GIMP, and after that removing it from the original image is as simple as overlaying the cleaned-up version on the page and setting the watermark’s layer mode to divide.

Now we need to repeat our earlier PDF->PNG conversion for all the files. This wasn’t much harder with a dash of GNU parallel (an incredibly handy tool):

parallel convert -density 300 -quality 100 {} '~/book-imgs/{/.}-%d.png' ::: pdfs/*.pdf

Now, the fun part – automating the image-munging process! With the watermark image in watermark.png and a test page image named test.png, we can easily replicate our GIMP process in ImageMagick:

convert test.png watermark.png -compose Divide_Src -composite out.png

Ahh, shell is a wonderful thing.

And we have a nice, clean page… which I’d love to show you, but, copyrights. So just image a pristine (well, there’s a few artifacts) textbook page, with no ugly watermarks. Ahhhh.

Now, lets do this 1500 times! Time for reach for parallel again:

parallel convert {} -background white -fill white -draw "rectangle 160,163 1799,215" -draw "rectangle 160,2725 2342,2834" -flatten watermark.png -compose Divide_Src -composite "/home/jon/book-imgs-proc/{/}" ::: ~/book-imgs/*.png

(I could have used mogrify and done it in-place, but I wanted to keep a backup. The first parallel command took quite a while to run.)

This command seems a bit confusing, so I’ll break it into its constituent pieces.

The first part is -background white ... -flatten, which fills the transparent edges with white. I wanted the images to have an 8.5×11 ratio (because I’m a silly American), and it turns out they already were in that ratio – if I included the transparent part. No cropping required!

The next part is -fill white -draw "rectangle 160,163 1799,215" -draw "rectangle 160,2725 2342,2834". Since we aren’t cropping out the licensing text, I’m instead simply covering it up with some filled-white rectangles. The coordinates took a bit of tweaking, but it worked out pretty well.

Finally after we -flatten, and all the operations have happened on the source image, we can load the watermark image and divide by it as before: watermark.png -compose Divide_Src -composite "/home/jon/book-imgs-proc/{/}". To make everything work, I had to manually move the watermark up in GIMP to get it to align. Not sure why, but after I did this everything processed basically perfectly (it’s still not exactly aligned, so there’s some very thin gray lines, but it’s good enough for me).

Now, we can use convert again (ImageMagick is so useful) to join the PNG images into a single PDF:

convert ~/book-images-proc/*.png book.pdf

Well actually, convert likes to use… lots of RAM, so I actually ran:

nice -n 19 convert -limit area 1GiB -limit memory 1GiB -limit map 1GiB ~/book-imgs-proc/*.png book.pdf

…and went out with a friend, then came back and read The C Programming Language for a bit, then browsed Hacker News for a bit… it ended up running for almost 3 hours but it chewed through all the pages eventually. The resulting PDF was more than 700 MiB.

Now we can use basically any PDF OCR tool to make the text searchable. If I had the motivation I could probably scrape the original text from the book to get it perfect, but I don’t care that much, so OCR it is.

I already have Ruby and the Tesseract OCR engine installed, so I just grabbed the one-script pdfocr tool from its Github repo. One extra command installation it needed for some reason and…

./pdfocr -t -i book.pdf -o book-ocr.pdf

…another hour or so of waiting later and my PDF was done! Basically 100% searchable and cleanly formatted. Over buying a “lifetime of edition” code I saved at least $120, so I’m pretty happy with this project overall.

Original URL:

Original article

Ideaome – Mind-maps meet flow-charts, but social

  • Zoom In
  • Zoom Out
  • New Idea
    • Short Text
    • Long Text
    • Link/Reference
    • Image
    • Video



You agree with this idea.

You disagree with this idea.

Original URL:

Original article

Add interactive documentation to your JavaScript apps with Intro.js

Add easy-to-absorb, interactive user documentation to your JavaScript
apps with Intro.js. Learn from a sample tour implementation how to demonstrate
your application’s features the modern way from within the app’s

Original URL:

Original article

How to perform as a DJ on Ubuntu Linux with Mixxx

Linux and professional multimedia tools don’t exactly go together, and while we can use some great and very capable audio workstations like Ardour, there aren’t many audio mixers that DJs can use for their performances. If however you are a Linux user and you don’t want to resort to other operating systems every time that you need to play some music, here are your choices.

Original URL:

Original article

HP to ‘sunset’ Helion Public Cloud in 2016

clouds cloud sundown sunset

HP is going to shut down its cloud business, and this time it means it. Seriously. The company’s executive, Bill Hilf, wrote a post on the HP blog, where he announced that the Helion service will be put out of its misery next year.

“We will sunset our HP Helion Public Cloud offering on January 31, 2016”, he writes. Instead, the company will focus on turning its hardware gear into the building blocks for private enterprise clouds.

“As we have before, we will help our customers design, build and run the best cloud environments suited to their needs — based on their workloads and their business and industry requirements”, he adds.

“To support this new model, we will continue to aggressively grow our partner ecosystem and integrate different public cloud environments. To enable this flexibility, we are helping customers build cloud-portable applications based on HP Helion OpenStack® and the HP Helion Development Platform”.

According to the blog post, Hewlett-Packard will expand its support for Amazon AWS and Microsoft Azure — two clouds that have run the IT titan out of town.

“We also support our PaaS customers wherever they want to run our Cloud Foundry platform — in their own private clouds, in our managed cloud, or in a large-scale public cloud such as AWS or Azure”.

“All of these are key elements in helping our customers transform into a hybrid, multi-cloud IT world”, Hilf boldly states.

“We will continue to innovate and grow in our areas of strength, we will continue to help our partners and to help develop the broader open cloud ecosystem, and we will continue to listen to our customers to understand how we can help them with their entire end-to-end IT strategies”.

Published under license from, a Net Communities Ltd Publication. All rights reserved.

Photo Credit: Saikom/Shutterstock

Original URL:

Original article

10 Linux GUI tools for sysadmins

Has administering Linux via the command line confounded you? Here are 10 GUI tools that might make your life as a Linux administrator much easier.

Original URL:

Original article

Ubuntu 15.10 ‘Wily Werewolf’ Released

LichtSpektren writes: Ubuntu 15.10 “Wily Werewolf” is now released and available, along with its alternative desktop flavors (MATE, Xfce, LXDE, GNOME, KDE, Kylin). This release features Linux 4.2, GCC 5, Python 3.5, and LibreOffice 5. The default version is still using display server and Unity7; Mark Shuttleworth has said that Mir and Unity8 won’t arrive until Ubuntu 16.04 “Xenial Xerus.” Not much has changed beyond package updates, other than replacing the invisible overlay scrollbars in Nautilus with the GNOME 3 scrollbars.

Phoronix brings us the only bit of drama regarding this release: Jonathan Riddell, long time overseer of Kubuntu, has resigned with claims that Canonical has “defrauded donors and broke the copyright licenses.”
Another reader adds a link to a Q & A session with Riddell.

Share on Google+

Read more of this story at Slashdot.

Original URL:

Original article

Deep into Drupal, Cisco starts to give back to open source community

 NetworkWorld: Cisco’s Jamal Haider acknowledged during a presentation this week that his team that works on the company’s open source-based customer support portal hasn’t given much back to the wider Drupal community yet

Original URL:

Original article

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑

%d bloggers like this: