At the beginning of this year, we quietly expanded TimesMachine, our virtual microfilm reader, to include every issue of The New York Times published between 1981 and 2002. Prior to this expansion, TimesMachine contained every issue published between 1851 and 1980, which consisted of over 11 million articles spread out over approximately 2.5 million pages. The new expansion adds an additional 8,035 complete issues containing 1.4 million articles over 1.6 million pages.
Creating and expanding TimesMachine presented us with several interesting technical challenges, and in this post we’ll describe how we tackled two. First, we’ll discuss the fundamental challenge with TimesMachine: efficiently providing a user with a scan of an entire day’s newspaper without requiring the download of hundreds of megabytes of data. Then, we’ll discuss a fascinating string matching problem we had to solve in order to include articles published after 1980 in TimesMachine.
The Archive, Pre-TimesMachine
Before TimesMachine was launched in 2014, articles from the archive were searchable and available to subscribers only as PDF documents. While the archive was accessible, two major problems in implementation remained: context and user experience.
Isolating an article from the surrounding content removes the context in which it was published. A modern reader might discover that on July 20, 1969, a man named John Fairfax became the first man to row across the Atlantic Ocean. However, a reader absorbed in The New York Times that morning might have been considerably more impressed by the front page news that Apollo 11, whose crew contained Neil Armstrong, had just swung into orbit around the moon in preparation for the first moon landing. Knowing where that John Fairfax article was published in the paper (bottom left of the front page) as well as what else was going on that day is much more interesting and valuable to a historian than an article on its own without the context of other news of the day.
We wanted to present the archive in all its glory as it was meant to be consumed on the day it was printed — one issue at a time. Our goal was to create a fluid viewing experience, not to force users to slowly download high resolution images. Here’s how we did that.
Our digitized print archive is big, containing petabytes of high-resolution page scans. Even for a single issue, the storage requirements are appreciable. The May 22, 1927 issue announcing the success of Charles Lindbergh’s pioneering trans-Atlantic flight consists of 226 pages which require nearly 200 megabytes of storage. When we built TimesMachine, we knew that there was no way we could expect users to sit through multi-hundred-megabyte downloads in order to browse a single issue. We needed a way to load just the parts of an issue that a user is looking at. We found an answer from a somewhat unexpected quarter and now, when you load that 200 megabyte Lindbergh issue in your browser, the initial page load requires the transmission of just a couple of megabytes.
We achieve this by using mapping software to display each issue. Like the pages of a scanned newspaper, a digital map is just a really big image. The technique most often used to display digital maps (and the same technique we employ for TimesMachine) is image tiling. With image tiling, a large image is broken down into a myriad of small square images, or “tiles,” computed at a variety of zoom levels.
Clever software then runs in the browser and loads only those tiles that correspond to the region of the image the user wants to see. Numerous open source software libraries have been created to make and display such tiles (we used GDAL for tile generation and leaflet.js for display). All we had to do was adapt these libraries to show you a newspaper. To do this we created a processing pipeline called The TimesMachine Publisher. Here’s how it works.
Whenever a user requests a day’s paper in TimesMachine, the client- side software downloads the JSON object describing the paper’s contents and requests only those tiles necessary to display the portion of the paper that fits into the user’s viewport. Additional data is loaded only when the user pans or zooms. Using this approach, TimesMachine delivers any day’s newspaper to the client quickly and efficiently.
We encountered a fascinating issue in our attempt to expand the number of issues in TimesMachine. Initially, TimesMachine contained only those articles published between 1851 and 1980. The exclusion of data from after 1980 stems from an interesting historic quirk of our archive. Starting around 1981, The Times began keeping an archive of the complete digital text of every article published in print. In order to expand TimesMachine beyond 1980 and include links to the full text, we needed to know how our scanned print archive and our digital text archive aligned. Here is how we figured this out.
The first step was to run optical character recognition (OCR) on articles in the scanned print archive to transcribe the text as cleanly as possible. We used tesseract-ocr for this.
Here’s an example of some nicely-OCRed text:
After doing this for every article in a single day’s issue, we ended up with a bucket of scanned print articles OCRed with tesseract, and a bucket of articles from the full text archive. We then had to figure out which articles matched up between these two buckets, which was an interesting process.
Because an OCRed article is seldom an exact match for its full text counterpart, we could not align articles by simply testing for string equality. Instead, we used fuzzy string matching. Our approach was applied one issue at a time and relied on a technique known as “shingling.” Using shingling, we transformed the text of articles in both datasets into a list of tokens, and then turned the list of tokens into a list of n-token sequences called “shingles.”
We’ll illustrate with this quote by Abraham Lincoln:
This is our full text. We tokenize it by splitting it into a list of words separated by spaces. The string “secret” is considered a token in the full text.
Now we convert the list of tokens into a list of “shingles,” which are sequences of tokens. If we use a shingle size of 4, we end up with the following: 5 lists of tokens. (As you can see, the contents of the lists overlap like shingles on a roof.)
When we generate the list of shingles for every article in the full text digital archive, we get something that looks like this:
It is a reasonable hypothesis that sequences of words from an OCRed article will overlap a fair amount with sequences of words in that same article in the full text archive. We want a list of articles that contain each shingle so we can narrow down our options. Iterating through the above list, we can transform our data into the following hash table:
Now that we have a mapping of all the shingles appearing in a given issue to all the full text articles from that issue containing each shingle, we repeat the first part of the process with the OCRed text, getting a list of shingles for each article.
Let’s say OCRed article_A consists of shingle_2 and shingle_5. We can use the table above to generate a list of article candidates that might be a “match” with article_A. By looking up shingle_2 and shingle_5 in the table, we conclude that article_1, article_2, article_2 and article_5 are all potential matches for article_A.
This greatly reduces the problem space. Now, instead of having to compare every OCRed article in an issue to every full text article in an issue, which could involve tens of thousands of computationally expensive comparisons, we need only compare a short list. This ends up reducing the number of comparisons by several orders of magnitude.
To quantify the difference between the OCRed data and the full text articles, we used the Python difflib library. It gave us a nice, clean result:
From this particular example, it is clear that OCRed article_A is most likely the same article as full text article_1.
Using this process, we could match approximately 80 percent of the articles. The remaining 20 percent did not have clear enough distinctions in scores, which required us to be a little more clever. In a perfect world, the relationship between our two buckets of articles would have been one-to-one, but in this world, it was actually many-to-many. Some full text articles were represented as multiple regions in the scanned archive, and some single regions in the scanned archive corresponded to multiple items in the full text archive. We reconciled the disparity by splitting the data into paragraphs and carrying out a similar process to the one described above, on the paragraph level.
We ended up with a near-perfect, many-to-many matching of zones to the full text archive which is wonderfully searchable. You can check it out by exploring the entire Times archive at timesmachine.nytimes.com.
Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/Pgb5otZFMXE/