Build a user-facing OpenWhisk application with Bluemix and Node.js

Learn how to use OpenWhisk
to write user-facing applications. This tutorial covers the basics of
OpenWhisk through a simple application that you can build and expand to integrate
with a Cloudant database. The sample application uses a small Node.js stub to
allow it to be user-facing.

Original URL:

Original article

Canal Brings Fine-Grained Policy to DC/OS and Apache Mesos via CNI

When we launched Canal at CoreOS Fest last month, we indicated that this was the start of something bigger than just the constituent projects (Calico and Flannel). Today, at Mesoscon, we unveiled another big step forward for Canal – support for Apache Mesos, bringing fine-grained network policy to the Mesos and DC/OS communities.

Didn’t Calico already support Apache Mesos and DC/OS?

Well, yes we did! (See demos here and here, for example). In fact, Calico was the first networking solution to implement the “IP per container” model. But we did so through the net-modules interface, which was specific to that platform. That means that the work we did on Canal, to combine Calico and Flannel via the Container Network Interface (CNI) that is used by Kubernetes, would not translate to Apache Mesos and DC/OS.

That is why we were excited to work with the team that has taken the CNI specification and implemented it for Apache Mesos. A huge shout-out here to the development team behind this leap forward, including Avinash Sridharan, Jie Yu and Qian Zhang, who are speaking today at Mesoscon about their work.

Why should I care about this?

First, an ever-growing community is now embracing CNI as the de facto standard interface for cloud-native networking, and all the goodness of Canal (dynamic, fine-grained policy with the broadest choice of connectivity options) is now available to an even wider group of users. The consensus and momentum behind a simple, secure model for cloud-native network is growing, and we’re proud to be part of this movement!

Second, it means that, as a Mesos or DC/OS user, you now have access to the powerful network policy capabilities developed in the Kubernetes community (see the Kubernetes Networking SIG blog post on this topic). Simply apply labels to tasks when launching them via Marathon, define the policies that apply to those labels (in the same way as you would with the new Kubernetes API – but passing them into Calico rather than Kubernetes), and sit back and watch Calico automagically maintain dynamic, distributed firewalls around every task. It’s that simple!

Awesome. When can I get it?

Support for CNI (and hence for Canal and Calico via CNI) will be generally available along with the upcoming Apache Mesos 1.0 release.

As always, there is still work to do – including integration with workloads running under Docker Engine (today just the Unified Containerizer is supported) and the Marathon UI for seamless operation in DC/OS – but we are excited at this milestone and hope you will be too. We look forward to hearing your thoughts over on Slack (get an invite if you’re not already a member) – we have a dedicated #Mesos channel for discussions of all things Apache Mesos and DC/OS related.

What about Docker?

If you’re wondering about CNI support for Docker… check out this proof of concept we worked on with our friends over at Weaveworks!… but for those of you who are using our libnetwork integration – don’t worry, Calico will continue to support that. While we’re excited about the momentum behind CNI, we don’t believe there will ever be just one way of doing things – and we’re committed to our mission of bringing simple, secure networking to all the major cloud platforms.

Original URL:

Original article

The truth about deep learning

Come on people — let’s get our shit together about deep learning. I’ve been studying and writing about DL for close to two years now, and it still amazes the misinformation surrounding this relatively complex learning algorithm.

This post is not about how Deep Learning is or is not over-hyped, as that is a well documented debate. Rather, it’s a jumping off point for a (hopefully) fresh, concise understanding of deep learning and its implications. This discussion/rant is somewhat off the cuff, but the whole point was to encourage those of us in the machine learning community to think clearly about deep learning. Let’s be bold and try to make some claims based on actual science about whether or not this technology will or will not produce artificial intelligence. After all, aren’t we supposed to be the leaders in this field and the few that understand its intricacies and implications? With all of the news on artificial intelligence breakthroughs and non-industry commentators making rash conclusions about how deep learning will change the world, don’t we owe it to the world to at least have our shit together? It feels like most of us are just sitting around waiting for others to figure that out for us.

The Problem

Even the most academic among us mistakenly merge two very different schools of thought in our discussions on deep learning:

  1. The benefits of neural networks over other learning algorithms.
  2. The benefits of a “deep” neural network architecture over a “shallow” architecture.

Much of the debating going on is surprisingly still concerned with the first point instead of the second. Let’s be clear — the inspiration for, benefits of, and detriments against neural networks are all well documented in the literature. Why are we still talking about this like the discussion is new? Nothing is more frustrating when discussing deep learning that someone explaining their views on why deep neural networks are “modeled after how the human brain works” (much less true than the name suggests) and thus are “the key to unlocking true artificial intelligence”. This is an obvious Straw man, since this discussion is essentially the same as was produced when plain old neural networks were introduced.

The idea I’d like for you to take away here is that we are not asking the right question for the answer which we desire. If we want to know how one can contextualize deep neural networks in the ever-increasing artificially intelligent world, we must answer the following question: what does increasing computing power and adding layers to a neural network actually allow us to do better than a normal neural network? Answering these could yield a truly fruitful discussion on deep learning.

The Answer (?)

Here is my personal answer to the second question: deep neural networks are more useful than traditional neural networks for two reasons:

  1. The automatic encoding of features which previously had to be hand engineered.
  2. The exploitation of structurally/spatially associated features.

At the risk of sounding bold, that’s it — if you believe there is another benefit which is not somehow encompassed by these two traits, please let me know. These are the only two that I have come across in all my time working with deep learning.

If this was true, what would we expect to see in the academic landscape? We might expect that deep neural networks would be useful in situations where the data has some spatial qualities that can be exploited, such as image data, audio data, natural language processing, etc. Although we might say there are many areas that could benefit from that spatial exploitation, we would certainly not find that this algorithm was a magical cure for any data that you throw at it. The words deep learning will not magically cure cancer (unless you find some way to spatially exploit data associated with cancer, as has been done with the human genome), there is no threat that it will start thinking and become sentient. We might see self-driving cars that assist in simply keeping the car between the lines, but not one which can decide whether to protect it’s own driver or the pedestrian walking the street. Hell, even those that actually read the papers on AlphaGo will realize that deep learning was simply a tool used by traditional AI algorithms. Lastly, we might find that, once again, that the Golden mean is generally spot on, and that deep learning is not the answer to all machine learning problems, but also not completely baseless.

Since I am feeling especially bold, I will make another prediction: deep learning will not produce the universal algorithm. There is simply not enough there to create such a complex system. However, deep learning is an extremely useful tool. Where will it be most useful in AI? I predict it will be as a sensory learning system (vision, audio, etc) that exploits some spatially characteristics in data that otherwise go unaccounted for, which, like in AlphaGo, must be used by a truly artificially intelligent system as an input.

Stop studying deep learning thinking it will lay all other algorithms to waste no matter the scenario. Stop throwing deep learning at every dataset you see. Start experimenting with these technologies outside of the “hello world” examples in the packages you use — you will quickly learn what they are actually useful for. Most of all, let’s stop viewing deep learning as the “almost there!!!” universal algorithm and start viewing it for what it truly is: a tool that is useful in assisting a computer’s ability to perceive.

Original URL:

Original article

CoreOS Launches Torus, a New Open Source Distributed Storage System

CoreOS on Wednesday launched Torus, an open source project that provides storage primitives designed for cloud-native apps and can be deployed like a containerized app via Kubernetes. With Torus, startups and enterprises get access to the same kind of technologies that web-scale companies such as Google already use internally. NetworkWorld reports: Torus is deployed by Kubernetes, side by side with the apps to which it provides storage, and it uses Kubernetes’s Flexvolume plugin to allow dynamic mounting of volumes for nodes in the cluster. This allows, for example, PostgreSQL to run atop Torus storage volumes. Torus also demonstrates how CoreOS is working on what happens around containers, not only what happens inside them. A key part of Torus is etcd, a distributed key/value store used by CoreOS to automatically keep configuration data consistent across all machines in a cluster. In Torus, etcd is used to store and replicate metadata for all the files and objects stored in the pool.

Share on Google+

Read more of this story at Slashdot.

Original URL:

Original article

Unlimited Copying Versus Legal Publishing

In John Willinsky’s, Scholarly Publishing Has Its Napster Moment, it’s clear that unlimited “napster”-like copying was a challenge to academic publishing, and notably to some of the large academic publishing houses that dominate legal publishing.

The situations are similar, and worldwide legal publishing seems just as concentrated, as noted by Gary Rodrigues. It’s not, however, clear if the risks are the same in the legal-publishing world, or if they apply to (law-)books.

The Common Bits

Legal publishing starts out very similarly to academic publishing, with an author who is paid for the work he does, but not for publishing it. In academic publishing, a researcher writes a paper to describe his work, while in legal, publishing, a lawyer is paid to write an argument and a judge a decision. In both, a significant business grew up in printing and distributing the articles and decisions in journals which were the sold to libraries.

Academic and law libraries struggle to be able to pay for the journals they need, often taxing their patrons to augment their funding. The problem is horrid in academia, but a bit less so in a profit-making business like the law. A law office can make library costs part of their overhead, and some of the cost can legitimately be billed to the customer.

What’s not similar in legal publishing is the degree of indexing, cross-referencing and just plain human judgment provided by the publishers along with the decision..

What’s different: Value Added

In just a head note alone, two additional extra services are provided: a summary of any important decisions or rules within the case and a syllabus of all the points decided. Most publishers with an on-line service also provide links for each of the other cases cited in the body of the case, and significant on-line services to build on those linkages..

When one goes from articles and judgments to books, there’s even more work done by the publisher. The indexing, tables of cases and other supporting matter is not charged for: it’s just part of the deal. The massive effort to write Gold’s[1] or a Halsbury’s had to come out of the book sales.

All of this is at risk, especially when you’re publishing articles or decisions that are of moderate size. I can download a paper at my convenience, read it, search it for keywords and print it on a perfectly ordinary office printer to refer to later.


The saving grace for a publisher is that the marketplace for Canadian case-law is a small and unusual one: it consists of people who are preferentially law-abiding(!)

The law is also a business with reasonable margins and the ability to bill customers for useful third-party services, such as “forty minutes researching on quicklaw”, legitimately spent to save multiple hours of manual effort.

This speaks to articles, reports and decisions in the law, but what about books?

Books or, Given Lemons, Make Lemonade

My publisher, O’Reilly, faced a similar risk with technical books such as Using Samba. They had to invest a significant sum of money in pre-production work and illustrations for everything they printed, even if it didn’t sell. They didn’t have to pay the author for a failure, but everything else cost money, effort, printing time and warehouse space. To them, free copying looked like the end of their business.

My editor, Andy Oram, however, looked at the risk of copying and turned it on it’s head. He negotiated a deal where every copy of the Samba program included a copy of Using Samba, so if you wanted a copy, you would get one as part of the normal free download.

There were no limitations on distribution or personal printing, and the license reserved only commercial printing rights to the publisher. The net result is that the book was widely used as a reference on Samba, at no apparent benefit to O’Reilly.

What surprised the rest of us was that the on-line readers bought the physical book in great numbers. We went from the third-selling book on the subject to the first in a matter of weeks, and the book was one of O’Reilly’s best sellers for the year.

Readers buy books. To be precise, on-line readers buy printed books. They value their convenient form, they make notes in the margins and they lend them to friends. They preferentially buy books that are available on-line, partly because they know they’re not buying a “pig in a poke”, and additionally because the on-line copies are searchable, and in effect serve as a superior index into the printed copies.

Andy knew that, and sold a lot of copies of Using Samba. O’Reilly subsequently found ways to provide a search service for their books as well as free samples, all as part of their on-line offering, Safari. Other publishers noticed that, and have found variations that work for both fiction and non-fiction.

The “Unfair Advantage” of Physical Books

Part of the reason that people will buy books is that they’re of a convenient physical size. Samba is 388 pages, and when printed on book-weight paper, is just under an inch thick. Gold’s at 1600 pages is only a bit less that two and a half inches thick, courtesy of being printed on amazing super-thin paper.

Printing Samba on ordinary office-printer paper is a total waste of time and materials: you get an impractical wodge. The galleys were more than four inches thick, on thinner-than-everyday paper. Your own copy of Gold’s from the office printer would be unusable by anyone smaller than a professional weight-lifter.

The Fair Advantage of Topicality

Gold’s is an annual. It’s subject matter changes quietly each year, and much of the work that went into writing headnotes and articles about important cases contributes to making it possible to update it each year. It makes good sense to subscribe to it, and get a new copy each year.

Put together with the unusual marketplace, the weight of an ill-printed book and the value of regular updates, the traditional print publishers have have a worthwhile product even if the contents are copyable easily and without cost.

However, the printed-book advantage only lasts until someone sets up a bootleg print-on-demand service. After that, the dynamics of publishing change.


At the moment, we have a dynamic equilibrium, and our traditional large-scale publishers continue to support Canadian legal publishing. Dynamic equilibria change, though, when circumstances or technology changes.

At some point, a publisher who is losing their shirt on academic articles could stop doing any articles and reports. Another publisher might stop checking the copyrights on single-copy print-on-demand jobs.

Until we have our own Andy Oram and turn the ease of copying to the advantage of our publishers, we as their customers could be harmed by a failure in our very centralized, very traditional publishing structures.

In my view, it’s time to innovate. Print-on-demand looseleafs, anyone?


[1] Alan D. Gold, The Practitioner’s Criminal Code. Markham, On (Lexis Nexis)
A copy of the 2010 edition has pride of place on my bookshelf, right beside Donald Knuth’s The Art of Computer Programming.

Original URL:

Original article

Introducing DeepText: Facebook’s text understanding engine

Text is a prevalent form of communication on Facebook. Understanding the various ways text is used on Facebook can help us improve people’s experiences with our products, whether we’re surfacing more of the content that people want to see or filtering out undesirable content like spam.

With this goal in mind, we built DeepText, a deep learning-based text understanding engine that can understand with near-human accuracy the textual content of several thousands posts per second, spanning more than 20 languages.

DeepText leverages several deep neural network architectures, including convolutional and recurrent neural nets, and can perform word-level and character-level based learning. We use FbLearner Flow and Torch for model training. Trained models are served with a click of a button through the FBLearner Predictor platform, which provides a scalable and reliable model distribution infrastructure. Facebook engineers can easily build new DeepText models through the self-serve architecture that DeepText provides.

Why deep learning

Text understanding includes multiple tasks, such as general classification to determine what a post is about — basketball, for example — and recognition of entities, like the names of players, stats from a game, and other meaningful information. But to get closer to how humans understand text, we need to teach the computer to understand things like slang and word-sense disambiguation. As an example, if someone says, “I like blackberry,” does that mean the fruit or the device?

Text understanding on Facebook requires solving tricky scaling and language challenges where traditional NLP techniques are not effective. Using deep learning, we are able to understand text better across multiple languages and use labeled data much more efficiently than traditional NLP techniques. DeepText has built on and extended ideas in deep learning that were originally developed in papers by Ronan Collobert and Yann LeCun from Facebook AI Research.

Understanding more languages faster

The community on Facebook is truly global, so it’s important for DeepText to understand as many languages as possible. Traditional NLP techniques require extensive preprocessing logic built on intricate engineering and language knowledge. There are also variations within each language, as people use slang and different spellings to communicate the same idea. Using deep learning, we can reduce the reliance on language-dependent knowledge, as the system can learn from text with no or little preprocessing. This helps us span multiple languages quickly, with minimal engineering effort.

Deeper understanding

In traditional NLP approaches, words are converted into a format that a computer algorithm can learn. The word “brother” might be assigned an integer ID such as 4598, while the word “bro” becomes another integer, like 986665. This representation requires each word to be seen with exact spellings in the training data to be understood.

With deep learning, we can instead use “word embeddings,” a mathematical concept that preserves the semantic relationship among words. So, when calculated properly, we can see that the word embeddings of “brother” and “bro” are close in space. This type of representation allows us to capture the deeper semantic meaning of words.

Using word embeddings, we can also understand the same semantics across multiple languages, despite differences in the surface form. As an example, for English and Spanish, “happy birthday” and “feliz cumpleaños” should be very close to each other in the common embedding space. By mapping words and phrases into a common embedding space, DeepText is capable of building models that are language-agnostic.

Labeled data scarcity

Written language, despite the variations mentioned above, has a lot of structure that can be extracted from unlabeled text using unsupervised learning and captured in embeddings. Deep learning offers a good framework to leverage these embeddings and refine them further using small labeled data sets. This is a significant advantage over traditional methods, which often require large amounts of human-labeled data that are inefficient to generate and difficult to adapt to new tasks. In many cases, this combination of unsupervised learning and supervised learning significantly improves performance, as it compensates for the scarcity of labeled data sets.

Exploring DeepText on Facebook

DeepText is already being tested on some Facebook experiences. In the case of Messenger, for example, it is beginning to give us a better understanding of when someone might want to go somewhere. DeepText is used for intent detection and entity extraction to help realize that a person is not looking for a taxi when he or she says something like, “I just came out of the taxi,” as opposed to “I need a ride.”

Video ikke tilgængelig

Denne video kunne desværre ikke afspilles.

We’re also beginning to use high-accuracy, multi-language DeepText models to help people find the right tools for their purpose. For example, someone could write a post that says, “I would like to sell my old bike for $200, anyone interested?” DeepText would be able to detect that the post is about selling something, extract the meaningful information such as the object being sold and its price, and prompt the seller to use existing tools that make these transactions easier through Facebook.

DeepText has the potential to further improve Facebook experiences by understanding posts better to extract intent, sentiment, and entities (e.g., people, places, events), using mixed content signals like text and images, and automating the removal of objectionable content like spam. Many celebrities and public figures use Facebook to start conversations with the public. These conversations often draw hundreds or even thousands of comments. Finding the most relevant comments in multiple languages while maintaining comment quality is currently a challenge. One additional challenge that DeepText may be able to address is surfacing the most relevant or high-quality comments.

Next steps

We are continuing to advance DeepText technology and its applications in collaboration with the Facebook AI Research group. Here are some examples.

Better understanding people’s interests

Part of personalizing people’s experiences on Facebook is recommending content that is relevant to their interests. In order to do this, we must be able to map any given text to a particular topic, which requires massive amounts of labeled data.

While such data sets are hard to produce manually, we are testing the ability to generate large data sets with semi-supervised labels using public Facebook pages. It’s reasonable to assume that the posts on these pages will represent a dedicated topic — for example, posts on the Steelers page will contain text about the Steelers football team. Using this content, we train a general interest classifier we call PageSpace, which uses DeepText as its underlying technology. In turn, this could further improve the text understanding system across other Facebook experiences.

Joint understanding of textual and visual content

Often people post images or videos and also describe them using some related text. In many of those cases, understanding intent requires understanding both textual and visual content together. As an example, a friend may post a photo of his or her new baby with the text “Day 25.” The combination of the image and text makes it clear that the intent here is to share family news. We are working with Facebook’s visual content understanding teams to build new deep learning architectures that learn intent jointly from textual and visual inputs.

New deep neural network architectures

We continue to develop and investigate new deep neural network architectures. Bidirectional recurrent neural nets (BRNNs) show promising results, as they aim to capture both contextual dependencies between words through recurrence and position-invariant semantics through convolution. We have observed that BRNNs achieve lower error rates than regular convolutional or recurrent neural nets for classification; in some cases the error rates are as low as 20 percent.

While applying deep learning techniques to text understanding will continue to enhance Facebook products and experiences, the reverse is also true. The unstructured data on Facebook presents a unique opportunity for text understanding systems to learn automatically on language as it is naturally used by people across multiple languages, which will further advance the state of the art in natural language processing.

Video ikke tilgængelig

Denne video kunne desværre ikke afspilles.

Original URL:

Original article

Monitoring Cassandra at Scale

At Yelp we leverage
Cassandra to
fulfill a diverse workload that seems to combine every consistency and
availability tradeoff imaginable. It is a fantastically versatile datastore,
and a great complement for our developers to our MySQL and Elasticsearch
offerings. However, our infrastructure is not done until it ships and is
. When we started deploying Cassandra we immediately started looking
for ways to properly monitor the datastore so that we could alert developers
and operators of issues with their clusters before cluster issues became
site issues. Distributed datastores like Cassandra are built to deal with
failure, but our monitoring solution had to be robust enough to differentiate
between routine failure and potentially catastrophic failure.

Monitoring Challenges

We access Cassandra through our standard NoSQL proxy layer
which gives us great application level performance and availability metrics for
free via our uwsgi metrics
framework. This immediately gives us the capability to alert developers when
their Cassandra based application is slow or unavailable. Couple these pageable
events with reporting the
out of the box
JMX metrics on query timing to customers and the monitoring story is starting
to look pretty good for consumers of Cassandra.

The main difficulty we faced was monitoring the state of the entire database
for operators of Cassandra. Some of the
metrics exposed
over JMX are useful to operators. There are a lot of
online to learn which JMX metrics are most relevant to operators, so I won’t
cover them here. Unfortunately, most of the advanced cluster state monitoring
is built into
which is useful for an operator tending to a specific cluster but does not
scale well to multiple clusters with a distributed DevOps ownership model where
teams are responsible for their own clusters. For example, how would one
robustly integrate the output of nodetool with
Nagios or Sensu?
OpsCenter is
closer to what we need, especially if you pay for the enterprise edition, but
the reality is that this option is expensive, does not monitor ring health in
the way we want and does not (yet) support 2.2 clusters.

We need to be able to determine a datastore is going to fail before it
. Good datastores have warning signs, but it’s a matter of identifying
them. In our experience, the JMX metrics monitoring technique works well when
you’re having a performance or availability problem isolated to one or two
nodes. The technique falls flat, however, when trying to differentiate between
an innocuous single node failure impacting a keyspace with a replication factor
of five and a potentially critical single node failure impacting a keyspace
with a replication factor of two.

Finding Cassandra’s Warning Signs

Cassandra uses a ring topology to store data. This topology divides the
database into contiguous ranges and assigns each of the ranges to a set of
nodes, called replicas. Consumers query the datastore with an associated

which indicates to Cassandra how many of these replicas must participate when
answering a query. For example, a keyspace might have a replication factor of
3, which means that every piece of data is replicated to three nodes. When a
query is issued at LOCAL_QUORUM, we need to contact at least ⅔ of the
replicas. If a single host in the cluster is down, the cluster can still
satisfy operations, but if two nodes fail some ranges of the ring will become

Figure 1 is a basic visualization of how data is mapped onto a single Cassandra
ring with virtual nodes. The figure elides details of datacenter and rack
awareness, and does not illustrate the typically much higher number of tokens
than nodes. However, it is sufficient to explain our monitoring approach.

Figure 1: A Healthy Ring

Figure 1: A Healthy Ring

In this case we have four nodes, each with three “virtual” nodes (a.k.a.
) and the keyspace has a replication factor of three. For the sake of
explanation, we assume that we only have twelve ranges of data and data
replicates to the three closest virtual nodes in a clockwise fashion. For
example, if a key falls in token range 9, it is stored on Node A, Node B
and Node C. When all physical hosts are healthy, all token ranges have all
three replicas available, indicated by the threes on the inside of the ring.
When a single node fails, say Node A, we lose a replica of nine
token ranges because any token range that would replicate to Node A is
impacted. For example, a key that would map to token range 8 would typically
replicate to Node D, Node A and Node B but it cannot replicate to to
Node A because Node A is down. This is illustrated in Figure 2.

Figure 2: Single Node Failure

Figure 2: Single Node Failure

At this point we can still execute operations at LOCAL_QUORUM because ⅔ of
replicas are still available for all token ranges, but if we were to lose
another node, say Node C, we would lose a second replica of six token ranges
as shown in Figure 3.

Figure 3: Two Node Failures

Figure 3: Two Node Failures

This means that any key which exists on those six token ranges is unavailable
at LOCAL_QUORUM, while any keys not in those ranges are still available.

This understanding allows us to check if a cluster is unavailable for a
particular consistency level of operations by inspecting the ring and verifying
that all ranges have enough nodes in the “UP” state. It is important to note
that client side metrics are not sufficient to tell if a single additional node
failure will prevent queries from completing, because the client operations are
binary: they either succeed or not. We can tell they are failing, but can’t
see the warning signs before failure.

Monitoring Cassandra’s Warning Signs

Under the hood, nodetool uses a JMX interface to retrieve information like
ring topology, so with some sleuthing in the nodetool and Cassandra source
code we can find the following useful

In order to programmatically access these mbeans we install
jolokia on all of our Cassandra clusters. An HTTP
interface to Cassandra’s mbeans is extremely powerful for allowing quick
iteration on automation and monitoring. For instance, our monitoring script can
be as simple as (pseudocode):

400: Invalid request

Often our applications using Cassandra read and write at LOCAL_ONE, which
means that we can get robust monitoring by deploying the above check twice: the
first monitoring LOCAL_ONE that pages operators, and the second monitoring
LOCAL_QUORUM that cuts a ticket on operators. When running with a replication
factor of three, this allows us to lose one node without alerting anyone,
ticket after losing two (imminently unavailable), and page upon losing all
replicas (unavailable).

This approach is very flexible because we can find the highest consistency
level of any operation against a given cluster and then tailor our monitoring
to check the cluster at the appropriate consistency level. For example, if the
application does reads and writes at quorum we would ticket after LOCAL_ALL
and page on LOCAL_QUORUM. At Yelp, any cluster owner can control these
alerting thresholds individually.

A Working Example

The real solution has to be slightly more complicated because of
cross-datacenter replication and flexible consistency levels. To demo this
approach really works, we can setup a three node Cassandra cluster with two
keyspaces, one with a replication factor of one (blog_1) and one with a
replication factor of three (blog_3). In this configuration each node has the
default 256 vnodes.

400: Invalid request

We like to write our monitoring scripts in Python because we can integrate
seamlessly with our pysensu-yelp
library for emitting alerts to Sensu, but for this
demo I’ve created a simplified
monitoring script
that inspects the ring and exits with a status code that conforms to the
Nagios Plugin API.
As we remove nodes we can use this script to see how we gradually lose the
ability to operate at certain consistency levels. We can also check that the
number of under-replicated ranges matches our understanding of vnodes and

400: Invalid request

Now it’s just a matter of tuning the monitoring to look for one level of
consistency higher than what we actually query at, and we have achieved robust

An important thing to understand is that this script probably won’t “just work”
in your infrastructure, especially if you are not using jolokia, but it may be
useful as a template for writing your own robust monitoring.

Take it to 11, What’s Next?

Our SREs can sleep better at night knowing that our Cassandra clusters will
give us warning before they bring down the website, but at Yelp we always ask,
“what’s next?”

Once we can reliably monitor ring health, we can can use this capability to
further automate our Cassandra clusters. For example, we’ve already used
this monitoring strategy to enable robust rolling restarts that ensures ring
health at every step of the restart. A project we’re currently working on is
combining this information with
autoscaling events
to be able to intelligently react to hardware failure in an automated fashion.
Another logical next step is to automatically deduce the consistency levels to
monitor by hooking into our Apollo proxy layer and updating our monitoring from
the live stream of queries against keyspaces. This way if we change the queries
to a different consistency level, the monitoring follows along.

Furthermore, if this approach proves useful long term, it is fairly easy to
integrate it into nodetool directly, e.g.
nodetool health .

Back to blog

Original URL:

Original article

Samba Server installation on Ubuntu 16.04

This guide explains the installation and configuration of a Samba server on Ubuntu 16.04 with anonymous and secured Samba shares. Samba is an Open Source/Free Software suite that provides seamless file and print services to SMB/CIFS clients. Samba is freely available, unlike other SMB/CIFS implementations, and allows for interoperability between Linux/Unix servers and Windows-based clients.

Original URL:

Original article

Bayesian Deep Learning

Neural Networks in PyMC3 estimated with Variational Inference

(c) 2016 by Thomas Wiecki

Current trends in Machine Learning

There are currently three big trends in machine learning: Probabilistic Programming, Deep Learning and “Big Data“. Inside of PP, a lot of innovation is in making things scale using Variational Inference. In this blog post, I will show how to use Variational Inference in PyMC3 to fit a simple Bayesian Neural Network. I will also discuss how bridging Probabilistic Programming and Deep Learning can open up very interesting avenues to explore in future research.

Probabilistic Programming at scale

Probabilistic Programming allows very flexible creation of custom probabilistic models and is mainly concerned with insight and learning from your data. The approach is inherently Bayesian so we can specify priors to inform and constrain our models and get uncertainty estimation in form of a posterior distribution. Using MCMC sampling algorithms we can draw samples from this posterior to very flexibly estimate these models. PyMC3 and Stan are the current state-of-the-art tools to consruct and estimate these models. One major drawback of sampling, however, is that it’s often very slow, especially for high-dimensional models. That’s why more recently, variational inference algorithms have been developed that are almost as flexible as MCMC but much faster. Instead of drawing samples from the posterior, these algorithms instead fit a distribution (e.g. normal) to the posterior turning a sampling problem into and optimization problem. ADVI — Automatic Differentation Variational Inference — is implemented in PyMC3 and Stan, as well as a new package called Edward which is mainly concerned with Variational Inference.

Unfortunately, when it comes traditional ML problems like classification or (non-linear) regression, Probabilistic Programming often plays second fiddle (in terms of accuracy and scalability) to more algorithmic approaches like ensemble learning (e.g. random forests or gradient boosted regression trees).

Deep Learning

Now in its third renaissance, deep learning has been making headlines repeatadly by dominating almost any object recognition benchmark, kicking ass at Atari games, and beating the world-champion Lee Sedol at Go. From a statistical point, Neural Networks are extremely good non-linear function approximators and representation learners. While mostly known for classification, they have been extended to unsupervised learning with AutoEncoders and in all sorts of other interesting ways (e.g. Recurrent Networks, or MDNs to estimate multimodal distributions). Why do they work so well? No one really knows as the statistical properties are still not fully understood.

A large part of the innoviation in deep learning is the ability to train these extremely complex models. This rests on several pillars:

  • Speed: facilitating the GPU allowed for much faster processing.
  • Software: frameworks like Theano and TensorFlow allow flexible creation of abstract models that can then be optimized and compiled to CPU or GPU.
  • Learning algorithms: training on sub-sets of the data — stochastic gradient descent — allows us to train these models on massive amounts of data. Techniques like drop-out avoid overfitting.
  • Architectural: A lot of innovation comes from changing the input layers, like for convolutional neural nets, or the output layers, like for MDNs.

Bridging Deep Learning and Probabilistic Programming

On one hand we Probabilistic Programming which allows us to build rather small and focused models in a very principled and well-understood way to gain insight into our data; on the other hand we have deep learning which uses many heuristics to train huge and highly complex models that are amazing at prediction. Recent innovations in variational inference allow probabilistic programming to scale model complexity as well as data size. We are thus at the cusp of being able to combine these two approaches to hopefully unlock new innovations in Machine Learning. For more motivation, see also Dustin Tran’s recent blog post.

While this would allow Probabilistic Programming to be applied to a much wider set of interesting problems, I believe this bridging also holds great promise for innovations in Deep Learning. Some ideas are:

  • Uncertainty in predictions: As we will see below, the Bayesian Neural Network informs us about the uncertainty in its predictions. I think uncertainty is an underappreciated concept in Machine Learning as it’s clearly important for real-world applications. But it could also be useful in training. For example, we could train the model specifically on samples it is most uncertain about.
  • Uncertainty in representations: We also get uncertainty estimates of our weights which could inform us about the stability of the learned representations of the network.
  • Regularization with priors: Weights are often L2-regularized to avoid overfitting, this very naturally becomes a Gaussian prior for the weight coefficients. We could, however, imagine all kinds of other priors, like spike-and-slab to enforce sparsity (this would be more like using the L1-norm).
  • Transfer learning with informed priors: If we wanted to train a network on a new object recognition data set, we could bootstrap the learning by placing informed priors centered around weights retrieved from other pre-trained networks, like GoogLeNet.
  • Hierarchical Neural Networks: A very powerful approach in Probabilistic Programming is hierarchical modeling that allows pooling of things that were learned on sub-groups to the overall population (see my tutorial on Hierarchical Linear Regression in PyMC3). Applied to Neural Networks, in hierarchical data sets, we could train individual neural nets to specialize on sub-groups while still being informed about representations of the overall population. For example, imagine a network trained to classify car models from pictures of cars. We could train a hierarchical neural network where a sub-neural network is trained to tell apart models from only a single manufacturer. The intuition being that all cars from a certain manufactures share certain similarities so it would make sense to train individual networks that specialize on brands. However, due to the individual networks being connected at a higher layer, they would still share information with the other specialized sub-networks about features that are useful to all brands. Interestingly, different layers of the network could be informed by various levels of the hierarchy — e.g. early layers that extract visual lines could be identical in all sub-networks while the higher-order representations would be different. The hierarchical model would learn all that from the data.
  • Other hybrid architectures: We can more freely build all kinds of neural networks. For example, Bayesian non-parametrics could be used to flexibly adjust the size and shape of the hidden layers to optimally scale the network architecture to the problem at hand during training. Currently, this requires costly hyper-parameter optimization and a lot of tribal knowledge.

Bayesian Neural Networks in PyMC3

Generating data

First, lets generate some toy data — a simple binary classification problem that’s not linearly separable.

%matplotlib inline
import pymc3 as pm
import theano.tensor as T
import theano
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.preprocessing import scale
from sklearn.cross_validation import train_test_split
from sklearn.datasets import make_moons
X, Y = make_moons(noise=0.2, random_state=0, n_samples=1000)
X = scale(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.5)
fig, ax = plt.subplots()
ax.scatter(X[Y==0, 0], X[Y==0, 1], label='Class 0')
ax.scatter(X[Y==1, 0], X[Y==1, 1], color='r', label='Class 1')
sns.despine(); ax.legend()
ax.set(xlabel='X', ylabel='Y', title='Toy binary classification data set');

Model specification

A neural network is quite simple. The basic unit is a perceptron which is nothing more than logistic regression. We use many of these in parallel and then stack them up to get hidden layers. Here we will use 2 hidden layers with 5 neurons each which is sufficient for such a simple problem.

# Trick: Turn inputs and outputs into shared variables. 
# It's still the same thing, but we can later change the values of the shared variable 
# (to switch in the test-data later) and pymc3 will just use the new data. 
# Kind-of like a pointer we can redirect.
# For more info, see:
ann_input = theano.shared(X_train)
ann_output = theano.shared(Y_train)

n_hidden = 5

# Initialize random weights between each layer
init_1 = np.random.randn(X.shape[1], n_hidden)
init_2 = np.random.randn(n_hidden, n_hidden)
init_out = np.random.randn(n_hidden)
with pm.Model() as neural_network:
    # Weights from input to hidden layer
    weights_in_1 = pm.Normal('w_in_1', 0, sd=1, 
                             shape=(X.shape[1], n_hidden), 
    # Weights from 1st to 2nd layer
    weights_1_2 = pm.Normal('w_1_2', 0, sd=1, 
                            shape=(n_hidden, n_hidden), 
    # Weights from hidden layer to output
    weights_2_out = pm.Normal('w_2_out', 0, sd=1, 
    # Build neural-network using tanh activation function
    act_1 = T.tanh(, 
    act_2 = T.tanh(, 
    act_out = T.nnet.sigmoid(, 
    # Binary classification -> Bernoulli likelihood
    out = pm.Bernoulli('out', 

That’s not so bad. The Normal priors help regularize the weights. Usually we would add a constant b to the inputs but I omitted it here to keep the code cleaner.

Variational Inference: Scaling model complexity

We could now just run a MCMC sampler like NUTS which works pretty well in this case but as I already mentioned, this will become very slow as we scale our model up to deeper architectures with more layers.

Instead, we will use the brand-new ADVI variational inference algorithm which was recently added to PyMC3. This is much faster and will scale better. Note, that this is a mean-field approximation so we ignore correlations in the posterior.

with neural_network:
    # Run ADVI which returns posterior means, standard deviations, and the evidence lower bound (ELBO)
    v_params = pm.variational.advi(n=50000)
Iteration 0 [0%]: ELBO = -368.86
Iteration 5000 [10%]: ELBO = -185.65
Iteration 10000 [20%]: ELBO = -197.23
Iteration 15000 [30%]: ELBO = -203.2
Iteration 20000 [40%]: ELBO = -192.46
Iteration 25000 [50%]: ELBO = -198.8
Iteration 30000 [60%]: ELBO = -183.39
Iteration 35000 [70%]: ELBO = -185.04
Iteration 40000 [80%]: ELBO = -187.56
Iteration 45000 [90%]: ELBO = -192.32
Finished [100%]: ELBO = -225.56
CPU times: user 36.3 s, sys: 60 ms, total: 36.4 s
Wall time: 37.2 s

< 40 seconds on my older laptop. That's pretty good considering that NUTS is having a really hard time. Further below we make this even faster. To make it really fly, we probably want to run the Neural Network on the GPU.

As samples are more convenient to work with, we can very quickly draw samples from the variational posterior using sample_vp() (this is just sampling from Normal distributions, so not at all the same like MCMC):

with neural_network:
    trace = pm.variational.sample_vp(v_params, draws=5000)

Plotting the objective function (ELBO) we can see that the optimization slowly improves the fit over time.


Now that we trained our model, lets predict on the hold-out set using a posterior predictive check (PPC). We use sample_ppc() to generate new data (in this case class predictions) from the posterior (sampled from the variational estimation).

# Replace shared variables with testing set

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500)

# Use probability of > 0.5 to assume prediction of class 1
pred = ppc['out'].mean(axis=0) > 0.5
fig, ax = plt.subplots()
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
ax.set(title='Predicted labels in testing set', xlabel='X', ylabel='Y');
print('Accuracy = {}%'.format((Y_test == pred).mean() * 100))
Accuracy = 94.19999999999999%

Hey, our neural network did all right!

Lets look at what the classifier has learned

For this, we evaluate the class probability predictions on a grid over the whole input space.

grid = np.mgrid[-3:3:100j,-3:3:100j]
grid_2d = grid.reshape(2, -1).T
dummy_out = np.ones(grid.shape[1], dtype=np.int8)

# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500)
cmap = sns.diverging_palette(250, 12, s=85, l=25, as_cmap=True)
fig, ax = plt.subplots(figsize=(10, 6))
contour = ax.contourf(*grid, ppc['out'].mean(axis=0).reshape(100, 100), cmap=cmap)
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
cbar = plt.colorbar(contour, ax=ax)
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X', ylabel='Y');'Posterior predictive mean probability of class label = 0');

Uncertainty in predicted value

So far, everything I showed we could have done with a non-Bayesian Neural Network. The mean of the posterior predictive for each class-label should be identical to maximum likelihood predicted values. However, we can also look at the standard deviation of the posterior predictive to get a sense for the uncertainty in our predictions. Here is what that looks like:

cmap = sns.cubehelix_palette(light=1, as_cmap=True)
fig, ax = plt.subplots(figsize=(10, 6))
contour = ax.contourf(*grid, ppc['out'].std(axis=0).reshape(100, 100), cmap=cmap)
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
cbar = plt.colorbar(contour, ax=ax)
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X', ylabel='Y');'Uncertainty (posterior predictive standard deviation)');

We can see that very close to the decision boundary, our uncertainty as to which label to predict is highest. You can imagine that associating predictions with uncertainty is a critical property for many applications like health care. To further maximize accuracy, we might want to train the model primarily on samples from that high-uncertainty region.

Mini-batch ADVI: Scaling data size

So far, we have trained our model on all data at once. Obviously this won’t scale to something like ImageNet. Moreover, training on mini-batches of data (stochastic gradient descent) avoids local minima and can lead to faster convergence.

Fortunately, ADVI can be run on mini-batches as well. It just requires some setting up:

# Set back to original data to retrain

# Tensors and RV that will be using mini-batches
minibatch_tensors = [ann_input, ann_output]
minibatch_RVs = [out]

# Generator that returns mini-batches in each iteration
def create_minibatch(data):
    rng = np.random.RandomState(0)
    while True:
        # Return random data samples of set size 100 each iteration
        ixs = rng.randint(len(data), size=50)
        yield data[ixs]

minibatches = [

total_size = len(Y_train)

While the above might look a bit daunting, I really like the design. Especially the fact that you define a generator allows for great flexibility. In principle, we could just pool from a database there and not have to keep all the data in RAM.

Lets pass those to advi_minibatch():

with neural_network:
    # Run advi_minibatch
    v_params = pm.variational.advi_minibatch(
        n=50000, minibatch_tensors=minibatch_tensors, 
        minibatch_RVs=minibatch_RVs, minibatches=minibatches, 
        total_size=total_size, learning_rate=1e-2, epsilon=1.0
Iteration 0 [0%]: ELBO = -311.63
Iteration 5000 [10%]: ELBO = -162.34
Iteration 10000 [20%]: ELBO = -70.49
Iteration 15000 [30%]: ELBO = -153.64
Iteration 20000 [40%]: ELBO = -164.07
Iteration 25000 [50%]: ELBO = -135.05
Iteration 30000 [60%]: ELBO = -240.99
Iteration 35000 [70%]: ELBO = -111.71
Iteration 40000 [80%]: ELBO = -87.55
Iteration 45000 [90%]: ELBO = -97.5
Finished [100%]: ELBO = -75.31
CPU times: user 17.4 s, sys: 56 ms, total: 17.5 s
Wall time: 17.5 s

with neural_network:    
    trace = pm.variational.sample_vp(v_params, draws=5000)

As you can see, mini-batch ADVI’s running time is much lower. It also seems to converge faster.

For fun, we can also look at the trace. The point is that we also get uncertainty of our Neural Network weights.


Hopefully this blog post demonstrated a very powerful new inference algorithm available in PyMC3: ADVI. I also think bridging the gap between Probabilistic Programming and Deep Learning can open up many new avenues for innovation in this space, as discussed above. Specifically, a hierarchical neural network sounds pretty bad-ass. These are really exciting times.

Next steps

Theano, which is used by PyMC3 as its computational backend, was mainly developed for estimating neural networks and there are great libraries like Lasagne that build on top of Theano to make construction of the most common neural network architectures easy. Ideally, we wouldn’t have to build the models by hand as I did above, but use the convenient syntax of Lasagne to construct the architecture, define our priors, and run ADVI.

While we haven’t successfully run PyMC3 on the GPU yet, it should be fairly straight forward (this is what Theano does after all) and further reduce the running time significantly. If you know some Theano, this would be a great area for contributions!

You might also argue that the above network isn’t really deep, but note that we could easily extend it to have more layers, including convolutional ones to train on more challenging data sets.

I also presented some of this work at PyData London, view the video below:

Finally, you can download this NB here. Leave a comment below, and follow me on twitter.


Taku Yoshioka did a lot of work on ADVI in PyMC3, including the mini-batch implementation as well as the sampling from the variational posterior. I’d also like to the thank the Stan guys (specifically Alp Kucukelbir and Daniel Lee) for deriving ADVI and teaching us about it. Thanks also to Chris Fonnesbeck, Andrew Campbell, Taku Yoshioka, and Peadar Coyle for useful comments on an earlier draft.

Original URL:

Original article

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑

%d bloggers like this: