Summarizing Data in SQL

About Matt: Matt DeLand is Co-Founder and Data Scientist at Wagon.
His team is building a collaborative SQL editor for analysts and engineers. He studied algebraic geometry at Columbia University,
taught at the University of Michigan, and now enjoys applied machine learning— his mom is very proud!


How quickly can you understand data from your database? Excel croaks at ten thousands rows, R is difficult to run on a cluster,
and your eye certainly can’t summarize large tables. SQL to the rescue!

Summary statistics are the fastest way to learning about your dataset. Are there outliers? How does the distribution look?
What are the relationships hiding inside the rows and columns? You’ll always want to ask these questions when faced with a new dataset.
SQL (an uninspired acronym for Structured Query Language) is the native language for database computation. Many summary methods are
available as built in SQL functions in modern databases and more complex measures can be implemented directly in SQL as well.

In this post, we’ll walk through 4 ways to summarize your dataset in SQL so you can run it directly on your database.

A summary of summaries

  1. The Basics
  2. Percentiles
  3. Histogram
  4. Joint Distributions

Here’s a relatable example: suppose we work in a banana stand’s
back office and we analyze banana data.

The Basics

With just a few SQL commands, we’ll be able to calculate the basics: count, distinct, sum, average, variance, … If we have a table of banana transactions, let’s calculate the total number of customers, unique customers, number of bananas sold, as well as total and average revenue per day:

This should be a familiar SQL pattern (and if not, come to the next free Wagon SQL class!). With just one query, we can calculate important aggregations over very large datasets. If we dress it up with a few where statements or join with customer lookup tables, we can quickly and effectively slice and dice our data. Unfortunately, there are some tricky questions that can’t be answered with the regular SQL functions alone.

Computing Percentiles

How long do people wait for their tasty banana orders? Using basic SQL we can compute average wait time, but if the distribution is skewed away from normal (as many internet-driven (and banana?) distributions often are), this may not give us a complete picture of how long most people are waiting. In addition to computing the average, we might (and should) ask, what are the 25th, 50th, 75th percentiles of wait-time, and how does that number vary day to day?

Many databases (including Postgres 9.4, Redshift, SQL Server) have built in percentile functions. Here’s an example using the function percentile_cont which is a window function that computes the percentile of wait-time, split (pun intended!) by day:

The structure of the percentile_cont is similar to other window functions: we specify how to order the data, how to group it – and the database does the rest. If we wanted to add more dimensions to our query (e.g. time of day), we’d add them to the partition and group by clause.
If our database doesn’t support percentile_cont (sorry MySQL, Postgres < 9.4), the query is more complicated, but fear not, still possible! The challenge is to order the rows by increasing wait-time (per date of course) and then pick out the middle value (for median). In MySQL, we can use local variables to keep track of the order, and in Postgres, we can use the row-number function. Here’s the Postgres version:


It’s often useful to have a rough idea of the distribution of the data in a table or query result. Generating a histogram is a great way to understand that distribution. What’s the distribution of revenue per purchase at the banana stand? I mean, how much could a banana cost? We might (naively) write a query like:

It’s likely this query will return far too many rows to eyeball the revenue distribution (1 row per distinct price). Instead, we’d like to group revenue values together into buckets so that we can understand the shape of the data. We might be interested in the number of banana purchases which generated between $0 to $5, $5 to $10, etc. There’s no “correct” number of buckets to use, it’s a choice we can make and experiment with to get a better understanding of the distribution. Probably 1000 buckets is too many, but 2 is too few. A reasonable rule of thumb is to use somewhere around 20 buckets. Our query will specify the width of the bucket, rather than the total number. The larger the width, the fewer buckets we’ll end up with.

This is a simple, but tricky query that will generate a histogram for us. It rounds each revenue data point down to the nearest multiple of 5 and then groups by that rounded value. It has one failing in that if we have no data in a bucket (e.g. no purchases of 55 to 60 dollars), then that row will not appear in the results. We can fix that with a more complex query, but let’s skip it for now. In order to choose our bucket size, it helps to first calculate the min/max values so we know how many buckets we would end up with. If we want the buckets to have slightly nicer labels, we can format the labels with a query like:

Joint Distributions

Comparing two numerical measures is very important as we try to summarize our data. Though the math becomes more complicated, SQL can still handle it. For example, we might be nervous that if people have to wait too long, they end up spending less money when they get to the front of the line. To get a rough idea of the relationship between wait-time and revenue, we might write a query very similar to the histogram one:

These queries give us some insight into the join distribution (plug: Wagon makes charts like these super easy to generate). We might also be interested in more statistical measurements like covariance or correlation. Most of the popular SQL implementations have these statistics built in, so for example in Postgres/Redshift we can write:

Postgres has all kinds of cool aggregate functions, including linear regressions – right in our SQL query!


At Wagon, we’re very interested in how people write queries, understand their results, and share them with their team. Our SQL tool computes statistics on all your query results (including histograms!) to help you understand your result-set without having to do additional work. If you’re interested, we’d love to hear from you, give us a shout! And remember, there’s always sum_money in the banana_stand.

Original URL:  

Original article

3 Ways That The Blockchain Will Change The Real Estate Market

shutterstock_116263732 Experts have suggested a number of niche industries that will be made more secure by the untamperable data record provided by blockchain technology — including international art dealing, pharmaceuticals and international trade of high-value goods — but to date, very little attention has been given to the potential effects on the real estate market. Read More

Original URL:  

Original article

What Have We Learned from This Open Source Project?

Here is the collected wisdom that we have gained from running the
Taskwarrior project for nine years. It has been rewarding,
enjoyable, and sometimes frustrating. We learned a lot about
users and Open Source expectations.

Start an open source project if you want to learn all you can about
software design, development, planning, testing, documenting, and
delivery; enjoy technical challenges, administrative challenges,
compromise, and will be satisfied hoping that someone out there is
benefitting from your work.

Do not start an open source project if you need praise, warmth and
love from your fellow human beings.

If you could draw a boundary between that which is already
supported, and that which is not, you would find that all
the activity, discussion and drama occurs at that boundary.

Feature requests only nibble at the periphery.
Bold changes originate elsewhere.

People will get excited about something a project doesn’t yet
support. Deliver it, and they will get excited about the next

If a feature works well, you’ll never hear about it again.

There is a fine line between “richly-featured” and “bloated”.
There may not be a line at all.

If you demo two features, and talk about twenty more, users still
only know about the two. Visual demonstrations have far greater impact.

Every change will ruin someone’s day. They will be sure to tell
you about it.

The same change will improve someone’s day. You will not hear of this.

People will disguise feature requests as bugs, which means either
they consider difference of opinion a defect, or believe that
calling it a flaw will force implementation, but hopefully they
just forgot to set the issue type to ‘enhancement’.

Some people find it very difficult to articulate what they want.
It’s worth being patient and finding out what they need.

What you keep out of a project is just as important as what you
allow in to a project.

Many new users will submit feature requests, just to show that they
are knowledgeable and clever. They don’t really want that feature,
it’s a form of positive feedback.

Beware of suggestions from users who have used your software for
only a day or so. Be equally aware of suggestions from users who
have used your software for a long, long time.

People will threaten to not use open source software because it lacks
a feature, thereby mistaking themselves for paying customers.

Many believe that if a change is small, it deserves to be in the
project, regardless of whether it makes sense for it to be there.

Users will go to the effort of seeking you out online, to directly
ask you a question that is answered two clicks from the front page
of a website.

A looping, animated GIF will be watched over and over, scrutinized
and understood. A paragraph of text will be ignored.

Man pages are too densely crammed with information, and too lengthy,
for most modern humans to ingest.

“What have you tried so far?” is the best question to identify
time wasters.

People will pick a fight with you about all your incidental choices.
Your issue tracker, your branching strategy, your version numbers,
the text editor you use, and so on.

You can choose the most permissive software license, and people will
still argue with you about your choice.

SEO consultants are not very good at searching the web, and learning
that you operate an open source, non-profit project. It says a lot.

Original URL:  

Original article

GitHub-backup: backs up everything GitHub knows about a repository or a user

github-backup is a simple tool you run in a git repository you cloned from
GitHub. It backs up everything GitHub publishes about the repository,
including branches, tags, other forks, issues, comments, wikis, milestones,
pull requests, watchers, and stars.


cabal install github-backup --bindir=$HOME/bin

(Cabal is bundled with the Haskell Platform.)


Run github-backup with no parameters, inside a git repository cloned
from GitHub to back up that repository.

Or, if you have a GitHub account, run github-backup username
to clone and back up your account’s repositories, as well
as the repositories you’re watching and have starred.

Why backup GitHub repositories

There are a couple of reasons to want to back this stuff up:

  • In case something happens to GitHub. More generally because
    keeping your data in the cloud and relying on the cloud to
    back it up is foolish.

  • In case someone takes down a repository that you were interested in.
    If you run github-backup with your username, it will back up all
    the repositories you have watched and starred.

  • So you can keep working on your repository while on a plane, or
    on a remote beach or mountaintop. Just like Linus intended.

What to expect

Each time you run github-backup, it will find any new forks on GitHub. It
will add remotes to your repository for the forks, using names like
github_torvalds_subsurface. It will fetch from every fork.

It downloads metadata from each fork. This is stored
into a branch named “github”. Each fork gets a directory in there,
like torvalds_subsurface. Inside the directory there will be some
files, like torvalds_subsurface/watchers. There may be further
directories, like for comments: torvalds_subsurface/comments/1.

You can follow the commits to the github branch to see what information
changed on GitHub over time.

The format of the files in the github branch is currently Haskell
serialized data types. This is plain text, and readable, if you squint.


github-backup is repository-focused. It does not try to back up other
information from GitHub. In particular, social network stuff, like
users who are following you, is not backed up.

github-backup does not log into GitHub, so it cannot backup private

Notes added to commits and lines of code don’t get backed up yet.
There is only recently API support for this.

The labels that can be added to issues and milestones are not backed up.
Neither are the hooks. They could be, but don’t seem important
enough for the extra work involved. Yell if you need them.

github-backup re-downloads all issues, comments, and so on
each time it’s run. This may be slow if your repo has a lot of them,
or even if it just has a lot of forks.

Bear in mind that this uses the GitHub API; don’t run it every 5 minutes.
GitHub rate limits the
API to some small number of requests per hour when used without
authentication. To avoid this limit, you can set GITHUB_USER and
GITHUB_PASSWORD in the environment and it will log in when making
(most) API requests.

Anyway, github-backup does do an incremental backup, picking up where it
left off, so will complete the backup eventually even if it’s rate limited.


github-backup was written by Joey Hess

It is made possible thanks to:

Original URL:  

Original article

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑

%d bloggers like this: