By Nathan Willis
March 11, 2015
Since opening its doors in 2008, GitHub has grown to become the largest
active project-hosting service for open-source software. But it has
also attracted a fair share of criticism for some of its
implementation choices—with one of the leading complaints being
that it takes a lax approach to software licensing. That, in turn,
leads to a glut of repositories bearing little or no licensing
details. The company recently announced a new tool to help combat the
license-confusion issue: a site-wide API for querying and reporting
license information. Whether that API is up to the task, however,
remains to be seen.
None of the above
By way of background information, GitHub does not require users to
choose a license when setting up a new project. An existing project
can also be forked into a new repository with one click, but nothing
subsequently prevents the new repository’s owner from changing or
removing the upstream license information (if it exists).
From a legal standpoint, of course, the fork inherits its
license from upstream automatically (unless the upstream project is
public domain or under some other less-common license). But from a
practical standpoint, this provenance is difficult to
trace. Throw in other GitHub users submitting pull requests for
patches that have no license information, and one has a recipe for
The bigger problem, however, is that the majority of GitHub repositories
carry no license information at all, because the users who own them
have not chosen to add such information. In 2013, GitHub introduced
its first tool designed to combat that issue, launching ChooseALicense.com, a web site
that explains the features and differences of popular FOSS licenses.
ChooseALicense.com allows GitHub users to select a license, and the GitHub
new-project-configuration page has a license selector, but using it is
not obligatory. In fact, the ChooseALicense.com home page includes
the following as its last option:
That “no license” link, incidentally, attempts to explain the downside of selecting no license—most notably, it strongly discourages other
developers (both FOSS and proprietary) from using or redistributing
the code in any fashion, for fear of getting entangled in a copyright
problem. But the page also points out that the GitHub
of service dictate that other users have the right to view and
fork any GitHub repository.
A new interface
One could probably quibble endlessly over the details of
ChooseALicense.com and its wording. The upshot, though, is that it
did not have a serious impact on the license-confusion problem. A
March 9 post
on the GitHub blog presented some startling statistics: that less than 20%
of GitHub repositories have a license, and that the percentage is declining.
The introduction of the license-selection tool in 2013 produced a
spike in licensed repositories, followed by a downward trend that
continues to the present. The post also included some statistics on license
popularity; the three licenses featured most prominently on the
license-chooser site (MIT, Apache, and GPLv2) are, unsurprisingly, the
most often selected.
This data set, however, is far from complete; as the post
explains, the team only logged licenses that were found in a file
named LICENSE, and only matched that file’s contents against
a short set of known licenses. Nevertheless, GitHub did evidently
determine that the problem was real enough to warrant a new attempt at
The team’s answer is a new site-wide API called, fittingly, the Licenses API.
It is currently in preview, which means that interested developers
must supply a special HTTP header with any requests in order to access it.
But the API is, at least currently, a frustratingly limited one.
It offers just three functions:
- GET /licenses returns a JSON-formatted list of all of the
licenses tracked by the site.
- GET /licenses/licensename returns the license text and
associated metadata for licensename.
- GET /repos/username/reponame returns any licensing
information for username‘s reponame repository (along
with other repository information).
Arguably the biggest limitation is that, as was the case with the statistics
gathered for the blog post, the license of a repository is determined
only by examining the contents of a LICENSE file. On the
plus side, the license information returned by the API conforms to the
Software Package Data Exchange (SPDX) specification, which should make it easy to integrate with
To be sure, determining and counting licenses is not a simple
matter—as many in the community know. In 2013, for example, a
pair of presentations at the Free Software Legal and Licensing
Workshop explored several strategies for
tabulating statistics on FOSS license usage. Both presentations ended
with caveats about the difficulty of the problem—whatever
methodology is used to approach it.
Nevertheless, the GitHub Licenses API does appear to be strangely
naive in its approach. For example, it is well-established that a
significant number of projects place their license in a file named
COPYING, rather than LICENSE, because that has long
been the convention used by the GNU project. Even scanning for that
filename (or other obvious candidates, like GPL.txt) would
enhance the quality of the data available significantly. Far better
would be allowing the repository owner to designate what file contains
Furthermore, the Licenses API could be used to accumulate more
meaningful statistics, such as which forks include different license
information than their corresponding upstream repository, but there is
no indication yet that GitHub intends to pursue such a survey. It may
fall on volunteers in the community to undertake that sort of
work. There are, after all, multiple source-code auditing tools that are
compatible with SPDX and can be used to audit license information and
compliance. Regrettably, the GitHub Licenses API does not look like it will
lighten that workload significantly, since the information it returns
is so restricted in scope.
Power to choose
GitHub is right to be concerned about the paucity of license
information in the repositories hosted at its site. But both the
2013 license chooser and the new Licenses API seem to
stem from an assumption on GitHub’s part that the reason so many
repositories lack licenses is that license selection is either
confusing or difficult to find information on. Neither effort strikes
at the heart of the problem: that GitHub makes license selection
optional and, thus, makes licensing an afterthought.
SourceForge has long required new projects to select a license while
performing the initial project setup. Later, when Google Code
supplanted SourceForge as the hosting service of choice, it, too,
required the user to select a license during the first step. So too
do Launchpad.net, GNU Savannah, and BerliOS. FedoraHosted and Debian’s
Alioth both involve manually requesting access to create a new
project, a process that, presumably, involves discussing whether or
not the project will be released under a license compatible with that distribution.
It is hard to escape the fact that only GitHub and its direct
competitors (like Gitorious and GitLab) fail to raise the licensing
question during project setup, and equally hard to avoid the
conclusion that this is why they are littered with so many
non-licensed and mis-licensed repositories. An API for querying
licenses may be a positive step, but it is not
likely to resolve the problem, since it side-steps the underlying
Hopefully, the current form of the Licenses API is merely the
beginning, and GitHub will proceed to develop it into a truly useful
tool. There is certainly a need for one, and being the most active
project-hosting provider means that GitHub is best positioned to do
something about it.
(Log in to post comments)