Why Apache Arrow is the future of opensource Columnar in-memory analytics

Data digital flow

Performance gets redefined when the data is in memory, Apache Arrow is a de-facto standard for columnar in-memory analytics, Engineers from across the top level Apache projects are contributing towards to create Apache Arrow. In the coming years we can expect all the big data platforms adopting Apache Arrow as its columnar in-memory layer.
What can we expect from an in-memory system like Apache Arrow:

  • Columnar: Over the past few years, big data is all about columnar. It was primarily inspired by the creation and adoption of Apache Parquet and other columnar data storage technologies.
  • In-memory: SAP HANA was the first one to accelerate the analytical workloads with its in-memory component and then Apache Spark came into the picture in the open source world which accelerates the workloads by holding the data in memory.
  • Complex data and dynamic schemas: Solving business problems are much easier when we represent the data through hierarchical and nested data structures. This was the primary reason for the adoption of JSON and document based databases.

At this point most of the systems out there hardly supports two of the above concepts. Many exists which supports one of them. That’s where the Apache Arrow kicks in, which supports all three of them seamlessly.

Arrow is being designed in a way to supports complex data and dynamic schema and in terms of performance, it is totally based on in-memory and is columnar storage.

Without Arrow

The bottleneck with any typical system comes when the data is moved across machines, Serialization is an overhead in many cases, Arrow improves the performance for the data movement within a cluster without any serialization or deserialization. Another important aspect of Arrow is when two systems use arrow as their in-memory storage, for example Kudu could send Arrow data to Impala for analytics purposes since both of them are Arrow-enabled without involving any costly deserialization on the receipt. Inter Process Communication is mostly happening through shared memory, TCP/IP and RDMA with Arrow. It also supports a wide variety of data types which includes both the SQL and JSON types, such as Int, BigInt, Decimal, VarChar, Map, Struct and Array.

Nobody wants to wait longer to get their answers from the data. The faster they gets the answer the faster they can ask other questions or solve their business problems. CPUs these days become faster and more sophisticated in design, the key challenge in any system is making sure the CPU Utilization is at ~100% and is using it efficiently. When the data is in columnar structure, it is much easier to use SIMD instructions over it.

SIMD is short for Single Instruction/Multiple Data, while the term SIMD operations refers to a computing method that enables processing of multiple data with a single instruction. In contrast, the conventional sequential approach using one instruction to process each individual data is called scalar operations. In some cases, when using AVX instructions, these optimizations can increase performance by two orders of magnitude.

Arrow2

Arrow is designed to maximize the cache locality, pipelining and SIMD instructions. Cache locality, pipelining and super-word operations frequently provide 10-100x faster execution performance. Since many analytical workloads are CPU bound, these benefits translate into dramatic end-user performance gains. These gains result in faster answers and higher levels of user concurrency.

References:

Share This Story, Choose Your Platform!


Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/uKLGxwO5W1Y/

Original article

GitLab Runner 1.1 with Autoscaling

Mar 29, 2016

Over the last year, GitLab Runner has become a significant part of the GitLab
family. We are happy to announce that GitLab Runner 1.1 is released today; a
release that brings major improvements over its predecessor. There is one
feature though that we are excited about and is the cornerstone of this release.

Without further ado, we present you GitLab Runner 1.1 and its brand-new, shiny
feature: Autoscaling!

About GitLab Runner

GitLab has built-in continuous integration to allow you to run a
number of tasks as you prepare to deploy your software. Typical tasks
might be to build a software package or to run tests as specified in a
YAML file. These tasks need to run by something, and in GitLab this something
is called a Runner; an application that processes builds.

GitLab Runner 1.1 is the biggest release yet. Autoscaling provides the ability
to utilize resources in a more elastic and dynamic way. Along with autoscaling
come some other significant features as well. Among them is support for a
distributed cache server, and user requested features like passing artifacts
between stages and the ability to specify the archive names are now available.

Let’s explore these features one by one.

The Challenge of Scaling

Other continuous integration platforms have a similar functionality.
For example, Runners are called “Agents” in Atlassian’s Bamboo (which integrates
with Bitbucket.) Some services, like Bamboo, charge individually for using these
virtual machines and if you need a number of them it can get quite expensive,
quite quickly. If you have only one available Agent or Runner, you could be
slowing down your work.

We don’t charge anything for connecting Runners in GitLab, it’s all built-in.
However, the challenge up until now has been the scaling of these Runners. With
GitLab, Runners can be specified per project, but this means you have to create
a Runner per project, and that doesn’t scale well.

An alternative up until now was to create a number of shared Runners which
can be used across your entire GitLab instance.

However, what happens when you need more Runners than there are available?
You end up having to queue your tasks, and that will eventually slow things down.

Hence the need for autoscaling.

Autoscaling increases developer happiness

We decided to build autoscaling with the help of Docker Machine.
Docker Machine allows you to provision and manage multiple remote Docker hosts
and supports a vast number of virtualization and cloud providers,
and this is what autoscaling currently works only with.

Because the Runners will autoscale, your infrastructure contains only as
many build instances as necessary at anytime. If you configure the Runner to
only use autoscale, the system on which the Runner is installed acts as a
bastion for all the machines it creates.

Autoscaling allows you to increase developer happiness. Everyone hates to wait
for their builds to be picked up, just because all Runners are currently in use.

The autoscaling feature promotes heavy parallelization of your tests, something
that is made easy by defining multiple jobs in your .gitlab-ci.yml file.

While cutting down the waiting time to a minimum makes your developers happy,
this is not the only benefit of autoscaling. In the long run, autoscaling
reduces your infrastructure costs:

  • autoscaling follows your team’s work hours,
  • you pay for what you used, even when scaling to hundreds of machines,
  • you can use larger machines for the same cost, thus having your builds run
    faster,
  • the machines are self-managed, everything is handled by docker-machine, making
    your Administrators happy too,
  • your responsibility is to only manage GitLab and one GitLab Runner instance.

Below, you can see a real-life example of the Runner’s autoscale feature, tested
on GitLab.com for the GitLab Community Edition project:

Real life example of autoscaling

Each machine on the chart is an independent cloud instance, running build jobs
inside Docker containers. Our builds are run on DigitalOcean 4GB machines, with
CoreOS and the latest Docker Engine installed.

DigitalOcean proved to be one of the best choices for us, mostly because of
the fast spin-up time (around 50 seconds) and their very fast SSDs, which are
ideal for testing purposes. Currently, our GitLab Runner processes up to 160
builds at a time.

If you are eager to test this yourself, read more on configuring the new
autoscaling feature
.

Distributed cache server

In GitLab Runner 0.7.0 we introduced support for caching. This release brings
improvements to this feature too, which is especially useful with autoscaling.

GitLab Runner 1.1 allows you to use an external server to store all your caches.
The server needs to expose an S3-compatible API, and while you can use for
example Amazon S3, there are also a number of other servers that you can run
on-premises just for the purpose of caching.

Read more about the distributed cache server and learn how to set
up and configure your own.

Artifacts improvements

We listen closely to our community to extend GitLab Runner with user requests.
One of the often-requested features was related to passing artifacts between
builds.

GitLab offers some awesome capabilities to define multiple jobs and group
them in different stages. The jobs are always independent, and can be run on
different Runners.

Up until now, you had to use an external method if you wanted to pass the files
from one job to another. With GitLab Runner 1.1 this happens automatically.
Going one step further, GitLab 8.6 allows you to fine-tune what should be
passed. This is now handled by defining dependencies:

1
2
3
4
5
6
7
8
build:osx:
  stage: build
  artifacts: ...

test:osx:
  stage: test
  dependencies:
  - build:osx

The second most-requested feature was the ability to change the name of the
uploaded artifacts archive. With GitLab Runner 1.1, this is now possible with
this simple syntax:

1
2
3
build_linux:
  artifacts:
    name: "build_linux_$CI_BUILD_REF_NAME"

Read more about naming artifacts.

Improved documentation

With GitLab Runner 1.1 we’ve also released improved documentation, explaining
all executors and commands. The documentation also describes what features are
supported by different configurations.

Read the revised documentation.

Using Runner on OSX

We also upgraded GitLab Runner 1.1 to be compatible with El Capitan and Xcode 7.3.
You should read the revised installation guide for OSX
and FAQ section describing the needed preparation steps.

Changelog

So far we described the biggest features, but these are not all the changes
introduced with GitLab Runner 1.1. We know that even the smallest change can
make a difference in your workflow, so here is the complete list:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
- Use Go 1.5
- Add docker-machine based autoscaling for docker executor
- Add support for external cache server
- Add support for `sh`, allowing to run builds on images without the `bash`
- Add support for passing the artifacts between stages
- Add `docker-pull-policy`, it removes the `docker-image-ttl`
- Add `docker-network-mode`
- Add `git` to gitlab-runner:alpine
- Add support for `CapAdd`, `CapDrop` and `Devices` by docker executor
- Add support for passing the name of artifacts archive (`artifacts:name`)
- Refactor: The build trace is now implemented by `network` module
- Refactor: Remove CGO dependency on Windows
- Fix: Create alternative aliases for docker services (uses `-`)
- Fix: VirtualBox port race condition
- Fix: Create cache for all builds, including tags
- Fix: Make the shell executor more verbose when the process cannot be started
- Fix: Pass gitlab-ci.yml variables to build container created by docker executor
- Fix: Don't restore cache if not defined in gitlab-ci.yml
- Fix: always use `json-file` when starting docker containers

You can see why we think version 1.1 is fantastic.
Go get it, test it and share your feedback with us!
You can meet with the CI team in our upcoming webcast.

Live webcast: GitLab CI

Sign up for our webcast on April 14th, which includes an overview and tutorial
about using GitLab CI. Meet people from the GitLab CI team and get your questions
answered live!

  • Date: Thursday, April 14, 2016
  • Time: 5pm (17:00) UTC; 12pm EST; 9am PST
  • Register here

Can’t make it? Register anyway, and we’ll send you a link to watch it later!


Install GitLab on your own server in 2 minutes

Browse all posts


Get our GitLab newsletter twice monthly.

Please enable JavaScript to view the comments powered by Disqus.


Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/7KYg7WYvsac/

Original article

From side project to 250M daily requests

What is IP info and what can developers do with your API?

IPInfo is an IP details API. It has geolocation details so you know the city, region, country, and often the postal code or area code for an IP. If you’re customizing content on your website, you can show different features to different people based on the country or city.

Another detail is the organization. If you look up your IP and you’re home, it might say this is a Comcast or AT&T IP address. It also returns the hostname IP. Also, we have a totally free plan — you can curl ipinfo.io without any IP address and it will give you your own IP details. With /8.8.8.8:

We have some optional add-ons as well, like carrier field or hosting provider (e.g. to detect an Amazon AWS or Rackspace IP address). We have some rudimentary proxy stuff that will let you know if an IP address is a known proxy.

Who are a few sample customers?

Tesla uses it on their website for the dealership finder. Through the API, they can automatically detect that I live in Mountain View and show that the closest dealership is in Palo Alto based on my IP address.

We’ve got lots of different ad networks that use us to customize their offers and content based. In particular, mobile ad networks will show different offers based on the country you’re in.

Their are quite a few brand names. TripAdvisor and Xerox use it to customize parts of their sites. Brooklyn Library uses it and I’m not sure what for 🙂

You have 250 million daily requests. Walk me through how you got initial users and what the best growth channels have been.

Initially, I put out a dead simple webpage with a bootstrap theme. All of the data came from an existing geo IP database and it just showed your IP location on a map. You can still see this on the homepage.

Pretty soon, I saw a question on StackOverflow asking if there are good APIs to see where your IP is located. I figured I already have all the data, so it was easy to build a simple API. I honestly built it in a couple of hours and then answered the question.

Within a couple months, I got an email notification from Linode that said your CPU usage is off the charts. That’s strange — I hosted a bunch of sites on the same server so I didn’t know what’s going on here. I logged in, checked the access logs, and there were millions of API requests per day. It really started taking off on its own thanks to StackOverflow.

It’s just been inbound? No outbound sales?

Yeah, absolutely. There was nothing to the API beyond a GET request for basic IP info, so I looked into improving it. There were a few people doing 10 million requests a day and a bunch of people doing around a million.

I decided to try some paid plans using access tokens and made it free for 1,000 requests a day. I figured most small side projects would need less than this daily and be able to use it for free. After adding access tokens and rate limiting, I added four plans: $10, $50, $100, and $200 per month.

One of my first paying customers was Tesla, within a week or so of rolling out the paid plans. It’s continued to grow and I’m seeing more and more enterprise customers interested in directly downloading our data, instead of accessing it through the API.

Ben’s Stack Overflow profile

We’ve used no paid advertising or other outreach. It’s all been totally inbound other than writing a bunch of answers on Stack Overflow related to how to find the country of a visitor on my webpage, how to get someone’s IP with javascript, and anything else relevant to my API. It got to the point where I could reach critical mass with people who had read the different posts and they’d link to the site in their own answers.

Could you tell me more about your stack?

There have been a few variations.

Initially, I had the Linode server. Soon after getting the CPU usage warning I added a couple of DigitalOcean servers, and used Amazon Route 53 for DNS to route to one of the servers using round robin. It worked reasonably well for a while, but adding new server required DNS updates, which take a while to propagate. If a server has any problems, it’ll continue to get traffic because of the delay.

Soon after, I moved everything to AWS, with the servers behind elastic load balancers so I could quickly switch servers in and out without any downtime. AWS scaling groups also helped automate this to some extent.

I setup servers in 3 regions (US east coast, US west coast, and Frankfurt), and then made use of Route 53’s latency-based routing to route to the lowest latency server, which helps to keep the API’s latency super low wherever you are in the world. I’ll also be adding servers in Singapore soon, to cut latency even further in Asia.

This setup worked well, but deploys were a huge pain. I’d need to spin up fresh new servers, deploy the latest code there, add the new servers to the load balancer, and then decommision the old ones. It was all scripted, but it still involved running a bunch of different scripts, and checking that everything worked as expected.

AWS does have CodeDeploy to solve this problem, but it’s not yet available outside of some core regions, which meant I couldn’t use it.

That’s why I switched to Elastic Beanstalk, which is basically a managed version of AWS. It creates almost exactly the same server setup as I had before, but deploying is now a case of running a simple Elastic Beanstalk command, and it handles everything for me.

One thing that has been consistent throughout is that each server can independently answer every API request, there’s no shared database or anything. Everything that’s needed for a request is kept in memory, so when a new server spins up, the requests can come in straight away. It’s super quick and well over 90% of our 250 million daily requests are handled in less than 10 milliseconds.

It sounds like you’ve had really good uptime. What sort of monitoring do you have to track that?

Primarily the AWS logs, and also Pingdom just to be safe. AWS has great metrics. There are load balancer reports that show your IP latency. They also show a summary of your requests broken down by 2xx, 3xx, etc. Assuming that Amazon is doing a decent job of keeping the load balancer up, you can see how many requests from the load balancer to my backend failed.

Also, I import our load balancer logs into like Redshift each day and generate a bunch of reports on that. I mostly try to drill into requests that failed. The main thing I’m worried about is not shipping buggy data.

Do you have continuous integration tests that run before you deploy?

The site gets re-deployed every day with fresh data, and we have a bunch of scripts that pull in the raw data, process it, check that everything updated properly, and then do the deploy to the 3 different server regions that we’re currently in.

Do you have any war stories of when the service went down?

Haha, all of the issues so far have been because of me. As I mentioned, we have checks to make sure the database that we generated isn’t corrupt when we deploy. Those checks have evolved over time to catch mistakes that happened before.

For example, one time a required input file was missing and the script generated an empty organization database, we deployed that, and then got a bunch of emails…

Over time, the integration tests have become much more comprehensive!

What are some use cases for others to build on top of the API?

Content customization. An obvious example is any e-commerce site like Amazon has different stores for different countries. If you know a German visitor is looking at books, you should redirect them to a .de site and show German language options.

Network customization. There are some very useful ad ideas (like being able to target T-Mobile and AT&T users differently) and I’m interested to see what else people could do with the data. For example, if a user’s on a slow mobile network vs. wifi. Maybe you would serve low resolution images or don’t show ads because they don’t convert as well.

Location mashups. If you have location-based data, you can mix it with the ipinfo API. For example, I see a lot of people integrating with weather databases.


Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/m5fapWtfG-I/from-side-project-to-250-million-daily-requests-909b9e373d94

Original article

Changes to Npm’s unpublish policy

One of Node.js’ core strengths is the community’s trust in npm’s registry. As it’s grown, the registry has filled with packages that are more and more interconnected.

A byproduct of being so interdependent is that a single actor can wreak significant havoc across the ecosystem. If a publisher unpublishes a package that others depend upon, this breaks every downstream project that depends upon it, possibly thousands of projects.

Last Tuesday’s events revealed that this danger isn’t just hypothetical, and it’s one for which we already should have been prepared. It’s our mission to help the community succeed, and by failing to protect the community, we didn’t uphold that mission.

We’re sorry.

This week, we’ve seen a lot of discussion about why unpublish exists at all. Similar discussions happen within npm, Inc. There are important and legitmate reasons for the feature, so we have no intention of removing it, but now we’re significantly changing how unpublish behaves and the policies that surround it.

These changes, which incorporate helpful feedback from a lot of community members, are intended to ensure that events like Tuesday’s don’t happen again.

Our new policy

Going forward, if you try to unpublish a given package@version:

  • If the version is less than 24 hours old, you can unpublish it. The
    package will be completely removed from the registry. No new
    packages can be published using the same name and version.

  • If the version is older than 24 hours, then the unpublish will fail, with a message to contact support@npmjs.com.

  • If you contact support, they will check to see if removing that version of your package would break any other installs. If so, we will not remove it. You’ll either have to transfer ownership of the package or reach out to the owners of dependent packages to change their dependency.

  • If every version of a package is removed, it will be replaced with a security placeholder package, so that the formerly used name will not be susceptible to malicious squatting.

  • If another member of the community wishes to publish a package with the same name as a security placeholder, they’ll need to contact support@npmjs.com.  npm will determine whether to grant this request. (Generally, we will.)

Examples

This can be a bit difficult to understand in the abstract. Let’s walk
through some examples.

  1. Brenna is a maintainer of a popular package named “supertools”. Supertools has 3 published versions: 0.0.1, 0.3.0, and 0.3.1. Many packages depend on all the versions of supertools, and, across all versions, supertools gets around 2 million downloads a month.

    Brenna does a huge refactor and publishes 1.0.0. An hour later, she realizes that there is a huge vulnerability in the project and needs to unpublish. Version 1.0.0 is less than 24 hours old. Brenna is able to unpublish version 1.0.0.

    Embarrassed, Brenna wants to unpublish the whole package. However, because the other versions of supertools are older than 24 hours Brenna has to contact support@npmjs.com to continue to unpublish. After discussing the matter, Brenna opts instead to transfer ownership of the package to Sarah.

  2. Supreet is the maintainer of a package named “fab-framework-plugin”, which has 2 published versions: 0.0.1 and 1.0.0. fab-framework-plugin gets around 5,000 downloads monthly across both versions, but most packages depend on it via ^1.0.0.

    Supreet realizes that there are several serious bugs in 1.0.0 and would like to completely unpublish the version. He attempts to unpublish and is prompted to talk to support@npmjs.com because the 1.0.0 version of his package is older than 24 hours. Instead, Supreet publishes a new version with bug fixes, 1.0.1.

    Because all dependents are satisfied by 1.0.1, support agrees to grant Supreet’s request to delete 1.0.0.

  3. Tef works for Super Private Company, which has several private
      packages it use to implement static analysis on Node.js packages.

    Working late one night, Tef accidentally publicly publishes a private package called “@super-private-company/secrets”. Immediately noting his mistake, Tef unpublishes secrets. Because secrets was only up for a few minutes — well within the 24 window for unrestricted unpublishes — Tef is able to successfully unpublish.

    Because Tef is a responsible developer aware of security best-practices, Tef realizes that the contents of secrets have been effectively disclosed, and spends the rest of the evening resetting passwords and apologizing to his coworkers.

  4. Charlotte is the maintainer of a package called “superfoo”. superfoo is a framework on which no packages depend. However, the consultancy Cool Kids Club has been using it to develop their applications for years. These applications are private, and not published to the registry, so they don’t count as packages that depend on superfoo.

    Charlotte burns out on open source and decides to unpublish all of their packages, including superfoo. Even though there are no published dependents on superfoo, superfoo is older than 24 hours, and therefore Charlotte must contact support@npmjs.com to unpublish it.

    After Charlotte contacts support, insisting on the removal of superfoo, npm deprecates superfoo with a message that it is no longer supported. Whenever it is installed, a notice is displayed to the installer.

    Cool Kids Club sees this notice and republishes superfoo as “coolfoo”. Cool Kids Club software now depends on “coolfoo” and therefore does not break.

Changes to come

This policy is a first step towards balancing the rights of individual publishers with npm’s responsibility to maintain the social cohesion of the open source community.

The policy still relies on human beings making human decisions with their human brains. It’s a fairly clear policy, but there is “meat in the machine”, and that means it will eventually reach scaling problems as our community continues to grow.

In the future, we may extend this policy (including both the human and automated portions) to take into account such metrics as download activity, dependency checking, and other measures of how essential a package is to the community.

Moving forward

In balancing individual and community needs, we’re extremely cognizant that developers feel a sense of ownership over their code. Being able to remove it is a part of that ownership.

However, npm exists to facilitate a productive community. That means
we must balance individual ownership with collective benefit.

That tension is at the very core of open source. No package ecosystem
can survive without the ability to share and distribute code. That’s
why, when you publish a package to the registry, you agree to our
Terms of Service. The key lines are:

Your Content belongs to you. You decide whether and how to license it. But at a minimum, you license npm to provide Your Content to users of npm Services when you share Your Content. That special license allows npm to copy, publish, and analyze Your Content, and to share its analyses with others. npm may run computer code in Your Content to analyze it, but npm’s special license alone does not give npm the right to run code for its functionality in npm products or services.

When Your Content is removed from the Website or the Public Registry, whether by you or npm, npm’s special license ends when the last copy disappears from npm’s backups, caches, and other systems. Other licenses, such as open source licenses, may continue after Your
Content is removed. Those licenses may give others, or npm itself, the right to share Your Content with npm Services again.

These lines are the result of a clarification that we asked our lawyer to make for the purposes of making this policy as understandable as possible. You can see that in this PR.

We don’t try to hide our policies; in fact, we encourage you to review their full list of changes and updates, linked from every policy page.

We acknowledge that there are cases where you are justified in wanting to remove your code, and also that removing packages can cause harm to other users. That’s exactly why we are working so hard on this issue.

This new policy is just the first of many steps we’ll be taking. We’ll be depending on you to help us consider edge cases, make tough
choices, and continue building a robust ecosystem where we can all
build amazing things.

You probably have questions about this policy change, and maybe you have a perspective you’d like to share, too.

We appreciate your feedback, even when we can’t respond to all of it. Your participation in this ecosystem is the core of its greatness. Please keep commenting and contributing: you are an important part of this community!

Please post comments and questions here. We’ve moved to a Github issue for improved moderation.


Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/705_9SisJQw/changes-to-npms-unpublish-policy

Original article

Building highly available applications using Kubernetes new multi-zone clusters

Editor’s note: this is the third post in a series of in-depth posts on what’s new in Kubernetes 1.2



Introduction 

One of the most frequently-requested features for Kubernetes is the ability to run applications across multiple zones. And with good reason — developers need to deploy applications across multiple domains, to improve availability in the advent of a single zone outage.

Kubernetes 1.2, released two weeks ago, adds support for running a single cluster across multiple failure zones (GCP calls them simply “zones,” Amazon calls them “availability zones,” here we’ll refer to them as “zones”). This is the first step in a broader effort to allow federating multiple Kubernetes clusters together (sometimes referred to by the affectionate nickname “Ubernetes“). This initial version (referred to as “Ubernetes Lite”) offers improved application availability by spreading applications across multiple zones within a single cloud provider.

Multi-zone clusters are deliberately simple, and by design, very easy to use — no Kubernetes API changes were required, and no application changes either. You simply deploy your existing Kubernetes application into a new-style multi-zone cluster, and your application automatically becomes resilient to zone failures.

Now into some details . . .  

Ubernetes Lite works by leveraging the Kubernetes platform’s extensibility through labels. Today, when nodes are started, labels are added to every node in the system. With Ubernetes Lite, the system has been extended to also add information about the zone it’s being run in. With that, the scheduler can make intelligent decisions about placing application instances.

Specifically, the scheduler already spreads pods to minimize the impact of any single node failure. With Ubernetes Lite, via SelectorSpreadPriority, the scheduler will make a best-effort placement to spread across zones as well. We should note, if the zones in your cluster are heterogenous (e.g., different numbers of nodes or different types of nodes), you may not be able to achieve even spreading of your pods across zones. If desired, you can use homogenous zones (same number and types of nodes) to reduce the probability of unequal spreading.

This improved labeling also applies to storage. When persistent volumes are created, the PersistentVolumeLabel admission controller automatically adds zone labels to them. The scheduler (via the VolumeZonePredicate predicate) will then ensure that pods that claim a given volume are only placed into the same zone as that volume, as volumes cannot be attached across zones.

Walkthrough 

We’re now going to walk through setting up and using a multi-zone cluster on both Google Compute Engine (GCE) and Amazon EC2 using the default kube-up script that ships with Kubernetes. Though we highlight GCE and EC2, this functionality is available in any Kubernetes 1.2 cluster, including Google Container Engine (GKE).

Bringing up your cluster 

Creating a multi-zone deployment for Kubernetes is the same as for a single-zone cluster, but you’ll need to pass an environment variable ("MULTIZONE”) to tell the cluster to manage multiple zones. We’ll start by creating a multi-zone-aware cluster on GCE and/or EC2.

GCE:

curl -sS https://get.k8s.io | MULTIZONE=1 KUBERNETES_PROVIDER=gce 
KUBE_GCE_ZONE=us-central1-a NUM_NODES=3 bash

EC2:

curl -sS https://get.k8s.io | MULTIZONE=1 KUBERNETES_PROVIDER=aws 
KUBE_AWS_ZONE=us-west-2a NUM_NODES=3 bash

At the end of this command, you will have brought up a cluster that is ready to manage nodes running in multiple zones. You’ll also have brought up NUM_NODES nodes and the cluster’s control plane (i.e., the Kubernetes master), all in the zone specified by KUBE_{GCE,AWS}_ZONE. In a future iteration of Ubernetes Lite, we’ll support a HA control plane, where the master components are replicated across zones. Until then, the master will become unavailable if the zone where it is running fails. However, containers that are running in all zones will continue to run and be restarted by Kubelet if they fail, thus the application itself will tolerate such a zone failure.

Nodes are labeled 

To see the additional metadata added to the node, simply view all the labels for your cluster (the example here is on GCE):

$ kubectl get nodes --show-labels

NAME                     STATUS                     AGE       LABELS
kubernetes-master        Ready,SchedulingDisabled   6m        
beta.kubernetes.io/instance-type=n1-standard-1,failure-domain.beta.kubernetes.
io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kub
ernetes.io/hostname=kubernetes-master
kubernetes-minion-87j9   Ready                      6m        
beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.
io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kub
ernetes.io/hostname=kubernetes-minion-87j9
kubernetes-minion-9vlv   Ready                      6m        
beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.
io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kub
ernetes.io/hostname=kubernetes-minion-9vlv
kubernetes-minion-a12q   Ready                      6m        
beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.
io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kub
ernetes.io/hostname=kubernetes-minion-a12q

The scheduler will use the labels attached to each of the nodes (failure-domain.beta.kubernetes.io/region for the region, and failure-domain.beta.kubernetes.io/zone for the zone) in its scheduling decisions.

Add more nodes in a second zone 

Let’s add another set of nodes to the existing cluster, but running in a different zone (us-central1-b for GCE, us-west-2b for EC2). We run kube-up again, but by specifying KUBE_USE_EXISTING_MASTER=1 kube-up will not create a new master, but will reuse one that was previously created.
GCE:

KUBE_USE_EXISTING_MASTER=true MULTIZONE=1 KUBERNETES_PROVIDER=gce 
KUBE_GCE_ZONE=us-central1-b NUM_NODES=3 kubernetes/cluster/kube-up.sh

On EC2, we also need to specify the network CIDR for the additional subnet, along with the master internal IP address:

KUBE_USE_EXISTING_MASTER=true MULTIZONE=1 KUBERNETES_PROVIDER=aws 
KUBE_AWS_ZONE=us-west-2b NUM_NODES=3 KUBE_SUBNET_CIDR=172.20.1.0/24 
MASTER_INTERNAL_IP=172.20.0.9 kubernetes/cluster/kube-up.sh

View the nodes again; 3 more nodes will have been launched and labelled (the example here is on GCE):

$ kubectl get nodes --show-labels

NAME                     STATUS                     AGE       LABELS
kubernetes-master        Ready,SchedulingDisabled   16m       
beta.kubernetes.io/instance-type=n1-standard-1,failure-domain.beta.kubernetes.
io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kub
ernetes.io/hostname=kubernetes-master
kubernetes-minion-281d   Ready                      2m        
beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.
io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kub
ernetes.io/hostname=kubernetes-minion-281d
kubernetes-minion-87j9   Ready                      16m       
beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.
io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kub
ernetes.io/hostname=kubernetes-minion-87j9
kubernetes-minion-9vlv   Ready                      16m       
beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.
io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kub
ernetes.io/hostname=kubernetes-minion-9vlv
kubernetes-minion-a12q   Ready                      17m       
beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.
io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kub
ernetes.io/hostname=kubernetes-minion-a12q
kubernetes-minion-pp2f   Ready                      2m        
beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.
io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kub
ernetes.io/hostname=kubernetes-minion-pp2f
kubernetes-minion-wf8i   Ready                      2m        
beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.
io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kub
ernetes.io/hostname=kubernetes-minion-wf8i

Let’s add one more zone:
GCE:

KUBE_USE_EXISTING_MASTER=true MULTIZONE=1 KUBERNETES_PROVIDER=gce 
KUBE_GCE_ZONE=us-central1-f NUM_NODES=3 kubernetes/cluster/kube-up.sh

EC2:

KUBE_USE_EXISTING_MASTER=true MULTIZONE=1 KUBERNETES_PROVIDER=aws 
KUBE_AWS_ZONE=us-west-2c NUM_NODES=3 KUBE_SUBNET_CIDR=172.20.2.0/24 
MASTER_INTERNAL_IP=172.20.0.9 kubernetes/cluster/kube-up.sh

Verify that you now have nodes in 3 zones:

kubectl get nodes --show-labels

Highly available apps, here we come.

Deploying a multi-zone application 

Create the guestbook-go example, which includes a ReplicationController of size 3, running a simple web app. Download all the files from here, and execute the following command (the command assumes you downloaded them to a directory named “guestbook-go”:

kubectl create -f guestbook-go/

You’re done! Your application is now spread across all 3 zones. Prove it to yourself with the following commands:

$  kubectl describe pod -l app=guestbook | grep Node
Node:       kubernetes-minion-9vlv/10.240.0.5
Node:       kubernetes-minion-281d/10.240.0.8
Node:       kubernetes-minion-olsh/10.240.0.11

$ kubectl get node kubernetes-minion-9vlv kubernetes-minion-281d 
kubernetes-minion-olsh --show-labels
NAME                     STATUS    AGE       LABELS
kubernetes-minion-9vlv   Ready     34m       
beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.
io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kub
ernetes.io/hostname=kubernetes-minion-9vlv
kubernetes-minion-281d   Ready     20m       
beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.
io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kub
ernetes.io/hostname=kubernetes-minion-281d
kubernetes-minion-olsh   Ready     3m        
beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.
io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-f,kub
ernetes.io/hostname=kubernetes-minion-olsh

Further, load-balancers automatically span all zones in a cluster; the guestbook-go example includes an example load-balanced service:

$ kubectl describe service guestbook | grep LoadBalancer.Ingress
LoadBalancer Ingress:   130.211.126.21

ip=130.211.126.21

$ curl -s http://${ip}:3000/env | grep HOSTNAME
  "HOSTNAME": "guestbook-44sep",

$ (for i in `seq 20`; do curl -s http://${ip}:3000/env | grep HOSTNAME; done)  
| sort | uniq
  "HOSTNAME": "guestbook-44sep",
  "HOSTNAME": "guestbook-hum5n",
  "HOSTNAME": "guestbook-ppm40",

The load balancer correctly targets all the pods, even though they’re in multiple zones.


Shutting down the cluster 

When you’re done, clean up:

GCE:

KUBERNETES_PROVIDER=gce KUBE_USE_EXISTING_MASTER=true 
KUBE_GCE_ZONE=us-central1-f kubernetes/cluster/kube-down.sh
KUBERNETES_PROVIDER=gce KUBE_USE_EXISTING_MASTER=true 
KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-down.sh
KUBERNETES_PROVIDER=gce KUBE_GCE_ZONE=us-central1-a 
kubernetes/cluster/kube-down.sh

EC2:

KUBERNETES_PROVIDER=aws KUBE_USE_EXISTING_MASTER=true KUBE_AWS_ZONE=us-west-2c 
kubernetes/cluster/kube-down.sh
KUBERNETES_PROVIDER=aws KUBE_USE_EXISTING_MASTER=true KUBE_AWS_ZONE=us-west-2b 
kubernetes/cluster/kube-down.sh
KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2a 
kubernetes/cluster/kube-down.sh

Conclusion 

A core philosophy for Kubernetes is to abstract away the complexity of running highly available, distributed applications. As you can see here, other than a small amount of work at cluster spin-up time, all the complexity of launching application instances across multiple failure domains requires no additional work by application developers, as it should be. And we’re just getting started!

Please join our community and help us build the future of Kubernetes! There are many ways to participate. If you’re particularly interested in scalability, you’ll be interested in:

    And of course for more information about the project in general, go to www.kubernetes.io

     — Quinton Hoole, Staff Software Engineer, Google, and Justin Santa Barbara


    Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/hFIJtgyjKkQ/building-highly-available-applications-using-Kubernetes-new-multi-zone-clusters-a.k.a-Ubernetes-Lite.html

    Original article

    Saved Replies

    Replying with the same response to Issues and Pull requests over and over can be tedious. Saved replies allow you to create a response to Issues and Pull requests and reuse it multiple times. Saved replies are available on all repositories starting today. This can save you a ton of time typing and posting the replies you use most frequently.

    To get started, go to your personal settings and click “Saved replies”. Here, you can add custom replies based on the types of responses you use most frequently. You can edit and update these anytime.

    Add a saved reply

    You can access your Saved replies when composing or replying to an Issue or Pull request.

    Insert a saved reply

    Check out the documentation for additional information on the feature.


    Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/7GiH-bhhDrc/2135-saved-replies

    Original article

    Sony’s Ultra 4K Streaming Service Launching On April 4; Titles Priced At $30

    Janko Roettgers reports for Variety: Sony is launching its 4K movie streaming service called Ultra next month: Consumers will be able to buy movies from the service, and stream to supported Sony 4K TV sets, starting April 4. The new service will offer 4K HDR movies to stream, including extras that have previously been able only on physical discs. Ultra ties into UltraViolet, the cloud locker service backed by Sony. Consumers will be able to upgrade SD and HD quality movies from their UltraViolet cloud locker for $12 to $15, respectively.


    Share on Google+

    Read more of this story at Slashdot.


    Original URL: http://rss.slashdot.org/~r/Slashdot/slashdot/~3/14bx5w8V50Y/sonys-ultra-4k-streaming-service-launching-on-april-4-titles-priced-at-30

    Original article

    Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

    Up ↑

    %d bloggers like this: