tpryan

Makefile – Get Compute Engine Instance IP

I have a test that requires me to spin up a VM with a service running on port 80. I’d need to test if that service is working, but the machine itself is irrelevant, so I’m not bothering to give it a hostname. But I still need to test output on port 80. So I need the external IP address of the machine and nothing else.

Enter the dynamic data trick from my first Makefile post – again.

So I grab the value and dump it to a variable:

TESTIP = $(shell gcloud compute instances describe $() --format='value[terminator=""](networkInterfaces[0].accessConfigs[0].natIP)')

Nothing more to it.

Whack-a-pod

Earlier this week, I released an open source version of “Whack a Pod,” a demo that we at Google Cloud have been using at Google Cloud Next, Google I/O and at various regional events. For those that haven’t seen the demo, it turns a Kubernetes cluster into a Whack-a-mole game, where Kubernetes Pods are the moles, and you are trying to take down enough pod/moles to disrupt the service those pod/moles are serving up.

The versions we have used at our events include a physical whack-a-mole machine that was hooked up to our game, so that you could actually physically kill Kubernetes pods by swinging a hammer at them. The work on the physical rig was done by Sparks and is not included in this repo. But, you can run this version anywhere, with minimal hardware requirements – a screen and an interface, touchscreen or mouse based.

Why Build it?

I wanted an easy and fun way to explain Kubernetes. I wanted something that could be hooked up to a real thing, and allow you to touch, as it were, something in the cloud. I wanted to get the idea of Kubernetes being resilient across.

I also wanted to make a series of Carnival based games, all with the same look and feel as the carnival version of whack a pod. They were shot down for being too whimsical. But that’s another story.

How does it work?

The entire application consists of three separate applications that are all hosted on the same Kubernetes cluster. We also create three services with which to expose the applications.

API

This is application that are represented by the moles. We launch a deployment that creates a replica set with instructions to keep 12 of them running at all times.

http://%5Bapi-service%5D/api/color

This is the basic service that the pods are keeping up. It is tremendously simple – when polled it returns a random hexadecimal color value.

http://%5Bapi-service%5D/api/color-complete

This is a slightly tweaked version of the color api above for use with the advanced interface. In addition to the color, it returns the unique Kubernetes generated name of the pod returning that generated the random color.

Admin

This is a set of commands that allow our front-end to issue commands against the Kubernetes cluster without need for credentialing. It is basically a proxy to the Kubernetes API with a restricted set of actions possible.

http://%5Badmin-service%5D/api/k8s/createdeploy

Creates a deployment for running the pods that serve up the API application. Used in all interfaces when you start up.

http://%5Badmin-service%5D/api/k8s/deleteallpods

Deletes all pods for the API deployment.

http://%5Badmin-service%5D/api/k8s/deletedeploy

Deletes the deployment for the API application. Used in all interfaces when clean finish with the deployment

http://%5Badmin-service%5D/api/k8s/deletepod

Deletes a single pod. Used in all interfaces when you whack a pod.

http://%5Badmin-service%5D/api/k8s/drain

Cordons a node to prevent it from any pods being scheduled on it, then it kills all the API pods that are running on the node. Used in the advanced interface.

http://%5Badmin-service%5D/api/k8s/getnodes

Gets information about the nodes of the Kubernetes cluster. Used in the advanced interface.

http://%5Badmin-service%5D/api/k8s/getpods

Gets information about all of the pods running the API service. Used in all of the interfaces to populate the list of pod/moles.

http://%5Badmin-service%5D/api/k8s/uncordon

Resets a node so that it can start accepting newly scheduled pods. Used in the advanced interface.

Game

The game consists of a few separate HTML/JS/CSS apps joined together. All of them work in the same general way.

The demo starts with no pods running. Each version will prompt you to deploy the pods. Deploying creates a replica set with 12 pods running.

The ui regularly polls the api/color service. If it gets a result, the service is up, if it doesn’t get a result, the service is down. There is an indicator towards the top that give the player feedback on service status.

The pods are displayed in a grid, their status is indicated by color differences (or mole position differences). The statuses are: starting, running, terminated. Starting and terminated pods cannot be “whacked.” When you whack a running pod, the UI calls admin/api/k8s/deletepod/. The pods remain for awhile after they have been terminated, and are replaced by pods in the started state.

http://%5Bgame-service%5D/

This is the basic game. It has a fun carnival theme, and is designed to be more a fun distracting game then a real lesson about Kubernetes.

http://%5Bgame-service%5D/next.html

This is the basic game but without the carnival theme. It’s more in line with the branding for the Google Cloud Next events. We have used this version on a touchscreen at a few of our regional Cloud Next Event.

It also has a panel that displays an abridged version of the json response from Kubernetes commands? Why abridged? Because most of the single responses from Kubernetes are over 100 lines of formatted json. The app shows the salient details.

http://%5Bgame-service%5D/advanced.html

When showing the demo at various events, we found ourselves wanting another view of the information for when conversations started to go deeper. The advanced view is a response to that. It gets rid of the time element, and instead displays the pods as they populate the cluster nodes, and not in a fixed grid. We show the service responses directly. We also show off which pod is actually responding to the service request. The interface includes the ability to drain a cluster node to simulate the node going down. Killing an actual node takes much longer, so this seemed like a reasonable way to simulate node death.

Choices

Why 3 services?

I originally wrote it as 1 app. When you killed all of those pods, and disrupted the service, you killed the UI too. This combined with inconsistent caching behavior caused really odd issues when you ran the demo. It made much more sense to split them up so as to not kill the game while you were playing the game.

Why PHP?

For starters I do outreach for Google Cloud to the PHP community. But also I started writing this as a quick and dirty prototype. When I want something done quickly, I write it in PHP. It’s very productive for me. I used the Docker image for App Engine flexible environment’s PHP runtime to just get the thing to work.

When the time came to tighten up everything, I considered re-writing all of the apis to use Golang instead. But one of the side effects of choosing PHP and the GAE flexible runtime was that there is a little overhead to starting the service – as opposed to writing a lean go app that only does just one thing. That overhead is only a couple of seconds, but having it allows the demo to illustrate the full lifecycle of a pod.

Why No Public Demo?

I haven’t figured out a way to convert this demo to a multi tenant demo. So it’s one front end client to one Kubernetes Cluster. So multiple players would interfere with each other. Right now it involves trying to take down a service directly tied to a specific IP. I’m sure there is a way to rewrite it to do so, I just haven’t had the time to do so.

What I learned from it?

Kubectl is awesome

In a lot of cases, I was just trying to recreate a kubectl command to tie to the front-end. Specific examples are kubectl delete deployment, and kubectl drain. In both cases, these are actually doing a lot of work, but hiding it behind one command. In the case of kubectl delete deployment, the command is deleting the deployment, then the replica set, then each pod and making it all one thing. If you just delete the deployment via the API, all of the children remain – and if you aren’t expecting it, you’ll be confused.

The fact that kubectl can be called with a flag to reveal those underlying calls is very much appreciated. The flag is “-v=8” in case you need it.

Kubernetes is a weeble not a fortress

I think this was one of the more surprising things I learned. I didn’t go out of my way to make the system super resilient or super brittle, but under most conditions if you were able to directly delete all of your Pods, you can cause major outages for your services. However, those outages would seldom be very long. In fact, at Google I/O we were tracking how long people kept the service down, the most we saw was 50% downtime. And this was two very motivated people hitting moles as soon as they appeared. Most of the time we saw downtown of less than 30%. Again, keeping in mind that the point of the game mechanics is to cause as much downtime as possible.

A dead Pod is only mostly dead

At a few events we ran into issues where none of the visible pods were in a “running” state, but the service was still up. I thought it was due to UI weirdness. Turns out it was due to the fact that running the Kubernetes delete pod command marks the pod as “terminating” and then allows for graceful termination. So if request gets routed to them, and they are still responding to requests, they can, even if marked terminated. I would have known it already if I had bothered to read the documentation a little clearer.

Whack-a-mole machines rest when the moles are down

When the moles are up, the machines are doing work, when they are down the machine is resting. When Sparks first hooked up the physical whack-a-mole machine to the Kubernetes cluster (figuratively – one is in our data center, the other at the event) the moles were up all the time, because, well, that’s what Kubernetes does. Long story short, our first whack-a-mole machine burnt itself out, because when 21st century cloud technology collides with 1970’s arcade technology, the cloud wins.

Conclusion

This was a really fun project. It helped me learn a lot about Kubernetes, while at the same time enabling a great educational experience about Kubernetes. Plus I got people to build a whack-a-mole machine. It was pretty awesome.

Get the source on github.

Makefile – Start, Stop or Delete 20 VMs at once

Last time I created 20 virtual machines at once. Now I want to stop those machines, or start them back up, or delete them. Basically, I want to do bulk operations on all of the machines that I am using in this scenario.

If you look at the create 20 VMs post, I gave each one of them a similar name, based on the pattern “load-xxx” where load is the operation I am using them for and xxx is a three digit sequential id with 0s prefixed. (This makes them order correctly in our UI.)

Because I know their names, I can count them up and not have to explicitly tell these operations how many machines I want to operate on. To do that, I create a make variable that contains the count of all VMs prefixed by “load.”

COUNT = $$(( $(shell gcloud compute instances list | grep 'load-' | wc -l | xargs) ))

Once I have that, I can perform batch operations very simply.

To stop 20 running VMs:

stop: 
    @echo "Initiate Stop loop"
    @i=1 ; 
    while [[ $$i -le $(COUNT) ]] ; 
        do server=`printf "load-%03d" $$i` ; 
        ($(call stop-node,$$server) &) ; 
        ((i = i + 1)) ; 
    Done
 
define stop-node
   echo "Stop Compute Engine Instance - " $(1) ; 
   (gcloud compute instances stop $(1) ) 
endef

Just to explain, like the previous post, we loop from i to COUNT, creating a variable that contains the name of our server, and running a function call to execute the gcloud stop instances command. Why is this a separate function? Because I usually do more than just stop the VM.

I also wrap the call in parentheses and append the & to allow multiple calls to execute in parallel.

To start them back up:

start: 
    @echo "Initiate Start loop"
    @i=1 ; 
    while [[ $$i -le $(COUNT) ]] ; 
        do server=`printf "load-%03d" $$i` ; 
        ($(call start-node,$$server) &) ; 
        ((i = i + 1)) ; 
    done
 
define start-node
   echo "Start Compute Engine Instance - " $(1) ; 
   (gcloud compute instances start $(1))      
endef

To delete them all:

delete:
    @echo "Initiate Delete loop"
    @i=1 ; 
    while [[ $$i -le $(COUNT) ]] ; 
        do server=`printf "load-%03d" $$i` ; 
        ($(call delete-node,$$server) &) ; 
        ((i = i + 1)) ; 
    done
 
define delete-node
   echo "DELETE Compute Engine Instance - " $(1) ;
    (gcloud compute instances delete $(1) --delete-disks "all" --quiet )
endef

And in this case, I do a little bit more here in delete. I make sure all of the disks are deleted, and I set the request to quiet. Why? Because I don’t want to confirm this 20 times, silly.

In any case, doing batch operations on my set of VMs is as easy as:

make start
make stop
make delete

There you have it, fleets of VMs responding in concert to your requests. As it should be.

Makefile – Launch 20 Compute Engine virtual machines at once.

We’re going to try something a lot more complex in make now. I’m going to dynamically create 20 Compute Engine virtual machines that are absolutely the same. This requires quite a bit more complexity, so we’ll break it down step by step.

Let’s start with the gcloud command to create an instance.

define create-node
   echo "Create Compute Engine Instance - " $(1) ;
   (gcloud compute instances create $(1) --machine-type "f1-micro" ;
   gcloud compute ssh $(1) --command "sudo apt-get update" ;
   say "$(1) is done.")
endef

I encapsulated this into a Makefile function. Why? Well, as I have it here, it is a pretty simple event with adding apt-get update but I usually do more then just create the node and install software. I often set environmental information or start services, etc. So by putting all of the instance specific instructions in a function, I make it just slightly easier to grok.

Let’s go through this part step by step.

Define a function with the define keyword, and end it with the endef keyword
It appears that functions must be one line, so use ; to organize multiple calls into one function
Wrap all of the real work in a parenthesis. Why? It turns it into one operation, so that each step of the function doesn’t block parallel execution of other operations in the makefile.
Capture the first argument – $(1) – passed into this function – we’ll use it as the name of the instance
Create a machine using gcloud compute instances create. Note setting the machine type. If you are creating a lot of instances, make sure you don’t run afoul of quota or spend.
SSH into machine and run apt-get update.
Tell us this machine is ready.

Okay, that handles the instance creation, but now we have to loop through and create a variable amount of machines. I said 20, but I often spin up anywhere from 10 to 150 using this method.

create: 
    @echo "Initiate Create loop"
    @i=1 ; 
    while [[ $$i -le $(count) ]] ; 
        do server=`printf "load-%03d" $$i` ; 
        ($(call create-node,$$server) &) ; 
        ((i++)) ; 
    done

Again, step by step:

Use @ so that the commands aren’t echoed to the output.
Set up a while loop with iterator – i, that will run as long as i is less than the explicitly passed variable named count
Use ; to make the command one logical line.
Use printf to create a variable named server to name the instances. In this case each instance is named “load-xxx” where xxx is a sequential id number for the node that always has three digits. This makes it easier to go back later and do more group operations on the entire set of machines.
Call the function using the syntax $(call function_name, value_to_pass)
Wrap call in parentheses and append a &. This shoves the call to the background so you can create 20, or 100, or 150 of these in parallel instead of sequentially.
We then increment the counter.

Finally we call the whole thing with:

make create count=20

Pretty straightforward. I frequently use this technique to launch of fleet of VMs to send large amounts of load at App Engine. Next I’ll tell you how to delete them all.

Don’t forget the count=N, or the call will bail.

Makefile – Tell me when you are done

I have been doing a large number of tasks lately that involve executing long-running processes from a Makefile — maybe somewhere in the neighborhood of 40 seconds to 5 minutes. They’re just long enough that I get bored and go off and do something else, but short and urgent enough that I would really want to do something (usually manually test) right after the process is done. I need to not drift off into procrastination world. I need to be alerted when my process completes.

I have taken to adding a ‘say’ command to my Makefiles. If you aren’t familiar, on OS X ‘say’ will have the computer speak out whatever you have written using the Text-to-Speech system setup in the OS. So that way, when I am looking in a browser window being distracted, a disembodied voice can startle me out of my reverie and I can jump right back in as soon as possible.

I usually do something like this:

clean :
	gcloud container clusters delete $(CLUSTER) -q
	gsutil -m rm -r gs://artifacts.$(PROJECT).appspot.com
	gcloud compute --project $(PROJECT) addresses delete "ip1"
	gcloud compute --project $(PROJECT) addresses delete "ip2" 
	gcloud compute --project $(PROJECT) addresses delete "ip3"
	
burnitdown: clean
	say “Project $(PROJECT) has been burnt to the ground.”

As always, your command names and mileage may vary. And turn down your volume. Or don’t.

Makefile – Clean App Engine flexible environment

One of the more interesting quirks of App Engine flexible environment is that App Engine launches Compute Engine virtual machines that you can’t spin down directly. The way to spin down App Engine flex is to delete all versions of the app. This will close down all of the VMs, and shut down your App Engine app.

You can do it manually through the web interface, you can do it manually by listing versions in gcloud then deleting them, or you can have a Makefile do it for you.

First I use the trick I wrote about capturing dynamic data from gcloud. Then I pipe that to a Makefile command that will delete the versions.

VERSIONLIST = $(shell gcloud app versions list --format='value[terminator=" "](version.id)')
 
clean: 
	gcloud app versions stop $(VERSIONLIST) -q

Note that I add -q to the command because I don’t want to be prompted; I just want them gone.

Makefile – Delete Forwarding Rules

I have a demo where I build a Kubernetes cluster on Container Engine to run a LAMP app. In the demo, I script out a complete build process from an empty project to the full running app. Testing this requires a clean up that takes me all the way back to an empty project with no cluster.

There is one thing I do not tear down – static IP addresses. I don’t tear these down because they are locked to host names in Google Domains, and I use those IPs in my Kubernetes setup to make sure that my cluster app is available at a nice URL and not just a randomly assigned IP.

But I have been running into a problem with this. Sometimes the static IPs hold on to Forwarding Rules that are autogenerated with crazy randomized names by Container Engine. It appears to happen only when I do a full clean. I suspect that I am deleting the cluster before it has a chance to issue the command to delete the forwarding rules itself.

In any case, I got tired of dealing with this manually, so I made a Makefile solution. First I get the dynamic list of crazy random forwarding rule names using the Makefile technique I outlined earlier. Then I pass that list to a gcloud command:

RULELIST = $(shell gcloud compute forwarding-rules list --format='value[terminator=" "](name)')
 
rules.teardown:
	-gcloud compute forwarding-rules delete $(RULELIST) --region $(REGION) -q

Note that I had to make sure I passed a region, otherwise the command would have prompted me to enter it manually.

Makefile – Get dynamic values from gcloud

Most of the time when I create something in my environment on Google Cloud Platform, I give it a specific name. For example, I create servers and call them “ThingIWillReferenceLaterWhenIDeleteYou” or more boringly, “Server1.”

Having set names, as I alluded to, makes it easier to clean up after yourself. But there are some cases when you cannot name things when they are created. So it would be nice to get a list of these names. For example, App Engine flexible environment versions for cleaning up after a test.

You can get a list of them with this command:

gcloud app versions list

Which yields this:

SERVICE  VERSION          TRAFFIC_SPLIT  LAST_DEPLOYED              SERVING_STATUS
default  20170405t215321  0.00           2017-04-05T21:53:39-07:00  STOPPED
default  20170405t222022  0.00           2017-04-05T22:20:40-07:00  STOPPED
default  20170405t223620  0.00           2017-04-05T22:36:38-07:00  STOPPED
default  20170405t230438  0.00           2017-04-05T23:04:59-07:00  STOPPED
default  20170405t235759  0.00           2017-04-05T23:58:27-07:00  STOPPED
default  20170407t102935  0.00           2017-04-07T10:29:55-07:00  STOPPED
default  20170407t110623  1.00           2017-04-07T11:06:45-07:00  STOPPED

Now normally I would have to add extra code to my Makefile to rip out the version names.

But gcloud actually has a robust formatting tool. So instead of running the command above I can run:

gcloud app versions list --format='json'

And get the JSON representation, which looks like this:

[
  {
    "environment": {
      "FLEX": null,
      "MANAGED_VMS": {
        "FLEX": null,
        "MANAGED_VMS": null,
        "STANDARD": {
          "FLEX": null,
          "MANAGED_VMS": null,
          "STANDARD": null,
          "name": "STANDARD",
          "value": 1
        },
        "name": "MANAGED_VMS",
        "value": 2
      },
      "STANDARD": {
        "FLEX": null,
        "MANAGED_VMS": {
          "FLEX": null,
          "MANAGED_VMS": null,
          "STANDARD": null,
          "name": "MANAGED_VMS",
          "value": 2
        },
        "STANDARD": null,
        "name": "STANDARD",
        "value": 1
      },
      "name": "FLEX",
      "value": 3
    },
    "id": "20170405t215321",
    "last_deployed_time": {
      "datetime": "2017-04-05 21:53:39-07:00",
      "day": 5,
      "hour": 21,
      "microsecond": 0,
      "minute": 53,
      "month": 4,
      "second": 39,
      "year": 2017
    },
    "project": "redacted",
    "service": "default",
    "traffic_split": 0.0,
    "version": {
      "automaticScaling": {
        "coolDownPeriod": "120s",
        "cpuUtilization": {
          "targetUtilization": 0.5
        },
        "maxTotalInstances": 20,
        "minTotalInstances": 2
      },
      "betaSettings": {
        "cloud_sql_instances": "redacted:us-central1:redacted",
        "has_docker_image": "true",
        "module_yaml_path": "app.yaml",
        "no_appserver_affinity": "true",
        "use_deployment_manager": "true"
      },
      "createTime": "2017-04-06T04:53:39Z",
      "createdBy": "tpryan@google.com",
      "env": "flexible",
      "id": "20170405t215321",
      "name": "apps/redacted/services/default/versions/20170405t215321",
      "runtime": "php",
      "servingStatus": "STOPPED",
      "threadsafe": true,
      "versionUrl": "https://20170405t215321-dot-redacted.appspot.com"
    }
  },
...

Using a JSON parser might make this easier, but there is an even easier way:

cloud app versions list --format='value[terminator=" "](version.id)'

Which yields:

20170405t215321 20170405t222022 20170405t223620 20170405t230438 20170405t235759 20170407t102935 20170407t110623

What will this do?

It will list just the value of version.id and it will separate each record it returns with a ” “, not a line break. This allows me to drop this generated list into any command that takes multiple names and run them. The gcloud CLI takes multiple arguments in this way.

So to make this applicable to Makefiles I have to do one more thing – take this data and put it in a variable.

VERSIONLIST = $(shell gcloud app versions list --format='value[terminator=" "](version.id)')

Here we are, ready to use this variable in other Make commands. This works for most of the other places in GCP where you see random values spitting out, like IP forwarding rules, and GKE nodes, to name two.

To learn more about how to filter and format your gcloud commands, check out the Google Cloud Platform Blog.

Makefile – quick series

Before joining Google Cloud, I wasn’t programming as much as I used to. So when I joined up the last build system I had used with any regularity was ANT. Upon getting back in the routine of programming, I started down that path again, and immediately stopped. I did not want to deal with ANT and XML when I started back up. I also wasn’t doing anything even tangentially related to Java. So no, I’m not using ANT. I stopped doing single build files altogether and settled for folders of bash scripts.

This was… unsustainable.

Enter Mark Mandel and his constant exhortations to use Makefiles. Eventually I listened to him, and now instead of folders of scripts, or line after line of XML, I have giant Makefiles.

Make is awesome. And I know it is for more than just pushing files around, but that’s what I use it for. And I love it.

I’m running a short series on a number of productivity tips and tricks I’ve learned. Many will be about Google Cloud. Some will not. I hope these help someone else learn to love Makefiles.

How Kubernetes Updates Work on Container Engine

I often get asked when I talk about Container Engine (GKE):

How are upgrades to Kubernetes handled?

Masters

As we spell out in the documentation, upgrades to Kubernetes masters on GKE are handled by us. They get rolled out automatically. However, you can speed that up if you would like to upgrade before the automatic update happens. You can do it via the command line:

gcloud container clusters upgrade CLUSTER_NAME --master

You can also do it via the web interface as illustrated below.

GKE notifies you that upgrades are available.

You can then upgrade the master, if the automatic upgrade hasn’t happened yet.

Once there, you’ll see that the master upgrade is a one way trip.

Nodes

Updating nodes is a different story. Node upgrades can be a little more disruptive, and therefore you should control when they happen.

What do I mean by “disruptive?”

GKE will take down each node of your cluster killing the resident pods. If your pods are managed via a Replication Controller or part of a Replica Set deployment, then they will be rescheduled on other nodes of the cluster, and you shouldn’t see a disruption of the services those pods serve. However if you are running a Pet Set deployment, using a single Replica to serve a stateful service or manually creating your own pods, then you will see a disruption. Basically, if you are being completely “containery” then no problem. If you are trying to run a Pet as a containerized service you can see some downtime if you do not intervene manually to prevent that downtime. You can use a manually configured backup or other type of replica to make that happen. You can also take advantage of node pools to help make that happen. But even if you don’t intervene, as long as anything you need to be persistent is hosted on a persistent disk, you will be fine after the upgrade.

You can perform a node update via the command line:

gcloud container clusters upgrade CLUSTER_NAME [--cluster-version=X.Y.Z]

Or you can use the web interface.

Again, you get the “Upgrade Available” prompt.

You have a bunch of options. (We recommend you stay within 2 minor revs of your master.)

A couple things to consider:

As stated in the caption above, we recommend you say within 2 minor revs of your master. These recommendations come from the Kubernetes project, and are not unique to GKE.
Additionally, you should not upgrade the nodes to a version higher than the master. The web UI specifically prevents this. Again, this comes from Kubernetes.
Nodes don’t automatically update. But the masters eventually do. It’s possible that the masters could automatically update to a version more than 2 minor revs beyond the nodes. This can a cause compatibility issues. So we recommend timely upgrades of your nodes. Minor revs come out at about once every 3 months. Therefore you are looking at this every 6 months or so.

As you can see, it’s pretty straightforward. There are a couple of things to watch out for, so please read the documentation.