Another Blog

Book Review: How the word is passed by Clint Smith

2022-08-03T00:00:00+00:00

Introduction

This was a difficult but important book to read. The legacy of slavery and racism on our society is both overt and subtle. Much of it is ignored and glossed over but some people are trying to integrate a more historically accurate interpretation into our institutions and lives.

Thomas Jefferson

The first part of the book takes place at the farm/plantation of the founding father Thomas Jefferson. The reality is that slavery was a big part of all the founding fathers life and business, but Jefferson has some especially egregious behavior. He split families, tortured slaves and even had six (!) children with a slave that was his wife’s half sister! The estate is trying very hard to integrate this lesser told story into the traditional story and it is a difficult task. How can the misery and suffering of the slaves be reconciled with the fact that they allowed him to do his incredibly important work?

Whitney Plantation

This plantation is an independent project of a wealthy retired businessman to preserve the stories of the slaves that sustained it. It is incredible work that I am glad is being done. The reality of our history needs to be preserved and integrated, it cannot be ignored and buried. I hope to be able to visit this important contribution.

Modern Day Slavery in Angola

Angola is a maximum security prison in Louisiana that is filled with black men preforming forced labour for cents on the hour. It is modern day slavery and it highlights the corruption of the justice system as an overt means of continuing slavery. All talk of reconciliation is empty as long as this practice continues. This section was so infuriating.

Juneteenth and Manhattan

The actual events of Juneteenth were fascinating to me. The end of slavery in the South after the Civil War was a messy crazy process. I was not aware that plans to give slaves plots of land was rejected last moment. How different things might be today if it had happened.

The legacy of slavery in Manhattan is also something not talked about often. The economics of slavery were embedded throughout all of states and New York was no exception. Just because it was on the “winning” side doesn’t mean that it doesn’t bare any responsibility.

Blackford Cemetery

The Blackford cemetery is a Confederate cemetery and it is a focal point for Confederate culture. Is it possible to honor the soldiers without getting tangled in their implicit support of slavery? Is it possible to celebrate a culture when that culture justified a war to maintain slavery? It is a similar situation to post war Germany where everyone has to grapple with the fact that normal people enabled the slaughter of millions of people. What I experienced is that Germans have reframed the teaching of the history in terms of “Never Again”, i.e. we must teach this to ensure that it can never happen again. Unfortunately I don’t see this kind of hard work being done in the US.

Conclusion

I don’t think we will be able to address the legacy of slavery and racism without a clear acknowledgment and acceptance of the past. I really appreciated the way this book presented the historical blind spots of US history. Highly recommended.

Book Review: 1491 and 1493 by Charles Mann

2022-08-02T00:00:00+00:00

Introduction

These books changed the way I see and think about our modern world. They cover what we know today about the American continent before Christopher Columbus arrived in the “New World” and the many ways the entire world was changed forever afterwards. Mostly for my own benefit I will list a bunch of the things I learned and perhaps this will encourage you to read these books as well.

Impact of Pathogens

I had heard before about the impact of various pathogens brought over by the Europeans, but I didn’t appreciate the scale. The exact population of the Americas is impossible to know, but there is lots of evidence for as high as 100 million people. These people had to deal with multiple waves of influenza, typhoid, small pox, yellow fever, malaria, etc. and over 50 years this probably killed over 95% of the population.

The consequences of this are mind boggling. The Indians (there doesn’t seem to be a better word for the people that lived in the Americas before Columbus) did controlled burns of the forests for thousands of years and this stopped. Without this burning, the ecosystem completely changed. Huge forests grew, herds of bison formed and the growth may even have been responsible for the mini ice age in the 1600s due to forests capturing so much CO2.

All attempted European settlements failed until the Indians were wiped out. There are so many stories of how ill prepared the settlers were for life in America. It wasn’t until they weren’t in direct competition with the Indians that settlement had a chance of succeeding. Also the weakened Indian tribes often formed alliances with the settlers in their own conflicts. These alliances never worked out for the Indians as the Europeans once established would systematically wipe out the Indians.

Impact of joining two separate ecosystems

The Americas contained ecosystems completely isolated from the rest of the world. As the world became a single global ecosystem there were many winners and losers. America didn’t have worms which is one of the reasons the controlled burns were so critical. Now worms are everywhere and the way they break down biological material has profound consequences for the native plants.

The Americas contained three plants that changed the world: tomatoes, corn and potatoes. It is hard to imagine cuisine today without tomatoes and to think that something as “universal” as spaghetti and tomato sauce is a very recent phenomenon. Corn is the backbone of our industrial agriculture feeding cattle and converted to corn syrup and various other food additives. The potato alone has been attributed to have allowed the human population to grow by billions. Impossible to imagine our society without these plants, not to mention chocolate or tobacco.

American agriculture

The early settlers often remarked at how healthy the Indians looked. Turns out the Indians ate a diet of corn, beans and squash that was nutritionally superior to the European diet of wheat and meat. The Indians didn’t have any pack or domesticated animals. That means all farming was done by hand and all communications had to be walked. The Europeans mistook that lack of visible farms as sign that the land was “unused”. But without oxen to pull plows giant farms aren’t feasible and the Indians had very different forms of farming and hunting. Even in the Amazon, there are strange super fertile regions that contains millions of pottery shards mixed in with the soil. We still don’t understand how this works or was possible.

Silver from the Andes

After the Incas were defeated and enslaved, the Spanish found a silver mine in the Andes that had been mined by the Incas. The ore was very pure and plentiful and soon the silver was moving around the world. It was supposed to go directly back to Spain to fund various wars.Some enterprising Spanish sailors realized they could run the silver across the Pacific, trade with the Chinese for silks and spices, cross the Pacific, Mexico, the Atlantic and make a fortune. Almost two thirds of the silver went to China and even the one third to Spain was enough to trigger massive inflation in both countries. This inflation was the proximate trigger of various regime changes.

Rubber

The industrial revolution requires three things: steel, oil and rubber. I underappreciated the critical role that rubber plays in all the gaskets, seals and tires that are part of modern machines. The rubber/latex tree is still a core part of our economy and now grows all over the world. Rubber tree farms cause all kinds of ecological disasters and the rubber boom of the late 1800s also caused massive economic upheaval. It is another example of a natural product for which we haven’t been able to make an economical substitute.

Slavery

I was not aware of the link between malaria and slavery. Malaria was so deadly to Europeans and Indians that settlement in the Americas was almost a death sentence. It was because the Africans had more natural immunity the malaria that drove much of the Atlantic slave trade. Many African slaves were in fact prisoners of war between African tribes and so many slaves had military training. Unsurprisingly many escaped their slavery and formed “free” communities. One incredible example is a state in modern day Brazil that survived for almost 90 years. It was strategically placed on a cliff side with access to water, etc. The populace was trained and they were able to resist many attacks. It is such an amazing story that I hope it becomes a movie some day.

Conclusion

This is just a fraction of the incredible history of the Americas that I am so grateful to have been able to learn about. Highly recommended.

The Kubernetes batch job memory challenge

2021-08-03T00:00:00+00:00

Batch Jobs in Kubernetes

The design of Kubernetes has its origins at Google as a platform for microservices. It does support batch jobs but they do not have the same level of support as microservices. The only difference between a Pod and Job is that a Job does not restart automatically.

Pod Abstraction

The Pod abstraction assumes that the CPU and memory usage of the processes running inside a Pod are predictable and fairly constant. When a Pod’s memory usage is larger than its allocation, the assumption is that there is a bug or memory leak and the process should be killed.

Best Effort resource allocation

For Pods the default resource allocation is “Best Effort”. When no CPU or memory limits are defined the Pod is considered low priority and does not have any resource limits. During my initial explorations with K8s I created 10 Best Effort Jobs that compiled software. K8s started all the Pods on a single Node. The compile jobs quickly used up all available memory on the system and K8s started killing Jobs at random.

This behavior makes sense when the Pods are stateless and part of Deployments and con be easily moved to other nodes. Of course the compile jobs can just be restarted but it is wasteful to throw away the work in progress. In this case “Best Effort” doesn’t really seem appropriate for running batch jobs.

PodInterAntiAffinity

Using the Pod AnitAffinity with the nodename and/or labels, the K8s scheduler will spread the Pods out over a set of Nodes. This way if the nodes have enough resources “Best Effort” pods will have the best chance of not consuming too many resources on the nodes. But it isn’t a guarantee and it is difficult to predict if a Job/Pod will finish.

Burst and Guaranteed Pods

If “Best Effort” isn’t appropriate, then the next step is to give each Job a CPU and memory limit. The question then becomes what those limits should be? A Burstable Pod has lower limit and max limit. A Guaranteed Pod has a max limit. A Burstable Pod allows a form of resource over commitment. If the memory usage of the processes in the Pod ever tries to allocate more memory than the resource limit, the allocation will fail and K8s will kill the Pod.

Compressible and Uncompressible resources

The default Pod resources CPU and Memory have different behavior when the resource is exhausted. CPU over commitment is handled by the kernel scheduler and the scheduler will allocate CPU time based on the scheduler policy. A Pod that attempts to use more CPU time than allocated will be throttled. CPU is considered a compressible resource.

Memory is different because if there isn’t any memory available, attempted allocations will fail. This is considered a uncompressible resource. There isn’t a way to throttle process memory usage. The only option is swap memory which can allow allocations to proceed but it comes with problems like thrashing. With containers it gets even more complicated because by default the kernel does not account for swap memory in the memory resource usage.

Setting memory limits

So the only way to prevent a Pod from being killed when it uses too much memory is to set the memory allocation high enough. This is where the predictability of microservices makes figuring out the max memory limit easier.

Finding a memory limit for Yocto builds

But what about large and unpredictable Yocto builds? Yocto provides infinite configuration options and also supports two forms of build parallelization: jobs and parallel packages. These options speed up the build but make the the package build order non deterministic. SState can be used to speed up builds but it makes predicting memory usage even more difficult because it is impossible to know ahead of time which packages will be rebuilt.

Swap?

What about using swap to add memory temporarily to a Pod that has used all of its allocation? K8s does not support swap and will require that swap be disabled. As far as I can tell the main issue is around how to account for the swap memory. Swap cannot just be added to memory of the machine because it is much slower. The current kube tools do not track swap usage and the kernel does not enable swap memory accounting by default for performance reasons. Swap can increase performance by removing unused memory pages to disk and giving applications more RAM. However a Pod using swap could thrash the entire node causing failures of Pods that are not using too many resources.

There is work underway to add swap alpha support to K8s 1.22 but it has many limitations and may be restricted to “Best Effort” Pods. Exactly how swap will be managed and accounted for are still open questions. At some point swap may be a part of the solution but it isn’t feasible now.

Workarounds

Without a way to make memory compressible there are only workarounds and tradeoffs.

reducing parallelism for large packages like tensorflow or chromium does keep memory usage down at the cost of slowing down the build. When a build fails due to resource allocation failure we track the current packages being compiled from the process list and use that to identify potential problem packages.
Tracking memory usage of a Pod using monitoring systems like Prometheus allows us to track changes in memory usage over time. It also gives us a better picture of how the memory usage of Yocto build changes over the course of the build. Ideally the processes causing peak memory usage can be identified and reduced to keep memory usage more stable allowing for better utilization of the nodes resources. The Prometheus alerting system can also be used to warn when builds cross memory usage boundaries like 90% which may give us time to deploy workarounds before builds fail.
A build that failed due to a failed memory allocation could potentially be restarted with lower parallalization and/or changed memory limits. This is tricky because our builds use local disks with HostPath and the build Pod would need to be rescheduled to the node with the in progress build files. K8s 1.21 has added improved support for local volumes. Local volume management in K8s deserves a post of its own.
The make tool has an option –load-average which tells make to no longer spawn new jobs if the load average is above a specific value. This doesn’t work for Pods because load is system wide and uses CPU but concept of feedback into make or ninja is interesting. Since a process can use the /proc filesystem to monitor the memory usage of a container is might be possible to have these tools reduce the number of spawned jobs based on current container memory usage.

Conclusion

It is the combination of the following:

memory being uncompressible
K8s/container memory limits are hard limits without second chances
Yocto builds have unpredictable memory consumption

that makes this a very difficult problem. The only “proper” solution would be to make the builds have more predictable memory usage. This would require a feedback mechanism to make/ninja/bitbake to adjust the number of running processes based on the current container memory usage.

Golang app development using Skaffold

2021-04-20T00:00:00+00:00

Introduction

Developing an application to be distributed as a K8s service is a complicated undertaking. Besides learning the application language and solving the application problem, there are all the K8s workflows that need to be automated. This is my attempt to navigate the insane K8s ecosystem of tools as I try to make a decent development and production workflow.

Development workflow

The local development workflow needs to have a fast feedback loop. For a K8s application that requires at minimum a container build and deployment.

Ubuntu setup of go 1.15

Latest go at this time is 1.16.3, but for a sample app the distro supplied 1.15 is fine.

sudo apt install golang-1.15
cd $HOME/bin && ln -s /usr/lib/go-1.15/bin/go

I have $HOME/bin in my $PATH so this makes it easy to manage all the installation of single binary tools. Technically with buildpacks I don’t even need to install the go toolchain, but I want to explore things like debugging of a running go application.

Buildpacks

I am not a big fan of Dockerfiles, especially the multi-stage Dockerfiles which is the right way to separate the build and runtime containers. Due to the single binary structure of Go applications they can have a tiny runtime image. So I decided to investigate using buildpacks4 which look like a much better alternative for application development. It even supports new features like reproducible builds and image rebasing.

Install the pack tool.

cd $HOME/bin
curl -LO https://github.com/buildpacks/pack/releases/download/v0.18.1/pack-v0.18.1-linux.tgz
tar xzf pack-v0.18.1-linux.tgz
chmod +x pack
rm -f pack-v0.18.1-linux.tgz

Start with golang buildpacks sample app1.

Since this is a golang app, the default buildpack can be tiny:

pack config default-builder paketobuildpacks/builder:tiny
cd $APP
pack build mod-sample --buildpack gcr.io/paketo-buildpacks/go

With buildpacks running locally, the workflow is and save, run pack to build, run docker and test. It takes a few seconds and multiple steps.

Skaffold and Minikube

This app will run in K8s and will depend on K8s features so it will need to run inside K8s. Enter Minikube2 for a local K8s setup and Skaffold3 to orchestrate the development workflow.

cd $HOME/bin
curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
chmod +x minikube
minikube start

This starts up a full K8s instance locally using the docker driver. The initial download was ~1GB so it takes a while.

cd $HOME/bin
curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64
chmod +x skaffold

Now switch to the golang buildpack sample with skaffold5.

<term 1> skaffold dev
<term 2> minikube tunnel
<term 3> kubectl get svc # to get IP
<term 3> curl -s http://<external IP>:8080

minikube has its own docker daemon and the buildpacks used and images built are located inside minikube and not the host docker6.

<term 4> eval $(minikube docker-env)
<term 4> docker images

This makes deploying the image very fast because it isn’t copied.

Development workflow

The skaffold sample is setup to use gcr.io/buildpacks/builder:v1 and it also works with the builder paretobuildpacks/builder:tiny. The Google buildpacks7 supports “file sync” which copies changed files directly to the image. This means changes are available in seconds which is great for development.

Next steps

My application will be a multi-cluster app that exchanges K8s resource data. First step is to query the resource utilization of the K8s cluster using client-go.

Update: Git Server option bigFileThreshold

2021-02-11T00:00:00+00:00

Introduction

Many years ago (2014) I setup the my git servers with the git option core.bigFileThreshold=100k. This reduced memory usage dramatically because git stopped compressing already compressed files. I have used this option for many years without apparent problems until one of my colleagues alerted me that cloning an internal mirror of the Linux kernel from my git server was transferring over 9GB of data! Cloning the same repo from kernel.org transferred only approx 1.5GB.

So many repack options

When I looked at the bare repo everything seemed normal. The repo had been repacked properly less than a month ago thanks to grokmirror. There was a single pack file with a bitmap and a single pack file that was 9.1GB! I tried all the standard repack commands:

> git repack -A -d -l -b

and when that didn’t help:

> git repack -A -d -l -b -F -f

But nothing changed. Then my colleague reported that rebuilding with the above options did work on his machine and reduce the git repo size. This meant that there must be a local setting on the server that was causing the problem. I looked at the local ~/.gitconfig and saw the bigFileThreshold option I had set so long ago. So I did a quick experiment with:

> git -c core.bigFileThreshold=512m repack -A -d -F -f

and it did indeed reduce the bare git repo from 9.1GB to 1.9GB! It seems that there are ~200K files in the Linux kernel repo that are over 100k and when they are not compressed the size of the repository grows a lot!

Curious how large the files in the kernel repo can become I did a checkout of the mainline kernel and looked for files of 100Kb.

> find . \( -path */.git/* \) -prune -o \( -type f -size +100k \) | wc -l
914

Four of these files are even over 10MB!

Solution

Once the problem has been clearly identified the solution is usually simple. In this case the gitolite config for all the kernel repos sets the core.bigFileThreshold to its default value of 512m. This way all the other repos can still use the smaller bigFileThreshold setting.

There is also a way to tell git not to delta compress files with certain extensions. I created a global git attributes file /etc/gitattributes with the following content:

*.bz2 binary -delta
*.gz binary -delta
*.xz binary -delta
*.tgz binary -delta
*.zip binary -delta
*.lz binary -delta

Which covers all the compressed files in our repos and it had the same effect so I reverted the bigFileThreshold option to the default of 512M.

Developer productivity

2020-10-12T00:00:00+00:00

Introduction

During a recent job interview I was asked “Do you think you are a 10x developer”. The concept of a “10x” developer and developer productivity is something I have thought a lot about. Fundamentally the hard part is figuring out what to measure. I don’t have any good answers but here is how I think about it today.

How to measure productivity?

Since programming is a fairly creative activity it will always be difficult to find a measure that cannot be gamed.

A simple but flawed measure is something like “lines of code” or “features completed” or “bugs fixed”. These measurements are flawed because they are only loosely linked to the things users of the code actually care about. In University I met someone that allegedly completed a 5 hour coding interview in 1.5 hours with code that passed all the unit tests. If true this is impressive and a testament to that particular developers skills. I doubt I would ever be able to match such a feat.

Just as a person has many personality facets, a developer can work on different facets of productivity. I like the word facets because each is unique while still contributing to the whole.

Cost of programming errors

An important skill is the ability to produce “error-free” code. I think computer programming is unique in that a single bug can cost millions of dollars to fix. Even perfectly correct code can require rewriting when the requirements or execution environment changes. Examples of insanely expensive bugs include OpenSSL HeartBleed, Intel Meltdown and more. These bugs cause the users damage and also generate rework for the entire industry.

Programming is a continuous tradeoff between getting the code working for a specific use-case and making it robust enough to handle multiple use-cases. Figuring how much it will cost to develop a feature is hard enough and the risk of expensive bug is rarely factored in. There isn’t an easy way to measure the cost of expensive bugs. The cost to fix bugs is also hard to measure and not accounted as an engineering cost.

Developing the skill of writing code that doesn’t result in expensive bugs often requires:

Using tools like static analyzers, linters, enabling all compiler warnings, fuzzers, code quality scanners, etc. Catching errors early is often the best return on investment. However, each tool takes time to learn and integrate. Each run takes time and a high rate of false positives can result in lost productivity.
Developing and maintaining a set of runtime tests. Code developed at the same time as tests tends to be better designed because it works best when dependencies are minimized. Code with a good test suite can be refactored more easily. On the other hand, runtime testing of a large software base requires significant infrastructure in order to minimize false positives and maintain a good feedback loop.
Careful software reuse. Sometimes using an existing code base is the right thing to do. For example, almost none of the developers that thought they could write an encryption library have succeeded. Each dependency on a third party becomes a liability and has to be managed carefully. Ideally, it is an open source library and you can become part of its community and keep up with the upgrades and security fixes. In the worst case scenario, you end up maintaining a fork of the software or have to apply horrible workarounds.
Creating operationally simple software. Even bug free software can be a pain to upgrade or keep operational in a high availability configuration. Software has many different user interfaces and one is how the software is installed, configured, upgraded and maintained. I wasn’t exposed to this facet of software until I had to maintain a cluster of 100+ machines. I have found that whether a service can reload its configuration without a restart is a good indication if the operator interface has been taken seriously. Reloading configuration at runtime requires a good software design and test suite. When there are bugs it is too easy for the developers to just deprecate the feature and force restarts. But being able to reload a configuration without impact on running sessions is an operationally valuable feature.

In the wrong environment an inexperienced developer can introduce programming errors that will cost more than their contributions. Everyone likes to talk about “10x” programmers, but I think we should also talk about “negative productivity” programmers and what can be done to reduce the cost of these errors by catching and preventing them earlier.

Cost of fixing bugs

Debugging is a specific developer skill. It is difficult to teach and hard to explain the instincts of a good debugger to an inexperienced developer. Being able to make an intermittent bug easily reproducible or use gdb to track down some memory corruption are critical skills at the right time. I also saw a talk by a Google engineer that was investigating a 99th percentile latency outlier and found a Linux kernel scheduler bug that saved Google millions of dollars a year. As systems become more complex, the bugs also become harder to fix. I wish there were better ways to capture and train debugging expertise.

Individual productivity versus team productivity

One of the amazing properties of software is leverage where a single tool can make a large group of developers more productive. The goal of every manager should also be to make their team more productive. The goal of almost every software product is to make their customers more productive. Being able to find and address productivity bottlenecks in a team is another developer skill. Developing this skill often requires:

Understanding of the workflow of the different members of the team
Use of automation tools to transition manual work to the computer
Creating tools with a compelling user interface for team
Talking with upstream and downstream teams to find ways to make interactions smoother and more automated

This assumes that the team works well together. A toxic team member can reduce the productivity of an entire team. Language, timezone and cultural differences can also hinder productivity.

Choosing the “right” work

Even the most perfect code is useless if it doesn’t solve the right problems. Keeping development aligned with business needs can contribute to team productivity by eliminating rework. Some of the skills required to do this well are:

Interacting with customers directly and understanding what their problems are and why they are looking to you to solve them
Communicating technical concepts to non-technical people in an effective way
Communicating non-technical requirements to technical people in an effective way
Potentially developing expertise in the customer domain to understand their domain specific language and problem context

Conclusion

It is impossible to be excellent in all these skills. The most important is to constantly find ways to improve individual and team productivity. I suspect this isn’t the answer that an interviewer is expecting. I need to come up with a shorter answer.

Using AWS Session Manager to connect to machines in a private subnet

2019-11-07T00:00:00+00:00

Introduction

We are experimenting with AWS as many people are. One of the first hurdles is connecting over SSH to the EC2 instances that have been created. The “standard” mechanism is to setup a Bastion host that has a restrictive “Security Group” (also known as Firewall). This Bastion host is accessible from the Internet and once the user has logged into this host they can then access other instances in the VPC.

The Bastion host has a few limitations:

It is exposed to the Internet: A Security Group can restrict access to specific IPs and only open port 22. This is reasonably secure, but the possibility of an exploit in the SSH server is always a possibility.
SSH key management: The AWS console allows for the creation of SSH keypairs that can be automatically installed on the instance which is great. If you have multiple people accessing the Bastion instance, then either everyone will have to use the same keypair (which is bad) or there needs to some other mechanism to managing the authorized_keys file on the Bastion instance. Ideally this is automated using a tool like Puppet Bolt or Ansible.

One of my weekly newsletters pointed me to aws-gate which mentioned the possibility of logging into an instance using SSH without the need for a Bastion host. This post documents my experience getting it working.

Local Requirements

On the local machine the AWS CLI must be installed. I use a python virtualenv to keep the python environment separate and avoid requiring root access.

> python3 -m venv awscli
> cd awscli
> bin/pip3 install awscli

Unfortunately it turns out the Session Manager functionality requires a special plugin which is only distributed as a deb package.

> curl "https://s3.amazonaws.com/session-manager-downloads/plugin/latest/ubuntu_64bit/session-manager-plugin.deb" -o "session-manager-plugin.deb"
> sudo apt install ./session-manager-plugin.deb

The AWS CLI requires an access key. Go the AWS console -> “My Security Credentials” and create a new Access key (or use existing credentials).

> ~/awscli/bin/aws configure
AWS Access Key ID [None]: accesskey
AWS Secret Access Key [None]: secretkey
Default region name [None]: us-west-2
Default output format [None]:

Also in the AWS EC2 console, create a new KeyPair and download the .pem file locally. I put the file in ~/.ssh and gave it 0600 permissions. Now add the following to your .ssh/config file:

# SSH over Session Manager
host i-* mi-*
ProxyCommand sh -c "~/awscli/bin/aws ssm start-session --target %h --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"
IdentityFile ~/.ssh/<keypair name>.pem

AWS IAM Setup

By default an EC2 instance will not be manageable by the System Manager. Go to AWS Console -> IAM -> Roles to update the roles.

I already had a default EC2 instance role and I had to add AmazonSSMManagedInstanceCore permissions to the instance role.

Launching the Instance

According to the docs the official Ubuntu 18.04 server AMI has the SSM agent integrated and I relied on this. Finding the right AMI is really frustrating because there aren’t proper organization names attached to AMIs. The simplest is to go the Ubuntu AMI finder and search for ‘18.04 us-west-2 ebs’ and select the most recent AMI.

In the launch options:

choose the correct VPC with a private subnet
the ‘IAM Role’ with the correct permissions
A “Security Group” with port 22 open to you
Select the Keypair that was downloaded earlier and setup in your .ssh/config file.

Launch the instance and wait a while. Go to the AWS Console -> Systems Manager -> Inventory to see that the instance is running and the SSM agent is working properly.

Connecting over SSH

If everything is setup correctly grab the instance name and do the login:

> ssh ubuntu@i-014633b619400dfff
Welcome to Ubuntu 18.04.3 LTS (GNU/Linux 4.15.0-1052-aws x86_64)
<snip>
ubuntu@ip-10-0-1-193:~$

SSH access without a Bastion host is possible!

ZFS Disk replacement on Dell R730

2019-11-01T00:00:00+00:00

Introduction

I manage a bunch of Dell servers and I use OpenManage and check_openmanage to monitor for hardware failures. Recently one machine started showing the following error:

Logical Drive '/dev/sdh' [RAID-0, 3,725.50 GB] is Ready

Unfortunately “Drive is Ready” isn’t a helpful error message. So I log into the machine and check the disk:

> omreport storage vdisk controller=0 vdisk=7
Virtual Disk 7 on Controller PERC H730P Mini (Embedded)

Controller PERC H730P Mini (Embedded)
ID                                : 7
Status                            : Critical
Name                              : Virtual Disk 7
State                             : Ready

The RAID controller log shows a more helpful message:

Bad block medium error is detected at block 0x190018718 on Virtual Disk 7 on Integrated RAID Controller 1.

From experience I know that I could just clear the bad blocks, but the drive is dying and more will come. Luckily Dell will replace drives with uncorrectable errors and I received a replacement drive quickly.

Cleanly removing the drive

I know the drive is /dev/sdh, but I created the ZFS pool using drive paths. Searching /dev/disk/by-path/ gave me the correct drive.

First step is to mark the drive as offline.

> zpool offline pool 'pci-0000:03:00.0-scsi-0:2:7:0'

To make sure I replaced the correct drive I also forced it to blink:

> omconfig storage vdisk controller=0 vdisk=7 action=blink

Next came the manual step of actually replacing the drive.

Activating the new drive

After inserting the new disk I was able to determine the physical disk number and recreate the RAID-0 virtual disk.

> omconfig storage controller action=discardpreservedcache controller=0 force=enabled
> omconfig storage controller controller=0 action=createvdisk raid=r0 size=max pdisk=0:1:6

I use single drive RAID0 because I prefer that ZFS use the disks in raidz2 mode rather than using RAID6 on the controller.

Then a quick verify that the new virtual disk is using the same PCI device and drive letter and then add it back into the ZFS pool.

> omreport storage vdisk controller=0 vdisk=7
Virtual Disk 7 on Controller PERC H730P Mini (Embedded)

Controller PERC H730P Mini (Embedded)
ID                                : 7
Status                            : Ok
Name                              : Virtual Disk7
State                             : Ready
Device Name                       : /dev/sdh
> parted -s /dev/sdh mklabel gpt
> zpool replace pool 'pci-0000:03:00.0-scsi-0:2:7:0'

ZFS will add the new drive and resliver the data.

> zpool status
pool: pool
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.

It would be slightly easier if the virtual disk was handled by the RAID controller, but the rebuild would take much longer. So far ZFS on Linux has worked very well for me and I will continue to rely on it.

Building a multi node build cluster

2019-10-18T00:00:00+00:00

Introduction

I now have built, deployed and managed three internal build systems that handle thousands (yes thousands) of Yocto builds daily. Each build system has its tradeoffs and requirements. The latest one I call Wrigel was specifically designed to be usable outside of WindRiver and is available on our WindRiver-OpenSourceLabs GitHub repo and on the Docker Hub. Recently there has been a lot of internal discussion about build systems and the current state of various open source projects and I will use this post to clarify my thinking.

Wrigel Design Constraints

The primary use case of Wrigel was to make it easy for a team inside or outside WindRiver to join 3-5 “spare” computers into a build cluster. For this I used a combination of Docker, Docker Swarm and Jenkins.

Docker makes it really easy to distribute preconfigured Jenkins and build container images. Thanks to the generous support of Docker Cloud all the container images required for Wrigel are built and distributed on Docker Hub.

Docker Swarm makes is really easy to join 3-5 (Docker claims up to thousands) systems together into a cluster. The best part is that Docker Compose supports using the same yaml file to run services on a single machine or distributed over a swarm. This has been ideal for developing and testing the setup.

Jenkins is an incredible piece of software with an amazing community that is used everywhere and has plugins for almost any functionality. I rely heavily on the Pipeline plugin which provides a sandboxed scripted pipeline DSL. This DSL support both single and multi-node workflows. I have abused the groovy language support to do some very complicated workflows.

I have a system that works and looks to be scalable. Of course the system has limitations. It is these limitations and the current landscape of alternatives that I have been investigating.

Wrigel Limitations

Jenkins is a great tool, but the Pipeline plugin is very specific to Jenkins. There isn’t a single other tool that can run the Jenkins Pipeline DSL. To be fair, every build tool from CircleCI to Azure Pipelines and Tekton also have their own syntax and lock-in. There are many kinds of lock-in and not all are bad. One of the perennial challenges with all build systems has been reproducing the build environment outside of the build system. Failures due to some special build system state tend to make developers really unhappy, so I wanted to explore what running a pipeline outside of a build system would look like. I acknowledge the paradox of building a system to run pipelines that also supports running pipelines outside of the system.

The other limitation is security. The constant stream of CVE reports and fixes for Jenkins and its plugins is surprising. I am very impressed with Cloudbees and the community with the way they are taking these problems seriously. Cloudbees has made significant progress improving the default Jenkins security settings. This is no small feat considering Jenkins has a very old codebase. On the downside my own attempts to secure the default setup have been broken by Jenkins upgrades three times in the last year. While I understand the churn I am reluctant to ship Jenkins as part of a potential commercial product because each CVE would impose additional non business value work on our team.

Docker and the root access problem

Docker is an amazing tool and has completely transformed the way I work. One major problem is that giving a build script access to run Docker is equivalent to giving root on the machine. Since most build clusters are internal systems running mostly trusted code it isn’t a huge problem, but I have always been interested in alternatives. Recently Podman and rootless Docker have announced support for user namespaces. I was able to do a Yocto build using Podman and user namespaces with the 4.18 kernel so huge progress has been made. I would prefer that the build system required as little root access as possible, so I will continue to investigate using rootless Podman and/or Docker.

Breaking down the problem

At its core, Jenkins is a cluster manager and a batch job scheduler. It is also a plugin manager, but that isn’t directly relevant to this discussion. For a long time Jenkins was probably the most common open source cluster manager. It is only recently with rise of datacenter scale computers that more sophisticated cluster managers have become available. In 2019 the major open source cluster managers are Kubernetes, Nomad, Mesos + Marathon and Docker Swarm. Where Jenkins is designed around batch jobs with an expected end time, newer cluster managers are designed around the needs of a long lived service. These managers have support for batch jobs, but it isn’t the primary abstraction. They also have many features that Jenkins does not:

Each job specifies its resource requirements. Jenkins only supports label selectors for choosing hosts
The jobs are packed to maximize utilization of the systems. Jenkins by default will pack on a single machine and will prefer to reuse workareas.
Each manager supports high availability configurations in the open source version whereas the HA feature for Jenkins is an Enterprise only feature
Jobs can specify complex affinities and constraints on where the jobs can run.
Each manager has integration with various container runtimes, storage and network plugins. Jenkins has integration with Docker but generally doesn’t manage storage or network settings.

So by comparison Jenkins looks like a very limited scheduler, but it does have pipeline support which none of the other projects does. So I started exploring projects that add pipeline support to these schedulers. I found many very new projects like Argo and Tekton for Kubernetes, There are plugins for Jenkins that allow it to use Kubernetes, Nomad or Mesos, but they can’t really take advantage of all the features.

Cluster manager comparison

Now I will compare the features of the cluster managers which I feel are most relevant to build cluster setup:

How easy is the setup and maintenance?
How complicated is the HA setup?
Can it be run across multiple datacenters, i.e. Federated?
Community and Industry support?

Docker Swarm:

Very easy setup
Automatic cert creation and rotation
transparent overlay network setup
HA easy to setup
no WAN support
Docker Inc. is focused on Kubernetes and future of Swarm is uncertain

Nomad:

Install is a simple binary
integration with Consul for HA
encrypted communications
no network setup
plugins for job executors including Docker
WAN setup supported by Consul
Support for Service, Batch and System jobs
Runs at large scale
Well supported by Hashicorp and community
job configuration in json or hcl

Mesos + Marathon:

Support for Docker and custom containerizer
No network setup by default
Runs at large scale at Twitter
Commercial support available
Complicated installation and setup
HA requires zookeeper setup
no federation or WAN support
Small community

Kubernetes:

Very popular with lots of managed options
Runs at large scale at many companies
Supports build extensions like Tekton and Argo
Federation support
Lots of support options and great community
Complicated setup and configuration
Requires setup and management of etcd
Requires setup and rotation of certs
Requires network overlay setup using one of 10+ network plugins like Flannel

In my experience with Wrigel, Docker Swarm has worked well. It is only its uncertain future that has encouraged me to look at Nomad.

Running Pipelines outside Jenkins

Many years ago I saw a reference to a small tool on Github called Walter. The idea is have a small go tool that can execute a sequence of tasks as specified in a yaml file. It can execute steps serially or in parallel. Each stage can have an unlimited number of tasks and some cleanup tasks. Initially it supported only two stages so I modified it to support unlimited stages. This tool can only handle a single node pipeline, but that covers a lot of use cases. Now the logic for building the pipeline is in the code that generates the yaml file and not inside a Jenkinsfile. Ideally a developer could download the yaml file and the walter binary and recreate the entire build sequence on a local development machine. The temptation is to have the yaml file call shell scripts, but by placing the full commands in the yaml file with proper escaping each command could be cut and pasted out of the yaml and run on a terminal.

Workflow Support

It turns out that Jenkins Pipelines are an implementation of a much larger concept called Workflow. Scientific computing has been building multi-node cluster workflow engines for a long time. There is a list of awesome workflow engines on Github. I find the concept of directed acyclic graphs of workflow steps as mentioned by Apache Airflow very interesting because it matches my mental model of some of our larger build jobs.

With a package like Luigi, the workflow can be encoded as a graph of tasks and executed on a scheduler using “contribs”, which are interfaces to services outside of Luigi. There are contribs for Kubernetes, AWS, ElasticSearch and more.

Conclusion

With a single node pipeline written in yaml and executed by walter and a multi node workflow built in Luigi, the build logic would be independent of the cluster manager and scheduler. A developer could run the workflows on a machine not managed by a cluster manager. The build steps could be fairly easily executed on a cluster managed by Jenkins, Nomad or Kubernetes. Combined with rootless containers the final solution would be much more secure than current solutions.

Getting started with ElasticSearch

2018-05-10T00:00:00+00:00

Introduction

I manage an internal build system that creates a simple text file with key value pairs with build statistics. These statistics are then processed using a fairly gnarly shell script. When I first saw this years ago I thought it looked like the perfect candidate to use ElasticSearch and finally had time to look into this.

ElasticSearch and Kibana

ES is a text database and search engine which is useful, but it also has a neat frontend called Kibana which can be used to query and visualize the data. Since I manage the system, there was no need to setup Logstash to preprocess the data since I could just convert it to json myself.

Official Docker Images

The documentation for ElasticSearch covers installation using Docker, but there is one gotcha. The webpage that lists all the available docker images at https://www.docker.elastic.co/ only lists the images that contain the starter X-Pack with a 30 day trial license. I ended up using the docker.elastic.co/elasticsearch/elasticsearch-oss image which is only Open Source content. Same for the Kibana image.

Docker-compose

I wanted to run ES and Kibana on the same server, but if you do that using two separate docker run commands, the auto configuration of Kibana doesn’t work. I also wanted to make a local volume to hold the data and so I created a simple docker-compose file:

---
version: '2'
services:
  kibana:
    image: docker.elastic.co/kibana/kibana-oss:6.2.4
    environment:
      SERVER_NAME: $HOSTNAME
      ELASTICSEARCH_URL: http://elasticsearch:9200
    ports:
      - 5601:5601
    networks:
      - esnet

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch-oss:6.2.4
    container_name: elasticsearch
    environment:
      discovery.type: single-node
      bootstrap.memory_lock: "true"
      ES_JAVA_OPTS: "-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    ports:
      - 9200:9200
    volumes:
      - esdata1:/usr/share/elasticsearch/data
    networks:
      - esnet

volumes:
  esdata1:
  driver: local

networks:
  esnet:

Now I have ES and Kibana running on ports 5601 and 9200.

JSON output from Bash

I have a large collection of files in the form:

key1: value1
key2: value2
<etc>

Converting this to JSON should be simple, but there were a few surprises:

JSON does not like single quotes so wrapping the key and value in quotes requires either using echo and " or printf which I found cleaner.
JSON requires that the last element does not have a trailing comma. I abused bash control characters by using backspace to erase the last comma.
ElasticSearch would fail to parse if the JSON contained ( or ). I used tr to delete all backslashes from the JSON.

The final code looks like this:

convert_to_json()
{
    local ARR=();
    local FILE=$1
    while read -r LINE
    do
        ARR+=( "${LINE%%:*}" "${LINE##*: }" )
    done < "$FILE"

    local LEN=${#ARR[@]}
    echo "{"
    for (( i=0; i<LEN; i+=2 ))
    do
        printf '  "%s": "%s",' "${ARR[i]}" "${ARR[i+1]}"
    done
    printf "\b \n}\n"
}

for FILE in "$@"; do
    local JSON=
    JSON=$(convert_to_json "$FILE" | tr -d '\\' )
    echo $JSON
done

ElasticSearch type mapping

The data was being imported into ES properly and I tried to search and visualize the data, but I found it really hard to visualize the data. Every field was imported as a text and keyword type which meant that all the date and number fields could not be visualized as expected.

The solution was to create a mapping which assigns types to each field in the document. If the numbers had not been sent as strings, ES would have converted them automatically, but I had dates in epoch seconds which is indistinguishable from a large number. Date parsing is its own challenge and ES supports many different date formats. In my specific case, epoch_seconds was the only date format required.

I took the default mapping and added the type information to each field.I tried to apply this mapping to the existing document, but ES does not allow the mapping of a field to be changed because that would change the interpretation of the data. The solution is to create a new index and reindex the old index to the new one with types. This worked and I was now able to visualize the data much more easily.

Curator

I now had one index and it was growing quickly. I remembered from previous research that LogStash uses indexes with a date suffix. This allows data to be cleaned up regularly and also allows a new mapping to applied to new indexes. Creating and deleting indexes is handled by the Curator tool.

I created two scripts: one for deleting indexes that are 45 days old and another for creating tomorrows index with the specified mapping. Running these from cron daily will automate the creation and cleanup. Last piece is to have the JSON sent to the index that matches the day.

Conclusion

Every new piece of software has its learning curve. So far the curve has been quite reasonable for ElasticSearch. I look forward to working with it more.

Thoughts on Exercise

2017-05-23T00:00:00+00:00

Introduction

Three years ago, I read ‘Body by Science’ and it changed the way I think about exercise completely. Unfortunately I wasn’t able to find a gym in Ottawa that used this type of training. I then asked Google for results on ‘HIT bodyweight’ and found Drew Baye and Project Kratos. After reading most of Drew Baye’s blog and watching any videos I could find with him, I purchased Project Kratos and started my experiment.

Doing a workout once a week at home was perfect for my life situation at the time: two young children and full time jobs for my wife and I. Despite the infrequency I made progress surprisingly quickly. The squat, heel raise and back extension exercises almost tripled in under six months. But I quickly plateaued on exercises like the pushup, chinup and crunch. I tried many different things: negatives, forced reps, more rest, less rest, split routines, etc. but nothing allowed me to break through the plateau.

I started experimenting with different programs like GMB and GB, but all I lacked the mobility to do even their entry movements. I also read ‘Supple Leopard’ by Kelly Starret and ‘Roll Model’ by Jill Miller and realized that mobility was probably my limiting factor. It took a while before I was able to add my reading about mobility into my exercise mental model. Here is my current exercise model that I used to setup my latest exercise routine and goals.

Each movement has three components: mobility, skill and strength. These three components are related is non-linear ways. Even if they cannot be separated, it can still be useful to think about the effect of each component on a movement:

Mobility

Do the joints, connective tissue and muscles that participate in the movement have the range of motion required? There are three answers to this question: yes, no and almost.

No means the joints do not have the required range of motion. For example I cannot do a single leg squat because I lack the ankle range of motion necessary.
Yes means the range of motion is sufficient for the movement.
Almost is the most tricky condition because it can look like the movement can be done, but it is not optimal.

Examples of “almost” include missing shoulder range of motion so that the lats are not properly engaged for chinups or pullups. When only the arm muscles are used, the number of reps will always be limited and risk of shoulder injury is higher.

Another example is doing a squat without full ankle or hip range of motion. The squat is possible but it may lead to a rounded back and that can lead to joint wear and injury.

There are many examples of world class athletes that are still able to excel even with mobility limitation but they often pay the price for it later. If mobility restrictions are not addressed, injuries and plateaus will keep happening. More complicated movements often require more range of motion but developing range of motion takes much longer than building skill or strength.

Mobility is more than stretching. Increasing the range of motion requires convincing the nervous system that extending further is safe. This requires that the muscle be strong enough and lots of time being relaxed at the end of the range of motion. The fascia are also involved. Sometimes there can be adhesion of fascia layers that prevent proper movement. I have experienced many times injuries and pains resolved using a lacrosse ball or ART where force is applied to a trigger point.

Skill

Skill is the neurological coordination required to do a movement efficiently. Some movements require little practice, some full body movements require lots of practice to coordinate all the muscles and parts of the body properly. This is why practicing a movement without going to failure can still result in extra repetitions or apparent strength gains.

Full muscle contraction is another skill that is an important part of HIT. Learning to contract a muscle or a set of muscles under intense discomfort takes practice.

Strength

How best to simulate the body to produce the desired adaptation response of greater power output and/or muscle size? This is the component that HIT has focused on. The slow movement to momentary muscular failure protocol works well, but there are limitations. If a mobility or skill component is lacking, this will prevent a trainee from achieving proper muscular failure and simulating an adaptation response.

Training for all components

The best way to training for strength is HIT: short, infrequent movement to momentary muscular failure.

Skill training is best done when the muscles are rested as many skills require strength to perform properly. Skill training therefore works best with many repetitions using a lighter load with careful focus on form.

Mobility training is very different from skill or strength training. It takes lots of time and is specific to each individual body. It works best when done daily and integrated into other daily activities. I have had to find creative ways to combine a usual activity with stretching or working on fascia. For instance I will read in straddle stretch and meditate in squat position.

Programming

How best to combine all this information into a weekly program? That depends on the goals and time available of course.

If the goal is maximum ROI for minimum time investment, HIT strength training has the best returns. Movements that do not require skill or mobility components will have best return and this is why many HIT gyms use machines. Bodyweight HIT works but many of the movements have a mobility and skill component and each individual will experience different limitations.

The next step is to add mobility and skill practice. Unfortunately both these require significantly more time investment. Choose a specific skill and do daily mobility and skill repetitions.

For example I choose L-Sit and Crow Pose as skills and I do daily shoulder and wrist mobility followed by those skills on days when I do not do HIT strength. The mobility work takes 15-20 minutes. The skill work only takes a few minutes. I try to fit some more skill work in during the day to maximize the repetitions.

I have carved out a time for this every day. I will keep going while there is progress and then switch to something else when I stop progressing.

Conclusion

Thanks for reading this far. The key to staying motivated is progress and the key to progress is focusing on very specific goals.

Hashicorp Vault based PKI

2017-05-09T00:00:00+00:00

Introduction

One of the trends I have noticed is that open source tools encrypt network connections by default. Some tools like Puppet even make it impossible to disable TLS encryption and provide tooling to build an internal Certificate Authority. Docker requires overrides for any registry that does not have a verified TLS cert. Many tools also generate self signed certs which Firefox and Chrome always require manual overrides to use.

The solution is to have an internal Certificate Authority with the root CA as part of the trusted store. This internal CA can then be used to generate certs which will be trusted. But there are always complications. Many programs do not use the OS trusted store and require extra configuration to add trusted certs. For example Java applications require several steps to generate a new trusted store file and configuration to make that available to the application. Docker has a special directory to place trusted certs for registries.

Options

There are many CA solutions available: OpenCA, CertStrap, CFSSL, Lemur and many others. As I looked through all these programs a couple things kept bugging me. Creating certs is easy, revocation is where it gets really messy. The critical question is how to handle revocation in a sensible way. How can the system recover from a root CA compromise? Once I started reading about CRLs and OSCP and cert stapling, I got really discouraged. That is why I was intrigued by Hashicorp Vault and its PKI backend.

Vault

Vault is a tool for managing secrets of all kinds, including tokens, passwords and private TLS keys. It is quite complex and the CLI is non obvious. It supports backends for Authentication, Secret Storage and Auditing. It has a comprehensive access control language and a generic wrapper concept that makes it possible to pass secrets without revealing secrets to the middle man.

Vault solves the revocation and CA compromise problem by making it unnecessary. It provides a secure audited out of band channel for distributing secrets like certs which enables very short lived certs and secure automated reissuing of certs.

Vault PKI

That is the theory, so I decided to try it in practice by creating a CA and some certs.

1) Start Vault server, initialize, unseal and authenticate as root:

> vault server -config config.hcl
> export VAULT_ADDR='http://127.0.0.1:8200'
> vault init -key-shares=1 -key-threshold=1
Unseal Key 1: LbOw129fyB3OAzZvxq9RMQefNH8fFm7twS3wlg5Zv2o=
Initial Root Token: d9e9d69b-5d49-e753-3ef2-e6b36c0fb45a
> vault unseal LbOw129fyB3OAzZvxq9RMQefNH8fFm7twS3wlg5Zv2o=
> vault auth
Token (will be hidden): d9e9d69b-5d49-e753-3ef2-e6b36c0fb45a
Successfully authenticated! You are now logged in.

Of course this is for development only. A production deployment would use more shares and higher threshold. The unseal keys should be encrypted using gpg. Note that the root token can be changed.

2) Create self signed Cert with 10 year expiration

> vault mount -path=wrlinux -description="WRLinux Root CA" -max-lease-ttl=87600h pki
Successfully mounted 'pki' at wrlinux'!
> vault write wrlinux/root/generate/internal common_name="WRlinux Root CA" \
    ttl=87600h key_bits=4096 exclude_cn_from_sans=true
certificate     -----BEGIN CERTIFICATE-----
MIIFBDCCAuygAwIBAgIUMt8NYFtqaYk8Q1OUfdOWuPjXI0IwDQYJKoZIhvcNAQEL
...
serial_number   32:df:0d:60:5b:6a:69:89:3c:43:53:94:7d:d3:96:b8:f8:d7:23:42
> curl -s http://localhost:8200/v1/wrlinux/ca/pem | openssl x509 -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            32:df:0d:60:5b:6a:69:89:3c:43:53:94:7d:d3:96:b8:f8:d7:23:42
...

Note that this is the only time private cert is exposed.

3) Keep root CA offline and create second vault for intermediate CA

Create CSR for Intermediate CA

> vault mount -path=lpd -description="LPD Intermediate CA" -max-lease-ttl=26280h pki
> vault write lpd/intermediate/generate/internal common_name="LPD Intermediate CA" \
ttl=26280h key_bits=4096 exclude_cn_from_sans=true
csr     -----BEGIN CERTIFICATE REQUEST-----
MIIEYzCCAksCAQAwHjEcMBoGA1UEAxMTTFBEIEludGVybWVkaWF0ZSBDQTCCAiIw
...

4) Sign CSR and import Certificate

Note: Intermediate private key never leaves Vault

> vault write wrlinux/root/sign-intermediate csr=@lpd.csr \
common_name="LPD Intermediate CA" ttl=8760h
Key             Value
---             -----
certificate     -----BEGIN CERTIFICATE-----
MIIFSzCCAzOgAwIBAgIUAY8RmTDEzwbkUQ0smevPPIPXOkYwDQYJKoZIhvcNAQEL
...
-----END CERTIFICATE-----
expiration      1523021374
issuing_ca      -----BEGIN CERTIFICATE-----
MIIFBDCCAuygAwIBAgIUMt8NYFtqaYk8Q1OUfdOWuPjXI0IwDQYJKoZIhvcNAQEL
...
-----END CERTIFICATE-----
serial_number   01:8f:11:99:30:c4:cf:06:e4:51:0d:2c:99:eb:cf:3c:83:d7:3a:46
> vault write lpd/intermediate/set-signed certificate=@lpd.crt
Success! Data written to: lpd/intermediate/set-signed

5) Create Role and generate Certificate

Vault uses roles to setup cert creation rules.

> vault write lpd/roles/hosts key_bits=2048 \
max_ttl=8760h allowed_domains=wrs.com allow_subdomains=true \
organization='Wind River' ou=WRLinux
Success! Data written to: lpd/roles/hosts
> vault write lpd/issue/hosts common_name="yow-kscherer-l1.wrs.com" \
ttl=720h
private_key             -----BEGIN RSA PRIVATE KEY-----
MIIEpAIBAAKCAQEAvxHQzyEjc13djntQfCo1ncpwU18a8c8iI4OdaOSQV72zbHf2
...
-----END RSA PRIVATE KEY-----

6) Final Steps

Import root CA cert into trusted store
Create Policy to limit role access to cert creation
Use program like vault-pki-client to automate cert regeneration
Audit that certs are only created at expected times
Automate cert regeneration

Conclusion

Once this is setup, Heartbleed is a non event! As well as a PKI, i can also use Vault to manage other secrets as well.

Docker Multi Host Networking

2017-05-02T00:00:00+00:00

Introduction

I recently did a presentation at work covering the basics of getting docker container on different hosts to talk to one another. This was motivated because I wanted to understand all the available strange networking options and why Kubernetes choose the one network per pod model as the default.

Docker networking breaks many of the current assumptions about networking. A modern server can easily run 100+ containers and a datacenter rack can hold 80+ servers. If the networking model is one IP per container, that implies 100+ IPs per machine and 1000s per rack. Ephemeral containers with a short lifespan means that the network has to react quickly.

Of course there are competing container networking standards: CNM (libnetwork from Docker) and CNI (CoreOS and Kubernetes). Beyond the supported network models in Docker there is also a docker network plugin ecosystem with various vendors providing special integration with their gear.

Bridge Mode

Let’s start simple with the default bridge mode. Docker creates a linux bridge and veth per container. By default containers can access external network but external network cannot access container. This is the safe default. To allow external access to a container, host ports are forwarded to container ports. IPTables rules to prevent inter container communication. This functionality works with older kernels

Bridge mode example

> docker run --detach --publish 1234:1234 ubuntu:16.04 sleep infinity

# docker0 is the bridge, veth is connected to the docker0 bridge
> ip addr
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
    inet 128.224.56.107/24 brd 128.224.56.255 scope global eth0
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
    inet 172.17.0.1/16 scope global docker0
8: vethea44ea7@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master docker0 state UP

# iptables rules for packet forwarding
> iptables -L
Chain FORWARD (policy DROP)
target     prot opt source               destination
DOCKER     all  --  anywhere             anywhere

Chain DOCKER (1 references)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             172.17.0.2           tcp dpt:1234

# docker-proxy program forwards traffic from host port 1234 to container port 1234
> pgrep -af proxy
30676 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 1234 \
    -container-ip 172.17.0.2 -container-port 1234

Bridge Mode Limitations

The container IP is hidden and cannot be used for service discovery
Host ports become a limiting resource
Service discovery must have host ip and port
Port forwarding has a performance cost
Does not scale well
Application must support non standard port numbers
Large scale solutions involve load balancers + service discovery

Overlay

The overlay network feature Uses VXLAN to create a private network. It is part of Docker swarm mode. Each group of containers (Pod) has a dedicated network which is the Kubernetes network model. It does not require any underlay network modification, i.e. the network that the hosts are using. The docker swarm integration is very well done and many of the details are nicely abstracted away.

Benefits include:

Applications can use standard ports
Simplified service discovery can use DNS

Overlay Example - Create Swarm

manager> docker swarm init --advertise-addr 128.224.56.106
Swarm initialized: current node (tqxsn8ytpdq8ntd4sswl6qxjo) is now a manager.

To add a worker to this swarm, run the following command:

docker swarm join \
--token SWMTKN-1-0f49cat8w4xm29qndjza1u294i2 128.224.56.106:2377

worker1> docker swarm join ...
This node joined a swarm as a worker.

manager> docker node ls
ID                           HOSTNAME         STATUS  AVAILABILITY  MANAGER STATUS
qshqkznzaty8ggbyiodzb9jy9    worker2          Ready   Active
r0fj6inhs1tsin07mdsxoaiam    worker1          Ready   Active
tqxsn8ytpdq8ntd4sswl6qxjo *  manager          Ready   Active        Leader

Swarm Network and Load Balancing

I found this great example from the Nginx example repository:

Docker Swarm has built in DNS, scheduling and load balancing! In the following example A is Service1, B is Service2 and is not externally accessible

Overlay Example - Create Service

> docker network create --driver overlay demo_net
> docker service create --name service1 --replicas=3 \
        --network demo_net -p 8111:80 service1
> docker service create --name service2 --replicas=3 --network demo_net service2

Containers spread across three machines

> docker service ps service1
5fl2xpzbka28  service1.1  service1  worker1  Running
u3f4bd8q3p6d  service1.2  service1  worker2  Running
i85jdtgtinxr  service1.3  service1  manager  Running
> docker service ps service2
b5bzfdqw10y2  service2.1  service2  worker1  Running
k39m6utcq56o  service2.2  service2  worker2  Running
uaftc3ax0k17  service2.3  service2  manager  Running

Overlay Example - Load Balancing

Service 1 contacts Service 2 using internal DNS. Swarm uses Round Robin DNS lookup by default.

manager> curl -s http://worker1:8111/service1.php | grep address
service1 address: 10.255.0.9
service2 address: 10.0.0.5
manager> curl -s http://worker2:8111/service1.php | grep address
service1 address: 10.255.0.8
service2 address: 10.0.0.4

Overlay Limitations

VXLAN MTU and UDP complications
VXLAN adds latency (10-20%) and reduces throughput (50-75%)
Debugging VXLAN problems difficult
Docker swarm hides all the setup and routing complexity
Some network vendors provide VXLAN integration

Macvlan

Linux Networking driver feature
Low performance overhead
MAC and IP per container, similar to VM
MacVlan does not use VLANs!
Recently moved from docker experimental

Macvlan Example

host1> docker network create --driver macvlan --subnet 128.224.56.0/24 \
    --gateway 128.224.56.1 -o parent=eth0 mv1
host2> docker network create --driver macvlan --subnet 128.224.56.0/24 \
    --gateway 128.224.56.1 -o parent=eth0 mv1

Choose unused IPs

host1> docker run -it --rm --net=mv1 --ip=128.224.56.119 alpine /bin/sh
host2> docker run -it --rm --net=mv1 --ip=128.224.56.120 alpine /bin/sh
/ # ping 128.224.56.119
PING 128.224.56.119 (128.224.56.119): 56 data bytes
64 bytes from 128.224.56.119: seq=0 ttl=64 time=0.782 ms

Imagine /16 subnet where each host has a /24 for container IPs

Macvlan Limitations

Subnet and gateway must match host network
Requires new kernels: 4.2+
Requires IPAM and network cooperation
Isolation requires VLANs and/or firewalls
Limited to one broadcast domain
Too many MACs can overflow NIC buffer
Docker can allocate IPs in a given range
IPVLan L2 mode very similar

IPVlan L3 Mode

Linux Networking driver feature
Low performance overhead
Multicast and broadcast traffic silently dropped
Mimics Internet architecture of aggregated L3 domains
Scales well due to no broadcast domain
Docker experimental as of 1.13

IPVlan Example

create network - requires dockerd run with –experimental

host1> docker network create --driver ipvlan --subnet 192.168.120.0/24 \
    -o parent=eth0 -o ipvlan_mode=l3 iv1
host2> docker network create --driver ipvlan --subnet 192.168.121.0/24 \
    -o parent=eth0 -o ipvlan_mode=l3 iv1

Setup routes: host1=128.224.56.106, host2=128.224.56.107

host1> ip route add 192.168.121.0/24 via 128.224.56.107
host2> ip route add 192.168.120.0/24 via 128.224.56.106

Create containers

host1> docker run -it --rm --net=iv1 --ip=192.168.120.10 alpine /bin/sh
host2> docker run -it --rm --net=iv1 --ip=192.168.121.10 alpine /bin/sh
/ # ping 192.168.120.10
PING 192.168.120.10 (192.168.120.10): 56 data bytes
64 bytes from 192.168.120.10: seq=0 ttl=64 time=0.408 ms

IPVLan Limitations

Currently experimental
Requires new kernels: 4.2+
Isolation requires VLANs and/or iptables
Manage routes using BGP with Calico, Cumulus, etc.
Container networking becomes a routing problem, which is a well understood problem
Policies using BPF on veth and Cillium

Conclusion

Docker Multi-Host Networking is complicated!
Performance and Scale dictate solution
Balance between simplifying applications and infrastructure

Book Review: Sapiens

2017-05-01T00:00:00+00:00

“Sapiens: A Brief History of Humankind” by Yuval Noah Harari

It is hard to do such a dense and well written book justice in a short blog post. I really enjoyed the content and writing style.

Much of content overlaps with books like “Guns, Germs and Steel” and “The third Chimpanzee” by Jared Diamond. The mass extinctions and genocides directly attributed to our ancestors are covered. The book is very careful to draw clear boundaries around the limits of our historical knowledge.

The first concept that really got me thinking was Culture as shared myth or fiction. Getting large groups of humans to live together requires mechanisms to limit anti-social behaviour, but violence and surveillance do not scale well. Shared fictions like the hierarchy of royalty over common people can be much more effective at regulating behavior. The clearest example from the book is the concept of a corporation. It exists only because people accept that it exists. It does not exist because a few people scribbled on some paper, although the ritual can be important. A corporation is technically just a group of people. What binds them together is an imagined construct of hierarchy, rules, values and an identity which is accepted as real by potentially millions of people.

One of my favorite lines of the book:

Yet it is an iron rule of history that every imagined hierarchy
disavows its fictional origins and claims to be natural and
inevitable.

Every culture from the Greeks to modern democracy to Communist Russia made the same claim of being natural and inevitable. The book even takes on imaged hierarchies like racial and gender and dismantles their proponents. Money is another convenient shared fiction that many people claim as inevitable. It also makes the excellent point that our current society places rich above poor and this is no more natural than placing men above women or whites above blacks. It makes the current discussions of wealth inequality even more urgent.

There is so much thought provoking material in this book I cannot cover it all. The last section talks about the future of our species: changing our genetics, becoming cyborgs and creating an intelligence more capable than our own. Each of these paths has mind boggling possibilities. The final line of the book sums it up very well:

Since we might soon be able to engineer our desires too, the real
question facing us is not "What do we want to become?" but "What
do we want to want?" Those who are not spooked by this question
probably haven't given it enough thought.

It is a book that changed the way I see and think about the world. That is highest praise for a book that I can think of.

Rating: Highly recommended

Book Review: Ego is the Enemy

2017-04-20T00:00:00+00:00

“Ego is the Enemy” by Ryan Holiday

The central message of this book is not new. Ego has always a been a double edged sword. It motivates and energizes, but it also undermines us in many ways. Fundamentally all worthwhile progress involves more than one person and ego undermines human relationships. This book was a fantastic reminder of all the ways ego can undermine our relationships with other people and progress on our goals. The most enjoyable part was all the examples of famous and less famous people and how they succeeded by controlling ego or failed due to their ego.

The single line that resonated with me the most was “We choose to be or to do”. We either choose to expend our energy projecting an image of who we want to be or expend our energy doing the work. Do the work because it is important, not because we expect to be rewarded or acknowledged.

Thoughts provoked by this book:

The meaning of work is often a matter of perspective. A piece of code can be both “just a hack” and a valuable contribution to the world of open source software at the same time. At the same time I don’t want to make a small contribution seem more important than it really is.

Sometimes the work feels meaningful and sometimes I have to remind myself to change my perspective. Sometimes I think a different job would be more meaningful, but that ignores all the drudgery that is part of any job. I feel most motivated when I feel part of something much bigger than myself. For me it has always been the mythical “community” of open source software. Time to buckle down and do the work.

Rating: Recommended

PXE install on UEFI using Foreman and GRUB2

2017-03-20T00:00:00+00:00

Most of the bare metal hardware that I manage now supports or defaults to UEFI. Many have the option to use “Legacy BIOS” mode, but the main feature I find that I require from UEFI is support for the boot volume to be 2TB+. I prefer one single RAID0 volume for all the builders for operational simplicity.

Foreman

My preferred solution for installing the base OS on the hardware is Foreman. It makes automated installs very simple and reproducible but has only recently supported UEFI and PXE. I will describe my previous attempts to get this working and how I was able to get it working with Foreman 1.14.2.

Pxelinux and UEFI

Pxelinux is part of the syslinux project and provides many different types of bootloaders. Pxelinux depends on a custom ROM inside the network card to run DHCP and download kernel+initrd using TFTP. It also has support for displaying interactive menus to the user.

UEFI contains all this functionality but unfortunately did not think to extend it or preserve backwards compatibility. All boot time programs like grub2 and pxelinux required significant rework. I was able to use the syslinux git tree and compile a working EFI version of pxelinux that was able to boot the 14.04 Ubuntu installer. But there were limitations:

Foreman only supported non-efi Pxelinux and I had to manually swap the binaries on the TFTP server
The menu system didn’t work so I could not use the Foreman feature of leaving the system to boot PXE by default and booting the local hard drive if rebuild was not enabled for that host in Foreman.
I could not get this pxelinux to work with 16.04 installer. The initrd would be downloaded and would hang and trigger a system reset.

UEFI and GRUB2

Foreman 1.13 added support for GRUB2 and UEFI, but my initial attempts failed. When I changed the boot template from PXELinux to PXEGRUB2 the update of the DHCP server would fail. The DHCP entry was added properly to the DHCP server using the Foreman Proxy, but it would cause a traceback on the server and prevent the Host change from being saved. This bug was fixed in 1.14 and I was finally able to get this working. There was one more bug in the PXEGRUB2 boot template involving an assumption about Profiles. I opened an issue and have submitted a PR to the community templates for this.

Foreman 1.14.2 was also missing the Preseed default PXEGrub2 template, but one had already been submitted to the community templates repo, so I had to manually add this template to my provisioning templates.

TFTP preparation

Foreman adds a DHCP record which contains the following:

server.filename = "grub2/grubx64.efi";

First step was to find the proper grub2 binary. Fortunately the Ubuntu wiki had a helpful post covering UEFI PXE netboot

I was able to find the xenial grubnetx64.efi here. But it turns out the Debian/Ubuntu grub2 is missing a few useful features that have been added to the Fedora grub2. The Ubuntu/vanilla grub2 only looks for grub/grub.conf whereas the Fedora grub2 has patches to search the grub2 directory and search for grub.cfg-[mac address] which is a convention that Foreman expects. Since Foreman is a project mostly run by RedHat employees it makes sense. The Fedora prebuilt grub2 bootloader is here.

There is PR which adds a default grub/grub.cfg and uses the grub2 regexp feature to search for $prefix/grub.cfg-[mac address]. This means that will support vanilla Grub2 soon.

Foreman will also place the correct kernel and initrd into the boot directory. It will not replace an older kernel, so sometimes a newer kernel and initrd need to be download from here and manually added to the boot directory.

How does it work?

Here is how this works:

Put the host in build mode. This sets up the grub2/grub.cfg- file with the automated build setup. It also adds a DHCP entry specifying to download the "grub2/grubx64.efi" file.
Start PXE boot and UEFI retrieves IP, filename and next-server/TFTP from DHCP server
UEFI downloads grub2/grubx64.efi from TFTP
GRUB2 looks for grub2/grub.cfg-[mac address]
Grub2 template contains the automated install configuration generated by Foreman
GRUB2 downloads kernel and initrd and boots the kernel and starts the installer
After install is complete, PXELinux template is changed back to chainload local disk

Conclusion

The deficiencies of the previous process have been addressed. GRUB2 can boot the 16.04 kernels and even the hwe kernels and installer if I want to. The menus and boot to local disk are working.

Book Review: Rebirth

2017-02-18T00:00:00+00:00

This book is a fictional/auto biographical account of one mans journey on the Camino pilgrimage trail in Spain. I really enjoyed it. The characters are quirky and very human with baggage and beautiful experiences. The dialogue is a little too perfect, but it made me consider doing a long walk like this.

Some of my Favorite quotes from the book:

Questions that help guide the way, like the yellow arrows on the
Camino

Too often I hear about guiding values and statements, but I really like the idea of guiding questions. Tim Ferris and his podcast guests often talk about questions that guided their decisions.

"If I loved myself, what would I do?"

I find this a tough question because it feels selfish. Finding a balance between selfishness and selflessness never ends. I wish there was a single answer, but I know that isn’t possible.

"Don't ask why, ask 'Now what'? People have made it through horrific
times not by focusing on why but moving on and asking 'Now What?'"

Trying to understand is important, but sometimes the energy is better spent on getting ready for the future.

"It is not the wound that makes you special, it is the light that
shines through it"

A great reminder that the hardships of life define you as much as the successes. I have always marvelled at artists that were able to transform immense pain into incredible music and art.

"Perfect is no unnecessary pain. I wish you a perfect Camino."

Unfortunately sometimes pain is necessary. Pain is such a multi faceted concept and hard to talk about. Maybe I will find someone who can do it more eloquently than I can.

Rating: Recommended

Puppet Infrastructure Overhaul

2016-12-22T00:00:00+00:00

I have been planning to upgrade my infrastructure to Puppet 4 but other priorities have delayed it. I was finally able to find a way to start the upgrade work. There are many new pieces of technology available which I hope will make things work even better than before.

Puppet 4

Since Puppet 3 is End Of Life at the end of 2016, this upgrade is probably the most urgent. I am looking forward to being able to use the improved Puppet language and r10k. The Puppet Server is supposed to be much faster and the AIO packages should be easier to install and support.

MCollective Choria

R.I.Pienaar has been busy and built a new mcollective deployment package called Choria. It has puppet modules which automatically enables SSL everywhere, has an audit plugin, a packager for plugins and uses NATS instead of ActiveMQ. My federated cluster with three ActiveMQ servers has been stable, but it was a pain to setup and upgrade. It is also managed using a custom puppet module which I do not want to maintain. I am also hoping to be able to use NATS as a message bus for some application orchestration.

Gitolite

I maintain a large internal network of git servers. The base configuration is very open and anyone with a valid ssh login using NIS can create or push to repositories. Every repository is available for unauthenticated read-only access. We have a few post-receive hooks to limit who can push to what repositories, but our developers respect our gatekeeper model and do not push to repositories they aren’t supposed to. The open access model has allowed people to do emergency fixes when necessary. But there has occasionally been requests for some sort of access control and I also have considered locking down the repository with the Puppet modules because it is so critical to the business, so I decided to experiment with Gitolite.

R10K

I have been using librarian-puppet with a custom git synchronization program which relies on the ActiveMQ network. I have 3 puppet masters and the post-receive hook uses STOMP to broadcast changes. The git-stomp-hook receives the broadcast and calls librarian-puppet as appropriate. This has worked well except for when the ActiveMQ network was having problems. So I was happy to notice that the puppet-r10k module contains a webhook program that can be used to trigger r10k deploy on the puppet masters. Since r10k was integrated into Puppet Enterprise, I decided to move away from librarian-puppet. R10k actually works very similarly to the solution I had cobbled together, it just ignores module dependencies. This is both a blessing and a curse, but because Puppet does not support conditional dependencies it may be better long term to manage dependencies manually.

Bootstrapping a Puppet Server

I manage my Puppet 3 server using Puppet and the bootstrap process is tricky. Given a machine with just the puppet agent, how to get the Puppet Server + Hiera + R10K and my control repo installed in a reproducible way. I started with the Puppetlabs control-repo skeleton which gave me the basics, but no bootstrap. I looked through a lot of repos and finally found puppetinabox control-repo by rnelson0. This repo uses a script to install the bootstrap modules locally and puppet apply with some simple puppet manifests to do the bootstrap. I decided to use this approach as well.

A Puppet module to manage Puppet

Next step was to choose a module to manage the Puppet server. I reviewed many but many had crazy dependencies or didn’t support the way I wanted to configure my systems. I ended up using the puppet-puppet module maintained by the Foreman team. It is a big module, but it supports:

Puppet agent run using cron
Puppet server setup on Ubuntu 16.04
Compatible with Puppetdb and r10k
Foreman integration

I did add Foreman integration to my Puppet3 module, so having that was interesting to me.

Scripting the bootstrap

The bootstrap script does the following:

Make sure git is installed
Clone all the required modules into a bootstrap directory. I make internal git mirrors of all the puppet modules I use.
Run puppet apply using 3 manifests to install puppet server, hiera and r10k.
Run r10k deploy to generate the local production environment.

I now had a server setup and could start creating roles and profiles to manage the server.

R10K Webhook

Redundancy in infrastructure is good and having two ways to synchronize the environments on the masters is also a good idea. I could use mcollective, but that hasn’t been setup yet. The puppet-r10k module comes with a webhook. This webhook is a small ruby sinatra application that listens for http connections and triggers r10k commands as appropriate. It supports GitHub, GitLab, Bitbucket, etc. but I don’t need those. Since I am using a local gitolite server I created a git post-receive hook that calls curl with the updated branch:

curl -d "{ \"ref\": \"$REFNAME\" }" -H "Accept: application/json" \
 "https://puppet:puppet@$HOST:8088/payload" -k -q

By default the webhook and r10k run as root which is something I try to avoid. I was able to change the user for the webhook to puppet, chown all the r10k cache and environment dirs to puppet user and everything works. It also uses the SSL certs as signed by the Puppet CA to encrypt the communication.

Bash post receive hook and subshells

The only problem with this approach is that the user much wait when running git push for the script to complete. I was able to run the curl command in a subshell and have the post-receive script exit quickly.

( trigger_webhook "$refname" <hostname> ) &

This code will run to completion even once the parent shell has exited. The logs of the synchronization are stored on the puppet master. They could be stored on the git server as well but that isn’t necessary.

Next steps

Install MCollective Choria and start porting the base configuration with ntp, ssh keys, package management, etc. to Puppet 4.

Book Review: 'The War of Art' by Steven Pressfield

2016-10-24T00:00:00+00:00

I found the format strange for a book because it felt like a collection of short blog posts. The writing was very repetitive and with all the other books I have read by authors like Seth Godin, the message didn’t feel new or motivating. I kept reading hoping for some insight or inspiring wording but I finished the book feeling disappointed. Almost every day the daily blog post by Seth Godin resonates with me and/or inspires me, but this book did not.

Rating: Not recommended

Book Review: 'The Time Paradox' by Philip Zimbardo and John Boyd

2016-10-24T00:00:00+00:00

“The Time Paradox” was another highly rated book by Deric Sivers. The central message of this book also falls into that category of “obvious now except it wasn’t before”. The premise is very ambitious because it attempts to universally categorize people and their behaviors across six dimensions. The usual categories of race, religion, age, gender, education, etc. are flawed, but this book identifies time as the universal experience of all humans. Every human has a unique perspective on the past, present and future we all inhabit. The book breaks the time perspectives into past positive, past negative, present-hedonistic, present-fatalistic, future and transcendental future.

Past Positive: Views past events in a positive way. Finds the good in the way things happened.

Past Negative: Views past events in a negative way. Finds the bad in the way things happened

Present Hedonistic: In the moment and focused on the pleasures of the now.

Present Fatalistic: In the moment but discounting or ignoring the risks of present actions

Future: Planning for things that will happen to you later than now. Delaying gratification and waiting for the larger reward that will come later.

Transcendental Future: Planning for things that will happen after the life of that person has ended. This can be spiritual concepts like the afterlife or non-spiritual concepts like the 10 000 year “Long Now” foundation or the native American tribes considering the seventh generation.

Once these dimensions are defined, the book presents a fictional dialog between six people, each of which represents one time dimension. The conversation is a little too scripted, but I recognized people I know and their behaviors in each of the characters. Of course no real person lives in only one time dimension. We all shift through the different dimensions in different degrees at different times in our lives. For example, children are very present oriented and our society rewards and encourages a future orientation.

The authors then provide the full test that they created to determine time orientation in their studies. I ended up skipping this section because I was convinced I already knew that I was too future oriented. I intend to go back and take the full test and see if there are any surprises.

The last part of the book explores the dimensions and suggests ways to deal with and minimize the less desirable orientations like past-negative and present-fatalistic. It also suggests ways to balance the positive dimensions like past-positive, present and future with specific suggestions to help future oriented people live in the moment and help everyone find ways to reframe negative past events in positive ways.

The tricky one is the transcendental future orientation. Taking a perspective that extends past ones own life can be very noble, but can also lead to a state where choosing to be a suicide bomber is a rational option.

As stated before, I recognize that I am too future oriented and have been exploring ways to focus on the present moment more. I am experimenting with meditation and am trying to find more time for activities that encourage present awareness: massage, dancing, music, exercise, cooking and just being silly. I also recognize that my photography hobby is a great way to encourage a past positive orientation. The goal is to find a balance because we need all three dimensions to be happy and feel fulfilled.

Rating: Highly recommended

Book Review: 'So Good They Cannot Ignore You' by Cal Newport

2016-10-24T00:00:00+00:00

I read regularly thanks to our local Public Library. Recently the Tim Ferris podcast has expanded my reading list with lots of interesting books. Luckily the local library has most of these books and hold waiting lists tend to space the reading out well.

One of the the first Tim Ferris podcasts I listened to was with Derek Sivers and he mentioned that he maintains a list of book reviews with ratings on his website. I immediately went to the website and looked through the highest rated books and setup holds at the library.

The first book I read was “So good they cannot ignore you” by Cal Newport. This was a short read and the simplicity of its message resonated with me. The basic message is that our society rewards people with rare and valuable skills, not people with passion. Many sources of career advice talk about “following passion”, but passion without skills is not sufficient. If passion drives the building of valuable skills then it is helpful. The media often portrays having passion as the most important requirement to getting a great job but that is harmful because it often leads to confusion and inflated expectations. Cal calls these valuable skills “career capital” and I really like that perspective. Career capital is skills and experience that can be exchanged for career opportunities.

The best books motivate you to make changes in your life. This book helped me reflect on the career capital that would help me advance my career. For a software developer that would be submitting patches to open source projects and writing technical blog posts. I have decided to commit 10% of my work time to doing this. I haven’t been able to implement it fully, but I am making more of an effort. I have reported bugs and submitted some small patches. Ideally this will grow to more and larger patches.

Rating: Highly recommended.

Book Review: 'Money: Master the Game' by Tony Robbins

2016-10-24T00:00:00+00:00

After listening to a Tim Ferris podcast with Tony Robbins, I decided to read his latest book “Money: Master the Game”. I was skeptical of Tony Robbins and his style of motivation speaking. The book has a very informal speaking style with lots of bold text where you feel Tony is waving his hands frantically. But the content of the book is superb if a little long winded. I have read a few books about finance and investing, but this was the first to take a full life time look at saving, investing and retiring.

The first part is about saving regularly, starting early and avoiding excessive fees which are usually hidden. I feel my family is doing well here, but making saving automatic is a good reminder that saving is against our basic nature. Money finds ways to get spent.

The next part was making the case that we probably require less money than we think to sustain the retirement lifestyle we think we want. There were three levels of lifestyle and each had a worksheet that required estimates of how much we currently spend. I should have filled out these worksheets, but I did not because I have this fantasy of managing our finances with something like hledger which will provide these answers. When I imagine my retirement, it isn’t filled with high expense activities like travelling the world on a yacht or having a private jet. Imaging retirement is something I want to do more of with my family. It will make this kind of planning easier.

The part about investing had some real gems. I was aware of the importance of asset allocation and having investments that are not correlated, but the Ray Dalio “all weather” portfolio was fascinating. It was the first time I had heard of a portfolio that had so little downside risk with such substantial upside. I always assumed that any investment with high return required accepting extra risk. This is an investment strategy that outperformed the “market” or S&P 500 over decades with almost no loss in capital (maximum loss was less than 4%). The “secret” is a large allocation of long term bonds with a small allocation in gold and commodities. The logic that the economy has four seasons which are the combination of growth and inflation. Having assets that do well in each “season” has finally shown me what proper diversification looks like. Since stocks and bonds are correlated, the classic advice of stock and bond diversification is problematic. Right now the world economy is in a period of low inflation and low growth. When it switches (and it will) to higher inflation and “negative” growth (what a silly term), the standard advice for asset allocation will cause big problems.

For me the action item from this has been to look very carefully at the current asset allocation of my portfolio. Right now I am following a very contrarian style and my largest holding are real return bonds, short US equities and long commodities like gold and energy. But this is very short term focused and hopefully a longer focus will expose me to less risk and volatility.

The next part focused on what Tony referred as the “back of investment mountain”. I have spent a lot of time thinking about how to invest, but not what to do with that investment. I assumed I would retire at some point and spend the rest of my retirement managing my pile of investments. The book again showed me options that I was not aware of. Apparently there are “hybrid” annuities that provide payments for life while growing with the equity market but with full capital preservation! Frankly it sounds too good to be true, but I have made a note to investigate this further. The possibility of “getting out of the game” by having an income without having to worry about it is very appealing. I am skeptical because I don’t understand why an insurance company would take on this much long term risk. At least I think the premium must be very high to offset the risk, but Tony insists this plan is now available to all US citizens and I intend to see if I can find something similar in Canada.

There is a section with interviews of some of the greatest investors ever like Charles Schwab, Ray Dalio, Warren Buffet, etc. This part was nice, but did not contain helpful specific advice that wasn’t mentioned in other parts of the book. There was one brief mention of technical trading which I found strange because it actually goes against most of the advice in the book. Technical trading assumes the past stock price behavior can be used to predict the future stock price. It is true that some people have become very rich that way, but to me it is too much like gambling without any acknowledgment that the stock represents a company or group of companies with assets and revenue and people. On the other hand technical trading increases volatility which can be useful to contrarian investors like myself.

The last chapter is about the power of giving. I am very motivated to give my time and energy to my family and friends but the giving of money is complicated. All things being equal, I would like any donations to do the most good possible. Even defining what I mean by good is difficult: less suffering?, more education?, more opportunity? more equality? less disease? Maybe whatever makes me feel the happiest is the simplest approach and I need to accept that it will probably not be the most efficient. If my money isn’t making me happy, why even bother working so hard to accumulate it in the first place? My action item is to manage our finances better and look for ways to give more money in ways that will make me and my family happier.

I learned a lot from this book. It has also changed my opinion of Tony Robbins. The book was a real gift to me and I am planning my future differently because of it.

Rating: Highly Recommended

Running mesos agents over an unreliable network

2016-07-12T00:00:00+00:00

I have mesos agents located in three datacenters with a usually reliable WAN connection. Occasionally though all the running tasks in a DC get killed and it gets traced back to a WAN connection interruption.

This hasn’t been a big problem until recently when a fail over link had high enough latency that the agents would disconnect and kill all running tasks approx every half hour for about 12 hours. I tried to figure out which configuration options need to be tweaked for the master and agents to wait longer before killing tasks and this is what I came up with.

Current setup:

Three DC: DC1 (central), DC2 and DC3. Mesos 0.27.2 with custom python scheduler 3 node Zookeeper 3.4.5 cluster in DC1 with 3 HA mesos masters. Zookeeper observer nodes in DC2 and DC3 Agents in DC2 connect to Zookeeper observer in DC2

From my research there are several timeouts that are at play here:

1) Zookeeper ticktime and synclimit. Unfortunately the zookeeper read-only observer feature is not available yet, so when the observer loses connection it drops connections to the agents. There isn’t an agent zk_session_timeout configuration option, but it looks like the agent force expires the zk session after 10 sec (the master default). If zk reconnects in less than 10sec the session still expires, but the master is detected and everything works.

2) Mesos master agent_ping_timeout and max_agent_ping_timeout. The master shuts down the agent after this timeout (75 sec by default). This causes the slave to restart and kill all running tasks.

3) agent_reregister_timeout and max_agent_reregister_timeout. If there was a master failover during a WAN outage, then this timeout may be triggered. But the default is 10min so that shouldn’t be a problem.

Here are my conclusions for my setup. Please let me know if I missed anything.

1) Since the ZK observers in DC2 and DC3 do not affect main ZK cluster when disconnected, changing ticktime or synclimit is not necessary.

2) Increase max_agent_ping_timeout on masters so that (agent_ping_timeout * max_agent_ping_timeout) is longer than most WAN outages. In my case most outages are less than 10 mins so I am trying max_agent_ping_timeout = 40. This means I do not need to increase reregister timeout. Unfortunately max_agent_ping_timeout is a global configuration and I cannot set this value differently for agents in the different DCs.

Dell FX2 and Intel X710 nics

2016-06-21T00:00:00+00:00

What follows is an attempt to document a 6 month long debugging odyssey. This is easily the strangest computer behavior I have ever debugged or tried to understand.

Background

I manage a cluster of bare metal servers used for coverage build testing of Wind River Linux. The collection of git repos alone is over 15GB and the resulting IO traffic is high enough that using a public cloud is not cost effective. The current sweet point for price to build performance to rack space is a chassis that squeezes 4 blade servers into a 2U chassis. We have a bunch of the Dell C6220 series servers and then I decided to try the Dell FX2 chassis. The selling points for me were the full Dell iDRAC and the network IO aggregator system. The IO module theoretically would allow me to network 4 X 10GbE per system (160 GbE total) to a redundant switch pair providing 80 GbE uplink capability. We already had a good experience with the M-8024K module for the M1000e chassis.

Hardware setup

The first chassis was installed and networked. The IO modules were setup in same way as the M-8024K which is as a VLAN access port. The network was configured as VLAN 105, but with the access port this detail is hidden from the systems. The main problem is that the RedHat and Debian installers do not support VLAN configuration of the network devices so all my machines have this configuration which allows me to use Foreman for automated PXE installs.

Using newest Ubuntu installer

Things were finally ready for me in January 2016. I started the PXE install of Ubuntu 14.04. This failed because the kernel drivers for the X710 nic were only integrated in Linux 4.2 and the Ubuntu installer with the 3.13 kernel could not detect the X710 nic.

Luckily Ubuntu rebuilds the 14.04 installer image with the 15.10 kernel. I switched to the newest version of the installer and the installer was able to detect the X710 nic.

The first hiccup

This time the kernel and initrd were downloaded, the initial DHCP succeeded but then DNS lookup to download the preseed failed. This was strange but not unheard of. It had happened a long time ago but I hadn’t seen it in years. I was quick to blame our Microsoft DNS servers and replaced all DNS names in the preseed with IP addresses. This allowed the installation to complete and the machine booted Ubuntu as usual. Then things started to get really strange. DHCP on boot would occasionally fail and then I noticed that DNS would occasionally time out and then succeed right afterwards. This made using programs like Puppet impossible.

I completed the install of the other 3 servers and noticed that occasionally that DHCP would fail during the install process. This was mystifying to me because the PXE boot process uses the same DHCP setup to download the kernel and initrd and I never saw it fail.

I checked the DHCP server and the server logs showed that the DHCP received the request and was sending the offer back to the server, but that offer was never received. Running ethtool did not show any dropped or corrupted packets reported by the nic.

After the installation of the remaining 3 servers in the chassis was complete, I opened a support case with Dell.

Approx two weeks:

TOR access port config, IOA VLAN 1 untagged, Hosts untagged = problem

Round one - TOR switch config

The configuration of the TOR Cisco switch connecting to the IOA was the subject of the first round of debugging. The IOA comes by default in a “no touch” default configuration and it made sense to verify the setup of the TOR switch. It took a few weeks to get together all the people involved: myself, on site IT, IT networking, Dell tech support and Dell networking specialist. After many hours, the TOR switch was changed from an access port to a VLAN 105 tagged port. This resulted in all traffic being dropped until the the IOA was changed to make the 105 VLAN untagged. But the DNS/DHCP problem persisted.

Round One - Approx one month

TOR VLAN 105, IOA VLAN 105 untagged, Hosts untagged = problem

Round Two - Internal reproducer

Moving up levels of Dell support always takes time. While waiting for networking support to become available I started experimenting. I wanted to see if I could reproduce the problem without involving the TOR switch so I setup dnsmasq on blade #1 as a dns caching proxy. I then added a fake host entry into /etc/hosts so I could be sure that dnsmasq was being queried and started running nslookup queries on blade #2. To my surprise, I was able to reproduce the problem even with the network traffic completely internal to the FX2 chassis.

Round Two - Approx one month

IOA VLAN 1 untagged, Hosts untagged = problem

Round Three - A solution?

Then I decided to investigate if VLAN tagging at the Linux host level would change things. I PXE booted blade #3 with the IOA configured VLAN 105 untagged and when DHCP failed, I switched the IOA to VLAN 1 untagged and used the secondary install console to change the network config from em1 to em1.105. I was able to complete the install and boot the machine.

Amazingly the DHCP/DNS problems went away! It took some time to fix my Puppet configuration to work with the VLAN tagging and get everything working. I was also able to demonstrate that the problem using blade #1 and #2 was present with VLAN 1 untagged and not present with VLAN 1 untagged and linux host configured for VLAN 105.

Round Three - Approx one month

IOA VLAN 1 untagged, Hosts tagged 105 = no problem!

Round Four - Debugging the IOA

Now the focus turned exclusively to the IOA. With the help of Dell network support, we disabled the outbound ports of the IOA and ran tcpdump on the hosts. We were able to see packets being sent from the DNS “server” and not being received by the client about 25% of the time. About 5-10% of the time the initial DNS query would not even make it to the DNS server.

It was around this time that a second FX2 chassis with identical hardware arrived, but with a newer IOA firmware version. Full of hope I did a PXE install on a blade, but with the exact same problem.

The Dell networking support team attempted to reproduce the problem on their internal lab, but even with an FX2 chassis, Ubuntu 14.04 install on an FC630 with the X710 nic they were unable to reproduce the problem. To ensure the systems were configured identically we went through the entire BIOS setup line by line to compare. I even tried installs using UEFI and “Legacy BIOS” modes with no change in behavior.

I then got a crash course in F10 network configuration. It took a while to find the proper command line incantations, but we setup counters on the various ports to count incoming and outgoing packets. We setup fixed ARP entries and tried to reduce the network traffic as much as possible. Unfortunately the outgoing port counters did not work, but from the incoming counters it looked like the IOA was not seeing the packets come in the interface.

Round Four - Approx two months

IOA functioning as designed.

Bonus: I learned how to use the Dell iDRAC virtual media feature to transfer files to and from a system without network access.

Round Five - Debugging the X710 nic

Now another Dell Linux support tech was brought in and he confirmed that the Linux config was correct. We then tried a firmware upgrade for the X710 nic. This involved a failed upgrade attempt using an ISO upgrade package (only works with Legacy bios mode), a DRAC upgrade with HTML5 support and finally using the iDRAC upgrade functionality to upgrade the NIC firmware.

Unfortunately the firmware upgrade made things even worse!! DHCP worked but I could not ping inside the chassis. To make things even more bizarre, ARP would occasionally work but ping would not!

At this point, we decided to replace the Intel X710 nic with the Broadcom BCM57840 nic with a similar feature set to see how/if the problem changed.

Round Five - Approx one month

Several failed firmware upgrades and violations of the laws of networking.

Round Six - Something goes right for a change

A technician swapped out the Intel Nics for Broadcom Nics. I redid another PXE install (luckily it is completely automated) and everything works as expected! No DHCP/DNS errors or any hint of strange behavior.

We finally had a solution and the remaining Intel X710 nics were swapped out over a few weeks.

Final setup:

TOR 105 access port, IOA VLAN 1 untagged default config, Linux host untagged.

Recap

Bug not reproducible by support
Intermittent dropping of UDP packets without connection to non Dell hardware
Enabling VLAN tagging on the host “solved” the problem
Incorrect hardware counters
Firmware upgrades make things worse
Debugging requires coordination of at least 3 teams
Root cause never determined
Everyone involved agreed it was one of the strangest problems they have ever debugged

Number of people involved

At Wind River: myself, IT and IT networking

At Dell: 2 tech support, 2 networking support, 1 Linux support, 2 managers

Total time consumed: approx 2-3 man months over 6 months of calendar time.

Conclusion

The reality is that Dell shipped us something broken. The open question is whether testing could have found this problem before the hardware shipped. Dell was unable to reproduce the problem internally and without knowing the root cause of the problem, I can only speculate.

Ideally I would like to know the cause of the problem, know that it was fixed and that no one else will have to suffer through this. But that would be the fairy tale ending and life doesn’t work that way. The case is considered closed and I will get back to all the tasks I had to put on hold for this.

Python Packaging with make and pex

2016-06-03T00:00:00+00:00

As it often happens in the life of a professional programmer, a small python script had grown into a large script and needed to be split apart and properly packaged. Most of my experience with python had been with small scripts. I had tried before to understand the python packaging ecosystem but always got confused by the combinations of tools and formats.

Python development tools like virtualenv and pip
Code distributed in eggs and/or wheels
Packages installed using easy_install and/or pip
Python packaging tools like setuptools and distutils

There seemed to be at least two different tools that did almost the same thing, but neither had good documentation. I did find some decent blog posts like Open Sourcing a Python Project the Right Way but there were still workflow steps that I needed to figure out. In the past I was able to avoid figuring it out, but this time was different because my “small” script had grown to over 1000 lines of python and there was no way to avoid it.

I had an informal set of requirements:

No root access should be required. Python supports local installation and virtualenv
Bootstrap a development environment quickly
The development setup should be self contained and not affect any other part of the machine

My research took me all over the web, but one of the most important pieces of inspiration was this small post on Virtualenv and Makefiles. I was also inspired by Pex which provided a way to bundle all the python pieces together into a single self extracting package.

It took a few days but I was able to combine make, mkvirtualenv, pip and pex to implement a nice workflow. The Makefile will:

Install pip into $HOME/.local/bin/pip
Use local pip to install virtualenv and virtualenvwrapper into $HOME/.local/bin
Create a per project virtualenv for the project and install all the development dependencies like pylint, flake8 and pex
Check if required development packages are installed. Some python packages have C extensions and require a compiler and header files.
Runs python setup.py develop which installs the package dependencies like yaml and redis. This step also adds the package to the virtualenv and can be used if development is spread across multiple git repositories.
Uses python setup.py bdist_pex to build the pex file

Other nice touches:

The source py files are dependencies on the pex package so editing a file causes the pex file to be rebuilt. Regex support in Make simplifies this step
Has make help which reads comments embedded in the Makefile to generate nice help output
Has make clean for easy cleanup
Each make step loads the proper virtualenv, so the developer does not even have to activate the virtualenv manually.

Some annoyances:

Pex does not pick up local python file changes unless I delete the egg file in the pex build dir.
To keep timestamps in order, sometimes it is necessary to touch certain files.
I had to create a .check file to prevent the system package checking from running every build
Dependent on Pypi being available, though pip does cache downloads locally

The last step was to integrate the pex file into a docker image. If the package does not contain dependencies on system libraries, the Alpine Linux Python docker images can be used as a base. Unfortunately the python mesos.native packages I am using have dependencies on libraries like libsaml and I could not use Alpine Linux. But I was able to use the base Ubuntu image and only needed to install a few libraries which made the image much smaller than before.

I noticed that pex file is unpacked into PEX_ROOT which is $HOME by default. The last tweak I made was to ensure that PEX_ROOT was a docker volume to avoid the overhead of writing to the union filesystem. This isn’t strictly necessary, but I try to work as if the docker image is effectively read-only.

I have already reused this Makefile structure for other python projects. I was pleasantly surprised when a colleague of mine was able to clone the project and rebuild the docker image without any intervention.

I am now able to focus on refactoring and developing the project. The packaging part is solved in a clean way that can easily be shared with others.

Docker Daemon and Systemd

2016-02-29T00:00:00+00:00

I recently read an article on LWN about Systemd vs Docker and I was disappointed. As far as I am concerned, this is preventing one of the worst design flaws in Docker from being addressed. Docker CEO Solomon Hykes also thinks this should be resolved, though Issue #2658 has remained open since Nov 2013.

The current Docker design sets up all containers as children of the Docker daemon process. The consequence of this is that upgrading the daemon requires that all the containers are stopped/killed. Other operations like changing the daemon command line requires stopping all the containers. I have to be extra careful with my Puppet configuration because any change to the config files will restart the docker daemon. To prevent inadvertent restarts I had to remove the normal configuration to service dependency which normally restarts the daemon when the configuration changes.

From an operational perspective this is a pain. It represents another in a long line of software that requires significant operational resources to deploy properly. If the operator is lucky, the containerized application can be managed with load balancers or DNS rotation. If the service cannot work this way or the Ops team cannot build the required infrastructure, then upgrades mean downtime. With VMs it is possible to move the application to another machine, but CRIU isn’t ready yet. These “solutions” require large amounts of operational effort. I built a rolling upgrade system around Ansible to handle docker upgrades.

My experience with Mesos has been very different. The Mesos team has a supported upgrade path with lots of testing. I have upgraded at least 5 releases of Mesos without issues or any downtime.

What does this have to do with systemd? In order to support seamless upgrades of the docker daemon, the ownership of the container processes will have to be shared with some other process. This could be another daemon, but the init system is an obvious choice. If the docker daemon co-operated with another daemon or systemd by sharing ownership of the processes, then a nice upgrade path could be developed.

The Docker team is working on containerd and has stated that RunC would be integrated and this may where better integration with an init system becomes possible. I realize this is selfish, but for me all these squabbles are just distracting developers from addressing one of my major pain points with using Docker.

Benchmarking docker storage backends

2015-07-09T00:00:00+00:00

I am using docker simulate building Wind River Linux (which is based on OE-Core and Poky) on different hosts. The actual build is done on a bind mount outside of the container so I did not expect the storage backend to affect performance, but it did.

See Docker Issue #2891 for full history.

Setup

docker 1.7
Ubuntu 14.04.2
Vivid kernel 3.19.0-21-generic
Dual 6C Xeon with 64GB RAM and 100GB root SSD and dual 3TB RAID0

Using following Dockerfile:

FROM ubuntu:14.04.2

MAINTAINER Konrad Scherer <Konrad.Scherer@windriver.com>

RUN useradd --home-dir /home/wrlbuild --uid 1000 --gid 100 --shell /bin/bash wrlbuild && \
    echo "wrlbuild ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers

RUN dpkg --add-architecture i386 && \
    apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get -qy install --no-install-recommends \
    libc6:i386 libc6-dev-i386 libncurses5:i386 texi2html chrpath \
    diffstat subversion libgl1-mesa-dev libglu1-mesa-dev libsdl1.2-dev \
    texinfo gawk gcc gcc-multilib help2man g++ git-core python-gtk2 bash \
    diffutils xz-utils make file screen sudo wget time patch && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* && \
    rm -rf /usr/share/man && \
    rm -rf /usr/share/doc && \
    rm -rf /usr/share/grub2 && \
    rm -rf /usr/share/texmf/fonts && \
    rm -rf /usr/share/texmf/doc

USER wrlbuild

CMD ["/bin/bash"]

Building poky, fido release, core-image-minimal on a ext4 bind mount with the docker image using different storage backends.

cd <buildarea>
mkdir downloads
git clone --branch fido git://git.yoctoproject.org/poky
source poky/oe-init-build-env mybuild
ln -s ../downloads .
sed -i 's/#MACHINE ?= "qemux86-64"/MACHINE ?= "qemux86-64"/' conf/local.conf
bitbake -c fetchall core-image-minimal
time bitbake core-image-minimal

Results

Bare-metal:

real    33m5.260s
user    289m41.356s
sys     27m23.488s

Aufs:

real    40m24.416s
user    258m48.932s
sys     56m29.284s

Devicemapper with official binary in loopback mode: This requires --storage-opt dm.override_udev_sync_check=true

real    35m24.415s
user    289m10.660s
sys     34m21.168s

Devicemapper with my own compiled dynamic binary: This still requires --storage-opt dm.override_udev_sync_check=true even though docker info states udev sync is supported.

real    34m18.387s
user    294m1.720s
sys     31m43.764s

Overlayfs:

real    33m46.890s
user    293m40.084s
sys     35m31.480s

Conclusion

Aufs still has a measurable performance overhead even when the IO is done on a bind mount outside of the aufs filesystem. Devicemapper and overlayfs do not add overhead to this specific scenario. I did have problems with devicemapper on Ubuntu 14.04 and the 3.13 kernel, but since I upgraded to the 3.16 kernel I have not had any problems with devicemapper errors. The only problems I have add were related to the udev sync detection and new requirement with Docker 1.7.

My options are:

Ignore the udev sync requirement with a flag
Compile and distribute my own dynamically linked version of docker and hope that docker will provide an official version on Ubuntu
Switch to Overlayfs

There are reports of problems with Overlayfs and using rpm inside a container. I will do some more testing with Overlayfs, but it seems my best option now is to move all my Ubuntu builders to use Overlayfs.

XFS and RAID setup

2015-06-26T00:00:00+00:00

Choosing XFS

I manage a cluster of builder machines and all the builders use the ext4 filesystem. To load the machines effectively, the builds are heavily parallelized and using a RAID0 striped setup avoid IO being a bottleneck on the builds. When RedHat 7 was released the default filesystem was changed to xfs, I realized that it would be a alternative to ext4 because RedHat wouldn’t have made that change if xfs wasn’t a fast and solid filesystem. I recently got some new hardware and started an experiment.

Default RAID settings

The system has 6 4TB disks and I created 2 RAID0 disks of 3 disks each for a total of 12TB per drive. The machine has a battery backed RAID controller and each drive had as its default settings: stripe size of 64KB, write back, adaptive read ahead, disk cache enabled and a few more.

Creating the xfs drives

Once the machine was provisioned, I started reading about xfs filesystem creation options and mount options. There were several points of confusion:

Some web pages referred to a crc option which validates metadata. This sounds like a good idea, but is not available with the xfsprogs version on Ubuntu 14.04
I didn’t realize at first that the inode64 option is a mount option and not a filesystem creation option

Since the disks are using hardware RAID which is not generally detectable by the mkfs program, the geometry needs specified when creating the drive.

parted -s /dev/sdb mklabel gpt
parted -s /dev/sdb mkpart build1 xfs 1M 100%
mkfs.xfs -d su=64k,sw=3 /dev/sdb1

These commands create the partition and tell xfs the stripe size and number of stripes.

XFS mount options

It was clear that the inode64 was useful because the disks are large and the metadata is spread out over the drive. The interesting option was the barrier entry. There is an entry in the XFS Wiki FAQ about this situation. If the storage is battery backed, then the barrier is not necessary. Ideally the disk write cache is also disabled to prevent data loss if the power is lost to the machine. So I went back the RAID controller settings and disabled the disk cache on all the drives and then added nobarrier,inode64,defaults to the mount options for the drives.

Conclusion

The experiment has started. The first build on the machine was very fast, but the contribution of the filesystem is hard to determine. If there are any interesting developments I will post updates.

Adventures with Git and server packfile bitmaps

2015-05-15T00:00:00+00:00

In git 2.0, a new feature called bitmaps was added. The entry from the git Changelog has the following entry:

The bitmap-index feature from JGit has been ported, which should
significantly improve performance when serving objects from a
repository that uses it.

One of my colleagues told me that he had experimented with it had noticed some impressive speedups that I was able to reproduce. On the local GigE network a linux kernel clone went from approx 3 minutes to 1.5 minutes, a speedup of almost 50%!

The instructions seemed very simple. Just log into the git server and run:

git repack -A -b

on every bare repo. The first hurdle was upgrading to a newer version of git. Our git servers are running CentOS 5, CentOS 6 and Ubuntu 14.04. The EPEL version of git is 1.8 and 14.04 ships with 1.9.1.

For Ubuntu 14.04 the solution was to use the LaunchPad Git Stable PPA

But for CentOS, it was a little trickier. Since I hate distributing binaries directly I decided to backport the latest Fedora git srpm. Getting it to build required a few hacks with bash completion and installing a few dependencies, but it took less than 30 minutes to get both CentOS 5 and 6 rpms.

The upgrade of git on the servers worked very well because they are using xinetd to run the git-daemon and the very next connection to the server after the upgrade started using the newly installed git 2.3.5 binary.

There were of course a few hiccups. An internal tool that used git request-pull was relying on one of the working “heuristics” (see changelog) that were removed.

The next step was to repack all the bare repos on the server. So I wrote a script to run git repack -A -b and left it to run overnight. Recovering from this the next few days would require me to become very familiar with the git man pages.

First problem was that the git server ran out of disk space. Turns out I needed to add the -d flag in order to delete the previous pack files. I had effectively doubled the disk space requirements of every repo!

It also turns out the -A leaves packfiles that contain dangling objects. So I reran my script with

git gc --aggressive
git repack -a -d -b

This helped a lot but repos that were using alternates were still taking a lot more space than before because repack was making one big packfile of all the objects and effectively ignoring the alternates file. This is documented in the git clone man page.

So I went to all the repos with alternates and ran:

git repack -a -d -b -l

The -l flag only repacks files that are not available in the alternates. With some extra cleanup, this resulted in even less disk space usage than before. Unfortunately this does mean that a repo with alternates cannot have a bitmap.

On one server many repos still did not contain the bitmap file. After much experimentation I finally figured out that the pack.packSizeLimit option had been set on the server only to 500M. This meant that repos larger than 500M would have multiple pack files and since the bitmap requires a single pack file, no bitmap was created. The lack of warning extended the debugging time considerably.

Finally one of my servers had an old mirror of the upstream Linux kernel repo and even after git gc --aggressive the repo was 1.5GB, which is over 500MB larger than a new clone. So I started experimenting with the other repack flags, including -F. The result was that the repo ballooned to over 4GB and I couldn’t find a way to reduce the size. Even cloning the repo to another machine resulted in a 1.5GB transfer. In the end, I ended up doing a fresh clone and swapping the objects/pack directories.

I was able to reproduce the behavior with a fresh clone as well:

git clone --bare git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
cd linux-stable
git repack -a -d -F

In summary:

To create bitmaps without increasing disk space usage:
```
 git repack -a -d -b -l
```
I was not able to use git repack -F in a way that did not quadruple the size of the Linux kernel repo. It even caused clones of the repo to be larger as well
Git should have a warning if bitmaps are requested but cannot be created due to packSizeLimit restrictions. I plan to file a bug or make a patch.

Docker backend performance update

2015-03-03T00:00:00+00:00

A long time ago I filed Docker 2891 issue regarding the performance of the aufs backend vs devicemapper.

Quick summary is that the aufs backend was approx 30% slower even though the build was being done in a bind mount outside of the container.

I finally got around to checking again using Docker 1.5 on Ubuntu 14.04 with 3.16 utopic LTS kernel.

The current stable poky release is dizzy:

cd <buildarea>
mkdir downloads
chmod 777 downloads
git clone --branch dizzy git://git.yoctoproject.org/poky
source poky/oe-init-build-env mybuild
ln -s ../downloads .
bitbake -c fetchall core-image-minimal
time bitbake core-image-minimal

There is no need to set the parallel packages and jobs now in local.conf because bitbake now chooses reasonable defaults.

Bare Metal:

real    29m59.190s
user    278m0.988s
sys     59m47.379s

Devicemapper:

real    32m21.074s
user    281m53.994s
sys     68m45.554s

AUFS:

real    37m14.612s
user    259m19.226s
sys     85m50.269s

I only ran each build once so this is not an authoritative benchmark. It shows that there is a performance overhead of approx 20% when using the aufs backend even if the IO is done on a bind mount.

Git Server option bigFileThreshold

2014-09-26T00:00:00+00:00

Introduction

I manage the git infrastructure for the Linux group at Wind River: the main git server and 5 regional mirrors which are mirrored using grokmirror. I plan to do a post about our grokmirror setup. The main git server holds over 500GB of bare git repos and over 600 of those are mirrored. Many repos are not mirrored. Some repos are internal, some are mirrors of external upstream repos and some are mirrors of upstream repos with internal branches. The git server runs CentOS 5.10 and git 1.8.2 from EPEL.

The toolchain binary repos

One of the largest repos contains the source for the toolchain and all the binaries. Since the toolchain takes a long time to build, it was decided that Wind River Linux should ship pre-compiled binaries for the toolchain. There is also an option which allows our customers to rebuild the toolchain if they have a reason to.

The bare toolchain repo size varies between 1 and 3GB depending on supported architectures. Many of the files in the repo were tarballs around 250MB size.

Why is the git server down again?

When a new toolchain is ready for integration, it is uploaded to the main git server and mirrored. Then the main tree is switched to enable the new version of the toolchain and all the coverage builders start to download the new version. Suddenly the git servers would become unresponsive and would thrash under memory pressure until they would be inevitably rebooted. Sometimes I would have to disable the coverage builders and stage their activation to prevent a thundering herd from knocking the git server over again.

Why does cloning a repo require so much memory?

I finally decided to investigate this and found a reproducer quickly. Cloning a 2.9GB bare repo would consume over 7GB of RAM before the clone was complete. The graph of used memory was spectacular. I started reading the git config man page and asking google various questions.

I tried setting the binary attributes on various file types, but nothing changed. See man gitattributes for more information. The default set seem to be fine.

I tried various git config options like core.packedGitWindowSize and core.packedGitLimit and core.compression as recommended in many blog posts. But the memory spike was still the same.

core.bigFileThreshold

From the git config man page:

Files larger than this size are stored deflated, without attempting delta compression.
Storing large files without delta compression avoids excessive memory usage, at the slight
expense of increased disk usage.

Default is 512 MiB on all platforms. This should be reasonable for most projects as source
code and other text files can still be delta compressed, but larger binary media files
won’t be.

The 512MB number is key. The reason the git server was using so much memory is because it was doing delta compression on the binary tarballs. This didn’t make the files any smaller because they were already compressed and required a lot of memory. I tried one command:

git config --global --add core.bigFileThreshold 1

And suddenly (no git daemon restart necessary), the clone took a fraction of the time and the memory spike was gone. The only downside was that the repo required more disk space; about 4.5GB. I then tried:

git config --global --add core.bigFileThreshold 100k

The resulted in approx 10% more disk space (3.3GB) and no memory spike when cloning.

This setting seems very reasonable to me. The chance of having a text file larger than 100Kb is very low and the only downside is slightly higher disk usage. Git already is very efficient in this regard.

UPDATE This setting can cause disk space issues on linux kernel repos. See update here

Replacing disks before they fail

2013-11-08T00:00:00+00:00

Hardware setup

I am managing an R710 Dell server with 6 2TB disks. The RAID controller does not support JBOD mode, so I had to create 6 RAID0 virtual disks with one disk per group. The disks are then passed through to Linux as /dev/sda to /dev/sdf. I am running 6 xen vms and each vm gets a dedicated disk. The vms are coverage builders and not mission critical so there is no point in added redundancy. I have a nice Cobbler/Foreman setup that makes provisioning very quick.

OpenManage and check_openmanage

I am running the Dell OpenManage software on the system. If fact I am running it on all my hardware. I am using the puppet/dell module graciously shared on Github. The OpenManage package does many things including CLI query access to all the hardware.

Then I stumbled across check_openmanage which is a Nagios check which queries all the hardware and notifies Nagios if there are any problems. I had already used the Puppet integration with Nagios to setup a bunch of checks for ntp, disk and some other services. To make things even easier, check_openmanage is in EPEL and Debian. It did not take much time to add this check to the existing checks.

Predicted Failure

So once everything was setup, I started getting warned about many things that I was not aware of like firmware out of date and that some hard drives were predicted to fail. The output of check_openmanage looks like this:

WARNING: Physical Disk 1:0:4 [Seagate ST32000444SS, 2.0TB] on ctrl 0 is Online, Failure Predicted

A reasonably painless call to Dell and a replacement disk is shipped.

Disk replacement

When a disk fails it has a really nice blinking yellow light. To make things clean, I wanted to shutdown and delete the correct vm before changing the disk. How to figure out the correct vm to shutdown.

> omreport storage pdisk controller=0 pdisk=1:0:4
Physical Disk 1:0:4 on Controller PERC 6/i Integrated (Embedded)
Controller PERC 6/i Integrated (Embedded)
ID                              : 1:0:4
Status                          : Non-Critical
Name                            : Physical Disk 1:0:4
State                           : Online
Failure Predicted               : Yes

> omreport storage pdisk controller=0 vdisk=5
List of Physical Disks belonging to Virtual Disk 5
Controller PERC 6/i Integrated (Embedded)
ID                              : 1:0:4
Status                          : Non-Critical
Name                            : Physical Disk 1:0:4

Okay found the correct physical disk and the associated virtual disk.

> omreport storage vdisk controller=0 vdisk=5
Virtual Disk 5 on Controller PERC 6/i Integrated (Embedded)
ID                            : 5
Status                        : Ok
Name                          : Virtual Disk 5
State                         : Ready
Device Name                   : /dev/sdf

Okay I know that this physical disk maps to the device /dev/sdf and I initiated a shutdown of the vm that uses that disk.

The disk with predicted failure has a flashing amber light which makes it easy to figure out which one to swap.

Once the swap is complete run the following command to recreate the vdisk.

omconfig storage controller controller=0 action=createvdisk raid=r0 size=max pdisk=1:0:4

And /dev/sdf is available once again.

OpenStack Grizzly deployment using puppet modules

2013-08-30T00:00:00+00:00

Openstack Grizzly 3 node cluster installation

There is a lot of infrastructure that I leveraged to do this installation:

Local ubuntu mirror
Debian Preseed files to automate installation
Dell iDRAC and faking netboot using virtual CDROM
Puppet master with git branch to environment mapping
Git subtrees to integrate OpenStack puppet modules
An example hiera data file to handle configuration

Local Ubuntu mirror

Having a local mirror makes installations much simpler because packages download very quickly. The ideal setup uses netboot because the mirror already contains the kernel and initrd and packages needed to do the installation. I used:

ubuntu/dists/precise/main/installer-amd64/current/images/netboot/ubuntu-installer/amd64/linux
ubuntu/dists/precise/main/installer-amd64/current/images/netboot/ubuntu-installer/amd64/initrd.gz

To create the mirror I used the ubumirror scripts provided by Canonical.

Debian Preseed

I already have some experience using debian preseed files to automate installation of Ubuntu and Debian. The documentation is spread out all over the Internet. Most of the preseed is just sets the local mirror and network setup. The OpenStack related options were the disk layout and adding the Ubuntu Cloud Archive.

Openstack Compute Node disk layout

The machines I am using were purchased before I even knew OpenStack existed. They were used for Wind River Linux coverage builds and the simplest configuration uses 2 900GB SAS drives in RAID0. The builds require a lot of disk space and builds on SSD and in memory provided only a small speedup versus the increase in cost.

My idea was to use LVM and allow cinder to use the remaining space to create volumes for the vms. Here are the relevant preseed options to handle the disk layout.

d-i partman-auto/method string lvm
d-i partman-auto/purge_lvm_from_device  boolean true
d-i partman-auto-lvm/new_vg_name string cinder-volumes
d-i partman-auto-lvm/guided_size string 500GB
d-i partman-auto/choose_recipe select atomic

There are 3 kinds of storage in OpenStack: instance/ephemeral, block and object.

Object storage is handled by swift and not part of this installation.
Block storage is done by default using iscsi and LVM logical volumes. Cinder looks for a LVM volume group called cinder-volumes and creates logical volumes there.
Instance/Ephemeral storage by default goes into /var on the root filesystem. This is why I made the root filesystem 500GB. But this does not allow live migration because the root filesystem is not shared. If the vm was booted using block storage then the iscsi driver can handle the migration of vms. Another option is to mount /var on a shared nfs drive.

Ubuntu Cloud Archive

I added the cloud and puppetlabs apt repos in the preseed to prevent older versions of packages being installed.

d-i apt-setup/local0/repository string \
    http://apt.puppetlabs.com/ precise main dependencies
d-i apt-setup/local0/comment string Puppetlabs
d-i apt-setup/local0/key string http://apt.puppetlabs.com/pubkey.gpg

d-i apt-setup/local1/repository string \
    http://ubuntu-cloud.archive.canonical.com/ubuntu precise-updates/grizzly main
d-i apt-setup/local1/comment string Ubuntu Cloud Archive
d-i apt-setup/local1/key string \
    http://ubuntu-cloud.archive.canonical.com/ubuntu/dists/precise-updates/grizzly/Release.gpg

tasksel tasksel/first multiselect ubuntu-server
d-i pkgsel/include string openssh-server ntp ruby libopenssl-ruby \
    vim-nox mcollective rubygems git puppet mcollective facter \
    ruby-stomp puppetlabs-release ubuntu-cloud-keyring

Dell iDRAC and faking netboot using virtual CDROM

Unfortunately I do not have DHCP, PXE and TFTP in this subnet to do netboot provisioning. I am working on this with our IT department. So for now I have to fake it.

I grab the mini.iso from the Ubuntu mirror

ubuntu/dists/precise/main/installer-amd64/current/images/netboot/mini.iso

This contains the netboot kernel and initrd. I can then log into the Dell iDRAC and start the remote console for the server. Using Virtual Media redirection, I connect the mini.iso and boot the server. Press F11 to get the boot menu and select Virtual CDROM.

But using this directly means I have to type everything into a tiny console window. So I modified the isolinux.cfg to change the kernel params to load the preseed automatically

Mount mini.iso locally and copy the contents to the hard drive

sudo mount -o loop mini.iso /mnt/ubuntu/
cp -r /mnt/ubuntu/ .
chmod -R +w ubuntu

Here are the contents of the isolinux.cfg after editing:

default preseed
prompt 0
timeout 0

label preseed
    kernel linux
    append vga=788 initrd=initrd.gz locale=en_US auto \
        url=<server>/my.preseed priority=critical interface=eth0 \
        console-setup/ask_detect=false console-setup/layout=us --

Then make a new iso:

mkisofs -o ubuntu-precise.iso -b isolinux.bin -c boot.cat \
    -no-emul-boot -boot-load-size 4 -boot-info-table -R -J -v -T ubuntu/

Then the process is almost completely automated. Except that the server cannot download the preseed until the networking is configured. This info can be added to the kernel params, but then I would have to edit each iso for each server. With RedHat kickstarts I was able to add a script that mapped MAC address to IP and completely automate this. But with preseeds I need to manually enter the network info. The proper solution is a provisioner like Cobbler or Foreman.

Puppet master with git branch to environment mapping

I have setup my puppet masters based on the post by Puppetlabs:

I like this setup a lot. All development happens on my desktop and I have a consistent version controlled collection of all modules available to my systems. I am using it give some colleagues that are learning puppet a nice environment that won’t mess up my systems.

But I have some custom in-house modules and I want to put the OpenStack puppet modules in the same git branch beside them. The existing tools like puppet module and puppet librarian, etc. do not work in this use case. I want to be able to use git for these external repos and be able to easily share any patches I make with upstream. Enter git subtree.

Git subtrees to integrate OpenStack puppet modules

Git subtree is part of the git package contrib files. Enabling it on my system was simple:

cd ~/bin
cp /usr/share/doc/git/contrib/subtree/git-subtree.sh .
chmod +x git-subtree.sh
mv git-subtree.sh git-subtree

Now I can go to my modules directory and add in the OpenStack puppet modules

for arg in cinder glance horizon keystone nova; do \
    git subtree add --prefix=modules/$arg \
      --squash https://github.com/stackforge/puppet-$arg stable/grizzly;\
done

There are some more supporting modules like inifile, rabbitmq, apt, vcs, etc. Look in openstack/Puppetfile for the full list.

Next was to enable the modules on my machines. First the hiera data needs to added for the network config. I was inspired by Chris Hodge’s video and hiera data

The gist has some minor issues. I posted a revised version.

The last piece is to enable the modules on the nodes

node 'controller' {
    include openstack::repo
    include openstack::controller
    include openstack::auth_file
    class { 'rabbitmq::repo::apt':
        before => Class['rabbitmq::server']
    }
}
node 'compute' {
    include openstack::repo
    include openstack::compute
}

Conclusion

Most of this infrastructure already existed or I had already done in the past. I was able to reimage 3 machines and have a working grizzly installation in about 3 hours.

Many thanks to all people who have contributed to Debian, Ubuntu, Puppet and the OpenStack puppet modules.

Starting Openstack deployment

2013-04-22T00:00:00+00:00

Starting with Openstack

I have some experience with Xen, but no experience with any software that controls a hypervisor. Wind River has several customers interested in using oVirt and Openstack with Wind River Linux. Another team is looking at oVirt, but no one had taken up the Openstack investigation. I have experience with Puppet and Puppetlabs has some official Openstack modules, so it seems a good place to start.

Fedora 18

I re-purposed a coverage builder and installed Fedora 18. I had read about Openstack and Fedora and thought that would be a good place to start. Then Redhat announced the Packstack and RDO project and I decided to give it a try.

The initial install failed due to selinux being disabled and conflicts with NIS (our NIS deployment contains users with uids that conflict with the ones in the rpms). When I finally got the packstack installer to complete after a clean install, openstack refused to recognize the admin user. So I did a reinstall using CentOS 6.4 and everything worked without issue.

Openstack and images

My experience with virtual machines has always been boot and install onto some empty, usually virtual, disk. Openstack was my first interaction with images. The docs recommend a base F18 image. The first attempt to download using the horizon interface seemed to hang. I was able to wget the image on my local machine 30 minutes after the download using the web page had started.

Openstack and LVM

My initial install of the host OS created a large LVM partition called cinder-volumes for the Openstack Block storage service. Unfortunately, the packstack installer renamed the volume group. I had to:

Stop the cinder-volume service
Delete the packstack created volume group and physical volume
Rename the local LVM volume group
Restart the cinder-volume service.

Openstack and volumes

I went into the Volume section of the Horizon Web UI and created a Volume. Running lvs on the host shows that a volume was created in the correct place. I launched the instance as tiny and noticed that it did not require a volume, which I found strange. I had setup a security group to allow ssh and inject my public key. I then associated a floating ip and was able to log into the vm! This was a happy moment.

After some poking around, a disk space check revealed that the VM had 10 GB disk space. This confused me because I had not associated it with a volume. So I repeated the process but setup the VM to boot off the volume I created earlier. This time the boot failed due to missing boot image.

Some EC2 history

I did more research and found this article. It explains some the history of virtual machine infrastructure. When EC2 was first launched, the VMs had no persistent storage. Customers had to use some sort of web service like S3 to persist information. This kind of image is called instance-store; Openstack refers to it as Ephemeral storage.

Then Amazon introduced EBS to provide persistent storage. It could be attached to an instance-store image as a another block device. In Openstack this is handled by Cinder as block level storage.

Then came the ability to boot from EBS volumes. This matches my internal model of a virtual machine as persistent like a physical machine. By default the volumes are empty, so the next step is populating the volume with the proper bits. I have experience with Cobbler to use kickstart and others to install new systems, but I was curious if the image could be “transferred” to the volume.

The Horizon Web UI was not helpful. Some more research revealed the following:

cinder create --image-id <image-id> --display-name mybootable-vol 10

This runs qemu-img convert and writes the raw image to the new cinder volume. This volume can be booted directly, but the Web UI still requires an image name which is ignored.

Summary

Types of Openstack VMs: 1) Ephemeral storage only. Default size of 0 means use image disk size. 2) Ephemeral + block storage. VM must format if blank and mount. Volume can only be attached to one VM. 3) Block storage only. Web UI does not support image to volume conversion but cinder does.

Next on my list is NetApp cinder driver and installation on Ubuntu 12.04 Server using official puppet modules.

Lessons learned running e2croncheck

2013-03-19T00:00:00+00:00

Filesystems (ext4, xfs, zfs, etc) are one of those things whose failure nobody really wants to think about. The difference between a hard disk failure and complete filesystem corruption is largely academic. However a filesystem has many failure modes and the scariest is silent corruption that goes undetected for a long time. Worst case scenario is that backups are rendered useless.

The long time solution to detecting and correcting minor filesystem issues is fsck. The tool has several limitations:

The check can only be run while the filesystem is offline.
The check is serial per filesystem. It can be parallelized across multiple filesystems
As the amount of data on the filesystem grows, the time to complete the check grows as well.

What seems to be standard practice is the following:

Install and configure system with defaults
Leave system running as long as possible
When the machine hangs at a critical moment, reboot the machine
Wait for hours until admin logs into console and fsck check is manually killed

This has several obvious drawbacks:

fsck almost never gets a full run, especially if the system uses hibernation and/or S3 sleep
The downtime always happens at the worst possible time
No one knows how long an fsck is actually going to take
The fsck may not be necessary, but the disk/machine needs to be offline anyways

Online fsck seems to impossible, because the state of the filesystem can change in ways that make the check wrong.

Databases have a similar problem: how to do a backup while the system is in operation. The solution there is to use filesystem snapshots. This is how I stumbled upon e2croncheck. The original from Theodore Ts’o is here. I found a revised version on GitHub by Ion

The script creates a read write snapshot of the filesystem. LVM uses a copy on write snapshot volume to track changes to the original filesystem. The script thens runs e2fsck on the snapshot which will report if there is actual corruption on the filesystem that needs to be repaired offline.

This seems like a better solution than the standard practice of ignoring the problem so I setup my next servers in the following way:

Six physical disks in hardware RAID 5
Two virtual disks: 500GB system and 8.6TB data
System uses ext4
Data uses lvm with one lvm physical volume and one lvm volume group
Single logical volume at 8TB with 500GB unused space for snapshot
Cronjob to run e2croncheck weekly

LVM snapshots are not without their problems. The big one is performance. There is overhead to the COW filesystem, but thanks to the Internet I found some benchmarks comparing performance with chunksize. The default chunksize is 4kB and increasing the chunksize to 64kB increases performance by 10x!

I also added ionice with e2fsck set to idle priority. So far the changes mean that the background check does not interfere with programs that are running.

The final version of the script is located here inside a puppet class to install the file and cron job.

When root cannot delete a file

2012-10-20T00:00:00+00:00

Operation not permitted

It started when dpkg could not upgrade the util-linux package because the file /usr/bin/delpart could not be symlinked. So I tried to delete the file.

sudo rm /usr/bin/delpart
rm: cannot remove `/usr/bin/delpart': Operation not permitted

All operations on the file failed. I tried mv, fsck, reboot into rescue, etc.

So I googled “linux ext4 ‘Operation not permitted”. This did not help much, but I noticed a link about ext2 extended attributes. I have never used extended attributes, so I did a quick read of the man page for lsattr and chattr.

cd /usr/bin
sudo lsattr delpart
---D-a-----tT-- delpart

That was a strange set of attributes. So I compared to another random file.

sudo lsattr zip
-------------e- zip

Once the problem is found, the solution is straightforward

sudo chattr +e -DatT delpart
sudo lsattr delpart
-------------e- delpart

The question now is how the file got to this state. I can only speculate that a fsck run “repaired” this corrupted file to this strange but consistent state. I wonder if there are other surprises waiting for me on this disk.

Ruby adventure

2012-09-25T00:00:00+00:00

Puppet is a Ruby project and many of the tools that work with Puppet are also Ruby tools. For example, RSpec and Vagrant. To get access to these tools, the “normal” path would be to use the Ubuntu package manager, apt-get. But the Ruby world has its own packaging system, rubygems. The first thing I tried and used was:

gem install --user-install

which installs ruby code as a local user. Less use of sudo and root access is a good thing. The only downside is adjusting the PATH variable to find the rubygems.

The next trick is using RVM to install multiple ruby versions to user account. This allows another level of containerization. Installation is simple.

curl -L https://get.rvm.io | bash -s stable

Only annoyance is that the script modifies my bashrc and bash_profile. I erased those edits, sourced the rvm initialization file and installed a recent ruby.

source ~/.rvm/scripts/rvm
rvm install 1.9.3
rvm use 1.9.3

Next install the gem packages for testing inside the rvm

gem install --no-ri --no-rdoc puppet rspec-puppet puppetlabs_spec_helper

Next step was to use rspec-puppet to run my puppet class unit tests, but after much debugging, it turns out the move to Puppet 3.0.0 broke rspec puppet.

gem install puppet -v 2.7.19
gem uninstall puppet -v 3.0.0

Now my rspec unit tests pass, but unfortunately just as slowly as before. Looks like Ruby 1.9.3 didn’t speed things up much.

Dell R720 Server install

2012-09-12T00:00:00+00:00

Got a brand new Dell R720 server to install recently. The task sounded simple enough: Install CentOS 6.3 x86_64 on the machine as quickly as possible. The configuration of the server included 6 2TB drives which will be used to store the code that will be shipped to the customer, so RAID0 is not a good choice. A good place to start would be the RAID configuration.

The server comes with iDRAC7 and it allows me to connect to the server which was physically located over 3000km away. Linux and the iDRAC6 version of the VNC viewer did not work with arrow keys. A crazy hack was necessary. It is described in detail here: https://github.com/pjr/keycode-idrac. This was fixed with iDRAC7. Progress!

Out of the box, Dell grouped the 6 disks into a RAID5 disk group, but on top of that were 5 2TB and 1 80GB virtual disks on this disk group. Further research shows that the BIOS cannot boot partitions larger that 2TB and that many older file systems cannot handle disks larger than 2TB. But newer technologies like GPT and ext4 can handle theoretically handle this. Let’s give it a whirl.

In the RAID controller BIOS, the 6 virtual disks are deleted and one massive 9TB disk is created. This disk will need to be booted using UEFI.

Next step, go into the system setup. The BIOS now uses the latest UEFI and the arrow key that worked in the text mode bios now no longer work. However after pressing the Num Lock key, the UEFI options can be navigated using the keyboard. This is very annoying, but workable.

Under Boot Setting the boot system is switched to UEFI. To install CentOS from CD, the virtual media option on the iDRAC uses a special USB device to attaches the CentOS 6.3 install iso as CD on the server. After waiting almost 5 minutes for the server to reboot, the UEFI boot from Virtual CD fails. Oh well back to the old BIOS booting.

This means a smaller boot disk will be necessary. Back into the RAID controller BIOS, delete the single disk and recreate one 500GB disk and a larger 8.5TB disk. This time, the Virtual CD is found and the install proceeds as usual.

But upon reboot, the system hangs and the system does not boot! I use the CentOS netinstall CD as rescue to find out that the kickstart decided to install into the large data drive, that the BIOS cannot boot! To tell kickstart to ignore the large data drive it was necessary to add:

ignoredisk --drives=/dev/disk/by-path/pci-0000:03:00.0-scsi-0:2:1*

to the partition information in the kickstart and reinstall. This time the install is successful and the server reboots into CentOS 6.3 and puppet does the base configuration.

Next step is to prepare the large data drive to hold the data. One of the irritations of using ext and many other filesystems is that fsck is only possible when the disk is offline. For servers, this means that fsck never gets run. Occasionally the server is rebooted and an fsck is started. But this is usually the worst possible time to do it. The result is that fsck is completely disabled and fingers are crossed. One potential solution is e2croncheck.

It uses lvm read only snapshots to run fsck on a disk without making it necessary to take the disk offline. There are a couple caveats of course:

There is a performance impact of running fsck on a disk
The disk must obviously be on an lvm partition
There must be free space in the volume group to hold any writes that are done while the snapshot is active

With this and alignment issues in mind, the following commands were used to create the lvm and ext4 partition:

parted -s /dev/sdb mklabel gpt
parted -s /dev/sdb mkpart data ext2 1M 100%
pvcreate --dataalignment=1M -M2 /dev/sdb1
vgcreate vg /dev/sdb1
lvcreate -L 8T -n git vg
mkfs.ext4 -m 0 -E stride=16,stripe_width=80 /dev/mapper/vg-git

The RAID controller uses 64Kb block size, so I use a partition offset and a data alignment of 1Mb to ensure all blocks line up on boundaries. To help ext4 work within the RAID effectively, stride=4K*16=64K and stripe_width=(6 disks - 1 parity = 5) * 16.

The server is now finally ready to be used.

Download CBC Radio stream

2012-04-20T00:00:00+00:00

I listen to CBC Radio a lot. I often find the quality of the show Ideas superb. Recently there was a show called “All in the Family” which introduced me to the ACE (Adverse Childhood Experiences) study. The results of this study are worthy of another blog post, but this post is about something technical. Sorry.

I wanted to download this show as a podcast so I could share it. I went the Ideas website and the show is not available as a podcast but there was a link to listen to the current show. This link brings up a Flash Audio player which plays the show.

At this point I knew that since the audio is being played on my computer I can capture it. First I looked in the web source for an obvious link, but the code is so obfuscated I gave up quickly.

Next I considered recording the audio while it was playing, but I did not want to tie up the computer for an hour.

After a few Google searches I stumbled across some posts that mentioned UrlSnooper to figure out the location of a stream. UrlSnooper is a Windows program, but there was a mention of a Linux program called ngrep. What a great tool! I ran the following:

sudo aptitude install ngrep
sudo ngrep -W byline -qilw 'get' tcp dst port 80

Then I used FireFox to open the audio stream and saw the following in a long stream of output in the console where I ran ngrep:

T xxx.xxx.xxx.xxx:35064 -> 64.208.5.41:80 GET /maven_legacy/thumbnails/ideas_20111213_27203_uploaded.mp3 HTTP/1.1. Host: thumbnails.cbc.ca.

An mp3 on thumbnails.cbc.ca?? I tried:

wget http://thumbnails.cbc.ca//maven_legacy/thumbnails/ideas_20111213_27203_uploaded.mp3

Done! I had the mp3 of the Ideas show. I love Linux.

Rebuild MD3000 RAID0

2012-04-13T00:00:00+00:00

Some background. We do lots of coverage builds of the Wind River Linux products and we have a blade cluster attached to various SAN devices which hold the temporary build data. The builds are very CPU and disk intensive and push the limits of the SAN devices. The default configuration for a MD3000i SAN is one large RAID5 group, but this results in one unused RAID controller and unnecessary redundancy for our case of temporary build files.

So I reconfigured the MD3000i to have 2 RAID0 disk groups, one for each controller. This keeps both RAID controllers busy. Within each RAID group I made a disk for each host, i.e. 2 disks for each host. Each host then spread its builds evenly across the disks. Each builder runs 4 simultaneous builds, 2 on each disk.

Because MD3000i has redundant controllers, etc the multipath driver is necessary. I also found that I needed to make an alias for the wwid of each iSCSI disk to avoid naming problems when /dev/mapper/mpath0 became mpath3 for unpredictable reasons.

The only problem with RAID0 is that when a physical disk dies, the whole disk group dies too. No data is lost, but I thought I would capture the rebuild process here for future reference.

First log into dell storage manager.

Go to Modify > Delete Disk Groups and delete failed RAID 0 virtual disks and group.

Now create another RAID0 Disk group and the virtual disks. Make sure the names of the virtual disks and the host are the same as the previous virtual disk. Make sure the preferred owner for each disk used by the same host is different.

Log into machine. Unmount failed drive and remove from multipath. I use the device ba2 (buildarea2) in my examples.

puppet agent --disable
umount /ba2
multipath -f ba2

Run multipath -d (dryrun) to get the wwid of the new disk.

Edit /etc/multipath.conf to replace now invalid wwid with new wwid. Also change wwid in puppet class for this host. I am using extdata in puppet to manage the contents of the /etc/multipath.conf file.

Run multipath -d to verify that multipath alias is working. Run multipath (no args) to actually create disk.

Now create partitions. Ensure that partition is RAID stripe aligned. Use gpt because of large partitions (and it is new standard)

parted /dev/mapper/ba2 mklabel gpt
parted -s /dev/mapper/ba2 mkpart ba2 ext2 1M 100%

Format the drive. On CentOS 5 I have no choice but to use ext3. Use stride width of 32 which corresponds with 4k block x 32 = 128K raid block size.

mkfs.ext3 -m 0 -E stride=32 /dev/mapper/ba2p1

For some reason the format blazes through blocks 0 to 9000 and then slows to a crawl.

Final step, turn off fsck checks so reboots don’t hang. I don’t like it, but keeping the builders offline for hours to run fsck is not acceptable. It is easier just to reformat the drive regularly. Again the data on these drives is temporary and easily replaced.

tune2fs -i 0 /dev/mapper/ba2p1

With the disk recreated, I can run puppet to rebuild all infrastructure necessary for the coverage builders.

puppet agent --enable
puppet agent --test

All done. Take a couple hours. If lots of disks fail, then this takes too long. I never did a test with RAID5 in this configuration to see if the performance is acceptable. The builds are very sensitive to I/O bandwidth and are running well with RAID0 so I may not have time to run a comparison.

Wisdom from David TT

2012-03-27T00:00:00+00:00

I play violin with the Ottawa Chamber Orchestra which is an amateur orchestra. Our conductor is David Theis-Thompson who is also a professional violinist/violist with the NAC orchestra. He is one of the best conductors I have ever been lucky enough to play with. I hope to share some of his wisdom here:

“Remember that in Edward Grieg’s music, there are always trolls”

“You have to play it just like Brahms, even if it is Schumann”

Toi+Moi - Gregoire and Star Akademie

2012-03-27T00:00:00+00:00

My daughter attends a French Catholic school and recently the administration has chosen Toi+Moi as a theme song. My daughter has been singing it almost non stop. It is wonderful to see her so excited and I am enjoying her version of the song. One evening she wanted to hear the song at home, but we have do not have a TV so I went to Google and the first link was a link to the Wikipedia entry for the song. From there I found a link to the official video by Gregoire. The song is simple and catchy and I was finally able to understand some of the lyrics that had been lost through the school PA system.

The contrast between these two videos could not be more striking. The original video has ordinary happy people and the song has dynamics and a nice understated piano part. The album was funded by 347 people using the music equivalent of Kickstarter and 40 of them were invited to appear on the video. The video invites everyone to join the dance and the people are genuinely silly and happy.

The Star Academie video is pure celebrity making machine. Lots of makeup, glamorous clothing, carefully generated emotion, heavy bass, zero nuance and predictable choreography. I know that many people like it, but as a parent I much prefer my child to watch the original version. The story of a bunch of strangers donating money to help someone realize their dream is preferable to the un-reality of an entertainment corporation manufactured competition.

I do not want to diminish the talents of the performers on Star Academie. They are chasing their dreams as well. It is unfortunate that this is being exploited by the producers. This has been the model for a long time. But the success of Gregoire shows that a new option is now possible.

Introduction

2012-03-23T00:00:00+00:00

This is my first post using Jekyll. Most of the blog aesthetics was copied from Julian Yap. Thank you Julian!

I am preparing to upload a multi-part series on setting up a Xen cluster with over 30 different flavours of Linux.