Jeff McAffer

Packaging license text: It's the right thing to do

Fri, 05 Apr 2019 11:11:11 +0000

Packages are wonderful! They group together discrete bundles of function in “fun-sized” chunks that anyone can reuse. Unfortunately, many package ecosystems don’t do a great job of including the text of the package’s license in the package itself. By omitting this key piece of info, they make it extremely difficult for their community to comply with their license terms.

Here’s a look at some data from ClearlyDefined:

Ecosystem	% with license text
Pod	76
Crate	59
Gem	56
npm	55
PyPI	30

This is preliminary data and there may be some anomalies in the measurements but given there are ~350K package versions in the dataset, the general point remains the same — license and attribution material in packages is pretty sparse.

This is an issue because even the simplest open source license (e.g., MIT) has terms that require consumers to do something. The most common is including the package’s copyright(s) and license text when distributing the package. See my post on open source compliance for more detail.

The role of package management

Package managers are fantastic at making it super simple for direct and indirect consumers to get packages. This enables package distributors, other open source projects, product teams) to simply get the required packages, bundle them up with their own code, and ship. If these packages all include the required attribution items, it’s relatively easy for them to generate an explicit NOTICE file since all the raw materials are available in the packages themselves. Nice and easy.

If the packages don’t include the license and copyright text, then what’s the consuming team to do? They have a set of, perhaps 1000, packages but no attribution material. Assuming they know the license of the package (presumably they do otherwise they can’t be sure of their right to use the package in the first place), teams can go to the Open Source Initiative or some other source and get a copy of the canonical license text. Unfortunately, this does not address the attribution requirements of many licenses (e.g., the MIT license requires the reproduction of the “above copyright notice” presumably for whatever source files the license was originally on). Sigh.

Teams can rifle through the files in the package looking for copyright statements. But many packages are “built” meaning that they do not include the original source so may not have any (or at least not all) copyright info.

If the consumer is hip to the latest advances in this space and knows about ClearlyDefined — a crowd-source effort to discover, curate and disseminate exactly this kind of info. There’s a good chance they can get the required info. That’s cool, they can even use ClearlyDefined’s Notice file sharing mechanism (Share a workspace list of components as a notice file) to create the required attributions to include in their product. But where did that info come from? The package still did not have the license text in it.

Most typically it either came from ClearlyDefined weaseling around to find the precise source commit that matches the package version and then looking in there for the license text, or from some sort of human curation, or as outlined earlier, falling back to the canonical license text and scraping files for copyrights. Not much fun.

What do to?

Fortunately, the solution is relatively simple. Including the license text in the package is often as easy as adding a line to the package build/publish script to identify an additional file (e.g., LICENSE) to be included in the packaged output. Package publishers can do this today.

Even better would be updating the package publishing tools themselves to check packages and ensure they have a license file and nudge or remind users where the file is missing or cannot be identified. To that last point, it would be better still if the package metadata allowed publishers (or the package manager) to explicitly identify the license file(s) in the package. This would remove all ambiguity.

Some package managers are already there — Debian does a great job. Others are moving in this direction. npm warns users if their package does not identify their license using an SPDX license identifier. (Actually, it can be a license expression but that’s for another post) That’s great for knowing the terms helps but you still need the information demanded in the license, for example, the license text.

The NuGet team recently updated their nuspec metadata to allow identification or either an SPDX license expression or point to a license file. SPDX expressions are super useful but so, as we’ve seen, is the license text — this should not be an or. I’ve been chatting with the team to see how we can encourage more people to include the license text as well as spec a license expression. Let’s see what happens.

Wrap up

I’ve not done exhaustive research on which package managers promote the inclusion of license files but can say from the data we are seeing in ClearlyDefined, there is still room to improve.

If you do publish packages, please take a minute to check your packaging scripts and ensure they include the license file (you do have one, right?!) in the list of packaged files. It’s really easy and really simplifies the lives of consumers who are trying to comply with your license. Who knows, perhaps you are consuming some packages and would benefit from them including license text…

If you run or have influence over the package publishing tools or workflow for an ecosystem, foundation, project set, … please look at how you can nudge your users to include this important info in their packages (or just do it for them).

Moving to GitHub

Mon, 25 Mar 2019 11:11:11 +0000

After almost eight years at Microsoft, in a variety of teams in the developer space, I’ve decided to move on.

I’ve been incredibly lucky in my time at the company. I joined at a time when, while there was some open source happening, open source was still a very dirty ten letter word at the company. This was a bit of a leap for me having come from my “all open source, all the time” days in the Eclipse world. Not sure what I was really thinking but well, it happened.

The past

My first roles were deep in the architecture of Visual Studio and TFS/VSTS/Azure DevOps, trying to tease the code apart and modularize the development process. Turns out that componentizing a monolithic code base, some of which is 20+ years old, is a Hard Thing™. Some successes were had and over the years, things have dramatically improved. I learned a ton.

I moved on to spend time creating what is now the Azure DevOps Artifacts service, a facility which at its core is a massive, deduplicated file store. Petabytes of build drops, internal and cached NuGet, NPM, Maven, … packages, symbols and more. Garbage collected, scaled out, cached, mirrored, … the scale was just mind-boggling and the impact potential huge. What a fun set of technological challenges and a great team. I truly enjoyed that work and it’s been great watching it flourish.

All up, my favorite time at Microsoft has been the last few years running the Open Source Programs Office (OSPO). Shepherding and helping the company through one of its most profound transformations ever — from being the epitome of a proprietary software shop to being at the forefront of open source — has been a privilege, a thrill and a challenge.

Open source at Microsoft is now a normal thing. Tens of thousands of developers in the company are using open source in literally millions of places throughout our offerings. Open source management is an integral part of our engineering system, flagship offerings from k8s on Azure to the Edge browser in Windows are fundamentally based on open source, and we routinely release core technologies such as DotNet, VS Code, TypeScript, and the Calculator (don’t laugh, that’s actually a huge deal in many ways you’d never imagine). All unthinkable a few short years ago.

So when people ask, “Do they mean it?” (with reference to “Microsoft ❤️ open source”), there’s a long history to overcome but yeah, this is real people.

Culture changes

All the above is amazing and gratifying. Even more so has been experiencing the culture change. Microsoft has gone from a siloed, “gun pointing” organization, to a much leaner, customer-focused, engineering-driven, get “shit done” organization. To be sure, there are still holdouts in the company — like soldiers marooned on remote islands not knowing the war is over — but the gig’s up and collaboration and open source won. They’ll find out.

This culture change has been super obvious to me in the Open Source Programs Office work. Our transformation to open has been a true team effort. I’m very proud of the OSPO team’s role in this change. They are a superb set of individuals and a fun team who never once let a day go by without some laughs or at least an Auth* problem. Beyond that, we built an amazing network of partners across Legal, Security, Marketing, and the product teams, all of whom have played an integral part in this success.

The Legal team deserves special mention here. Legal folks are sometimes maligned as roadblocks to open source advancement in a company. Frankly, that’s just an excuse for the engineering/business teams not taking enough time to listen to them, understand the issues, and make their case. Hands down, week in, week out, our interactions with the open source legal folks at Microsoft have been the best, most positive and productive interactions ever. Ever. Corporate open source folks, your legal team can be your best ally. Know them, listen, engage. They will help you past all manner of issues.

Orgs actually do matter

For a while there, the virtual team nature and success of our structure made me believe that our actual org structure didn’t matter so much. While that’s true at a certain level, the actual org structure can really help. Many OSPOs can be found in the engineering org. That’s good — it’s where they should be and is where Microsoft’s has always been. Often times however, the engineering org doesn’t really know what to do with the OSPO org-wise and it ends up being wart on the side of a loosely aligned team whose exec gets open source and support the initiative. That’s OK too. But it can be so much more.

We found that out when recently the Microsoft OSPO was reorg’d into a newly formed “One Engineering System” (1ES) team responsible for, well, the engineering systems used broadly across the company. This was awesome. In the 1ES team we found a family, a set of people earnestly looking to make the development process at Microsoft more modern, faster, more efficient. Open source and OSPO were a natural fit. Now, more than any time in the past, the OSPO team is surrounded by like-minded teams on whom they can lean for tooling, data, and initiatives. Progress is accelerating.

The change

With all that great stuff happening, why change?

There’s never a good time for disruptive change like this, but a couple of things have aligned to make this the right time.

First, Microsoft is well and truly on its way to being broadly “proficient” in its open source engagement. This means that the open source program can start shifting to drive “fluency” and “mastery”. That’s an exciting prospect and something that we’ve been looking forward to for more than a year. But then something else came up.

GitHub. There has been a substantive change in product and engineering energy we see coming from GitHub. They are releasing faster, making real changes that improve the user experience, and are paying real attention to how organizations do open source. I find that super attractive.

So, in the end I’m not going far. Starting in April, I’ll be at GitHub supporting an effort to enable Open Source @ Scale for everyone on GitHub. This is a huge opportunity to marry the learnings from my time at Microsoft’s OSPO and interactions with the great folks in the TODO group and in countless companies and governments over the years, with the energy and efforts afoot at GitHub.

To say the least, I’m stoked. (Yes, I said “stoked” and will endure exhaustive teasing from my kids, but that’s really the only single term I can come up with that captures the nuances of how I feel about this change.) The potential and opportunity of helping all organizations evolve their open source practice and the reality of making it happen at GitHub is too much to pass up.

For those who have been following some of my latest interests, you will know about ClearlyDefined. That project has been a passion of mine and, I believe, is critical to open source in the industry. I remain committed to the project and will continue to be involved as will the team at Microsoft. My hope is that this move actually accelerates interest and adoption.

One more change is that after, uhhh, many, years of being some flavor of “dev”, I’m officially becoming a Product Manager. I’ll be reporting to Shanku Niyogi, SVP of Product, the then Microsoft manager who brought me over to help start Microsoft’s OSPO in the first place and has now prompted me to move to GitHub. He gets it and is awesome to work with.

Wrap up

I can’t say (‘cause I don’t know) what this all means in practice — I haven’t started working there yet. I can say that I’m looking forward to working even more deeply with companies, governments and other organizations who are driving their open source mission. I want to learn from you and enable your goals. Feel free to let me know how we can do that.

Open source license compliance distilled

Mon, 25 Feb 2019 11:11:11 +0000

Licenses are a fundamental part of open source but the challenge of finding and complying with the licenses can be daunting and mind-numbing. Here I summarize the practical license compliance landscape and set the scene for follow-up posts on how we can do better — to everyone’s benefit.

Generally speaking there are only three obligations that come with open source licenses:

Attribution — Most licenses require that you attribute the copyright holders of the projects you use in your software. Typically, this is done in a NOTICE file included with your system. For many licenses, this is the only obligation.
Component Source disclosure — Weak Copyleft licenses require you to make available the source of the components you use or modify such that the end user of your system can rebuild and replace the open source in your system.
Full-Program Source disclosure — Strong copyleft licenses require you to also release the code for your entire program as similarly licensed open source in certain scenarios.

There can be other details, but the vast majority of situations involve a subset of these three obligations. It’s worthwhile to note that pretty much all of the obligations only kick in if you actually distribute the open source to others — using the code internally, even to run a customer-facing service, does not trigger the terms. Network or server licenses close that gap and have terms that trigger for any customer-facing use.

The landscape has gotten a bit more interesting lately with the advent of the Server-Side Public License (SSPL) and various Commons Clauses springing up. I’m going to set those aside here as they are not (currently) considered open source licenses so fall under whatever commercial procurement and compliance process you many have.

Compliance challenges

On the surface, compliance seems pretty straightforward — track what you’re using, get the licenses, read the terms, follow the (relatively few) instructions. In practice, it’s unfortunately not that easy.

What open source are you using?

Try this experiment, find a dev near you and ask them what open source is in their system. Some will, perhaps with a bit of digging, come up with a complete list. Others will mention only the top-level components. Still others will just shrug. Today with package managers like npm, and containers, it’s trivial for a dev to bring in thousands of open source component without even knowing it.

Some package managers make it easy to get an inventory of the open source in use by providing a lock file that lists all the packages or having a command line tool option to dump the inventory. Others are no help at all. What’s more, versions change all the time and since licenses and copyright also change between versions, it matters what version you’re using.

And of course, if you’re using source directly, you need manually track the origin of the source using one of several vendoring approaches.

What’s the license?

Given an inventory of open source at play, you need to figure out the related licenses. Many (many) repos on GitHub, packages in Maven etc. do not actually state a license. Sometimes they have one the repo but forgot to put it in the package. Other times the project team just didn’t know or see the point.

Basically, if there is no license then you have no rights to use the code. Either use something else or contact the project team and ask them to put a license on the code. Fortunately, it turns out that folks are generally quite keen to have their code used and so are happy for the pull request that adds a license.

Assuming you can find a license, the so-called envelope license problem may still be an issue. This is where the envelope (top-level of a repo, package metadata, …) identifies a license but the actual code has additional/different licenses. Frankly, this is just sloppiness on the project team’s part — they’re not really respecting the licenses of the code they’re using and are not enabling you to do so. Unfortunately, this is ultimately your problem as a downstream user — you are on the hook to comply because you are distributing the component.

The other major topic is snippets. Devs are lazy (in a good way) and happily copy and paste code from other projects or, famously, from Stack Overflow (there is even a book on that!). The challenge here is that those same devs often neglect to include the license of the original code or even cite where it came from. Snippets are pernicious buggers that are notoriously hard to find.

Understanding the terms

OK, so you know what you’re using and have found the licenses that apply, understanding the terms is the next step. Fortunately, there are numerous resources available to help there. tl;drLegal has good classifications of the licenses and obligations. GitHub helps you with similar info right on the repo itself. Just click the license indicator at the top right of the repo view (there is one, right? The project has a license, right?) to get license details as shown below. You want the Conditions column in this context.

There are some cases where some interpretation of the license and the usage scenario is required — in particular around copyleft disclosure requirements for different licenses and architectures. You’ll probably want some legal advice on those.

Who are the copyright holders?

With all the license terms sorted, you likely need to attribute the various copyright holders. Here we see open source projects all over the map as to if and how copyright holders are captured. Many licenses say something to the effect of “you must attribute the above copyright holders”. There might be one copyright holder for the entire project, or several per file. Some projects track people in a NOTICE, CONTRIBUTORS, or AUTHORS file. A note here, just ‘cause someone wrote the code does not mean they hold the copyright. I work for Microsoft and they get the copyright on everything I do at work in exchange for paying me. I’m fine with that but you need to attribute Microsoft not me.

Where’s the source?

This one is the bane of my existence. As mentioned in the Source disclosure discussion above, some licenses require you to make buildable versions the open source code available. Obviously if you’re using source and building it yourself, you have the source needed to do the disclosure. Users of packages however generally do not have that luxury — they are just getting binaries via a package manager. Somehow you’ve to go from knowing you’re using Foo 1.0 to knowing the corresponding commit in some repo. Yes, the commit matters.

Copyrights and licenses change over time and, well, you’re supposed to disclose the source and attribute copyright holders for the component you’re actually using, not some random version. That’s harder. If you’re lucky the project team put a Git tag that matches the version number of the package on the repo, or squirreled away the commit hash somewhere. Some publishing workflows like the one in npm make this really easy. Others, well, don’t.

And just for the record, most packages are not source — they have been compiled or modified in some way prior to packaging. For example, an npm looks like it might “be source” but these days many are transpiled from TypeScript, ES* with Babel, CoffeeScript, … Further, minifying and obfuscating is common and can strip copyrights and licenses, and create code that, itself not effectively modifiable. So, yes you may be able to read it but hey, I used to read 6502 machine code, that does not make it source or suitable for license compliance.

Knowing the source for what you’re using is not just a compliance topic. Source is of course needed if you want to build the open source you use, run additional IP or security scans, or have the code stashed away in the event that the repo disappears and a fix is needed.

Mono repos — one repo that produces many, sometimes hundreds of, packages — are an interesting challenge. You need to know not only the repo, but also where in the repo the source for a package lives. Sometimes this is explicitly called out, other times it’s implicit in a build configuration.

We spend a lot of time trying to find the matching source for the packages we use.

Wrap up

Well, that’s about it for the basics of open source license compliance. Not a great picture when laid out like that — thousands of components, hunting for information, license interpretation, … All up you need to use judgement, assess your risk tolerances, and understand the norms and intents of the communities you engage. Sometimes it can be challenging but, in the bigger picture, it’s typically a small price to pay for the great function you’re getting.

Fortunately, there are some efforts to make much of this easier. In the next few posts on this topic I’ll outline some open source work aimed at helping here.

Disclaimers

I am not a lawyer, I am not your lawyer, I have never played a lawyer on TV, and you can’t use my lawyer. That is, get your own legal advice.

Open source license compliance gone amok?

Sun, 10 Feb 2019 11:11:11 +0000

Open source is awesome! So many great projects out there. Individuals, teams, and companies pouring their passion into code that folks can use and evolve. Licenses are a fundamental part of open source but the reality of finding and complying with the licenses can be daunting and mind-numbing. As an industry we have not come to terms with this challenge and prefer either denial or throwing money at the problem. I think there is a better way.

The landscape today

Open source license compliance is not a very sexy topic and I’m certainly not here to convince you otherwise. It’s full of mind-numbing accounting, proforma actions, and cat herding. At the same time, it’s a key part of the open source ecosystem.

Licenses set out the ground rules for using a piece of content — code, images, doc, … They give you the right to use and distribute the content, and list the associated obligations. The license used in a community sets the core tone of how they operate. Fundamentally licenses make open source work.

Somewhere along the way however, licenses and license compliance built up a mythology and a stench of FUD. With open source hitting the early and late majorities in various domains, license compliance has become a mainstream procurement topic. For traditional organizations — governments, regulated industries, manufacturing — this FUD is taking on a whole new status and driving organizations to take and demand heroic actions.

At the same time, the scale of open source production and consumption has exploded. As I write, there are about 50M public repos on GitHub with about a third of them created in the last year. With package managers like NPM it’s trivial to pull in a thousand packages with a simple command. At Microsoft we currently track millions of integrations of hundreds of thousands of open source components across thousands of products. That’s a lot and we are not alone.

Microsoft open source use scale

Given the complexities around compliance, it’s not surprising that a whole industry has sprung up around figuring out which open source you have, tracking it, getting data about it, and providing workflows to clear and approve open source use. And it’s a big industry with many players both commercial and open source, and big dollars — Black Duck was acquired in December 2017 for half a billion dollars. Companies regularly spend millions of dollars a year on compliance tools, data, and processing.

This is not healthy. It’s driven by fear.

It’s a whole bunch of friction that wastes time and resources that could otherwise go into working on the projects themselves (e.g., improving security) or creating new ones. It inhibits adoption of projects and limits deeper engagement.

Can we do better?

For sure folks need to honor the terms of the licenses — IMHO that’s a given — but somehow, we as a community and an industry need to make that easier, normalized and largely automated. We need to do better here.

Fortunately, times are changing and there is a raft of open source efforts in this space as well as some changing community norms and perspectives.

I don’t have all the answers but do have some ideas. This is the first in a series of posts where I’ll share those thoughts and hopefully hear from you on topics such as:

Distill open source license compliance (coming soon…)
Look at the industry around compliance (coming soon…)
Show how open source projects themselves play a role (coming soon…)
Ultimately sketch an open source solution to this open source problem (coming soon…)

What do you think? How much effort are you putting into compliance? As an open source project team how do you think about compliance in your community? Do you care?

Open source engagement in organizations

Thu, 07 Feb 2019 11:11:11 +0000

Companies, governments, and other organizations big and small are working with open source to achieve their goals. Teams range from barely considering it to betting their whole business on open source. Putting some structure on this spectrum has helped me think about and evolve Microsoft’s open source program. I’d love to hear if you find it useful, how, or why not.

What’s in a name?

When I first started thinking/talking about this, I use the term maturity with people. That unfortunately implies a judgement and it turns out that people don’t want to think of themselves (or be thought of) as immature. Most of the terms that came up had that same characteristic: sophistication, evolution, depth, … The antonyms are off-putting. Fact is, there is no one right answer here — there is nothing inherently better or worse about being at a particular point on the spectrum.

That led me to engagement. We really are just talking about what kind, what role, how much, … your engagement with open source plays in your organization. Naming is hard. I’d love to hear alternatives.

Why bother?

There has been a lot of churn the community lately with new licensing models, mergers/acquisitions, and many, many new players. Reflecting on how and why organizations engage in open source work helps folks on the inside plot a course, make plans, achieve goals, and helps folks on the outside of an org understand what the org is doing and what they might do in the future.

Remember, this is not about the quality of anyone’s open source code, or community, or the technology at all. Rather it’s about how an organization thinks about and “does” open source — what role does it play in their operations?

The model

The model below captures some discrete states of open source engagement. Most teams are simultaneously at multiple engagement stages. That’s fine. Being proficient is more than adequate for some, while others seek to become masters. You may also want to mix and match characteristics from different stages. Go for it. Make it your own. The key is that your level of engagement matches your organizational goals, whatever they might be.

Microsoft itself is all over the map in how different teams participate in open source. This model helps us understand that and intentionally make changes. Or not.

A great way to use the model is for each characteristic, test how it feels in the following phrases:

“My team’s approach to open source is _______”
“The value of open source to my team is _______”

Often these will cluster and give you an overall sense of your engagement. Characteristics where you self-assess lower than others, or where you hope/need to be, represent areas for additional effort.

Denial

Denial – Somehow open source does not apply to your domain or is the wrong approach.
Prevention – Put up technical, legal/process, or regulatory barriers to considering open source.
Countering – Attack open source with FUD

Tolerant

Limited – Use grudgingly allowed in pockets
Experimental – Some early adopters deeply engaged in isolated areas. Some releasing.
Ad hoc – Localized processes and policies. Wild west to locked down. Inconsistent outcomes.
Fearful – Limit risks. Sequester teams. Tightly scope engagements.
Not rewarded – No career incentives. Even disincentives for undertaking “risky” behavior.

Hype

Silver bullet – Open source is going to transform the company!
Marketing – All the cool kids are doing it. We want to be cool too.
Recruit/retain – Emphasis on high profile, high volume “open source” hires
Incoherent – Engagement not coordinated, localized or no policies/processes

Proficient

Systematic – Central policies and processes around legal and security topics
Tooled – Tooling in place to track and guide open source engagement
Broad – All teams are free to engage and understand the “rules of the road”
Engaging – Work with communities, contribute fixes/features
Efficient – Seen as a valuable tool for “time to market”

Fluent

Value – The business understands the value that use/release does and does not bring
Fundamental – Bet on using or releasing open source for core capability
Rationalized – Policies and processes are continually reassessed and automated
Open – Technical and process discussions default to open
Healthy – Engagement health is integral to engineering/business reporting
Rewarded – Open source engagement explicitly recognized as a valued activity

Mastery

Integral – Open source is integral to business model from the beginning
Liberating – Business understands its true value-add and builds on that with open source
Disruption – Open source as a means of disrupting and quantum innovation
Shared control – Foundations, community initiatives, joint projects, “coopetition”
Proactive – Engage broadly to improve quality, security, function as a goal in itself

Microsoft and the model

It’s no surprise that you can plot Microsoft’s course in open source quite closely in this model. Not so many years ago we were in denial about the movement if not active countering it. We then transitioned to concurrently being tolerant and driving the hype. We did a few experiments here and there, and went all in on some projects just to see what would stick.

Today I think are getting pretty proficient as a company. We have solid policies and processes with automation thoughout our engineering systems tracking millions of uses of open source. Some 20,000 Microsoft folks have some work activity on GitHub and overall the system is pretty efficient.

Of course, our engagement is not uniform. Some teams are still catching up and others have sprinted ahead to fluency and a sprinkling of mastery. This is a never ending process of revising business and technology strategies, evolving thinking, working with communities. It’ll be very interesting to see where we are in a year.

Wrap up

I’d love to hear what you think of this model and if/how it relates to your organization or view of the world. I certainly don’t claim it is universal or the final word. It’s a stake in the ground that we have found useful. How about you?

Resuscitating the blog

Sat, 26 Jan 2019 11:11:11 +0000

I made a resolution — not really for the new year just all on its own — to start sharing more of what’s going on in the world around me. We’ll see if I keep it up. For now, I’ve updated the blog’s look a bit and have a few topics lined up in my head. In the event that someone else is trying similar GH Pages machinations, I’ve capture some useful info here. Bear with me (or do something else useful) if this is all old hat to you.

Basic GH Pages update

I set this up three years ago and since then, a few things have changed. First step was to get the basic infrastructure up to date.

Markdown formatters

GitHub now only supports kramdown and the rouge highlighters. Luckily I was already using those. I did notice however that the configuration has a new format more like this:

kramdown:
  input: GFM
  syntax_highlighter: rouge
  syntax_highlighter_opts:
    css_class: "highlight"

Not sure if the old style would work (I last generated the site a while ago). I updated it anyway. Also had to mess a bit with the styling. Not sure how it was before but with the current highlighting I was getting a horizontal scroll area even for narrow text and the color was off. A tweak to the overflow property solved that.

.highlight {
    overflow: auto;
    ...

https

Mid-last year GitHub introduced https for custom domained GH Pages. This is an awesome and welcomed addition. I set about updating the configuration. My DNS is on GoDaddy (not sure why any more, will likely move) and they do not support ANAME or ALIAS. So, as per the GitHub doc, updated the DNS A records to point to the new addresses. In the end it all worked out but unfortunately several steps in the doc are of the flavor of:

Add all these entries in your DNS server.
Before changing your DNS server, do this other thing.

I definitely did things out of order. Not sure if it actually mattered or I was just too impatient or what. I had a whole writeup of what I tried etc but deleted it because it literally just started working now. Moral of the story? Patience.

Comments

I was using Disqus originally. Was never really happy with the look (very cluttered), perf issues, and having yet another system to manage. After poking around a bit, I found a summary by darekkay outlining various commenting approaches for static sites. In the end I settled on utterances, a GitHub integration that uses an issue per post to capture the comments. There is a relatively simple web UI that integrates into your site and looks basically the same as GitHub (see below). That works out well given the general tone/audience of the blog and my familiarity with managing GitHub.

Bit o’ stylin’

Full width header

I generally like full width headers etc so did some tweaks to make the banner full width but with the rest still constrained to at most 1100px wide. Easy enough. Err, well, my prior banners were all 130px high PNGs. I wanted the new banner layout to be 200px high. I also wanted the banner image to scale properly for different window sizes. That is, I want the height to stay the same and the image to scale up/down proportionally.

Turns out that background-size: cover is your friend there. Just create a full width <div> that is the height you want (200 in my case), set the background to be the image, and the background-size: cover. I made it much harder than it had to be in the process of figuring that one out. cover just stretchs (proportionally) or clips as needed to, well, cover the entire background of the element.

Pop that header text

Perhaps something changed between when I first did this setup and now. Looking at the old version, my name and quippy saying up there at the top right was not at all standing out over the image. Well that was not going to do. I had been using a text-shadow with white to help the black text pop but the effect was not strong enough. I found a bunch of different approaches to bluring the background etc. but many seemed complex or overly dramatic. In the end I settled on a <div> with a gradient background like this:

background: linear-gradient(
  to right,
  rgba(255, 255, 255, 0),
  rgba(255, 255, 255, 0.6) 40%,
  rgba(255, 255, 255, 0.7) 70%,
  rgba(255, 255, 255, 0.8)
);

Key bits there are the color (white), the changing alpha (opacity) value, and the %s. Going from left to right we start fully opaque (0 on the alpha channel) then drive up to 0.6 alpha at 40% of the way across, then a bit more alpha (opacity) at 70% across, and finally end up at 0.8 alpha. I spent quite a bit of time messing around with different increments and alphas to get it smooth. I think it worked out reasonably. Open to suggestions.

Random banners

In the original blog I had a random banner, sort of. Using some Jekyll magic detailed in a previous post, the build of the site would randomly pick an image from a table of images and statically bind that into the site. It would stay that way until the site was built again. Sorta boring. I wanted more randomness (who doesn’t?). Time for a bit of JavaScript.

  <script>
    window.onload = () => {
      var image = Math.floor(Math.random() * 7)
      var header = document.querySelector('.site-header')
      header.style.background = `transparent url(/images/header/${image}.jpg) no-repeat center bottom`
      header.style.backgroundSize = 'cover'
    }
  </script>

The background for an element is a style thing, not an element thing. That means you can easily just get the element and bash the style. The value of a style prop is just the string that you would normally have written in the CSS file. So I just renamed all of my images to be numbered and generated a random URL for the background. You can also see the aforementioned cover for the background-size.

One interesting thing came up while doing this. It appears you cannot set some of the background related style props in CSS (well SASS) and some in JavaScript. When I put the background-size setting in the CSS, it appears to have been overwritten by the JavaScript code. Perhaps I was doing something wrong? The above worked so I didn’t spend a long time fretting over it. Some day perhaps.

New images

The original images used in the banners were mosly panoramas from sailing around the British Virgin Islands with the family some years ago. Great memories. I’ve been doing some other things lately so thought I’d update the pics, what with all that new found height and variable width in the banner… Can you guess what I’ve been up to?

Favicon

Oddly the old blog did not have a favicon. The old old site did. Anyway, for this revamped blog I decided to use a helmet as a nod to my racing. I poked around for a bit looking for helmet icon that I could color and ideally scale etc. Turns out that the state of licensing in the image world is very unclear (something I’ll blog about later). Since I lacked clarity on the terms under which I could use the images I found, I ended up drawing my own.

Took me all of 5 minutes using INKSCAPE. You can see the icon on this page’s tab and the SVG is in the GitHub repo.

Aside: I quite like INKSCAPE (they seem to capitalize so I’ll follow suit). I’ve used it to do all sorts of stuff including laying out dozens and dozens of hexagons in a pattern for vinyl cutting and application to the race car. It’s certainly has its quirks but I actually prefer it to Illustrator for the things that I do.

Wrap up

That’s all I’m going to do on the blog infrastructure for now. Next will be some technical content I think. Let me know if there is anything you want to hear about related to Microsoft and open source, Pro3 racing or the Alfa Romeo Giulia Quadrafoglio.

Diving into GitHub Data

Tue, 01 Dec 2015 11:11:11 +0000

GitHub is a veritable treasure trove of information about all kinds of things related to open source, various technologies, software engineering practices, programming patterns, social interactions, and so much more. I’m really excited to be diving into this space. The first thing we are doing is sponsoring GHTorrent to run on Azure and integrate with Azure Data Lake

In my daily interaction with folks inside and outside Microsoft engaged in open source, it is very common to come across teams who would like to get a better understanding of what is going on in GitHub. Some of the scenarios we see are:

Project health – understand characteristics of successful projects, tell how our projects are doing and to determine if someone else’s project is solid enough to depend on or worthy of investment
Azure security – with the recent key compromises, Azure is keen to scan all of GitHub to find keys and warn owners
Developer behavior – What is a Python developer? Do most/many devs work across multiple techs, if so, which? Is something trending?
Product insight – Gain insight into product/platform/SDK uptake by monitoring projects using the technology of interest
SDK usage – Analyze SDK usage for correctness, change impact, …
Product implementation – for example, Bing Developer Assistant harvests quality code from GitHub as a source of its coding suggestions
IP management – IP scanning and license discovery etc

What do you want to know about what’s happening in GitHub? Add your thoughts to the comments below.

GitHub data sources

GitHub provides several sources of data about the projects and teams it hosts.

Website – The site for each repo has a number of charts and stats that show who is contributing, the watchers, forks, … This is consumable by humans but not so much by machines. Some folks have been screen scraping these sites to get info. Not great but in some cases, this is the only source.
API – There is a pretty rich set of APIs available on GitHub. With these you can access just about every aspect of public (and your private) repos. GitHub rate-limits these APIs to 5000 requests/hour in an effort to keep the system usable. As such, there is only so much you can do with API calls.
Events – https://api.github.com/events is a feed of all events happening in all public repos. See an example event below. As you can imagine, there are a lot of events (>50K/hour when I sampled recently) so this data changes very quickly. To get it all you have to read it often. Reading it once a second (3600 times an hour) “should” get you everything and still stay under the rate limit. That does not leave much headroom to work with the events.
Webhooks – You can ask GitHub to call you back when various events occur on a repo or an org. These webhooks deliver an event payload to a designated URL as the events occur. You can only hook repos and orgs to which you have appropriate permissions.

All of these approaches have drawbacks: incomplete data, potential for missed data, high volume, API rate limiting, limited permissions, … What we really need is a shared service that gathers all this data and puts it in a queryable form. Fortunately some fine folks in the community have been hard at work collecting up some of this info and making it available.

{
  "id": "3389952959",
  "type": "PushEvent",
  "actor": {
    "id": 99999999,
    "login": "jeffmcaffer",
    "gravatar_id": "",
    "url": "https://api.github.com/Microsoft/typescript",
    "avatar_url": "https://avatars.githubusercontent.com/u/99999999?"
  },
  "repo": {
    "id": 0102030546,
    "name": "microsoft/typescript",
    "url": "https://api.github.com/repos/microsoft/typescript"
  },
  "payload": {
    "push_id": 883546547,
    "size": 1,
    "distinct_size": 1,
    "ref": "refs/heads/master",
    "head": "aaf5025dafa123a90ee94b0e9d48e6d22bbf8663",
    "before": "4b57fe306b51e7de39e5a67d0ffc4d5e8a904513",
    "commits": [
      {
        "sha": "5f0f9aa98566c87e1a496f5248c4d136f96731d5",
        "author": {
          "email": "[email protected]",
          "name": "Jeff McAffer"
        },
        "message": "fix a simple problem",
        "distinct": true,
        "url": "https://api.github.com/repos/microsoft/typescript/commits/5f0f9aa98566c87e1a496f5248c4d136f96731d5"
      }
    ]
  },
  "public": true,
  "created_at": "2015-11-30T07:39:19Z"
}

GitHub Archive

GitHub Archive is essentially a database of all the events on the https://api.github.com/events feed. They make available hourly archives of some 20 different event types. You can download these, structure the data into a local database and query your heart out. You can also access the data through Google BigQuery. There is a lot you can do with this data.

Like GitHub itself, the BigQuery service is rate-limited such that you can process 1TB of data per month with BigQuery. This is great for investigation and modest sized datasets but we have thousands of repos to track. We certainly can get the raw archives and put them in our own database but…

The data itself is somewhat limited. The events captured are just that, the events. For example, the push event shown above identifies the user, the repo and the set of commits being pushed. The actual data is not included. This high-level information is fine for some flavors of insight but many of the scenarios outlined above need more detail.

GHTorrent

GHTorrent fill this gap by getting all the events and then walking the links in the events to transitively gather all of the data related to the event. The events themselves are stashed in a MySQL database and the event payloads, the transitive data, is stored in a series of MongoDB tables related to the different types of data.

Like GitHub Archive, the GHTorrent data is available both to use online (rate-limited) or to download the bulk data and run yourself. Unfortunately, the dataset is really pretty big (~7TB) so downloading is a challenge and querying in the online system can be slow.

Path forward

In a bid to address these concerns and get the desired insights, the Open Source Programs Office (my team) and the Tools for Software Engineering (TSE) teams at Microsoft are joining forces with the Georgios Gousios from GHTorrent to host the GHTorrent infrastructure on Azure and pump the data into the new Azure Data Lake service. This has a number of interesting benefits:

Stable home for GHTorrent as it is today
Data in a scaleable store, Data Lake available as an HDFS store
Access to the data using modern big data tools like Hadoop, uSQL, Spark, HBase, Storm, …
Ability to include private repo data

There are still a lot of details being worked out but the first steps have been taken and we should have more to say in the new year.

mcaffer.com moved to Jekyll

Mon, 23 Nov 2015 11:11:11 +0000

Like so many others, I have moved my blog from WordPress on my hosting provider Dreamhost to be produced via Jekyll and hosted using GitHub Pages. I made the switch for simplicity and to mess around with current technologies. In this post I outline the steps I took and setup I use to create this blog itself. Over time I’ll likely update the blog to capture how each element of the blog works.

Why Jekyll

As mentioned above, there are a few reasons.

Simplicity – Jekyll isn’t necessarily simpler. You have to set it up on the machine you are using to author posts (if you want to preview) and while not hard, there is some stuff to do. It is operationally simpler, however, in that once the site is rendered, there is no compute, nothing to get hacked (happening to my WP sites regularly), nothing to update.
Open – WP is definitely open but it is internally more complex and harder mess about with. Perhaps its just a learning curve…
Convention – All the other open source project content I happen to deal with is in markdown and rendered more or less with the Jekyll engine so this is good practice.

Initial setup

Setting this all up was resonably simple.

Install Ruby. I’m on Windows (yeah yeah) so had to install Ruby first. The instructions are pretty easy. I used the 32 bit stable version. No real value in 64 bit and running Jekyll is not rocket science.
Get Jekyll. This is pretty simple. If you are going to be using stock GitHub Pages to deliver your site, check out the instructions on how to get the exact GitHub setup on you machine.
Clone your site’s GitHub repo. Assuming you are going to host your site using GitHub Pages, clone the master (for orgs) or gh-pages (for projects) branch for the target repo onto your machine.
Generate a site. Run jekyll new my-site-repo where “my-site-repo” is name of your git repo and the command is run from the parent folder. Then run jekyll serve from the my-site-repo folder to check it out at http:://localhost:4000.
Publish to GitHub. Do a git commit and push to get the pages on GitHub.

Voila! The blog is up and running. No real content but that’s up to you anyway. Now for the interesting customizations.

Custom domain

Having your blog at yourname.github.io or some such is not that fun. Luckily you can setup a custom domain and have it pointed to the published GitHub Pages. This can be done for an “apex” domain like mcaffer.com or for a sub-domain like blog.mcaffer.com. I did the apex domain. GitHub has instructions. They are a little circuitous but basically net out to three steps.

Register you domain name. I already had mcaffer.com so this was a no-op. Hit up godaddy or some other registrar to get your domain.
Create a CNAME file. This tells GitHub where the site will show up in URL space. The file must be all named in all CAPS and be at the root of your repo. It must have just one line and that line is just the name of the (sub)domain where you are putting your blog. My file just has mcaffer.com on a line by itself. Note that you use a file called CNAME regardless of whether you are doing an apex domain or sub-domain.
Setup a DNS entry. You have to tell “the internet” where your domain is located. The steps will vary by domain registrar or DNS provider. Check out the handy instructions on GitHub.

Comments

I want to engage with folks so enabling comments on the posts is key. The easiest thing is to setup Disqus. The following steps got it done for me:

Register with Disqus. Just folow the instructions at Disqus to get your account setup.
Configure my blog. For Jekyll integration, use the “Universal” code. Again, the intstructions are pretty simple. I added a disqus.html file in the _includes folder and then included that in the post.html layout (I only wanted comments on my posts).
You have to setup values for this.page.identifier and this.page.url. I used page.peramlink for these.

Note that Disqus has changed their integration and you can no longer style as much as you could before. Nor can you eliminate some of the bits n bobs like the “Subscribe” and “Privacy” links along the bottom. Let me know if you find a way.

Syntax highlighting

Jekyll can do syntax highlighting with either Rouge or Pygments. Rouge is built into Jekyll so I started with that. Pygments requires Python and I figured fewer moving parts is better. When I first pushed the repo to GitHub I got build warnings saying that Rouge was not supported and they were using Pygments instead. I just tried it again and am not getting any errors and the highlighting in this post (below) appears to be working so it seems Rouge is built in and supported by GitHub. Winner!

I really like tag clouds and had one on my old blog. You can see mine in action down the right-hand side of this post.

There are several Jekyll plugins for generating tag clouds but as far as I could see, none are supported on GitHub. There are a number of non-plugin options like this one on CodingTips. Unfortunately, they all seem to use some linear font sizing algorithm. That’s less than optimal when your tags only have a few hits and when you have tags with lots of mentions. Early in your blog’s life there will be relatively few tag mentions so the fonts will be really small. Similarly, later in your blog’s life, the tags will have lots of mentions and the fonts will get really big.

My approach is to make 10 font size buckets and then scale the fonts according to the dynamic range of the tag mentions. That is, if the difference between the most mentioned and least mentioned tag is <= 10, a tag’s font size is simply allocated based on its position in the range. If the range is larger than 10 then the range is “compressed” and fonts allocated according to the compressed values.

The snippet below is tagcloud.html you can include wherever you want the tag cloud to show up. It first computes the minimum and maximum mentions of tags across all the posts and then diffs those to get the range. From there, the cloud formed by creating the links and setting their font size based on where each tag’s mention count fits in the range (scaled or unscaled appropriately). No plugins and only a little bit of gorpy Liquid code.

<div class="tag-cloud">

{% assign min = 100000 %}
{% assign max = 0 %}
{% for tag in site.tags %}
  {% if tag[1].size > max  %}
    {% assign max = tag[1].size %}
  {% endif %}
  {% if tag[1].size < min  %}
	  {% assign min = tag[1].size %}
  {% endif %}
{% endfor %}
{% assign range = max | minus: min | plus: 1 %}

{% assign sortedtags = site.tags | sort %}
{% for tag in sortedtags %}
  {% assign count = tag[1].size %}
  {% if range > 10 %}
    {% assign font = count | minus: min | times: 10 | divided_by: range | plus: 1 %}
  {% else %}
	  {% assign font = 10 | minus: range | divided_by: 2 | plus: count %}
  {% endif %}
  <a href="/tags.html#{{ tag | first }}" style="font-size: {{ font | times: 3 }}pt" >{{ tag | first }}</a>
{% endfor %}

</div>

The site has a banner image that is typically a panorama shot that I took somewhere along the way. The image shown should change over time. This could be done with some Javascript on the page that varied the image link but, in the interest of keeping it simple, the image selection should be part of the site itself. Liquid does not have a random filter/function but it does have time. In particular, it has site.time. With this and the date and modulo filters, we’re off to the races.

First, use Jekyll’s data file facility to create an indexable list of banner images. Create a _data folder and add a simple banners.csv that lists the banner image files like so.

file
header-anegada1.jpg
header-anegada2.jpg
header-canadaday.jpg
header-loblolly.jpg
header-saba.jpg

Then update the header.html include file to pick a banner at publishing time. In the snippet below we get the date and modulo it by the number of entries in the banner data file. Then use that index to find the banner file to use. Finally, that file is used as the background for the page header.

{% assign index = site.time | date: "%H" | modulo: site.data.banners.size %}
{% assign banner = site.data.banners[index]["file"] %}

<a href="{{ site.url }}" class="site-header-link">
  <div class="site-header" style="background:transparent url(/images/header/{{ banner }}) no-repeat center bottom; "> 
  </div>
</a>

Wrap up

That’s it for now. Going forward I’d like to add a post archive by date and a few affordances to make it easy for you to tweet the things you find interesting. Mostly the site will stay simple. Let me know in the comments if you have any suggestions.

Jeff McAffer

Packaging license text: It's the right thing to do

The role of package management

What do to?

Wrap up

Moving to GitHub

The past

Culture changes

Orgs actually do matter

The change

Wrap up

Open source license compliance distilled

Compliance challenges

What open source are you using?

What’s the license?

Understanding the terms

Who are the copyright holders?

Where’s the source?

Wrap up

Disclaimers

Open source license compliance gone amok?

The landscape today

Can we do better?

Open source engagement in organizations

What’s in a name?

Why bother?

The model

Denial

Tolerant

Hype

Proficient

Fluent

Mastery

Microsoft and the model

Wrap up

Resuscitating the blog

Basic GH Pages update

Markdown formatters

https

Comments

Bit o’ stylin’

Full width header

Pop that header text

Random banners

New images

Favicon

Wrap up

Diving into GitHub Data

GitHub data sources

GitHub Archive

GHTorrent

Path forward

mcaffer.com moved to Jekyll

Why Jekyll

Initial setup

Custom domain

Comments

Syntax highlighting

Tag Cloud

“Random” banner images

Wrap up