Goal Trees for Software Engineering Teams

In this article, I’ll talk about a modeling technique called Goal Trees and how I’ve found it useful for leading software engineering teams.

Back in grad school, I developed JSDSI, an open source Java implementation of Carl Ellison’s Simple public key infrastructure and Ron Rivest & Butler Lampson’s Simple distributed security infrastructure. This work taught me about Java development, public key infrastructure, and computer system security. I read a lot about security at that time, including Bruce Schneier’s book Secrets & Lies, which helped me understand the similarities between computer system security and real-world physical security. I particularly liked Schneier’s threat modeling technique, Attack Trees.

Attack Trees are a kind of And-or tree in which each node is a goal, and that node’s children are either alternative ways to achieve that goal (an “or-node”) or are all required to achieve the goal (an “and-node”). In an attack tree, the root node is the goal for a system’s attacker, and the subtrees explore possible ways for the attacker to succeed. Attack trees can be annotated with the likelihood and costs of various subgoals to help defenders anticipate the most likely attacks and make them more costly for the attacker.

Here’s an example from Schneier’s 1999 article:

In this attack tree, the attacker’s goal is “Open Safe”, and each child “Pick Lock”, “Learn Combo”, etc. is an alternative way to achieve that goal. “Eavesdrop” is the only and-node in the tree, with “Listen to Conversation” and “Get Target to State Combo” as the required subgoals. Schneier has annotated this attack tree with the cost of achieving each goal: the cost of an and-node is the sum of its children’s costs; the cost of an or-node is the cheapest of its children. “Cut Open Safe” is the cheapest way for the attacker to achieve the top-level goal. A defender can use this analysis to make attacks more costly, for example, by expanding “Cut Open Safe” into an and-node that also requires “Gain Access to Safe” and making that more difficult with additional physical security.

In researching this article, I learned that in 1962, 37 years before Schneier published his article on Attack Trees, Bell Laboratories had developed a similar technique called Fault tree analysis. Ishikawa diagrams (pictured below) were also popularized in the 1960s. These techniques fall under the general category of Root cause analysis, which includes popular quality control techniques like the Five whys.

In software engineering teams, we use related techniques in several areas:

  • Recursive goal decomposition: OKRs (Objectives and Key Results) were introduced by Andy Grove at Intel in the 1970s and popularized in his 1983 book, High Output Management. OKRs have been standard practice at Google since 1999. In a large company like Google, it is common for sub-organizations to align their Objectives with one or more of their parent org’s Os or KRs, forming a goal tree, with lower level goals contributing to higher level ones.
  • Exploration of alternatives: When designing software to address a product or business requirement, we explore alternative implementations that vary in cost, complexity, and capability. In a system connected by abstractions or interfaces, we can explore alternative implementations of each interface. In these cases, each interface or requirement is an or-node in our goal tree.
  • Root cause analysis: Following an outage or system failure, production teams perform a root cause analysis to determine the factors that contributed to the failure. Similar to defending against attacks, we identify ways to make failures less likely by systematically addressing various paths from root causes to the failures.

In my role as an engineering leader, I’ve found these techniques most useful for addressing questions like “should we pursue this idea?” and “why are we doing this project?” These questions come up often for teams building software that doesn’t generate revenue directly, such as those building infrastructure or developer tools and languages. It’s important to have a clear understanding of how these efforts connect to a company’s business goals, such as attracting developers to your platform or streamlining usage of revenue-generating products and services.

(Cue “Is this good for the company?” reference from Office Space.)

Let’s explore this with an example. Suppose a team member comes to you with an idea, like “we should provide a tool that enables automatic migration from older to newer versions of a package”. Such a tool is useful when the new version is not backwards compatible with the old version, that is, client code must change in order to use the new version. Such changes are usually onerous and error prone, so a tool would make this much faster and easier for the user. Sounds great, right?

But in an organization with limited resources, pursuing this project means that less is available for other projects. We need to consider carefully how this project aligns with our organizational goals, what it would truly take to make the idea truly useful, and whether there are better ways to achieve our goals. Here are three steps:

  1. Climb the goal tree. Use successive “whys” to connect ideas up to organizational goals.
  2. Complete the solution. Use and-nodes to flesh out complete solutions and estimate their total costs.
  3. Explore alternatives. Use or-nodes to explore alternative ways to achieve the organizational goals.

Climb the goal tree. Use successive “whys” to connect ideas up to organizational goals.

Let’s ask “why” starting from our idea of “provide a tool that enables automatic migration from older to newer versions of a package”:

  • Why should we provide an automatic migration tool? To enable users to upgrade to newer versions of packages quickly and easily.
  • Why should users upgrade to newer versions of packages? To enable them to be more productive with those packages and gain access to improvements, like new features and security fixes.
  • There’s a split here into multiple branches:
    • Why should we make users more productive? To increase their use of our platforms and services.
    • Why is it important that users gain access to new features? So that they engage with new platform offerings provided through those features.
    • Why is it important that users get security fixes? So that users trust the software they build using our packages.

Visualizing these relationships, we connect our AutomaticMigrationTool idea up to the EasyUpgrades goal and to each organizational goal that this serves.

When planning, I’ll use this exercise to connect our project ideas up to the OKRs for my parent organizations as a way of ensuring we’re working on the right things. Even if a project idea doesn’t align with current organizational goals, I can use this analysis to articulate the opportunities the idea creates for the organization.

Sometimes it can be challenging to get team members to think about why and whether we should pursue an idea; instead they are preoccupied with the details of how we’ll implement the idea. In such cases I’ve developed a technique called “Assume success! … then what?” The idea is to remove any doubt that the team is capable of delivering on the idea: I have complete confidence in you, and I will support you if we decide to pursue this! But let’s discuss the impact we hope to achieve by doing this and what it will take to achieve that impact.

Complete the solution. Use and-nodes to flesh out complete solutions and estimate their total costs.

The original idea was “provide a tool that enables automatic migration from older to newer versions of a package”. This is the core technology of an offering, but not a complete solution. In the previous step, we connected this idea to the goal “Enable users to upgrade to newer versions of packages quickly and easily.” If we treat this goal as an and-node, what is needed to actually achieve this goal?

  • We need to support migration for each new package version that’s of interest to our business
    • We need to identify pairs of old & new package versions
    • We need to provide tools to enable migration between each such pair
    • We need to repeat this process for each new package version that requires migration
  • We should also streamline usage of these tools through IDE integration
  • We need to do all the usual best practices: testing, documentation, etc.
  • We need to make users aware of these tools through blog posts, talks, notifications

In our picture, we replace the AutomaticMigrationTool node with the and-node AutomaticMigrationSolution, with children for each component of the complete solution.

Surprise! Our goal tree is in fact a goal graph: each goal may connect up to multiple higher-level goals and decompose into multiple subgoals. This is particularly common for projects “further down the stack”, like infrastructure and tools.

We can repeat and-node expansion for each higher-level goal our idea contributes to; for example, providing automatic upgrades for security fixes may require special consideration. This exercise helps us understand the real costs of achieving these goals beyond the core idea.

Explore alternatives. Use or-nodes to explore alternative ways to achieve the organizational goals.

Once we’ve done the and-node expansions, we can consider or-nodes. For each of the goals we’re examining, what are the alternative ways to achieve them? For the goal “Enable users to upgrade to newer versions of packages quickly and easily,” there’s a clear alternative, which is to ensure new package versions are backwards compatible with old ones. Then, updates require no updates to client code. We visualize this as an or-node under EasyUpgrades:

However, this BackwardsCompatibleUpgrades alternative is not free; let’s flesh out the complete solution with an and-node:

  • We need to ensure new versions of packages are backwards compatible with old ones by detecting when unreleased new versions are incompatible and blocking their release.
    • Detect when a new package version would break the build of client code.
    • Detect when a new package version would violate the expectations of client code (typically by breaking tests or benchmarks for client code).
  • We need to provide ways to achieve our business goals while maintaining backwards compatibility, for example, by maintaining support for old features while recommending new ones.

These costs add up, and we must weigh them against the alternative of making breaking changes and providing migration support. For example, Go’s compatibility policy guarantees backwards compatibility except when critical fixes, such as for security issues, require a breaking change. In such cases it might makes sense to provide tooling to ease the migration to the new version.

Let’s visualize all this as a complete solution:

Another complexity is that we can only ensure backwards compatibility for packages that we ourselves control. We have no control over whether third-party package authors make breaking changes. We have set a standard for how Go modules (package collections) use semantic version numbers to differentiate compatible and incompatible updates, but we have no way to enforce this. Instead, we can provide tools to help package authors avoid making breaking changes (as we do for ourselves) and to help package consumers identify which packages follow our compatibility guidelines. I’ll explore this further in a future article.

We can apply or-nodes at higher levels of the goal tree: for example, there are many ways to foster user trust beyond making security upgrades easier, such as improving transparency and communication with users. Higher levels of leadership explore these alternatives to decide which major programs to invest in.

Goal Graphs and Cycles. As we saw in the examples above, a subgoal may serve multiple higher-level goals, and each higher-level goal may have several subgoals. This means our goal tree isn’t actually a tree at all, but a graph, with nodes representing goals and edges connecting them. We can annotate the edges in this graph with likelihoods and costs, as we saw with Attack Trees.

It’s tempting to believe that Goal Graphs are acyclic, since completing subgoals should lead to achieving the higher-level goals. But in reality goals are part of a dynamic system, and the state of one goal affects others—even itself! Making progress towards one goal may accelerate or impede progress towards other goals.

The simplest example of this is product adoption. It’s common for product adoption to be a goal for an organization, and many dream of the “hockey stick” of exponential growth. The key to such rapid growth is network effects: a product becomes more attractive as more people use it. This is clearly true for social networks, but it’s also true for developer tools and languages. More adoption leads to a richer community and ecosystem of documentation, trainings, packages, tools, meetups, conferences, forums, and more … all of which leads to more adoption. This is a positive feedback loop.

We can visualize Adoption as a flow between two stocks, NonAdopters and Adopters, with the Ecosystem enabling that flow:

Adoption continues growing until the product saturates its market or some other factor limits it. Goals of accelerating product adoption typically involve removing a limiting factor or a negative feedback loop. At this point we’ve moved from goal analysis to system dynamics. If you find this interesting, I recommend Thinking in Systems by Donella Meadows.

In this article, we’ve explored ways of thinking about project ideas, organizational goals, and the exploration of alternatives. In future articles I’ll discuss ideas around improving stability and security in open source package ecosystems and how we define SLOs (service level objectives) for developer tools and languages.

Helping teens learn to lead with DoSomething.org

In this article, I’ll talk about my service as a board member for DoSomething.org, a nonprofit focused on empowering teens to take action on the issues that matter to them.

I joined DoSomething’s board in January 2022 at the invitation of its CEO, DeNora Getachew. My wife, Mandi, and I met DeNora through her work at a previous nonprofit, Generation Citizen, which teaches middle and high schoolers civics and how to effect change by identifying and engaging with their representatives and other community authorities. We appreciated the structure of the program and how it taught young people to take action on problems they cared about, so when DeNora became CEO of DoSomething and invited us to get involved, we were eager to participate.

This is the first time I’d ever served on a board, and I had no idea what this entailed. Fortunately there are plenty of tenured board members to help newbies like me learn the ropes. I’m learning a lot.

DoSomething.org has a fascinating history. Founded in 1993, it’s one of the oldest “dot orgs” that’s still running. DoSomething.org provides a web platform that helps connect teens to service projects that help them make positive change in their community. Over its 30-year history, DoSomething has engaged over 8 million young people and awarded $1.8 million in scholarships. It previously ran award shows on VH1 featuring famous musicians and celebrities, so there’s a pop culture element to the way the organization creates campaigns.

While DoSomething had great success with the “millennial” generation of teens, today’s teens are different, and so is their online environment. Now, there are dozens of online platforms that help people find ways to get involved and do good. Teens spend most of their time on social media and video platforms, not on the web. And the issues they care about are big: democracy, the economy, equity, mental health, climate change. As a result of these changes in their target demographic and the technical environment, DoSomething is changing its approach to its mission.

DoSomething’s vision is to transform its online platform to support the growth of individuals as active citizens. Rather than the “transaction” of finding a project, logging participation, and getting service hours, the new platform will focus on helping teens define their civic intent, connect with others who share their goals, and grow as activists and leaders. This is a much richer journey than the previous platform provided, oriented to help the new generation of teens (including my own children) better understand themselves and their collective power.

My role on the board is to leverage my technical expertise and industry connections to support DoSomething and its mission. For example, I serve on the advisory board for the design of their new online platform, I’ve organized fundraising within Google, and recently I answered questions about generative AI in a discussion with young DoSomething members. I’m enjoying the opportunity to apply what I know to a domain focused on empowering the next generation of leaders.

One of my favorite things about DoSomething is its name. Facebook launched twenty years ago and fostered a generation of people who “like”, reshare, and comment to express their support for issues or their frustration with the lack of progress. This produced a lot of engagement, but little real change. People have learned to express themselves online, but not always in a way that’s productive or achieves positive goals. The current generation of teens is different: they are much more aware of the limitations and risks of social media, and they care about making real change happen. They want to do something about the issues they care about. DoSomething aims to empower these young people by helping them find others who share their civic intent and by helping them grow into leaders of change. I’m proud to be part of this organization and support its mission.

If you’re interested in learning more, check out DoSomething.org and their latest strategic plan. As a nonprofit, DoSomething welcomes donations; be sure to check whether your company offers to match your gift.

Go, Python, Rust, and production AI applications

In this article, I’ll talk about Go, Python, and Rust, and each language’s role in building AI-powered applications.

Python was the first programming language I ever loved, and Go was the second. But let’s start at the beginning …

I discovered the power of programming in my high school linear algebra class. We were learning the Simplex method for linear programming and were expected to memorize the algorithm for an exam. The steps were entirely mechanical but tedious and error prone. I had noticed that our TI-81 calculators supported programming, so I asked the teacher whether I could use it for the algorithm. She agreed. It took me hours to write the program on the calculator, but with it I completed the 45-minute exam in 5 minutes and walked out like a boss.

Pascal was the first language I learned formally, then Scheme, C, and C++. As a grad student, I learned Java and became enamored with its safety, memory management, and ability to dynamically load bytecode. As a teaching assistant for 6.170, MIT’s software engineering course, I used Java’s dynamic class loading to automatically test and grade students’ programming assignments.

Then I found Python, and it was something truly special. Python was so simple, so light on the page, so readable—it felt like programming in natural language. I started using Python everywhere I could. I even used Python to answer programming questions during my Google interviews, so much so that the interviewers asked that I write a header file for a double-ended queue to prove that I actually knew C++.

Once I joined Google in 2004, I was back to writing C++ most of the time. It was fine, but not joyful. I missed the simplicity and lightness of Python, but I understood the need for efficiency demanded by “Google scale”. But then, in 2010, I found Go, and I felt like someone had finally given me the best of both worlds: a simple, light, joyful language that also provided great performance, reliability, and scale. I was thrilled.

I was not alone. While Go had been designed as an alternative to C++ for building networked services, much of Go’s early adoption came from dynamic language users, particularly from Python and Ruby. Jeremy Manson’s post describes the challenges Google had running Python in production and how Go provided a way forward. The same thing happened outside Google, with many Python and Ruby developers adopting Go to achieve greater reliability and efficiency while retaining a fast inner loop.

Meanwhile, Python was going through a renaissance. While Python’s use for web backends was declining, its use for data science was booming. Scientists had discovered what I did: a light, simple language that made it easy to iterate and ran efficiently enough, especially when Python could delegate heavy numerical computations to C libraries. After scientific computing and “big data” came ML development, and now we have LLM application development. Python is the default language for all of these, and its ecosystem is rich and enormous.

But there’s a problem. Python is amazing for working with data, developing models, and prototyping applications, but it doesn’t scale well. The features of static languages—type checking, compilation to machine code, and more—are what enable scaling to large programs, large development teams, and large systems. And production systems require their own ecosystem of libraries and tools to enable developers and operators to work effectively at scale.

We’ve been talking with people who are running AI-powered applications in production, and we keep hearing the same thing: these applications are written in Python, but these organizations don’t want to support Python in production. While Python may become better in production over time, the AI revolution is happening right now, and people want an alternative. I believe that alternative is Go. LLM-powered applications are primarily about orchestration: calling into one or more models, in sequence or in parallel, and synthesizing the results—and doing this at scale, with strong support for production operators. Go excels at this.

There are several other languages that might make a claim to suitability here, but I’d like to focus on one: Rust. Rust is an amazing language. I won’t say I love it (yet), but I respect it. Rust provides uncompromising safety and uncompromising speed, even in the context of concurrency. The cost is up front programmer effort: Rust demands the programmer provide enough information for the compiler to understand object lifetimes and sharing, whereas “managed languages” like Go figure this out for you. Go and Rust team members explored this trade-off together in the 2021 article “Rust vs. Go: Why They’re Better Together”.

Imagine Python, Go, and Rust on a spectrum of languages: as we move from Python to Go then Rust, we get increasing safety and efficiency. As we move in the opposite direction—from Rust to Go then Python—we get increasing ease and accessibility. Furthermore, each language has rich but different ecosystems of libraries and tools that make them suitable for different kinds of applications.

Python, Go, and Rust each have their strengths, and each language has a role to play in building AI-powered systems. Python is fantastic for the iterative, exploratory process of developing AI models and prototyping applications. Go is great for bringing these applications to production at scale. Rust is superb when speed is paramount, such as in serving AI models.

You’ll notice I didn’t say anything about which language is the most “productive”. That’s because your productivity with a language or tool is a function of the tool’s suitability to task. You’ll be more “productive” driving a screw with a screwdriver than a hammer. Pick the right tool for the job.

So if we believe in “prototype in Python, productionize in Go”, how do we make this easy? We are actively researching this area. The #1 thing users tell us is that they need Go equivalents to the Python libraries used to build AI applications. Python has an enormous ecosystem (pandas, numpy, Pytorch, tensorflow, JAX, notebook integrations, and much more)—but we don’t need to provide all of this in Go. We are focused on the subset of libraries that are needed for building production AI applications, starting with LangChainGo. We’re eager to hear from the community what else you need.

Our goal is for Go to complement Python to the benefit of both communities. Let’s build bridges between our ecosystems and communities to make Go and Python “better together”. Together, we will enable developers to create a new generation of production-grade AI-powered applications.

Go 2022-2024 and beyond: Let’s talk about AI

In this article, I’ll talk about where we are with Go today and what’s coming next, specifically in the context of how generative AI is changing software development. (Discussion hosted on LinkedIn)

The last few years have been about maturing the Go platform for mainstream users. We added generics, addressing the top language feature request since Go 1.0. We added feature flags for backwards compatibility, which enable major systems like Kubernetes to extend their support windows. We added automatic updates for forward compatibility, which enable us to fix longstanding language issues like the range variable capture “gotcha”. And we greatly improved the software supply chain security of the Go project itself.

We’ve made major improvements to Go’s IDE support in VS Code and in gopls, the Go language server. These now scale to much larger code bases and support a variety of static analyses. We’re improving support for refactoring, and we recently added transparent toolchain telemetry, which will enable us to make data-driven improvements to the developer experience. Please opt-in to telemetry with “gotelemetry on” to help us make Go better for you.

We also improved Go’s support for production systems. We added structured logging to the standard library and improved support for HTTP routing. We enhanced code coverage to support integration tests. We added vulnerability management, a critical requirement for securing enterprise systems; and we made vulnerability triage much more efficient by using static analysis and the Go vulnerability database to automatically dismiss false positives (40% of reports). Finally, we launched profile-guided optimization (PGO), which has delivered great efficiency gains and sets us up to deliver much more.

On the business front, Go continues its strong growth as the language of choice for scalable cloud applications. The cloud market is growing at a compound annual growth rate (CAGR) of over 15%, so the future is very bright for the Go ecosystem.

So what’s next? Surprising no one: AI

AI—specifically, generative AI using large language models (LLMs)—seems to be all anyone talks about in tech nowadays.

The reaction from many programmers has been skepticism: sure, these models generate text that sounds correct, code that seems right, images that look nice, but on closer inspection it’s full of errors (and extra fingers). How can we possibly build anything trustworthy on top of something so unreliable?

But then I recall that the Internet is built on unreliable networks, that early Google was built on commodity hardware, that successful organizations are a collection of fallible people. I think about all the techniques we’ve developed to build reliable systems out of unreliable components. There’s always an efficiency cost to doing this (see The Tail at Scale), but it’s possible. And as the underlying components become more reliable (for example, as AI models improve), the whole system can become more efficient. AI presents a new kind of unreliability for us to understand and engineer around, but I’m optimistic that we’ll figure it out over time.

Programmers mainly engage with AI on two fronts: AI assistance (using AI ourselves to be more productive) and AI applications (building software that uses AI to serve our users better). We are investigating both these areas for Go. I’ll write about AI assistance in this article and explore AI applications (and Go’s relationship to Python) in future articles.

Before we dive in, I’ll speak to one concern: Many people have predicted that AI will make programming languages obsolete in favor of natural language. I disagree, because programming languages enable humans to specify what they want precisely, collaborate with other programmers, and debug and maintain software over time. AI will enable some use of natural language for these tasks, and AI has already proven useful atop low-code and no-code systems; but I believe programming languages will remain relevant to software development and operation for many years to come. (I am, of course, biased towards believing this. As Upton Sinclair said, “It is difficult to get a man to understand something when his salary depends on his not understanding it”. Check out Cassie Kozyrkov’s post for another viewpoint.)

AI developer assistance most often takes the form of code completion in the IDE. AI can also help generate code (for example, from a natural language description); explain code (roughly the reverse of generation); translate code between different programming languages; generate documentation, examples, or tests; explain and fix compilation errors, runtime errors, or test failures; answer general knowledge, onboarding, and “how to” questions; and more. AI developer assistance is available from a variety of providers using a variety of models, and some providers make it possible to train custom models on an organization’s private code to enable domain-specific responses.

AI-generated code often has errors, but so does human-generated code. Programmers have developed a variety of ways to validate whether code does what we intended: static checks (like type checking), dynamic checks (like thread sanitizers), model checking, unit tests, integration tests, fuzz tests, runtime monitoring, and more. We can apply validators to check the code generated by AI, and we can also use them to validate the code used to train AI models. An open question is how the errors that AI makes differ from those humans make, and whether we need new kinds of validators to catch those errors.

There’s an interesting connection between AI training data and software supply chain security (S3C): many of the code qualities we want from our training data are also the qualities we want from our dependencies. There may be an opportunity to align the work being done by organizations like the Open Source Security Foundation (OpenSSF) with the needs for training AI models on high-quality open source code.

I believe most programmers will use AI assistance, so we have prioritized making AI assistance great for Go developers. We are investigating:

  • How do we improve the quality of Go code generated by AI models? Can we differentiate between “good code” and “bad code” so that models can learn the difference? Is there value in synthesizing Go code for additional training data? Can we automatically fix errors in the training data with refactoring tools, then train models to make the same fixes? (Tools to identify good code and fix bad code are useful to programmers on their own, and they are also useful for AI and S3C.)
  • If models train on existing open source code, how do they learn to generate code that uses newly introduced language features and libraries? Can we “modernize” training data with refactoring tools, so that the models learn to use the latest idioms? Similar questions apply to training models that must produce code that has specific safety, security, or compliance properties.
  • How do we evaluate whether an AI model generates good Go code? Evaluation criteria are critical to enabling models to improve over time. What are the prompts and responses in this evaluation set? Should such evaluations be open source benchmarks, so we can compare the performance of different models?
  • How should IDEs prompt models to generate good Go code? What needs to be included in the prompt? Do IDEs need to understand Go workspace layout in order to provide the right context in the prompt? Do they need to fetch dependency code via RAG and include that in the prompt?

Today, each AI assistance provider has to address these issues independently for each programming language they want to support. We’re looking at this from the language provider’s point of view, trying to understand how we can scale good quality AI assistance to many models and providers. All these questions also apply to the other programming languages people use, besides Go. I would be happy to see coordination between programming language projects on addressing these issues, so that we make AI assistance better for everyone.

Special thanks to Hana Kim for reviewing a draft of this article and providing great suggestions. Hana leads work on the VS Code Go plugin—the most widely used Go IDE—and is investigating how we can make AI assistance great for Go.

Go 2019-2022: Becoming a Cloud team

In this article, I’ll talk about how we aligned Go with Google Cloud while preserving the core values that make Go great for everyone.

When Go joined Google’s Cloud org in 2019, I experienced the greatest culture shock of my career, as did many others on the Go team. Cloud was a newly formed product area under a newly hired leader, Thomas Kurian. TK (as he’s commonly known) worked quickly to define Cloud’s culture around a relentless focus on customer success and generating revenue through sales and customer spend on the platform. The Go team had come from the Google organization formerly known as Technical Infrastructure (TI), where our definition of success had been on growing Go adoption and making Go developers happy. We had no idea how to connect our work to Cloud’s business success. But our leadership expressed strong confidence in Go, calling us “part of Cloud’s DNA”.

Fortunately, the Go team was placed alongside other developer tools teams under Mary Smiley, who had a background in leading research teams within product organizations. Mary did an incredible job leading our teams through this change. And the Go team itself includes several people with backgrounds working in product teams at startups and in enterprises. Mary and these others helped me to understand what success in a product organization meant.

They also helped me understand the value of cross-functional teams. We were joined by fantastic user experience researchers (UXRs), Todd Kulesza and Alice Merrick, who help us understand Go and GCP developers through surveys, interviews, and user studies. We had talented developer relations engineers, developer advocates, and open source program managers like Carmen Andoh who engaged the Go community and brought their voices into our discussions. Our PMs, Steve Francia and Cameron Balahan, organized our cross-functional efforts and served as our liaisons to other teams across Google.

While our team was still responsible for growing Go overall, we now also had to connect Go’s success to Cloud’s success. We had to do this in a way that preserved what made Go great: its simplicity, reliability, and efficiency, as well as its status as a free, open source language that was easily portable to any platform. We determined that the best way to drive Cloud success was to work with GCP product teams to ensure they supported Go well. We worked simultaneously to make Go the best language for building cloud workloads while partnering to make GCP the best place to develop and run Go workloads.

We worked closely with the teams that develop the Cloud SDK, serverless runtimes, vulnerability scanning, managed Kubernetes instances, AI assistance, and more. Many of these products are themselves implemented in Go, so we engaged in two fronts: both to improve the product teams’ own use of Go as well as to enhance their offerings for Go developers. Improvements to Go’s efficiency, security, and developer productivity benefited both Cloud’s teams as well as our customers using Go. We worked with Cloud sales teams to educate them about Go and help them sell GCP to “Go shops”. And we kept talking to our users, working to identify their most pressing needs so the we could address them in Go and, when appropriate, in partnership with GCP.

We also aligned Go with opportunities presented by changes in the industry. For example, the CodeCov and SolarWinds attacks alerted the industry to the need to “secure their software supply chains” (S3C). In May 2021, the Biden administration issued an Executive Order defining new requirements on how federal systems would need to secure their software supply chains. With Go, we had the opportunity to meet certain S3C requirements—like authenticated dependencies, SBOM generation, and vulnerability scanning—directly in our core toolchain. We did this in a way that’s open and compatible with any platform, and we partnered with GCP to ensure their S3C solutions could support Go well. We also partnered with Google’s efforts to improve the security of all open source software and worked to make it simple to follow S3C best practices with Go. Check out Russ Cox’s recent ACM SCORED talk for details.

We’ve worked to balance the need to keep Go free, open, and portable with the need to support Google’s business goals through GCP. The results are strong: Go users are very happy overall, and Go users on GCP are happier than those using other languages. Go usage on GCP is high and growing well. Go and GCP are succeeding together, because Go helps GCP developers build their cloud applications quickly, scale them gracefully, and integrate them easily with cloud services.

Nothing in our approach prevents other platforms from providing great Go support, and several do. Competition makes the ecosystem better for everyone. Overall, we believe Go demonstrates a successful model for corporate sponsorship of a major open source project: Go is free and portable, and Google provides steadfast support for the Go project while also investing in Google product integrations that attract Go developers and their workloads.

Go 2016-2019: My transition to management

Over the next few articles, I’ll talk about my experience becoming the engineering manager of the Go team and leading the team through several important transitions: from a focus on internal Google users to external Cloud users; from an engineering team of ~20 to a cross-functional team of ~50; and from an engineering-led strategy to one informed by product needs, business concerns, and UX research.

2016 was a complicated year. In March, I started managing the full Go team. In April, we moved our family to a new neighborhood in Brooklyn. In November, Donald Trump was elected president of the United States. But my 2016 started with the loss of my little sister, Shama, to suicide. Shortly thereafter, my wife Mandi and I started seeing a therapist together to process our grief and the challenges we were navigating together. Today, I am a much better husband, father, friend, colleague, and manager than I would have been without that therapy.

I share these personal details to highlight that each of us are living and working in a broader context of love and loss and hope and frustration. While most of my professional growth prior to 2016 was technical, most of my growth since then has been in better understanding people, both as individuals and as collectives, and how we can succeed together.

I never wanted to be a manager. I love coding, designing systems, iterating on APIs, coming up with the right names for things, and building useful tools that “just work”. But in 2013, the Go team was low on “management bandwidth”, so I accepted the role of Tech Lead Manager (TLM) of the 6-person team developing Google’s internal integrations for Go. At Google, TLM is a demanding role in which you continue acting as Tech Lead (TL) for the team while also serving as their Engineering Manager (EM). Google generally encourages people to specialize in either the individual contributor or manager tracks, but TLM can be useful as a transitional role to determine the right fit.

By 2016, Russ Cox and I were already working closely together to set goals for the entire Go team. So when our EM at the time decided to move on, I was offered their role. Overnight I went from 6 direct reports in the US to 20 direct reports distributed across North America, Australia, and Switzerland. By November, that number would rise to 25. It would take another 3 years to bring my direct reports back down to 6, though by then my org had grown to over 40 engineers plus a dozen cross-functional partners.

Systems thinking helped me transition from designing computer systems to leading organizations. I’ve always loved systems, particularly those we see in the world around us. As a student, I focused my research on distributed systems. At Google, my technical work was building storage and data processing systems in C++, then creating the tools and libraries to enable developers to build such systems in Go. I loved it all. (I recommend Thinking in Systems by Donella Meadows.)

As a manager, I realized that many of the challenges we designed for in distributed systems also occurred in developer organizations, and even the solutions were somewhat similar. Project planning resembled programming in that we had to decompose large problems into smaller ones that could then be pipelined and parallelized. Unlike programs, projects “execute” only once, and the “execution environment” (the team!) changes dynamically (not unlike a multitenant system with component failures). We could reduce the risk that a project would be stalled by an individual absence by distributing the work over more people (eliminating a “single point of failure”), although this adds coordination overhead (as in replicated systems). Replicating knowledge across people pays off in future projects, provided we assign work to take advantage of that knowledge (locality-aware scheduling).

But there’s a key difference between computer systems and organizations: we’re talking about people, not computers. People have career ambitions, financial needs, and disagreements. People are intelligent and take action to make their work better not only for themselves but for those around them. People cannot and should not be “programmed”—instead, the “system” should establish the common values and practices that enable people to work well together. This is more like “programming through agents”, which is far less deterministic than traditional programming, but much more resilient, provided the system proves sufficient slack for adaptation.

My next realization was that the system of our team operated within the much larger system of Google, the company. As Google adapts to change, so must each organization within the company. In 2013, Google entered the public cloud business with GCP, and by 2016, Go had clear momentum as the language for open source cloud infrastructure projects and cloud native applications. Meanwhile, Go’s growth within Google was slowing. It was time for a change.

When I became Go’s engineering manager, Russ Cox continued to serve as Go’s tech lead, and Steve Francia joined as Go’s product manager. Together we prepared a pitch to Google’s leadership in 2017 to pivot Go from growing internal adoption to growing usage on cloud. The pitch included data we had on Go’s adoption in several companies (based on conversations and public articles) as well as Go usage data on GCP (based on API calls).

The pitch went over well, and the Go team was funded to grow substantially in 2018. When Thomas Kurian joined in 2019, Google moved Cloud to its own organizational unit, and Go moved into Cloud. This began Go’s next chapter as part of a product-driven business focused on serving paying customers, generating revenue, and achieving profitability. As a free open source programming language, this was … interesting.

Go 2012-2016: Early growth

In this article, I’ll talk about how I got involved with the Go programming language, my first several years working on the Go team, and the factors that contributed to Go’s early growth from 2012-2016.

I first encountered Go in 2010, when Rob Pike came to the Google New York office to give a tutorial on the language. At the time I was working on globally distributed storage and indexing systems. In that work I had created a pair of C++ classes to read and write replicated bigtables in a way that ensured read-after-write consistency, even when some replicas were unavailable. Each of my C++ classes was 700 lines of tricky multithreaded code. The code worked, but it was very hard to understand and maintain.

Rob Pike’s Go tutorial introduced me to a new way of thinking about concurrency, communicating sequential processes (CSP), and Go provided an elegant syntax and implementation of CSP. I immediately tried porting my C++ classes to Go and was amazed to find that I could reduce each down to 100 lines, a 7x reduction in code size! Not only that, but because the code was expressed using higher-level concurrency primitives and no longer had to do explicit memory management, it was much easier to consolidate common code and extend the routines with new logic.

I was smitten. I immediately started trying to write more of my Google code in Go, but I quickly discovered that Go’s support for Google’s internal libraries and protocols was nascent. Thankfully, the culture inside Google supported my spending some time contributing to improving these libraries, and in doing so I got to know several members of the Go team. Near the end of 2011, Russ Cox invited me to join the Go team to work full time on making Go great for Googlers.

I joined the Go team in January 2012. At the time, the team was focused on shipping Go 1.0, a milestone release that would guarantee backwards compatibility and encourage many more companies (including Google) to adopt Go for real production workloads. I personally had very little to do with Go 1.0—my focus was on Go inside Google.

Unlike most of the Go team, I had spent several years developing, operating, and maintaining production systems inside Google, mainly written in C++. I was very familiar with the “production contact surface” that jobs needed to expose to be “operator friendly”: communication protocols, command line flags, logs, diagnostics, status pages, exported metrics, crash reporting, traces, and so on. I was also familiar with the tools Google developers used to get their jobs done. This knowledge gave me confidence in identifying the requirements that Go would need to satisfy to work well inside Google, but I also wanted to consult the experts: Google’s Site Reliability Engineers (SREs).

I called up some of the SRE friends I’d made during my pager-carrying years and asked them to help us identify what Go would need to do to satisfy a “Production Readiness Review”, which was the process at the time for determining whether a system was mature enough for SRE support. The SREs were delighted with the request: this was the first time they had been invited to review a new language well before it would be deployed in production. Google’s SREs helped us get the requirements right, and in particular they identified the need to reduce Go’s garbage collection pause latency dramatically from where it was in Go 1.0. These requirements greatly influenced the evolution of Go’s low-latency concurrent GC.

Many people across Google and the Go team contributed to making Go a great language for Google’s internal use, and Google’s use of Go grew rapidly from 2012 to 2016. However, we noticed that adoption was uneven: organizations that were well served by C++ or Java tended to stick with those languages, while organizations that were dissatisfied with their current language or were writing mostly greenfield code were eager to adopt Go. In 2013, Google’s SRE organization decided to adopt Go for all their own new programs, replacing Python as their language of choice.

Another group that eagerly adopted Go was Cloud. Go was booming outside Google as the language of open source cloud infrastructure, starting with Docker, Vitess, and Kubernetes and expanding to the majority of Cloud Native Computing Foundation (CNCF) projects. Google created Kubernetes, and Google Cloud wrote many of their internal systems in Go. Soon, Go began growing as a language for cloud applications, first in startups, then later in mainstream enterprises.

Reflecting on this period, there were several factors that contributed to Go’s early growth:

  1. Greenfield code. Most professional programming is maintenance of existing code. Paradigm shifts, like cloud development, create the opportunity to write new code in new languages, largely independent of existing programs and libraries.
  2. Suitability to task. Go was designed for building production systems at Google. This includes not just the language and runtime but also the libraries: the Go standard library includes everything you need to build production-grade HTTP and REST servers, and Go gRPC enables building high-performance RPC services. Go provides simple cross compilation and easy deployment with static binaries. Go arrived just in time to meet widespread demand to build new cloud infrastructure and applications, and it provided a simpler programming platform than the incumbents.
  3. Distributed systems. The separation of applications into processes connected over a network created the opportunity to write different parts of the system in different languages, which made it easier to adopt Go incrementally. Go became popular not only for services and microservices, but also for command line tools and web backends.
  4. Open source. Go was open sourced in 2009, and several elements of its design fostered open source collaboration: a standard code format, a standard tool for builds and testing, and a decentralized package system. Go was well timed to grow with GitHub and serve as the language for open source cloud infrastructure and DevOps.
  5. The Go team. Go has always had an amazing team at Google, and Google has provided steadfast support for Go since 2007. Many of the early Go team members were well known and widely respected by the developer community, which certainly helped get people interested in Go.
  6. The Go community. Go’s community has been incredible in advocating for the language and building Go’s rich ecosystem. Contributors have extended Go to many new platforms, operating systems, and application domains.

Many of these factors may apply to the introduction of any new programming language or to the expansion of an existing language into a new application domain. An additional factor is interoperability, which is not a strength for Go but has been critical to the adoption of Kotlin (for Java), Swift (for Objective C), and C++ (for C).

My first four years on the Go team is when I made most of my coding contributions to Go and wrote many of the articles listed on ajmani.net/go. These cover topics like concurrency, package structure, gRPC, and context, a data type I designed to support cancellation and request-scoped data in distributed systems.

In a future article, I’ll talk about Go from 2016 till now. These are the years that I’ve been managing the Go team, during which I’ve learned a lot about running an open source team within a large company.

S

Please visit this post on LinkedIn to comment and discuss.

Introducing myself

I’m considering doing some public writing, and I’m curious what people are interested to hear from me. In this post, I’ll share a little about myself and what I think I can write well about.

Professionally: I’ve worked at Google for nearly 20 years, the majority of that time on the team that develops the Go programming language, a popular language for developing “production software systems”, that is, software systems that run at large scale, reliably and efficiently. Prior to my work on Go, I developed production systems in C++ at Google. My academic background is in computer science; I earned my Masters and PhD in Computer Science in Barbara Liskov’s Programming Methodology Group at MIT.

Personally: I live in Brooklyn with my wife and our three children. I was born and raised in Houston, Texas; my parents immigrated from India to Houston in 1970, both as doctors. My father is still practicing medicine at age 81, and for a long time I assumed I would follow in my parents’ footsteps and become a doctor, and I studied accordingly.

I’m interested in writing about what I’ve learned in this journey so far and where I see things going next. I enjoy taking a “systems view” of problems and possibilities; part of what has enabled me to transition from software engineering to leading teams is the common “systems problems” across production systems and human organizations. I read a lot about psychology, sociology, and economics and enjoy understanding how those human systems interact with technology. In the immediate future, AI raises lots of interesting questions to explore in this intersection.

In the past, I’ve written in some detail about specific aspects of Go and spoken on concurrency patterns and modeling real-world systems. You can find links to those articles and talks at ajmani.net/go

That’s it for now. If you have any specific questions or ideas you’d like me to explore, I’m open to suggestions!

S

Please visit this post on LinkedIn to comment and discuss.