What is platform engineering. I think. Maybe...?

Oct 13, 2022

Or, “is platform engineering just another excuse to use the infinity loop SmartArt in a PowerPoint?”

Preface: I usually have a whole passel of random stuff in the newsletter, it’s my waste book, by design. This past couple of weeks I’ve gotten obsessed with figuring out what the deal with platform engineering is, I mean, beyond the obvious. Here’s my write-up so far. I don’t know, it’s a draft of an idea to see if it feels right. That is my disclaimer! I’ll send out the usual type of newsletter this weekend, or tomorrow. Sometime soon.

What is platform engineering?

I am going to start with this definition of platform engineering. It is three parts, and, like, I don’t know: who knows if it’s right? It is a proposal, what I can figure out so far. You and I will figure it out.

What is platform engineering? Platform engineering is:

Building the layers of stuff between kubernetes and the code that application developers write. You are making kubernetes usable by developers by filling in “the gaps” that it has. This means the runtime environment, how software is packaged, configured and deployed, and the ALM/SDLC tools developers use (this third part makes the definition way big, but stick with me - I man, how else are we going to jimmy in Backstage?). This means automating and standardizing most everything so that developers have a self-service relationship with all that infrastructure. This is a platform. (Whatever you do, don’t call it a PaaS! Haven’t you been keeping it up?)
Running and maintaining that platform over time. The platform engineer works with whoever is below them (IaaS, on-premises infrastructure, etc.) to do capacity management, control costs, apply security patches, and react to other “oh poop!” problems…all the usual stuff from forever in IT. The platform engineering team also updates the platform itself. This that “who installs the installer?” bit.
Product managing the platform instead of delivering a service. Developers are your customers. Figure out how to make their day-to-day work better by removing developer toil, waste, and removing the need for them to think about things other than moving pixels on the screen. “Cognitive Load,” amiright! This means something that’s easy to overlook: you need a product manager who does product management. (Oh right: also, product marketing to get developers to use the platform, see below.)

There are many more definitions and discussions, especially in the past several months. Here are some: Paul Delory at Gartner (free to read!), Charity Majors, and further round-up from Matt Campbell.

Here is the one from PlatformEngineering.com itself, from July 2022:

Platform engineering is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era. Platform engineers provide an integrated product most often referred to as an “Internal Developer Platform” covering the operational necessities of the entire lifecycle of an application.

It feels like things have expanded since then. However, before I go on, it’s worth noting, that would be good enough on its own to be helpful.

There is an adjacent school of old wisdom called ”platform as a product,”, and there’s also still our old pals DevOps and SRE. We’ll get to all of that.

But, you have to keep to three things, and I am trying.

Why is platform engineering important?

As you know, Bob, software is eating the world. Nine out of ten management consultants recommend getting better at software to improve your business. Also, the DevOps reports, both original flavor and new rainbow taste. Praise be the PDFs. Amen.

If there are no application developers involved, you do not need to think about platform engineering. I mean, as with DevOps, you could “platform engineer” your Office 365 install or your desktop management stuff…but…yeah. Moving on.

Which is to say, platform engineering only exists to support organizations writing and running their own software - “Developers” as we say in shorthand.

And, look, I know. I KNOW! You’re like, “but wait, what is new here?” Just stick with me. We’ll get to both how this is different than DevOps and how kubernetes discarded and rebooted about five years of progress solving all these problems soon.

What does a platform engineer do?

Let’s start by looking at what’s been going on with platforms so far. A platform is the thing that platform engineers do. As application developers are to applications, platform engineers are to platforms.

What is a platform? This is pretty good (via Paula), from 2018:

A digital[!] platform is a foundation of self-service APIs, tools, services, knowledge and support which are arranged as a compelling internal product. Autonomous delivery teams can make use of the platform to deliver product features at a higher pace, with reduced co-ordination.

This is an outcomes based definition. It tells you what the aspirational, the “to be” state is. It does not tell you what is in the platform, nor how to get there, nor the activities around it. THIS IS FINE! It’s only two sentences, and there’s the whole rest of the article and more in the series to explain more.

Recall my first proposed part of platform engineering: it’s all that stuff on-top of kubernetes to make kubernetes more usable for developers. Plus, probably, kubernetes itself.

Things only 2000’s ITIL kids will understand

The platform engineering team owns the platform. They must first get one and build it. (Don’t get too focused on “build” meaning that you don’t spend money on vendors and pubic cloud, “buying” a platform or components of your platform. That’s a whole other thing.) This means they need to figure out who uses it, their needs, and all that. Then the platform engineers need to figure out the components to put in there. Then they need to figure out how to run it all. Then the platform engineers need to figure out how the ways to get applications deployed and running. Then you figure out how to keep the platform and all its parts up to date and maintain the platform. It’d be super cool if you can put new versions of the applications on it too. And, of course, bonus points if you can figure out how keep it running and fix problems the applications have. Probably at the start you’ll need to figure out supporting whatever a “microservice” is and “service mesh.” Eventually you’ll get to the really nifty stuff like progressive delivery, A/B testing, cloud sovereignty, scaling up and down, and all the great cloud stuff you’re BURSTING to have.

Also, there’s support. You know, like, instead of tickets you use Slack.

I have left things off. For example, building out all the middleware and services - databases, event thingies, data analysis tools and services, integrations with payment networks and ERP back-ends for, like, inventory and shipping products, etc. The notion is that your application developers shouldn’t be concerned with setting up, running, and maintaining any of that: they’ll just use that stuff so they can write their apps, move pixels on the screen, not containers in the cloud.

The Pink Elephant in the Room

These activities also mean taking on a lot of meetings and coordinating with other teams: networking, storage, and infrastructure admins; enterprise architects who are setting company policy; security and compliance people; third party services providers (like, for SMS, payments, etc.); the ERP and mainframe teams; and so forth. I don’t think you actually need to do that work, you need to work with people who do it for the platform.

And, also, tracking how much all of this costs and making sure you don’t waste money. You also have the important part of that: paying for it and working on making the case each year in the annual budget cycle to get enough cash for the next 12 months to pay for it all.

Otherwise, the developers will have to walk into that corporate slaughterhouse instead of, you know, focusing on writing applications. MOVE THE PIXELS.

So far, this is pretty standard stuff. This has been going on since forever, at least the 90s when applications were delivered over the Internet (I don’t know, also Minitel and probably some BBC thing, or whatever - moving on.)

It would be unhelpful to say that the difference is that platform engineering has a lot more automation and focus on self-service for developers. But that would be accurate, with some nuance that comes from answering the next question.

How is platform engineering different than DevOps?

Now we are at the fun parts!

This should all sound like DevOps. And you should be thinking, how is platform engineering different than DevOps?

DevOps stopped talking about how to achieve all the DevOps dreams with technology. This was fine, and by design. Once CALMS came about, DevOps shifted its helping hands to improving people’s lives, not just how to use cool, new tools.

Here is that shift in two charts:

DevOps is now all about “culture” and it is doing great work there. It cares about people, which is a tragically rare thing in IT…or any corporate setting, really.

But, this means that people are left asking, “yes, but how to we keep the servers running?”

As these things usually go, by seeming coincidence the SRE book came out and seemed to offer a lot of the “how,” especially when it came to re-thinking metrics (SLAs on Stone Tablets to negotiated SLOs), a shift in how to judge what “healthy” is (not uptime and availability, but MTTR and latency), and a kind of passive aggressive version of “you build it, you run it” (I will help you only if you follow my rules and use my tools). What’s easy to forget about the SRE book is, like, all the Google proprietary parts. The entire platform and tools that only Google has to actually do all of that. Again, we slip into the comfort of being tools agnostic.

Let me soapbox-on-a-soapbox here a bit: being technology agnostic is a trap. You need to always talk about the technology, and most of how you want to improve things will be made possible by the technology. To ignore the technology, to say it’s “easy” or “not important” is like the old how to draw an owl bit: all that stuff that happens off-stage (the technology) is critical. Otherwise you’re just two overlapping ovals followed by disappointment.

Platform engineering is different than DevOps in three ways. Hold on, let me correct that. Platform engineering evolves DevOps (“builds on”?) in three ways (shoulders of giants - [finger guns!]):

Platform engineering uses product management to build the platform. Developers are the customer.
Platform engineering focuses on marketing, advocacy, and driving usage for the platform. You have to get people to use the platform.
Platform engineering talks about what tools to use and how to use them. It will even prescribe which tools to use.

Let’s look at all three.

How is platform engineering different than platform as a product?

The first two fold in an older concept, “platform as a product” and are the key, new points about platform as a product.

Product management

First, in that thinking, operations shifts to thinking about developers as their customer and, then, ops continuously builds the MVP product that those developers would want to use. Then you observe if the developers are doing well, and take another iteration and making it better. Eventually, the MVP becomes the ideal solution, often one that you couldn’t have predicted at first. You know, like, lean startup and stuff.

This notion also pulls on that Netflix (of course!) thing form long ago about an internal tools group that has to win over the developers, and (of course, again!) some of the SRE-think. To avoid going on about it, “what if enterprise infrastructure and architecture but with product management?”

Here’s a quick test for the “we’ve always done Agile/DevOps/platform engineering, we’ve just never called it that” set (yes, I’m looking at you over there): do you have a product manager in your ops, DevOps, SRE, whatever team? No? Well, time for some platform engineering!

Product marketing

Second, people won’t use your platform (correctly) without you doing some marketing to them. I don’t like the following way of putting it, but it is true: you can’t mandate that developers use your centralized IT stacks. Well, you can, and they will, but they won’t do a good job at it. And they’ll also try to subvert you and find someone else’s budget and authority to move to. They’ll even build a whole new organization in your org-chart, like, “product” to escape IT mandates. The tops-down (from the senior executives and board, even) is to just outsource it all.

Both of these are usually not good, especially the developer one. Developers will not take care of the infrastructure and platform after one or two years. They will lose interest, get new, exotic needs and build a different platform, and, equally likely, find a new job based on their ability to put in place a platform. And then you have this nifty platform but no one who knows how to maintain, let alone, upgrade it.

Application developers are very much a launch culture, not a “keep it running so nobody notices for five years” culture. Developers will create a beautiful mess (build their own platform), and then ops will have to clean it up…no…not even clean-it up…ops will have to feed and care for that mess. And then ops will get in trouble when it all melts down. I mean. I don’t like that idea or even typing it. But. I think it’s true…?

So. You have to get people to use your centralized standardized platform beyond mandates. Hence, you pave the desire paths with gold, and so forth.

More than just building, though, this means going out there and talking with developers, doing training, workshops…convincing them that your platform is good and helping them use it. Taking their feedback series and back to the product manager. Making sure they understand why using your platform is good and the problems it solves for them (the value props), why it’s better than alternatives, and building trust based on other developers who’ve been successful (customer references).

This is all marketing!

If you don’t like that, you can call it “developer advocacy,” or even “developer relations.” It doesn’t matter: platform as a product includes marketing the platform. As companies like JP Morgan Chase, BT, and many others do, you hire full time people to do this - something like 5 or 6 at JP Morgan last I checked. Like, they have developer advocates at a bank!

Getting mentally prepared to break a taboo

Product managing and marketing the platform are the first two things that make platform engineering different than DevOps. The last is a focus on actual technology…even strong opinions about the technology you should use.

OK. I am being a bit aspirational here. This may not be an actual thing, but I want to suggest that technology needs to be a focus if platform engineer wants to fully evolve from DevOps…be long-term transformative.

As we saw with DevOps, if you don’t keep a strong focus on tools, you will become only about culture, and you will lose track of how to achieve your desires, at least, when it comes to how computers fit into it all.

Here is why technology is important for platform engineering.

Story time.

The Innovator’s Dilemma: Kubernetes Case Study

In the 2010’s, many people were hard at work solving all these problems we’re talking about here: how can we get developers to focus on writing applications and spend less of their time on infrastructure stuff? Oh, and, also: still keep everything up and running.

How can we achieve the DevOps dreams?

At the time (and now!), this meant building container-based platforms. This was the most recent, great attempt to build the PaaS Utopia. I hope we’ll get back to that, but, as you recall, we don’t talk about PaaS.

Let us pick, I would say, the most proven, popular, successful, and still widely used PaaS from that era: Cloud Foundry. Now, I am biased because I worked at Pivotal and, now, VMware. We built a business on it. Large organizations used it to build out their own businesses by making developers more productive. Etcetera. I talk with people to this day in large organizations who use Cloud Foundry based platforms (Pivotal Cloud Foundry/Tanzu Application Service) to great effect, with benefits, and most of the people (ops and developers) who use it love it.

But something weird happened. This new container orchestration platform that barely could be installed, lacked many “enterprise grade” capablities you would need, was impossible to find knowledgable people for…and so forth…totally killed the 2010’s PaaS ecosystem.

HEY! Kubernetes!

Kubernetes is a perfect Innovator’s Dilemma upper-case-d Disruptor: it won the market with less functionality at a lower price (free!) and was ignored by many of the incumbents for too long.

What kubernetes won over (and I’m still not really sure how, or why, because, like, I am the disrupted so blind by definition) was the community, the people - the ops people building platforms, the developers too. Despite the capabilities of the 2010’s PaaSes, the solid examples of customer success, “the market” was just sort of like, “how about: nope.”

And now we have kubernetes as the infrastructure people want, you can see growing use and desire. Along with many other surveys and analysises, you can see that in these two charts from the VMware State of Kubernetes 2022 survey:

Here is the problem as described by one of the creators of kubernetes:

Well, I don’t know how many of you have built kubernetes based apps, but one of the key pieces of feedback we get is that it’s powerful, but it can be a little inscrutable for folks that haven’t grown up with a distributed systems background. The initial experience that wall of yaml as we like to say to configure your first application can be a little bit daunting. And i’m sorry about that. We never really intended folks to interact directly with that subsystem. It’s more or less developed a life of its own over time.

The kubernetes people have been saying this all along

And, look, I’m going to keep saying this: this is fine, not dog in a house on fire fine, but like, fine as in “totally cool, the natural order of things.”

We are, then, back to building a good developer experience on-top of a container-based platform. The way the diffusion of innovation happens is that your “innovators” and “early adopters” will put up with anything because they want to use the new technology: they’ll fill in the gaps, burn the oil, build that platform on-top of the platform on their own. But as you go mainstream (“cross the chasm”), the people using the technology don’t want to do all that. In our case, “they” are developers, and they want to write applications.

I think that’s why you see the decrease in developer benefits from kubernetes over the past three years in the same survey:

These are not the two best metrics to track “developer benefits,” but shortening developer cycles is, sort of, like, incredibly important. You don’t want to make them longer. And, I mean, there’s all the “wall of yaml,” etc. stuff.

Find whatever survey you want, and you’ll see the same thing: kubernetes is too hard for application developers.

We’re all focusing now on fixing that problem by building a platform on-top of a platform for kubernetes. That was the whole intention of kubernetes, remember? And, thus, that’s the whole game in this space right now. This whole “DevX gap” (or whatever) with kubernetes is what’s summoned all this attention on platform engineering recently.

(I would also like to point out, too, that if you’d prefer to just get on with things, and the platform’s opinions fit the yours, that Cloud Foundry is available, evolving, and getting its relationship kubernetes squared up. I mean, you know, it works and is in production all over the world. [two thumbs up with smile!])

Platform engineering needs to focus on technology

And, thus, we end up at the third way that platform engineering is different than DevOps. Well, like, the way I hope it will be.

With “exceptions that prove the rule” aside, platform engineering is about building a platform on-top of kubernetes. This drives everything up the stack: the management and monitoring tools you use, the way you package and configure applications, the way you architect applications, the way you create automation and self-service, the way you secure and standardize things, the way you trouble-shoot incidents in production…and on…and on…

Kubernetes is a leaky abstraction, and this is fine. What kubernetes does so well is create a standard way of representing and working with infrastructure, even programming infrastructure. They call this an “API” which I find incredibly confusing as an a retired application developer, but, hey, now-a-days I just do slides. Kubernetes is wants one way of dealing with infrastructure, no more variation and heterogeneity. Variation is a hassle for operations and leads to problems.

But, when you do that, you get leaks up the stack. When there is only one way of thinking about working with infrastructure, it drives everything up the stack. This was true of the 2010 PaaSes; true of bare IaaS; true of districted, three tier applications; true of LAMP, J(3)EE, and rails; true of mobile apps; true of two of desktop-GUI three tier applications; and, I assume, true of minicomputer and mainframe…uh…workloads and…er…”batch jobs” (did I get those last two right?).

The point of accepting the leaky abstraction theory is that it gives you a guidelines for what matters up the stack, what you need to do to be successful. Hence, the technology is important. Sure, it’s dangerous to tie your methodology-as-culture to tools and technology (witness SRE with Google’s proprietary technology), but…it’s also not bad. As we saw with DevOps, when you distance yourself too much from technology, you become something else.

Platform Engineering is Three Things

Getting kubernetes ready for developers and continuing to evolve it as such.
Running and trouble-shooting that platform, including all the usual ITSM stuff like support, capacity management, funding, security, etc.
Product managing and product marketing the platform to build a platform developers want to use and get them to use it.

I don’t know. I should probably go watch those PlatformCon videos more and start lurking in the platform engineering Slack channel to learn more.

Addition: it’s a good point to make that I am overstating the one to one mapping to kubernetes a lot here. In fact, a lot of the lessons learned and experience to draw on comes from platform engineering that wasn’t using kubernetes, e.g., the Cloud Foundry and other SaaS users out there (that are still using it to delight, as well). What I’m trying to capture is that the “platform engineering” discussion I’m encountering currently encountering is very kubernetes focused. This is similar to DevOps, in the initial years, being very focused on Puppet, Chef, etc. Sure, you could call that “automation,” but those were to two big ones and were the focus of so much at the time. Also, I’m overstating to add to the point at the end: since kubernetes is so desired now, a new methodology with a new name is bound to emerge. So, sure, it doesn’t have to be kubernetes, but it seems most often that it is and that, maybe, kubernetes is a large part driving so much interest in this now.