Coté's Commonplace Book #34
The problem with me and newsletters is that I constantly want to make one, but they're a lot of work. I long for the old blog days of just a splatter of content, a wunderkammer. Let's try this commonplace book format.
Software Defined Talk Episode 141: Broadcom acquiring CA, AT&T acquiring AlienVault, the mysteries of cloud native vendor product management — www.softwaredefinedtalk.com We try to discern the strategy behind two acquisitions this week: Broadcom buying CA and AT&T buying AlienVault. Seems fine. Meanwhile, you get to join conversation as we talk about how much different product management seems at cloud native vendors than traditional, “enterprise product management.”
Databasing with Greenplum, with Ivan Novick (Ep. 108) by Pivotal Conversations — soundcloud.com
We talk databases in this episode. First, with the history of databases and why the relational database become king, for awhile at least, and then about how databases evolved, ending up talking about Greenplum. Greenplum is the world’s first fully-featured, multi-cloud, massively parallel processing (MPP) data platform based on the open source. With Ivan Novick, we go over all that and cover some use cases. Also, as always, some recent infrastructure software news.
"Metrics"
[W]hat the business considers to be valuable is not always what Agile developers focus on when they think about delivering value. All too often, developers like to concentrate on stuffing as much new code and as many new features as possible into every release without regard to the level of business value generated.
Core DevOps (tech) metrics, from Nicole Forsgren
"For organizations in technology, I really push to the IT performance metrics we've identified in our research, because we've found they drive value in so many areas. These IT performance metrics capture speed and stability of software delivery: lead time for changes (from code commit to code deploy), deployment frequency, mean time to restore (MTTR), and change fail rate."
"It depends on what your organization does, what's most important and relevant to your organization, etc. For example, one good candidate might be Net Promoter Score (NPS) -- but that assumes you work in an industry where referrals are important. This metric might not be relevant (and maybe not applicable at all) for public goods and government services."
Andrew, Israel, and Patrick on metrics
Simple ops: Availability (uptime), capacity, health (status, red/yellow/green).
Team: burn-up/down chart.
Team: velocity (# of stories per release); utilization (are they on the bench or overworked?)
Business: ROI, inventory (WIP), time to market,
Others: Mean Time Between Failures, MTTR, technical debt,
Customer responsiveness is (negatively) effected by technical debt.
Business-centric metrics from Startbucks
Starbucks created its mobile order and pay service to increase revenue from people with limited time. so it gauges success by measuring the usage of mobile order and pay overall, usage in the busiest stores, frequency of customer visits, and ticket size. Starbucks even monitors queue behavior, as Howard Schultz shared when he told investors that mobile order and pay “significantly reduced attrition off the line.”
HP Lovecraft corner
Lovecraft's one line story ideas
"The walking dead—seemingly alive, but—."
John Hay Library of Providence and H. P. Lovecraft
Deadpool
Toil
Google's SRE methodology identifies an intriguing term, "toil":
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
Toil can also include non-technical things like needing to respond to low-priority emails, investigating routine production failures, and having to focus on manual releases. You want to eliminate toil because it's time consuming, error-prone, and demoralizing. While any given toil task may seem like it must be done manually, chances are it can be automated, at least with contemporary technology.
While the SRE book focuses on toil in production, you should also track toil in development. Just as you don't want your platform operators doing manual tasks over and over, you want your developers to avoid such low-value, but time consuming work. In both cases, you automate any toil you encounter.
You track the amount of toil developers and platform operators do over time to identify when you need to focus more on automating toil. If you find that people are spending too much time on toil, your need to quickly investigate why. It may be the case that the cloud platform you're using is lacking critical features, that people are not using those features, or are not being given to time to work on automating toil.
Tracking toil can be difficult - few people want to do hourly time logs. If you've gone full-bore IT Service Management, you may be able to derive some metrics from your ticketing system to give you a sense of toil.
One, very rough way, to track toil is to keep track of the ratios between developers and platform operators and applications and platform operators. In theory, if your platform is highly automated, you'll need less operators to support the number of developer and applications using it.
Availability
Availability answers questions like: is your software up, and for how long monthly, annually, etc.? How frequently is it down? Availability is the source of the infamous "five nine's" discussions which shows what percentage of time your software is "up" and ready to go during some period of time, usually annually or monthly.
Day-to-day, the point of tracking availability is to know if your software is ready to be used; if it's not, you get an alert, and then you start remediating and fixing it. Over the long term, you want to report on availability to gauge your organization's ability to keep the software up and running. Clearly, if it's the trend line is going down, you should fix that.
For financial reporting, availability will also be a good, historic metric to use when demonstrating IT's value. And, for various regulations and compliance, you might need to prove that you're providing the mandated or agreed on availability.
While an intuitively important metrics, recently there's been much discussion that availability is far from the best metric. If your software can recorder from errors in milliseconds and has no visible effect on the people using the software, what does it matter how much uptime the software has? You want to look at availability through the lense of users: are people using the software having problems because the system is unavailable? It's often more useful to narrow down "availability." As the Google SRE book puts it:
Using an aggregate unavailability metric (i.e., "X% of all operations failed") is more useful than focusing on outage lengths for services that may be partially available—for instance, due to having multiple replicas, only some of which are unavailable—and for services whose load varies over the course of a day or week rather than remaining constant.
Also worth considering if you're an availability fantastic is that the cost of five, six, or n+1 nines might be so expensive as to ruin any profit your software has: being perfect could be your downfall. What's key with availability is figuring out what actually matters and adjusting it over time.
(Above cut from a chapter on "metrics" I'm writing.)
Revenue and Spending
'You might point out that you own a share in the company that grows in value as the company does, and that right now you can sell that share on the stock exchange for $13.31. But that evades rather than answering the question: What does the person who buys the share from you expect to get from it? The value of a stock in the market is supposed to be equal to the present value of its future cash flows, and there’s nothing about the stock itself that promises you any cash flows. Or you might say that Snap’s directors and officers have a fiduciary duty to you to maximize the profits of the company and the value of your shares, but even if that were true—it’s pretty debatable—it continues to avoid the question. If Snap made massive consistent profits for decades, it would still never have to give any money back to shareholders, and the shareholders would have no way to force it to. “I own a 1/1,258,171,112 share of a massive pile of cash,” you could say, but you could never spend it.'
Micro Focus belches as it struggles to digest HPE Software
"Operational difficulties fed into the financial results for Micro Focus's half-year ended 30 April 2018 with sales down 8 per cent year-on-year to $1.974bn. Nearly all of the revenue lines were down with the exception of subscription and SaaS. Licence sales dropped 18.4 per cent to $396.4m, maintenance was down 3.5 per cent to $1.109bn, and consulting dropped 27.5 per cent $149.9m." Also, SUSE revenue: "The top line number included an $182.9m contribution in sales from SUSE, which Micro Focus is offloading to a private equity biz for $2.53bn, and was 17.2 per cent higher than a year earlier."
Aquisitions
AT&T to Acquire AlienVault DALLAS, July 10, 2018 — AT&T* today announced its plans to acquire AlienVault®, a privately held company based in San Mateo, Calif. The
Weirdest. Acquisition. Ever. Broadcom buys CA Technologies
Broadcom buying CA for US$18.9bn. "Mainframe solutions dominate CA’s income, pulling nearly $2.2bn in the 2017-2018 financial year, followed by its enterprise solutions segment at $1.75bn and services at $311m."
Red Hat's James Talks About the Importance of Open Source Innovat
Topic: what about price? Also, I want a more nuanced understanding of open source and lock-in. I think lock-in in open source is
won’t be held hostage to higher prices, and,
if there are multiple distros, you can move (not too easily, but at least can) other distros, which is more about shared standards and conventions (APIs and standards) than the actual implementation.
Books
Summer reading list for building your community
Books To Base Your Life on (The Reading List)
Finished reading:
Currently reading:
Amsterdam: A History of the World's Most Liberal City: "what is today the Netherlands is one vast river delta."
How to computer
Farmers Insurance Tests AI, Automation’s Potential For Speeding Up Claims Process
“AI-powered image recognition couldhelp speed up the claims process for damages related to, for example, windshields. Windshield claims are common and easy to resolve but still often require claims adjusters to go out in the field and make assessments, Mr. Guerra said.”
Australia’s Digital Transformation Stumbles Badly
“Many Australians, especially the poor, now see their government using digital technology as an indiscriminate, uncaring, and illegal club to beat them with. The government’s planned use of facial recognition to determine if a welfare recipient should receive benefits will do nothing to change their minds.”
Goodbye Microservices: From 100s of problem children to 1 superstar
“Briefly, microservices is a service-oriented software architecture in which server-side applications are constructed by combining many single-purpose, low-footprint network services. The touted benefits are improved modularity, reduced testing burden, better functional composition, environmental isolation, and development team autonomy. The opposite is a Monolithic architecture, where a large amount of functionality lives in a single service which is tested, deployed, and scaled as a single unit.” That’s a good definition!
Preliminary Analysis of the Site Reliability Engineer Survey
If the response takes too long to get to your phone, the system might as well be "unavailable": 'If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.'
Monitoring SRE's Golden Signals
How to actually get all those metrics from various types of middleware and web infrastructure.
Happy 10th birthday, Evernote: You have survived Google and Microsoft.
One or twice a year, I try to switch to something else - usually markdown-driven stuff that saves to dropbox. I always return to Evernote.