AI Reviews Aren't Very Good

They're hard to review, but charts are meaningless to most humans. // Also, fun and strange finds from the World Wide Web

Aug 08, 2025

Making AI Reviews More Useful

Reading all the ChatGPT 5 coverage confirms my feeling that no one knows how to review these models yet. It’s either those incomprehensible charts (and also, who cares how good they are at math? More: who understand what those tests even mean?) or people just saying “I really like it.”

It’s a classic IT/business alignment problem. Until you define the “business outcome” you want and how you can improve it, you’re vibe-ROI’ing.

Maybe some tests:

Have it give you sartorial advice. Dress like it tells you to and see if you get treated better.
Can it (tell you how to) fix your plumbing problems cheaper than hiring a plumber?
Ask it what art you would like and see if you do like it. Will it help you discover new art?
Have it give you a walking tour in your own city. Does it match what you would show people?
Ask it to critique your work’s strategy and suggest small and large changes to make to improve profits, share price, etc. Do they make sense?
What is the best episode/season of a TV show to start with to see if you like it?
What’s a recipe for a Greek-inspired cucumber salad? Given the following ingredients, what should I make?
Look at my email and my calendar and give me a list of valuable things I can do in five minutes each. What are things I’m missing, what are things I can complexly ignore. Make me GMail (etc.) filters accordingly.
Explain to me when I should spend Euros, pounds, or dollars. Make me a spreadsheet to tell me when.
Based on recent earnings calls, what is Apple’s strategy and when is an ideal time to buy and sell their stock?
What is the ideal garbage and recycling plan/strategy for New York vs. Amsterdam vs. Waco? How does it compare to what they currently do?
Write me an 800 word short story that I’d like.
What are some astounding things I could ask you to do?

And so forth.

Maybe we could get inspiration from car reviews. I think those are based on aesthetics, performance (I’m guessing speed and things like ability to turn), smooth ride, (I’d hope!) durability and cost of maintenance over time, all compared to price. How do you evaluate kitchen gadgets or the price/performance of Velux windows?

The other part of the evaluation needs to consider the app’s capabilities. The model is just part of the overall experience. The actual app and integrations/tools in ChatGPT and Claude make a huge difference in the quality of the experience. For example, ChatGPT’s long term memory features are incredibly important. How good is it at using your Google Drive, etc.

A simple one that most fail utterly at is something like “tell me who I emailed with the most in 2003 and what we talked about.” I just added my GMail to ChatGPT today and asked that and it said, “I don’t have access to your GMail” even though I’d just added the integration.1 Last I did this with with Gemini, while in Gmail, was sad.

You could probably also re-use whatever test cases we had for Alexa and Siri before we all realized they were just good for turning on music and telling the time.

You can of course have it write things for you. I feel like programming is mostly a finished task. We sort of know that it can write the first few passes of code. We can intuit that the limitations of AI programming are: (1) long term maintenance will be a nightmare, probably no better than it currently is, (2) a lot of programming is not actually writing code, but product management, design, etc. The AI is fine at programming, but all the other stuff is equally important.

For writing, maybe give it the daily White House press releases and videos, along with other agencies. Have it write a daily briefing and compare it to what the NYT writes about that day. Then compare it to what The Economist publishes that week. You could do it for tech news.

Upload all of your journal entries for 10+ years and say “what is wrong with me, and what should I do to improve?” After a week, are you better, happier, did you change anything?

A lot of these tasks have to with data and content: getting access to a lot of it and having the AI work with it. Again, something that has little to do with the model itself, and more to do with the app.

You could also take this text and have the AI write a better version incorporating other content and coming up with some original thoughts.

Less Goofy. More Enterprise.

I give my take on Google Cloud’s progress and prospects in this week’s Software Defined Talk: “This week, we discuss cloud earnings, what’s driving valuations, and why AWS says it’s still early innings for cloud. Plus, Coté does a deep dive on Shipley Donuts.” Listen to the audio, or check out the un-edited video.

Also, in my self-proclaimed role as Apple’s (fractional) VP of Cables, I suggest a complimentary product:

The more ports, the more cables they’ll sell, right?

Wastebook

“We all used to meet at weekends and draw each other. We went to the cinema and discovered the films of Jean Renoir, Rene Clair and Marcel Carne. We jived and jitterbugged to Humphrey Lyttleton’s jazz band every Monday evening. We also got sucked into the drinking scene in Soho. It was all over in 2 or 3 years, but in my memory it seems to have been longer.” Some UK weirdos, long ago.
“Slopject.” From bruces.
They say developers don't like being marketed to. Yes, and, no one likes being marketed too, right? Successful marketing is rarely thought of as “marketing.” As with any marketing, from tooth-paste to piping valves, in developer marketing, if a developer doesn’t like your marketing, it just means you need to come up with different marketing.
This is a good meta episode because the guest is so slippery at answering questions and does not “play in the space.” She refuses to follow the norms of questions, which, in this podcast is largely about coming up with new thoughts based on what you know, not just talking about what you already know and have proven. In doing so, Tyler has to coax her and explain how to play in the space, revealing the mechanics of the format and tactics getting people to have interesting conversations. The actual topic is interesting too. And, to be clear, I don’t think the guest is being “bad,” I like her style and responses as well.

Relative to your interests

At-Scale Management and Multi-Foundation Views with Tanzu Hub - If you’re running a platform - or platforms - inside a large organization, Tanzu’s got something for you.
Your AI workloads still need a service mesh - You always need a load-balancer/proxy/gateway, more generally, “middleware.”
Why AI Isn’t The Silver Bullet For Customer Service — Yet - A typical digital transformation problem: you change just one tool, but not everything else, especially how people work and the (now) old systems you integrate with.
Columbus, Ohio’s Revival: a Model for the Rust Belt - Update from IRL. Now, back to cyberspace.
Gartner Says Worldwide IaaS Public Cloud Services Market Grew 22.5% in 2024
Remembering Dominic Pannell: Dom’s Framework for Influencer Ecosystem Mapping - Managing the people who influence enterprise It buying.
Heroku brings app development to the AI era - One day I hope people stop focusing on build containers, and start focusing on to build apps.
In the world of podcasts, YouTube is now the elephant in the room — just like in TV - This is obviously a category error: if I can’t add it to Overcast with an RSS feed, it’s not a podcast. But, (1) old man yelling at clouds, it me! and, (2) aside from using the word “podcast,” good info. // “The elephant in the room of all of this upheaval is YouTube – the silly viral internet video giant that became a TV, music, advertising, and now podcast giant. Per an April survey by Cumulus Media and media research firm Signal Hill Insights, 39% of all weekly podcast consumers use YouTube as their primary platform, more than double the share from late 2019. The video platform estimated that more than a billion people a month are watching podcasts as of February.”
The Reformist CTO’s Guide to Impact Intelligence - A good take on metrics and frameworks to show the “business value” or tech projects.

Conferences

SpringOne, Las Vegas, August 25th to 28th. VMUG London, speaking, September 18th, speaking. SREDay London, speaking, September 18th and 19th. Civo Navigate London, September 30th, London, speaking. Cloud Foundry Day EU, Frankfurt, October 7th, 2025, speaking. AI for the Rest of Us, London, October 15th to 16th, London, speaking. SREDay Amsterdam, November 7th, speaking.

Logoff

In this week’s Software Defined Interviews, Whitney, James Eastham, and I talk a lot about learning how to be “social.” This is not only being a good “communicator,” but being good in a group of people. For the three of us, this is part of our job, our professional life. And you can see how, like any other nerd, we studied how to do it. But, look at us nerds now! Also, we cover the tyranny of those stupid, childish YouTube thumbnails we all have to use.

I like how the Interviews show has turned out - it’s difficult to find a good co-host. You need someone who will put their energy and time into it, maintain the “lively” feeling, and develop both a podcast persona for themselves, as well as co-create the overall persona (“vibe”?) for the show. Also, they have to be maniacal about scheduling and showing up on time.2 I think we’ve got it worked.

Make sure you subscribe to it! And, yes, it’s in YouTube if you consider that a “podcast.” People from all channels are welcome and appreciated.

I’m sure there’s some reason, but the point is, it didn’t work, nor tell me how to make it work.

And, when you work with me, you have to be tolerant of me often being the opposite of that.

Discussion about this post

Ready for more?