58 Comments
User's avatar
Verdi's avatar

I agree and I have come to a similar realization in the past few months. The value of the time gained on throwing some tasks at the LLM can't be ignored and I will still use it but I've gone from being "100% all-in vibecoding" to a much more measured stance. Maybe < 40% use. For these reasons:

- You cannot get away from the tech debt models produce

- Over time, these subtle issues compound into a codebase the LLM itself finds difficult to navigate and work in

- Simultaneously, you, the dev has lost the inherent and deep insight connection into the code where you *cannot* troubleshoot without using AI and it's a self-reinforcing loop.

- You, the dev, become 0% productive when in the train or airplane because you forget how to write some algo or syntax and you lack the codebase's mental model

Mo's avatar

40% honestly sounds like the perfect number. I don't know anything about rallying, but you have someone in the driver's seat and the co-driver calling audibles. I think most people imagine the future is us being the co-driver. At least for now, being in the driver's seat and using AI as your sort of partner-in-crime is most productive, if you have the skills to do both.

Oliver Schoenborn's avatar

Totally agree, same conclusion. Although I was never a believer in the "create a spec and let it go" that many YouTubers taunt their audience with, just makes no sense from a very fundamental reason: the human is the consumer. Therefore, the human has to be in the loop for UX (whether it is graphicslal or terminal based). Soniya quite irrelevant how long a task it can run. And we all know that specifying upfront is a losing battle when the problem or solution is novel. So, lots of excitement and discovery of what it can and cannot do, bit ultimately the amount of coding it can do unsupervised is irrelevant. Baby steps, just with this new tool on which you can bounce ideas, do research without knowing intricate syntax just plain English, butnits just a tool, as useful as the skills and intelligence of the person using it.

Mo's avatar

> And we all know that specifying upfront is a losing battle when the problem or solution is novel.

Yup, exactly.

Gregory Forché's avatar

Great post! Couple of comments:

-- the idea that 'all you need to do is specify' has been wrongly internalized by many.

a) there's another name for it: it's called "waterfall". we know what happened there.

b) the entire structure of interaction is plan-then-execute. This is NOT how software development actually works (though, people can be excused for not being conscious of how they had been doing it before AI). There is a huge amount of improvisational work in a larger unfolding situation. The linear flow from plan to execution is frequently incompatible with any nontrivial development effort

c) software development - like anything else that is 'work' as we have come to understand it - is not merely a collection of tasks that neatly decompose just because we want them to.

Basically, I see you are observing these things in your work. Check out my substack if you are interested as I am trying to share out some compatible perspectives - would love some eyeballs / feedback.

Mo's avatar

> Check out my substack if you are interested

Actually as I was reading your comment I was thinking, "ok, I want to read more of what this guy has to say." Because it's exactly what I've begun to suspect: we don't actually know what software is. We don't know what code is.

I'll take a look shortly. Is there a particular article you'd recommend starting with?

Gregory Forché's avatar

Yeah, so I just wrote two articles on the relevance problem.

The latest is this : https://gregore.substack.com/p/on-ais-incapacity-to-understand-what?r=d1a2a

benjamin ar's avatar

> "Amazingly, I’m faster, more accurate, more creative, more productive, and more efficient than AI, when you price everything in, and not just code tokens per hour."

Have you been measuring this? I'm not doubtful, but I am curious how much of this is measured vs vibes. I think "vibes" and the empowering feelings of writing your own code are a solid argument on their own, but it would be great to have empirical evidence too.

Mo's avatar

It's hard to measure things like creativity and productivity of course, but yeah it's just a holistic assessment of where I am compared to where I would typically be in a project.

As I started rewriting the code these past few weeks, I was moving much faster than I recalled was typical. I think Claude/Codex have trained us to think coding is laborious and painful. Some parts certainly are, like debugging and tracing. But structuring classes and designing architectures is pretty enjoyable and relatively quick.

Sure there are puzzles to solve along the way. In my experience agents aren't good at minding architecture unless you explicitly call out the traps they might fall into. But how will you know the traps if you haven't undertaken the problem yet?

By the end of it, "when you price *everything* in", it was clear one took way more time than the other for the kind of problems I was working on (which were basic engineering problems tbh).

Mike Crowley's avatar

> I think Claude/Codex have trained us to think coding is laborious and painful

Every time I've heard somebody say "oh AI knocked this out for me", I could really look at that person and realize they didn't like doing that thing. Whether it was writing, or reading their email, or building presentations, or writing code, or writing tests, etc... It all comes down to it felt good because I didn't have to do thing I didn't like to do, or wasn't good at...

Mo's avatar

This is true. But it's also being made to seem like coding sucked or something in the public discourse. Absolute nonsense. I would wager only a minority of devs hated coding. Probably corporate coding (working on a project you have no interest in) is what sucked. So now AI's are a godsend. But even with AI, a project you have no interest in is a project you have no interest in.

Mike Crowley's avatar

Also, the push I see is from above... a simple "The industry says we can move faster with this, so use it". Without people realizing how much joy they're draining from the simple act of doing a thing.

Doesn't matter if devs liked what they did, this technique goes faster (aka cheaper) so do it. And the perceived economics of an inexperienced newbie being suddenly able to produce something that looks awesome (even if it's crap) scream dollar signs to execs... Suddenly it seems like we don't need to pay devs so much... and they wonder why experienced devs wouldn't like this trend.

Mo's avatar

There will likely be an exploration of the extrema before we correct to sensible means.

Jan Machacek's avatar

I wholeheartedly agree. I don't want to be all hating on gen-AI; no wait, I do!

Making software that is sufficiently useful and innovative is a process of discovery. One starts from fuzzy details and then *discovers* the boundaries of understanding as the development progresses. I don't think it's possible to define all of this up front, even if there were an agent that could then execute the plan flawlessly.

I am baffled by engineers who say that gen-AI helps them with boilerplate. I mean, sure, there is some code that needs to be written that is mechanical but, and humour me for one second here, didn't we engineers make these libraries and abstractions that reduce the boilerplate? And if you are finding yourself writing boilerplate, maybe that's a good indication that some engineering should be done.

I am baffled by engineers who claim that code should be self-documenting; if the code needs comments, it is simply bad code, etc, etc. What do they do now? Just one more markdown file full of documentation for the agents to consume. That'll make it all right. I mean markdown isn't Confluence [shudders] or Google Doc [oh-the-humanity] or something even worse, so I suppose that makes it OK.

I am baffled by engineers who are willing to, to paraphrase the cruel joke, let the AI take away thing they love [programming], and instead go and do the dishes and laundry instead of the other way around. Even if AGI is coming, and even if it is going to be smarter than all of us combined, we are still going to be stuck with our thoughts for the rest of our lives. I'd rather be stuck with some at least marginally interesting thoughts. And if, perish the thought!, AGI is not coming... well, it'll pay off to be able to do some thinking.

Mo's avatar

It’s a baffling time for sure.

Per Anderson's avatar

What do you think of incremental cleanup after every major change? To remove duplicated or dead code, etc. I am considering automated cleanup tasks after every X lines of code changed.

Mo's avatar

Honestly, I don't even think it's the kind of thing it would think is cleanup worthy. It might catch some like unnecessary comments or something. But a cleanup won't fundamentally detect if say you're repeating boilerplate amongst many different models or services. You could potentially tell it to keep an eye on these things. But then, because it's just doing a drive-by cleanup, it will find the laziest way to abstract the common denominator, into some ungodly abstraction.

Oliver Schoenborn's avatar

I've used the following: after a module or class has gotten too large for my liking (which implies I'm at least looking at the overall look and feel of the code it is creating), inask it "find the main concerns that this class addresses and let me know which ones can be refactored to separate classes used by the current one". I usually have an idea or 2 of such concerns and I will also list those as examples. And it usually does a pretty good job of listing them and explaining why. Then it seems to naturally propose eg "items 1-5 can be refactored, and the remaining 3 can remain in the current class". But, the AI will not do those things on its own. You could schedule another AI agent to have a look and notify of proposed design changes, but every automation takes requires maintenance so sometimes it's not worth setting up, just remember to supervise the system.

Per Anderson's avatar

I found that code review usually detects repeated boilerplate. Two or three rounds of GitHub PR review usually help. If you go with just the first draft it will definitely be a mess.

Nickolai Belakovski's avatar

I had a similar realization about 6 months ago and went back to coding by hand, and I was much more productive. I still use AI to ask about situations where I see behavior I don't understand, or where I'm doing something very non standard with a given library and need some help, but I generally write all the code myself.

Mo's avatar

It really is a tremendous tool. If you can do both, you'll be in a different league imo than someone just doing one.

SB's avatar

My main gain with agents was being able to try projects that I have no qualifications in. Do I want to create a modern website with a Vite SPA frontend and Tailwind styling, despite lacking any frontend experience? You bet I want. So instead of spending hours or days fiddling with documentation or learning from scratch some deadend tech, I just swing it and I quickly have something that is good enough. So an entire world of "good enough" projects has opened that I would have never attempted in the past.

If your goal is not mere "good enough", but the best it can be, AI is not something that will bring huge value. It might help you some menial tasks like cleanups and automated predictive typing of boilerplate, but won't boost productivity by orders of magnitude.

Mo's avatar

Depends what you mean by "good enough." Because in fact the people reading this code are saying, it's not good enough. If on the other you mean purely based on UI representation and the underlying constituents, then yes, it's very good at UI work.

SB's avatar

The standard is "is it roughly functional and fulfilling 80% of the requirements, purely from a black box perspective, without looking at the code?"

There is an immense world of projects that open up when you have an army of interns capable of delivering that for free. Your small business uncle who "wants a web site ". The inteligent door your cat deserves. The 1000 business ideas you had that need a quick prototype to validate with real customers. The online stalking agent for your ex. The possibilities are endless.

ScienceGrump's avatar

The "internet of things"/smart home debacle shows that having lots of little systems, each intended to make life easier, cumulatively make life nearly impossible unless they are *very* thoughtfully designed. Your uncle's website will probably be fine, but all that other stuff will quickly end in the typical destination of AI projects: orphaned and decaying.

It's also just really easy for nerds to overestimate how many people out there want to make things. The great majority of people do not want to make things. They want things to work without thinking about them, because they got enough going on as it is. See: the 3D printing revolution that never was.

Mo's avatar

Yes true that an immense world does open up. But a large set of existing software engineering remains unaffected.

Mike Crowley's avatar

I'm just curious - was this even after giving it context about design patterns, coding style guidelines, etc? Just wondering whether it had access to that context and just completely ignored them.

This is what I've long suspected would happen, that AI is as effective as a junior developer, who you have to explain every little detail to... which frustrates senior developers because they don't (usually) like explaining everything to that minute level for either JDs or AI...

But when you give it something complex, or it has to find the needle in the haystack of a mature and complex application, it would struggle because there was too much context to fully grasp.

Mo's avatar

Of course, I iterated endlessly on the spec/style docs I gave it. No use.

Phil T's avatar

I've been using an LLM for support while I set-up a new Linux environment. It gives me a sounding board and, I hope, some idea of good/best practice.

It also gives me commands, with most of which I am already familiar, so I am happy to copy/paste... BUT, even in a single conversation, I can see it deviating for it's own suggestions, or making subtle tweaks to commands it already advised me to run and not always explaining why.

It's clearly not building a holistic "mental" picture of what we're trying to achieve. It keeps assuring me it is, but it is evidently not actually able to do that. Maybe this is demonstrated by the way it often regenerates ALL the advice it has previously given when asked for clarification on an apparent caveat. This is when the unexplained tilts can creep in.

Knowing a little of how an LLM works, I think I can understand exactly why that is.

With all that in mind, I can completely understand how it can't reliably develop an actual code base!

It is pretty great at tedious shit like combining and transforming two simple JSON datasets, though!

Mo's avatar

That’s the worst part. When you ask it, it clearly demonstrates understanding of your intention. But it just doesn’t get demonstrated in the code consistently enough.

Nick Borodinov's avatar

LLM can handle IO, tests or specific adjustments.

Maybe some super simple code.

Anything of substance needs to be written manually

Mo's avatar

The thinking needs to be done by you, one way or another. Simple as that. I just found it more efficient to think in code than transcribe in english to a middleman.

Abhijit Bakshi's avatar

This tracks with my experience also. The problem for me is that in my big-ish tech company, it seems like anything however silly will be approved if it promises Massive Productivity Gains from AI and anything, however worthy, is at risk if it doesn't. As far as I can tell, there are entire layers of upper management who have no idea how to develop software or what LLMs can or cannot do who just believe that if people's "productivity" isn't doubling or tripling it's because they aren't LLM-ing hard enough.

So assuming the AI realism reflected in this article is indeed closer to the truth (as I think it is) it's gonna be a long, long time before sanity returns to a lot of alleged industry leaders.

Mo's avatar

Yeah the suits are gonna be a bit annoying for a bit until the dust settles.

Aleksey Anisimov's avatar

somehow I reckon I agree with both you and Andrej also !

https://x.com/karpathy/status/2015883857489522876

Mo's avatar

Yeah, this is an incredible tweet.

Rainbow Roxy's avatar

Thanks for writing this, it clarifes a lot. Do you think this is a general tren? So insightful.

Mo's avatar

Yeah, Andrej Karpathy’s tweet from yesterday shows the general zeitgeist right now is understanding the AI writes like a junior developer.

Chadius's avatar

Sounds about right.

The most useful part I've found is trying to write the task in the first place makes me notice missing requirements, or ambiguities I didn't think of. I've gotten into dead ends because I didn't think things through and leapt in. In the efforts to not make the Robot screw up it forced me to actually write my thoughts and process.

I treat these Gen AIs fancy Autocompletes. I wouldn't accept the first suggestion Autocomplete gives me without reading it over. So if I can't read the PR in a reasonable amount of time, that means I did too much and I should read the summary, throw the code away, and execute the summary 1 step at a time.

Mo's avatar

1000%. You don't know what problems need solving until you're halfway in building the dumb solution you thought would work.

Konsti's avatar

Well put. In my experience it's much better to have the AI do small, contained tasks, sit by it and hold its hand. You potentially learn something new, it reduces the risk of "under the radar" tech debt and the effects of the cognitive atrophy are also mitigated.

I have also made it a rule to not prompt it a second time if the first time fails. It can't do it. Fine. I'll do it myself.

Mo's avatar

Well said, I have the same meta-rule largely: if you need to followup with an agent, you've already failed. Agents like to one-shot things. Otherwise they've misunderstood you.

Andrei Bozantan's avatar

This could be much more persuasive. Could you add some code examples and explain the problems?

ScienceGrump's avatar

A minor, relatively self-contained example from my own project. Here, Opus 4.6 has identified a real bug, albeit one that would be trivial to identify and fix from the traceback:

4. parse_choice needs to resolve card names to uniqueIDs for card selections

/ask for Error 4, how would you resolve card names to unique IDs?

After some rigamarole, it suggested:

if contract.selectionType == "card":

for card in host.cards.values():

if card.name == decision.strip():

decision = card.uniqueID

break

else:

raise ValueError(f"Card named '{decision}' not found")

My preferred solution:

if decision in host.cards:

decision = host.cards[decision]

The model had all the necessary context to know that host.cards is keyed by card names btw.

But the real problems are not self-contained enough to post as code. For example, a previous feature was implemented poorly by an agent, causing objects to hand off responsibility for a behavior back and forth several times. I so feel Mo when he writes that the LLM eventually reaches a point where it itself cannot understand what it's done, because it is already happening with this bug that I was really hoping the magical Opus would fix for me. It keeps going, "Actually, on closer inspection..." as it hits the same counterintuitive switch. When a human winds up in this situation, they try to refactor so behavior is more self-contained. LLMs are even more dependent than humans on clean, intuitive code, but seem totally unable to maintain it. They just march forward, spewing ever more kludges until the code base is impossible. In my case, it is already unable to think through the problem, so it keeps proposing a solution that won't work.

Just like Mo, I also thought I was being very careful with reviewing its code, too.

Mo's avatar

I think Karpathy's post from yesterday captures it perfectly:

> The mistakes have changed a lot - they are not simple syntax errors anymore, they are subtle conceptual errors that a slightly sloppy, hasty junior dev might do. The most common category is that the models make wrong assumptions on your behalf and just run along with them without checking. They also don't manage their confusion, they don't seek clarifications, they don't surface inconsistencies, they don't present tradeoffs, they don't push back when they should, and they are still a little too sycophantic. Things get better in plan mode, but there is some need for a lightweight inline plan mode. They also really like to overcomplicate code and APIs, they bloat abstractions, they don't clean up dead code after themselves, etc. They will implement an inefficient, bloated, brittle construction over 1000 lines of code and it's up to you to be like "umm couldn't you just do this instead?" and they will be like "of course!" and immediately cut it down to 100 lines.

https://x.com/karpathy/status/2015883857489522876