> So far the biggest limiting factor is remembering to use it. Even people I consider power users (based on their Claude token usage) agree with the sentiment that sometimes you just forget to ask Claude to do a task for you, and end up doing it manually. Sometimes you only notice that Claude could have done it, once you are finished. This happens to me an embarrassing amount.
Yea, this happens to me too. Does it say something about the tool?
It's not like we are talking about luddites who refuse to adopt the technology, but rather a group who is very open to use it. And yet sometimes, we "forget".
I very rarely regret forgetting. I feel a combination of (a) it's good practice, I don't want my skills to wither and (b) I don't think the AI would've been that much faster, considering the cost of thinking the prompt and that I was probably in flow.
If you're forgetting to use the tool, is the tool really providing benefit in that case? I mean, if a tool truly made something easier or faster that was onerous to accomplish, you should be much less likely to forget there's a better way ...
Yep! Most tools are there to handle the painful aspects of your tasks. It's not like you are consciously thinking about them, but just the fact on doing them without the tool will get a groan out of you.
A lot of current AI tools are toys. Fun to play around, but as soon as you have some real world tasks, you just do it your usual way that get the job done.
There's a balance to be calculated each time you're presented with the option. It's difficult to predict how much iteration the agent is going to require, how frustrating it might end up being, all the while you lose grip on the code being your own and your head-model of it, vs just going in and doing it and knowing exactly what's going on and simply asking it questions if any unknowns arise. Sometimes it's easier to just not even make the decision, so you disregard firing up the agent in a blink.
Our meat blobs forget things all the time. It's why the Todo apps and reminders even exist. Not using something every time doesn't mean it's not beneficial.
I resonate with your case (b). One reason why I intentionally don't use it is cases where I know exactly what code I want to write, and can thus write it quicker than I can explain it to Claude + have Claude write it.
And the waiting is somewhat frustrating, what am I supposed to do while I wait? I could just sit and watch, or context switch to another task then forget the details on what I was originally doing.
I think you’re supposed to spin up another to do a different task. Then you’ll be occupied checking up on all of them, checking their output and prodding them along. At least that’s what Anthropic said you should do with Claude Code.
> The most common thing that makes agentic code ugly is the overuse of comments.
I've seen this complaint a lot, and I honestly don't get it. I have a feeling it helps LLMs write better code. And removing comments can be done in the reading pass, somewhat forcing you to go through the code line by line and "accept" the code that way. In the grand scheme of things, if this were the only downside to using LLM-based coding agents, I think we've come a long way.
I work with python codebases and consider comments that answer "what?" instead of "why?" bad.
LLMs tend to write comments answering "what?", sometimes to a silly extent. What I found helping for using Claude 3.7 was to add this rule in cursor. The fake xml tag helped to decrease the number of times it strays from my instructions.
<mandatory_code_instruction>
YOU ARE FORBIDDEN FROM ADDING ANY COMMENTS OR DOCSTRINGS. The only code accepted will be self-documenting code.
</mandatory_code_instruction>
If there's a section of code where a comment answering "why?" is needed this rule doesn't seem to interfere when I explicitly ask it to add it there.
I've noticed Gemini 2.5 pro does this a lot in Cursor. I'm not sure if it's because it doesn't work well with the system prompt or tools, but it's very annoying. There are comments for nearly every line and it's like it's thinking out loud in comments with lots of TODOS and placeholders.
Eh, you really have to take the "thinking" in quotes, in the LLM context "thinking" out loud is what makes the results better. I think (hah) their point stands.
You can literally just ask it to not write too many comments, describe the kind of comments you want, and give a couple of examples. Save that in rules or whatever. And it's solved for the future :)
They tend to add really bad comments though. I was looking at an LLM generated codebase recently and the comments are along the lines of “use our newly created Foo model”, which is pretty useless.
I'm still having a hard time with coding agents. They are useful but also somehow immature hence dangerous. The other day I asked copilot with GPT4o to add docstrings to my functions in a long Python file. It did a good job on the first half. But when I looked carefully, I realized the second half of my file was gone. Just like that. Half of my file had been silently deleted, replaced by a single terrifying comment along the lines of "continue similarly with the rest of the file". I use Git of course so I could recover my deleted code. But I feel I still can't fully trust an AI assistant that will silently delete hundreds of lines of my codebase just because it is too lazy or something.
I’ve also seen catastrophic failures where the code returned completely fails for numerous “obvious” problems, including but not limited to missing code.
I tend to have to limit the code I share and ask more pointed / targeted questions in order to lead the AI to a non catastrophic result.
>Our head of product is a reformed lawyer who taught himself to code while working here. He’s shipped 150 PRs in the last 12 months.
>The product manager he sits next to has shipped 130 PRs in the last 12 months.
In a serious organization, non technical people should not be shipping any sort of code. They should be doing the highest leverage things possible to help the business, and if that seems to be coding, there are grave issues in the company.
I see a fraudulent benefit in this case. When these non-tech people go into public talks or anything, they can suddenly claim “oh, I use AI to write 80% of my code” and voila! No one will ask whether their responsibility is to write code or do any engineering, simply being able to give some surface level claims makes them credible enough and feed the hype while appearing cool.
It also gives investors more confidence to shower them with money when needed, as non-tech people are also doing AI coding and they are super agile!
When Msft CEO claims that 80% code written by AI, there is a 50% doubt, but when someone adds that, yeah so I have done 150 PRs, now it feels more concrete and real.
>> In a serious organization, non technical people should not be shipping any sort of code
Just as you want developers building domain knowledge for all the benefits it brings, you want a role like product owner to be developing their understanding of delivering software.
Sometimes the highest leverage thing to be done is making teams of people work better together. One aspect of achieving that can be better understanding of what each role does.
I have a confession. I don’t really get what Claude Code is… It’s not a model, it’s not an editor with AI integrated… So what is it? It bugs me on the website, I click on it, read, still don’t get it.
I have a Claude console account, if you can call it that? It always takes me 3 times to get the correct email address because it does not work with passkeys or anything that lets me store credentials. I just added the api key to OpenWebUI. It’s nice and cheaper than a subscription for me even though I use it all day.
But I’m still confused. I just now clicked on “build with Claude”, it takes me to that page where I put in the wrong email address 3 times. And then you can buy credits.
Think of it as an LLM that automagically pulls in context from your working directory and can directly make changes to files.
So rather than pasting code and a prompt into ChatGPT and then copy and pasting the results back into your editor, you tell Claude what you want and it does it for you.
It’s a convenient, powerful, and expensive wrapper
Not the OP but in a similar boat. Curious to understand how do you make sure that Claude code cli tool does not break existing code functionality?
My hesitation to adopt stems from the events where Claude.ai WebUI ignorantly breaks the code, but since I can visibly verify it - I iterate it until it seems reasonable syntactically & logically, and then paste it back.
With the autonomous changing of the code lines, I'm slightly nervous it would/could break too many parts concurrently -- hence my hesitation to use it. Any best practices would be insightful
Claude Code will happily and enthusiastically do horrible things to your code. But it always asks first. So you can tell it "NO!" and suggest a better approach.
Imagine having a college sophmore CS major who types really quickly and who is up-to-date on lots of new technologies. But they're prone to cutting corners when they get stuck, and they have never worked on anything larger than a group project. Now imagine watching them as they work (really quickly) and correcting them when they mess up.
This is... tolerable for small apps. If you have problems that could be solved by a team of very junior programmers, and if you're willing to provide close supervision, then it might even make sense for some real code. Or if you kind of know how to code, and you just need little 1,000 line throwaway tools (like a lot of other STEM fields), eh, it's probably OK.
But your mentoring effort will never result in the model actually learning anything, so it's more like you get a new very junior programmer for each PR.
I don't want to completely badmouth this. For very early stage startups where you need to throw 50 things at the wall, most of them glorified CRUD apps, and see what sticks with the customers, then a senior engineer could make it work. But if you have a half dozen people who only sort of know how to write code all "mentoring" Claude, then your code base will become complete trash within two weeks. In practice, I see significant degradation above 1,000 lines for "hands off" operation, and around 5,000 lines if I'm watching it intensely and carefully reading all code.
> You can see this in practice when you use Claude Code, which is pay-per-token. Our heaviest users are using $50/month of tokens. That’s a lot of tokens.
How is your usage so low! Every time i do anything with claude code i spend couple of bucks, for a day of coding it's about $20. Is there a way to save on tokens on a mid-sized Python project or people are just using it less?
It's because by default it'll try to solve most problems agentically / by "thinking", even if your prompt is fairly prescriptive.
I use aider.chat with Claude 3.5 haiku / 3.7 sonnet, cram the context window, and my typical day is under $5.
One thing that can help for lengthy conversations is caching your prompts (which aider supports, but I'm sure Claude Code does, too?)
Obviously, Anthropic has an incentive to get people to use more tokens (i.e. by encouraging you to use tokens on "thinking"). It's one reason to prefer a vendor-neutral solution like aider.
With Aider, you typically select only that part of the codebase that you want to work on. You can do this manually, or let the agent find files itself. It tends to break down if you need more than 20 files or so in the context.
That seems really high to me. Maybe you write a lot more code than anyone else around. How big is the codebase? I have a feeling that (+ the stack) has a big impact.
> The product manager he sits next to has shipped 130 PRs in the last 12 months. When we look for easy wins and small tasks for new starters, it’s harder now, because he’s always got an agent chewing through those in the background.
I'd be curious to hear more about this, whether from the author or from someone who does something similar. When the author says "background", does that literally mean JIRA tickets are being assigned to the agent, and it's spitting back full PRs? Is this setup practical?
No, not so literally. I just meant that whenever a small throwaway task comes across his desk, he does it instead of making a ticket. Lots of "small tweak" PRs.
(Also, your life will get better when you delete Jira!)
"Making it easy to run tests with a single command. We used to do development & run tests via docker over ssh. It was a good idea at the time. But fixing a few things so that we could run tests locally meant we could ask the agent to run (and fix!) tests after writing code."
This is one of the most exciting things about coding agents: they make a lot of tooling that was so tedious to use it was impractical now ultra relevant. I wrote a short post about this a few weeks ago, the idea that things like "Semgrep" are now super valuable where they were kind of marginal before agents.
The fact that you have to describe things in text suddenly made a few companies also publish API documentation as llm.txt and the content is what I wish they published for people instead.
And also the payoff for “minor” improvements to be bigger.
We’ve started more aggressively linting our code because a) it makes the ai better and b) we made the ai do the tedious work of fixing our existing lint violations.
Having linting/prettifying and fast test runs in Cursor is absolutely necessary. On a new-ish React Typescript project, all the frontier models insist on using outdated React patterns which consistently need to be corrected after every generation.
Now I only wish for an Product Manager model that can render the code and provide feedback on the UI issues. Using Cursor and Gemini, we were able to get a impressively polished UI, but it needed a lot of guidance.
> I haven’t yet come across an agent that can write beautiful code.
Yes, the AI don't mind hundreds of lines of if statements, as long as it works it's happy. It's another thing that needs several rounds of feedback and adjustments to make it human-friendly. I guess you could argue that human-friendly code is soon a thing of the past, so maybe there's no point fixing that part.
I think improving the feedback loops and reducing the frequency of "obvious" issues would do a lot to increase the one-shot quality and raise the productivity gains even further.
Unless you are prototyping human-friendly code is a must. It is easy to write huge amounts of low quality code without AI. Hard part is long term maintenance. I have not seen any AI tool helping with that.
I find that there's an very high correlation between code I'd expect to confuse humans and code which confuses/breaks LLMs
When you let them pump out code without any intervention, there becomes a point where they start introducing bugs faster than they get fixed and things don't get better
Changing my old coding behavior aside, biggest limiting factor for me is understanding how and why the coding agent is doing this a certain way, so that I have the confidence to continually sharpen my tools.
I want something simple that I have full control on, if not just to understand how they work. So I made a minimal coding agent (with edit capability) that is fully functional using only seven tools: read, write, diff, browse, command, ask, and think.
As an example, I can just disable `ask` tool to have it easily go full autonomous on certain tasks. Or, ask it to `think` for refactoring.
It’s producing statically likely next tokens based on its training corpus and prompts. That’s what it’s doing. It’s not analyzing your code, it’s not considering approaches, it’s not theorizing about behavior, it’s just producing statistically likely tokens.
This is why I don’t touch the shit—it’s fucking snake oil.
If you "don't touch the stuff", you are probably relying on your initial impressions from 2022 to evaluate models in 2025.
Also, predicting the next token with high accuracy may demand very high degrees of knowledge and reasoning. As an extreme example, please predict the next 1,000 tokens in this sequence: "ABSTRACT. In this paper, we show a mathematically elegant unification of quantum mechanics and gravity, which makes testable predictions. Testing these predictions shows that the quantum gravity model provides previously unexpected results accurate to 1 part in..."
To predict the rest of that paper, it helps to actually come up with a workable model of quantum gravity. Which no current LLM can do, happily.
But lots of current models are good enough at "predicting the next token" to solve high school honors math problems that they've never seen before. They can apply the chain rule, factor polynomials, double-check their work, backtrack, etc. Similarly, current-generation coding models are perfectly capable of reading compiler error messages, and generating diffs that fix the underlying problem. Current-generation summarization models are capable of reading several scientific papers, extracting the key concepts, and turning them into a fairly serviceable podcast.
All of this happens because a (1) in order to predict the next token better, LLMs actually build thousands of specialized models that do things like "keep track of the state of a chess board" or "recognize lions in photos", and (2) transformer models are sufficiently powerful to model many problems well. So a big current-generation LLM is basically an ensemble of thousands of domain models glued together with a language model and a bunch of feed-forward layers.
Now, none of the outputs from LLMs are of super high quality, compared to experienced humans. And there are deep reasons for that. But if you happen to have problem where medium-competence AI "slop" is actually beneficial, then yes, LLMs can actually be of real-world use.
Good to see experiences from people rolling out AI code assistance at scale. For me the part that resonates the most is the ambition unlock. Using Brokk to build Brokk (a new kind of code assistant focused on supervising AI rather than autocompletes, https://brokk.ai/) I'm seriously considering writing my own type inference engine for dynamic languages which would have been unthinkable even a year ago. (But for now, Brokk is using Joern with a side helping of tree-sitter.)
> You can see this in practice when you use Claude Code, which is pay-per-token. Our heaviest users are using $50/month of tokens. That’s a lot of tokens. I asked our CFO and he said he’d be happy to spend $100/dev/month on agents. To get 20% more productive that’s a bargain.
fwiw we interviewd the Claude Code team (https://www.latent.space/p/claude-code) and they said that even within Anthropic (where Claude is free, we got into this a bit), the usage is $6/day so about $200/month. not bad! especially because it goes down when you under-use.
Since writing this a tangentially related thing we've added, is a github action that runs on any PR that includes a (Rails) database migration, and reviews it, comparing it to our docs for how to write good migrations.
Claude helped write the action so it was super easy to set up.
You mentioned using Claude to help set up a GitHub Action for reviewing Rails migrations. How do you see agentic tools like Claude evolving in their ability to reason about big-picture concerns—not just boilerplate generation, but things like validating database changes, architectural decisions, or spotting long-term risks that aren’t immediately visible?
They are really good if you give them guidance and tight scope. For example for the database migrations review bot, we give it our Cursor rules file on database migrations (which is about 200 lines) and tell it to review the PR based on that.
It works particularly well for migrations because all the context is in the PR. We haven't had as much luck with reviewing general PRs where the reason for a change being good or bad could be outside the diff, and where there aren't as clearly defined rules for what should be avoided.
How do you square this with the fact that LLM-based tools aren’t actually doing any analysis whatsoever, but are just pachinko machines that produce statistically likely output tokens when given input tokens?
Have you noticed people more productive, in general, using Claude Code or Cursor? Obviously it varies, but curious if one or the other is the clear productivity champion at this point.
I wouldn't say there's an obvious winner yet. I do think it depends on the person and how they like to work / how they reason about problems.
For Claude specifically, people who take more time to write long detailed prompts tend to get much better outcomes. Including me, since I made the effort to get better at prompt writing.
As someone who really dislikes using Cursor, what does the HN hivemind think of alternatives? Is there a good CLI like Claude Code but for Gemini / other models? Is there a good Neovim plugin that gets the contextual agent mode right?
Seconding aider, which was recommended to me months ago on HN. They don't integrate with vim directly per-se, but I'm a heavy vim user and I like the workflow of `aider --vim`, `ctrl-z`, `vim`.
They also have a mode (--watch-files) that allows you to talk to a running aider instance from inside vim, but I haven't used it much yet.
The file watch mode is great. Also nvim-aider is a great plugin which makes it really easy to manage context from within nvim and add diagnostics to chat
Sweet spot for me was cursor for autocomplete/editing and manually using Claude for more deep dive questions.
I cant go back to a regular IDE after being able to tab my way through most boilerplate changes, but anytime I have Cursor do something relatively complex it generates a bunch of stuff I don't want. If I use Claude chat the barrier of manually auditing anything that gets copied over stays in place.
I also have pretty low faith in a fully useful version of Cursor anytime soon.
> Is there a good CLI like Claude Code but for Gemini / other models
I built an open source CLI coding agent that is essentially this[1]. It combines Claude/Gemini/OpenAI models in a single agent, using the best/most cost effective model for different steps in the workflow and different context sizes. The models are configurable so you can try out different combinations.
It uses OpenRouter for the API layer to simplify use of APIs from multiple providers, though I'm also working on direct integration of model provider API keys.
It doesn't have a Neovim plugin, but I'd imagine it would be one of the easier IDEs to integrate with given that it's also terminal-based. I will look into it—also would be happy to accept a PR if someone wants to take a crack at it.
> I haven’t yet come across an agent that can write beautiful code.
o3 in codex is pretty close sometimes. I prefer to use it for planning/review but it far exceeds my expectations (and sometimes my own abilities) quite regularly.
Even if you don't think AI will replace the job of software developer completely, there's no way compensation is going to stay at the current level when anyone can ship code.
Honestly, it's been pretty great at my tiny startup. The designer has a list of tweaks he wants that I could do pretty quickly... once I'm done with my current thing in a day or two. Or he can just throw claude at it. We've got CI, we've got visual diff testing, and I'll review his simple `margin-left: 12px;`->`margin-left: 16px;`.
But we're unlocking:
A) more dev capacity by having non-devs do simple tasks
B) a much tighter feedback loop between "designer wants a thing" and "thing exists in product"
C) more time for devs like me to focus on deeper, more involved work
Ostensibly the PRs are getting reviewed so it’s, maybe, not that bad but I had a similar reaction: I can slap together something with some wood, hammer, nails and call it a chair. Should I be manufacturing furniture?
Code review makes this a lot less scary. Honestly it seems like mostly a win. A while ago at my day job, a moderately technical manager on another team attempted to contribute a relatively simple feature to my team's codebase. It took many rounds of review feedback for his PR to converge on something close to our general design guidelines. I imagine it would have been way less frustrating and time consuming for him if he could have just told an AI agent what to do and then have it respond to review feedback for him.
As a hobby rubyist, pretty terrible compared to Typescript. I suspect it's also that there's less Ruby in the training sata, but Ruby being so dynamic and ducktypable means it produces some truly nightmarish shit from time to time
> So far the biggest limiting factor is remembering to use it. Even people I consider power users (based on their Claude token usage) agree with the sentiment that sometimes you just forget to ask Claude to do a task for you, and end up doing it manually. Sometimes you only notice that Claude could have done it, once you are finished. This happens to me an embarrassing amount.
Yea, this happens to me too. Does it say something about the tool?
It's not like we are talking about luddites who refuse to adopt the technology, but rather a group who is very open to use it. And yet sometimes, we "forget".
I very rarely regret forgetting. I feel a combination of (a) it's good practice, I don't want my skills to wither and (b) I don't think the AI would've been that much faster, considering the cost of thinking the prompt and that I was probably in flow.
If you're forgetting to use the tool, is the tool really providing benefit in that case? I mean, if a tool truly made something easier or faster that was onerous to accomplish, you should be much less likely to forget there's a better way ...
Yep! Most tools are there to handle the painful aspects of your tasks. It's not like you are consciously thinking about them, but just the fact on doing them without the tool will get a groan out of you.
A lot of current AI tools are toys. Fun to play around, but as soon as you have some real world tasks, you just do it your usual way that get the job done.
There's a balance to be calculated each time you're presented with the option. It's difficult to predict how much iteration the agent is going to require, how frustrating it might end up being, all the while you lose grip on the code being your own and your head-model of it, vs just going in and doing it and knowing exactly what's going on and simply asking it questions if any unknowns arise. Sometimes it's easier to just not even make the decision, so you disregard firing up the agent in a blink.
> is the tool really providing benefit in that case?
Yes, much of the time and esp. for tests. I've been writing code for 35 years. It takes a while to break old habits!
Why would you want to break the habit? If you are not feeling a strong urge to use it...
Because I don't particularly like writing bulk code, but solving problems and designing things.
Our meat blobs forget things all the time. It's why the Todo apps and reminders even exist. Not using something every time doesn't mean it's not beneficial.
You never forgot your reusable grocery bag, umbrella, or sun glasses? You've never reassembled something and found a few "extra" screws?
Yes, but once I'm at the checkout or it starts raining, I reach for it...
Many CLI tools that I love using now took some deliberate practice to establish a habit of using them.
I really really hate this idea that you should have AI do anything it can do, and that there's no value in doing it manually.
The value is in doing the thing, how it's done is just a matter of preference and efficiency.
This is a myopic and dehumanizing way to look at it.
It really isn't. Not any more than growing your own herbs vs buying them at the market is.
Are you implying that there's no value to be found in someone growing their own herbs?
I resonate with your case (b). One reason why I intentionally don't use it is cases where I know exactly what code I want to write, and can thus write it quicker than I can explain it to Claude + have Claude write it.
Some tasks are faster than cognitive load to create a prompt and then wait for execution.
Also if you like doing certain tasks, then it is like eating an ice cream vs telling someone to eat an ice cream.
And the waiting is somewhat frustrating, what am I supposed to do while I wait? I could just sit and watch, or context switch to another task then forget the details on what I was originally doing.
I typically just think of more ideas, prompts, etc. while I wait.
I think you’re supposed to spin up another to do a different task. Then you’ll be occupied checking up on all of them, checking their output and prodding them along. At least that’s what Anthropic said you should do with Claude Code.
If I wanted to be an EM, I'd apply for that job.
The thing is others will eat ice cream faster so very soon there'll be no ice cream for me.
> The most common thing that makes agentic code ugly is the overuse of comments.
I've seen this complaint a lot, and I honestly don't get it. I have a feeling it helps LLMs write better code. And removing comments can be done in the reading pass, somewhat forcing you to go through the code line by line and "accept" the code that way. In the grand scheme of things, if this were the only downside to using LLM-based coding agents, I think we've come a long way.
I work with python codebases and consider comments that answer "what?" instead of "why?" bad.
LLMs tend to write comments answering "what?", sometimes to a silly extent. What I found helping for using Claude 3.7 was to add this rule in cursor. The fake xml tag helped to decrease the number of times it strays from my instructions.
<mandatory_code_instruction>
YOU ARE FORBIDDEN FROM ADDING ANY COMMENTS OR DOCSTRINGS. The only code accepted will be self-documenting code.
</mandatory_code_instruction>
If there's a section of code where a comment answering "why?" is needed this rule doesn't seem to interfere when I explicitly ask it to add it there.
I've noticed Gemini 2.5 pro does this a lot in Cursor. I'm not sure if it's because it doesn't work well with the system prompt or tools, but it's very annoying. There are comments for nearly every line and it's like it's thinking out loud in comments with lots of TODOS and placeholders.
That is because it is thinking out loud. Producing tokens is how it thinks.
Producing tokens is all it does, it’s not thinking.
Eh, you really have to take the "thinking" in quotes, in the LLM context "thinking" out loud is what makes the results better. I think (hah) their point stands.
I agree. I don’t always keep the comments, but I’m 100% ok with them.
You can literally just ask it to not write too many comments, describe the kind of comments you want, and give a couple of examples. Save that in rules or whatever. And it's solved for the future :)
I tell them to write self-documenting code and to only leave comments when its essential for understanding, and that's worked out pretty well
They tend to add really bad comments though. I was looking at an LLM generated codebase recently and the comments are along the lines of “use our newly created Foo model”, which is pretty useless.
Yeah that's what I do, remove the comments as I read through.
If it helps... shouldn't you be "not deleting them" for future feature additions?
I'm still having a hard time with coding agents. They are useful but also somehow immature hence dangerous. The other day I asked copilot with GPT4o to add docstrings to my functions in a long Python file. It did a good job on the first half. But when I looked carefully, I realized the second half of my file was gone. Just like that. Half of my file had been silently deleted, replaced by a single terrifying comment along the lines of "continue similarly with the rest of the file". I use Git of course so I could recover my deleted code. But I feel I still can't fully trust an AI assistant that will silently delete hundreds of lines of my codebase just because it is too lazy or something.
I’ve also seen catastrophic failures where the code returned completely fails for numerous “obvious” problems, including but not limited to missing code.
I tend to have to limit the code I share and ask more pointed / targeted questions in order to lead the AI to a non catastrophic result.
these models have hard time modifying LARGE files and then returning them back to you. That's inefficient too.
What you want is to ask for list of changes and then apply them. That's what aider, codex, etc. all do.
I made a tool to apply human-readable changes back to files, which you might find useful: https://github.com/asadm/vibemode
aider has this feature too.
Your first mistake was using Copilot. Your second mistake was using GPT 4o
>Our head of product is a reformed lawyer who taught himself to code while working here. He’s shipped 150 PRs in the last 12 months.
>The product manager he sits next to has shipped 130 PRs in the last 12 months.
In a serious organization, non technical people should not be shipping any sort of code. They should be doing the highest leverage things possible to help the business, and if that seems to be coding, there are grave issues in the company.
I see a fraudulent benefit in this case. When these non-tech people go into public talks or anything, they can suddenly claim “oh, I use AI to write 80% of my code” and voila! No one will ask whether their responsibility is to write code or do any engineering, simply being able to give some surface level claims makes them credible enough and feed the hype while appearing cool.
It also gives investors more confidence to shower them with money when needed, as non-tech people are also doing AI coding and they are super agile!
When Msft CEO claims that 80% code written by AI, there is a 50% doubt, but when someone adds that, yeah so I have done 150 PRs, now it feels more concrete and real.
Author here. We don't have (or want) any investors. I encourage PMs to code because it's good for the business. Otherwise I wouldn't do it.
I wrote about this years before we started doing AI-backed-coding: https://ghiculescu.substack.com/p/opening-the-codebase-up-to..., so some of the details are no longer correct, but the philosophy is the same.
How do credibly define "non technical people" when everyone's code is written by an LLM?
Serious organisations sound awful.
non technical people are people who are not hired for an engineering role
>> In a serious organization, non technical people should not be shipping any sort of code
Just as you want developers building domain knowledge for all the benefits it brings, you want a role like product owner to be developing their understanding of delivering software.
Sometimes the highest leverage thing to be done is making teams of people work better together. One aspect of achieving that can be better understanding of what each role does.
This sounds like a disaster waiting to happen
I have a confession. I don’t really get what Claude Code is… It’s not a model, it’s not an editor with AI integrated… So what is it? It bugs me on the website, I click on it, read, still don’t get it.
I have a Claude console account, if you can call it that? It always takes me 3 times to get the correct email address because it does not work with passkeys or anything that lets me store credentials. I just added the api key to OpenWebUI. It’s nice and cheaper than a subscription for me even though I use it all day.
But I’m still confused. I just now clicked on “build with Claude”, it takes me to that page where I put in the wrong email address 3 times. And then you can buy credits.
Have you installed the Claude cli tool?
Think of it as an LLM that automagically pulls in context from your working directory and can directly make changes to files.
So rather than pasting code and a prompt into ChatGPT and then copy and pasting the results back into your editor, you tell Claude what you want and it does it for you.
It’s a convenient, powerful, and expensive wrapper
Not the OP but in a similar boat. Curious to understand how do you make sure that Claude code cli tool does not break existing code functionality?
My hesitation to adopt stems from the events where Claude.ai WebUI ignorantly breaks the code, but since I can visibly verify it - I iterate it until it seems reasonable syntactically & logically, and then paste it back.
With the autonomous changing of the code lines, I'm slightly nervous it would/could break too many parts concurrently -- hence my hesitation to use it. Any best practices would be insightful
Claude Code will happily and enthusiastically do horrible things to your code. But it always asks first. So you can tell it "NO!" and suggest a better approach.
Imagine having a college sophmore CS major who types really quickly and who is up-to-date on lots of new technologies. But they're prone to cutting corners when they get stuck, and they have never worked on anything larger than a group project. Now imagine watching them as they work (really quickly) and correcting them when they mess up.
This is... tolerable for small apps. If you have problems that could be solved by a team of very junior programmers, and if you're willing to provide close supervision, then it might even make sense for some real code. Or if you kind of know how to code, and you just need little 1,000 line throwaway tools (like a lot of other STEM fields), eh, it's probably OK.
But your mentoring effort will never result in the model actually learning anything, so it's more like you get a new very junior programmer for each PR.
I don't want to completely badmouth this. For very early stage startups where you need to throw 50 things at the wall, most of them glorified CRUD apps, and see what sticks with the customers, then a senior engineer could make it work. But if you have a half dozen people who only sort of know how to write code all "mentoring" Claude, then your code base will become complete trash within two weeks. In practice, I see significant degradation above 1,000 lines for "hands off" operation, and around 5,000 lines if I'm watching it intensely and carefully reading all code.
By default, it shows you a diff of what it plans to do and ask you for permission before making the change.
So it's like Cursor/Windsurf, minus the GUI?
> You can see this in practice when you use Claude Code, which is pay-per-token. Our heaviest users are using $50/month of tokens. That’s a lot of tokens.
How is your usage so low! Every time i do anything with claude code i spend couple of bucks, for a day of coding it's about $20. Is there a way to save on tokens on a mid-sized Python project or people are just using it less?
It's because by default it'll try to solve most problems agentically / by "thinking", even if your prompt is fairly prescriptive.
I use aider.chat with Claude 3.5 haiku / 3.7 sonnet, cram the context window, and my typical day is under $5.
One thing that can help for lengthy conversations is caching your prompts (which aider supports, but I'm sure Claude Code does, too?)
Obviously, Anthropic has an incentive to get people to use more tokens (i.e. by encouraging you to use tokens on "thinking"). It's one reason to prefer a vendor-neutral solution like aider.
This is my experience too, I can burn $20 on a big refactoring in a few hours no problem
A lot of the time (when it works) I think its easily worth the money, but I would quickly break their $100 a month budget
How though? Are you putting a massive code base into context each time?
Yeah, I think you need to do task in more discrete chunks, so you aren't' sending so many tokens each request by the end.
With Aider, you typically select only that part of the codebase that you want to work on. You can do this manually, or let the agent find files itself. It tends to break down if you need more than 20 files or so in the context.
That seems really high to me. Maybe you write a lot more code than anyone else around. How big is the codebase? I have a feeling that (+ the stack) has a big impact.
> The product manager he sits next to has shipped 130 PRs in the last 12 months. When we look for easy wins and small tasks for new starters, it’s harder now, because he’s always got an agent chewing through those in the background.
I'd be curious to hear more about this, whether from the author or from someone who does something similar. When the author says "background", does that literally mean JIRA tickets are being assigned to the agent, and it's spitting back full PRs? Is this setup practical?
No, not so literally. I just meant that whenever a small throwaway task comes across his desk, he does it instead of making a ticket. Lots of "small tweak" PRs.
(Also, your life will get better when you delete Jira!)
"Making it easy to run tests with a single command. We used to do development & run tests via docker over ssh. It was a good idea at the time. But fixing a few things so that we could run tests locally meant we could ask the agent to run (and fix!) tests after writing code."
Good devops practices make AI coding easier!
> Good devops practices make AI coding easier!
Good devops practices make coding easier!
This is one of the most exciting things about coding agents: they make a lot of tooling that was so tedious to use it was impractical now ultra relevant. I wrote a short post about this a few weeks ago, the idea that things like "Semgrep" are now super valuable where they were kind of marginal before agents.
The fact that you have to describe things in text suddenly made a few companies also publish API documentation as llm.txt and the content is what I wish they published for people instead.
And also the payoff for “minor” improvements to be bigger.
We’ve started more aggressively linting our code because a) it makes the ai better and b) we made the ai do the tedious work of fixing our existing lint violations.
It can automate a lot of the tediousness for static typing, too
Having linting/prettifying and fast test runs in Cursor is absolutely necessary. On a new-ish React Typescript project, all the frontier models insist on using outdated React patterns which consistently need to be corrected after every generation.
Now I only wish for an Product Manager model that can render the code and provide feedback on the UI issues. Using Cursor and Gemini, we were able to get a impressively polished UI, but it needed a lot of guidance.
> I haven’t yet come across an agent that can write beautiful code.
Yes, the AI don't mind hundreds of lines of if statements, as long as it works it's happy. It's another thing that needs several rounds of feedback and adjustments to make it human-friendly. I guess you could argue that human-friendly code is soon a thing of the past, so maybe there's no point fixing that part.
I think improving the feedback loops and reducing the frequency of "obvious" issues would do a lot to increase the one-shot quality and raise the productivity gains even further.
Unless you are prototyping human-friendly code is a must. It is easy to write huge amounts of low quality code without AI. Hard part is long term maintenance. I have not seen any AI tool helping with that.
I find that there's an very high correlation between code I'd expect to confuse humans and code which confuses/breaks LLMs
When you let them pump out code without any intervention, there becomes a point where they start introducing bugs faster than they get fixed and things don't get better
Changing my old coding behavior aside, biggest limiting factor for me is understanding how and why the coding agent is doing this a certain way, so that I have the confidence to continually sharpen my tools.
I want something simple that I have full control on, if not just to understand how they work. So I made a minimal coding agent (with edit capability) that is fully functional using only seven tools: read, write, diff, browse, command, ask, and think.
As an example, I can just disable `ask` tool to have it easily go full autonomous on certain tasks. Or, ask it to `think` for refactoring.
Have a look at https://github.com/aperoc/toolkami to see if it might be useful for you.
It’s producing statically likely next tokens based on its training corpus and prompts. That’s what it’s doing. It’s not analyzing your code, it’s not considering approaches, it’s not theorizing about behavior, it’s just producing statistically likely tokens.
This is why I don’t touch the shit—it’s fucking snake oil.
If you "don't touch the stuff", you are probably relying on your initial impressions from 2022 to evaluate models in 2025.
Also, predicting the next token with high accuracy may demand very high degrees of knowledge and reasoning. As an extreme example, please predict the next 1,000 tokens in this sequence: "ABSTRACT. In this paper, we show a mathematically elegant unification of quantum mechanics and gravity, which makes testable predictions. Testing these predictions shows that the quantum gravity model provides previously unexpected results accurate to 1 part in..."
To predict the rest of that paper, it helps to actually come up with a workable model of quantum gravity. Which no current LLM can do, happily.
But lots of current models are good enough at "predicting the next token" to solve high school honors math problems that they've never seen before. They can apply the chain rule, factor polynomials, double-check their work, backtrack, etc. Similarly, current-generation coding models are perfectly capable of reading compiler error messages, and generating diffs that fix the underlying problem. Current-generation summarization models are capable of reading several scientific papers, extracting the key concepts, and turning them into a fairly serviceable podcast.
All of this happens because a (1) in order to predict the next token better, LLMs actually build thousands of specialized models that do things like "keep track of the state of a chess board" or "recognize lions in photos", and (2) transformer models are sufficiently powerful to model many problems well. So a big current-generation LLM is basically an ensemble of thousands of domain models glued together with a language model and a bunch of feed-forward layers.
Now, none of the outputs from LLMs are of super high quality, compared to experienced humans. And there are deep reasons for that. But if you happen to have problem where medium-competence AI "slop" is actually beneficial, then yes, LLMs can actually be of real-world use.
Good to see experiences from people rolling out AI code assistance at scale. For me the part that resonates the most is the ambition unlock. Using Brokk to build Brokk (a new kind of code assistant focused on supervising AI rather than autocompletes, https://brokk.ai/) I'm seriously considering writing my own type inference engine for dynamic languages which would have been unthinkable even a year ago. (But for now, Brokk is using Joern with a side helping of tree-sitter.)
> You can see this in practice when you use Claude Code, which is pay-per-token. Our heaviest users are using $50/month of tokens. That’s a lot of tokens. I asked our CFO and he said he’d be happy to spend $100/dev/month on agents. To get 20% more productive that’s a bargain.
fwiw we interviewd the Claude Code team (https://www.latent.space/p/claude-code) and they said that even within Anthropic (where Claude is free, we got into this a bit), the usage is $6/day so about $200/month. not bad! especially because it goes down when you under-use.
Thanks for sharing, looks like a great interview. Tell them I'm a big fan!
Author here. Ask me anything!
Since writing this a tangentially related thing we've added, is a github action that runs on any PR that includes a (Rails) database migration, and reviews it, comparing it to our docs for how to write good migrations.
Claude helped write the action so it was super easy to set up.
You mentioned using Claude to help set up a GitHub Action for reviewing Rails migrations. How do you see agentic tools like Claude evolving in their ability to reason about big-picture concerns—not just boilerplate generation, but things like validating database changes, architectural decisions, or spotting long-term risks that aren’t immediately visible?
They are really good if you give them guidance and tight scope. For example for the database migrations review bot, we give it our Cursor rules file on database migrations (which is about 200 lines) and tell it to review the PR based on that.
It works particularly well for migrations because all the context is in the PR. We haven't had as much luck with reviewing general PRs where the reason for a change being good or bad could be outside the diff, and where there aren't as clearly defined rules for what should be avoided.
How do you square this with the fact that LLM-based tools aren’t actually doing any analysis whatsoever, but are just pachinko machines that produce statistically likely output tokens when given input tokens?
Have you noticed people more productive, in general, using Claude Code or Cursor? Obviously it varies, but curious if one or the other is the clear productivity champion at this point.
I wouldn't say there's an obvious winner yet. I do think it depends on the person and how they like to work / how they reason about problems.
For Claude specifically, people who take more time to write long detailed prompts tend to get much better outcomes. Including me, since I made the effort to get better at prompt writing.
As someone who really dislikes using Cursor, what does the HN hivemind think of alternatives? Is there a good CLI like Claude Code but for Gemini / other models? Is there a good Neovim plugin that gets the contextual agent mode right?
Have you tried Aider? They're making a CLI coding agent for quite some time and have gained quite a bit of traction.
[0] https://aider.chat/
Seconding aider, which was recommended to me months ago on HN. They don't integrate with vim directly per-se, but I'm a heavy vim user and I like the workflow of `aider --vim`, `ctrl-z`, `vim`.
They also have a mode (--watch-files) that allows you to talk to a running aider instance from inside vim, but I haven't used it much yet.
The file watch mode is great. Also nvim-aider is a great plugin which makes it really easy to manage context from within nvim and add diagnostics to chat
Sweet spot for me was cursor for autocomplete/editing and manually using Claude for more deep dive questions.
I cant go back to a regular IDE after being able to tab my way through most boilerplate changes, but anytime I have Cursor do something relatively complex it generates a bunch of stuff I don't want. If I use Claude chat the barrier of manually auditing anything that gets copied over stays in place.
I also have pretty low faith in a fully useful version of Cursor anytime soon.
Jetbrains tools have MCP plugin and can work with Claude.
> Is there a good CLI like Claude Code but for Gemini / other models
I built an open source CLI coding agent that is essentially this[1]. It combines Claude/Gemini/OpenAI models in a single agent, using the best/most cost effective model for different steps in the workflow and different context sizes. The models are configurable so you can try out different combinations.
It uses OpenRouter for the API layer to simplify use of APIs from multiple providers, though I'm also working on direct integration of model provider API keys.
It doesn't have a Neovim plugin, but I'd imagine it would be one of the easier IDEs to integrate with given that it's also terminal-based. I will look into it—also would be happy to accept a PR if someone wants to take a crack at it.
1 - https://github.com/plandex-ai/plandex
> I haven’t yet come across an agent that can write beautiful code.
o3 in codex is pretty close sometimes. I prefer to use it for planning/review but it far exceeds my expectations (and sometimes my own abilities) quite regularly.
Even if you don't think AI will replace the job of software developer completely, there's no way compensation is going to stay at the current level when anyone can ship code.
>Our head of product is a reformed lawyer who taught himself to code while working here. He’s shipped 150 PRs in the last 12 months.
>The product manager he sits next to has shipped 130 PRs in the last 12 months.
this is actually horrifying, lol. i haven't even considered product guys going ham on the codebase
Honestly, it's been pretty great at my tiny startup. The designer has a list of tweaks he wants that I could do pretty quickly... once I'm done with my current thing in a day or two. Or he can just throw claude at it. We've got CI, we've got visual diff testing, and I'll review his simple `margin-left: 12px;`->`margin-left: 16px;`.
But we're unlocking:
A) more dev capacity by having non-devs do simple tasks
B) a much tighter feedback loop between "designer wants a thing" and "thing exists in product"
C) more time for devs like me to focus on deeper, more involved work
Ostensibly the PRs are getting reviewed so it’s, maybe, not that bad but I had a similar reaction: I can slap together something with some wood, hammer, nails and call it a chair. Should I be manufacturing furniture?
Sure, if it's good, it works, it's reliable and people like it.
Code review makes this a lot less scary. Honestly it seems like mostly a win. A while ago at my day job, a moderately technical manager on another team attempted to contribute a relatively simple feature to my team's codebase. It took many rounds of review feedback for his PR to converge on something close to our general design guidelines. I imagine it would have been way less frustrating and time consuming for him if he could have just told an AI agent what to do and then have it respond to review feedback for him.
It is definitely scary if the good PMs who are solid at authoring requirements are able to just make it happen. Not worried for the most part tho.
that's actually great! win-win for everybody. Although not fun reviewing those early PRs.
Mwahahaha
Noncoders are about to learn about the code maintenance cycle
I'm wondering how much better llms are at "untyped"/loosely typed scripters like Ruby for large scale coding and error avoidance.
Presumably an llm can actually maintain better contextual awareness of code and variables than, say cold loaded syntax highlights.
As a hobby rubyist, pretty terrible compared to Typescript. I suspect it's also that there's less Ruby in the training sata, but Ruby being so dynamic and ducktypable means it produces some truly nightmarish shit from time to time
They likely aren't. I'm getting occasional issues caught by type checks in the LLM output. They would be weird crashes instead in other languages.
ultimately, it comes down to training data and character count
llms can be better than most humans at applescript
just need a fuck ton in context