

Are you saying that it is not possible to use scientific methods to systematically and objectively compare programming tools and methods?
No, I’m saying the opposite, and I’m offended at what the author seems to be suggesting, that this should only be attempted by academics, and that programmers should only defer to them and refrain from attempting this to inform their own work and what tools will be useful to them. An absolutely insane idea given that the task of systematic evaluation and seeking greater objectivity is at the core of what programmers do. A programmer should obviously be using their experience writing and testing both typing systems to decide which is right for their project, they should not assume they are incapable of objective judgment and defer their thinking to computer science researchers who don’t directly deal with the same things they do and aren’t considering the same questions.
This was given as an example of someone falling for manipulative trickery:
A recent example was an experiment by a CloudFlare engineer at using an “AI agent” to build an auth library from scratch.
From the project repository page:
I was an AI skeptic. I thought LLMs were glorified Markov chain generators that didn’t actually understand code and couldn’t produce anything novel. I started this project on a lark, fully expecting the AI to produce terrible code for me to laugh at. And then, uh… the code actually looked pretty good. Not perfect, but I just told the AI to fix things, and it did. I was shocked.
But understanding and testing code is not (necessarily) guesswork. There is no reason to assume this person is incapable of it, and no reason to justify the idea that it should never be attempted by ordinary programmers when that is the main task of programming.
I’m not confusing that. Effective programming requires and consists of small scale application of the scientific method to the systems you work with.
I wasn’t making that argument so I don’t know what you’re getting at with this. For the purposes of this discussion I think it doesn’t matter at all how it was written or whether what wrote it is truly intelligent, the important thing is the code that is the end result, whether it does what it is intended to and nothing harmful, and whether the programmer working with it is able to accurately determine if it does what it is intended to.
I feel like “not even possible to assess their usefulness for programming by self-exoerimentation(!)” is necessarily a claim that reading and testing code is something no one can do, which is absurd. If the output is often correct, then the means of creating it is likely useful, and you can tell if the output is correct by evaluating it in the same way you evaluate any computer program, without needing to directly evaluate the LLM itself. It should be obvious that this is a possible thing to do. Saying not to do it seems kind of like some “don’t look up” stuff.