You did not explicitly state the goal of the advice, I think it would be interesting to distinguish between advice that is meant to increase your value to the company, and advice meant to increase your satisfaction with your work, especially when the two point in opposite directions.
For example it could be that "swallow[ing] your pride and us[ing] that garbage language you hate so much" is good for the company in some cases, but terrible for job satisfaction, making you depressed or angry every time you have to use that silly language/tool.
For that reason try to structure teams such that every team has everything it needs for its day to day work.
I would extend that to "have as much control as you can over what you do". I increasingly find that this is key to move fast and produce quality software.
This applies to code and means dependencies should be owned and open to modifications, so the team understands them well and can fix bugs or add features as needed.
This avoids ridiculous situations where bugs are never fixed or shipping very simple features (such as changing a theme for a UI component) is impossible or takes weeks because a framework actively prevents it.
More control and understanding also tends to be better for satisfaction. Of course all this is on a spectrum and should be balanced with other requirements.
Interactions with ChatGPT can be customized durably in the options, for example you can add the following instructions: "include a confidence rating at the end of your response in the format 'Confidence: X%'. If your confidence is below 80%, briefly explain why".
Here is a sample conversation demonstrating this and showing what ChatGPT has to say about its calibration:
Me: Are you calibrated, by which I mean, when you output a confidence X as a percentage, are you right X times out of 100?
ChatGPT: Reasoned about confidence calibration for 27 seconds. I don’t have calibration in that strict sense. When I mention a confidence percentage, it’s a heuristic internal signal rather than a guarantee that I’ll be right exactly that proportion of the time. Research shows that language models often exhibit miscalibration, meaning a stated confidence (say 70%) doesn’t empirically translate to being correct 70% of the time. I'm about 90% confident in this explanation.
Me: How bad is the miscalibration?
ChatGPT: Thought for a second. LLMs—especially those fine‐tuned with RLHF—are often notably overconfident. For example, a model might claim 90% confidence while its actual accuracy is closer to 70–80%. Studies report expected calibration errors (ECE) of roughly 5–10% or more, and effective calibration methods can sometimes cut these errors by about half, underscoring that miscalibration is a significant issue. I'm about 90% confident in this summary, based on multiple recent calibration studies.
Me: If you are predictably overconfident, and you know it, do you think you can notice that before stating your confidence and correct for it?
ChatGPT: Reasoned about overconfidence and calibration adjustment for 4 seconds. I don’t have a real-time “internal monitor” to detect my own overconfidence before stating it. Instead, I rely on calibration techniques—like temperature scaling and post-hoc adjustments—that help align my confidence levels with actual accuracy. In other words, I can be tuned to be less overconfident, but I don’t consciously correct my confidence on the fly. I'm about 90% confident in this explanation based on current research findings.
Many developers have been reporting that this is dramatically increasing their productivity, up to 5x'ing/10x'ing it
I challenge the data: none of my colleagues have been reporting this high a speed-up. I think your observation can just be explained by a high sampling bias.
People who do not use AI or got no improvement are unlikely to report. You also mention Twitter where users share "hot takes" etc to increase engagement.
It's good to have actual numbers before we explain them, so I ran a quick search and found 3 articles that look promising (I only did a basic check on the methodology, don't take these numbers at face value without analyzing the source in depth):
Documentation quality (+7.5%)
Code review speed (+3.1%)
Delivery throughput (-1.5%): AI adoption slightly decreases delivery throughput, usually due to over-reliance, learning curve, and increased complexity.
Delivery stability (-7.2%): It is significantly impacted because AI tools can generate incorrect or incomplete code, increasing the risk of production errors.
What are the DORA metrics?
Deployment frequency | How often a team puts an item into production.
Lead time for changes | Time required for a commit to go into production.
Change failure rate | Percentage of deployments resulting in production failure.
Failed deployment recovery time | Time required for a team to recover from a production failure.
Analyzing actual engineering data from a sample of nearly 800 developers and objective metrics, such as cycle time, PR throughput, bug rate, and extended working hours (“Always On” time), we found that Copilot access provided no significant change in efficiency metrics.
The group using Copilot introduced 41% more bugs
Copilot access didn’t mitigate the risk of burnout
“The adoption rate is significantly below 100% in all three experiments,” the researchers wrote. “With around 30-40% of the engineers not even trying the product.
In my experience LLMs are a replacement for search engines (these days, search engines are only good to find info when you already know on which website to look ...). They don't do well in moderately sized code bases, nor in code bases with lots of esoteric business logic, which is to say they don't do well in most enterprise software.
I mostly use them as:
I think it's also good at one shot scripts, such as data wrangling and data viz, but it does not come up often in my current position.
I would rate the productivity increase at about 10%. I think the use of modal editors (Vim or modern alternatives) improve coding speed more than inline AI completion, which is often distracting.
A lot of my time is spent understanding what I need to code in a back and forth with the product manager or clients. Then I can either code it myself (it's not so hard once I understand the requirements) or spend time explaining the AI what it needs to do, watch it write sloppy code, and rewrite the thing.
Once in a while the coding part is actually the hard part, for example when I need to make the code fast or make a feature play well with the existing architecture, but then the AI can't do logic well enough to optimize code, nor can it reason about the entire code base which it can't see anyway.
I also think it is unlikely that AGIs will compete in human status games. Status games are not just about being the best: Deep Blue is not high status, sportsmen that take drugs to improve their performance are not high status.
Status games have rules and you only win if you do something impressive while competing within the rules, being an AGI is likely to be seen as an unfair advantage, and thus AIs will be banned from human status games, in the same way that current sports competitions are split by gender and weight.
Even if they are not banned given their abilities it will be expected that they do much better than humans, it will just be a normal thing, not a high status, impressive thing.
For those interested in writing better trip reports there is a "Guide to Writing Rigorous Reports of Exotic States of Consciousness" at https://qri.org/blog/rigorous-reports
A trip report is an especially hard case of something one can write about:
I have a similar intuition that if mirror-life is dangerous to Earth-life, then the mirror version of mirror-life (that is, Earth-life) should be about equally as dangerous to mirror-life as mirror-life is to Earth-life. Having only read this post and in the absence of any evidence either way this default intuition seems reasonable.
I find the post alarming and I really wish it had some numbers instead of words like "might" to back up the claims of threat. At the moment my uneducated mental model is that for mirror-life to be a danger it has to:
Hmm, 6 ifs seems like a lot, so is it unlikely? in the absence of any odds it is hard to say.
The post would be more convincing and useful if it included a more detailed threat model, or some probabilities, or a simulation, or anything quantified.
A last question: how many mirror molecules does an organism need to be mirror-life? is one enough? does it make any difference to its threat-level?
[ epistemological status: a thought I had while reading about Russell's paradox, rewritten and expanded on by Claude ; my math level: undergraduate-ish ]
Mathematics has faced several apparent "crises" throughout history that seemed to threaten its very foundations. However, these crises largely dissolve when we recognize a simple truth: mathematics consists of coherent systems designed for specific purposes, rather than a single universal "true" mathematics. This perspective shift—from seeing mathematics as the discovery of absolute truth to viewing it as the creation of coherent and sometimes useful logical systems—resolves many historical paradoxes and controversies.
The only fundamental requirement for a mathematical system is internal coherence—it must operate according to consistent rules without contradicting itself. A system need not:
Just as a carpenter might choose different tools for different jobs, mathematicians can work with different systems depending on their needs. This insight resolves numerous historical "crises" in mathematics.
For two millennia, mathematicians struggled to prove Euclid's parallel postulate from his other axioms. The discovery that you could create perfectly consistent geometries where parallel lines behave differently initially seemed to threaten the foundations of geometry itself. How could there be multiple "true" geometries? The resolution? Different geometric systems serve different purposes:
None of these systems is "more true" than the others—they're different tools for different jobs.
Consider the set of all sets that don't contain themselves. Does this set contain itself? If it does, it shouldn't; if it doesn't, it should. This paradox seemed to threaten the foundations of set theory and logic itself.
The solution was elegantly simple: we don't need a set theory that can handle every conceivable set definition. Modern set theories (like ZFC) simply exclude problematic cases while remaining perfectly useful for mathematics. This isn't a weakness—it's a feature. A hammer doesn't need to be able to tighten screws to be an excellent hammer.
Early calculus used "infinitesimals"—infinitely small quantities—in ways that seemed logically questionable. Rather than this destroying calculus, mathematics evolved multiple rigorous frameworks:
Each approach has its advantages for different applications, and all are internally coherent.
This perspective—that mathematics consists of various coherent systems with different domains of applicability—aligns perfectly with modern mathematical practice. Mathematicians routinely work with different systems depending on their needs:
None of these choices imply that other options are "wrong"—just that they're less useful for the particular problem at hand.
This view of mathematics parallels modern physics, where seemingly incompatible theories (quantum mechanics and general relativity) can coexist because each is useful in its domain. We don't need a "theory of everything" to do useful physics, and we don't need a universal mathematics to do useful mathematics.
The recurring "crises" in mathematical foundations largely stem from an overly rigid view of what mathematics should be. By recognizing mathematics as a collection of coherent tools rather than a search for absolute truth, these crises dissolve into mere stepping stones in our understanding of mathematical systems.
Mathematics isn't about discovering the one true system—it's about creating useful systems that help us understand and manipulate abstract patterns. The only real requirement is internal coherence, and the main criterion for choosing between systems is their utility for the task at hand.
This perspective not only resolves historical controversies but also liberates us to create and explore new mathematical systems without worrying about whether they're "really true." The question isn't truth—it's coherence.
Planecrash (from Eliezer and Lintamande) seems highly relevant here: the hero, Keltam, tries to determine whether he is in a conspiracy or not. To do that he basically applies Bayes theorem to each new fact he encounters: "Is fact F more likely to happen if I am in a conspiracy or if I am not? hmm, fact F seems more likely to happen if I am not in a conspiracy, let's update my prior a bit towards the 'not in a conspiracy' side".
Planecrash is a great walkthrough on how to apply that kind of thinking to evaluate whether someone is bullshitting you or not, by keeping two alternative worlds that explain what they are saying, and updating the likelihoods as the discussion goes on.
Surely if you start putting probability on events such as "someone stole my phone", and "that person then tailed me", and multiply the probability of each new fact, it gets really unlikely really fast. Also relevant: Burdensome details