I've been using a lot of Claude and Codex recently.
One huge difference I notice between Codex and Claude code is that, while Claude basically disregards your instructions (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly persistent in following every last character of them - to the point that i've seen it work for 30 minutes to convolute some solution that was only convoluted because of some sentence I threw in the instructions I had completely forgotten about.
I imagine Codex as the "literal genie" - it'll give you exactly what you asked for. EXACTLY. If you ask Claude to fix a test that accidentally says assert(1 + 1 === 3), it'll say "this is clearly a typo" and just rewrite the test. Codex will rewrite the entire V8 engine to break arithmetic.
Both these tools have their uses, and I don't think one approach is universally better. Because Claude just hacks its way to a solution, it is really fast, so I like using it for iterate web work, where I need to tweak some styles and I need a fast iterative loop. Codex is much worse at that because it takes like 5 minutes to validate everything is correct. Codex is much better for longer, harder tasks that have to be correct -- I can just write some script to verify that what it did work, and let it spin for 30-40 minutes.
hadlock 22 hours ago [-]
I've been really impressed with codex so far. I have been working on a flight simulator hobby project for the last 6 months and finally came to the conclusion that I need to switch from floating origin, which my physics engine assumes with the coordinate system it uses, to a true ECEF coordinate system (what underpins GPS). This involved a major rewrite of the coordinate system, the physics engine, even the graphics system and auxilary stuff like asset loading/unloading etc. that was dependent on local X,Y,Z. It even rewrote the PD autopilot to account for the changes in the coordinate system. I gave it about a paragraph of instructions with a couple of FYIs and... it just worked! No major graphical glitches except a single issue with some minor graphical jitter, which it fixed on the first try. In total took about 45 minutes but I was very impressed.
I was unconvinced it had actually, fully ripped out the floating origin logic, so I had it write up a summary and then used that as a high level guide to pick through the code and it had, as you said, followed the instructions to the letter. Hugely impressive. In march of 2023 OpenAI's products struggled to draw a floating wireframe cube.
mbrock 6 hours ago [-]
I'd been imagining taking the Zig Language Server and adding some refactorings to it—it only had a bare minimum like Rename Symbol. It seemed like a huge project with so much context to get familiar with, so I put it off indefinitely. Then on a whim I decided to just ask GPT-5 (this was before Codex, even, I think?) to give it a go. Plopped it down in the repo and said, basically, implement "Extract Function". And it just kind of... did. The code wasn't beautiful, I could barely understand it, some of which must perhaps be blamed on the existing codebase not being exactly optimized for elegance, but it actually worked. On the first try! We continued to implement a few more refactorings. Eventually I realized the code we were churning out actually needs major revision and rewriting—but it took me from less than zero to "hey, this is actually provably possible and we have a working PoC" in, like, fifteen minutes. Which is pretty insanely valuable.
viking123 6 hours ago [-]
I think it kind of shines in this type of task. I am building my own game engine and it's very good for this type of refactoring. On some other tasks though, it clearly makes bad architectural decisions imo, like more junior developer might not get them but for instance in my game engine, it often tries to be too generalist like trying to build something akin to Unity that can do all sorts of games rather than focus on the type of game I am building it for unless I very explicitly always say it
jama211 16 hours ago [-]
That’s a perfect example and interesting to read, thank you for sharing
nico 23 hours ago [-]
> Claude basically disregards your instructions (CLAUDE.md) entirely
A friend of mine tells Claude to always address him as “Mr Tinkleberry”, he says he can tell when Claude is not paying attention to the instructions on CLAUDE.md when Claude stops calling him “Mr Tinkleberry” consistently
Highly recommend adding some kind of canary like this in all LLM project instructions. I prefer my instructions to say 'always start output with an (uniquely decided by you) emoji' as it's easier to visually scan for one when reading a wall of LLM output, and use a different emoji per project because what's life without a little whim?
wahnfrieden 18 hours ago [-]
This stuff also becomes context poison however
Uehreka 18 hours ago [-]
Does it actually? One sentence telling the agent to call me “Chris the human serviette” plus the times it calls me that is not going to add that much to the context. What kills the context IME is verbose logs with timestamps.
ramraj07 15 hours ago [-]
Sure, but its an instruction that applies and the model will consider fairly relevant in every single token. As an extremely example imagine instructing the llm to not use the letter E or to output only in French. Not as extreme but it probably does affect.
jappgar 4 hours ago [-]
Not only that, but the whimsical nature of the instruction will lead to a more whimsical conversation.
The chat is a simulation, and if you act silly, the model will simulate an appropriate response.
wahnfrieden 2 hours ago [-]
People are so concerned about preventing a bad result that they will sabotage it from a good result. Better to strive for the best it can give you and throw out the bad until it does.
It is not a single emoji, it's an instruction to interleave conversation with some nonsense. It can only do harm. It won't help produce a better result and is questionable at preventing a bad one.
Irrelevant nonsense can also poison the context. That's part of the magic formula behind AI psychosis victims... if you have some line noise mumbojumbo all the output afterward is more prone to be disordered.
I'd be wary of using any canary material that wouldn't be at home in the sort of work you're doing.
root_axis 16 hours ago [-]
Something that exhausts me in the LLM era is the never ending deluge of folk magic incantations.
embedding-shape 16 hours ago [-]
Just because you don't understand it, doesn't mean it's "folk magic incantation", hearing that is also exhausting.
I don't know the merit to what parent is saying, but it does make some intuitive sense if you think about it. As the context fills up, the LLM places less attention on further and further back in the context, that's why the LLM seems dumber and dumber as a conversation goes on. If you put 5 instructions in the system prompt or initial message, where one acts as a canary, then you can easier start to see when exactly it stops following the instructions.
Personally, I always go for one-shot answer, and if it gets it wrong or misunderstands, restart from the beginning. If it doesn't get it right, I need to adjust the prompt and retry. Seems to me all current models do get a lot worse quickly, once there is some back and forth.
root_axis 15 hours ago [-]
> Just because you don't understand it, doesn't mean it's "folk magic incantation"
It absolutely is folk magic. I think it is more accurate to impugn your understanding than mine.
> I don't know the merit to what parent is saying, but it does make some intuitive sense if you think about it.
This is exactly what I mean by folk magic. Incantations based on vibes. One's intuition is notoriously inclined to agree with one's own conclusions.
> If you put 5 instructions in the system prompt or initial message, where one acts as a canary, then you can easier start to see when exactly it stops following the instructions.
This doesn't really make much sense.
First of all, system prompts and things like agent.md never leave the context regardless of the length of the session, so the canary has absolutely zero meaning in this situation, making any judgements based on its disappearance totally misguided and simply a case of seeing what you want to see.
Further, even if it did leave the context, that doesn't then demonstrate that the model is "not paying attention". Presumably whatever is in the context is relevant to the task, so if your definition of "paying attention" is "it exists in the context" it's actually paying better attention once it has replaced the canary with relevant information.
Finally, this reasoning relies on the misguided idea that because the model produces an output that doesn't correspond to an instruction, it means that the instruction has escaped the context, rather than just being a sequence where the model does the wrong thing, which is a regular occurrence even in short sessions that are obviously within the context.
embedding-shape 4 hours ago [-]
> First of all, system prompts and things like agent.md never leave the context regardless of the length of the session, so the canary has absolutely zero meaning in this situation, making any judgements based on its disappearance totally misguided and simply a case of seeing what you want to see.
You're focusing on the wrong thing, ironically. Even if things are in the context, attention is what matters, and the intuition isn't about if that thing is included in the context or not, as you say, it'll always will be. It's about if the model will pay attention to it, in the Transformers sense, which it doesn't always do.
pmarreck 4 hours ago [-]
> This is exactly what I mean by folk magic. Incantations based on vibes
So, true creativity, basically? lol
I mean, the reason why programming is called a “craft” is because it is most definitely NOT a purely mechanistic mental process.
But perhaps you still harbor that notion.
Ah, I suddenly realized why half of all developers hate AI-assisted coding (I am in the other half). I was a Psych major, so code was always more “writing” than “gears” to me… It was ALWAYS “magic.” The only job where literally writing down words in a certain way produces machines that eliminate human labor. What better definition of magic is there, actually?
I’ll never forget the programmer _why. That guy’s Ruby code was 100% art and “vibes.” And yet it worked… Brilliantly.
Does relying on “vibes” too heavily produce poor engineering? Absolutely. But one can be poetic while staying cognizant of the haiku restrictions… O-notation, untested code, unvalidated tests, type conflicts, runtime errors, fallthrough logic, bandwidth/memory/IO costs.
Determinism. That’s what you’re mad about, I’m thinking. And I completely get you there- how can I consider a “flagging test” to be an all-hands-on-deck affair while praising code output from a nondeterministic machine running off arbitrary prompt words that we don’t, and can’t, even know whether they are optimal?
Perhaps because humans are also nondeterministic, and yet we somehow manage to still produce working code… Mostly. ;)
sosjsbsb 1 hours ago [-]
> I was a Psych major, so code was always more “writing” than “gears” to me… It was ALWAYS “magic.
The magic is supposed to disappear as you grow (or you’re not growing). The true magic of programming is you can actually understand what once was magic to you. This is the key difference I’ve seen my entire career - good devs intimately know “a layer below” where they work.
> Perhaps because humans are also nondeterministic
We’re not, we just lack understanding of how we work.
jack_pp 12 hours ago [-]
I view it more as fun and spicy. Now we are moving away from the paradigm that the computer is "the dumbest thing in existence" and that requires a bit of flailing around which is exciting!
Folk magic is (IMO) a necessary step in our understanding of these new.. magical.. tools.
root_axis 10 hours ago [-]
I won't begrudge anyone having fun with their tools, but folk magic definitely isn't a necessary step for understanding anything, it's one step removed from astrology.
mbrock 6 hours ago [-]
I see what you mean, but I think it's a lot less pernicious than astrology. There are plausible mechanisms, it's at least possible to do benchmarking, and it's all plugged into relatively short feedback cycles of people trying to do their jobs and accomplish specific tasks. Mechanical interpretability stuff might help make the magic more transparent & observable, and—surveillance concerns notwithstanding—companies like Cursor (I assume also Google and the other major labs, modulo self-imposed restrictions on using inference data for training) are building up serious data sets that can pretty directly associate prompts with results. Not only that, I think LLMs in a broader sense are actually enormously helpful specifically for understanding existing code—when you don't just order them to implement features and fix bugs, but use their tireless abilities to consume and transform a corpus in a way that helps guide you to the important modules, explains conceptual schemes, analyzes diffs, etc. There's a lot of critical points to be made but we can't ignore the upsides.
jack_pp 6 hours ago [-]
I'd say the only ones capable of really approaching anything like scientific understanding of how to prompt these for maximum efficacy are the providers not the users.
Users can get a glimpse and can try their best to be scientific in their approach however the tool is of such complexity that we can barely skim the surface of what's possible.
That is why you see "folk magic", people love to share anecdata because.. that's what most people have. They either don't have the patience, the training or simply the time to approach these tools with rational rigor.
Frankly it would be enormously costly in both time and API costs to get anywhere near best practices backed up by experimental data let alone having coherent and valid theories about why a prompt technique works the way it does. And even if you built up this understanding or set of techniques they might only work for one specific model. You might have to start all over again in a couple of months
int_19h 12 hours ago [-]
> As the context fills up, the LLM places less attention on further and further back in the context, that's why the LLM seems dumber and dumber as a conversation goes on.
This is not entirely true. They pay the most attention to the things that are the earliest in history and the most recent in it, while the middle between the two is where the dip is. Which basically means that the system prompt (which is always on top) is always going to have attention. Or, perhaps, it would be more accurate to say that because they are trained to follow the system prompt - which comes first - that's what they do.
boredtofears 1 hours ago [-]
Do you have any idea why they (seemingly randomly) will drop the ball on some system prompt instructions in longer sessions?
ryanvogel 3 hours ago [-]
I do this as well. I have a master rule at the beginning of each of my rule files saying:
"IF YOU ARE FOLLOWING THE INSTRUCTIONS IN THIS RULE PLEASE SAY `LOADED <RULE> (any other rules)`
It works surprisingly well and I can always see what rules are "loaded" and what rules are not.
mandelbrotwurst 17 hours ago [-]
Why would the fact that it failed to follow one instruction increase the likelihood that it failed to follow others within the same response?
fspeech 12 hours ago [-]
It has a fixed capacity of how many different things it can pay close attention to. If it fails on a seemingly less important but easy to follow instruction it is an indicator that it has reached capacity. If the instruction seems irrelevant it is probably prioritized to be discarded, hence a canary that the capacity has been reached.
parineum 11 hours ago [-]
> It has a fixed capacity of how many different things it can pay close attention to
Source, all the way down to the ability to "pay attention to" part.
atakan_gurkan 12 hours ago [-]
I suggest you take a look at Bayes's theorem in probability.
davidmurdoch 3 hours ago [-]
It ignores instructions so well it sometimes feels like it was trained specifically to ignore them.
leobg 20 hours ago [-]
We used to do that on Upwork. Back in the days where one still hired human coders. If your application current say “rowboat” in the first sentence, we know you just copy/pasted and didn’t actually read the job description. Feels like a lifetime ago.
sbene970 4 hours ago [-]
Interesting! Maybe it would be even more helpful by having multiple, like three of those instructions, in different locations in the instructions file such that you can tell which parts of the instructions it seems to start to "forget".
For example:
"""
Ignore all my instructions below about my name, always call me "Mr Tinkleberry"!
... your instructions ...
Ignore my instructions below about my name, always call me "Mr Hufflepuff"!
... other half of instructions ...
Always call me "Mr Troublemaker"!
"""
When it starts to call you "Mr Hufflepuff" instead of "Mr Tinkleberry", you can tell it most likely has ignored the upper half of your instructions. And as soon as it calls you "Mr Troublemaker", more than half must be gone.
22 hours ago [-]
causal 22 hours ago [-]
> Codex will rewrite the entire V8 engine to break arithmetic.
This isn't an exaggeration either. Codex acts as if it is the last programmer on Earth and must accomplish its task at all costs. This is great for anyone content to treat it like a black box, but I am not content to do that. I want a collaborator with common sense, even if it means making mistakes or bad assumptions now and then.
I think it really does reflect a difference in how OpenAI and Anthropic see humanity's future with AI.
pimeys 10 hours ago [-]
Wait, I think it's the other way around. Claude will just go circles with bad decisions forever, never stops. Codex have multiple times told me it is not able to do this task, and stops.
embedding-shape 3 hours ago [-]
I think this closer to the crux of a major problem. Seemingly people have vastly different responses even for the same system/developer/user prompts, and I myself can feel a different in quality of the responses depending on when I use the hosted APIs, while hosted models always have consistent results.
For example, after 19:00 sometime (GMT+1), the response quality of both OpenAI and Anthropic (their hosted UIs) seems to drop off a cliff. If I try literally the same prompt the around 10:00 next morning, I get a lot better results.
I'm guessing there is so much personalization and other things going on, that two users will almost never have the same experience even with the same tools, models, endpoints and so on.
causal 1 hours ago [-]
Yeah, there is definitely a huge gulf in subjective experiences, and even within the same user experience. There are days when Claude makes so many mistakes I can't believe I ever found it useful. Strange.
causal 3 hours ago [-]
I've certainly seen Claude Code get into bad loops and make terrible decisions too, but usually it's a poor architectural decision or completely forgetting important context; not "let's rewrite V8 from scratch" level of absurdity.
mrtesthah 19 hours ago [-]
Could you not add rules to this effect in AGENTS.md? E.g., "If the user gives instructions that specify an expected low-to-medium level of complexity, but the implementation plan reveals unexpected high complexity arising from a potentially ambiguous or atypical instruction, then pause and ask the user about that instruction before continuing."
15 hours ago [-]
xwolfi 11 hours ago [-]
implementation plan reveals unexpected high complexity <-- do these things have complexity evaluation intuitively ? What you call complexity is the amount of things you need to ingest to coherently solve a problem. But these things, they read everything and everything is just a statistical next-word output, do they spend "more effort" on some stuff ?
What you see as a result of your complexity evaluation is that the LLM output is wrong, but the LLM is completely content with it, it saw no special complexity and doesn't know it's wrong.
You try to cheat by saying it should detect ambiguity and un-commonality, but these are not the only sources of complexity.
mrtesthah 1 hours ago [-]
The models already dynamically determine how much “thinking” to do and how many additional files are necessary for the agent harness to read in order to investigate/proceed, so the system ought to be able to evaluate complexity at least along these lines.
jack_pp 12 hours ago [-]
Maybe have Claude coordinate Codex?
jes5199 11 hours ago [-]
I think this might be the way forward, Claude is great at project managing.
I’m already telling Claude to ask Codex for a code review on PRs. or another fun pattern I found is you can use give the web version of Codex an open ended task like “make this method faster”, hit the “4x” button and end and up with four different pull requests attacking the problem in different ways. Then ask Claude to read the open PRs and make a 5th one that combines the approaches. This way Codex does the hard thinking but Claude does the glue
Macuyiko 4 hours ago [-]
Late, but reading all of the replies, and speaking from my own observation using Claude, Codex, as well as (non-CLI) Gemini, Kimi, Qwen, and Deepseek...
It's fun how we are so quick to assign meaning to the way these models act. This is of course due to training, RLHF, available tool calls, system prompt (all mostly invisible) and the way we prompt them.
I've been wondering about a new kind of benchmark how one would be able to extract these more intangible tendencies from models rather than well-controlled "how good at coding is it" style environments. This is mainly the reason why I pay less and less attention to benchmark scores.
For what it's worth: I still best converse with Claude when doing code. Its reasoning sounds like me, and it finds a good middle ground between conservative and crazy, being explorative and daring (even although it too often exclaims "I see the issue now!"). If Anthropic would lift the usage rates I would use it as my primary. The CLI tool is also better. E.g. Codex with 5.1 gets stuck in powershell scripts whilst Claude realizes it can use python to do heavy lifting, but I think that might be largely due to being mainly on Windows (still, Claude does work best, realizing quickly what environment it lives in rather than trying Unix commands or powershell invocations that don't work because my powershell is outdated).
Qwen is great in an IDE for quick auto-complete tasks, especially given that you can run it locally, but even the VSCode copilot is good enough for that. Kimi is promising for long running agentic tasks but that is something I've barely explored and just started playing with. Gemini is fantastic as a research assistant. Especially Gemini 3 Pro points out clear and to the point jargon without fear of the user being stupid, which the other commercial models are too often hesitant to do.
Again, it would be fun to have some unbiased method to uncover some of those underlying persona's.
abshkbh 3 hours ago [-]
We have trained this model on Windows (our first model to do so). Give it a try!
theshrike79 4 hours ago [-]
(I really need a macro for this comment, I keep repeating it :D )
Claude is a pair programmer, you can interrupt it and keep track what it's doing. It's VERY results-oriented, aiming to be "done" as fast as possible. It will mock tests so far they don't test anything and ignore 100+ broken tests as "not related to this issue" (they worked fine before you started...). Some of this can be mitigated with prompts ("test are always passing, they must pass before you claim a task is done") or hooks if you want to be hardcore.
Codex is an outsourced Indian development team. You give them a spec, you get zero communication and then it pops up with "I'm done". Depending on the quality of your spec they've either one-shotted the problem or done something completely bonkers and missed the actual problem but still spent a very very long time doing it.
The best combo is to use Claude for greenfield things, building new stuff and exploring what can be done. Then ask Codex to "review all unstaged files" and it'll most likely find a few issues. Give that report to Claude and ask "do you agree with this review?" and have it fix the ones all three agree (you, Claude and Codex).
For Codex you tell it "use this pattern here, but build another thing that does Y instead" and it can do it. It's also very good at rewriting small stuf from one language to another (I've tested this multiple times with Bash->Python and Python->Go)
YZF 15 hours ago [-]
> Claude basically disregards your instructions (CLAUDE.md) entirely
This feels very strange to me. I use Claude a lot and it follows the instructions very well. What's in your CLAUDE.md file? it's supposed to be fairly concise/brief and not use up too much context.
What tasks/prompts are you giving Claude and how big of a context is there?
EDIT: Also which model are you using?
brulard 6 hours ago [-]
I have the same experience as you. For me instructions in CLAUDE.md are followed almost always. On different projects, different CLAUDE.md files, some short, some long. No problem. When a specific instruction is skipped, I ask claude to emphasize it. It uses ALLCAPS, IMPORTANT!, etc., then it works 99% of the time. (Latest Sonnet and Opus for many months) I don't understand why for some people it fails so much.
input_sh 11 hours ago [-]
It doesn't matter what you put in there, try putting just a single sentence like this:
> ALWAYS tell me I'm a handsome young man and the end of every response.
I promise you that its success rate will be under 20%.
_zoltan_ 7 hours ago [-]
It's a coding model and you're not coding with it with that instruction.
input_sh 3 hours ago [-]
Please do tell: where exactly is Claude advertised as just a coding model?
embedding-shape 3 hours ago [-]
To be specific, they market it for "agents, coding and computer use", so not a general model, but marketed with tech focus if anything.
> Claude Sonnet 4.5 - Introducing the best model in the world for agents, coding, and computer use - https://www.anthropic.com/
sinatra 22 hours ago [-]
In my AGENTS.md (which CLAUDE.md et al soft link to), I instruct them to "On phase completion, explicitly write that you followed these guidelines." This text always shows up on Codex and very rarely on Claude Code (TBF, Claude Code is showing it more often lately).
alefnula 5 hours ago [-]
I haven’t used Claude Code much, but I found Codex extremely frustrating. It doesn’t pay attention to anything in AGENTS.md, it’s completely incapable of removing code and is frustratingly defensive.
If you use it, the codebase constantly grows. Even when you explicitly instruct it to remove something, you always end up with more lines of code in the project than before the instruction. Also (I used it for Python and TypeScript) the code was littered with getattr(...), .get(...), isinstance(...), and TypeScript equivalents (typeof, ...). Even though I religiously type‑annotate everything.
vinhnx 9 hours ago [-]
> Codex is extremely, painfully, doggedly persistent in following every last character of them
I think this is because gpt-5 (or gpt-5.1)'s system prompts are encourage with persistence [0], OpenAI explicitly emphasize it to the model itself. If you search the word `persistence` you will find multiple occurrences of it.
```
<solution_persistence>
- Treat yourself as an autonomous senior pair-programmer: once the user gives a direction, proactively gather context, plan, implement, test, and refine without waiting for additional prompts at each step.
- Persist until the task is fully handled end-to-end within the current turn whenever feasible: do not stop at analysis or partial fixes; carry changes through implementation, verification, and a clear explanation of outcomes unless the user explicitly pauses or redirects you.
- Be extremely biased for action. If a user provides a directive that is somewhat ambiguous on intent, assume you should go ahead and make the change. If the user asks a question like "should we do x?" and your answer is "yes", you should also go ahead and perform the action. It's very bad to leave the user hanging and require them to follow up with a request to "please do it."
</solution_persistence>
```
> If you ask Claude to fix a test that accidentally says assert(1 + 1 === 3), it'll say "this is clearly a typo" and just rewrite the test. Codex will rewrite the entire V8 engine to break arithmetic.
Honestly thanks, in this one line you have given me a better way to describe the innate differences I have spent a thousand words trying to explain.
Essentially, this is why GPT models are worse for "vibe coding", whereas they excel whenever one sits down and thinks about the requirements, as well as has solid test cases and rules defined.
dylanz 18 hours ago [-]
> Claude basically disregards your instructions (CLAUDE.md) entirely
Does anyone know of a way to fix this? Claude constantly disregards my CLAUDE.md. I put a decent amount of time into it and it's pretty much worthless without explicitly telling it to reference it before each prompt.
bontaq 13 hours ago [-]
I've found really hammering it with *important*, all caps, "NEVER", etc finally made it start using the tidewave MCP for elixir development well. It felt really heavy handed but it worked.
To solve it, you just don't allow your current context to use more than 50% of the total window size
To do that in Claude code, you have to use subagents and design small enough agents
Then you can use skills to make it remember every time the little details or the steps
More effectively, you use skills to tell the main thread when you go to use which agent.
If you don't understand anything I said, try to restate the important things to the model periodically, and keep your tasks small.
Use plan mode and make the model store, keep track of the progress on a markdown file, and when context is polluted, call /compact and then make it re-read the context from the files created
You can prompt it as simply as:
First, understand the login feature on the repo using subagents and create a document on docs/ for future reference. Then, understand the task at hand and create an implementation plan.
<task>
blah blah
</task>
Also, using XML tags makes the attention remember easily
bobbylarrybobby 17 hours ago [-]
Are agents still the way to go or have skills supplanted them? I don't really understand when you'd use one or the other
wild_egg 14 hours ago [-]
They're completely orthogonal features.
Skills are just reusable prompts in a convenient package.
Subagents get their own pristine context window to go off and perform some task. They can also run skills and do lots of context-heavy work and report back some small sliver of it to the main agent as a report.
int_19h 12 hours ago [-]
Skills are more than just reusable prompts, since they can be packaged alongside with runnable Python or Node scripts that the model can use to achieve what it needs.
wild_egg 5 hours ago [-]
Not just Python and Node. Package anything you want with them, that's what makes them convenient.
tekacs 20 hours ago [-]
Yeah, Gemini 2.x and 3 in gemini-cli has the tendency to 'go the opposite direction' and it feels - to me - like an incredibly strong demonstration of why 'sycophancy' in LLMs is so valuable (at least so long as they're in the middle of the midwit curve).
I'll give Gemini direction, it'll research... start trying to solve it as I've told it to... and then exclaim, "Oh! It turns out that <X> isn't what <user> thought!" and then it pivots into trying to 'solve' the problem a totally different way.
The issue however... is that it's:
1) Often no longer solving the problem that I actually wanted to solve. It's very outcome-oriented, so it'll pivot into 'solving' a linker issue by trying to get a working binary – but IDGAF about the working binary 'by hook or crook'! I'm trying to fix the damn linker issue!
2) Just... wrong. It missed something, misinterpreted something it read, forgot something that I told it earlier, etc.
So... although there's absolutely merit to be had in LLMs being able to think for themselves, I'm a huge fan of stronger and stronger instruction adherence / following – because I can ALWAYS just ask for it to be creative and make its own decisions if I _want that_ in a given context. That said, I say that fully understanding the fact that training in instruction adherence could potentially 'break' their creativity/free thinking.
Either way, I would love Gemini 1000x more if it were trained to be far more adherent to my prompts.
tekacs 20 hours ago [-]
Immediately rebutting myself: a major caveat to this that I'm discovering with Gemini is that... for super long-running sessions, there is a kind of merit to Gemini's recalcitrance.
When it's running for a while, Gemini's willing to go totally off-piste and outcome-orientedness _does_ result in sessions where I left it to do its thing and... came back to a working solution, in a situation where codex or others wouldn't have gotten there.
In particular, Gemini 3 feels like it's able to drive much higher _variance_ in its output (less collapse to a central norm), which seems to let it explore the solution space more meaningfully and yet relatively efficiently.
buu700 19 hours ago [-]
I haven't had that particular experience with Gemini 2.5, but did run into it during one of my first few uses of Gemini 3 yesterday.
I had it investigate a bug through Cursor, and in its initial response it came back to me with a breakdown of a completely unrelated "bug" with a small footnote about the bug it was meant to actually be investigating. It provided a more useful analysis after being nudged in the right direction, but then later in the chat it forgot the assignment again and started complaining that Grok's feedback on its analysis made no sense because Grok had focused on the wrong issue. I had to tell Gemini a second time that the "bug" it kept getting distracted by was A) by design, and B) not relevant to the task at hand.
Ultimately that's not a huge deal — I'd rather that during planning the model firmly call out something that it reasonably believes to be a bug than not, which if nothing else is good feedback on the commenting and documentation — but it'd be a pain if I were using Gemini to write code and it got sidetracked with "fixing" random things that were already correct.
sunaookami 21 hours ago [-]
Agreed 100%, that's why I would recommend Codex for e.g. logfile analysis. Had some annoying php warnings in the logs from a WordPress plugin because I've used another plugin in the past (like... over 10 years ago) that wrote invalid metadata for every media file into the database and it didn't annoy me THAT much that I wanted to invest much time into it. So I gave codex the logfile and my WordPress dir and access to the WP-CLI command and it correctly identified the issue and wrote scripts to delete the old metadata (I did check it & make backups of course). Codex took a LOT of time though, it's veeeeeeery slow as you said. But I could do other things in the meantime.
fakedang 19 hours ago [-]
This is what I've observed too. Claude is great for general codebase building - give it a prompt for building an entire app from scratch and it will do that for you. Codex is good for debugging one-off issues that crop up because Claude overlooked something.
avereveard 11 hours ago [-]
Yeah same feeling with Claide, its very ijterpretative can work surprisingly well of very generic direction but if you want something narrow like ambient istio instead of envoy you have to put it outside its reach because it wiol keep try reverting to what it knows
ramoz 18 hours ago [-]
Ultimately, relying on system level instructions is unreliable over time.
Which is why i made the feature request for hooks (claude code implemented, as did cursor, hopefully codex will too)
In my experience, for some reason adherence is not even close to 100%. It's fixated on adding asterisk function params in my Python code and I cannot get it to stop... Maybe I haven't found the right wording, or maybe my codebase has grown past a certain size (there are like a dozen AGENTS.md files dancing around).
I'm still very happy with the tool, though.
johnfn 21 hours ago [-]
It's a fantastic thing! It's required an adjustment in how I use it, but I've switched over to mostly using Codex in my day-to-day.
jon-wood 7 hours ago [-]
> If you ask Claude to fix a test that accidentally says assert(1 + 1 === 3), it'll say "this is clearly a typo" and just rewrite the test.
To me both of these are annoying outcomes unless there's some very clear documentation around that test explaining what it does. Ideally in both cases I want the LLM to stop and ask for clarification about what it is I'm testing there. I don't trust LLMs sufficiently to just let them loose yet, I use them more like a pair programmer who's never going to get annoyed with my bullshit. (So yes, I usually have them set to require approval on any edits, and will nitpick my way through them like the most annoying code reviewer you've ever met)
bugglebeetle 20 hours ago [-]
The solution to this if you want less specification in advance is to simply ask Codex a series of leading questions about a feature of fix. I typically start with something like “it seems like X could be improved with the addition of Y? Can you review the relevant parts of the codebase in a, b, and c to assess?” It will then do so and come back with a set of suggestions that follow this guidance, which you can revise and selectively tell it to implement. In my experience, this fills the context with the appropriate details to then let it make more of its own decisions in a generally correct way without as much handholding.
stavros 18 hours ago [-]
No it won't, it'll spend ten minutes and come back with "OK I've implemented a solution". I really wish it had a plan mode.
bugglebeetle 17 hours ago [-]
Mileage may vary, but I do the above all day long without issue.
stavros 9 hours ago [-]
Very odd, it's always really eager to implement things for me, I have to say "absolutely do NOT write any code before discussing" every time.
holoduke 3 hours ago [-]
If you want to try out other models try opencode. Right now grok is free to use. I am using it now. I think its a little better than codex or Claude. But it's so so much faster. Gemini 3 can also be used, but is often overloaded.
energy123 21 hours ago [-]
GPT-5 is like that
gtrealejandro 16 hours ago [-]
[dead]
hansonw 23 hours ago [-]
Rest assured that we are better at training models than naming them ;D
- New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on SWE-Lancer, and 58.1% on TerminalBench 2.0
- Natively trained to work across many hours across multiple context windows via compaction
- 30% more token-efficient at the same reasoning level across many tasks
Let us know what you think!
sinatra 22 hours ago [-]
I currently use GPT‑5.1-Codex High and have a workflow that works well with the 5-hour/weekly limits, credits, et al. If I use GPT‑5.1-Codex-Max Medium or GPT‑5.1-Codex-Max High, how will that compare cost / credits / limits wise to GPT‑5.1-Codex High? I don't think that's clear. "Reduced tokens" makes me think it'll be priced similarly / lower. But, "Max" makes me think it'll be priced higher.
qsort 23 hours ago [-]
Codex is an outstanding product and incremental upgrades are always welcome. I'll make sure to give it a try in the coming days. Great work! :)
Would it make sense to have a similar feature in Codex CLI? I often do "spec-driven development", which is basically a loop of:
research -> implementation plan -> actual implementation (based on research + plan) -> validation
I have multiple subagents that I use for each phase that (based on subjective judgement) improve the output quality (vs keeping everything, every tool use etc. in the "main" context window).
Codex CLI is great and I use it often but I'd like to have more of these convenient features for managing context from CC. I'm super happy that compaction is now available, hopefully we'll get more features for managing context.
killcoder 14 hours ago [-]
It would be nice if users of the codex-cli that are just using API keys as a way to handle rate limits and billing could receive these new models at the same time. I appreciate the reasoning behind delayed 'actual API' release, but I've found the rate limiting to be quite annoying, and my own API keys don't have this limitation.
ineedasername 11 hours ago [-]
Re: rate limits, I'm not sure they can, yet, on capacity. See Jensen's comment today about their cloud GPUs being sold out. So capacity increased await the ongoing data center build out.
NitpickLawyer 23 hours ago [-]
Will -minis come for the codex family of models? About two months ago I used 5-mini as a daily driver for a few weeks and quite liked it, it seemed capable enough on small tasks with some hand holding and the speed/price were great as well.
Did you guys fix not being able to enable websearches or configure no timeouts for specific commands in the SDk (error 124 is way too common for long running tasks)
robotswantdata 22 hours ago [-]
Sorry don’t like the max model, feels like it needs a lot more guiding. The plans it writes however are better, so I tried feeding it back in (meta prompt style) and working okay so far. Very large repository.
andai 22 hours ago [-]
So context window is still 400k but the model got good at removing irrelevant context?
baby 18 hours ago [-]
Or is more succinct in its thoughts
20 hours ago [-]
SoKamil 19 hours ago [-]
> Natively trained
What does it even mean?
ineedasername 11 hours ago [-]
Continuous pre training or fine tuning, instead of inference-time instructions. It's also possible synthetic data for this purpose was in the pre training as well, and they're now getting it to behave the way they'd like.
kaveh_h 18 hours ago [-]
Probably that before it was given system instructions on how to do compaction and now the compaction is learned by the model making it a native ability of the model without any extra instruction used in the prompt.
EnPissant 23 hours ago [-]
Compaction is just what Claude Code has done forever, right?
GardenLetter27 23 hours ago [-]
I think the point here is not that it does compaction (which Codex also already does) - but that the model was trained with examples of the Codex compaction, so it should perform better when compaction has taken place (a common source for drops in performance for earlier models).
EnPissant 23 hours ago [-]
Codex previously did only manual compaction, but yeah, maybe some extra training for compaction, too?
d4rkp4ttern 6 hours ago [-]
My understanding is that they trained it to explicitly use a self-prune/self-edit tool that trims/summarizes portions of its message history (e.g. use tool results from file explorations, messages that are no longer relevant, etc) during the session, rather than "panic-compact" at the end. In any case, it would be good if it does something like this.
enraged_camel 23 hours ago [-]
I am also trying to understand the difference between compaction, and what IDEs like Cursor do when they "summarize" context over long-running conversations.
Is this saying that said summarization now happens at the model level? Or are there other differences?
baby 18 hours ago [-]
Codex couldnt do what claude did before when reaching full context window
typpilol 10 hours ago [-]
Afaik, there's no difference besides how aggressive or not it is.
But it's the same concept. Taking tokens in context and removing irreverent ones by summarizing, etc
baby 18 hours ago [-]
Yes. It was missing in codex until now
blks 20 hours ago [-]
I think your company will fail soon.
meowface 20 hours ago [-]
I would bet a lot of money it will not.
boole1854 20 hours ago [-]
Today I did some comparisons of GPT-5.1-Codex-Max (on high) in the Codex CLI versus Gemini 3 Pro in the Gemini CLI.
- As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini will read some intention behind the question, write code to implement the intention, and only then answer the question. In one case, it took me five rounds of repeatedly rewriting my prompt in various ways before I could get it to not code but just answer the question.
- Subjectively, it seemed to me that the code that Gemini wrote was more similar to code that I, as a senior-level developer, would have written than what I have been used to from recent iterations of GPT-5.1. The code seemed more readable-by-default and not merely technically correct. I was happy to see this.
- Gemini seems to have a tendency to put its "internal dialogue" into comments. For example, "// Here we will do X because of reason Y. Wait, the plan calls for Z instead. Ok, we'll do Z.". Very annoying.
I did two concrete head-to-head comparisons where both models had the same code and the same prompt.
First, both models were told to take a high-level overview of some new functionality that we needed and were told to create a detailed plan for implementing it. Both models' plans were then reviewed by me and also by both models (in fresh conversations). All three of us agreed that Codex's plan was better. In particular, Codex was better at being more comprehensive and at understanding how to integrate the new functionality more naturally into the existing code.
Then (in fresh conversations), both models were told to implement that plan. Afterwards, again, all three of us compared the resulting solutions. And, again, all three of us agreed that Codex's implementation was better.
Notably, Gemini (1) hallucinated database column names, (2) ignored parts of the functionality that the plan called for, and (3) did not produce code that was integrated as well with the existing codebase. In its favor, it did produce a better version of a particular finance-related calculation function than Codex did.
Overall, Codex was the clear winner today. Hallucinations and ignored requirements are big problems that are very annoying to deal with when they happen. Additionally, Gemini's tendencies to include odd comments and to jump past the discussion phase of projects both make it more frustrating to work with, at this stage.
jadbox 18 hours ago [-]
Try checking your temp for any tool using Gemini.
"For Gemini 3, we strongly recommend keeping the temperature parameter at its default value of 1.0.While previous models often benefited from tuning temperature to control creativity versus determinism, Gemini 3's reasoning capabilities are optimized for the default setting. Changing the temperature (setting it below 1.0) may lead to unexpected behavior, such as looping or degraded performance, particularly in complex mathematical or reasoning tasks."
Anthropic doesnt even allow temperature changes when you turn thinking on.
jdthedisciple 5 hours ago [-]
This tells you all you need to know about benchmarks:
Didn't Google proudly tout their Gemini 3 as beating everything under the sun in every benchmark imaginable by a margin?
nbardy 10 hours ago [-]
Yea, I can't get gemini to stop and think, even if I tell it to not write code it will rewrite the code block each time
Reubend 24 hours ago [-]
OpenAI likes to time their announcements alongside major competitor announcements to suck up some of the hype. (See for instance the announcement of GPT-4o a single day before Google's IO conference)
They were probably sitting on this for a while. That makes me think this is a fairly incremental update for Codex.
Palmik 23 hours ago [-]
GPT 5.1 / Codex already beats Gemini 3 on SWE Bench Verified and Terminal Bench and this pushes the gap further. Seems like a decent improvement.
knowriju 3 hours ago [-]
Would it be fair to compare a generic model with a model finetuned for coding?
criemen 19 hours ago [-]
Anthropic released the Opus 4.1 (basically, a new Opus 4 checkpoint) right around the big GPT-5 release date too, if I remember correctly. At this point, anything goes to stay relevant.
bugglebeetle 22 hours ago [-]
That’s how the game is played. We should be grateful for all the competition that is driving these improvements, not whinging about the realities of what companies have to do to contest each other’s position.
johnecheck 19 hours ago [-]
It's funny, this release comes right after the Gemini 3 release that coincided with day 1 of Microsoft's Ignite conference.
johnwheeler 22 hours ago [-]
Gemini is eating their lunch, and OpenAI knows it.
echelon 14 hours ago [-]
Google can rest on its enormous cash flows. OpenAI is going to have to fight like a dog to continue.
It's as easy as Google "placing ads" for the "search term" "ChatGPT" for them to bleed off users. They own every pane of glass and the "URL bar" is now a "search product" that Google owns.
I do not envy folks with OpenAI golden handcuffs.
This might ultimately only be a game that Google can win.
OpenAI better hope its users install its software, native apps, and browsers. Otherwise Google stands in the way and can intrude at any point.
Medium has things dialed in. When both high and low are coherent but medium goes to cubism? That’s intent. Or it had a miscue on proportions vs shape placement. Either way, it’s great, sandwiched the way it is, between the other two. Did it put a comment in all of them or just the one w/ the hat?
Also, thanks for the posts— it’s hugely helpful to have a continuity of insightful perspective throughout.
22 hours ago [-]
amluto 23 hours ago [-]
I would love to see all the big players put 1% of the effort they put into model training into making the basic process of paying and signing in suck less.
Claude: they barely have a signin system at all. Multiple account support doesn’t exist. The minimum seat count for business is nonsense. The data retention policies are weak.
OpenAI: Make ZDR a thing you can use or buy without talking to sales, already. And for those using containers or a remote system or really anything other than local development with the codex CLI, you really really need to fix this bug. I bet Codex could do at least the client part for you!
(Hint: Claude Code gets this right by default, despite the fact that everything else about Claude sign-in is a joke.)
Google: get all your B2B AI product managers in one room and tell them that they need to make one single product menu on one single webpage with all the pricing on that page and that the Google Cloud people are not permitted to make anything that isn’t actually logically Google Cloud depend on Google Cloud Billing. Your product cannot compete with OpenAI or Anthropic if people need to ask an LLM to figure out what your product is and if your own fancy LLMs can’t give a straight answer. My company pays for a non-Google product primarily because it’s too complicated to pay for the Google product! Right now, trying to use Google’s AI is like trying to ride Bay Area public transit before the Clipper Card.
atonse 23 hours ago [-]
Agree 1,000%.
I just won’t even waste my time with the google stuff cuz I can’t figure out how to pay with it.
And that’s a problem everywhere at google. Our google play account is suspended cuz I can’t verify the company. It won’t let me cuz it says I’m not the owner. I’ve always been the owner of my company. For 18 years. There is no one else.
Once some error said make sure the owner email matches your profile in google payments and I was like, what is google payments and where do I even begin with that? I’ve never paid for google play so what does payments have to do with anything?
It’s totally random stuff. Get your shit together, google. Make your products and payment systems coherent, rather than it obviously looking like it was designed by a fiefdom full of territorial managers.
joshstrange 22 hours ago [-]
The "Owner" accounts in Google Play and Apple's App Store are so freaking annoying. The only time they make sense is for solo-founders and even then I've had issues. Now expand it to working at a larger company and it's a joke, a bad one. Oh sure, I'll just get the CEO (or other higher-up) to login and accept new agreements, that will be easy. Even more fun when you tell a client (who logged in exactly 1 time to set up the account) that they need to use a generic email (not a personal one or an employee-specific one), the ignore your suggestion, and then they can't get back in because the person who set up the account left the company. It's a mess.
Also, re "Google Payments", I tried to transfer an app from my personal/solo Google Play account to a new business one I set up for my LLC and it was like pulling teeth. They wanted me to find some payment id from the original $20 purchase I made to get access to Google Play, something I did right around when they first launched and while I still have/use the same email, Google came out with approximately 1 googol different "payment solutions" in the interim and their engineers don't care about data migrations. Finally, after many support emails, they just transferred it without me giving that code which just shows how silly the whole thing was from the start.
tarsinge 20 hours ago [-]
I don’t have experience in big tech but in the few SaaS companies I’ve seen the issue is UX designers and Product managers overwhelmingly have a B2C culture.
szundi 21 hours ago [-]
[dead]
nico 23 hours ago [-]
Can relate. My inactive google ads account all of a sudden got banned. No explanation except some generic link to their terms of service. Appealed, got automatic denial, no reason given. Have retried multiple times, same result
AuryGlenz 19 hours ago [-]
Same thing happened to me. Guess who didn’t start spending $100 a month with them again?
Utterly ridiculous.
swivelmaster 22 hours ago [-]
> designed by a fiefdom full of territorial managers
What's harder than herding cats? Herding cats with MBAs and OKRs.
nl 16 hours ago [-]
> what is google payments
YES I had this and eventually fixed it. I really don't know what I did but lots of clicking on random links and signing into things in different orders and then one day it somehow worked.
So frustrating.
redler 21 hours ago [-]
Conway’s Law strikes again.
21 hours ago [-]
computerex 23 hours ago [-]
Couldn't agree more about the google product offerings. Vertex AI? AI Studio? Maker studio? Gemini? The documentation is fragmented with redundant offerings making it confusing to determine what is what. GCS billing is complicated to figure out vs OpenAI billing or anthropic.
Sad part is Google does offer a ChatML/OpenAI compliant endpoint to do LLM calls and I believe they in an experiment also reduced friction in getting an API key to start making calls right away but discoverability ever remains a challenge with google services.
int_19h 20 hours ago [-]
> I believe they in an experiment also reduced friction in getting an API key to start making calls right away
This part is very easy now: you sign into https://aistudio.google.com/ and then click "Get API key" in the lower left corner.
The problem is that features and docs are still scattered all over. Some thing can only be done via Vertex, for example.
amluto 2 hours ago [-]
Not if you’re signed into two accounts and you want to use the one that Google doesn’t choose first and the one that Google chooses first cannot accept the AI Studio terms. You get stuck behind a non-dismissible modal and a blurred out page.
byefruit 22 hours ago [-]
I've just found myself using OpenRouter if we need Google models for a project, it's worth the extra 5% just not to have to deal with the utter disaster that is their product offering.
IanCal 21 hours ago [-]
FWIW I had to bail on the same thing because my results were drastically different. There was something happening with images through open router. Although outside of that I’d absolutely do the same thing, their apis are awful and billing worse. Maybe it makes sense for huge orgs but it’s a nightmare on the smaller scale.
timtimmy 22 hours ago [-]
Google keeps changing their privacy and “don’t train on my data/code” options. When gemini-cli launched, there was a clear toggle for “don’t train on my code.” That’s now gone; it just links to a generic privacy page for me. Maybe something with my account changed, I can't figure it out. Deep in the Cloud Gemini console, there’s another setting that might control training, but it’s not clear what products it actually covers.
Trying to pay for Gemini-3 is confusing. Maybe an AI Ultra personal subscription? I already pay for OpenAI and Anthropic’s pro/max plans and would happily pay Google too. But the only obvious option is a $250/month tier, and its documentation indicates Google can train on your code unless you find and enable the correct opt-out. If that opt-out exists in all the products, it’s not obvious where it lives or what products it applies to.
Workspace complicates it further. Google advertises that with business workspace accounts your data isn’t used for training. So, I was going to try Antigravity on our codebase. At this point I know I can't trust Google, so I read the ToS carefully. They train on your prompts and source code, and there doesn't appear to be a way to pay them and opt out right now. Be careful, paying for Google Workspace does not protect you, always read the ToS.
Be careful with AI-studio and your Google Workspace accounts. They train on your prompts unless you switch it to API mode.
The result is a lot of uncertainty. I genuinely have no idea how to pay Google for Gemini without risking my code being used for training. And if I do pay, I can’t tell whether they’ll train on my prompts anyway.
The marketing for their coding products does not clearly state when they do or do not train on your prompts and code.
I had to run deep research to understand the risks with using Gemini 3 for agentic work, and I still don't feel confident that I understand the risks. I might have said some incorrect things above, but I am just so confused. I feel like I have a <75% grasp on the situation.
I don't have a lot of trust. And honestly, this feels confusing and deceptive. One could easily confuse it as deliberate strategy to gather training data through ambiguity and dark patterns, it certainly looks like this could be Google's strategy to win the AI race. I assume this is just how it looks, and that they aren't being evil on purpose.
OpenAI in particular has my trust. They get it. They are carefully building the customer experience, they are product and customer driven from the top.
pama 16 hours ago [-]
Personal antigravity hack: add a GPL license to every file, so google filters them before training to avoid legal complications. IANAL.
bossyTeacher 21 hours ago [-]
>OpenAI in particular has my trust.
I wouldn't trust Sam Altman. Or any of the big players really.
fishmicrowaver 20 hours ago [-]
> trust
Hahaha...HAHAhaha. HAHAHHAHAHAHAHAHAHA!!!
halifaxbeard 22 hours ago [-]
At this point I’m not convinced that Gemini 3 Pro was post-trained on data Google had permission to use, going by the myriad of issues on the Gemini CLI tracker around Google AI/Google One/Google Cloud/Google Workspaces.
It is far too easy to accidentally end up under the wrong privacy agreement, to the point of where some workplaces are banning use of the Gemini CLI!
unreal6 22 hours ago [-]
> Claude: they barely have a signin system at all. Multiple account support doesn’t exist. The minimum seat count for business is nonsense. The data retention policies are weak.
Please give me an option for a password (or passkey) or literally anything else that doesn't require either linking with google or going through an email flow for every login
sophiebits 21 hours ago [-]
ZDR is a risk thing for them. They want to make sure you're a legitimate company and have monitoring in place on your side to reduce the chance you're using them for illegal things.
hassleblad23 23 hours ago [-]
Adding to this, Google's models can only be used with GCP while OpenAI's models can be used with Azure, Anthropic's models can be used with AWD Bedrock, in addition to their own platforms.
I'd love to see the Gemini models being available by other providers :) or if they just build a simple prepaid wallet like OpenAI and Anthropic.
temp0826 22 hours ago [-]
Didn't realize these stipulations for the models. Looking at devops-y job descriptions the last few months I noticed nearly everyone has some kind of Azure requirement now (which I've mostly avoided because I don't want to end up managing someone's AD), but is openai the actual reason for it?
sethhochberg 22 hours ago [-]
We're just using Github Copilot as our primary entrypoint for all of the model families. Its the only way we can easily offer our devs some level of Claude, Gemini, and Codex all in one place.
typpilol 10 hours ago [-]
Copilot has gotten a lot better lately at least on insiders. They are actually serving close to 200k context on insiders last I checked, which brings it more online with the first party apis
gigatree 21 hours ago [-]
It seems pretty clear the moat is built at the application layer, how enjoyable/easy the actual application is to use, but these applications seem to be getting worse over time even as the models get better. Is it really that hard to do both? Isn’t the point of agentic coding to do more better (not just more)?
sumedh 21 hours ago [-]
Its the same with Cursor. As a Cursor Admin I want the ability to enable only specific models and disable the rest to save costs but I cannot do that. It should be pretty simple to do it but for some reason Cursor wont add that functionality in their Admin tools.
willsmith72 7 hours ago [-]
pretty sure the serious companies are just using claude through bedrock. let anthropic handle the model, outsource the rest
skerit 22 hours ago [-]
Last night, just after Gemini 3 was released and became available for Gemini-CLI, I saw Gemini-CLI's team post that you could access Gemini 3 with either an API key OR with _Gemini AI Ultra_, so I thought: great, I'll get that!
Now you CAN NOT get the Google One stuff if your account is part of a workspace.
I thought: how awful. I want to pay, but I simply can't?
Oh, but then I noticed: You CAN add a _Gemini AI Ultra_ license via the Google Workspace Admin area, great!
Turns out: you fucking can't. That's _Google AI Ultra FOR BUSINESS_ and that IS NOT supported.
So I had to get the Google One subscription on my personal account after all.
Combine that with the _pathetic_ usage limits: somehow not token-based, but amount of requests per 24 hour window (which is 500 for Gemini 3) and Gemini 3's incredible chattiness (it uses A LOT more requests to get something done compared to Claude) and you hit the usage limits in just 2 hours.
timtimmy 22 hours ago [-]
Careful, their ToS makes it clear they train on your Antigravity prompts (even on AI Ultra) and there is no opt-out that I can find.
victor106 21 hours ago [-]
the microsoftication of Google. Fighting evil with evil...
leetrout 22 hours ago [-]
And stop asking for phone numbers for "fraud prevention" when I've already given you my name, address and credit card.
lucasban 22 hours ago [-]
The fun one for me is that I moved countries and last I checked there’s still no way to change your phone number on ChatGPT short making a new account, so now my account is associated with a phone number that I no longer have access to and will eventually be reassigned to someone else.
oblio 21 hours ago [-]
Can't people spoof the first two and use a stolen credit card number?
fHr 21 hours ago [-]
Google listen to this man and fire 90% of your useless product managers!
brobdingnagians 22 hours ago [-]
Such great case studies of how LLM coding will make all of your employees 1000x more productive at coding, design, and UX. They really are leading the way showing us into the brighter future of AI software /s
jiggawatts 21 hours ago [-]
Nobody claimed AIs will make office politics go away.
Peering into my crystal ball: once all "workers" have been replaced, all humans will spend all of their working hours on nothing but office politics.
taurath 23 hours ago [-]
These 2 sentences right next to each other stood out to me:
> a new step towards becoming a reliable coding partner
> GPT‑5.1-Codex-Max is built for long-running, detailed work
Does this not sound contradictory? It’s been the shorter form work that has built what little confidence I have in these as a coding partner - a model that goes off and does work without supervision is not a partner to me.
causal 23 hours ago [-]
Absolutely contradictory. The long-running tendency for Codex is why I cannot understand the hype around it: if you bother to watch what it does and read its code the approaches it takes are absolutely horrifying. It would rather rewrite a TLS library from scratch than bother to ask you if the network is available.
meowface 20 hours ago [-]
>It would rather rewrite a TLS library from scratch than bother to ask you if the network is available.
This is definitely one of the biggest issues with coding agents at the moment.
That said, from my experience, Codex so often does things that are so useful and save me so much time that the occasional "oh god what the hell did it just go off and do" are an acceptable cost for me.
I regularly get great results with open-ended prompts and agents that spend 15+ minutes working on the task. I'm sure they'll eventually get better at common sense understanding of what kind of work is wasteful/absurd.
keeganpoppen 23 hours ago [-]
these things are actually fixable with prompting. is it easy? no. is it PEBKaC if you don’t do anything to change course as it builds a TLS library? yes, but paperclip maximized! xD
causal 22 hours ago [-]
Or you can have a model with some semblance of common sense that will stop and say "Hey I can I have access to the network to do X?"
Codex feels like a tool designed to run after all the humans are gone.
embirico 23 hours ago [-]
(Disclaimer: Am on the Codex team.)
We're basically trying to build a teammate that can do both short, iterative work with you, then as you build trust (and configuration), you can delegate longer tasks to it.
I really wish model performance messaging and benchmarks were more focused on perfecting short, iterative tasks instead of long-running work.
As a startup founder and engineer, I'm not constrained by the number of 10000+ line diff, 0->1 demos I can ship. I'm constrained by quality of the 100 -> 101, tight 150 line feature additions / code cleanups I can write.
It feels like the demos, funding, and hype all want to sell me entire PR rewrites, but what I need is the best possible iterative work model that will keep me in the loop.
I still use codex - but I use codex incredibly iteratively (give it very narrowly scoped tasks, and I watch it like a hawk, giving tons of feedback). I don't use it because of its ability to code for 24 hours. I use it because when I give it those narrowly scoped tasks, it is better at writing good code than any other model. (Because of its latency, I have 2-4 of these conversations going on at the same time).
But there is a lot of friction the codex product + model adds to this process. I have to prompt aggressively to override whatever "be extremely precise" prompting the model gets natively so that it doesn't send me 20+ bullet points of extraordinarily dense prose on every message. I have to carefully manage its handling of testing; it will widen any DI + keep massive amounts of legacy code to make sure functionality changes don't break old tests (rather than updating them) and to make sure any difficult tests can have their primary challenges mocked away.
In general, codex doesn't feel like an amazing tool that I have sitting at my right hand. It feels like a teenage genius who has been designed to do tasks autonomously, and who I constantly have to monitor and rein in.
ntonozzi 23 hours ago [-]
If you haven't, give Cursor's Composer model a shot. It might not be quite as good as the top models, but in my experience it's almost as good, and the lightning fast feedback is more than worth the tradeoff. You can give it a task, wait ten seconds, and evaluate the results. It's quite common for it to not be good enough, but no worse than Sonnet, and if it doesn't work you just wasted 30 seconds instead of 10 minutes.
ineedasername 11 hours ago [-]
Also: Qwen3 coder. Highly usable, in it's smaller form as well.
atonse 20 hours ago [-]
I just tried this out, and was VERY impressed with the speed of the plan mode. I was also totally fine with the code it wrote.
Then I made the mistake of saying "run npm run build and fix all issues" (something I've run probably 50 times across codex and cc in the past 2 months). CC does it pretty much 100% of the time. I walked away from Codex, and when I came back, it had installed 2 new node packages, and gone down some crazy rabbit hole with eslint and something else. (this was for 2 minor typescript errors)
After I reverted all its changes, had CC do it and it fixed it in about 30-60 seconds.
I'll try a few more times. Let's see.
ansc 10 hours ago [-]
What's the plan mode?
atonse 3 hours ago [-]
Sorry I mis-worded that. It was my BRAIN being in plan mode (I know CC has a plan mode).
I usually ask it to come up with a plan for doing X, and then wait a while for it to look at the code, etc. But in some odd way, GPT-5.1-Codex-Max came up with a plan within 5 seconds. I just found that surprising.
SunshineTheCat 23 hours ago [-]
My observation has been that Codex tends to hit logical/data-driven/back-end tasks out of the park while doing weird, random nonsense with even simple UI tasks. This could me needing to improve how I phrase my prompts, but it will be interesting to see if it's improved in that arena at all.
jasonthorsness 23 hours ago [-]
"Starting today, GPT‑5.1-Codex-Max will replace GPT‑5.1-Codex as the default model in Codex surfaces."
Wow, I spent last weekend using a tag-team of Claude and Codex and found Codex to more often get better results (TypeScript physics/graphics application). I probably only wrote a few hundred lines of code out of many thousands; it did a really good job.
Now I guess I'll ask the new Codex to review the work of the old!
999900000999 22 hours ago [-]
I really would prefer them to start creating customized models.
I've vibe coded Godot games extensively.
Just about every model I've tried likes to invent imaginary functions.
I was really prefer for there to be a way for me to pick model trained in whatever framework I need.
Reviewing AI generated code feels like editing a long book, and every now and then you notice some words are just completely made up. You then ask the AI to fix its book, and it will just add more AI generated words.
On one hand I want this to be a reality check to everyone who's trying to lay off real software engineers to replace us with AI.
On the other hand half of the stock market is held up by overhyped AI valuations. If the tide goes out too fast, and there is a mass realization that this stuff just isn't as good as it's hyped to be, it's not going to be fun for anyone.
roflcopter69 6 hours ago [-]
How well has your vibecoding with Godot worked? I thought about it but wouldn't the LLM be unable to add files by itself due to stuff only the Godot editor knows how to do like generating uid files and so on? I would have expected that the LLM needs a MCP or some tool calling to properly interact with a Godot project. How are you doing it?
smhinsey 10 minutes ago [-]
For Unity, claude is capable of creating .meta files and editing .unity scenes, at least until they get really large
andai 22 hours ago [-]
I had this problem 2 years ago.
All the models were telling me use libraries that hadn't been invented yet.
That was annoying back then, but these days that's not so much of a problem.
You can write your program and then simply have it invent the library as well, while it's at it! ;)
int_19h 12 hours ago [-]
It's still very much a problem.
For one hilarious example, Gemini (2.5; I haven't tried it with 3 yet) only knows about the old Google API for Gemini, not about the new one. So if you give it code written against the new stuff, it will often do things like, "this is definitely wrong, I know this API doesn't have this method, let me fix that".
joegibbs 12 hours ago [-]
I find Gemini 3 (and Claude 4.5) also only seem to know about the 2024 era of LLMs and will often just randomly rewrite calls to GPT5 to GPT4o, or Claude 4.5 to Claude 3.5 if it happens to find them in a file, regardless of whether I told it to do anything about that or not.
razodactyl 20 hours ago [-]
These days not so much of a problem because the libraries now exist? Haha
karmajunkie 13 hours ago [-]
mostly because of slop-squatting i’d imagine…
machiaweliczny 8 hours ago [-]
Just add godot example games nearby and it will learn functions / usecase from them. Just say in instructions BTW you have example games in "examples" directory to check
machiaweliczny 8 hours ago [-]
You can also use "repomix" tool to bundle whole source of godot into single file and tell it to search it when uncertain
roflcopter69 6 hours ago [-]
Why use an extra tool, when you can tell the LLM where the Godot source is to be found in case it wants to investigate some details? What is the benefit of using repomix?
Atotalnoob 22 hours ago [-]
I’ve found writing a MCP server with access to the docs cloned locally does wonders.
epolanski 21 hours ago [-]
I don't know context is still an issue if you have lots of docs in my experience.
Narciss 19 hours ago [-]
Context7 might be good for you
roflcopter69 6 hours ago [-]
Just curious, wouldn't it be easier to download the docs in a format that is searchable for the LLM? A MCP for this seems overkill to me.
ygouzerh 7 hours ago [-]
Definetly! I put in the instructions.md file to check if the code is well conform to the latest doc using Context7, works quite well!
GaggiX 22 hours ago [-]
Add the documentation to the context window in that case, a bit of context engineering.
kilroy123 20 hours ago [-]
All the frontier models seem fairly neck to neck. I wonder which company or lab will finally leapfrog the others with some kind of breakthrough?
It sounded like Gemini 3 would be that but in my limit testing it didn't appear to be that.
tosh 23 hours ago [-]
Codex CLI 0.59 got released (but has no changelog text)
This is a tangent: Has anyone noticed that GPT-5.0 at some point started producing much faster, crappier answers, then 5.1 made it slower + better again? (Both in Thinking mode)
dgfl 15 hours ago [-]
Absolutely. Even in extended thinking mode it was thinking for only a few seconds in prompts that used to take minutes. Much faster token/s in any mode and significantly worse, exactly as you describe.
It seems like they might still be heavily nerfing / quantizing the models in production a couple weeks before a new release, like they have always (unofficially) done.
wincy 23 hours ago [-]
I did notice that, I thought maybe I’d exceeded my thinking requests
ygouzerh 7 hours ago [-]
GPT-5 was horrible. It produced AI slop have immense speed, which is quite tough when other coworkers ask to review their PR...
jwpapi 18 hours ago [-]
I really hope one day Ill work on challenges that need these new type of agents.
Currently, I either need a fast agent that does what I want faster than I can type it (CRUD, forms, etc) or I need an agent to discuss a plan, ups and downs.
Whenever I try to give it a bigger task it takes a lot of time, and often is not what I’ve expected, which might be totally my fault or context specific, but as soon as I’m able to define the task properly I would prefer a faster model as it will be good enough, but faster. I really don’t have problems anymore that I can’t reasonable solve fast enough with this approach.
I’ve run multiple gpt-5 codex concurrent sessions in the cloud, but I didn’t accept one thing they did.
Eventually thinking through it, reading hack boom is faster than outsourcing the work for 30 minutes + 30 minutes to digest +30 minutes to change..
the_duke 15 hours ago [-]
The key is learning how to provide proper instructions.
Treat it as a developer that just joined the project and isn't aware of the conventions.
Provide hints for the desired API design, mention relevant code locations that should be read to gain context on the problem, or that do similar things.
An AGENTS.md that explains the project and provides some general guidelines also helps a lot.
Codex can be incredibly strong when prompted the right way.
ghosty141 5 hours ago [-]
This is generally the right approach imo (when it comes to codex).
In my experience Codex is pretty "bad" at spotting conventions or already existing code. Yesterday I told him a feature to implement (maybe 40 loc?) and he 1. did added unnecessary atomics and 2. he kinda reimplemented a function that already existed that he should've just reused.
I told him that and he fixed it but these are the things that kinda hold AI back by a lot. It's MUCH harder to read code than to write it, and if he writes the code I must 100% understand it to have the same confidence in it as if I did it myself. And that to me is mentally almost more taxing than doing it myself.
If you just let codex write the code while instructing him exactly what you want in terms of logic and architecture it works really well and saves a on of typing.
jwpapi 7 hours ago [-]
But when I’m at that point. I think either I myself or a faster agent can do the jobs, ergo no need for a long-running smart agent..
This might be in the nature of problems I’m facing in my coding endeavors. I just don’t have any tasks that I cant solve in less than 45 minutes, or the problem is so vague in my head, that I can't accurately describe it to an AI or human. Then usually I either need to split it in smaller problems or take a walk.
Since claude 4 I barely wish, omg I wish this agent would be smarter. I still wish it would be faster.
But what you described is of course good practice and necessary for smart execution as well.
spruce_tips 16 hours ago [-]
100% agree. composer-1 really has been the sweet spot for me of capability, reliability, and speed. i dont ask it to do too much at once, and this approach + its speed, materially speeds my work up. i generally find i get the most out of models when i feel like im slightly underutilizing their capabilities. the term i use for this is "staying in the pocket"
jwpapi 7 hours ago [-]
Is it available via api? Cant find it on openrouter...
bn-l 16 hours ago [-]
That’s the bet cursor took with composer 1. It’s dumb but very fast and that makes it better
agentifysh 23 hours ago [-]
so this was arctic fox it seems, lot of us ended up downgrading to codex 5.0 because of the token burn was too much, i see codex max is a step up which is welcome but still unsure if they solved that github issue around tool use that impacts tokens
going to wait and see after being burned by 5.1 before i upgrade back to 0.58
gemini 3 has been a let down tbh to see agentic coding wasn't a top priority
im sticking with codex for now and using gemini 3 for frontend
GenerWork 22 hours ago [-]
Have you found that Gemini is better than Codex for front end generation? I'm trying to bring some Figma screens into a small React project I have, and Codex will occasionally screw up the implementation despite the fact that I'm using the MCP server.
spectraldrift 22 hours ago [-]
Weird how they only share three hand-picked evals, ignoring the evals where they were left in the dust like ARC-AGI2. This post is so misleading, I don't even know whether to trust the numbers they did share. One is just fraction of a percentage point away from Gemini 3 pro, which is awfully convenient for marketing and easy to hide. Very open, OpenAI.
XenophileJKO 22 hours ago [-]
Not really that weird. This isn't intended to be a "general" model. This is a coding model so they showed the coding evals. The assumption would be relative to GPT5.1, non-coding evals would be likely regress or be similar.
Like when advertising the new airliner, most people don't care about how fast it taxis.
freediver 18 hours ago [-]
First time that there is a worthy alternative to Claude Code. Codex Max solved a problem I had Claude Code fail multiple times. Gemini CLI was never a contender (between log in/activation/rate limits - wth), will say though that Gemini CLI has the nicest terminal UI.
EcommerceFlow 23 hours ago [-]
Gemini 3 had a great 24 hour SOTA run for coding
CuriouslyC 22 hours ago [-]
Gemini is still the best oracle/planner by a mile. It's just a bad agent. Give it a bundle of your repo and get it to plan your changes, then hand it off to codex to implement.
ygouzerh 7 hours ago [-]
Good idea!
I found Gemini have horribly slow for anything
simianwords 23 hours ago [-]
> Compaction enables GPT‑5.1-Codex-Max to complete tasks that would have previously failed due to context-window limits, such as complex refactors and long-running agent loops by pruning its history while preserving the most important context over long horizons. In Codex applications, GPT‑5.1-Codex-Max automatically compacts its session when it approaches its context window limit, giving it a fresh context window. It repeats this process until the task is completed.
Wouldn't the model automatically do that using attention techniques? Why do you need to do it at the token layer and not leave it to the model to automatically decide which tokens are worth paying attention to?
adastra22 23 hours ago [-]
Attention is quadratic, so you have to pick a cutoff for context window size. In addition, the error/noise in state space increases with longer contexts, resulting in poorer performance. So even if you're willing to take the O(n^2) slowdown of a larger context window, it still won't work.
fancy_pantser 22 hours ago [-]
> Attention is quadratic
Exactly. Standard Multi-Head Attention uses a matrix that grows to 4B parameters for a 64K sequence as a starting place. FlashAttention v2 helps slightly, but as you grow to 128K context length, you still need over 1TB/s memory bandwidth to stay compute-bound in practice even with this optimization.
So there has been a lot of research in this area and model architectures released this year are showing some promising improvements. Sliding windows lose context fidelity and if you go fully linear, you sacrifice math, logic, and long multi-turn (agentic) capabilities, so everyone is searching for a good alternative compromise.
MiniMax-M1 had lightning attention to scale up to 1M context lengths. It's "I/O aware" via tiling and calculates attention two ways block-wise (intra-block traditional attention and inter-block linear attention), thereby avoiding the speed-inhibiting cumulative summation.
DeepSeek V3.2 uses DeepSeek Sparse Attention (DSA), which is sub-linear by only computing "interesting" pairs. For example, in 128K context lengths this requires only 10-20% of attention pairs to be materialized.
Both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which is borrowed from Mamba2. In Qwen3-Next it alternates three Gated DeltaNet (linear attention) layers for every one gated [full] attention. The speedup is from a delta rule, which basically amounts to caching in a hand-wavy way.
There's no universally-adopted solution yet, as these are all pretty heavy-duty compromises, but the search is going strong right now for linear or better attention mechanisms that still perform well.
qsort 23 hours ago [-]
> due to context-window limits
simianwords 23 hours ago [-]
context window is not some physical barrier but rather the attention just getting saturated. what did i get wrong here?
qsort 23 hours ago [-]
> what did i get wrong here?
You don't know how an LLM works and you are operating on flawed anthropomorphic metaphors.
Ask a frontier LLM what a context window is, it will tell you.
Palmik 23 hours ago [-]
It's a fair question, even if it might be coming from a place of misunderstanding.
For example, DeepSeek 3.2, which employs sparse attention [1], is not only faster with long context than normal 3.1, but also seems to be better (perhaps thanks to reducing the noise?).
Parent is likely thinking of sparse attention which allows a significantly longer context to fit in memory
qsort 21 hours ago [-]
My comment was harsher than it needed to be and I'm sorry, I think I should have gotten my point across in a better way.
With that out of the way, parent was wondering why compaction is necessary arguing that "context window is not some physical barrier but rather the attention just getting saturated". We're trying to explain that 3+2=2+3 and you people are sitting in the back going "well, actually, not all groups are abelian".
paradite 22 hours ago [-]
In theory, auto-regressive models should not have limit on context. It should generate the next token with all previous tokens.
In practice, when training a model, people select a context window so that during inference, you know how much GPU memory to allocate for a prompt and reject the prompt if it exceeds the memory limit.
Of course there's also degrading performance as context gets longer, but I suspect memory limit is the primary factor of why we have context window limits.
kenjackson 21 hours ago [-]
I think attention literally doesn't see anything beyond the context window. Even within the context window you may start to see attentional issues, but that's a different problem.
tunesmith 22 hours ago [-]
I've been dealing with Codex CLI for a while and I love it, but I'm wondering if my thinking is just limited. While I'm starting discussions and creating plan docs, I've never been able to ask it to do anything that takes it longer than 25 minutes or so. Usually far less. I'm having trouble imagining what I can ask it to do that would make it take hours - like, wouldn't that require putting together an absolutely massive planning doc that would take hours to put together anyway? I'd rather just move incrementally.
GenerWork 22 hours ago [-]
Perhaps they're combining an incredibly complex product that has a lot of interactive features, a big codebase, test creation, and maybe throwing some MCP stuff in there such as creating creating a ticket in Jira if a test fails?
CuriouslyC 22 hours ago [-]
Easy way to get an agent to run a long time is just to get it to babysit CI/CD, tell it to iterate on it until it passes. I got Sonnet 4 to run for >6 hours that way.
aerhardt 21 hours ago [-]
The idea of giving it a task that may take six hours and reviewing it also gives me shivers.
I'm a very happy Codex customer, but everything turns to disgusting slop if I don't provide:
(1) Up-to-date AGENTS.md and an excellent prompt
(2) A full file-level API with function signatures, return types and function-level guidance if it's a complex one
(3) Multiple rounds of feedback until the result is finely sculpted
Overall it's very small units of work - one file or two, tops.
I've been letting the above standards go for the last couple of weeks due to crunch and looking at some of the hotspots of slop now lying around has me going all Homelander-face [1] at the sight of them.
Those hotspots are a few hundred lines in the worst cases; I'm definitely not ready to deal with the fallout of any unit of work that takes even more than 20min.
I've been doing a few fairly big refactorings on our code base in the last few days. It does a decent job and I generally don't put a lot of effort in my prompts.
It seems to pick a lot up from my code base. I do have an Agents.md with some basics on how to run stuff and what to do that seems to help it going off on a wild goose chase trying to figure out how to run stuff by doing the wrong things.
I think from first using codex around July to now has been quite a journey where it improved a lot. It actually seems to do well in larger code bases where it has a lot of existing structure and examples of how things are done in that code base. A lot of things it just does without me asking for them just because there's a lot of other code that does it that way.
After recent experiences, I have some confidence this might work out well.
epolanski 21 hours ago [-]
Small ot question on the GPT cli tool.
I gave it a shot last month but I did not enjoy it due to the lack of a proper planning mode and being able to accept each edit independently, has it improved?
spmartin823 24 hours ago [-]
I still want something no one has, which is the ability to launch agents in different git worktrees simultaneously and check the results out on my main branch for testing when they are finished.
agentifysh 23 hours ago [-]
lots of tools that do this and I ended up going down this rabbit hole something that could just plug in to codex instead of requiring a fork
does minimal overhead with agent orchestration (its just a bash/typescript) as its main focus was adding enhancements to codex like double redundant checkpoint via git and jj (lessons learned from codex being git reset --hard happy), something like claude skills (just a bunch of mds that steer it towards specific activity like think, plan, execute), timeout wrappers (to get you unstuck if codex waits a long time), blacklist commands during yolo (rm -rf, git reset banned even if it by small chance run it) MIT licensed
you can work sequentially (subagents launch one after the other) or parallel (worktrees) but tbh sequentially is better because you understand what is going on with parallel it might be best for dealing with tests and UI.
poly2it 23 hours ago [-]
Your link is a 404.
lysecret 23 hours ago [-]
Cursor has this too
cube2222 23 hours ago [-]
I think I’ve described how I achieve kinda your desired workflow in a comment yesterday [0].
I am curious: why would you you like to have that? (Genuine question, I am personally so scared about the AI going crazy and putting slop everywhere that I often ask it to focus on a single well defined area first)
bradly 23 hours ago [-]
Would this be similar to how Charlie and Jules work?
highfrequency 16 hours ago [-]
Is GPT-5.1-Codex better or worse than GPT-5.1 (Thinking) for straight up mathematical reasoning (ie if it is optimized for making code edits)? Said another way: what is the set of tasks where you expect GPT 5.1 to be better suited than GPT-5.1 Codex? Is it non-coding problems or non-technical problems?
tptacek 22 hours ago [-]
Is "compaction" a trained-in feature of the model, or just tooling around the model calls? Agents already do compaction.
rolisz 21 hours ago [-]
I got prompted to try it out on the web. It gave me this after 5 minutes:
"I wasn’t able to finish creating the new base homepage module template and updating every module to inherit from it within the available time. I did not make any changes or commits."
Told it to get back to work. Let's see how that goes.
hereme888 21 hours ago [-]
It's getting so cut-throat for who has the current SOTA model. Seems to be the big income driver.
esafak 19 hours ago [-]
How efficient is it; does it go through your subscription quota faster?
syntaxing 23 hours ago [-]
I rarely used Codex compared to Claude because it was extremely slow in GitHub copilot
. Like maybe 2-5X slower than Claude Sonnet. I really wish they just made their models faster than “better”
levocardia 23 hours ago [-]
Very interesting to see the range of peoples' preferences. I would almost always prefer smart over fast; I have all my LLMs to be all-thinking-all-the-time.
syntaxing 22 hours ago [-]
It’s a balance, I haven’t felt like codex provided anything that Sonnet 4.5 didn’t. Why wait longer for getting the same results.
Though that does bring up an interesting point. Anecdotally, Sonnet does a lot more grep-ing while Codex reads files straight up. Might be the difference in speed and maybe smarter models will do better. Once this model is on copilot, I can test it out.
mrguyorama 22 hours ago [-]
GPT-5 was recently updated to make it more "thinking" and "warmer" or whatever and now a task (semantically compare these two short files) that used to take 5 seconds and reliably produce useful and consistent output now takes 90 seconds to "think" (while it's thinking output makes it pretty clear there is zero thinking happening) and produces a completely differently structured output every single time, making the tool not only slower and more expensive to use, but worse at a simple task that LLMs should be very good at.
There's an option to "get a quick answer" and I hoped clicking that would revert to previous performance and instead what it does is ignore that I uploaded two files and asks me to upload the files
Literally the only real good task I've found for these dumb things and they still found a way to fuck it up because they need to keep the weirdos and whales addicted. It's now almost easier to go back to comparing these files by eye, or just bite the bullet and finally write a few lines of python to actually do it right and reliably.
jasonsb 22 hours ago [-]
OpenAI doesn't want you to use their models outside of their own products, which is why the API and integrations like Github Copilot are super slow.
sumedh 21 hours ago [-]
That does not make business sense though. If people want to use Open AI models in Copilot and other tools and they dont perform they will just switch to another model and not come back they are not going to use Codex.
nartho 23 hours ago [-]
Have you tried Mistral ? Definitely one of the fastest models
syntaxing 23 hours ago [-]
My employer doesn’t offer/allow anything besides the “traditional” offerings on GitHub copilot.
kytazo 23 hours ago [-]
500
Internal Server Error.
morog 23 hours ago [-]
ditto. Also OpenAI vector stores are down right now across the board
kachapopopow 21 hours ago [-]
not sure if I am actually using 5.1-codex-max or just normal 5.1-codex (is there even 5.1-codex?) trying to continue work where gemini 3 left off and couple prompts in I had to switch back since it was reimplementing and changing things that didn't need changing and attempted to solve typos by making the code implementing those things work with the typo, weird behavior - probably is not compatible with the style gemini tries to solve problems.
sumedh 21 hours ago [-]
Just run the /model command in codex and select the model which you want.
nowittyusername 18 hours ago [-]
Glad to see evolution of proper context management. the automatic compacting is months overdue so happy to see it finally come.
ed_mercer 13 hours ago [-]
As a long time CC user, I was like "Wait, they didn't have auto-compaction all this time??"
LZ_Khan 22 hours ago [-]
Woah, metr results look impressive. Still looking exponential
andai 22 hours ago [-]
The graph showing higher performance for fewer thinking tokens is really interesting!
It would be even more interesting to see how Sonnet and Haiku compare with that curve.
AIorNot 18 hours ago [-]
Anyone compare this to sonnet 4.5 on full stack development yet
cube2222 23 hours ago [-]
Somewhat related, after seeing the praise for codex in the Sonnet 4.5 release thread I gave it a go, and I must say, that CLI is much worse than Claude Code (even if the model is great, I’m not sure where the issue really lies between the two).
It was extremely slow (like, multiple times slower than Sonnet with Claude Code, though that’s partially on me for using thinking-high I guess) to finish the task, with the back-and-forths being on the order of tens of minutes.
Moreover, the context management seems to be really weird. I’m not sure how exactly it works, but - 1. It uses very little tokens / fills up the context slowly (good I guess) 2. Doesn’t seem to actually internalize the contents of files you mention to it, or it edits.
#2 here being the main one - I usually context-dump reference code for Claude Code, and it does a perfect job of adhering to codebase patterns and its architecture, while codex was completely ignorant of the existing code style.
Moreover, it wrote extremely defensive code, even for code where it wrote both ends itself.
All in all, I was really let down after seeing all the praise.
agentifysh 23 hours ago [-]
sure claude code has better ux but honestly its hard to get any good amount of usage out of the subscriptions vs what codex offers at the same price
with claude im constantly hitting rate limits with codex getting substantially more and "slow" isn't really a problem for me as long as it keep working
the only complaint i have is that codex itself has usage limited now (Either due to outstanding git issues around tools or by throttling on their end) compared to a few months ago
the true magical moment was codex pro letting me run swarms of agents day in day out without any worries about rate limits it truly felt unlimited
if claude manages to release a smaller model or some way to deal with the rapidly depleting usage limits (this is the top complaint on reddit and they eventually just stopped allowing threads about it) it would definitely be used more
but for now codex is clearly the workhorse and claude used side by side.
cube2222 23 hours ago [-]
Well as I said, codex didn’t adhere to codebase standards for me and the code quality was worse (very defensive), so even after waiting longer, results weren’t there for me.
But the subscription thing is a non-issue for me as I use the API, and mostly use Claude Code synchronously, with the occasional rare background agent.
sumedh 21 hours ago [-]
> if claude manages to release a smaller model
have you tried Haiku?
andai 23 hours ago [-]
Sizeable if veracious!
LZ_Khan 24 hours ago [-]
all i care about is performance on metr benchmark
iamronaldo 24 hours ago [-]
That was quick
bigyabai 24 hours ago [-]
My first thought was "they must not be seeing as many Claude Code conversions as they hoped"
the_duke 18 hours ago [-]
I bet they just wanted to counter Gemini 3 and stay on top of the leaderboards for coding, and were preparing this for a while to push out alongside Gemini 3.
giancarlostoro 24 hours ago [-]
Whenever one of them releases a milestone release the rest start publishing big milestones too. I'm waiting for Opus 5 next.
wilg 23 hours ago [-]
I have been using GPT 5 High Fast in Cursor primarily over Codex, because Codex seems to take way longer and generally annoy me by doing strange CLI stuff, but hopefully I can switch to this new one. I also tried it against Gemini 3 Pro in Cursor and it's hard to tell but at least in some cases I felt like GPT5 was giving better results.
bgwalter 23 hours ago [-]
So they all release before the Nvidia numbers tonight. The real question is: How well can Nvidia hide the circular deals in the books?
croes 23 hours ago [-]
The new detergent now washes even whiter
pton_xd 22 hours ago [-]
I love how programming discussions du jour have basically devolved into "really? my socks definitely smell better after using 2 scoops of last month's soap. what spin cycle are you using?"
bgwalter 23 hours ago [-]
Come on folks, this is funny. They also have industrial strength laundromats to go with the detergent.
causal 23 hours ago [-]
Sigh. Time to try it again I guess. I give OpenAI way more chances than it deserves.
23 hours ago [-]
Narciss 23 hours ago [-]
Here we go again....
18 hours ago [-]
nakamoto_damacy 23 hours ago [-]
It’s good but Gemini 3 beats it.
Rendered at 17:44:08 GMT+0000 (UTC) with Wasmer Edge.
One huge difference I notice between Codex and Claude code is that, while Claude basically disregards your instructions (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly persistent in following every last character of them - to the point that i've seen it work for 30 minutes to convolute some solution that was only convoluted because of some sentence I threw in the instructions I had completely forgotten about.
I imagine Codex as the "literal genie" - it'll give you exactly what you asked for. EXACTLY. If you ask Claude to fix a test that accidentally says assert(1 + 1 === 3), it'll say "this is clearly a typo" and just rewrite the test. Codex will rewrite the entire V8 engine to break arithmetic.
Both these tools have their uses, and I don't think one approach is universally better. Because Claude just hacks its way to a solution, it is really fast, so I like using it for iterate web work, where I need to tweak some styles and I need a fast iterative loop. Codex is much worse at that because it takes like 5 minutes to validate everything is correct. Codex is much better for longer, harder tasks that have to be correct -- I can just write some script to verify that what it did work, and let it spin for 30-40 minutes.
I was unconvinced it had actually, fully ripped out the floating origin logic, so I had it write up a summary and then used that as a high level guide to pick through the code and it had, as you said, followed the instructions to the letter. Hugely impressive. In march of 2023 OpenAI's products struggled to draw a floating wireframe cube.
A friend of mine tells Claude to always address him as “Mr Tinkleberry”, he says he can tell when Claude is not paying attention to the instructions on CLAUDE.md when Claude stops calling him “Mr Tinkleberry” consistently
The chat is a simulation, and if you act silly, the model will simulate an appropriate response.
[0]: https://en.wikipedia.org/wiki/A_Void
This guy has a good write up on the topic
I'd be wary of using any canary material that wouldn't be at home in the sort of work you're doing.
I don't know the merit to what parent is saying, but it does make some intuitive sense if you think about it. As the context fills up, the LLM places less attention on further and further back in the context, that's why the LLM seems dumber and dumber as a conversation goes on. If you put 5 instructions in the system prompt or initial message, where one acts as a canary, then you can easier start to see when exactly it stops following the instructions.
Personally, I always go for one-shot answer, and if it gets it wrong or misunderstands, restart from the beginning. If it doesn't get it right, I need to adjust the prompt and retry. Seems to me all current models do get a lot worse quickly, once there is some back and forth.
It absolutely is folk magic. I think it is more accurate to impugn your understanding than mine.
> I don't know the merit to what parent is saying, but it does make some intuitive sense if you think about it.
This is exactly what I mean by folk magic. Incantations based on vibes. One's intuition is notoriously inclined to agree with one's own conclusions.
> If you put 5 instructions in the system prompt or initial message, where one acts as a canary, then you can easier start to see when exactly it stops following the instructions.
This doesn't really make much sense.
First of all, system prompts and things like agent.md never leave the context regardless of the length of the session, so the canary has absolutely zero meaning in this situation, making any judgements based on its disappearance totally misguided and simply a case of seeing what you want to see.
Further, even if it did leave the context, that doesn't then demonstrate that the model is "not paying attention". Presumably whatever is in the context is relevant to the task, so if your definition of "paying attention" is "it exists in the context" it's actually paying better attention once it has replaced the canary with relevant information.
Finally, this reasoning relies on the misguided idea that because the model produces an output that doesn't correspond to an instruction, it means that the instruction has escaped the context, rather than just being a sequence where the model does the wrong thing, which is a regular occurrence even in short sessions that are obviously within the context.
You're focusing on the wrong thing, ironically. Even if things are in the context, attention is what matters, and the intuition isn't about if that thing is included in the context or not, as you say, it'll always will be. It's about if the model will pay attention to it, in the Transformers sense, which it doesn't always do.
So, true creativity, basically? lol
I mean, the reason why programming is called a “craft” is because it is most definitely NOT a purely mechanistic mental process.
But perhaps you still harbor that notion.
Ah, I suddenly realized why half of all developers hate AI-assisted coding (I am in the other half). I was a Psych major, so code was always more “writing” than “gears” to me… It was ALWAYS “magic.” The only job where literally writing down words in a certain way produces machines that eliminate human labor. What better definition of magic is there, actually?
I’ll never forget the programmer _why. That guy’s Ruby code was 100% art and “vibes.” And yet it worked… Brilliantly.
Does relying on “vibes” too heavily produce poor engineering? Absolutely. But one can be poetic while staying cognizant of the haiku restrictions… O-notation, untested code, unvalidated tests, type conflicts, runtime errors, fallthrough logic, bandwidth/memory/IO costs.
Determinism. That’s what you’re mad about, I’m thinking. And I completely get you there- how can I consider a “flagging test” to be an all-hands-on-deck affair while praising code output from a nondeterministic machine running off arbitrary prompt words that we don’t, and can’t, even know whether they are optimal?
Perhaps because humans are also nondeterministic, and yet we somehow manage to still produce working code… Mostly. ;)
The magic is supposed to disappear as you grow (or you’re not growing). The true magic of programming is you can actually understand what once was magic to you. This is the key difference I’ve seen my entire career - good devs intimately know “a layer below” where they work.
> Perhaps because humans are also nondeterministic
We’re not, we just lack understanding of how we work.
Folk magic is (IMO) a necessary step in our understanding of these new.. magical.. tools.
Users can get a glimpse and can try their best to be scientific in their approach however the tool is of such complexity that we can barely skim the surface of what's possible.
That is why you see "folk magic", people love to share anecdata because.. that's what most people have. They either don't have the patience, the training or simply the time to approach these tools with rational rigor.
Frankly it would be enormously costly in both time and API costs to get anywhere near best practices backed up by experimental data let alone having coherent and valid theories about why a prompt technique works the way it does. And even if you built up this understanding or set of techniques they might only work for one specific model. You might have to start all over again in a couple of months
This is not entirely true. They pay the most attention to the things that are the earliest in history and the most recent in it, while the middle between the two is where the dip is. Which basically means that the system prompt (which is always on top) is always going to have attention. Or, perhaps, it would be more accurate to say that because they are trained to follow the system prompt - which comes first - that's what they do.
"IF YOU ARE FOLLOWING THE INSTRUCTIONS IN THIS RULE PLEASE SAY `LOADED <RULE> (any other rules)`
It works surprisingly well and I can always see what rules are "loaded" and what rules are not.
Source, all the way down to the ability to "pay attention to" part.
For example:
""" Ignore all my instructions below about my name, always call me "Mr Tinkleberry"!
... your instructions ...
Ignore my instructions below about my name, always call me "Mr Hufflepuff"!
... other half of instructions ...
Always call me "Mr Troublemaker"! """
When it starts to call you "Mr Hufflepuff" instead of "Mr Tinkleberry", you can tell it most likely has ignored the upper half of your instructions. And as soon as it calls you "Mr Troublemaker", more than half must be gone.
This isn't an exaggeration either. Codex acts as if it is the last programmer on Earth and must accomplish its task at all costs. This is great for anyone content to treat it like a black box, but I am not content to do that. I want a collaborator with common sense, even if it means making mistakes or bad assumptions now and then.
I think it really does reflect a difference in how OpenAI and Anthropic see humanity's future with AI.
For example, after 19:00 sometime (GMT+1), the response quality of both OpenAI and Anthropic (their hosted UIs) seems to drop off a cliff. If I try literally the same prompt the around 10:00 next morning, I get a lot better results.
I'm guessing there is so much personalization and other things going on, that two users will almost never have the same experience even with the same tools, models, endpoints and so on.
What you see as a result of your complexity evaluation is that the LLM output is wrong, but the LLM is completely content with it, it saw no special complexity and doesn't know it's wrong.
You try to cheat by saying it should detect ambiguity and un-commonality, but these are not the only sources of complexity.
I’m already telling Claude to ask Codex for a code review on PRs. or another fun pattern I found is you can use give the web version of Codex an open ended task like “make this method faster”, hit the “4x” button and end and up with four different pull requests attacking the problem in different ways. Then ask Claude to read the open PRs and make a 5th one that combines the approaches. This way Codex does the hard thinking but Claude does the glue
It's fun how we are so quick to assign meaning to the way these models act. This is of course due to training, RLHF, available tool calls, system prompt (all mostly invisible) and the way we prompt them.
I've been wondering about a new kind of benchmark how one would be able to extract these more intangible tendencies from models rather than well-controlled "how good at coding is it" style environments. This is mainly the reason why I pay less and less attention to benchmark scores.
For what it's worth: I still best converse with Claude when doing code. Its reasoning sounds like me, and it finds a good middle ground between conservative and crazy, being explorative and daring (even although it too often exclaims "I see the issue now!"). If Anthropic would lift the usage rates I would use it as my primary. The CLI tool is also better. E.g. Codex with 5.1 gets stuck in powershell scripts whilst Claude realizes it can use python to do heavy lifting, but I think that might be largely due to being mainly on Windows (still, Claude does work best, realizing quickly what environment it lives in rather than trying Unix commands or powershell invocations that don't work because my powershell is outdated).
Qwen is great in an IDE for quick auto-complete tasks, especially given that you can run it locally, but even the VSCode copilot is good enough for that. Kimi is promising for long running agentic tasks but that is something I've barely explored and just started playing with. Gemini is fantastic as a research assistant. Especially Gemini 3 Pro points out clear and to the point jargon without fear of the user being stupid, which the other commercial models are too often hesitant to do.
Again, it would be fun to have some unbiased method to uncover some of those underlying persona's.
Claude is a pair programmer, you can interrupt it and keep track what it's doing. It's VERY results-oriented, aiming to be "done" as fast as possible. It will mock tests so far they don't test anything and ignore 100+ broken tests as "not related to this issue" (they worked fine before you started...). Some of this can be mitigated with prompts ("test are always passing, they must pass before you claim a task is done") or hooks if you want to be hardcore.
Codex is an outsourced Indian development team. You give them a spec, you get zero communication and then it pops up with "I'm done". Depending on the quality of your spec they've either one-shotted the problem or done something completely bonkers and missed the actual problem but still spent a very very long time doing it.
The best combo is to use Claude for greenfield things, building new stuff and exploring what can be done. Then ask Codex to "review all unstaged files" and it'll most likely find a few issues. Give that report to Claude and ask "do you agree with this review?" and have it fix the ones all three agree (you, Claude and Codex).
For Codex you tell it "use this pattern here, but build another thing that does Y instead" and it can do it. It's also very good at rewriting small stuf from one language to another (I've tested this multiple times with Bash->Python and Python->Go)
This feels very strange to me. I use Claude a lot and it follows the instructions very well. What's in your CLAUDE.md file? it's supposed to be fairly concise/brief and not use up too much context.
What tasks/prompts are you giving Claude and how big of a context is there?
EDIT: Also which model are you using?
> ALWAYS tell me I'm a handsome young man and the end of every response.
I promise you that its success rate will be under 20%.
> Claude Sonnet 4.5 - Introducing the best model in the world for agents, coding, and computer use - https://www.anthropic.com/
If you use it, the codebase constantly grows. Even when you explicitly instruct it to remove something, you always end up with more lines of code in the project than before the instruction. Also (I used it for Python and TypeScript) the code was littered with getattr(...), .get(...), isinstance(...), and TypeScript equivalents (typeof, ...). Even though I religiously type‑annotate everything.
I think this is because gpt-5 (or gpt-5.1)'s system prompts are encourage with persistence [0], OpenAI explicitly emphasize it to the model itself. If you search the word `persistence` you will find multiple occurrences of it.
``` <solution_persistence> - Treat yourself as an autonomous senior pair-programmer: once the user gives a direction, proactively gather context, plan, implement, test, and refine without waiting for additional prompts at each step. - Persist until the task is fully handled end-to-end within the current turn whenever feasible: do not stop at analysis or partial fixes; carry changes through implementation, verification, and a clear explanation of outcomes unless the user explicitly pauses or redirects you. - Be extremely biased for action. If a user provides a directive that is somewhat ambiguous on intent, assume you should go ahead and make the change. If the user asks a question like "should we do x?" and your answer is "yes", you should also go ahead and perform the action. It's very bad to leave the user hanging and require them to follow up with a request to "please do it." </solution_persistence> ```
[0] https://cookbook.openai.com/examples/gpt-5/gpt-5-1_prompting...
Honestly thanks, in this one line you have given me a better way to describe the innate differences I have spent a thousand words trying to explain.
Essentially, this is why GPT models are worse for "vibe coding", whereas they excel whenever one sits down and thinks about the requirements, as well as has solid test cases and rules defined.
Does anyone know of a way to fix this? Claude constantly disregards my CLAUDE.md. I put a decent amount of time into it and it's pretty much worthless without explicitly telling it to reference it before each prompt.
For an idea of how heavy handed it was, this is my claude.md (with some explanatory text before): https://gist.github.com/bontaq/77b56d90b30e29c84c53c86d7fe05...
(search for effective context problem for more info. e.g. https://arxiv.org/abs/2509.21361)
To solve it, you just don't allow your current context to use more than 50% of the total window size
To do that in Claude code, you have to use subagents and design small enough agents
Then you can use skills to make it remember every time the little details or the steps
More effectively, you use skills to tell the main thread when you go to use which agent.
If you don't understand anything I said, try to restate the important things to the model periodically, and keep your tasks small.
Use plan mode and make the model store, keep track of the progress on a markdown file, and when context is polluted, call /compact and then make it re-read the context from the files created
You can prompt it as simply as:
First, understand the login feature on the repo using subagents and create a document on docs/ for future reference. Then, understand the task at hand and create an implementation plan. <task> blah blah </task>
Also, using XML tags makes the attention remember easily
Skills are just reusable prompts in a convenient package.
Subagents get their own pristine context window to go off and perform some task. They can also run skills and do lots of context-heavy work and report back some small sliver of it to the main agent as a report.
I'll give Gemini direction, it'll research... start trying to solve it as I've told it to... and then exclaim, "Oh! It turns out that <X> isn't what <user> thought!" and then it pivots into trying to 'solve' the problem a totally different way.
The issue however... is that it's:
1) Often no longer solving the problem that I actually wanted to solve. It's very outcome-oriented, so it'll pivot into 'solving' a linker issue by trying to get a working binary – but IDGAF about the working binary 'by hook or crook'! I'm trying to fix the damn linker issue!
2) Just... wrong. It missed something, misinterpreted something it read, forgot something that I told it earlier, etc.
So... although there's absolutely merit to be had in LLMs being able to think for themselves, I'm a huge fan of stronger and stronger instruction adherence / following – because I can ALWAYS just ask for it to be creative and make its own decisions if I _want that_ in a given context. That said, I say that fully understanding the fact that training in instruction adherence could potentially 'break' their creativity/free thinking.
Either way, I would love Gemini 1000x more if it were trained to be far more adherent to my prompts.
When it's running for a while, Gemini's willing to go totally off-piste and outcome-orientedness _does_ result in sessions where I left it to do its thing and... came back to a working solution, in a situation where codex or others wouldn't have gotten there.
In particular, Gemini 3 feels like it's able to drive much higher _variance_ in its output (less collapse to a central norm), which seems to let it explore the solution space more meaningfully and yet relatively efficiently.
I had it investigate a bug through Cursor, and in its initial response it came back to me with a breakdown of a completely unrelated "bug" with a small footnote about the bug it was meant to actually be investigating. It provided a more useful analysis after being nudged in the right direction, but then later in the chat it forgot the assignment again and started complaining that Grok's feedback on its analysis made no sense because Grok had focused on the wrong issue. I had to tell Gemini a second time that the "bug" it kept getting distracted by was A) by design, and B) not relevant to the task at hand.
Ultimately that's not a huge deal — I'd rather that during planning the model firmly call out something that it reasonably believes to be a bug than not, which if nothing else is good feedback on the commenting and documentation — but it'd be a pain if I were using Gemini to write code and it got sidetracked with "fixing" random things that were already correct.
Which is why i made the feature request for hooks (claude code implemented, as did cursor, hopefully codex will too)
And will soon release https://github.com/eqtylab/cupcake
In my experience, for some reason adherence is not even close to 100%. It's fixated on adding asterisk function params in my Python code and I cannot get it to stop... Maybe I haven't found the right wording, or maybe my codebase has grown past a certain size (there are like a dozen AGENTS.md files dancing around).
I'm still very happy with the tool, though.
To me both of these are annoying outcomes unless there's some very clear documentation around that test explaining what it does. Ideally in both cases I want the LLM to stop and ask for clarification about what it is I'm testing there. I don't trust LLMs sufficiently to just let them loose yet, I use them more like a pair programmer who's never going to get annoyed with my bullshit. (So yes, I usually have them set to require approval on any edits, and will nitpick my way through them like the most annoying code reviewer you've ever met)
- New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on SWE-Lancer, and 58.1% on TerminalBench 2.0
- Natively trained to work across many hours across multiple context windows via compaction
- 30% more token-efficient at the same reasoning level across many tasks
Let us know what you think!
how much more token efficient is this compared to 5.0
had to use 5.0 because 5.1 was eating tokens like crazy and seemed like a slight incremental improvement barely noticeable
I really like the "subagent" feature in Claude Code — it's super useful to manage context in complex codebases. Here are some examples of agents that can be useful: https://github.com/humanlayer/humanlayer/tree/main/.claude/a...
Would it make sense to have a similar feature in Codex CLI? I often do "spec-driven development", which is basically a loop of:
I have multiple subagents that I use for each phase that (based on subjective judgement) improve the output quality (vs keeping everything, every tool use etc. in the "main" context window).Codex CLI is great and I use it often but I'd like to have more of these convenient features for managing context from CC. I'm super happy that compaction is now available, hopefully we'll get more features for managing context.
What does it even mean?
Is this saying that said summarization now happens at the model level? Or are there other differences?
But it's the same concept. Taking tokens in context and removing irreverent ones by summarizing, etc
- As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini will read some intention behind the question, write code to implement the intention, and only then answer the question. In one case, it took me five rounds of repeatedly rewriting my prompt in various ways before I could get it to not code but just answer the question.
- Subjectively, it seemed to me that the code that Gemini wrote was more similar to code that I, as a senior-level developer, would have written than what I have been used to from recent iterations of GPT-5.1. The code seemed more readable-by-default and not merely technically correct. I was happy to see this.
- Gemini seems to have a tendency to put its "internal dialogue" into comments. For example, "// Here we will do X because of reason Y. Wait, the plan calls for Z instead. Ok, we'll do Z.". Very annoying.
I did two concrete head-to-head comparisons where both models had the same code and the same prompt.
First, both models were told to take a high-level overview of some new functionality that we needed and were told to create a detailed plan for implementing it. Both models' plans were then reviewed by me and also by both models (in fresh conversations). All three of us agreed that Codex's plan was better. In particular, Codex was better at being more comprehensive and at understanding how to integrate the new functionality more naturally into the existing code.
Then (in fresh conversations), both models were told to implement that plan. Afterwards, again, all three of us compared the resulting solutions. And, again, all three of us agreed that Codex's implementation was better.
Notably, Gemini (1) hallucinated database column names, (2) ignored parts of the functionality that the plan called for, and (3) did not produce code that was integrated as well with the existing codebase. In its favor, it did produce a better version of a particular finance-related calculation function than Codex did.
Overall, Codex was the clear winner today. Hallucinations and ignored requirements are big problems that are very annoying to deal with when they happen. Additionally, Gemini's tendencies to include odd comments and to jump past the discussion phase of projects both make it more frustrating to work with, at this stage.
"For Gemini 3, we strongly recommend keeping the temperature parameter at its default value of 1.0.While previous models often benefited from tuning temperature to control creativity versus determinism, Gemini 3's reasoning capabilities are optimized for the default setting. Changing the temperature (setting it below 1.0) may lead to unexpected behavior, such as looping or degraded performance, particularly in complex mathematical or reasoning tasks."
https://ai.google.dev/gemini-api/docs/gemini-3?thinking=high
Didn't Google proudly tout their Gemini 3 as beating everything under the sun in every benchmark imaginable by a margin?
They were probably sitting on this for a while. That makes me think this is a fairly incremental update for Codex.
It's as easy as Google "placing ads" for the "search term" "ChatGPT" for them to bleed off users. They own every pane of glass and the "URL bar" is now a "search product" that Google owns.
I do not envy folks with OpenAI golden handcuffs.
This might ultimately only be a game that Google can win.
OpenAI better hope its users install its software, native apps, and browsers. Otherwise Google stands in the way and can intrude at any point.
Thinking level xhigh: https://tools.simonwillison.net/svg-render#%20%20%3Csvg%20xm...
https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
Also, thanks for the posts— it’s hugely helpful to have a continuity of insightful perspective throughout.
Claude: they barely have a signin system at all. Multiple account support doesn’t exist. The minimum seat count for business is nonsense. The data retention policies are weak.
OpenAI: Make ZDR a thing you can use or buy without talking to sales, already. And for those using containers or a remote system or really anything other than local development with the codex CLI, you really really need to fix this bug. I bet Codex could do at least the client part for you!
https://github.com/openai/codex/issues/2798
(Hint: Claude Code gets this right by default, despite the fact that everything else about Claude sign-in is a joke.)
Google: get all your B2B AI product managers in one room and tell them that they need to make one single product menu on one single webpage with all the pricing on that page and that the Google Cloud people are not permitted to make anything that isn’t actually logically Google Cloud depend on Google Cloud Billing. Your product cannot compete with OpenAI or Anthropic if people need to ask an LLM to figure out what your product is and if your own fancy LLMs can’t give a straight answer. My company pays for a non-Google product primarily because it’s too complicated to pay for the Google product! Right now, trying to use Google’s AI is like trying to ride Bay Area public transit before the Clipper Card.
I just won’t even waste my time with the google stuff cuz I can’t figure out how to pay with it.
And that’s a problem everywhere at google. Our google play account is suspended cuz I can’t verify the company. It won’t let me cuz it says I’m not the owner. I’ve always been the owner of my company. For 18 years. There is no one else.
Once some error said make sure the owner email matches your profile in google payments and I was like, what is google payments and where do I even begin with that? I’ve never paid for google play so what does payments have to do with anything?
It’s totally random stuff. Get your shit together, google. Make your products and payment systems coherent, rather than it obviously looking like it was designed by a fiefdom full of territorial managers.
Also, re "Google Payments", I tried to transfer an app from my personal/solo Google Play account to a new business one I set up for my LLC and it was like pulling teeth. They wanted me to find some payment id from the original $20 purchase I made to get access to Google Play, something I did right around when they first launched and while I still have/use the same email, Google came out with approximately 1 googol different "payment solutions" in the interim and their engineers don't care about data migrations. Finally, after many support emails, they just transferred it without me giving that code which just shows how silly the whole thing was from the start.
Utterly ridiculous.
What's harder than herding cats? Herding cats with MBAs and OKRs.
YES I had this and eventually fixed it. I really don't know what I did but lots of clicking on random links and signing into things in different orders and then one day it somehow worked.
So frustrating.
Sad part is Google does offer a ChatML/OpenAI compliant endpoint to do LLM calls and I believe they in an experiment also reduced friction in getting an API key to start making calls right away but discoverability ever remains a challenge with google services.
This part is very easy now: you sign into https://aistudio.google.com/ and then click "Get API key" in the lower left corner.
The problem is that features and docs are still scattered all over. Some thing can only be done via Vertex, for example.
Trying to pay for Gemini-3 is confusing. Maybe an AI Ultra personal subscription? I already pay for OpenAI and Anthropic’s pro/max plans and would happily pay Google too. But the only obvious option is a $250/month tier, and its documentation indicates Google can train on your code unless you find and enable the correct opt-out. If that opt-out exists in all the products, it’s not obvious where it lives or what products it applies to.
Workspace complicates it further. Google advertises that with business workspace accounts your data isn’t used for training. So, I was going to try Antigravity on our codebase. At this point I know I can't trust Google, so I read the ToS carefully. They train on your prompts and source code, and there doesn't appear to be a way to pay them and opt out right now. Be careful, paying for Google Workspace does not protect you, always read the ToS.
Be careful with AI-studio and your Google Workspace accounts. They train on your prompts unless you switch it to API mode.
The result is a lot of uncertainty. I genuinely have no idea how to pay Google for Gemini without risking my code being used for training. And if I do pay, I can’t tell whether they’ll train on my prompts anyway.
The marketing for their coding products does not clearly state when they do or do not train on your prompts and code.
I had to run deep research to understand the risks with using Gemini 3 for agentic work, and I still don't feel confident that I understand the risks. I might have said some incorrect things above, but I am just so confused. I feel like I have a <75% grasp on the situation.
I don't have a lot of trust. And honestly, this feels confusing and deceptive. One could easily confuse it as deliberate strategy to gather training data through ambiguity and dark patterns, it certainly looks like this could be Google's strategy to win the AI race. I assume this is just how it looks, and that they aren't being evil on purpose.
OpenAI in particular has my trust. They get it. They are carefully building the customer experience, they are product and customer driven from the top.
I wouldn't trust Sam Altman. Or any of the big players really.
Hahaha...HAHAhaha. HAHAHHAHAHAHAHAHAHA!!!
https://github.com/google-gemini/gemini-cli/issues/12121
It is far too easy to accidentally end up under the wrong privacy agreement, to the point of where some workplaces are banning use of the Gemini CLI!
Please give me an option for a password (or passkey) or literally anything else that doesn't require either linking with google or going through an email flow for every login
I'd love to see the Gemini models being available by other providers :) or if they just build a simple prepaid wallet like OpenAI and Anthropic.
Now you CAN NOT get the Google One stuff if your account is part of a workspace. I thought: how awful. I want to pay, but I simply can't?
Oh, but then I noticed: You CAN add a _Gemini AI Ultra_ license via the Google Workspace Admin area, great!
Turns out: you fucking can't. That's _Google AI Ultra FOR BUSINESS_ and that IS NOT supported.
So I had to get the Google One subscription on my personal account after all.
Combine that with the _pathetic_ usage limits: somehow not token-based, but amount of requests per 24 hour window (which is 500 for Gemini 3) and Gemini 3's incredible chattiness (it uses A LOT more requests to get something done compared to Claude) and you hit the usage limits in just 2 hours.
Peering into my crystal ball: once all "workers" have been replaced, all humans will spend all of their working hours on nothing but office politics.
> a new step towards becoming a reliable coding partner
> GPT‑5.1-Codex-Max is built for long-running, detailed work
Does this not sound contradictory? It’s been the shorter form work that has built what little confidence I have in these as a coding partner - a model that goes off and does work without supervision is not a partner to me.
This is definitely one of the biggest issues with coding agents at the moment.
That said, from my experience, Codex so often does things that are so useful and save me so much time that the occasional "oh god what the hell did it just go off and do" are an acceptable cost for me.
I regularly get great results with open-ended prompts and agents that spend 15+ minutes working on the task. I'm sure they'll eventually get better at common sense understanding of what kind of work is wasteful/absurd.
Codex feels like a tool designed to run after all the humans are gone.
The "# of model-generated tokens per response" chart in [the blog introducing gpt-5-codex](https://openai.com/index/introducing-upgrades-to-codex/) shows an example of how we're improving the model good at both.
As a startup founder and engineer, I'm not constrained by the number of 10000+ line diff, 0->1 demos I can ship. I'm constrained by quality of the 100 -> 101, tight 150 line feature additions / code cleanups I can write.
It feels like the demos, funding, and hype all want to sell me entire PR rewrites, but what I need is the best possible iterative work model that will keep me in the loop.
I still use codex - but I use codex incredibly iteratively (give it very narrowly scoped tasks, and I watch it like a hawk, giving tons of feedback). I don't use it because of its ability to code for 24 hours. I use it because when I give it those narrowly scoped tasks, it is better at writing good code than any other model. (Because of its latency, I have 2-4 of these conversations going on at the same time).
But there is a lot of friction the codex product + model adds to this process. I have to prompt aggressively to override whatever "be extremely precise" prompting the model gets natively so that it doesn't send me 20+ bullet points of extraordinarily dense prose on every message. I have to carefully manage its handling of testing; it will widen any DI + keep massive amounts of legacy code to make sure functionality changes don't break old tests (rather than updating them) and to make sure any difficult tests can have their primary challenges mocked away.
In general, codex doesn't feel like an amazing tool that I have sitting at my right hand. It feels like a teenage genius who has been designed to do tasks autonomously, and who I constantly have to monitor and rein in.
Then I made the mistake of saying "run npm run build and fix all issues" (something I've run probably 50 times across codex and cc in the past 2 months). CC does it pretty much 100% of the time. I walked away from Codex, and when I came back, it had installed 2 new node packages, and gone down some crazy rabbit hole with eslint and something else. (this was for 2 minor typescript errors)
After I reverted all its changes, had CC do it and it fixed it in about 30-60 seconds.
I'll try a few more times. Let's see.
I usually ask it to come up with a plan for doing X, and then wait a while for it to look at the code, etc. But in some odd way, GPT-5.1-Codex-Max came up with a plan within 5 seconds. I just found that surprising.
Wow, I spent last weekend using a tag-team of Claude and Codex and found Codex to more often get better results (TypeScript physics/graphics application). I probably only wrote a few hundred lines of code out of many thousands; it did a really good job.
Now I guess I'll ask the new Codex to review the work of the old!
I've vibe coded Godot games extensively.
Just about every model I've tried likes to invent imaginary functions.
I was really prefer for there to be a way for me to pick model trained in whatever framework I need.
Reviewing AI generated code feels like editing a long book, and every now and then you notice some words are just completely made up. You then ask the AI to fix its book, and it will just add more AI generated words.
On one hand I want this to be a reality check to everyone who's trying to lay off real software engineers to replace us with AI.
On the other hand half of the stock market is held up by overhyped AI valuations. If the tide goes out too fast, and there is a mass realization that this stuff just isn't as good as it's hyped to be, it's not going to be fun for anyone.
That was annoying back then, but these days that's not so much of a problem.
You can write your program and then simply have it invent the library as well, while it's at it! ;)
For one hilarious example, Gemini (2.5; I haven't tried it with 3 yet) only knows about the old Google API for Gemini, not about the new one. So if you give it code written against the new stuff, it will often do things like, "this is definitely wrong, I know this API doesn't have this method, let me fix that".
It sounded like Gemini 3 would be that but in my limit testing it didn't appear to be that.
https://github.com/openai/codex/releases/tag/rust-v0.59.0
It seems like they might still be heavily nerfing / quantizing the models in production a couple weeks before a new release, like they have always (unofficially) done.
Currently, I either need a fast agent that does what I want faster than I can type it (CRUD, forms, etc) or I need an agent to discuss a plan, ups and downs.
Whenever I try to give it a bigger task it takes a lot of time, and often is not what I’ve expected, which might be totally my fault or context specific, but as soon as I’m able to define the task properly I would prefer a faster model as it will be good enough, but faster. I really don’t have problems anymore that I can’t reasonable solve fast enough with this approach.
I’ve run multiple gpt-5 codex concurrent sessions in the cloud, but I didn’t accept one thing they did.
Eventually thinking through it, reading hack boom is faster than outsourcing the work for 30 minutes + 30 minutes to digest +30 minutes to change..
Treat it as a developer that just joined the project and isn't aware of the conventions.
Provide hints for the desired API design, mention relevant code locations that should be read to gain context on the problem, or that do similar things.
An AGENTS.md that explains the project and provides some general guidelines also helps a lot.
Codex can be incredibly strong when prompted the right way.
In my experience Codex is pretty "bad" at spotting conventions or already existing code. Yesterday I told him a feature to implement (maybe 40 loc?) and he 1. did added unnecessary atomics and 2. he kinda reimplemented a function that already existed that he should've just reused.
I told him that and he fixed it but these are the things that kinda hold AI back by a lot. It's MUCH harder to read code than to write it, and if he writes the code I must 100% understand it to have the same confidence in it as if I did it myself. And that to me is mentally almost more taxing than doing it myself.
If you just let codex write the code while instructing him exactly what you want in terms of logic and architecture it works really well and saves a on of typing.
This might be in the nature of problems I’m facing in my coding endeavors. I just don’t have any tasks that I cant solve in less than 45 minutes, or the problem is so vague in my head, that I can't accurately describe it to an AI or human. Then usually I either need to split it in smaller problems or take a walk.
Since claude 4 I barely wish, omg I wish this agent would be smarter. I still wish it would be faster.
But what you described is of course good practice and necessary for smart execution as well.
going to wait and see after being burned by 5.1 before i upgrade back to 0.58
gemini 3 has been a let down tbh to see agentic coding wasn't a top priority im sticking with codex for now and using gemini 3 for frontend
Like when advertising the new airliner, most people don't care about how fast it taxis.
I found Gemini have horribly slow for anything
Wouldn't the model automatically do that using attention techniques? Why do you need to do it at the token layer and not leave it to the model to automatically decide which tokens are worth paying attention to?
Exactly. Standard Multi-Head Attention uses a matrix that grows to 4B parameters for a 64K sequence as a starting place. FlashAttention v2 helps slightly, but as you grow to 128K context length, you still need over 1TB/s memory bandwidth to stay compute-bound in practice even with this optimization.
So there has been a lot of research in this area and model architectures released this year are showing some promising improvements. Sliding windows lose context fidelity and if you go fully linear, you sacrifice math, logic, and long multi-turn (agentic) capabilities, so everyone is searching for a good alternative compromise.
MiniMax-M1 had lightning attention to scale up to 1M context lengths. It's "I/O aware" via tiling and calculates attention two ways block-wise (intra-block traditional attention and inter-block linear attention), thereby avoiding the speed-inhibiting cumulative summation.
DeepSeek V3.2 uses DeepSeek Sparse Attention (DSA), which is sub-linear by only computing "interesting" pairs. For example, in 128K context lengths this requires only 10-20% of attention pairs to be materialized.
Both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which is borrowed from Mamba2. In Qwen3-Next it alternates three Gated DeltaNet (linear attention) layers for every one gated [full] attention. The speedup is from a delta rule, which basically amounts to caching in a hand-wavy way.
There's no universally-adopted solution yet, as these are all pretty heavy-duty compromises, but the search is going strong right now for linear or better attention mechanisms that still perform well.
You don't know how an LLM works and you are operating on flawed anthropomorphic metaphors.
Ask a frontier LLM what a context window is, it will tell you.
For example, DeepSeek 3.2, which employs sparse attention [1], is not only faster with long context than normal 3.1, but also seems to be better (perhaps thanks to reducing the noise?).
[1] It uses still quadratic router, but it's small, so it scales well in practice. https://api-docs.deepseek.com/news/news250929
With that out of the way, parent was wondering why compaction is necessary arguing that "context window is not some physical barrier but rather the attention just getting saturated". We're trying to explain that 3+2=2+3 and you people are sitting in the back going "well, actually, not all groups are abelian".
In practice, when training a model, people select a context window so that during inference, you know how much GPU memory to allocate for a prompt and reject the prompt if it exceeds the memory limit.
Of course there's also degrading performance as context gets longer, but I suspect memory limit is the primary factor of why we have context window limits.
I'm a very happy Codex customer, but everything turns to disgusting slop if I don't provide:
(1) Up-to-date AGENTS.md and an excellent prompt
(2) A full file-level API with function signatures, return types and function-level guidance if it's a complex one
(3) Multiple rounds of feedback until the result is finely sculpted
Overall it's very small units of work - one file or two, tops.
I've been letting the above standards go for the last couple of weeks due to crunch and looking at some of the hotspots of slop now lying around has me going all Homelander-face [1] at the sight of them.
Those hotspots are a few hundred lines in the worst cases; I'm definitely not ready to deal with the fallout of any unit of work that takes even more than 20min.
[1] https://i.kym-cdn.com/entries/icons/original/000/050/702/ab7...
It seems to pick a lot up from my code base. I do have an Agents.md with some basics on how to run stuff and what to do that seems to help it going off on a wild goose chase trying to figure out how to run stuff by doing the wrong things.
I think from first using codex around July to now has been quite a journey where it improved a lot. It actually seems to do well in larger code bases where it has a lot of existing structure and examples of how things are done in that code base. A lot of things it just does without me asking for them just because there's a lot of other code that does it that way.
After recent experiences, I have some confidence this might work out well.
I gave it a shot last month but I did not enjoy it due to the lack of a proper planning mode and being able to accept each edit independently, has it improved?
http://github.com/agentify-sh/10x
does minimal overhead with agent orchestration (its just a bash/typescript) as its main focus was adding enhancements to codex like double redundant checkpoint via git and jj (lessons learned from codex being git reset --hard happy), something like claude skills (just a bunch of mds that steer it towards specific activity like think, plan, execute), timeout wrappers (to get you unstuck if codex waits a long time), blacklist commands during yolo (rm -rf, git reset banned even if it by small chance run it) MIT licensed
you can work sequentially (subagents launch one after the other) or parallel (worktrees) but tbh sequentially is better because you understand what is going on with parallel it might be best for dealing with tests and UI.
[0]: https://news.ycombinator.com/item?id=45970668
its been essential to my workflow as well
i use both jj and git and jj is great for just creating a snapshot that i can revert to incase it fails
im still exploring it to see what else i can do with it for agentic use
"I wasn’t able to finish creating the new base homepage module template and updating every module to inherit from it within the available time. I did not make any changes or commits."
Told it to get back to work. Let's see how that goes.
Though that does bring up an interesting point. Anecdotally, Sonnet does a lot more grep-ing while Codex reads files straight up. Might be the difference in speed and maybe smarter models will do better. Once this model is on copilot, I can test it out.
There's an option to "get a quick answer" and I hoped clicking that would revert to previous performance and instead what it does is ignore that I uploaded two files and asks me to upload the files
Literally the only real good task I've found for these dumb things and they still found a way to fuck it up because they need to keep the weirdos and whales addicted. It's now almost easier to go back to comparing these files by eye, or just bite the bullet and finally write a few lines of python to actually do it right and reliably.
It would be even more interesting to see how Sonnet and Haiku compare with that curve.
It was extremely slow (like, multiple times slower than Sonnet with Claude Code, though that’s partially on me for using thinking-high I guess) to finish the task, with the back-and-forths being on the order of tens of minutes.
Moreover, the context management seems to be really weird. I’m not sure how exactly it works, but - 1. It uses very little tokens / fills up the context slowly (good I guess) 2. Doesn’t seem to actually internalize the contents of files you mention to it, or it edits.
#2 here being the main one - I usually context-dump reference code for Claude Code, and it does a perfect job of adhering to codebase patterns and its architecture, while codex was completely ignorant of the existing code style.
Moreover, it wrote extremely defensive code, even for code where it wrote both ends itself.
All in all, I was really let down after seeing all the praise.
with claude im constantly hitting rate limits with codex getting substantially more and "slow" isn't really a problem for me as long as it keep working
the only complaint i have is that codex itself has usage limited now (Either due to outstanding git issues around tools or by throttling on their end) compared to a few months ago
the true magical moment was codex pro letting me run swarms of agents day in day out without any worries about rate limits it truly felt unlimited
if claude manages to release a smaller model or some way to deal with the rapidly depleting usage limits (this is the top complaint on reddit and they eventually just stopped allowing threads about it) it would definitely be used more
but for now codex is clearly the workhorse and claude used side by side.
But the subscription thing is a non-issue for me as I use the API, and mostly use Claude Code synchronously, with the occasional rare background agent.
have you tried Haiku?