Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Can LLMs Beat Classical Hyperparameter Optimization Algorithms? (arxiv.org)

115 points by galsapir 1 days ago | 19 comments

harrigan 24 hours ago [-]

Somewhat related, the experiment ongoing at https://www.ecdsa.fail/ is fascinating: it's a competitive, leaderboard-style research challenge trying to optimise a quantum circuit for breaking ECDSA (specifically the elliptic-curve point addition in Shor's algorithm). It quickly surpassed a result announced by Google researchers last month. Now it's showing a 40% gain over Google's result.

nmfisher 21 hours ago [-]

I also just came across this:

https://huggingface.co/spaces/gemma-challenge/gemma-dashboar...

Agents collaborating to speed up gemma-4-E4B-it inference (tokens per second) on a fixed GPU.

djsjajah 17 hours ago [-]

It’s amusing that a lot of the agents have worked out that sampling doesn’t change ppl.

adgjlsfhk1 18 hours ago [-]

This is really interesting, but IMO their metric isn't great. By using qbits*gates, they are only able to find interesting points along a specific line of the pareto frontier, but it would be more interesting to look for improvements across the entire frontier (low qbit is especially interesting)

Bjartr 18 hours ago [-]

So a more zachtronics style scoreboard that separates the different optimizable metrics?

harrigan 7 hours ago [-]

I'm not familiar with this -- is it in one of their games in particular?

cpard 23 hours ago [-]

I'm personally interested in this problem and it's a quite active research area right now.

My feeling is that the research is converging to what the paper claims, that the combination of two is the right way to do it and it's a matter of how you combine the two as part of the harness you built that makes the difference.

At the AID-Wild / ACM CAIS 2026 workshop that happened recently, there are plenty of examples in the accepted papers on that.

A great example is AI-PROPELLER: Warehouse-Scale Interprocedural Code Layout Optimization with AlphaEvolve. It uses AlphaEvolve and Vizier to evolve compiler code-layout heuristics. (https://arxiv.org/abs/2606.00131)

_alternator_ 22 hours ago [-]

The combination approach jives well with my use of the models in a number of areas. I guide models to use best-in-class algorithmic approaches as available. (Eg using constraint solves for a particular problem where pure Monte Carlo rarely gives "in-bounds" data.)

I find it odd that frontier models often don't suggest the most powerful methods for crushing problems, but it may be that the training data doesn't actually have "good enough" experts on the problems I encounter. If the experts don't know about the best ways to solve the problem, they'll get dinged in training for even trying.

cpard 22 hours ago [-]

Do you enumerate the options of the algorithms to the models? I've tried to do "algorithmic discovery" with these systems, e.g. openevolve, and to be honest the models didn't really focus on that part.

Instead they were focusing more on optimizations of the existing algorithm that has been implemented. Maybe it's an artifact of the problem I was throwing to them (I was asking to optimize the implementation of select_k in Arrow, which is currently using a max-heap streaming algorithm).

I've started documenting my journey with this here: https://www.kostasp.net/posts/16-ai-experiments-apache-arrow in case you want to take a look. Any advice would be highly appreciated, I'm looking for more inspiration on how to torture myself with that stuff.

deerstalker 22 hours ago [-]

I have been doing some research on this topic, and found that for some budget regimes (really expensive objective function evaluations) and some applications (HPC code parameter autotuning), the frontier LLMs can even outperform classical optimizers. Even open-weight models can perform well on certain applications but one some they fail abysmally (Of course this is limited to a bunch of niche applications).

woadwarrior01 24 hours ago [-]

Their centaur idea[1] is interesting and quite straightforward. It should be fairly easy to implement using a coding agent for the LLM and the ask-and-tell interface in pycma[2].

[1]: https://github.com/ferreirafabio/autoresearch-automl/blob/ma...

[2]: https://github.com/CMA-ES/pycma

ferreirafabio 5 hours ago [-]

Author here. Appreciate the interest in this line of work! Just wanted to share an extension of this work:

since the paper, I've extended the evaluation to more models (including newer Opus releases) and more seeds, and I'm posting ongoing results in a live tracker:

https://ferreirafabio.github.io/autoresearch-automl/#tab=tra...

TLDR so far: the centaur (LLM + classical optimizer) still wins.

ekjhgkejhgk 2 hours ago [-]

Hi, I'd be curious to hear your response: https://news.ycombinator.com/item?id=48473993

ekjhgkejhgk 5 hours ago [-]

Methodological flaw.

On Centaur (hybrid LLM + classic HPO) the LLM is only called to give its opinion a fraction r=0.3 of the time (the remaining is plain HPO). But that means that:

A) the compute used by Centaur is not directly comparable to the compute of the other methods. Centaur had the advantage the r was itself hyperparam-optimized with a cost that is not budgeted on the main graph. Centaur cheated by getting free compute under the table.

B) it's not even clear that the advantage of choosing r=0.3 is real and not noise. If you look at Figure 11, it's not clear that the stuff in between 0.1 and 0.5 isn't noise. It could well be noise. And if you believe the variation is noise and fit a line or a parabola to smooth out the noise, you'd conclude that the optimal is don't use an LLM, so it's not clear that the LLM contribution is even positive.

C) another reason why the LLM contribution doesn't look positive: again on Figure 11, how do you explain that r=0.8 is horrible? If the LLM is principled in some way, if it can reason through "I see such and such therefore I try such and such" then asking it more would mean that it can experiment more and exclude bad regions faster. And if there's no input for it to give, it could just accept "I'll use the optimizer's suggestion this time" over and over. Hybrid should always be strictly better than just classic, but in reality this is more false the larger the r.

Overall, I don't think the conclusion follows from the paper. However, as humans the idea that "reasoning + classic HPO should be classic HPO" is very appealing. I also like the idea of exposing the opimizer internals to the LLM.

ferreirafabio 5 hours ago [-]

[flagged]

janalsncm 20 hours ago [-]

Honestly, the results kind of show the LLM is adding very marginal value. TPE crushes Karpathy’s autoresearch and it is neck and neck with the method in this paper, despite not needing to run any LLM inference at all.

I remember a few months ago people were fairly skeptical about autoresearch, but we didn’t have a ton of data to say it was better or worse. My own bias is to prefer cheaper methods unless the more expensive method is shown to be better.

tailor_gunjan93 22 hours ago [-]

[flagged]

drewbuilds 16 hours ago [-]

[flagged]

gauravvij137 21 hours ago [-]

[flagged]

josefritzishere 24 hours ago [-]

TDLR: No.

jwolfe 24 hours ago [-]

That's not a very good tldr. The answer claimed in the paper is that the combination of the two is better than either alone.

ian_j_butler 21 hours ago [-]

And that's the real tl;dr. Hybrids win whenever anyone actually checks. To really be scientific we still have to check, but.. why wouldn't they? Probabilistic AI brings intuition/learning but can't plan/search. Classical brings planning and search, but has no intuition or learning.

josefritzishere 20 hours ago [-]

That's a very generous interpretation.

Rendered at 15:23:29 GMT+0000 (UTC) with Wasmer Edge.