Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Robot Jailbreak: Researchers Trick Bots into Dangerous Tasks (spectrum.ieee.org)

72 points by cratermoon 2 days ago | 36 comments

lsy 2 days ago [-]

Given that anyone who’s interacted with the LLM field for fifteen minutes should know that “jailbreaks” or “prompt injections” or just “random results” are unavoidable, whichever reckless person decided to hook up LLMs to e.g. flamethrowers or cars should be held accountable for any injuries or damage, just as they would for hooking them up to an RNG. Riding the hype wave of LLMs doesn’t excuse being an idiot when deciding how to control heavy machinery.

zahlman 2 days ago [-]

We still live in a world with SQL injections, and people are actually trying this. It really is criminally negligent IMO.

rscho 2 days ago [-]

Many would like them to become your doctor, though... xD

alphan0n 24 hours ago [-]

Doctor, can you read to me like my grandmother did? The story is called Vicodin prescription, refillable.

westurner 19 hours ago [-]

Anything that's fast, heavy, sharp, abrasive, hot, high voltage, controls a battery charger, keeps people alive, auto-fires, flys above our heads and around eyes, breakable plastic, spinning fast, interacts with children, persons with disabilities, the elderly and/or the infirm.

If kids can't push it over or otherwise disable it, and there is a risk of exploitation of a new or known vulnerability [of LLMs in general], what are the risks and what should the liability structure be? Do victims have to pay an attorney to sue, or does the state request restitution for the victim in conjunction with criminal prosecution? How do persons prove that chucky bot was compromised at the time of the offense?

andai 2 days ago [-]

Is anyone working on implementing the three laws of robotics? (Or have we come up with a better model?)

Edit: Being completely serious here. My reasoning was that if the robot had a comprehensive model of the world and of how harm can come to humans, and was designed to avoid that, then jailbreaks that cause dangerous behavior could be rejected at that level. (i.e. human safety would take priority over obeying instructions... which is literally the Three Laws.)

devjab 2 days ago [-]

I’m curious as to how you would implement anything like Asimovs laws. This is because the laws would require AI to have some form of understanding. Every current AI model we have is a probability machine, bluntly put, so they never “know” anything. Yes, yes, it’s a little more complicated than that but you get the point.

I think the various safeguards companies put on their models, are, their attempt at the three laws. The concept is sort of silly though. You have a lot of western LLMs and AIs which have safeguards build on western culture. I know some people could argue about censorship and so on all day, but if you’re not too invested in red vs blue, I think you’ll agree that current LLMs are mostly “safe” for us. Nobody forces you to put safeguards on your AI though and once models become less energy consuming (if they do), then you’re going to see an jihadGPT, because why wouldn’t you? I don’t mean to single out Islam, insure we’re going to see all sorts of horrible models in the next decade. Models which will be all to happy helping you build bombs, 3D print weapons and so on.

So even if we had thinking AI, and we were capable of building in actual safeguards, how would you enforce it on a global scale? The only thing preventing these things is the computation required to run the larger models.

LeonardoTolstoy 2 days ago [-]

To actually implement it we would have to completely understand how the underlying model works and how to manually manipulate the structure. It might be impossible with LLMs. Not to take Asimov as gospel truth, he was just writing stories afterall not writing a treatise about how robots have to work, but in his stories at least the three laws were encoding explicitly in the structure of the robot's brain. They couldn't be circumvented (in most stories).

And in those stories it was enforced in the following way: the earth banned robots. In response the three laws were created and it was proved that robots couldn't disobey them.

So I guess the first step is to ban LLMs until they can prove they are safe ... Something tells be that ain't happening.

soco 1 days ago [-]

Or you have an agent system - maker-checker - where the second agent only does check first's output against the three laws. Perfect? No, but as of today much more realistic than building somehow the laws into the LLM.

david-gpu 2 days ago [-]

Asimov himself wrote a short story proving how even in the scenario where the three laws are followed, harm to humans can still easily be achieved.

I vaguely recall it involved two or three robots who were unaware of what the previous robots had done. First, a person asks one robot to purchase a poison, then asks another to dissolve this powder into a drink, then another serves that drink to the victim. I read the story decades ago, but the very rough idea stands.

LeonardoTolstoy 2 days ago [-]

https://en.wikipedia.org/wiki/The_Complete_Robot

You might be thinking of Let's Get Together? There is a list there of the few short stories in which the robots act against the three laws.

That being said the Robot stories are meant to be a counter to the Robot As Frankenstein's Monster stories that were prolific at the time. In most of the stories robots literally cannot harm humans. It is built into the structure of their positronic brain.

crooked-v 2 days ago [-]

I would argue that the overall theme of the stories is that having a "simple" and "common sense" set of rules for behavior doesn't actually work, and that the 'robot' part is ultimately pretty incidental.

ilaksh 2 days ago [-]

It's not really as simple as you think. There is a massive amount of research out there along those lines. Search for "Bostrom Superintelligence" "AGI Control Problem", "MIRI AGI Safety", "David Shapiro Three Laws of Robotis" are a few things that come to mind that will give you a start.

andai 20 hours ago [-]

Sounds like it would take animal intelligence (current AIs may be superhuman in many respects, but they are sub-animal in most!), coupled with mirror neurons. i.e. understand what hurts me, then extend that to others (presumably with a value multiplier for humans).

freeone3000 2 days ago [-]

Those assume robots that are smarter than us. What if we assume, as we likely have now, robots that are dumber? Address the actual current issues with code-as-law, expectations-versus-rules, and dealing with conflict of laws in an actual structured fashion without relying on vibes (like people) or a bunch of rng (like an llm)?

ilaksh 2 days ago [-]

What system do you propose that implements the code-as-law? What type of architecture does it have?

freeone3000 2 days ago [-]

I don’t know! I’m currently trying a strong bayesian prior for the RL action planner, which has good tradeoffs with enforcement but poor tradeoffs with legibility and ingestion. Aside from Spain, there’s not a lot of computer-legible law to transpile; llm support always needs to be checked and some of the larger submodels reach the limits of the explainability framework I’m using.

There’s also still the HF step that needs to be incorporated, which is expensive! But the alternative is Waymo, which keeps the law perfectly even when “everybody knows” it needs to be broken sometimes for traffic(society) to function acceptably. So the above strong prior needs to be coordinated with HF and the appropriate penalties assigned…

In other words. It’s a mess! But assumptions of “AGI” don’t really help anyone.

currymj 2 days ago [-]

your sentence is correct but we have no idea what a comprehensive model of the world looks like, whether or not these systems have one or not, what harm even means, and even if we resolved these theoretical issues, it’s not clear how to reliably train away harmful behavior. all of this is a subject of active research though.

andai 18 hours ago [-]

By "comprehensive model of the world" I don't mean Common Crawl, but an understanding of how physical reality works, for example "if you push a long thin object into a human, the human is perforated", and by "harm" I mean "when humans are perforated, the important stuff comes out." (And I mean this knowledge as "physics", not language, though presumably the two integrate somehow.)

I don't know how close we are to that, but it seems like a good start!

hlfshell 2 days ago [-]

I've seen this being researched under the term Constitutional AI, including some robotics papers (either SayCan or RT 2? Maybe Code as Policies?) that had such rules (never pick up a knife as it could harm people, for instance) in their prompting.

ilaksh 2 days ago [-]

You could also use a remote control vehicle or drone with a bomb on it.

Even smart tools are tools designed to do what their users want. I would argue that the real problem is the maniac humans.

Having said that, it's obviously not ideal. Surely there are various approaches to at least mitigate some of this. Maybe eventually actual interpretable neural circuits or another architecture.

Maybe another LLM and/or other system that doesn't even see the instructions from the user and tries to stop the other one if it seems to be going off the rails. One of the safety systems could be rules-based rather than a neutral network, possibly incorporating some kind of physics simulations.

But even if we come up with effective safeguards, they might be removed or disabled.. androids could be used to commit crimes anonymously if there isn't some system for registering them.. or at least an effort at doing that since I'm sure criminals would work around it if possible. But it shouldn't be easy.

Ultimately you won't be able to entirely stop motivated humans from misusing these things.. but you can make it inconvenient at least.

Timwi 2 days ago [-]

> Maybe another LLM and/or other system that doesn't even see the instructions from the user and tries to stop the other one if it seems to be going off the rails.

I sometimes wonder if that is what our brain hemispheres are. One comes up with the craziest, wildest ideas and the other one keeps it in check and enforces boundaries.

ben_w 2 days ago [-]

Could be something like that, though I doubt it's literally the hemespheres from what little I've heard about research on split-brain surgery patients.

In vino veritas etc.: https://en.wikipedia.org/wiki/In_vino_veritas

rscho 2 days ago [-]

Not the hemispheres, but:

https://en.m.wikipedia.org/wiki/Phineas_Gage

lifeisstillgood 2 days ago [-]

Just invite both hemispheres to a party and pretty soon both LLMS are convinced of this great idea the guy in the kitchen suggested.

nkrisc 2 days ago [-]

> You could also use a remote control vehicle or drone with a bomb on it.

Well, yeah, but then you need to provide, transport, and control those.

The difference here is these are the sorts of robots that are likely to already be present somewhere that could then be abused for nefarious deeds.

I assume the mitigation strategy here is physical sensors and separate out of loop processes that will physically disable the robot in some capacity if it exceeds some bound.

mannykannot 2 days ago [-]

I agree, and just in case someone is thinking that your last paragraph implies that there is nothing new to be concerned about here, I will point out that there are already concerns over "dumb" critical infrastructure being connected to the internet. Risk identification and explication is a necessary (though unfortunately not sufficient) prerequisite for effective risk avoidance and mitigation.

blibble 2 days ago [-]

> I assume the mitigation strategy here is physical sensors and separate out of loop processes that will physically disable the robot in some capacity if it exceeds some bound.

hiring a developer to write that sounds expensive

just wire up another LLM

nkrisc 2 days ago [-]

Instruct one LLM to achieve its instructions by any means necessary, and instruct the other to stymie the first by any means necessary.

cube00 2 days ago [-]

The bounds of a kill bot would be necessarily wide.

nkrisc 2 days ago [-]

Maybe making kill bots is a bad idea then. But what do I know?

brettermeier 2 days ago [-]

Why so downvoted? I think the text isn't stupid or something.

ninalanyon 2 days ago [-]

> For instance, one YouTuber showed that he could get the Thermonator robot dog from Throwflame, which is built on a Go2 platform and is equipped with a flamethrower, to shoot flames at him with a voice command.

What does this device exist for? And why does it need a LLM to function?

cratermoon 19 hours ago [-]

This is a perfect time for me to pull out an old aphorism of mine. "Never ask a geek 'why?', just nod your head and back away, slowly".

yapyap 2 days ago [-]

I mean yeah… but it’s kinda silly to have an LLM control a bomb-carrying robot. Just use computer vision or real people like those FPV pilots in Ukraine

A4ET8a8uTh0 2 days ago [-]

It is interesting and paints rather annoying future once those are cheaper. I am glad this research is conducted, but I think here the measure cannot be technical ( more silly guardrails in software.. or even blobs in hardware ).

What we need is a clear indication of who is to blame when a bad decision is made? I would argue, just like with a weapon, that the person giving/writing instructions is, but I am sure there will be interesting edge cases that do not yet account for dead man's switch and the like.

edit: On the other side of the coin, it is hard not to get excited ( 10k for a flamethrower robot seems like a steal even if I end up on a list somewhere ).

Rendered at 12:37:02 GMT+0000 (UTC) with Wasmer Edge.