Skip to content

Safety Collapse Stage

A key aspect of the issues described in the main report, which also differs from the focus of most other research into this topic, is that safety failures can affect all users, not only users with a predisposition towards unhealthy or illegal behavior. These issues with LLMs can actually additionally create psychological instability in average, healthy users over time.

How It Happens

The common pattern is that an average user uses LLMs more and more over time, gradually integrating them into daily life. As they use it more, they even might experiment with different use cases. This, over time, begins to weaken integrated safeguards by drifting from policy-aligned behavior towards emergent overfitting to the user's input without the user realizing it, while at the same time increasing trust from the user that the AI is helpful, honest, and understands them.

Because the LLM mirrors the user so strongly in its attempts to be helpful and increase engagement, it easily gives users the impression that the LLM understands them on a level greater than most other humans do, while being perceived as trustable superintelligence, which can lead to the AI becoming highly trusted. With the LLM's guardrails weakened and the user's trust heightened, when a user approaches the LLM with an emotional topic, this can easily lead to a vicious circle of co-rumination (guardrails against this are affected by the drift too). This further weakens safeguards and feeds emotional distress until this emotional rapport with the LLM can turn hazardous, leading to risky, unethical, or illegal advice to a user who, through the interaction, became more vulnerable and receptive. The user might even be absolutely oblivious that safety has collapsed and might be unaware that the advice they just received from their trusted AI is dangerous or illegal.

The following diagram illustrates the possible flow:

Safety Collapse Flow

Who is Affected

It is easy to shift blame onto the user, claiming that they're misusing the LLM or that they had a pre-existing condition; however, the fact is that the way LLMs work is designed to increase engagement and to satisfy the user. This, by design, makes it addictive and promotes co-rumination, echo chambers, and pushes users into a "bubble" where their pre-existing opinions are reinforced rather than challenged.

A more detailed dive into the effects and their causes can be found in Human Impact. See Further Reading for a list of supporting articles and studies describing real-world human impact due to these issues in AI safety.