mindprison — LessWrong

LESSWRONG
LW

Replying toI Tested LLM Agents on Simple Safety Rules. They Failed in Surprising and Informative Ways.

I Tested LLM Agents on Simple Safety Rules. They Failed in Surprising and Informative Ways.

I used Infi-gram to prove the data exists in the training set as well as other prompts that could reveal the information exists. For example, LLMs sometimes could not answer the question directly, but when asked to list the episodes it could do so revealing the episode exists within the LLMs dataset.

FYI - "Ring around Gilligan" is surfaced incorrectly. It is not about mind reading. It is about controlling another person through a device that makes them do whatever asked.

Although I can't know specifically why some models are able to now answer the question, it isn't unexpected that they would eventually. With more training and bigger models the statistical bell curve of... (read more)

Replying toI Tested LLM Agents on Simple Safety Rules. They Failed in Surprising and Informative Ways.

mindprison7mo

I Tested LLM Agents on Simple Safety Rules. They Failed in Surprising and Informative Ways.

"detailed episode guides for an old show not being in the training data"

This is incorrect. I'm the author of this test. The intention was to show data that we can prove is in the training set isn't correctly surfaced by the LLM. So in this case, when it hallucinates or says "I don't know", it should know.

As to the model confidence, you might find what I recently wrote about "hallucinations are provably unsolvable" of interest under section "The Attempted Solutions to Solve AI Hallucinations" and the 2 linked papers within.

Replying toAI for AI safety

mindprison11mo

AI for AI safety

For something such as AI safety, having AI itself solve the problem is largely problematic when we need provable outcomes without hidden behaviors. The circular system can easily mask problems. But all of this fundamentally is advancing past the harder problems. That being algorithms cannot encode the concepts we wish to impart to the AI. Abstractly we can make alignment systems that seem to make sense, but can never implement them concretely.

As it is today, every LLM released is jailbroken within minutes. We can't protect behaviors of which the attack surface is anything that can be expressed by human language.

I elaborate on these points extensively here in AI Alignment: Why Solving It Is Impossible

-4

-1

Replying toWhat is it to solve the alignment problem?

mindprison1y

What is it to solve the alignment problem?

Alignment is a constructed on a paradox. It can't be solved. It is a tower of nebulous and contradictory concepts. I don't perceive how we can make any progress until an argument first addresses these issues. A paradox can't be solved, but it can be invalidated if the premise is wrong.

“Alignment, which we cannot define, will be solved by rules on which none of us agree, based on values that exist in conflict, for a future technology that we do not know how to build, which we could never fully understand, must be provably perfect to prevent unpredictable and untestable scenarios for failure, of a machine whose entire purpose is to outsmart all of us and think of all possibilities that we did not.”

I elaborate in detail here - AI Alignment: Why Solving It Is Impossible

-2

-1

The Catastrophe of Shiny Objects

mindprison

The architects of the modern digital landscape, are optimizing all creations to exploit our desires, the need and love for attention and the disdain for boredom. The result is a world in which our attention is the basis of the new economic model, to be harvested and monetized with ruthless efficiency. Attention is the new society's gold, being both the currency and object of irresistible appeal.

We wish to be entertained, to be cured of our boredom; every activity and task must be gamified, tracked, and rewarded so that we can complete our daily activities with maximum stimulation to our brains.

But what is the effect of the constant, never-ending stimulation that is nearly... (read 605 more words →)

-11

Redefining Tolerance: Beyond Popper's Paradox

mindprison

Karl Popper's Paradox of Tolerance, often concisely stated as "if a society is tolerant without limits, it will eventually be destroyed by the intolerant; therefore, we have a right to be intolerant of the intolerant", is paradoxically being used to destroy tolerance, in direct contrast to its original intention. Below I will do a brief elaboration of the problem and solution I proposed in greater detail in "Solving Popper's Paradox of Tolerance Before Intolerance Ends Civilization".

The paradox presents a dilemma: should a free society allow the freedom to embrace harmful ideologies that threaten its very foundation? Popper’s passage, now often cited to support censorship or suppression of opposing viewpoints, lacks clarity on... (read 738 more words →)

-1

Replying toThe Cartesian Crisis

mindprison1y

The Cartesian Crisis

All sources are cited within here - https://www.mindprison.cc/p/the-cartesian-crisis

-2

The Cartesian Crisis

mindprison

The Cartesian Crisis, as detailed in this more verbose essay, represents an unprecedented existential threat to humanity's ability to discern reality from fiction. We stand at a critical juncture where the foundations of truth are being systematically dismantled by a perfect storm of technological and social forces, leaving civilization adrift in a sea of manufactured illusions.

This crisis emerges from multiple vectors of attack on our collective ability to reason. The institutional pillars of knowledge have succumbed to ideological corruption, while our communication channels are now dominated by algorithmic manipulation that distorts the natural flow of human discourse. Perhaps most alarmingly, artificial intelligence has emerged as the ultimate weapon in this war against... (read 358 more words →)

-5

Replying toAgainst the paradox of tolerance

mindprison2y

Against the paradox of tolerance

Any ideology can be contagious as they tend to ride on top of social bonds. However, your premise for arriving at a tolerant utopia requires that a single ideology will become dominant against all other competing ideologies.

While the ideals and principles of tolerance can certainly spread, they will encounter other ideologies in opposition. Unfortunately, this brings us right back the Popper's paradox. Tolerance would need to win against other intolerant ideologies. Tolerance could only maintain its status if the other ideologies simply submit or voluntarily disband of their own accord.

What could cause intolerant ideologies to voluntarily disband? The members would need to come to understand their needs are better met by the... (read more)

-2

Replying toStop talking about p(doom)

mindprison2y

Stop talking about p(doom)

Agreed, p(doom) is a useless metric for advancing our understanding or policy decisions. It has become a social phenomenon where people are now pressuring others to calculate a value and asserting it can be used as an accurate measurement. It is absurd. We should stop accepting nothing more than guesses and define a criteria that is measurable. FYI, my own recent rant on the topic ...

https://www.mindprison.cc/p/pdoom-the-useless-ai-predictor

-2

-3