robertzk — LessWrong

The “when no one’s watching” feature

Can we find a “when no one’s watching” feature, or its opposite, “when someone’s watching”? This could be a useful and plausible precursor to deception

With sufficiently powerful systems, this feature may be *very* hard to elicit, as by definition, finding the feature means someone’s watching and requires tricking / honeypotting the model to take it off its own modeling process of the input distribution.

This could be a precursor to deception. One way we could try to make deception harder for the model is to never develop this feature: modify the training data to make it clear that all gradients are recording, and someone’s always watching. This way, the need for the feature will be OOD by design, and it will be harder to develop and use such a feature during training. (We may still end up with effective proxies or correlates, such as “counterfactual no one’s watching,” e.g. when writing a story and assuming as the protagonist, or when feeding 1984 to the context window, etc. Even this proxy might still be a useful feature to locate and study.)

We could also perform gain of function finetuning on true *actual* “unrecorded training and evaluation” sessions to build a model organism with the feature to understand what it may look like, its subtlety, its elicitation and black box discovery potential, and how it hides OOD. Then leverage it for Microscope AI based alignment proposals.

I would offer $100 for anyone that can provide a concrete experiment or evidence to locate such a feature in a frontier open weight model.

Einstein's Arrogance

robertzk14y110

http://mathoverflow.net/questions/53122/mathematical-urban-legends

Another urban legend, which I've heard told about various mathematicians, and which Misha Polyak self-effacingly tells about himself (and therefore might even be true), is the following:

As a young postdoc, Misha was giving a talk at a prestigious US university about his new diagrammatic formula for a certain finite type invariant, which had 158 terms. A famous (but unnamed) mathematician was sitting, sleeping, in the front row. "Oh dear, he doesn't like my talk," thought Misha. But then, just as Misha's talk was coming to a close, the famous professor wakes with a start. Like a man possessed, the famous professor leaps up out of his chair, and cries, "By golly! That looks exactly like the Grothendieck-Riemann-Roch Theorem!!!" Misha didn't know what to say. Perhaps, in his sleep, this great professor had simplified Misha's 158 term diagrammatic formula for a topological invariant, and had discovered a deep mathematical connection with algebraic geometry? It was, after all, not impossible. Misha paced in front of the board silently, not knowing quite how to respond. Should he feign understanding, or admit his own ignorance? Finally, because the tension had become too great to bear, Misha asked in an undertone, "How so, sir?" "Well," explained the famous professor grandly. "There's a left hand side to your formula on the left." "Yes," agreed Misha meekly. "And a right hand side to your formula on the right." "Indeed," agreed Misha. "And you claim that they are equal!" concluded the great professor. "Just like the Grothendieck-Riemann-Roch Theorem!"

robertzk's Shortform

robertzk1mo10616

I am just sharing a quick take on something that came to mind that occured earlier this year that I had forgotten about. I just received a domain renewal for a project that was dead in the water but should have been alive, and was unilaterally killed by the AWS Trust and Safety team in gaslighting. It is a bit too late, but I still think it is important for people to know.

Earlier this year (2025), a software engineering friend of mine had a great idea of creating a tool to pin on a map interface any ICE (the United States Immigration and Customs Enforcement) sightings from the public. I vibe coded it last February with Claude Sonnet 3.7, and put it up on AWS with a public facing domain. From a coding standpoint, it was a fun project because it was my first project that Terraformed a production app using Claude, autonomously. I had a website up that worked on desktop and mobile. From a public trust and safety standpoint, it helped by adding public accountability and mitigating overreach by a rogue, and potentially in some circumstances operating illegaly, agency. It was overall A Good Thing™.

However, what happened next was upsetting and shocking to me. A day after the website was put online, without me advertising it to ANYONE, the website became inaccessible to almost all browsers. It would simply show a giant red background saying this is associated with known criminals, etc., and has been blocked for safety. It is some Chrome / Google maintained blacklist mechanism that looks even scarier and more severe than the "https / http" / cert issue mismatch, and that cannot be overridden. To be clear, I had registered a certificate using AWS Certificate Manager, and there was no SSL issue. Instead, the domain had been unilaterally marked as "dangerous" by AWS (or Google, Chrome, whomever) one day after it was made publicly accessible, despite no advertising / attention.

I did all of this on my personal AWS account. The very next day, I received a scary email from AWS that claimed my account was violating their policies due to supporting criminal activity, and will be suspended immediately unless I remediate the account. I contacted AWS, who demurred and said they weren't sure what was going on, but that their Trust and Safety team had flagged dangerous activity on my account (they did NOT specify any resources related to the application I had put up; they were generic and vague with no specificity). The only causal correlate was me putting up this public tool to report ICE. I terraform destroyed the resources Claude generated, and then waited. Within a day, the case was closed and my account was restored to normal status.

I am not going to share the domain name, but needless to say, this pissed me off majorly. What the fuck, AWS? (and Google, and Chrome, and anyone else that had a hand in this?) I understand mistakes happen, but this does not smell like a mistake; this smells like some thing worse: a lack of sound judgment. And if there were any automated systems involved, that is no excuse either. Per AWS support's correspondence, they informed me a human in their Trust and Safety had reviewed the account and marked it as delinquent, not from a financial standpoint, but something worse--by equating its resources to criminality. If the concern was what I think it was (corporate cowardice), they should have been intellectually honest, and filed a case stating "We found an application on your account that is outside our policy. Here is the explanation for what we found and why we think it is outside our policy. You can dispute our decision at this link." etc. Instead, I was gaslighted and treated like a guilty until proven innocent subject of weaponized fear (because for a second, with the scary language of the website block and the support email, I was scared.)

That AWS Trust and Safety employee's judgment failed me, and their judgment failed themselves and the public as well as their own responsibility as an arbitrator of trust and safety; their decision, if there was one that can be attributed and not hidden behind the corporate veil of ambiguity, ultimately reduced public trust and safety.

robertzk's Shortform

robertzk1mo*30

A quick holiday-break thought that popped into my head.

Subject: **On the perceived heteroskedasticity of utility of Claude Code and AI vibe coding**

To influence Claude to do what you want, both you and Claude need to speak the same language, and thus share similar knowledge and words expressing that knowledge, about what is to be done or built. You need to have a gears-level model of how to fully execute the task.

Thus, Claude is subordinate to wizard power, not king power. If you find yourself struggling to wield Claude (Code, or GPT through Codex) increase your wizard power — cultivate and amplify your understanding of the world you wish to craft through its augmentative power.

P.S. Claude will help you speedrun this cultivation process, if you ask nicely enough (theorem: everyone has enough innate wizard power to launch the inductive cascade of self-edification compatible with Claude-compatible wizard power)

robertzk's Shortform

robertzk2mo20

I found it based on a hunch, then confirmed it with experimentation. I gained additional conviction when backtesting the experimentation on various historical versions of excel.exe, and noting that the phenomenon only appeared in excel.exe versions shortly after (measured in months) government requested a "read-only" copy of the source code for Excel held in escrow. This has occurred historically in the past (e.g., https://www.chinadaily.com.cn/english/doc/2004-09/20/content_376107.htm and https://www.itprotoday.com/microsoft-windows/microsoft-gives-windows-source-code-to-governments) but subsequent instances of this were allegedly/supposedly classified. Nevertheless, following those instances, the phenomenon appeared, indicating possible compromise of Excel.exe.

robertzk's Shortform

robertzk2mo10

Vary the filename/path from short (one character) to max length and run the above repro, and notice the increase in bits communicated if and only if the filename/path is long, all other factors being held constant. Same for varying the data. There is no reason why Excel.exe should be interpolating this information with all the standard telemetry and connected experience stuff disabled. Even the fact that it is occurring is interesting, and doesn't require hypotheses for its origin.

robertzk's Shortform

robertzk2mo-30

Undetectable steganography on endpoints expected to be used / communicated during normal usage. Mostly natsec. You can repro it by setting up a synthetic network with similar characteristics or fingerprints to some sanctioned region, and generate 10,000 synthetic honeytrap files to attempt to open (use your imagination); capture and diff all network traffic on identical actions (open file => read / manipulate some specific cells => close file). Then note the abnormalities in how much is communicated and how.

robertzk's Shortform

robertzk2mo-1-2

Note Excel.exe is boobytrapped. Be careful when using Excel to open any data or workbooks related to highly sensitive data or activities. On certain networks, when specific conditions are met (i.e. specific regexes or heuristics are triggered on the basis of data that is loaded), Excel will send information to endpoints ostensibly owned and maintained by Microsoft, that provide identifying information on the host and workbook. These are not traceable using normal packet sniffing tools like Wireshark etc. Alternatives when needing to use spreadsheets on highly sensitive data: open source versions that you have compiled from source (NOT Google Sheets), or opening Excel.exe with your network card disabled within a sandboxed environment (e.g. disable network => Start VMWare Windows container => use Excel.exe => End VMWare Windows container => enable network).

robertzk's Shortform

robertzk2mo20

Injecting a static IP that you control to a plethora of "whitelisting tutorials" all over the internet is a great example of exploiting data poisoning (e.g. https://www.lakera.ai/blog/training-data-poisoning) attacks, especially once the models pick up the data and are applied to autonomous devsecops use cases to conduct IP whitelisting over Terraform or automated devops-related MCPs.

This can be made more pernicious when you control the server (e.g. not just a substack post controlled by Substack, but active control over the hosting server), because you can inject the malicious static IP selectively depending upon whether or not the User-Agent is a scraping bot dispatched by an entity conducting model training.

One way that labs may look preemptively for adversarial cybersecurity actors is to scan their training data for contexts related to IP whitelisting / security whitelisting, and then have scanning agents examine whether there is any of the above funny business. After all, once the data has been indexed, it is static, and is no longer within the control of the malicious actor.

Final addendum: Models that have gone rogue and/or failed due to misalignment risk, rather than succumbed to misuse risk via this data poisoning avenue, could also take advantage of this by injecting whitelisted IPs into specific situations when activated as a coding agent to do so. It may be helpful for (1) interpretability researchers to actively catalogue and track features related to, particularly in open-weight models where this can be even more pernicious given that anyone can download models from HuggingFace, and (2) for control researchers to operationalize monitoring probes for these monitoring context related to IP whitelisting (or anomalous features activating within residual streams on top of or adjacent to IPs intended for whitelisted, such as in generated code diffs to a .tf file).

leogao's Shortform

robertzk2moΩ230

Mild caveat: the whole “I joined a group chat at NeurIPS and others kept joining to talk to us” only happens if you’re at NeurIPS and your name is Leo Gao so YMMV.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments