US Government Taps Anthropic’s Claude for Pioneering AI Safety Research

In an unprecedented collaboration, the United States government is joining forces with AI companies to rigorously test the safety of advanced models before they’re released to the public.

The U.S. Department of Commerce’s National Institute of Standards and Technology (NIST) announced it has signed agreements with Anthropic – maker of the Claude AI assistant – and OpenAI to give government experts early access to these firms’ most powerful AI systems.

This move establishes a formal pipeline for independent AI safety evaluations and marks the first time AI developers will systematically share pre-release models with regulators for scrutiny.

The partnership centers on NIST’s newly launched U.S. AI Safety Institute, which is tasked with developing standards and tests for trustworthy AI. Under Memoranda of Understanding (MOUs), Anthropic will provide the institute “access to major new models prior to and following their public release”.

In practice, when Anthropic is training a new Claude model (say Claude 3 or Claude Next), NIST’s team of researchers will be invited to put it through its paces before it goes live.

They’ll probe for things like: Does the model make harmful errors or hallucinations? Could it be manipulated into disclosing sensitive info? Does it exhibit biases or unfair behavior? And importantly, can it follow Anthropic’s own safety measures or are there gaps?

Elizabeth K. Kelly, the director of the U.S. AI Safety Institute, called these agreements “first-of-their-kind” and said they “will help advance safe and trustworthy AI innovation for all”. “Safety is essential to fueling breakthrough technological innovation,” she noted, emphasizing that having guardrails will ultimately enable AI’s benefits.

NIST will collaborate on research with Anthropic – meaning they won’t just test passively; they’ll work together on methods to evaluate capabilities and mitigate risks. For instance, they might jointly develop new stress tests for “situational awareness” of AI (to ensure models know their limits), or techniques to make models refuse certain dangerous requests more effectively.

And it’s a two-way street: NIST will give feedback and potential safety improvements to Anthropic for their models. So Anthropic might actually tweak Claude’s training or fine-tuning based on NIST’s findings, leading to safer final releases.

The collaboration also extends internationally. The announcement mentions the U.S. AI Safety Institute will work closely with its counterpart in the UK, sharing results. Indeed, in late 2023, the UK government set up an AI Safety Institute as well, and it’s hosting a global AI Safety Summit.

Anthropic is one of the companies engaging there too. So, if Anthropic gave NIST early access to, say, a Claude 3 model, the UK institute might be involved in evaluating it as well, via a partnership. This is creating an emerging framework of trusted circles of experts who vet advanced AI across borders before they’re deployed widely.

Why this matters: For the public, this is reassurance that someone outside of the AI companies themselves is checking these models.

After the surprise launch of ChatGPT, there were calls (even from Elon Musk and some researchers) for independent review of AI due to potential societal risks. This NIST-Anthropic arrangement is a concrete step in that direction.

By getting a sneak peek, government scientists can benchmark how a model performs on safety metrics and identify any red flags. They might catch issues companies missed or confirm the company’s own claims. It’s akin to FDA inspections or aircraft safety certifications, but for AI algorithms.

For Anthropic, agreeing to this shows confidence in their safety practices – they’re basically saying “we’re not afraid to let a third party poke and prod our model”.

It also might pre-empt heavier regulation: if voluntary sharing works and builds trust, maybe legislators won’t feel the need to mandate stricter controls like licensing or pre-approval (some proposals floating around). Also, NIST’s feedback could genuinely help Anthropic make Claude better.

For example, NIST might have specialized tests for adversarial robustness or bias that Anthropic’s internal team didn’t have. If Claude can pass NIST’s gauntlet, that’s a strong validation they can tout to customers: “Claude has been independently evaluated for safety by the U.S. government.”

Voluntary Commitments to Policy: This agreement builds on a series of moves by the Biden administration regarding AI.

In July 2023, Anthropic and six other AI companies (OpenAI, Google, Meta, Amazon, Microsoft, Inflection) made voluntary commitments at a White House meeting to prioritize safety, including allowing third-party testing before release. Anthropic’s co-founder, Dario Amodei, was at that meeting.

One key pledge was to share models with the government for testing – and this NIST MOU is essentially the follow-through.

It’s happening under the umbrella of the administration’s agenda: President Biden had issued an Executive Order instructing NIST to establish a process for AI testing. So we see policy turning into action here, relatively quickly by government standards (in about a year).

Scope of access: While the details are under wraps, NIST likely gets to test frontier models – those closest to public release or just released. For Anthropic, that could mean NIST had access to Claude 2 shortly before its July 2023 debut, or Claude 3.5 in mid-2024, etc.

They might also test iterative updates. The MOU likely has strict confidentiality – NIST won’t leak model details or allow others to use it beyond testing, addressing companies’ IP concerns. It’s about evaluation, not putting the model in government products or something.

Research focus: The announcement highlights evaluating capabilities and safety risks. Capabilities could range from the model’s knowledge and reasoning to its ability to follow instructions.

Safety risks could include generating disinformation, hate speech, instructions for violence, or cybersecurity issues (like leaking private training data or code). They’ll also look at mitigation methods – e.g., how effective are Anthropic’s filters or its “Constitutional AI” approach in practice.

Results will help NIST develop standards. For instance, they might come up with “AI Safety Level” definitions (Anthropic actually has something called ASL in-house), or evaluation benchmarks a model should pass. Over time, such standards could become baseline requirements for AI deployments in sensitive areas.

Elizabeth Kelly’s quote hints that this is just a start but an important one. NIST has a history of working with industries to set standards (e.g., cryptography). If these evaluations go well, it could inform broader certification schemes for AI.

NIST cooperating with both Anthropic and OpenAI also means they can compare notes across two different architectures and training styles, which could yield best practices applicable industry-wide.

For the public sector, this collaboration might also pave the way for government use of AI. If NIST gives Claude a green light on safety for certain uses, agencies might be more willing to deploy it (for instance, using Claude in federal customer service chatbots or data analysis) knowing NIST vetted it.

The government tends to buy tech that meets standards or has security certifications – maybe in the future “NIST safety-tested” becomes one for AI.

The bigger picture: 2024 saw increasing calls in Washington D.C. for AI regulation – from Senate hearings with OpenAI’s Sam Altman to a bipartisan push for AI licensing. While legislation will take time, these NIST agreements are an immediate executive action showing the U.S. is trying to get ahead of AI risks without stifling innovation.

It’s notable that only Anthropic and OpenAI are in this first cohort. Likely they were most willing or their models most relevant; we might see others like Google or Meta join later (Meta’s open-source approach is different, but they might do something via NIST too, especially for their powerful models that are not fully open).

Kelly mentions working with the UK’s AI Safety Institute, which indicates alignment with allies on AI oversight. U.S. and UK sharing test results can reduce duplication and set common standards so AI companies aren’t facing wildly different expectations in each country.

It’s a step toward global governance norms for AI, something experts have called for to manage frontier AI which doesn’t respect borders.

In essence, what we’re seeing is the beginning of a “test before trust” regime for AI. Much like new drugs go through trials or new airplanes through test flights, advanced AI models are starting to get independent check-ups.

Anthropic’s cooperation shows that at least some AI labs view this positively – presumably they feel their models will shine under examination, or at least that the process will be fair and beneficial.

And for Anthropic’s mission of safety, it’s a way to operationalize their values: allowing an external watchdog to validate their safety claims. This likely boosts confidence among cautious customers (like government or healthcare clients) that might use Claude.

We’ll be watching what tangible outputs come from this collaboration. NIST might publish some findings or methodologies (without spilling proprietary details). Perhaps we’ll see new NIST AI test suites as a result. If issues are found, will we hear about them? Maybe not publicly, but one hopes it will influence improvements quietly.

Over time, if such testing becomes standard, it could be one of the key mechanisms that ensure powerful AI systems are robust and safe – balancing the rapid pace of AI advancement with the diligence of independent oversight. Anthropic’s partnership with NIST thus stands as a milestone in bridging tech innovation with public accountability.

Leave a ReplyCancel Reply