Notion and Stripe trust this year-old startup to make sure their AI tools work

Innovation

Ankur Goyal built infrastructure for AI products twice before at Impira and Figma. Now, his AI evaluations startup Braintrust is valued at about $150 million following a $36 million Series A round led by a16z.

Ankur Goyal, center, with the Braintrust team. — Braintrust CEO Ankur Goyal says his team can make it easier for “anyone” to build software that makes use of AI.

Braintrust

Three months after a tech company releases new software utilizing artificial intelligence, Ankur Goyal’s team usually gets the call for help. As more users try the company’s shiny new AI tools, complaints that it provides nonsensical answers follow close behind.

“People are good at predicting what people will need an AI tool for. The hard part is that if you just type something up and ship it, it doesn’t work,” Goyal said. “It’s like baking without measuring the ingredients: you end up with mush.”

Goyal’s startup, Braintrust, caters to that challenge. Its software evaluates and monitors how an AI product is performing, then helps pinpoint the problem when something breaks. Bake in Braintrust for just a few weeks, and companies typically see their AI products’ self-reported accuracy – how it scores on a test such as for factual accuracy – soar from below 40% to north of 80%, Goyal claimed.

Just a year old, Braintrust is already used by several unicorns, including Airtable, Brex, Instacart and Stripe. The company’s dozens of customers, a group that have doubled in the past three months, Goyal said, typically pay tens of thousands of dollars and sometimes more than $100,000. Now, Braintrust has raised $36 million in a Series A round led by a16z partner Martin Casado to reach more companies beyond the Silicon Valley bubble.

Australia’s $3 trillion opportunity: Stripe founder on the power of payments

The funding values the San Francisco-based startup at about $150 million, according to a source with knowledge. Braintrust and a16z declined to comment on the valuation. Cloud leaders Databricks and Datadog also joined the round.

Braintrust works by providing a software development kit, or SDK, for a business to operate within its own IT infrastructure. At first, AI early adopters like Notion and Zapier layered its evaluations (“evals” to industry practitioners) on top of what they’d already built to better gauge performance. It could show them how tweaks, like using more customized prompts or switching from OpenAI’s GPT-4 to Anthropic’s Claude, helped or hurt accuracy.

Today, companies are integrating Braintrust’s evaluations before a tool launches, then using its monitoring tools to track and trace how well and how often its delivering prompts. The company also offers building blocks of code, known as functions, to help product developers and others newer to AI to help bridge the gap between a company’s non-AI core product features and its experimental, model-based additions. “We’re not an AI product ourselves, although there are AI components in it,” Goyal said. “We’re more a tool for helping everyone else to build AI software.”

With Braintrust, Goyal is hoping to solve a problem for others that he previously worked on twice: first at a startup he founded and led as CEO called Impira, earning Goyal a spot on the Forbes Under 30 list for 2020, and later at Figma, the product software unicorn that acquired it in January 2023. Impira’s use of machine learning to automatically extract data from documents and invoices required building out in-house evaluations; at Figma, Goyal helped with similar problems, but this time for features like the design software’s visual search, which can surface similar results from a user’s assets library.

Goyal only lasted eight months at Figma. While on a long walk with one of the backers of his previous company, Elad Gil, he realized that the jury-rigged solutions he’d built internally addressed a more universal and growing problem. After speaking with 25 potential customers, Goyal decided to launch Braintrust in August 2023.

“There was a common stack that enterprises need to build over and over in order to use modern AI approaches, and the key wedge into that stack is the evals,” Gil told Forbes. “This allows you to take the vibes out of AI development and actually understand what’s happening.”

At $5 billion-valued Zapier, cofounder Bryan Helmig said the workflow automation platform’s AI tools now handle tens of millions of tasks every month. But before using Braintrust, controls to catch when the AI hallucinated were “ad hoc.” The company turned to the startup first for its evaluations, then started using its logging and management of datasets of prompts, too. “It’s a grab bag of a lot of the utility stuff you need, the grunt work around deploying generative AI,” Helmig said.

Ex-Google, Stripe execs bag $5 million for AI start-up Lorikeet

Notion, the productivity software maker last valued at $10 billion, also fancied itself an aggressive early adopter of LLMs. But as developers shared individual failed prompts and errors with each other manually via files, “it felt like we were in the dark ages,” said cofounder Simon Last. One of Braintrust’s heaviest users, Notion is now pushing the startup to be able to process more and more of its product interactions, he added. “It lets us build much more complex stuff, more confidently,” he said.

With evals increasingly a subject of conversation and recognition among developers in the AI community ( “The future belongs to those who do evals”, X.ai cofounder Greg Yang posted on X last month), Braintrust isn’t the only company eyeing solutions. A startup called Galileo announced its own models for evaluating other LLMs in June; one large software unicorn’s CTO told Forbes that their company had selected a U.K.-based challenger, Humanloop, instead of Braintrust for a trial. The model providers themselves are also unlikely to stay out of Braintrust’s turf. A week ago, OpenAI announced its own basic evaluations tools.

For Braintrust and its supporters, more attention on evals is a good thing. Prominent AI research labs are power users of Braintrust, in frequent contact with the team for insights into how customers are using their models, Goyal said (he declined to name them). Braintrust also counts as investors and supporters the CEOs of Datadog, AI unicorn Hugging Face, customers Airtable and Instacart, as well as former Lattice CEO Jack Altman; OpenAI cofounder Greg Brockman and board director Adam D’Angelo, the CEO of answers site Quora, are also personal backers.

Investors also noted that Braintrust and its ilk will provide more neutral building blocks for companies looking to work with the major AI research labs interchangeably, similar to cloud businesses flourishing alongside heavyweights Amazon Web Services and Microsoft Azure. At Notion, cofounder Last said that companies would be “insane” to put all their eggs in one basket with one model provider like OpenAI. “Nobody wants to be stuck,” he said.

“I view this as a new systems layer, where you’ve got an unreliable substrate – the AI models – and you need a bunch of code to make programming work on top,” said a16z’s Casado, who first tested Braintrust against its peers as a hobbyist programmer looking to build games with LLMs, then again more formally with his firm. “I think there’s a new type of product developer that’s emerging… this can make you a sophisticated LLM programmer right away.”

The biggest risk to Braintrust, according to Goyal, is not that it might be outmaneuvered; it’s that AI tools in general might fail to live up to their promise. “If AI is impactful, Braintrust is incredibly well positioned,” the CEO said. “If it turns out not to be a big disruptive thing, well, that’s a bet I’m willing to take.”

This article was originally published on forbes.com and all figures are in USD.