Changelog

Week of 2024-10-07

After using "Copy to Dataset" to create a new dataset row, the audit log of the new row now links back to the original experiment, log, or other dataset.
Tools now stream their stdout and stderr to the UI. This is helpful for debugging.
Fix prompt, scorer, and tool dropdowns to only show the correct function types.

Week of 2024-09-30

The Github action now supports Python runtimes.
Add support for Cerebras models in the proxy, playground, and saved prompts.
You can now create span iframe viewers to visualize span data in a custom iframe. In this example, the "Table" section is a custom span iframe.
NOT LIKE, NOT ILIKE, NOT INCLUDES, and NOT CONTAINS supported in BTQL.
Add "Upload Rows" button to insert rows into an existing dataset from CSV or JSON.
Add "Maximum" aggregate score type.
The experiment table now supports grouping by input (for trials) or by a metadata field.
- The Name and Input columns are now pinned
Gemini models now support multimodal inputs.

Week of 2024-09-23

Basic monitor page that shows aggregate values for latency, token count, time to first token, and cost for logs.
Create custom tools to use in your prompts and in the playground. See the docs for more details.
Set org-wide environment variables to use in these tools
Pull your prompts to your codebase using the braintrust pull command.
Select and compare multiple experiments in the experiment view using the compared with dropdown.
The playground now displays aggregate scores (avg/max/min) for each prompt and supports sorting rows by a score.
Compare span field values side-by-side in the trace viewer when fullscreen and diff mode is enabled.

SDK (version 0.0.160)

Fix a bug with setFetch() in the Typescript SDK.

SDK (version 0.0.159)

In Python, running the CLI with --verbose now uses the INFO log level, while still printing full stack traces. Pass the flag twice (-vv) to use the DEBUG log level.
Create and push custom tools from your codebase with braintrust push. See docs for more details. Typescript only for now.
A long awaited feature: you can now pull prompts to your codebase using the braintrust pull command. Typescript only for now.

API (version 0.0.56)

Hosted tools are now available in the API.
Environment variables are now supported in the API (not yet in the standard REST API). See the docker compose file for information on how to configure the secret used to encrypt them if you are using Docker.
Automatically backfill function_data for prompts created via the API.

Week of 2024-09-16

The tag picker now includes tags that were added dynamically via API, in addition to the tags configured for your project.
Added a REST API for managing AI secrets. See docs.

SDK (version 0.0.158)

A dedicated update method is now available for datasets.
Fixed a Python-specific error causing experiments to fail initializing when git diff --cached encounters invalid or inaccessible Git repositories.
Token counts have the correct units when printing ExperimentSummary objects.
In Python, MetricSummary.metric could have an int value. The type annotation has been updated.

Week of 2024-09-09

You can now create server-side online evaluations for your logs. Online evals support both autoevals and custom scorers you define as LLM-as-a-judge, Typescript, or Python functions. See docs for more details.

New member invitations now support being added to multiple permission groups.
Move datasets and prompts to a new Library navigation tab, and include a list of custom scorers.
Clean up tree view by truncating the root preview and showing a preview of a node only if collapsed.
Automatically save changes to table views.

Week of 2024-09-02

You can now upload typescript evals from the command line as functions, and then use them in the playground.
Click a span field line to highlight it and pin it to the URL.
Copilot tab autocomplete for prompts and data in the Braintrust UI.

# This will bundle and upload the task and scorer functions to Braintrust
npx braintrust eval --bundle

API (version 0.0.54)

Support for bundled eval uploads.
The PATCH endpoint for prompts now supports updating the slug field.

SDK (version 0.0.157)

Enable the --bundle flag for braintrust eval in the Typescript SDK.

Week of 2024-08-26

Basic filter UI (no BTQL necessary)
Add to dataset dropdown now supports adding to datasets across projects.
Add REST endpoint for batch-updating ACLs: /v1/acl/batch_update.
Cmd/Ctrl click on a table row to open it in a new tab.
Show the last 5 basic filters in the filter editor.
You can now explicitly set and edit prompt slugs.

SDK (version 0.0.155)

The client wrappers wrapOpenAI()/wrap_openai() now support Structured Outputs.

API (version 0.0.54) [Upcoming]

Don't fail insertion requests if realtime broadcast fails

Week of 2024-08-19

Fixed comment deletion.
You can now use % in BTQL queries to represent percent values. E.g. 50% will be interpreted as 0.5.

API (version 0.0.54) [Upcoming]

Performance optimizations to filters on scores, metrics, and created fields.
Performance optimizations to filter subfields of metadata and span_attributes.

Week of 2024-08-12

You can now create custom LLM and code (Typescript and Python) evaluators in the playground.

Fullscreen trace toggle
Datasets now accept JSON file uploads
When uploading a CSV/JSON file to a dataset, columns/fields named input, expected, and metadata are now auto-assigned to the corresponding dataset fields
Fix bug in logs/dataset viewer when changing the search params.

API (version 0.0.53)

The API now supports running custom LLM and code (Typescript and Python) functions. To enable this in the:
- AWS Cloudformation stack: turn on the EnableQuarantine parameter
- Docker deployment: set the ALLOW_CODE_FUNCTION_EXECUTION environment variable to true

Week of 2024-08-05

Full text search UI for all span contents in a trace
New metrics in the UI and summary API: prompt tokens, completion tokens, total tokens, and LLM duration
- These metrics, along with cost, now exclude LLM calls used in autoevals (as of 0.0.85)
Switching organizations via the header navigates to the same-named project in the selected organization
Added MarkAsyncWrapper to the Python SDK to allow explicitly marking functions which return awaitable objects as async

Autoevals (version 0.0.85)

LLM calls used in autoevals are now marked with span_attributes.purpose = "scorer" so they can be excluded from metric and cost calculations.

Autoevals (version 0.0.84)

Fix a bug where rationale was incorrectly formatted in Python.
Update the full docker deployment configuration to bundle the metadata DB (supabase) inside the main docker compose file. Thus no separate supabase cluster is required. See docs for details. If you are upgrading an existing full deployment, you will likely want to mark the supabase db volumes external to continue using your existing data (see comments in the docker-compose.full.yml file for more details).

SDK (version 0.0.151)

Eval() can now take a base experiment. Provide either baseExperimentName/base_experiment_name or baseExperimentId/base_experiment_id.

Week of 2024-07-29

Errors now show up in the trace viewer.
New cookbook recipe on benchmarking LLM providers.
Viewer mode selections will no longer automatically switch to a non-editable view if the field is editable and persist across trace/span changes.
Show % in diffs instead of pp.
Add rename, delete and copy current project id actions to the project dropdown.
Playgrounds can now be shared publicly.
Duration now reflects the "task" duration not the overall test case duration (which also includes scores).
Duration is now also displayed in the experiment overview table.
Add support for Fireworks and Lepton inference providers.
"Jump to" menu to quickly navigate between span sections.
Speed up queries involving metadata fields, e.g. metadata.foo ILIKE '%bar%', using the columnstore backend if it is available.
Added project_id query param to REST API queries which already accept project_name. E.g. GET experiments.
Update to include the latest Mistral models in the proxy/playground.

SDK (version 0.0.148)

While tracing, if your code errors, the error will be logged to the span. You can also manually log the error field through the API or the logging SDK.

SDK (version 0.0.147)

project_name is now projectName, etc. in the invoke(...) function in Typescript
Eval() return values are printed in a nicer format (e.g. in Notebooks)
updateSpan()/update_span() allows you to update a span's fields after it has been created.

Week of 2024-07-22

Categorical human review scores can now be re-ordered via Drag-n-Drop.
Human review row selection is now a free text field, enabling a quick jump to a specific row.
Added REST endpoint for managing org membership. See docs.

API (version 0.0.51)

The proxy is now a first-class citizen in the API service, which simplifies deployment and sets the groundwork for some exciting new features. Here is what you need to know:
- The updates are available as of API version 0.0.51.
- The proxy is now accessible at https://api.braintrust.dev/v1/proxy. You can use this as a base URL in your OpenAI client, instead of https://braintrustproxy.com/v1. [NOTE: The latter is still supported, but will be deprecated in the future.]
- If you are self-hosting, the proxy is now bundled into the API service. That means you no longer need to deploy the proxy as a separate service.
- If you have deployed through AWS, after updating the Cloudformation, you'll need to grab the "Universal API URL" from the "Outputs" tab.

Universal URL Cloudformation

Then, replace that in your settings page settings page

Universal API

If you have a Docker-based deployment, you can just update your containers.
Once you see the "Universal API" indicator, you can remove the proxy URL from your settings page, if you have it set.

SDK (version 0.0.146)

Add support for max_concurrency in the Python SDK
Hill climbing evals that use a BaseExperiment as data will use that as the default base experiment.

Week of 2024-07-15

In preparation for auth changes, we are making a series of updates that may affect self-deployed instances:
- Preview URLs will now be subdomains of *.preview.braintrust.dev instead of vercel.app. Please add this domain to your allow list.
- To continue viewing preview URLs, you will need to update your stack (to update the allow list to include the new domain pattern).
- The data plane may make requests back to *.preview.braintrust.dev URLs. This allows you to test previews that include control plane changes. You may need to whitelist traffic from the data plane to *.preview.braintrust.dev domains.
- Requests will optionally send an additional x-bt-auth-token header. You may need to whitelist this header.
- User impersonation through the x-bt-impersonate-user header now accepts either the user's id or email. Previously only user id was accepted.

Autoevals (version 0.0.80)

New ExactMatch scorer for comparing two values for exact equality.

Autoevals (version 0.0.77)

Officially switch the default model to be gpt-4o. Our testing showed that it performed on average 10% more accurately than gpt-3.5-turbo!
Support claude models (e.g. claude-3-5-sonnet-20240620). You can use them by simply specifying the model param in any LLM based evaluator.
- Under the hood, this will use the proxy, so make sure to configure your Anthropic API keys in your settings.

Week of 2024-07-08

Human review scores are now sortable from the project configuration page.
Streaming support for tool calls in Anthropic models through the proxy and playground.
The playground now supports different "parsing" modes:
- auto: (same as before) the completion text and the first tool call arguments, if any
- parallel: the completion text and a list of all tool calls
- raw: the completion in the OpenAI non-streaming format
- raw_stream: the completion in the OpenAI streaming format
Cleaned up environment variables in the public docker deployment. Functionally, nothing has changed.

If you are running the full-mode deployment, next time you update your docker images, please make sure to pull the latest compose file. Specifically, we added a new env var CHALICE_LOCAL_USE_LOCAL_ENV: 1 to the braintrust-standalone-api and braintrust-standalone-proxy containers.

Autoevals (version 0.0.76)

New .partial(...) syntax to initialize a scorer with partial arguments like criteria in ClosedQA.
Allow messages to be inserted in the middle of a prompt.

Week of 2024-07-01

Table views can now be saved, persisting the BTQL filters, sorts, and column state.
Add support for the new window.ai model into the playground.
Use push history when navigating table rows to allow for back button navigation.
In the experiments list, grouping by a metadata field will group rows in the table as well.
Allow the trace tree panel to be resized.
Port the log summary query to BTQL. This should speed up the query, especially if you have clickhouse configured in your cloud environment. This functionality requires upgrading your data backend to version 0.0.50.

SDK (version 0.0.140)

New wrapTraced function allows you to trace javascript functions in a more ergonomic way.

import { wrapTraced } from "braintrust";
 
const foo = wrapTraced(async function foo(input) {
  const resp = await client.chat.completions.create({
    model: "gpt-3.5-turbo",
    messages: [{ role: "user", content: input }],
  });
  return resp.choices[0].message.content ?? "unknown";
});

SDK (version 0.0.138)

The Typescript SDK's Eval() function now takes a maxConcurrency parameter, which bounds the number of concurrent tasks that run.
braintrust install api now sets up your API and Proxy URL in your environment.
You can now specify a custom fetch implementation in the Typescript SDK.

Week of 2024-06-24

Update the experiment progress and experiment score distribution chart layouts
Format table column headers with icons
Move active filters to the table toolbar
Enable RBAC for all users. When inviting a new member, prompt to add that member to an RBAC Permission group.
Use btql to power the datasets list, making it significantly faster if you have multiple large datasets.
Experiments list chart supports click interactions. Left click to select an experiment, right click to add an annotation.
Jump into comparison view between 2 experiments by selecting them in the table an clicking "Compare"

Deployment

The proxy service now supports more advanced functionality which requires setting the PG_URL and REDIS_URL parameters. If you do not set them, the proxy will still run without caching credentials or requests.

Week of 2024-06-17

Add support for labeling expected fields using human review.
Create and edit descriptions for datasets.
Create and edit metadata for prompts.
Click scores and attributes (tree view only) in the trace view to filter by them.
Highlight the experiments graph to filter down the set of experiments.
Add support for new models including Claude 3.5 Sonnet.

Week of 2024-06-10

Improved empty state and instructions for custom evaluators in the playground.
Show query examples when filtering/sorting.
Custom comparison keys for experiments.
New model dropdown in the playground/prompt editor that is organized by provider and model type.

Week of 2024-06-03

You can now collapse the trace tree. It's auto collapsed if you have a single span.
Improvements to the experiment chart including greyed out lines for inactive scores and improved legend.
Show diffs when you save a new prompt version.

Prompt diff

Week of 2024-05-27

You can now see which users are viewing the same traces as you are in real-time.
Improve whitespace and presentation of diffs in the trace view.
Show markdown previews in score editor.
Show cost in spans and display the average cost on experiment summaries and diff views.
Published a new Text2SQL eval recipe
Add groups view for RBAC.

Week of 2024-05-20

Deprecate the legacy dataset format (output in place of expected) in a new version of the SDK (0.0.130). For now, data can still be fetched in the legacy format by setting the useOutput / use_output flag to false when using initDataset() / init_dataset(). We recommend updating your code to use datasets with expected instead of output as soon as possible.
Improve the UX for saving and updating prompts from the playground.
New hide/show column controls on all tables.
New model comparison cookbook recipe.
Add support for model / metadata comparison on the experiments view.
New experiment picker dropdown.
Markdown support in the LLM message viewer.

Week of 2024-05-13

Support copying to clipboard from input, output, etc. views
Improve the empty-state experience for datasets.
New multi-dimensional charts on the experiment page for comparing models and model parameters.
Support HTTPS_PROXY, HTTP_PROXY, and NO_PROXY environment variables in the API containers.
Support infinite scroll in the logs viewer and remove dataset size limitations.

Week of 2024-05-06

Denser trace view with span durations built in.
Rework pagination and fix scrolling across multiple pages in the logs viewer.
Make BTQL the default search method.
Add support for Bedrock models in the playground and the proxy.
Add "copy code" buttons throughout the docs.
Automatically overflow large objects (e.g. experiments) to S3 for faster loading and better performance.

Week of 2024-04-29

Show images in LLM view, adding the ability to display images in the LLM view in the trace viewer.
Send an invite email when you invite a new user to your organization.
Support selecting/deselecting scores in the experiment view.
Roll out Braintrust Query Language (BTQL) for querying logs and traces.

Week of 2024-04-22

Smart relative time labels for dates (1h ago, 3d ago, etc.)
Added double quoted string literals support, e.g., tags contains "foo".
Jump to top button in trace details for easier navigation.
Fix a race condition in distributed tracing, in which subspans could hit the backend before their parent span, resulting in an inaccurate trace structure.

As part of this change, we removed the parent_id argument from the latest SDK, which was previously deprecated in favor of parent. parent_id is only able to use the race-condition-prone form of distributed tracing, so we felt it would be best for folks to upgrade any of their usages from parent_id to parent. Before upgrading your SDK, if you are currently using parent_id, you can port over to using parent by changing any exported IDs from span.id to span.export() and then changing any instances of parent_id=[span_id] to parent=[exported_span].

For example, if you had distributed tracing code like the following:

import { initLogger } from "braintrust";
 
const logger = initLogger({
  projectName: "My Project",
  apiKey: process.env.BRAINTRUST_API_KEY,
});
 
export async function POST(req: Request) {
  return logger.traced(async (span) => {
    const { body } = req;
    const result = await someLLMFunction(body);
    span.log({ input: body, output: result });
    return {
      result,
      requestId: span.id,
    };
  });
}
 
export async function POSTFeedback(req: Request) {
  logger.traced(
    async (span) => {
      logger.logFeedback({
        id: span.id, // Use the newly created span's id, instead of the original request's id
        comment: req.body.comment,
        scores: {
          correctness: req.body.score,
        },
        metadata: {
          user_id: req.user.id,
        },
      });
    },
    {
      parentId: req.body.requestId,
      name: "feedback",
    },
  );
}

It would now look like this:

import { initLogger } from "braintrust";
 
const logger = initLogger({
  projectName: "My Project",
  apiKey: process.env.BRAINTRUST_API_KEY,
});
 
export async function POST(req: Request) {
  return logger.traced(async (span) => {
    const { body } = req;
    const result = await someLLMFunction(body);
    span.log({ input: body, output: result });
    return {
      result,
      requestId: span.export(),
    };
  });
}
 
export async function POSTFeedback(req: Request) {
  logger.traced(
    async (span) => {
      logger.logFeedback({
        id: span.id, // Use the newly created span's id, instead of the original request's id
        comment: req.body.comment,
        scores: {
          correctness: req.body.score,
        },
        metadata: {
          user_id: req.user.id,
        },
      });
    },
    {
      parent_id: req.body.requestId,
      name: "feedback",
    },
  );
}

Week of 2024-04-15

Incremental support for roles-based access control (RBAC) logic within the API server backend.

As part of this change, we removed certain API endpoints which are no longer in use. In particular, the /crud/{object_type} endpoint. For the handful of usages of these endpoints in old versions of the SDK libraries, we added backwards-compatibility routes, but it is possible we may have missed a few. Please let us know if your code is trying to use an endpoint that no longer exists and we can remediate.

Changed the semantics of experiment initialization with update=True. Previously, we would require the experiment to already exist, now we will create the experiment if it doesn't already exist otherwise return the existing one.

This change affects the semantics of the PUT /v1/experiment operation, so that it will not replace the contents of an existing experiment with a new one, but instead just return the existing one, meaning it behaves the same as POST /v1/experiment. Eventually we plan to revise the update semantics for other object types as well. Therefore, we have deprecated the PUT endpoint across the board and plan to remove it in a future revision of the API.

Week of 2024-04-08

Added support for new multimodal models (gpt-4-turbo, gpt-4-vision-preview, gpt-4-1106-vision-preview, gpt-4-turbo-2024-04-09, claude-3-opus-20240229, claude-3-sonnet-20240229, claude-3-haiku-20240307).
Introduced REST API for RBAC (Role-Based Access Control) objects including CRUD operations on roles, groups, and permissions, and added a read-only API for users.
Improved AI search and added positive/negative tag filtering in AI search. To positively filter, prefix the tag with +, and to negatively filter, prefix the tag with -.

We are making some systematic changes to the search experience, and the search syntax is subject to change.

Week of 2024-04-01

Added functionality for distributed tracing. See the docs for more details.

As part of this change, we had to rework the core logging implementation in the SDKs to rely on some newer backend API features. Therefore, if you are hosting Braintrust on-prem, before upgrading your SDK to any version >= 0.0.115, make sure your API version is >= 0.0.35. You can query the version of the on-prem server with curl [api-url]/version, where the API URL can be found on the settings page.

Week of 2024-03-25

Introduce multimodal support for OpenAI and Anthropic models in the prompt playground and proxy. You can now pass image URLs, base64-encoded image strings, or mustache template variables to models that support multimodal inputs.
The REST API now gzips responses.
You can now return dynamic arrays of scores in Eval() functions (docs).
Launched Reporters, a way to summarize and report eval results in a custom format.
New coat of paint in the trace view.
Added support for Clickhouse as an additional storage backend, offering a more scalable solution for handling large datasets and performance improvements for certain query types. You can enable it by setting the UseManagedClickhouse parameter to true in the CloudFormation template or installing the docker container.
Implemented realtime checks using a WebSocket connection and updated proxy configurations to include CORS support.
Introduced an API version checker tool so you know when your API version is outdated.

Week of 2024-03-18

Add new database parameters for external databases in the CloudFormation template.
Faster optimistic updates for large writes in the UI.
"Open in playground" now opens a lighter weight modal instead of the full playground.
Can create a new prompt playground from the prompt viewer.

Week of 2024-03-11

Shipped support for prompt management.
Moved playground sessions to be within projects. All existing sessions are now in the "Playground Sessions" project.
Allowed customizing proxy and real-time URLs through the web application, adding flexibility for different deployment scenarios.
Improved documentation for Docker deployments.
Improved folding behavior in data editors.

Week of 2024-03-04

Support custom models and endpoint configuration for all providers.
New add team modal with support for multiple users.
New information architecture to enable faster project navigation.
Experiment metadata now visible in the experiments table.
Improve UI write performance with batching.
Log filters now apply to any span.
Share button for traces
Images now supported in the tree view (see tracing docs for more).

Week of 2024-02-26

Show auto scores before manual scores (matching trace) in the table
New logo is live!
Any span can now submit scores, which automatically average in the trace. This makes it easier to label scores in the spans where they originate.
Improve sidebar scrolling behavior.
Add AI search for datasets and logs.
Add tags to the SDK.
Support viewing and updating metadata on the experiment page.

Week of 2024-02-19

We rolled out a breaking change to the REST API that renames the output field to expected on dataset records. This change brings the API in line with last week's update to the Braintrust SDK. For more information, refer to the REST API docs for dataset records (insert and fetch).

Add support for tags.
Score fields are now sorted alphabetically.
Add support for Groq ModuleResolutionKind.
Improve tree viewer and XML parser.
New experiment page redesign

Week of 2024-02-12

We are rolling out a change to dataset records that renames the output field to expected. If you are using the SDK, datasets will still fetch records using the old format for now, but we recommend future-proofing your code by setting the useOutput / use_output flag to false when calling initDataset() / init_dataset(), which will become the default in a future version of Braintrust.

When you set useOutput to false, your dataset records will contain expected instead of output. This makes it easy to use them with Eval(...) to provide expected outputs for scoring, since you'll no longer have to manually rename output to expected when passing data to the evaluator:

import { Eval, initDataset } from "braintrust";
import { Levenshtein } from "autoevals";
 
Eval("My Eval", {
  data: initDataset("Existing Dataset", { useOutput: false }), // Records will contain `expected` instead of `output`
  task: (input) => "foo",
  scores: [Levenshtein],
});

Here's an example of how to insert and fetch dataset records using the new format:

import { initDataset } from "braintrust";
 
// Currently `useOutput` defaults to true, but this will change in a future version of Braintrust.
const dataset = initDataset("My Dataset", { useOutput: false });
 
dataset.insert({
  input: "foo",
  expected: { result: 42, error: null }, // Instead of `output`
  metadata: { model: "gpt-3.5-turbo" },
});
await dataset.flush();
 
for await (const record of dataset) {
  console.log(record.expected); // Instead of `record.output`
}

Support duplicate Eval names.
Fallback to BRAINTRUST_API_KEY if OPENAI_API_KEY is not set.
Throw an error if you use experiment.log and experiment.start_span together.
Add keyboard shortcuts (j/k/p/n) for navigation.
Increased tooltip size and delay for better usability.
Support more viewing modes: HTML, Markdown, and Text.

Week of 2024-02-05

Playground

Tons of improvements to the prompt playground:
- A new "compact" view, that shows just one line per row, so you can quickly scan across rows. You can toggle between the two modes.
- Loading indicators per cell
- The run button transforms into a "Stop" button while you are streaming data
- Prompt variables are now syntax highlighted in purple and use a monospace font
- Tab now autocompletes
- We no longer auto-create variables as you're typing (was causing more trouble than helping)
- Slider params like max_tokens are now optional
Cloudformation now supports more granular RDS configuration (instance type, storage, etc)
Support optional slider params
- Made certain parameters like max_tokens optional.
- Accompanies pull request https://github.com/braintrustdata/braintrust-proxy/pull/23.
Lots of style improvements for tables.
- Fixed filter bar styles.
- Rendered JSON cell values using monospace type.
- Adjusted margins for horizontally scrollable tables.
- Implemented a smaller size for avatars in tables.
Deleting a prompt takes you back to the prompts tab

Week of 2024-01-29

New REST API.
Cookbook of common use cases and examples.
Support for custom models in the playground.
Search now works across spans, not just top-level traces.
Show creator avatars in the prompt playground
Improved UI breadcrumbs and sticky table headers

Week of 2024-01-22

UI improvements to the playground.
Added an example of closed QA / extra fields.
New YAML parser and new syntax highlighting colors for data editor.
Added support for enabling/disabling certain git fields from collection (in org settings and the SDK).
Added new GPT-3.5 and 4 models to the playground.
Fixed scrolling jitter issue in the playground.
Made table fields in the prompt playground sticky.

Week of 2024-01-15

Added ability to download dataset as CSV
Added YAML support for logging and visualizing traces
Added JSON mode in the playground
Added span icons and improved readability
Enabled shift modifier for selecting multiple rows in Tables
Improved tables to allow editing expected fields and moved datasets to trace view

Week of 2024-01-08

Released new Docker deployment method for self hosting
Added ability to manually score results in the experiment UI
Added comments and audit log in the experiment UI

Week of 2024-01-01

Added ability to upload dataset CSV files in prompt playgrounds
Published new guide for tracing and logging your code
Added support to download experiment results as CSVs

Week of 2023-12-25

API keys are now scoped to organizations, so if you are part of multiple orgs, new API keys will only permit access to the org they belong to.
You can now search for experiments by any metadata, including their name, author, or even git metadata.
Filters are now saved in URL state so you can share a link to a filtered view of your experiments or logs.
Improve performance of project page by optimizing API calls.

We made several cleanups and improvements to the low-level typescript and python SDKs (0.0.86). If you use the Eval framework, nothing should change for you, but keep in mind the following differences if you use the manual logging functionality:

Simplified the low-level tracing API (updated docs coming soon!)
- The current experiment and current logger are now maintained globally rather than as async-task-local variables. This makes it much simpler to start tracing with minimal code modification. Note that creating experiments/loggers with withExperiment/withLogger will now set the current experiment globally (visible across all async tasks) rather than local to a specific task. You may pass setCurrent: false/set_current=False to avoid setting the global current experiment/logger.
- In python, the @traced decorator now logs the function input/output by default. This might interfere with code that already logs input/output inside the traced function. You may pass notrace_io=True as an argument to @traced to turn this logging off.
- In typescript, the traced method can start spans under the global logger, and is thus async by default. You may pass asyncFlush: true to these functions to make the traced function synchronous. Note that if the function tries to trace under the global logger, it must also have asyncFlush: true.
- Removed the withCurrent/with_current functions
- In typescript, the Span.traced method now accepts name as an optional argument instead of a required positional param. This matches the behavior of all other instances of traced. name is also now optional in python, but this doesn't change the function signature.
Experiments and Datasets are now lazily-initialized, similar to Loggers. This means all write operations are immediate and synchronous. But any metadata accessor methods ([Experiment|Logger].[id|name|project]) are now async.
Undo auto-inference of force_login if login is invoked with different params than last time. Now login will only re-login if forceLogin: true/force_login=True is provided.

Week of 2023-12-18

Dropped the official 2023 Year-in-Review dashboard. Check out yours here!

2023 year in review

Improved ergonomics for the Python SDK:
- The @traced decorator will automatically log inputs/outputs
- You no longer need to use context managers to scope experiments or loggers.
Enable skew protection in frontend deploys, so hopefully no more hard refreshes.
Added syntax highlighting in the sidepanel to improve readability.
Add jsonl mode to the eval CLI to log experiment summaries in an easy-to-parse format.

Week of 2023-12-11

Released new trials feature to rerun each input multiple times and collect aggregate results for a more robust score.
Added ability to run evals in the prompt playground. Use your existing dataset and the autoevals functions to score playground outputs.
Released new version of SDK (0.0.81) including a small breaking change. When setting the experiment name in the Eval function, the exprimentName key pair should be moved to a top level argument. before:

Eval([eval_name], {
  ...,
  metadata: {
    experimentName: [experimentName]
  }
})

after:

Eval([eval_name], {
  ...,
  experimentName: [experimentName]
})

Added support for Gemini and Mistral Platform in AI proxy and playground

Week of 2023-12-4

Enabled the prompt playground and datasets for free users
Added Together.ai models including Mixtral to AI Proxy
Turned prompts tab on organization view into a list
Removed data row limit for the prompt playground
Enabled configuration for dark mode and light mode in settings
Added automatic logging of a diff if an experiment is run on a repo with uncommitted changes

Week of 2023-11-27

Added experiment search on project view to filter by experiment name
Upgraded AI Proxy to support tracking Prometheus metrics
Modified Autoevals library to use the AI proxy
Upgraded Python braintrust library to parallelize evals
Optimized experiment diff view for performance improvements

Week of 2023-11-20

Added support for new Perplexity models (ex: pplx-7b-online) to playground
Released AI proxy: access many LLMs using one API w/ caching
Added load balancing endpoints to AI proxy
Updated org-level view to show projects and prompt playground sessions
Added ability to batch delete experiments
Added support for Claude 2.1 in playground

Week of 2023-11-13

Made experiment column resized widths persistent
Fixed our libraries including Autoevals to work with OpenAI’s new libraries
Added support for function calling and tools in our prompt playground
Added tabs on a project page for datasets, experiments, etc.

Week of 2023-11-06

Improved selectors for diffing and comparison modes on experiment view
Added support for new OpenAI models (GPT4 preview, 3.5turbo-1106) in playground
Added support for OS models (Mistral, Codellama, Llama2, etc.) in playground using Perplexity's APIs

Week of 2023-10-30

Improved experiment sidebar to be fully responsive and resizable
Improved tooltips within the web UI
Multiple performance optimizations and bug fixes

Week of 2023-10-23

Improved prompt playground variable handling and visualization
Added time duration statistics per row to experiment summaries

Multiple performance optimizations and bug fixes

Week of 2023-10-16

Launched new tracing feature: log and visualize complex LLM chains and executions.
Added a new “text-block” prompt type in the playground that just returns a string or variable back without a LLM call (useful for chaining prompts and debugging)
Increased default # of rows per page from 10 to 100 for experiments
UI fixes and improvements for the side panel and tooltips
The experiment dashboard can be customized to show the most relevant charts

Week of 2023-10-09

Performance improvements related to user sessions

Week of 2023-10-02

All experiment loading HTTP requests are 100-200ms faster
The prompt playground now supports autocomplete
Dataset versions are now displayed on the datasets page

Projects in the summary page are now sorted alphabetically
Long text fields in logged data can be expanded into scrollable blocks
We evaluated the Alpaca evals leaderboard in Braintrust
New tutorial for finetuning GPT3.5 and evaluating with Braintrust

Week of 2023-09-18

The Eval framework is now supported in Python! See the updated evals guide for more information:

from braintrust import Eval
 
from autoevals import LevenshteinScorer
 
Eval(
    "Say Hi Bot",
    data=lambda: [
        {
            "input": "Foo",
            "expected": "Hi Foo",
        },
        {
            "input": "Bar",
            "expected": "Hello Bar",
        },
    ],  # Replace with your eval dataset
    task=lambda input: "Hi " + input,  # Replace with your LLM call
    scores=[LevenshteinScorer],
)

Onboarding and signup flow for new users
Switch product font to Inter

Week of 2023-09-11

Big performance improvements for registering experiments (down from ~5s to <1s). Update the SDK to take advantage of these improvements.
New graph shows aggregate accuracy between experiments for each score.
Throw errors in the prompt playground if you reference an invalid variable.
A significant backend database change which significantly improves performance while reducing costs. Please contact us if you have not already heard from us about upgrading your deployment.
No more record size constraints (previously, strings could be at most 64kb long).
New autoevals for numeric diff and JSON diff

Week of 2023-09-05

You can duplicate prompt sessions, prompts, and dataset rows in the prompt playground.
You can download prompt sessions as JSON files (including the prompt templates, prompts, and completions).
You can adjust model parameters (e.g. temperature) in the prompt playground.
You can publicly share experiments (e.g. Alpaca Evals).
Datasets now support editing, deleting, adding, and copying rows in the UI.
There is no longer a 64KB limit on strings.

Week of 2023-08-28

The prompt playground is now live! We're excited to get your feedback as we continue to build this feature out. See the docs for more information.

Sync Playground

Week of 2023-08-21

A new chart shows experiment progress per score over time.

Experiment Progress

The eval CLI now supports --watch, which will automatically re-run your evaluation when you make changes to your code.
You can now edit datasets in the UI.

Edit Dataset

Week of 2023-08-14

Introducing datasets! You can now upload datasets to Braintrust and use them in your experiments. Datasets are versioned, and you can use them in multiple experiments. You can also use datasets to compare your model's performance against a baseline. Learn more about how to create and use datasets in the docs.
Fix several performance issues in the SDK and UI.

Week of 2023-08-07

Complex data is now substantially more performant in the UI. Prior to this change, we ran schema inference over the entire input, output, expected, and metadata fields, which could result in complex structures that were slow and difficult to work with. Now, we simply treat these fields as JSON types.
The UI updates in real-time as new records are logged to experiments.
Ergonomic improvements to the SDK and CLI:
- The JS library is now Isomorphic and supports both Node.js and the browser.
- The Evals CLI warns you when no files match the .eval.[ts|js] pattern.

Week of 2023-07-31

You can now break down scores by metadata fields:

Grouped Score Chart

Improve performance for experiment loading (especially complex experiments). Prior to this change, you may have seen experiments take 30s+ occasionally or even fail. To enable this, you'll need to update your CloudFormation.
Support for renaming and deleting experiments:

Rename Delete Menu

When you expand a cell in detail view, the row is now highlighted:

Highlight Row

Week of 2023-07-24

A new framework for expressing evaluations in a much simpler way:

import { Eval } from "braintrust";
import { Factuality } from "autoevals";
 
Eval("My Evaluation", {
  data: () => [
    {
      input: "Which country has the highest population?",
      expected: "China",
      meta: { type: "question" },
    },
  ],
  task: (input) => callModel(input),
  scores: [Factuality],
});

Besides being much easier than the logging SDK, this framework sets the foundation for evaluations that can be run automatically as your code changes, built and run in the cloud, and more. We are very excited about the use cases it will open up!

inputs is now input in the SDK (>= 0.0.23) and UI. You do not need to make any code changes, although you should gradually start using the input field instead of inputs in your SDK calls, as inputs is now deprecated and will eventually be removed.
Improved diffing behavior for nested arrays.

Week of 2023-07-17

A couple of SDK updates (>= v0.0.21) that allow you to update an existing experiment init(..., update=True) and specify an id in log(..., id='my-custom-id'). These tools are useful for running an experiment across multiple processes, tasks, or machines, and idempotently logging the same record (identified by its id).
- Note: If you have Braintrust installed in your own cloud environment, make sure to update the CloudFormation (available at https://braintrust-cf.s3.amazonaws.com/braintrust-latest.yaml).
Tables with lots and lots of columns are now visually more compact in the UI:

Before:

Table before

After:

Table after

Week of 2023-07-10

A new Node.js SDK (npm) which mirrors the Python SDK. As this SDK is new, please let us know if you run into any issues or have any feedback.

If you have Braintrust installed in your own cloud environment, make sure to update the CloudFormation (available at https://braintrust-cf.s3.amazonaws.com/braintrust-latest.yaml) to include some functionality the Node.js SDK relies on.

You can do this in the AWS console, or by running the following command (with the braintrust command included in the Python SDK).

braintrust install api <YOUR_CLOUDFORMAT_STACK_NAME> --update-template

You can now swap the primary and comparison experiment with a single click.

Swap experiments

You can now compare output vs. expected within an experiment.

Diff output and expected

Version 0.0.19 is out for the SDK. It is an important update that throws an error if your payload is larger than 64KB in size.

Week of 2023-07-03

Support for real-time updates, using Redis. Prior to this, Braintrust would wait for your data warehouse to sync up with Kafka before you could view an experiment, often leading to a minute or two of time before a page loads. Now, we cache experiment records as your experiment is running, making experiments load instantly. To enable this, you'll need to update your CloudFormation.
New settings page that consolidates team, installation, and API key settings. You can now invite team members to your Braintrust account from the "Team" page.
The experiment page now shows commit information for experiments run inside of a git repository.

Week of 2023-06-26

Experiments track their git metadata and automatically find a "base" experiment to compare against, using your repository's base branch.
The Python SDK's summarize() method now returns an ExperimentSummary object with score differences against the base experiment (v0.0.10).
Organizations can now be "multi-tenant", i.e. you do not need to install in your cloud account. If you start with a multi-tenant account to try out Braintrust, and decide to move it into your own account, Braintrust can migrate it for you.

Week of 2023-06-19

New scatter plot and histogram insights to quickly analyze scores and filter down examples.
API keys that can be set in the SDK (explicitly or through an environment variable) and do not require user login. Visit the settings page to create an API key.
- Update the braintrust Python SDK to version 0.0.6 and the CloudFormation template (https://braintrust-cf.s3.amazonaws.com/braintrust-latest.yaml) to use the new API key feature.

Week of 2023-06-12

New braintrust install CLI for installing the CloudFormation
Improved performance for event logging in the SDK
Auto-merge experiment fields with different types (e.g. number and string)

Week of 2023-06-05

Tutorial guide + notebook
Automatically refresh cognito tokens in the Python client
New filter and sort operators on the experiments table:
- Filter experiments by changes to scores (e.g. only examples with a lower score than another experiment)
- Custom SQL filters
- Filter and sort bubbles to visualize/clear current operations
[Alpha] SQL query explorer to run arbitrary queries against one or more experiments

On this page