AI speaker detection
How we unlocked a $100k+ ARR deal by designing human-in-the-loop AI journalists could trust.
Project overview
Trint is an AI-powered transcription platform that helps teams turn audio into accurate, editable text for collaborative content creation. It’s trusted by journalists and media organisations worldwide, including teams at The New York Times, The Washington Post, The Financial Times, and The Associated Press.
This project was triggered by a $100k+ prospective deal with a large US media organisation, which exposed a critical competitor gap. While competing platforms offered automatic speaker recognition, Trint still relied on manual tagging.
As the lead Designer, I led discovery and end-to-end design to close that gap, focusing on accuracy, transparency, and user trust, and ensuring the feature worked not just technically, but cognitively for journalists.
The challenge
Manual speaker labelling was slow, inconsistent, and frustrating, especially for journalists. We needed to build a solution that reduced effort, created trust, and help secure a six-figure deal with a prospective client.
Using discovery to prioritise & scope a focused MVP
I kicked off the project by partnering with the PM to run a competitive analysis and capture requirements from the prospective client. To maintain momentum with the prospective client, engineering delivered a quick proof of concept to validate that speaker recognition was technically feasible. My focus then shifted to shaping how that capability would work as a product.
Before joining this team, discovery had been largely developer-led, with little structure around user needs. I ran a discovery workshop to help the team shift focus from shipping features to understanding real user problems.
Together, we identified five core problem spaces, each supported by job stories:
Enrolment - adding known speakers
Management - editing and maintaining voiceprints
Categorisation - organising libraries
Permissions - managing who can do what
Identification - automated detection by the AI model
These job stories clarified the problem space, prioritised scope, and shaped a realistic roadmap. Using How Might We prompts, we explored multiple solutions, then prioritised them using dot voting and an impact vs. effort matrix.
This helped the team align on a focused MVP. We concentrated on Enrolment, Management, and Identification, the minimum set required to validate value quickly.
From this process we prioritised three core ideas:
1. Enrolment - An inline CTA that allows users to enrol speakers directly within the transcript, keeping enrolment at a transcript level.
2. Management - A dedicated space to view and manage all enrolled speakers.
3. Identification - Automatic identification for high-confidence matches, and human verification for low confidence results.
To check whether this was valuable beyond the prospective client, we looked at existing product data. On average, there were around 730k speaker name edit events per month, showing how often teams were already managing speakers manually.
This confirmed that speaker identification wasn’t a niche requirement, but a common workflow across accounts.
Alongside usage data, we spoke with a small number of existing Enterprise customers to understand how they currently managed speakers and where friction appeared in real workflows. We shared the problem space and direction we were exploring, rather than proposing a specific solution.
Their response suggested this wasn’t an isolated request, but a shared problem.
Validating broader relevance
Early collaboration to understand edge cases & model constraints
Before any screens were created, I partnered with Engineers and the PM to understand how the model worked, taking an AI-first approach to the design process. Early collaboration was key, shaping the entire design approach. Instead of pitching ideas and checking feasibility later, I spent time understanding the model’s constraints, edge cases, and potential.
Some outcomes of this collaboration include:
🛠 Identifying model unavailability
We discovered a model limitation that proved overly complex to explain clearly to users.
Given the cognitive complexity of explaining this edge case, we chose not to introduce additional feedback and instead monitored how often it occurred.
🚫 Anticipating model failure states
We explored scenarios where the model may not be available, for example, when audio was too short or unclear. These cases helped us understand where feedback or guidance would eventually be required.
✅ Identifying trust as a requirement
Through conversations with journalists, it became clear that trust was paramount, particularly understanding what was AI-generated versus what had been created or edited by a human. We also recognised early that users would need to understand when labels came from the AI versus manually inputted. This shaped later work on confidence, suggestions, and verification.
Shaping the workflow through continuous feedback
The development team built an initial POC for the prospective client, demonstrating the system’s ability to recognise distinct speakers within transcripts. With early confidence established, I led the transition into the design phase.
I created designs and prototypes for a complete end-to-end solution, to articulate a clear vision for how the feature could evolve beyond the POC. Presenting this future state to the client helped secure further buy-in, and we aligned on progressing towards an MVP.
Throughout development, designs, prototypes, and early builds were continuously tested and reviewed. This allowed us to iterate quickly, validate assumptions, and refine the experience in parallel with engineering, rather than treating design as a one-off handover.
Designing clear, predictable interactions
My focus was on introducing a human layer to the enrolment flow, adding transparency, clarity, and predictable behaviour so the system felt natural and trustworthy to use.
💬 Familiar interactions, clearer outcomes
Our core users (journalists) are time-poor and sensitive to changes that disrupt their workflow, so I deliberately built on familiar interaction patterns to minimise friction and learning overhead.
While we explored more granular control, testing showed these options added unnecessary cognitive load. 4 out of 5 participants described applying a speaker to a single entry as a corrective action, rather than a bulk one.
As a result, we removed granular options when no speaker had been enrolled and introduced a single, explicit “apply to all entries” action to keep the interaction predictable and transparent.
⚡️ Reducing interaction cost
I auto-focused the input so users could type immediately. It removed an extra click and made repetitive edits feel faster.
🌍️ Real world behaviour
Defaulting “remember speaker” to ON would artificially boost usage and create cleanup work when unwanted speakers were saved. Testing and client feedback showed remembering speakers is an intentional action, not a default. Early copy referencing ‘future transcripts’ caused confusion, with testers interpreting this as future uploads. Simplifying the language to “future use” removed ambiguity and aligned expectations.
⏮️ Introducing reversibility
Once a speaker was enrolled, I added a success state with an option to undo, allowing users to reverse large changes, a capability missing in competitor systems. Adding this simple control made bulk actions safer.
Step 1 in the flow - Pre feedback
Step 1 in the flow - Post feedback
Designing clear mental models
This work focused on separating similar actions with very different consequences, while giving users confidence and control over how speaker data was stored and shared.
While designing the end-to-end flow, this area quickly emerged as one of the most cognitively demanding parts of the experience. During usability testing, it was consistently the point where participants asked the most questions around the difference between editing a speaker label, changing a speaker, and what impact those actions had on future transcripts and saved data.
🧠 Clarifying mental models
‘Editing a known speaker’ and ‘correcting an attribution error’ appear similar, but represent fundamentally different user intentions. Without a clear distinction, users risked unintentionally overwriting saved speaker data when they simply meant to correct a transcript.
To make this distinction explicit, I separated these actions through a dedicated UI state when a recognised speaker name was present. This reinforced the mental model that users were either managing a known speaker or correcting an error, reducing ambiguity and protecting enrolled speakers.
Speaker management & privacy controls
Through client conversations, journalists highlighted that some speakers needed to remain private, while others (e.g. News Anchors or regular contributors) benefited from being recognised organisation-wide.
🔐 Private and speaker management
I introduced a user specific speaker management panel where users could review and edit their enrolled speakers, keeping personal or sensitive voices private.
Admin users, however, could allow specific speakers (e.g., anchors or reporters) to be recognised organisation-wide. This created a scalable model, private when it needed to be, shared when it added value.
Designing AI predictions users can trust
For users to trust automated speaker identification, they needed to clearly understand when AI was acting, when a human had intervened, and retain control of predictions and saved speakers. AI accuracy varies, so the experience needed to make uncertainty visible and keep humans in charge of final decisions.
🔐 Making AI behaviour visible
Users needed to understand when a speaker label came from the model. I introduced clear visual markers (AI sparkles) to distinguish AI-generated labels from human-added ones, making the system’s behaviour visible.
✅ Introducing lightweight verification
Users needed a way to confirm what had been reviewed by a human.
I added a subtle verification checkbox users could click to verify the both the speaker label and transcribed text. On hover, it showed who reviewed it and when. This added trust and accountability without adding friction. Users could see, at a glance, what was human-checked versus AI-suggested.
🔐 Balancing automation with user control
High-confidence predictions (90%+) were applied automatically, marked with a small AI sparkle so users knew the source.
Anything below that threshold appeared as a suggestion the user could apply, edit, or reject. This kept the workflow fast while ensuring users always stayed in control.
Together, these patterns ensured AI acted as an assistive layer, fast when confident, cautious when uncertain, and always open to final human review.
Outcomes & impact
This work successfully secured the $100k+ Enterprise deal, and helped generate new opportunities from other major US broadcasters, during enterprise sales conversations. Following this, the feature was rolled out gradually, starting with larger Enterprise accounts.
During this period, we collected qualitative feedback, helping us understand how the feature was performing.
Business outcome
$100k+
Enterprise deal secured with a major US media organisation
Feedback survey results
65%
Users expressed being satisfied or above with the new speaker recognition feature
82%
Users rated the accuracy ‘Mostly accurate’ and ‘Very accurate - rarely incorrect’
78%
Users reported that the feature saved time within their existing workflows
Alongside these positives, feedback also highlighted important limitations. Speaker attribution still varied across accents, dialects, and languages. This feedback helped reframe our next challenge:
How might we design systems that allow users to actively help improve the model over time?