👌

AI quality assurance

A lot has been written about AI models hallucinating - and it’s true - they do.

But this only tells part of the story.

Often the problem isn’t the AI - it’s the author of the prompt. They are too high level, too vague, or point the model at the wrong data.

A great way to mitigate against this, is to be very specific about the data the LLM should look at and ensure each element of the data is correct.

You do this by checking sources to ensure the outputs are accurate, unbiased and professionally reliable not unlike any other research process.

How do we do this?

Through systematic fact verification

AI can confidently be wrong about 5% to 15% of all factual claims.

Professional AI users must have systematic verification processes to catch those errors.

The easiest way to think about it, is to apply a Risk Assessment Filter over each claim:

HIGH RISK INFORMATION (Always verify):

  • Statistics and data points
  • Historical facts and dates
  • Legal or regulatory information
  • Medical or safety advice
  • Financial calculations
  • Recent news or events
  • Technical specifications

MEDIUM RISK (Spot check):

  • Industry best practices
  • General procedures
  • Common knowledge facts
  • Established theories

LOW RISK (Usually safe):

  • Creative content
  • Brainstorming ideas
  • Opinion-based content
  • Formatting and structure

Another trick in this phase of analysis is to feed the models the RIGHT factual data from which to do the work. LLM’s know a lot about a lot of things - but they can work even better if they only draw upon high quality data. There are two ways of doing this:

  • Provide the data, and get the AI to index it, and only answer based on that content; or
  • Use a ‘DEEP RESEARCH’ tool to surface high quality content - then prompt engineer what you want over the top of that.

An example task

Your Task: Practice identifying high-risk claims that require verification

AI generated business report excerpt:

"The global project management software market is valued at $6.68 billion in 2023 and is expected to grow at a CAGR of 10.77% through 2030. Leading companies include Microsoft (31% market share), Atlassian (18% market share), and Asana (12% market share). Recent studies show that 73% of companies using project management software report improved team productivity, with average time savings of 21% per project. The COVID-19 pandemic accelerated adoption, with remote work driving 45% increase in software purchases during 2020-2021. Key trends include AI integration, mobile-first design, and advanced analytics. Gartner predicts that by 2025, 80% of project management software will include AI-powered features for task automation and resource optimization."

  1. Identify all factual claims in the excerpt
  2. List each specific fact, statistic, or claim

    Found: [count your list] factual claims

  3. Categorise each claim by risk level
  4. High Risk / Medium Risk / Low Risk

  5. Priority verification list
  6. Which 5 claims would you verify first?

Expert analysis:

  • High risk claims (Must Verify): $6.68B market value, 10.77% CAGR, market share percentages, 73% productivity stat, 21% time savings, 45% pandemic increase, Gartner prediction
  • Medium risk claims: General trend descriptions, COVID impact direction
  • Low risk claims: Software category definitions, general benefit claims

Another pro tip. Ask AI to do the work for you with a deep research tool, or a tool like Notebook LM or an Agent over your own data!

Conclusion

Effective AI quality assurance requires systematic verification of high risk claims, appropriate risk categorisation, and providing models with quality data sources. By implementing these practices, users can minimise hallucinations and ensure AI outputs are accurate and reliable. It’s always your responsibility to ship accurate content.