AI Content Systems
How to Reduce LLM Hallucinations and Keep Writing Human
Sharing our learnings from generating thousands of articles and posts with LLMs.
We build software that produces content. This is by far the number one most frequent question we get asked - and as you'd expect, we work on this every day.
There are a few layers to avoiding hallucinations, and here's a range of techniques that work for us, organised by difficulty.
Copy <> Hallucinations
Firstly you need to understand that almost everything you do to prevent hallucinations will also help your copy sound more human.
This is because the LLM will be reacting to clearer prompts, consuming less noise (which affects how your brand voice comes through), and doing less meta-textual commentary "wait, no that can't be right, I thought it was...let me check..".
Let me take you through some from easiest to hardest:
Easy
- Explain what you are writing and who the audience is (2-3 sentences). If the LLM knows it's a personal health article for a healthcare company, this narrows the universe of what it will talk about.
- Position the LLM as the audience. "You are a 50yo mother of three exhausted and scrolling this on your phone at 10pm". LLMs have been trained on enough media that they are pretty decent at these subtle and complex emotional states.
- Provide compliance and brand voice guidelines (<1 page each). Providing these both upfront and later gets the best results in our testing. If you're using API calls, these will be cached in the system prompt which tends to be forgotten (more on this shortly).
- Chunk the text down. Generate each paragraph individually instead of doing the article at once. Depending on what you're writing, you can either generate the paragraphs independently (for more informational pieces), or pass the results of the first paragraph to generation for the second paragraph (to maintain narrative continuity), and then pass those two to the next, and so on.
- Use examples sparingly. While showing the structure of how a sentence should / should not be used can be helpful, pretty frequently it will also result in the LLM over-using that style.
Medium
- Provide more specific information. Don't hand over a PDF or point to a web-page and expect the LLM to find the numbers you want. Select the data up-front and provide it to the LLM, e.g. as an API response. Even a page of text or a screenshot of a table can improve your results.
Note: the less extraneous text you provide, the more likely the LLM is to obey your voice guidelines. Even though this sounds purely like a hallucination safeguard, it will improve your copy.
- Reiterate important rules in every prompt. Do it once initially when you kick off the task, and then repeat the most critical ones for every paragraph. In our testing, models doing multi-step content creation tend to "forget" parts of the system prompt and benefit from repeat instructions.
- Don't use Tropes.md to remove "LLM-sounding" copy. There's way too much editorial commentary in Tropes and it's too generic for your business and audience. You will get actively worse results if you use this. Take it as a starting point and rewrite it to suit your specific needs, company, and content type.
- Use an eval loop - either manual copy + paste, or automated. Pass the results of the article along with a description of possible concerns to another LLM and ask it to flag risky sentences. We use both a script eval loop (cheap, deterministic) for forbidden phrases, and an LLM eval. You can go as complex as you want here. Try to use a different model for the eval; Claude in particular loves the sound of his own voice.
Hard
If you're writing software, here are some more complex techniques we use.
- Swap em dashes for en dash via script; don't bother trying to prompt this out. You can do this for other "tells" as well, but don't overdo it; ideally this is for grammar only.
- Use a forbidden-phrase list. Some of our clients simply cannot have particular phrases in their copy. If you're in financial services, you might want to avoid "guaranteed". An opposite example might be if you sell medication, and you produce content that doesn't contain the words "pregnant" or "breastfeeding", you know you've forgotten to address pregnant women (a regulatory requirement). Depending on what you're creating, either hardcode these so they appear every single time (e.g. as a disclaimer), or run a linter script that will fail the content if that phrase is spotted/absent.
Direct phrase matching is messy, so you shouldn't have more than a handful of words on your list, but this is an easy way to absolutely stop red-line issues. Don't prompt these!
- Run an automated eval loop and bulk generate content for it. Do 10-100 pieces of content, see what gets flagged, update your prompts and do it again. (You can also swap out models to compare performance or cost).
How to make LLMs sound like you:
This is a bigger piece that deserves an article on its own, but if you follow the above steps, you will have much less noise in your prompts which make it much easier for the LLM to follow your specific voice.
It also depends a lot on whether the LLM is writing copy from scratch (easier), or using existing copy, e.g. quotes, to create its own stories (much harder). In those circumstances, the model will often follow the format of the quote rather than your brand voice.
Do not give the LLM your brand guide; these are noisy documents that often contain information that's irrelevant to the task.
Brand guides typically focus on all channels, e.g. this is how we do Meta ads, this is how we do images, this is how we show times/dates, this is how we describe our customers, etc. Take the most visible and frequent tells from the guide (e.g. exclamation marks, unique words) into your prompts, and this will carry you a long way.
Additionally, brand voices tend to be difficult to replicate because the full voice and energy often come from someone who truly embodies the brand (e.g. a founder); this is why brands matter, this is hard to replicate. In these situations, you can often improve performance by distilling writing samples from that person and making them part of the prompt.
Either way, you can often get 80-90% of the results with a fraction of the effort through careful prompting. 80% is well inside the "human-tolerable" quality level.
Let us help you
If you need help setting a company story engine up in your company, shoot me an email. I will embed with your team to set up Fireside to tell compelling stories, about your own work and products, for your audience, and roll up my sleeves to help you deliver.