Building a Knowledge Base Your AI Can Actually Use

Open notebook with organised entries on a dark desk — knowledge base architecture for AI operations

Every B2B operation accumulates institutional knowledge in the worst possible format: WhatsApp threads, email chains, PDFs nobody finds, and the heads of three long-serving employees. When you deploy an AI support layer or internal assistant, it answers from what it’s given. If what it’s given is a poorly organised document dump, it will answer poorly.

A well-structured knowledge base — one designed for how AI retrieval actually works, not for how humans browse — is the difference between a chatbot that resolves 65% of queries and one that resolves 15%. The gap is not the AI model. It’s the knowledge base.

This article explains what that structure looks like, how to build it from existing material, and how to test whether it’s working before you go live.

Why Most Knowledge Bases Fail AI Retrieval

Human-readable knowledge bases are organised for browsing. They have categories, nested sub-pages, long explanatory PDFs, and running context embedded in surrounding paragraphs. A human skimming a policy document understands that the refund terms on page 4 connect to the payment terms on page 2. They carry that context in their head.

AI retrieval — specifically retrieval-augmented generation (RAG), the architecture behind most B2B AI assistants — does not browse. It retrieves chunks. A RAG system takes an incoming query, converts it to a vector, searches the KB for the closest matching vectors, and passes those chunks to the language model to generate an answer.

The three failure modes this creates:

Too long, too dense. A 15-page product manual gets split into chunks of 400–800 tokens. If the relevant answer is buried in paragraph 11, it may not rank highly enough to be retrieved. Long documents dilute specific answers.

Context-free entries. A FAQ entry that says “Yes, we accept returns within 14 days” is retrieval-ready. An entry that says “As mentioned in our terms, the policy applies here” is not — the AI doesn’t know what “here” means without the surrounding context.

No atomic structure. Atomic means one question or one policy per entry. When a single KB document covers order entry, payment terms, and delivery timelines in the same block, the retrieval system cannot cleanly separate them. Queries about payment terms will sometimes return the delivery section instead.

The implication: building a KB for AI requires breaking your institutional knowledge into small, self-contained, context-complete units before loading anything.

The Three Input Categories

Before writing a single entry, map your source material into three categories. Each behaves differently in retrieval and requires different treatment.

Category 1 — Policies and procedures. Refund terms, payment conditions, minimum order quantities, delivery timelines, credit terms, escalation paths. This is the highest-value category for customer-facing AI support: it’s well-defined, relatively static, and directly answers the queries buyers ask most often. It also tends to exist in the messiest formats — scattered across email footers, buried in contracts, or living only in someone’s head.

Category 2 — Product data. SKU specifications, compatibility notes, pricing tiers, availability windows, substitution options. For distributors, this is often the largest category by volume and the most frequently outdated. AI retrieval against stale product data produces confidently wrong answers — a worse outcome than no answer at all. Product data entries need a refresh cadence, not just an initial load.

Category 3 — Conversation history. Resolved support tickets, WhatsApp exchanges, email threads. This category is often ignored during KB builds and is frequently the most operationally useful. Real buyer queries and real answers are the best training signal for what your buyers actually ask. Cleaned, anonymised conversation history — stripped of personal data, reformatted as Q&A pairs — can double the retrieval relevance for common query types.

Most operators start with Category 1 only. A functional KB needs all three.

Extracting Useful Content from WhatsApp, Email, and PDFs

The extraction step is where most KB projects stall. The raw material is there — years of resolved queries, policy documents, supplier correspondence — but it’s in formats that require processing before it’s loadable.

WhatsApp: Export chat history (Settings → Chat → Export Chat, without media). The raw export is a time-stamped text file. Filter for exchanges where a staff member answered a customer question, not internal coordination. The pattern to extract: customer question → staff answer. Strip names, dates, and irrelevant context. What remains is a Q&A pair suitable for a KB entry.

Email: If you’re using a shared inbox (Outlook, Gmail with shared access), export threads by category or label. The useful threads are the ones with a clean resolution — buyer asked, someone on your team answered factually. Avoid threads with multiple back-and-forths, pricing negotiations, or complaints: these introduce ambiguity that confuses retrieval. Filter for resolution-confirmed threads only.

PDFs: Supplier specifications, internal policy documents, and rate cards are usually PDF-native. Run them through a PDF extraction tool (Docling, Adobe Extract API, or even a direct copy-paste for shorter documents). The output is raw text. The work is then splitting that text into atomic entries — one concept, one policy, one specification per entry — and adding explicit headers so the retrieval system knows what each chunk covers.

Total realistic time for a 50-account B2B distributor doing a first KB build: 15–20 hours of extraction and formatting work, done once. If you’re already running an AI support layer, that time starts paying back within the first month through fewer escalations.

How to Structure Entries for RAG

A RAG-ready KB entry has four components:

1. A specific title that mirrors how buyers ask. Not “Returns Policy” — “How do I return a product that arrived damaged?” Not “Payment Terms” — “What payment methods do you accept for first-time orders?” The title should match natural language query patterns, not internal document taxonomy.

2. A self-contained answer. The entry must be fully understandable without reading anything else. If you reference another policy, state the relevant part of that policy in the same entry. Don’t cross-reference; duplicate the context. Retrieval systems don’t follow hyperlinks.

3. Relevant synonyms and variations. If your buyers call a product category “consumables” in some markets and “supplies” in others, include both terms in the entry. If a process is called “order confirmation” internally but buyers ask about “purchase order acknowledgement,” include both phrasings. The retrieval system matches vectors, not keywords — but semantic similarity improves with more surface coverage.

4. A clear scope statement for edge cases. Where the policy has exceptions, state them explicitly: “This applies to orders placed after 01 January 2026. Orders placed before that date follow the previous terms available on request.” Ambiguous scope is the most common source of AI hallucination — the model fills gaps with plausible-sounding but incorrect information.

Entry length: Keep entries between 150 and 400 words. Below 150, there’s usually not enough context for accurate retrieval. Above 400, the entry is trying to cover too many things.

Entry volume for a mid-size distributor: A functional starting KB covers 60–120 entries. This is enough to handle 70–80% of inbound query types for a B2B operation with 50–200 accounts. The long tail of edge cases can be added iteratively after launch.

Testing the KB Before Going Live — The 20-Question Method

Loading the KB and hoping it works is how you get a chatbot that confidently answers incorrectly in front of customers. The 20-question method is a pre-launch validation that takes two to three hours and identifies structural problems before they’re public.

Step 1 — Write 20 questions your buyers actually ask. Pull from recent WhatsApp threads, email, or support tickets. Use the exact phrasing buyers used, not internal paraphrases. Include a mix of: common queries (5 questions), edge cases (5 questions), questions the KB currently covers well (5 questions), and questions it may not cover (5 questions).

Step 2 — Run each question through the AI assistant. Record: the answer given, the source entry retrieved, and whether the answer was accurate, partially accurate, or wrong.

Step 3 — Categorise failures. Wrong answers usually fall into one of three buckets: (a) no entry covered the topic — the KB has a gap; (b) the right entry existed but wasn’t retrieved — the entry title or phrasing needs adjustment; (c) the answer combined two entries incorrectly — the entries overlap and need to be separated.

Step 4 — Fix before launch. Gaps require new entries. Retrieval failures require rephrasing titles and adding synonym coverage. Overlap failures require splitting entries.

A 90%+ accuracy rate on the 20-question test (18 of 20 correct) is a reasonable bar for go-live. Below 80%, the KB has structural problems that will generate customer-facing errors. This connects directly to the measurement approach in the AI measurement framework — you want a pre-launch baseline, not a post-complaint diagnosis.

For context on where a well-built KB fits within a broader AI operations decision: if you haven’t yet resolved the ERP question, the sequencing matters. A KB-backed assistant can operate alongside legacy systems without deep integration, which is part of why AI often makes sense before ERP modernisation for many mid-market operators.

Maintenance: The Step Most Operators Skip

A KB is not a one-time project. It decays. Products change. Policies update. New query types emerge. Without a maintenance schedule, a KB that was accurate at launch is 60% accurate six months later — and a partially accurate AI support layer is worse for customer trust than a slower human response.

Minimum viable maintenance for a distributor KB:

Monthly: Review the previous month’s escalated queries. Every escalation is either a KB gap or a retrieval failure. Add or fix the corresponding entry.
Quarterly: Audit Category 2 (product data). Flag entries where specifications, pricing, or availability have changed.
On policy change: Update affected entries within 48 hours of any change to payment terms, delivery conditions, or return policies. Stale policy entries are a compliance and customer-relations risk.

Assign one person 2–3 hours per month to this. For a B2B distributor handling 300–500 queries per month through an AI layer, the maintenance cost is small relative to the volume it keeps functioning accurately.

The work of building this properly is front-loaded and tractable. Operators who skip it and load a document dump instead typically see resolution rates in the 15–30% range and conclude AI support doesn’t work. Operators who build it correctly — atomic entries, tested retrieval, maintained currency — typically see 60–70% resolution rates within 60 days of launch. The delta is almost entirely in the KB structure, not the model.

AHoosh builds and validates KB architecture as part of AI support layer deployments. ahoosh.ai/contact