Skip to content

NER Synonyms Training

This guide outlines how to improve the semantic search capabilities of TalkingDB by training its Named Entity Recognition (NER) component. Because TalkingDB is not a pre-trained LLM, it requires specific training to accurately recognize entities and their synonyms within your industry's context.

By leveraging Wikipedia’s massive link graph, we can train TalkingDB to understand that different surface forms (e.g., "Big Blue" and "IBM") refer to the same entity.


  1. Identify Wikipedia Categories

    First, you must manually identify which Wikipedia categories best represent your corpus text. This ensures the model learns synonyms relevant to your specific industry.

    1. Analyze your corpus: Review your category titles.

    2. Locate Categories: Find the corresponding Wikipedia Category URLs.

      • Input: Your corpus text
      • Output: A list of ~10 Wikipedia Category URLs
  2. Cherry-pick Wikipedia Articles

    Once the categories are identified, you need to select the specific articles that TalkingDB should learn from.

    1. Extract Articles: manually skim through all the articles under each category listed to pull relevant articles to your industry.

    2. Filter Results: For sandbox testing, 10–50 articles are sufficient. For production, you may target thousands.

      • Input: Wikipedia Category URLs.
      • Output: A curated list of Wikipedia Article URLs.
  3. Train Synonyms into TalkingDB:

    In this step, we feed TalkingDB with a list of synonyms for each article title we have selected.

    1. Submit for Training: Use the NER Synonyms endpoint to ingest the curated article list:

      • Example Input (Request body): { "namespace": "polyguard", "src_id": "sealant", "aliases": [ "caulks" ] }
    2. Bulk Upload (Optional): If you already have a custom synonym list, use the "Many" endpoint: [POST]: /ner/synonyms/many. Click here to view the sample JSON file format.

  4. Test and Fine-tune

    The final step is to verify that the semantic search now recognizes the synonyms you've trained.

    1. Verify via Search: Go to the Kray AI Search Interface and test a query using an alternate name or synonym and see the results

    2. Iterate: If the entity is not recognized, return to Step 2 and ensure more relevant articles are included in the training set.

Now you are done with the NER synonyms training! 🎉