Extracting Keywords and Word Counts from Text in MongoDB

In the design bureau we were trained to impose order on messy logbooks. MongoDB’s aggregation framework gives the same feeling: raw sentences come in, tidy keyword summaries go out, and the data never leaves the database.

When text stays unstructured inside documents, I want fast counts and clear signals. With a few aggregation stages I can slice tokens, throw away useless words, and tally the rest without leaning on external ETL machinery.

By the end of this drill you will know how to:

  1. Load a small batch of text documents
  2. Tokenize and normalize every sentence
  3. Remove common stopwords that add no value
  4. Count keyword frequency per document
  5. Estimate simple TF-IDF weights for rarer terms

1. Prepare a Sample Collection

As with any lab bench exercise, start with a compact dataset you can inspect by eye.

use textdb

db.articles.insertMany([
  {
    _id: 1,
    lang: "en",
    text: "MongoDB is a powerful NoSQL database that allows flexible data processing and text analysis."
  },
  {
    _id: 2,
    lang: "en",
    text: "Text processing in MongoDB can extract keywords, analyze frequency, and generate useful insights."
  }
])

Later stages will run directly over the text field, so keep it consistent.

2. Register a Stopword Inventory

In our factories we kept bins of fasteners; here we keep a short list of high-frequency words to discard. A lightweight collection works fine.

db.stopwords.insertMany([
  { lang: "en", word: "a" },
  { lang: "en", word: "the" },
  { lang: "en", word: "and" },
  { lang: "en", word: "is" },
  { lang: "en", word: "in" },
  { lang: "en", word: "of" },
  { lang: "en", word: "that" },
  { lang: "en", word: "can" },
  { lang: "en", word: "to" },
  { lang: "en", word: "be" }
])

3. Tokenize and Normalize Text

MongoDB does not hand us a full tokenizer, so we assemble one from $regexFindAll, $map, and $filter. The pipeline below drops everything to lowercase, extracts alphabetic words, and removes empty strings.

db.articles.aggregate([
  {
    $addFields: {
      tokens: {
        $filter: {
          input: {
            $map: {
              input: {
                $regexFindAll: {
                  input: { $toLower: "$text" },
                  regex: "[a-zA-Z]+" // stick with alphabetic terms
                }
              },
              as: "t",
              in: "$$t.match"
            }
          },
          as: "word",
          cond: { $ne: ["$$word", ""] }
        }
      }
    }
  },
  { $project: { _id: 1, lang: 1, tokens: 1 } }
])

Sample output:

{
  "_id": 1,
  "lang": "en",
  "tokens": [
    "mongodb", "is", "a", "powerful", "nosql", "database",
    "that", "allows", "flexible", "data", "processing", "and", "text", "analysis"
  ]
}

4. Remove Stopwords

Before counting, we strip away the verbal filler. Think of it as deburring a machined part. For this example I read the stopwords once and keep them in memory; production code can reach for $lookup.

const stopwords = db.stopwords.find({ lang: "en" }).map(w => w.word);

db.articles.aggregate([
  {
    $addFields: {
      tokens: {
        $filter: {
          input: {
            $map: {
              input: {
                $regexFindAll: {
                  input: { $toLower: "$text" },
                  regex: "[a-zA-Z]+"
                }
              },
              as: "t",
              in: "$$t.match"
            }
          },
          as: "word",
          cond: { $and: [
            { $ne: ["$$word", ""] },
            { $not: { $in: ["$$word", stopwords] } }
          ]}
        }
      }
    }
  },
  { $project: { _id: 1, tokens: 1 } }
])

The resulting token list now ignores items such as “is”, “a”, and “and”.

5. Count Keyword Occurrences (TF)

With the debris removed, we can pull counts. I unwind the token array and group by document and word, same way we would tally parts in a storeroom.

db.articles.aggregate([
  {
    $addFields: {
      tokens: {
        $filter: {
          input: {
            $map: {
              input: {
                $regexFindAll: {
                  input: { $toLower: "$text" },
                  regex: "[a-zA-Z]+"
                }
              },
              as: "t",
              in: "$$t.match"
            }
          },
          as: "word",
          cond: { $and: [
            { $ne: ["$$word", ""] },
            { $not: { $in: ["$$word", stopwords] } }
          ]}
        }
      }
    }
  },
  { $unwind: "$tokens" },
  {
    $group: {
      _id: { doc: "$_id", word: "$tokens" },
      count: { $sum: 1 }
    }
  },
  {
    $sort: { "_id.doc": 1, count: -1 }
  }
])

Example output:

{ "_id": { "doc": 1, "word": "mongodb" }, "count": 1 }
{ "_id": { "doc": 1, "word": "powerful" }, "count": 1 }
{ "_id": { "doc": 1, "word": "database" }, "count": 1 }
{ "_id": { "doc": 1, "word": "text" }, "count": 1 }
{ "_id": { "doc": 2, "word": "mongodb" }, "count": 1 }
{ "_id": { "doc": 2, "word": "text" }, "count": 1 }
{ "_id": { "doc": 2, "word": "keywords" }, "count": 1 }

These values are the term frequencies for each document.

6. (Optional) Compute TF-IDF

When I need to highlight uncommon words across the corpus, I add a quick TF-IDF stage. It multiplies the per-document count by the logarithm of how many documents did not contain the word.

db.articles.aggregate([
  // reuse the same tokenization and stopword filter
  { $addFields: { ... } },
  { $unwind: "$tokens" },
  { $group: { _id: { doc: "$_id", word: "$tokens" }, tf: { $sum: 1 } } },
  { $group: { _id: "$_id.word", df_docs: { $addToSet: "$_id.doc" }, tfs: { $push: "$$ROOT" } } },
  { $project: { _id: 1, df: { $size: "$df_docs" }, tfs: 1 } },
  { $unwind: "$tfs" },
  {
    $addFields: {
      tfidf: {
        $multiply: [
          "$tfs.tf",
          { $log: { $divide: [db.articles.countDocuments(), "$df"] } }
        ]
      }
    }
  },
  { $sort: { "tfs._id.doc": 1, tfidf: -1 } }
])

For larger workloads, measure the document counts once and reuse them, same way we would pre-calc material allowances.

7. Results and Uses

At this point we hold a compact set of keyword counts and optional weights for every document. That dataset supports search, quick tagging, dashboards, or even a word cloud if management insists.

Practical deployments:

  • Build lightweight full-text search helpers inside MongoDB
  • Auto-tag articles, product reviews, or maintenance logs
  • Track keyword trends over time for reporting
  • Run NLP-style preprocessing without shelling out to another system

Summary

StepPurposeKey MongoDB Operator
TokenizeBreak text into smart tokens$regexFindAll, $map, $toLower
FilterDrop the high-noise words$filter, $not, $in
CountTally per-document frequency$unwind, $group, $sum
RankOrder results for inspection$sort, $limit
(Optional) TF-IDFEmphasize rare but important$log, $divide

Key Takeaways

  • The aggregation pipeline is enough for hands-on keyword extraction; no external library is required for a basic rig.
  • Store the output in a separate collection such as keywords if you plan to reuse it in dashboards or search.
  • Treat each stage like a machining step: keep inputs clean, outputs inspected, and the whole process stays predictable.

MongoDB’s pipeline architecture works the way we liked our tooling in the Union—modular, deterministic, and fully within reach of a disciplined engineer.