Extracting Keywords and Word Counts from Text in MongoDB
In the design bureau we were trained to impose order on messy logbooks. MongoDB’s aggregation framework gives the same feeling: raw sentences come in, tidy keyword summaries go out, and the data never leaves the database.
When text stays unstructured inside documents, I want fast counts and clear signals. With a few aggregation stages I can slice tokens, throw away useless words, and tally the rest without leaning on external ETL machinery.
By the end of this drill you will know how to:
- Load a small batch of text documents
- Tokenize and normalize every sentence
- Remove common stopwords that add no value
- Count keyword frequency per document
- Estimate simple TF-IDF weights for rarer terms
As with any lab bench exercise, start with a compact dataset you can inspect by eye.
use textdb
db.articles.insertMany([
{
_id: 1,
lang: "en",
text: "MongoDB is a powerful NoSQL database that allows flexible data processing and text analysis."
},
{
_id: 2,
lang: "en",
text: "Text processing in MongoDB can extract keywords, analyze frequency, and generate useful insights."
}
])
Later stages will run directly over the text field, so keep it consistent.
In our factories we kept bins of fasteners; here we keep a short list of high-frequency words to discard. A lightweight collection works fine.
db.stopwords.insertMany([
{ lang: "en", word: "a" },
{ lang: "en", word: "the" },
{ lang: "en", word: "and" },
{ lang: "en", word: "is" },
{ lang: "en", word: "in" },
{ lang: "en", word: "of" },
{ lang: "en", word: "that" },
{ lang: "en", word: "can" },
{ lang: "en", word: "to" },
{ lang: "en", word: "be" }
])
MongoDB does not hand us a full tokenizer, so we assemble one from $regexFindAll, $map, and $filter. The pipeline below drops everything to lowercase, extracts alphabetic words, and removes empty strings.
db.articles.aggregate([
{
$addFields: {
tokens: {
$filter: {
input: {
$map: {
input: {
$regexFindAll: {
input: { $toLower: "$text" },
regex: "[a-zA-Z]+" // stick with alphabetic terms
}
},
as: "t",
in: "$$t.match"
}
},
as: "word",
cond: { $ne: ["$$word", ""] }
}
}
}
},
{ $project: { _id: 1, lang: 1, tokens: 1 } }
])
Sample output:
{
"_id": 1,
"lang": "en",
"tokens": [
"mongodb", "is", "a", "powerful", "nosql", "database",
"that", "allows", "flexible", "data", "processing", "and", "text", "analysis"
]
}
Before counting, we strip away the verbal filler. Think of it as deburring a machined part. For this example I read the stopwords once and keep them in memory; production code can reach for $lookup.
const stopwords = db.stopwords.find({ lang: "en" }).map(w => w.word);
db.articles.aggregate([
{
$addFields: {
tokens: {
$filter: {
input: {
$map: {
input: {
$regexFindAll: {
input: { $toLower: "$text" },
regex: "[a-zA-Z]+"
}
},
as: "t",
in: "$$t.match"
}
},
as: "word",
cond: { $and: [
{ $ne: ["$$word", ""] },
{ $not: { $in: ["$$word", stopwords] } }
]}
}
}
}
},
{ $project: { _id: 1, tokens: 1 } }
])
The resulting token list now ignores items such as “is”, “a”, and “and”.
With the debris removed, we can pull counts. I unwind the token array and group by document and word, same way we would tally parts in a storeroom.
db.articles.aggregate([
{
$addFields: {
tokens: {
$filter: {
input: {
$map: {
input: {
$regexFindAll: {
input: { $toLower: "$text" },
regex: "[a-zA-Z]+"
}
},
as: "t",
in: "$$t.match"
}
},
as: "word",
cond: { $and: [
{ $ne: ["$$word", ""] },
{ $not: { $in: ["$$word", stopwords] } }
]}
}
}
}
},
{ $unwind: "$tokens" },
{
$group: {
_id: { doc: "$_id", word: "$tokens" },
count: { $sum: 1 }
}
},
{
$sort: { "_id.doc": 1, count: -1 }
}
])
Example output:
{ "_id": { "doc": 1, "word": "mongodb" }, "count": 1 }
{ "_id": { "doc": 1, "word": "powerful" }, "count": 1 }
{ "_id": { "doc": 1, "word": "database" }, "count": 1 }
{ "_id": { "doc": 1, "word": "text" }, "count": 1 }
{ "_id": { "doc": 2, "word": "mongodb" }, "count": 1 }
{ "_id": { "doc": 2, "word": "text" }, "count": 1 }
{ "_id": { "doc": 2, "word": "keywords" }, "count": 1 }
These values are the term frequencies for each document.
When I need to highlight uncommon words across the corpus, I add a quick TF-IDF stage. It multiplies the per-document count by the logarithm of how many documents did not contain the word.
db.articles.aggregate([
// reuse the same tokenization and stopword filter
{ $addFields: { ... } },
{ $unwind: "$tokens" },
{ $group: { _id: { doc: "$_id", word: "$tokens" }, tf: { $sum: 1 } } },
{ $group: { _id: "$_id.word", df_docs: { $addToSet: "$_id.doc" }, tfs: { $push: "$$ROOT" } } },
{ $project: { _id: 1, df: { $size: "$df_docs" }, tfs: 1 } },
{ $unwind: "$tfs" },
{
$addFields: {
tfidf: {
$multiply: [
"$tfs.tf",
{ $log: { $divide: [db.articles.countDocuments(), "$df"] } }
]
}
}
},
{ $sort: { "tfs._id.doc": 1, tfidf: -1 } }
])
For larger workloads, measure the document counts once and reuse them, same way we would pre-calc material allowances.
At this point we hold a compact set of keyword counts and optional weights for every document. That dataset supports search, quick tagging, dashboards, or even a word cloud if management insists.
Practical deployments:
- Build lightweight full-text search helpers inside MongoDB
- Auto-tag articles, product reviews, or maintenance logs
- Track keyword trends over time for reporting
- Run NLP-style preprocessing without shelling out to another system
| Step | Purpose | Key MongoDB Operator |
|---|---|---|
| Tokenize | Break text into smart tokens | $regexFindAll, $map, $toLower |
| Filter | Drop the high-noise words | $filter, $not, $in |
| Count | Tally per-document frequency | $unwind, $group, $sum |
| Rank | Order results for inspection | $sort, $limit |
| (Optional) TF-IDF | Emphasize rare but important | $log, $divide |
- The aggregation pipeline is enough for hands-on keyword extraction; no external library is required for a basic rig.
- Store the output in a separate collection such as
keywordsif you plan to reuse it in dashboards or search. - Treat each stage like a machining step: keep inputs clean, outputs inspected, and the whole process stays predictable.
MongoDB’s pipeline architecture works the way we liked our tooling in the Union—modular, deterministic, and fully within reach of a disciplined engineer.