Text File Cleanup and Deduplication Techniques

Efficient methods for cleaning up text files and removing duplicate content using command-line tools.

Sort and Remove Duplicates

Sort file lines alphabetically and remove duplicate entries:

sort file.txt | uniq > cleaned_file.txt

For unique lines only (lines that appear exactly once):

sort file.txt | uniq -u > unique_lines.txt

AWK-Based Deduplication

Use AWK for more efficient deduplication that preserves original order:

awk '!seen[$0]++' file.txt > deduplicated_file.txt

This approach:

  • Maintains the original order of first occurrences
  • Works faster on large files
  • Doesn’t require pre-sorting

Remove Empty Lines

Clean up files by removing blank lines:

grep -v "^[[:space:]]*$" file.txt > cleaned_file.txt

Or using sed:

sed '/^$/d' file.txt > cleaned_file.txt

Combined Cleanup

Remove duplicates, empty lines, and sort in one command:

awk 'NF && !seen[$0]++' file.txt | sort > fully_cleaned.txt

This command:

  • NF - Skips empty lines (lines with no fields)
  • !seen[$0]++ - Removes duplicates while preserving order
  • sort - Alphabetically sorts the results

Case-Insensitive Deduplication

Remove duplicates while ignoring case differences:

awk '!seen[tolower($0)]++' file.txt > case_insensitive_clean.txt