Text File Cleanup and Deduplication Techniques

Jan 30, 2022 1 min read

Efficient methods for cleaning up text files and removing duplicate content using command-line tools.

Sort file lines alphabetically and remove duplicate entries:

sort file.txt | uniq > cleaned_file.txt

For unique lines only (lines that appear exactly once):

sort file.txt | uniq -u > unique_lines.txt

Use AWK for more efficient deduplication that preserves original order:

awk '!seen[$0]++' file.txt > deduplicated_file.txt

This approach:

Clean up files by removing blank lines:

grep -v "^[[:space:]]*$" file.txt > cleaned_file.txt

Or using sed:

sed '/^$/d' file.txt > cleaned_file.txt

Remove duplicates, empty lines, and sort in one command:

awk 'NF && !seen[$0]++' file.txt | sort > fully_cleaned.txt

This command:

Remove duplicates while ignoring case differences:

awk '!seen[tolower($0)]++' file.txt > case_insensitive_clean.txt