Text File Cleanup and Deduplication Techniques
Efficient methods for cleaning up text files and removing duplicate content using command-line tools.
Sort file lines alphabetically and remove duplicate entries:
sort file.txt | uniq > cleaned_file.txt
For unique lines only (lines that appear exactly once):
sort file.txt | uniq -u > unique_lines.txt
Use AWK for more efficient deduplication that preserves original order:
awk '!seen[$0]++' file.txt > deduplicated_file.txt
This approach:
- Maintains the original order of first occurrences
- Works faster on large files
- Doesn’t require pre-sorting
Clean up files by removing blank lines:
grep -v "^[[:space:]]*$" file.txt > cleaned_file.txt
Or using sed:
sed '/^$/d' file.txt > cleaned_file.txt
Remove duplicates, empty lines, and sort in one command:
awk 'NF && !seen[$0]++' file.txt | sort > fully_cleaned.txt
This command:
NF- Skips empty lines (lines with no fields)!seen[$0]++- Removes duplicates while preserving ordersort- Alphabetically sorts the results
Remove duplicates while ignoring case differences:
awk '!seen[tolower($0)]++' file.txt > case_insensitive_clean.txt