Essential Text Transformation Commands for File Processing

Jul 16, 2021 2 min read

A comprehensive guide to text transformation commands for encoding conversion, file cleanup, and batch processing operations.

Convert text files between different character encodings using iconv:

Basic encoding conversion:

iconv -f CP1252 -t UTF-8 example.txt > example-utf-8.txt

Windows-1251 to UTF-8 with error handling:

iconv -f windows-1251 -t UTF-8//IGNORE example.txt > example-utf-8.txt

The //IGNORE option skips characters that cannot be converted rather than stopping with an error.

Remove duplicates and empty lines in one command:

awk 'NF && !seen[$0]++' inputfile.txt > outputfile.txt

Alternative method using grep and uniq:

grep -v "^[[:space:]]*$" input.txt | uniq > output.txt

Count the number of lines in a file:

wc -l file.txt

Get comprehensive file statistics:

wc file.txt  # Shows lines, words, and characters

Split large files by line count:

split -l 1000 largefile.txt chunk_

This creates files named chunk_aa, chunk_ab, etc., each containing 1000 lines.

Split files by size:

split -b 10M largefile.txt chunk_  # 10MB chunks

Find and remove files by name:

find . -name "Success.txt" -exec rm -rf {} \;

Alternative using xargs (more efficient for many files):

find . -name "Success.txt" | xargs rm -rf

Find files by pattern and size:

find . -name "*.log" -size +10M -exec rm {} \;  # Remove log files larger than 10MB

Convert multiple files to UTF-8:

for file in *.txt; do
    iconv -f windows-1251 -t UTF-8//IGNORE "$file" > "utf8/${file}"
done

Clean multiple files:

for file in *.txt; do
    awk 'NF && !seen[$0]++' "$file" > "cleaned/${file}"
done