-rw-r--r-- 1 user user 123 Apr 5 12:00 zh_utf8.txt -rw-r--r-- 1 user user 123 Apr 5 12:00 zh_gbk.txt -rw-r--r-- 1 user user 123 Apr 5 12:00 zh_big5.txt -rw-r--r-- 1 user user 123 Apr 5 12:00 ja_shift_jis.txt -rw-r--r-- 1 user user 123 Apr 5 12:00 ko_euc_kr.txt -rw-r--r-- 1 user user 123 Apr 5 12:00 fr_latin1.txt -rw-r--r-- 1 user user 123 Apr 5 12:00 en_utf16le.txt ...
🔍 第四步:使用 file 命令识别类型
1
file *.txt
✅ 预期输出示例:
1 2 3 4 5 6 7 8 9 10
en_ascii.txt: ASCII text en_utf8.txt: UTF-8 Unicode text zh_gbk.txt: ISO-8859 text zh_big5.txt: ISO-8859 text ja_shift_jis.txt: ISO-8859 text ko_euc_kr.txt: ISO-8859 text fr_latin1.txt: ISO-8859 text mixed_utf16le.txt: Little-endian UTF-16 Unicode text en_utf16be.txt: Big-endian UTF-16 Unicode text zh_utf8_bom.txt: UTF-8 Unicode (with BOM) text
for filepath in sorted(glob.glob("*.txt")): with open(filepath, 'rb') as f: raw = f.read() result = cchardet.detect(raw) encoding = result['encoding'] confidence = result['confidence'] print(f"{filepath:20} → {encoding:10} (置信度: {confidence:.2f})")
语言编码文件名中文(简体)UTF-8zh_utf8.txt中文(简体)GBKzh_gbk.txt中文(繁体)Big5zh_big5.txt日文Shift_JISja_shift_jis.txt韩文EUC-KRko_euc_kr.txt俄文UTF-8ru_utf8.txt阿拉伯文UTF-8ar_utf8.txt法文ISO-8859-1fr_latin1.txt英文ASCIIen_ascii.txt英文UTF-16LEen_utf16le.txt英文UTF-16BEen_utf16be.txt中文UTF-8 with BOMzh_utf8_bom.txt