Yadda ake Convert Files zuwa UTF-8 Encoding a Linux


A cikin wannan jagorar, za mu bayyana wane nau'in rufaffiyar haruffa kuma mu rufe ƴan misalan canza fayiloli daga rufaffen haruffa ɗaya zuwa wani ta amfani da kayan aikin layin umarni. Sannan a ƙarshe, za mu kalli yadda ake canza fayiloli da yawa daga kowane saiti (charset) zuwa UTF-8 da ke ɓoye a cikin Linux.

Kamar yadda wataƙila kun riga kun sani, kwamfuta ba ta fahimta ko adana haruffa, lambobi ko wani abu da mu a matsayinmu na mutane za mu iya tsinkaya sai kaɗan. A bit yana da ƙima guda biyu kawai, wato ko dai 0 ko 1, gaskiya ko ƙarya, > eh ko a'a. Duk wani abu kamar haruffa, lambobi, hotuna dole ne a wakilta su a cikin rago don sarrafa kwamfuta.

A cikin sassauƙan kalmomi, ɓoyayyen haruffa hanya ce ta sanar da kwamfuta yadda ake fassara ƴan sifili da waɗancan cikin ainihin haruffa, inda ake wakilta hali da saitin lambobi. Lokacin da muka buga rubutu a cikin fayil, kalmomin da jimlolin da muka tsara ana dafa su daga haruffa daban-daban, kuma haruffa ana tsara su cikin charset.

Akwai tsare-tsaren shigar da bayanai daban-daban daga can kamar ASCII, ANSI, Unicode da sauransu. A ƙasa akwai misalin ASCII rufaffen.

Character  bits
A               01000001
B               01000010

A cikin Linux, ana amfani da kayan aikin layin umarni na iconv don canza rubutu daga nau'i ɗaya na rufaffiyar zuwa wani.

Kuna iya bincika ɓoye fayil ɗin ta amfani da umarnin fayil, ta amfani da -i ko --mime tuta wanda ke ba da damar buga nau'in kirtani na mime kamar a cikin misalan da ke ƙasa:

$ file -i Car.java
$ file -i CarDriver.java

Ma'anar amfani da iconv shine kamar haka:

$ iconv option
$ iconv options -f from-encoding -t to-encoding inputfile(s) -o outputfile 

Inda -f ko --daga-code na nufin shigar da shigar da kalmar shiga da -t ko --zuwa-encoding yana ƙayyade fitarwa codeing.

Don jera duk sanannun saitin haruffa, gudanar da umarnin da ke ƙasa:

$ iconv -l 

Canza Fayiloli daga UTF-8 zuwa ASCII Encoding

Na gaba, za mu koyi yadda ake jujjuya shi daga wannan makirci zuwa wani. Umurnin da ke ƙasa yana canzawa daga ISO-8859-1 zuwa UTF-8.

Yi la'akari da fayil mai suna input.file wanda ya ƙunshi haruffa:

� � � �

Bari mu fara da duba faifan haruffan da ke cikin fayil ɗin sannan mu duba abin da ke cikin fayil ɗin. A kusa, za mu iya canza duk haruffa zuwa rikodin ASCII.

Bayan gudanar da umurnin iconv, sai mu duba abubuwan da ke cikin fayil ɗin fitarwa da sabon rikodin haruffa kamar yadda ke ƙasa.

$ file -i input.file
$ cat input.file 
$ iconv -f ISO-8859-1 -t UTF-8//TRANSLIT input.file -o out.file
$ cat out.file 
$ file -i out.file 

Lura: Idan an ƙara kirtani // IGNORE zuwa rikodi, haruffa waɗanda ba za a iya canzawa ba kuma ana nuna kuskure bayan juyawa.

Bugu da ƙari, tsammanin cewa an ƙara kirtani // TRANSLIT zuwa-rufafi kamar yadda yake cikin misalin da ke sama (ASCII//TRANSLIT), haruffan da ake canza su ana fassara su kamar yadda ake buƙata kuma idan zai yiwu. Wanda ke nuna idan ba za a iya wakilta wani hali a cikin saitin halayen da aka yi niyya ba, ana iya ƙididdige shi ta hanyar kamanni ɗaya ko fiye da haka.

Saboda haka, duk wani harafi da ba za a iya fassarawa ba kuma ba a cikin saitin haruffan da aka yi niyya ana musanya shi da alamar tambaya (?) a cikin fitarwa.

Mayar da Fayiloli da yawa zuwa Rufaffen UTF-8

Idan muka dawo kan babban jigon mu, don musanya mahara ko duk fayiloli a cikin kundin adireshi zuwa UTF-8 encoding, zaku iya rubuta ƙaramin rubutun harsashi mai suna encoding.sh kamar haka:

#!/bin/bash
#enter input encoding here
FROM_ENCODING="value_here"
#output encoding(UTF-8)
TO_ENCODING="UTF-8"
#convert
CONVERT=" iconv  -f   $FROM_ENCODING  -t   $TO_ENCODING"
#loop to convert multiple files 
for  file  in  *.txt; do
     $CONVERT   "$file"   -o  "${file%.txt}.utf8.converted"
done
exit 0

Ajiye fayil ɗin, sannan sanya rubutun aiwatarwa. Guda shi daga kundin adireshi inda fayilolinku (*.txt) suke.

$ chmod  +x  encoding.sh
$ ./encoding.sh

Muhimmi: Hakanan zaka iya amfani da wannan rubutun don jujjuya fayiloli da yawa daga ɗayan da aka ba su zuwa wani, kawai wasa tare da ƙimar FROM_ENCODING da TO_ENCODING m, ba manta sunan fayil ɗin fitarwa \& #36 {file%.txt}.utf8.converted\ .

Don ƙarin bayani, duba ta wurin iconv man page.

$ man iconv

Don taƙaita wannan jagorar, fahimtar shigar da bayanai da kuma yadda ake jujjuya shi daga tsarin shigar da haruffa ɗaya zuwa wani shine ilimin da ya wajaba ga kowane mai amfani da kwamfuta fiye da masu shirye-shirye a yayin da ake magana da rubutu.

A ƙarshe, zaku iya tuntuɓar mu ta amfani da sashin sharhin da ke ƙasa don kowace tambaya ko ra'ayi.