Announcement

Collapse
No announcement yet.

<SOLVED> How to set encoding for fonts

Collapse
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    <SOLVED> How to set encoding for fonts

    Hi, I'm getting trouble with encoding. I thought I was using only UTF-8, as defined in my locale.
    Code:
    $ locale
    LANG=pt_BR.UTF-8
    ...
    Well, I guess that was enogh but it is not. Today I download a file (and you are welcome to do so...)
    Code:
    $ wget -k [url]https://www.planalto.gov.br/ccivil_03/Constituicao/Constitui%E7ao_Compilado.htm[/url]
    The file on my HD should be Constituiçao_Compilado.htm but when I 'ls' on it
    Code:
    $ ls Cons*
    Constitui?ao_Compilado.htm
    All right, something's wrong here. Have to figure out what's going on...
    Code:
    $ ls Cons* | hexdump -C
    00000000 43 6f 6e 73 74 69 74 75 69 e7 61 6f 5f 43 6f 6d |Constitui.ao_Com|
    00000010 70 69 6c 61 64 6f 2e 68 74 6d 0a        |pilado.htm.|
    0000001b
    See that e7 on first line? That's a ccedil (ç) on UTF-8 as should be. Why is it messed up?
    OK, I'll try a move
    Code:
    $ mv Constitui�ao_Compilado.htm Constituiçao_Compilado.htm
    $ ls Cons*
    Constituiçao_Compilado.htm
    $
    $ ls Cons* | hexdump -C
    00000000 43 6f 6e 73 74 69 74 75 69 c3 a7 61 6f 5f 43 6f |Constitui..ao_Co|
    00000010 6d 70 69 6c 61 64 6f 2e 68 74 6d 0a       |mpilado.htm.|
    0000001c
    Gotcha! now appears c3 a7 which is ccedil on ISO-8859-1. Geez, where's that coming from?
    I know fonts are not standard on Linux, but how can I apply UTF-8 as a whole and get rid of that annoying ISO-8859-1?
    I think what I'm seeing should be something related to X (config is not locale related), but had no luck on System Settings, probably is on another place. Ideas are welcome.

    #2
    Re: How to set encoding for fonts

    Well, I made some research about it and got some conclusions.
    1- according to the unicode howto, locale directs UTF-8 config and X follows it, so no specific UTF-8 config for X
    2- I made a mistake, actually E7 and C3 A7 are UTF-8 and both represent the character 'ç'
    3- consequently, there is no ISO-8859-1 involved

    The ISO stuff came to me because I wrote a very simple Perl script to see how some input look in a specific encoding. AFAIK, there is no way to guess which encoding a file is using, so this script would show me the output for each encoding and let me guess the most probable one.
    The results were not as expected, I'll explain later. To the braves, here is the code
    Code:
    #!/usr/bin/perl
    # name: show_encodes.pl
    while(<>)
    {
      eval
      {
         binmode(STDOUT, ":utf8");
         print "UTF-8\n", $_, "\n";
         binmode(STDOUT, ":encoding(iso-8859-1)");
         print "ISO-8859-1\n", $_, "\n";
         binmode(STDOUT, ":encoding(iso-8859-15)");
         print "ISO-8859-15\n", $_, "\n";
       }
    }
    Well, the best knowledge source was found here, please have a look at the table shown on it.
    The conclusion was, when we run 'touch ç', filesystem uses the column named UTF-hx (the UTF8-encoded bytes as hexadecimal numbers)
    Code:
    $ ls ç | hexdump -C
    00000000 c3 a7 0a                     |...|
    But see what happen when we run the script above
    Code:
    $ ls ç | ~/bin/show_encodes.pl
    UTF-8
    ç
    
    ISO-8859-1
    ç
    
    ISO-8859-15
    �
    Nice, hum? Seems to be ISO-8859-1. What happened is the function 'encoding' actually needs UTF-8 in another format, to be more specific, the one in column U-hex (the Unicode value in hexadecimal)

    Well, I don't know exactly what is the difference between those formats, but it was pretty confusing, specially because a lot of tables do not show both columns.
    I hope that can be useful to others.

    Comment


      #3
      Re: &lt;SOLVED&gt; How to set encoding for fonts

      Walfred,
      Você é Brasileiro? ou fala português?

      Abçs

      Maciel

      Comment

      Working...
      X