Announcement

Collapse
No announcement yet.

Have a heck of a time getting a file "cleaned" so postgres 'copy' will accept it. "iconv" not working or???

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    [RESOLVED] Have a heck of a time getting a file "cleaned" so postgres 'copy' will accept it. "iconv" not working or???

    I have a text file that is 8,000,000+ lines in length. It is tab-separated text. The fields are in double quotes. I am attempting to convert it to a csv file so I can "copy" the data into a postgres schema. However I keep running into non UTF-8 encoding errors when attempting to copy into the schema.

    I can convert it to csv with this:

    Code:
    cat sourcefile.txt | tr -s "\\t" "," > sourcefile.csv
    and this works fine. It replaces the tabs with commas. However when I go to "copy" it into postgres I get this:

    ERROR: invalid byte sequence for encoding "UTF8": 0xee 0x3c 0xf9
    CONTEXT: COPY sourcefile, line 1833286

    This supposedly converts everything in the file to UTF8
    iconv -f utf-8 -t utf-8 -c sourcefile.csv > sourcefile.csv

    'file' returns 'charset=us-ascii'​

    so I used "ENCODING 'sql-ascii' " in the copy line, and then get this error:

    ERROR: unquoted carriage return found in data
    HINT: Use quoted CSV field to represent carriage return.
    CONTEXT: COPY sourcefile, line 1833286

    So same line, different error. I cannot edit the file manually as it is too big for anything I've tried yet. I could delete the line and manually re-enter it, but I'm concerned there will be 1000's more lines with the same issue. AFAIK, I can't sed replace the error because I don't know what exactly the bad part looks like.

    Looking for suggestions on how to clean up this file...

    Please Read Me

    #2
    OK, well never mind sort of. The above errors were because the file got corrupted during one of the conversions and the corruption started at line 1833286. I'm going to try it all again.

    Please Read Me

    Comment


      #3
      Tried again with a fresh csv conversion. Since "file" reports it as ascii I tried both us-ascii and utf8 and got these errors:

      postgres-# postgres=# \copy schema.table from '/tmp/sourcefile.csv' DELIMITER ',' ENCODING 'sql-ascii' CSV HEADER;
      ERROR: unquoted carriage return found in data
      HINT: Use quoted CSV field to represent carriage return.
      CONTEXT: COPY sourcefile, line 195432

      postgres-# postgres=# \copy schema.table from '/tmp/sourcefile.csv' DELIMITER ',' ENCODING 'utf8' CSV HEADER;
      ERROR: invalid byte sequence for encoding "UTF8": 0xbd
      CONTEXT: COPY sourcefile, line 1725
      postgres-#

      Please Read Me

      Comment


        #4
        I was able to open the file with Kate (took almost a full minute to open), then changed the encoding to UTF8. Kate reported "non-UTF8" characters in the file. So I saved it and then I ran

        Code:
        iconv -f utf8 -t utf8 sourcefile.csv -o sourcefile1.csv


        ​Diff showed there was a lot of changes made, so here I go to postgres try again...

        Please Read Me

        Comment


          #5
          OK I got it to load. The file still had some invalid characters in it, but when I set encoding to sql-ascii instead of utf8, it loaded all the lines. I suspect there may be a few fields that are corrupted or missing, but I can live with that for now.

          Please Read Me

          Comment


            #6
            Originally posted by oshunluvr View Post
            I have a text file that is 8,000,000+ lines in length.
            Good gawd man, what magical tomb do you have! An 8x11 page in portrait orientation will have an average of 66 single spaced lines, so your document is 121,212.121212 pages long!
            Windows no longer obstructs my view.
            Using Kubuntu Linux since March 23, 2007.
            "It is a capital mistake to theorize before one has data." - Sherlock Holmes

            Comment


              #7
              I know, right?

              It's actually nearly 9 million lines.

              8,944,327 rows and 67 columns. Not all the fields are populated.

              Please Read Me

              Comment


                #8
                I'm surprised at how quickly the server processes queries considering the size of database. Super simple ones. but still, that's a lot of data. My postgres server is Kubuntu 24.04 running in a QEMU VM.

                Please Read Me

                Comment


                  #9
                  Originally posted by oshunluvr View Post
                  8,944,327 rows and 67 columns.
                  That's 599,269,909 cells! That's enough room to stuff every American into a cell, and still have 253,247,093 cells left over! Like I said: Oh my gawd!
                  Windows no longer obstructs my view.
                  Using Kubuntu Linux since March 23, 2007.
                  "It is a capital mistake to theorize before one has data." - Sherlock Holmes

                  Comment

                  Working...
                  X