I have a text file that is 8,000,000+ lines in length. It is tab-separated text. The fields are in double quotes. I am attempting to convert it to a csv file so I can "copy" the data into a postgres schema. However I keep running into non UTF-8 encoding errors when attempting to copy into the schema.
I can convert it to csv with this:
and this works fine. It replaces the tabs with commas. However when I go to "copy" it into postgres I get this:
ERROR: invalid byte sequence for encoding "UTF8": 0xee 0x3c 0xf9
CONTEXT: COPY sourcefile, line 1833286
This supposedly converts everything in the file to UTF8
iconv -f utf-8 -t utf-8 -c sourcefile.csv > sourcefile.csv
'file' returns 'charset=us-ascii'
so I used "ENCODING 'sql-ascii' " in the copy line, and then get this error:
ERROR: unquoted carriage return found in data
HINT: Use quoted CSV field to represent carriage return.
CONTEXT: COPY sourcefile, line 1833286
So same line, different error. I cannot edit the file manually as it is too big for anything I've tried yet. I could delete the line and manually re-enter it, but I'm concerned there will be 1000's more lines with the same issue. AFAIK, I can't sed replace the error because I don't know what exactly the bad part looks like.
Looking for suggestions on how to clean up this file...
I can convert it to csv with this:
Code:
cat sourcefile.txt | tr -s "\\t" "," > sourcefile.csv
ERROR: invalid byte sequence for encoding "UTF8": 0xee 0x3c 0xf9
CONTEXT: COPY sourcefile, line 1833286
This supposedly converts everything in the file to UTF8
iconv -f utf-8 -t utf-8 -c sourcefile.csv > sourcefile.csv
'file' returns 'charset=us-ascii'
so I used "ENCODING 'sql-ascii' " in the copy line, and then get this error:
ERROR: unquoted carriage return found in data
HINT: Use quoted CSV field to represent carriage return.
CONTEXT: COPY sourcefile, line 1833286
So same line, different error. I cannot edit the file manually as it is too big for anything I've tried yet. I could delete the line and manually re-enter it, but I'm concerned there will be 1000's more lines with the same issue. AFAIK, I can't sed replace the error because I don't know what exactly the bad part looks like.
Looking for suggestions on how to clean up this file...
Comment