Leer archivos .csv

232 views
Skip to first unread message

dan...@macrosistemas.com.uy

unread,
Nov 1, 2023, 4:17:51 PM11/1/23
to Harbour Users
Hola Foro,
Necesito leer archivos .csv y los archivos que me entregan parece ser que están "codificados" de una forma que desconozco.
Si bien, nunca tuve problema para leer archivos a bajo nivel, en este caso, cuando obtengo una linea utilizando algo como:
cBuffer := hb_FReadLn(), veo que en cBuffer hay caracteres que por ejemplo excel no los muestra, pero si abro los .csv con un
editor haxadecimal, si se ven.

Por ejemplo, la siguiente fracción de linea del csv vista con excel, muestra lo siguiente
00:02:35;;"EL ULTIMO PIRATA

Usando el debugger veo el contenido de cBuffer es:
e" ■0\0000\000:\0000\0002\000:\0003\0005\000;\0000;\000\"\000E\000L\000 \000U\000L\000T\000I\000M\000O\000 \000P\000I\000R\000A\000T\000A\

¿Hay alguna función que convierta el contenido de cBuffer a una cadena como la que muestra excel?

Desde ya gracias por toda ayuda.
Saludos

Google tralator

Hello Forum,
I need to read .csv files and the files they give me seem to be "encoded" in a way that I don't know.
Although, I never had a problem reading files at a low level, in this case, when I get a line using something like: cBuffer := hb_FReadLn(), I see that in cBuffer there are characters that, for example, Excel does not show, but if I open the .csv with a hexadecimal editor, if seen.
For example, the following line fraction of the csv viewed with excel shows the following 00:02:35;;"EL ULTIMO PIRATA
Using the debugger I see the content of cBuffer is:
e" ■0\0000\000:\0000\0002\000:\0003\0005\000;\0000;\000\"\000E\000L\000 \000U\000L\000T\000I\000M\000O\000 \000P\000I\000R\000A\000T\000A\

Does anyone know of function that converts the content of cBuffer to a string like the one displayed in excel?
Thanks in advance for all the help.
Greetings

Daniele Campagna

unread,
Nov 1, 2023, 5:45:54 PM11/1/23
to harbou...@googlegroups.com

It seems UTF-16 Big Endian.

If so, the first 2 chars should be chr(254) and chr(255). (The BOM)

If it's UTF-32 Big Endian the first 4 chars should be 0,0,254,255

Dan

--
You received this message because you are subscribed to the Google Groups "Harbour Users" group.
Unsubscribe: harbour-user...@googlegroups.com
Web: https://groups.google.com/group/harbour-users
---
You received this message because you are subscribed to the Google Groups "Harbour Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to harbour-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/harbour-users/4a5393dc-7af6-47a0-93f3-b5925c58d262n%40googlegroups.com.

dan...@macrosistemas.com.uy

unread,
Nov 1, 2023, 6:45:59 PM11/1/23
to Harbour Users
Dan
Gracias por responder.
Veo que efectivamente los dos primeros carateres son chr(255) y chr(254).
Aquí adjunto la primera linea de un csv de ejemplo.

Gracias
Daniel

Thanks for answering.
I see that indeed the first two characters are chr(255) and chr(254).
Here I attach the first line of an example csv.

Thank you
Daniel
ejemplo.csv

CV

unread,
Nov 2, 2023, 8:37:11 AM11/2/23
to Harbour Users
Daniel

Mi solución para casos como el tuyo fue leer el archivo con memoread(), verificar que los 2 primeros caracteres sean xFF y xFE y si es asi eliminarlos, para luego aplicar strtran() para eliminar los x00 por un string nulo.
Con eso te quedará un texto ascii normal que podrás procesar de manera standard.

My simple solution for this case:

--- bof ---
wText := memoread("ejemplo.csv")
if substr(wText, 1,1) = chr(255) .and. substr(wText, 2,1) = chr(254)
    wText := substr(wText, 3)
endif

wText := strtran(wText, chr(0), "")
// now wText is just an ascii string equivalent to the content of "ejemplo.csv".
--- eof ---

Regards
--
Claudio Voskian
Buenos Aires - Argentina

dan...@macrosistemas.com.uy

unread,
Nov 2, 2023, 3:52:36 PM11/2/23
to Harbour Users
Claudio,
Muchas gracias por la idea.
Aplicada la misma veo que el texto queda "mas limpio", pero aún está lleno de \ (creo que se dice escapeado) , de echo el contenido de wText
comienza con  (incluida la e)
e"00:02:35;;\"EL ULTIMO PIRATA SOLO~0.5\";\"C:\\100.3\\04 - PROGRAMAS\\EL ULTIMO PIRATA\\PISADORES\\EL ULTIMO PIRATA SOLO~0.5.MP3\";\"<TRACK SKIPRE ....................... (los puntitos los puse yo y son para indicar que el texto sigue)
y necesitaría tenerlo así:
00:02:35;;"EL ULTIMO PIRATA SOLO~0.5";"C:\100.3\04 - PROGRAMAS\EL ULTIMO PIRATA\PISADORES\EL ULTIMO PIRATA SOLO~0.5.MP3";"<TRACK SKIPRE.............

Al menos hasta donde pude buscar recorriendo todas las librerías que pude, no encontré ninguna función que permita "des escapear" un texto.
Pensé en hacer una rutina que recorra carácter a carácter todo el archivo y si luego del un carácter \ viene otro \, eliminaría el primero y si luego del carácter \ viene " entonces eliminaría \, pero no creo que sea la mejor solución.

Se agradece toda idea que me pueda dar luz para lograr "limpiar" los csv.

Saludos
Daniel

Daniele Campagna

unread,
Nov 2, 2023, 9:13:06 PM11/2/23
to harbou...@googlegroups.com

This is just a workaround, it is not the correct way to convert UTF16BE to <whatever>. 

I know that my good friend fdaniele does it that way (from UTF16 LE) and I told him I was "horrified" by that. Eh Eh, just joking. It's a quick and dirty method but it works, more or less.

Unfortunately, there is no Windows API function to convert from UTF16BE. You should change the "endianness" of the file using a function that swaps the two bytes of each codified character. You should write a C function that calls the API Windows function _swab.

Once swapped the bytes, you can convert from UTF16LE using the Harbour functions or the Windows API function WideCharToMultibyte (again, you need a C wrapper)

In UTF-16 you have a character codified in 2 bytes. The second byte is 00 unless the character is in the first 256 ASCII codes (they more or less match the ANSI codify, 1 byte needed), in LE you have the 00 as second byte, in BE the order is inverted.

Anyway, an editor should allow to open a UTF-16 BE text file and save as, say, UTF-8. Since not all the characters present in UTF16 have a corresponding char in UTF-8, possibly not all the characters can be converted.

As per the e" preceding the string, this is the representation of a UTF16 string in a Harbour memvar.

Dan

Daniele Campagna

unread,
Nov 3, 2023, 6:50:34 AM11/3/23
to harbou...@googlegroups.com

Correction: my explanation of the UTF16 encoding is very approximated, and there is a typo

Il 03/11/2023 02:12, Daniele Campagna ha scritto:

This is just a workaround, it is not the correct way to convert UTF16BE to <whatever>. 

I know that my good friend fdaniele does it that way (from UTF16 LE) and I told him I was "horrified" by that. Eh Eh, just joking. It's a quick and dirty method but it works, more or less.

Unfortunately, there is no Windows API function to convert from UTF16BE. You should change the "endianness" of the file using a function that swaps the two bytes of each codified character. You should write a C function that calls the API Windows function _swab.

Once swapped the bytes, you can convert from UTF16LE using the Harbour functions or the Windows API function WideCharToMultibyte (again, you need a C wrapper)

In UTF-16 you have a character codified in 2 bytes. The second byte is 00 unless the character is in the first 256 ASCII codes (they more or less match the ANSI codify, 1 byte needed), in LE you have the 00 as second byte, in BE the order is inverted.

Must read :

In UTF-16 you have a character codified in 2 bytes. The second byte is 00 IF the character is in the first 256 ASCII codes, otherwise the character is codified with 2 bytes

dan...@macrosistemas.com.uy

unread,
Nov 3, 2023, 10:54:38 AM11/3/23
to Harbour Users
Daniele

Muchas gracias por tu explicación y ayuda.
Saludos

Thank you very much for your explanation and help.
Greetings
Reply all
Reply to author
Forward
0 new messages