Subscribe to this thread
Home - General / All posts - way to change column value rendering text to UTF-8
lionel

972 post(s)
#27-May-22 17:04

i copy data from Official html webpage (1) where format is iso-8859-15 into a file using notepad++ . when the data file i import is done and i see wrong diacritic ( é è û ). I don't know if Microsoft memory buffer do encoding behind my back when copy paste text from html page to notepad+++ event !!

It is strange that Microsoft show data before import in the right way ( ardèche) and that after import in manifold diacritic are wrong 07 - Ardèche !!

It is strange there is also 2 style panel !

Is there a way using manifold 9 after file is import as a table to define for column an encoding and after change this encoding to UTF-8 ?

Is there a usefull recommended editor under microsoft 10 that support encoding for/during create ,paste, convert, export ?

(1) source www.resultats-elections.interieur.gouv.fr/legislatives-2022

Attachments:
departement_iso8859_15.txt
MAnifold_import.png
MAnifold_styleTable_encoding.png
notepad++_iso8859_15.png


INFOGRAPHY union , LINK doc , API, deepl & keyboard shortcut

Dimitri

7,025 post(s)
#28-May-22 10:01

Begin by reading the Data Types topic. Don't miss the discussion of VARCHAR and NVARCHAR types.

Is there a way using manifold 9 after file is import as a table to define for column an encoding and after change this encoding to UTF-8 ?

A table data type is either VARCHAR or NVARCHAR. NVARCHAR is Unicode using Manifold's internal storage to cover the entire Unicode standard. You don't define an encoding because all that is internal to Manifold: it covers the entire listing of all Unicode symbols.

UTF-8 is one of the three standard encodings (the others being UTF-16 and UTF-32) used in interchange formats to represent Unicode text, or are used by software that isn't fully Unicode as a way of handling Unicode characters.

Genuinely Unicode applications like Manifold are fairly rare. To be fully Unicode, you have to have full Unicode capability in everything that touches data, or you get into issues with encodings and such where some parts of the program correctly handle Unicode and others don't. What often happens is that programs write what they think is Unicode text but they don't get the nuances right.

A good example is Notepad, which will often display Unicode text within Notepad using the correct characters but then when it is saved as UTF-8 a genuinely Unicode program using that resulting text file will display different characters. Why? Because the text should have been saved as UTF-8 with BOM or UTF-16 or UTF-32, but not as UTF-8. "BOM" is a Byte Order Mark.

If you've imported something into a Manifold table and Unicode text is not showing up as it should, that's an indication that the format from which you imported didn't get the Unicode encoding right or didn't specify it. The classic case is importing "Unicode" from a text file, like CSV, that was saved using UTF-8 when UTF-8 with BOM should have been used. That's easy to fix: Open the text file in Notepad and then save it choosing UTF-8 with BOM in the Encoding box in the File - Save dialog.

rk
518 post(s)
#28-May-22 14:16

For encoding related stuff I recommend EditPad https://www.editpadlite.com/

I'd say that Notepad++ choices are a bit confusing. Not entirely clear, which menu item is for changing interpretation of bytes and which is for converting bytes.

If you know what in Manifold is the difference between Assigning Projection and Changing Projection, then it becomes clear.

Changing (encoding)interpretation is like Assigning or Repairing Coordinate System (leaving bytes themselves unchanged)

Re-encoding is like Re-projecting - taking current encoding/projection as true and converting it to another, if possible.

EditPad makes it as clear with encodings as Manifold with projections.

If you save a file in UTF-8, then Manifold likes to see it with BOM. You can add/remove BOM with Notepad++ too. Without BOM, Manifold tries to interpret your file in your "current locale encoding" or whatever it is called. This is correct behavior, although since "BOM use is optional" [wikipedia], many today treat files as UTF-8 by default, even without BOM.

Attachments:
EditPad_encoding.png

hugh
194 post(s)
#28-May-22 15:28

I've used editpad for 20 years and never noticed that BOM button--thanks!

lionel

972 post(s)
#29-May-22 01:38

So, i can open any files ( csv , binary, shapefile, mapinfo,gdb, sqlite) and save the file to itself ( like open Map 8 project in manifold 9 and save it to map v9).

I really like the comparison beetween projection and Charset/ encoding.Unicode let user support many language and glyph and right to left ...This Make me think of POSIX for OS. i have in mind that manifold 9 should help us by let us assign original encoding to let manifold convert it to utf-8 since manifold only one behaviour is to compute data that by default should be utf-8 (mandatory).only human Can detect if he use utf-8 .If convert is easy as open a file and save it with thé suitable software ...then .this fonctionnality doesn't have to be manage/Handle by manifold itself. I hope that file don't use differents encoding in the same file( structure metadata and data).!!!! I Hope most of webserver and database server and protocol to transfert data(stream,file) manage handshake encoding ( endian, bom,charset) before send receive data!!!!!


INFOGRAPHY union , LINK doc , API, deepl & keyboard shortcut

rk
518 post(s)
#29-May-22 04:47

Maybe we should advocate for 'UTF-8 without BOM' as a default in Manifold too.

  • to export without BOM,
  • to read BOM-less files as UTF-8 by default.

2 years ago I sent a suggestion, for setting encoding for csv and other text dataports, or at least option to force UTF-8 at import.

Breaking change proposal: Encoding.UTF8 singleton should not have a BOM · Issue #51353 · dotnet/runtime (github.com)

BTW, I didn't realize that "with UTF-8, BOM is only used as an encoding signature to distinguish UTF-8 from other encodings — it has nothing to do with byte order."

https://unicode.org/faq/utf_bom.html

Q: Is the UTF-8 encoding scheme the same irrespective of whether the underlying processor is little endian or big endian?

A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no endian problem as there is for encoding forms that use 16-bit or 32-bit code units. Where a BOM is used with UTF-8, it is only used as an encoding signature to distinguish UTF-8 from other encodings — it has nothing to do with byte order.

adamw

10,011 post(s)
#31-May-22 13:12

We'll look into making UTF8 the default when there is no BOM. When there is no BOM, we are currently invoking a system function, the same one that Notepad is using, and it tries to determine the encoding from the first (several thousand of) characters. It catches UTF16 BE / LE quite well, but obviously it cannot distinguish between valid UTF8 and 'current old-style codepage'. When UTF8 was relatively new, it made sense to be resolving an ambiguity like that to 'current old-style codepage', but right now it makes more sense to be resolving it to UTF8.

The problem with this is that if we decide that the text is UTF8 while it is really just old-style text in an old-style codepage, then we might get an invalid character sometime later and the process of reading data will fail. With old-style codepage being the default the conflict is resolved in favor of UTF8 by adding a BOM. With the new approach and UTF8 being the default, we will need to add controls to tell the system "hey, this is not UTF8, although it looks like it is, this really is an old-style codepage, don't get fancy because if you will, you will fail half-way through the file".

lionel

972 post(s)
#29-May-22 14:09

there some strange ( no wodoo in coding ! so all is rational explainable) behaviour when launch a SQL Query using WSL Linux and LinqPad. I think search behaviour should not arise using manifold !!

Attachments:
character_endBegin.png
LinqPAd_charset_unicode.png
LinQPad_testmySQL_charset.png


INFOGRAPHY union , LINK doc , API, deepl & keyboard shortcut

lionel

972 post(s)
#16-Jun-22 04:25

A)Did you notice in the screenshot above the "character_set_database" value returned by the SQL query on the MySQL database? This value is utf8mb4 and not utf8 !

MySQL utf8 vs utf8mb4 - What's the difference between utf8 and utf8mb4? (eversql.com)

MySQL :: MySQL 8.0: When to use utf8mb3 over utf8mb4?

B) did you try to write the m of manifold without using the m key of the keyboard ?

Solution in the file top right side in pink


INFOGRAPHY union , LINK doc , API, deepl & keyboard shortcut

lionel

972 post(s)
#16-Jun-22 04:47

Collation for SQL Server , manifold9 , MySQL 8.0 , PostgreSQL , sqlite

mysql - What does character set and collation mean exactly? - Stack Overflow


INFOGRAPHY union , LINK doc , API, deepl & keyboard shortcut

adamw

10,011 post(s)
#16-Jun-22 07:52

Just for the record, with MySQL we support both utf8mb4 and utf8 / utf8mb3. (And the difference between the two is that when MySQL folks tried to implement UTF-8, they got it wrong, but by the moment they realized that they got it wrong it was too late and the encoding had to be preserved or tons of databases would stop working. So they implemented a new encoding which gets UTF-8 right and called it utf8mb4.)

Manifold User Community Use Agreement Copyright (C) 2007-2021 Manifold Software Limited. All rights reserved.