Unicode Issues

5.3k · January 25, 2006

I've been pretty happy so far with the new HTML filters put in place, but I discovered tonight that they strip out Unicode characters. I know that HTML Tidy is not responsible, as I sent my text through it and it survived unscathed.

This is not acceptable. There is no reason for this behavior. It needs to be fixed.

January 27, 2006

Unicode HTML entities or actual unicode characters encoded in whatever encoding the form accept-charset is allowed (UTF-8 or whatever)?

5.3k · February 2, 2006

Unicode HTML entities or actual unicode characters encoded in whatever encoding the form accept-charset is allowed (UTF-8 or whatever)?

Unicode entities. I have always assumed that it is the brower's job to transcode them to the appropriate code page. Maybe I am wrong, and there is some clever hack that will allow transcoding beforehand.

If so, I'd love to hear about it.

In my case I was just trying to use some Greek characters for a puzzle cache. HTML Tidy would change the Unicode for the "gamma" character to γ which is fine. But when I submitted the output of HTML Tidy on the cache page, the character came out either a "g" or a "?" (for uppercase letters).

How do cachers in Europe or Japan or China write cache pages without Unicode?

Edited February 2, 2006 by fizzymagic

February 2, 2006

Their editors store it in UCS-16 (raw Unicode - two bytes per character) or UTF-8 (1+ bytes per character). These get stored in the file. IE can display these encodings. Of course, before Unicode, JIS and Shift-JIS were far more common (for the Japanese), so I'm sure a LOT of them still use native encodings.

Like you can write & #65; (space not intentional, I cannot get the forum to properly render A) or you can write A in the file. The first will take 5 bytes and the second will take 1 byte (assuming the file/stream is encoded in ASCII or UTF-8 - which is the same as ASCII for these characters). Now if the file is "encoded" in UCS-16, the first will take 10 bytes and the second 2 bytes, but each second byte will be 0.

And IE will sort it all out or use the header. There is also an accept-charset in forms which can indicate what is accepted.

It can get kind of complicated since the browser will replace entities in the stream with the single glyphs even in text boxes - so sometimes, you can't tell if the underlying data is entities or binary encodings.

1.4k · February 2, 2006

Unicode HTML entities or actual unicode characters encoded in whatever encoding the form accept-charset is allowed (UTF-8 or whatever)?

Unicode entities. I have always assumed that it is the brower's job to transcode them to the appropriate code page. Maybe I am wrong, and there is some clever hack that will allow transcoding beforehand.

If so, I'd love to hear about it.

In my case I was just trying to use some Greek characters for a puzzle cache. HTML Tidy would change the Unicode for the "gamma" character to γ which is fine. But when I submitted the output of HTML Tidy on the cache page, the character came out either a "g" or a "?" (for uppercase letters).

How do cachers in Europe or Japan or China write cache pages without Unicode?

I run native-language text (Japanese) through a converter so they end up as Unicode entities, for example, a bunch of xxxx;xxxx; where x would represent decimal digits. That's what I end up posting in my logs. I think this is how the information is stored on GC.com.

The browser (Internet Explorer, at least) parses these correctly when the posted log is viewed, which is consistent with what you said. I presume this is how cachers in Japan include Japanese text in their cache descriptions.

Now, it gets a bit messy when I need to edit the same log. I have to re-enter the correct text offline, then run it through the converter again. If I try to edit what's displayed (which would be in Shift-JIS encoding), then the resulting log looks like a bunch of ?.

I don't know if this helps you solve the problem you are having, but I wanted to add my 2 cents.

5.3k · February 2, 2006

I run native-language text (Japanese) through a converter so they end up as Unicode entities, for example, a bunch of xxxx;xxxx; where x would represent decimal digits. That's what I end up posting in my logs. I think this is how the information is stored on GC.com.

The browser (Internet Explorer, at least) parses these correctly when the posted log is viewed, which is consistent with what you said. I presume this is how cachers in Japan include Japanese text in their cache descriptions.

Unfortunately, this is EXACTLY how I entered the text into the cache description. The Unicode was entered as HTML entities, as you describe. They were stripped out of the cache description completely.

So do logs get treated differently than cache descriptions? What the heck is going on?

5.3k · February 2, 2006

Their editors store it in UCS-16 (raw Unicode - two bytes per character) or UTF-8 (1+ bytes per character). These get stored in the file. IE can display these encodings.

Caderoux, I understand completely about encoding Unicode into HTML entities. Not a problem here.

My problem is this:

GEOCACHING.COM IS STRIPPING THE ENTITIES OUT OF THE HTML IN THE CACHE DESCRIPTION!

I enter them, the HTML displays perfectly in my browser, but they get removed from the cache description. Just deleted. Gone. Not there.

I changed the entities from Unicode entities (xxxx;) to special HTML entities (γ). These are treated differently, but still destroyed.

Is there any way I can make this more clear?

Try it for yourself. Go edit one of your cache pages, and stick a few Unicode entities at the bottom. View the altered page. Note that they are gone. Edit the HTML. Note that the entities have been deleted.

Edited February 2, 2006 by fizzymagic

1.4k · February 2, 2006

Fizzy's problem confirmed.

The workaround is to uncheck "The descriptions below are in HTML" box when you edit the cache page. Unless someone else has a better workaround, it seems one has to choose HTML formatting or Unicode, but can't use both.

February 2, 2006

gc.com is definitely mangling the data.

Browsers are also part of the roundtrip problem budd-rdc reports, because the entities get changed into native unicode characters, which are then posted to the form as whatever encoding IE decides is appropriate based on the accept-charset (which should be utf-8). The web server should be able to take utf-8 and convert to internal unicode (assuming the site is written in .net, which uses all unicode UCS-16 internally) and then posts them back to the page in some encoding. It may be utf-8, but the browser doesn't detect it automatically because the header specifies a charset. It may be set to a font which doesn't have those characters. It may be that it isn't even utf-8. These tend to show up and boxes or ? for the glyphs.

We had a whole system running one time, thinking the data was saved correctly since it re-displayed fine - and then we found out it wasn't storing them in ucs-16 (nchar/nvarchar) in the database at all, but using one byte of each ucs-16 character for half the glyph.

How did we fix it? We output it all to the screen in IE - copied the data off the screen (so it was stored in the clipboard - presumably ucs-16) - then fixed the system and pasted into a form with the accept-charset set to utf-8.

5.3k · February 2, 2006

gc.com is definitely mangling the data.

Whew. It's nice to have this confirmed. But I remain shocked that cachers in Japan haven't complained bitterly about this behavior. Maybe they are just used to being mistreated.

Browsers are also part of the roundtrip problem budd-rdc reports, because the entities get changed into native unicode characters, which are then posted to the form as whatever encoding IE decides is appropriate based on the accept-charset (which should be utf-8).

I don't use IE; I use Firefox. But if I understand you correctly, you are saying that the browser is responsible for translating the default textbox content sent by the server into an appropriate encoding. I would have expected that the browser would just escape the entities, which would make the round-trip transparent to the user. In any case, though, the browser should translate the entities into a form such that if the default data in the textbox is submitted, the server receives exactly the same data it sent.

In other words, there may be a browser interaction happening here, but this is not the browser's fault.

IMO, this is a pretty serious issue and it needs to be resolved!

February 2, 2006

gc.com can fix their HTML stripping issue.

As far as browsers, I think we'll always have some weirdness. (Same in Firefox).

For instance take this character: あ (& #12354;)

After I typed it in as an entity and then went to preview, the textbox presents the data as a single glyph. The underlying HTML is the entity, but it doesn't show as the entity any more in the text box. If I copy this to the clipboard and paste it into notepad, I then get prompted that the file must be saved as one of the unicode options or data will be lost.

The rest of the entity/display behavior is specific to this forum. After repeated round trips to the server, the data is always the entity in the stream. It's probably stored that way in the DB for the forum. The forum is PHP - and PHP's unicode support is poor - and this page uses <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" /> and the form uses no accept-charset so the forum is not necessarily a good example of how the regular gc.com handles it (or even how it should be handled.) For instance, when I paste clipboard unicode into the forum - it converts it TO an entity.

The cache editing page uses charset=utf-8, but does not specify an accept-charset on the form ( <form name="Form1" method="post" action="report.aspx?guid=2d33e6d4-9b37-48e0-a1b9-bd96ad87ed1e" language="javascript" onsubmit="if (!ValidatorOnSubmit()) return false;" id="Form1"> ) . Not sure what effect this has, but when unicode is pasted in, the characters are not converted to entities - they stay - and when you view source in notepad, they do show the correct characters - but after repeated saves, the data is lost - looks to me like an accept-charset could help a lot here - but some of it might depend on the code, (1) because any time you do an explicit or implicit conversion, it's going to be done in the current locale context of the thread handling the request. Usually this only screws with dates and currencies to and from text. However, there is also the case (2) where the code sends the string through a layer not expecting unicode (like perhaps HTML tidy, or God forbid, DAO) the strings could go through ANSI marshalling and have problems.

(Edited to correct the URL for the form tag to move it to the paragraph about report.aspx, not about the forum)

Edited February 2, 2006 by caderoux

1.6k · February 2, 2006

Is there some reason why people cannot just create HTML documents for their cache pages, using code that browsers understand?

Although FizzyMagic is very specific in his concerns, even for a user that knows a smattering of HTML tags, often the Cache Page editor strips away code that it doesn't understand and deletes it. That has annoyed me on several occations where I make a simple typo and don't catch it... only find the entire tag wiped clean from the editor.

Sometimes the editor takes the entire text block and crushes it into one huge paragraph... other times not... this only seems to happen when the HTML box is unchecked.

It got to the point where I was copying every cache page I created to a .TXT Document, just in case the Cache Page Editor was having a cranky moment.

So is there a reason why everyone can't just create web pages using the standard coding that browsers support, without Groundspeak having to check the coding and possibly altering it?

Then again... on cache pages the sunglasses smilie is [8D] but in the Forums it is B ) without the space... so what do I know?

The Blue Quasar

February 2, 2006

First thing I would check would be a marshalling issue to the HTMLTidy DLL or whatever mechanism is used.

This is discussed in references to using it with .NET on this page: http://users.rcn.com/creitzel/tidy.html

February 2, 2006

Is there some reason why people cannot just create HTML documents for their cache pages, using code that browsers understand?

I'm sure TPTB have reasons - some of which are arbitrary code which could allow people to launch malicious attacks onto viewers' computers which appear to come from Groundspeak.

Notice that no IFRAME or SCRIPT tags are allowed - they have potential for good but also for evil.

1.6k · February 3, 2006

Thanks for the easy to understand explanation.

It seems obvious now.. but that's one thing Forums are for I guess, to point out things other don't quite get.

The Blue Quasar

5.3k · February 3, 2006

So we know that Unicode entities are being deleted from cache descriptions by the server.

We don't know if this is on purpose (I can't think of any reason for it) and if TPTB are aware of the problem.

In my opinion, it's a bug and it needs to be fixed.

20.7k · February 3, 2006

Life is so much easier when you don't know any of the techie stuff.

9.4k · February 3, 2006

In my opinion, it's a bug and it needs to be fixed.

I agree. That's an odd one. It probably has to do with the conversion back and forth between the html coded in the textbox and how it is stored in the database. I'll look into it.

February 16, 2007

It probably has to do with the conversion back and forth between the html coded in the textbox and how it is stored in the database. I'll look into it.

Was or will be there some development in this case? I suppose that when the characters greater than 0xFF are stored to the database, they are replaced by the most similar character less than 0x100 (for example, a letter with diacritic is replaced by the corresponding letter without diacritic.) Could they be instead replaced by their numeric entity? And when creating .gpx file, the entity should be converted back to its correspoding utf-8 sequence. In present time, it is possible to place the entity in the log, and this is diplayed O.K. on the web, but in the .gpx file the leading ampersand of the entity is replaced by its own entinty, &

5.3k · February 17, 2007

Wow! It's been a year already! Thanks for doing a search on the forums to find this thread, A. da Mek. To TPTB: has there been any progress on allowing Unicode entities on cache pages, and also in exporting them properly to GPX files? I know that it's not an HTML Tidy issue, since I have run many pages with such entities through Tidy with nary a hiccup.

August 17, 2009

Wow! It's been a year already! Thanks for doing a search on the forums to find this thread, A. da Mek. To TPTB: has there been any progress on allowing Unicode entities on cache pages, and also in exporting them properly to GPX files? I know that it's not an HTML Tidy issue, since I have run many pages with such entities through Tidy with nary a hiccup.

Now it's more than three years already!

I would like to use the "not equal" character (≠), but it still doesn't work (UTF-8 charset) .

2.8k · August 18, 2009

So, it's three years...

Please, don't ignore us. Make Geocaching.com really international. This is very important for many people, and this is one of reasons why in some countries people prefer local databases.

2.1k · August 20, 2009

The fix is the same as with the extended characters discussed in the other thread. We have a plan to replace UBB Textbox which should fix this issue:

The problem is with UBB Textbox and we have spent hours trying to fix the issue. The solution is to eventually replace all textboxes with a WYSIWYG HTML editor. In the meantime, I'm very sorry this is such an annoyance.

Unicode Issues

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment