Log Change from HTML to Markdown

9.2k · January 19, 2016

Hello everyone! There has been a lot of talk and there are many questions about the changeover to Markdown. I am one of the developers on this project and I will try to provide some insights into the decisions that were made as well as correct the misconceptions that are swirling around this topic.

Excellent! Thank you. As a professional web app developer (as with others who've commented in this thread posing questions and proposing suggestions towards a 'nicer' process), I just have a few responses.

First and foremost this change, as with every change we make, has no ulterior motive or malicious intent. There is an understanding that change is disruptive and usually ill-received by some, but we are not making changes simply for the sake of change. There is a method to the (perceived) madness and we believe that we are moving towards a game that is better for everyone.

That's the position I prefer to take about this as well; it really annoys me when people jump to maliciousness in GS motives for any 'change' when they don't like sais change.

In order to properly secure the website so it cannot be used as a portal for malicious attack it was deemed necessary to remove the rendering of all user generated HTML in the logs.

This part doesn't bug me. Understandable.

BBCode is also a security vulnerability because it ultimately generates HTML that can contain its own vulnerabilities.

eh? BBCode was secure because it only generates approved html it knows is secure. Can you point to any studies that explain foundational BBCode insecurities that cannot be addressed because they're fundamental to the basic syntax of simple markup? Which BBCode features that you allow were creating HTML that itself is insecure? I can't seem to find anywhere discussing HTML-specific security risks to BBCode, especially related to anything that can't be restricted from use (such as unsanitized text parameter entry)

(I realize this is a dumb public question; maybe a PM would be appropriate?)

Our implementation of BBCode in geocache and trackable logs is old and brittle and would need to be completely replaced in order to account for this.

Ok, so if I read this right, when a new log is posted it should be sanitized for acceptable content - the system's HTML sanitization script is too old and insufficient to provide a sufficiently secure algorithm (moving forward and for legacy content), and your old implementation of BBCode has known and unfixable security flaws and is too complex to swap out with a newer implementation? (I don't know what "brittle" means)

there have been comments that it would be trivial to allow the rendering of both standards by adding a flag or switching based on the log date. I would challenge anyone who claims that maintaining duplicate code paths in a legacy system is a trivial task with no impact on future development or peformance.

I for one would accept that challenge

How do you respond to the proposed ideas made in an effort to maintain legacy rendering, at least until no more are an issue? There have been a number of people who've all posted similar solutions which have minimal to no impact on the system. Challenging the claim that it's not trivial (and supporting with practical suggestions) is, well, precisely what we've done for you

When you add up geocache and trackable logs we are dealing with over 1 billion records, ranging from 1 character to thousands of characters in length. The type of conversion that is required is difficult to perform in an automated manner because we will need to touch every single log. No query can reliably filter down the number of logs so we need to pull them all back. When dealing with that many logs we cannot account for every permutation in the logs which means that either logs will be missed or worse (and more likely) logs will be incorrectly converted. As a general rule we err on the side of avoiding permanent data loss.

This is why the suggestion was made to have an asynchronous bot that runs at low traffic times to analyze and deal with legacy logs as long as any exist which haven't yet been updated. It's fundamentally dynamic in that you can tell it to do 100 at a time or 5000, to work at whatever speed has an imperceptable impact, whether it takes days to get through the database or months - at least the content is being dealt with in a more friendly manner than that which is having people up in arms here.

I don't think anyone argued that running a single query on the entire database of logs would be remotely feasible...

It has been pointed out that logs that were written before Markdown might be converted inadvertently. It is true that there may be cases where plain text gets rendered by Markdown incorrectly. We do not have the concept of plain text vs HTML logs so everything is rendered as if it might contain Markdown. We are working to account for this in advance and we believe that any remaining cases will be minimal in number and impact.

Conversion was one suggestion which many of us could write simple regex scripts to handle. For example, if we know the acceptable output for a Markdown link, then we can search for any matching HTML syntax for a link and automatically convert them - anything not 'allowed' will become 'broken' - but far less than were the entire log to be left as is. Same with any other Markdown content. Convert what can be converted, leave the rest.

Another benefit is that unlike HTML and BBCode, Markdown is more human readable and works cross platform so that all users can share the same experience. I'm sure I have left some questions unanswered and there are still people that are unconvinced about this change, but please know that the change is made with good intentions and we are continuing to work towards a smooth transition. Also, just know that we are reading the forums and taking notes of constructive opinions in the various threads on the topic.

Yeah I think everyone here sees the value in Markdown. I don't think the issue is so much that Groundspeak is moving to Markdown, it's the method by which they're executing the moving, and the effect it'll have on old content. Change itself isn't the big problem in this case for once. :ninja:

Annamoritz posted a good list of specific examples of questions for how old logs will display text not intended to be using Markdown.

I don't think forcing every single user to escape usernames, for example, is going to go over well...

You suggested escaping by typing \*username*, but how would it render if used in a bold sentence - *Out with \*username** or what if they wanted to bold the username - **username**, presumably that would require *\*username\**. You can see how it might start getting difficult for a user to understand why their log isn't displaying as intended. What about merely having a special Markdown character you can surround a string (up to a max length) that indicates to Markdown that it's a username? Then sanitization can check that specific string against the db and render it (or not) appropriately. Such as {*username*}, or something like that. ..it's not perfect, since usernames with "}" would break it if it's unescaped (it would need to be {*namewith\}inside*}) but the idea keeps the text essentially human-readable. I don't think there's any solution that is perfect, since it's really the fact of including complex username strings in plaintext which is fundamentally problematic.

With HTML and BBCode, markup was nicely separated from human-readable text. With Markdown, users will need to know basic markup to understand how and why the plaintext they type doesn't appear exactly how they type. So there is a level additional learning curve that log-writing in the past didn't have.

Thanks for taking time to chime in, HiddenGnome. Definitely much appreciated!

2k · January 19, 2016

but 35% of the 20353 caches in this database are affected.

Sure, the number of caches impacted will be greater than the number of logs impacted. But over time these will slide down to the bottom of a cache page and almost nobody will ever actually see it.

Let's be real - for the most part (there are exceptions) most caches pages don't have any logs read other than the last half dozen or so. These will always be "good" unless somebody insists on typing in html/bbcode that doesn't render.

When I look at a cache to do, I'll look at the location, the description, and probably the last 4-5 logs. I'm not going to go back in history and read anything else unless it's one of the rare exception caches.

While it would be nice for everything to either be reformatted, or have a flag to support old logs, the cost/benefit probably just isn't there.

6.8k · January 19, 2016

While it would be nice for everything to either be reformatted, or have a flag to support old logs, the cost/benefit probably just isn't there.

Actually, I rather would like to see the restriction to pure text logs for all logs than this Markdown solution.

I often read old logs - I read logs for the enjoyment and only indoors on a PC with a large screen and not in the field in order to find a cache. The quality of logs for comes from the contents and the photos and not colors and fonts in the text. Apparently GS has a very different idea of log quality than I do have. It feels like someone pulls my leg if I read that they intend to make geocaching a better game for everyone.

One of the biggest contribution they could make towards nice logs is an update to the annoying photo upload system. Switching over to Markdown and sending around html e-mails certainly does not contribute positively to what makes log quality for me. Until recently I was looking forward to notifications about logs for my caches - now I hate them.

1.1k · January 19, 2016

Let's be real - for the most part (there are exceptions) most caches pages don't have any logs read other than the last half dozen or so. These will always be "good" unless somebody insists on typing in html/bbcode that doesn't render.

Then why, do you suppose the website has the scroll down feature in the logs where you can scroll down to the very first log posted to that cache?

If somehow malware entered the system (heaven forbid) to where every such defective rendering where happen on logs (with html or BB codes) at the website today, it would be treated as a security breach and the display system would be repaired. It would NOT be treated as "so few persons will see it that it does not matter".

What is being proposed to prevent such problems seems worse. It seems to me that the “medicine” is not going to prevent the “disease” (broken display) but will actually cause the very “disease” it aims to prevent.

I am trying to figure out how the author of a log might pose a security risk by

using the same subset of html code currently being used on the descriptions,
where doing the same thing with Markdown is not possible?

Phishing is just as easy with Markdown so that would not be an example.

I would submit that if the webpage is currently secure in this regard for the descriptions, then it can be secured this way for the logs. If it is not secure, then adding Markdown will be a nice addition but will accomplish little for security.

7.5k · January 19, 2016

(I don't know what "brittle" means)

In software development, "brittle" means that the code is, for one reason or another, very likely to break in unexpected ways when modified. This often happens to large, complex projects that have evolved over time.

How do you respond to the proposed ideas made in an effort to maintain legacy rendering, at least until no more are an issue?

As long as they continue to support bbcode instead of dropping it, there will never be an identifiable time when it is no longer an issue.

9.2k · January 19, 2016

(I don't know what "brittle" means)

In software development, "brittle" means that the code is, for one reason or another, very likely to break in unexpected ways when modified. This often happens to large, complex projects that have evolved over time.

I understand the concept of 'brittle' applied to code. I mean I don't understand how brittle applies in this context - it's a self-contained algorithm applied to a set of data (text). It should be modular. If it's being replaced, any 'brittleness' to the algorithm itself is irrelevant - it's being cleaned out and replaced with a new algorithm.

How do you respond to the proposed ideas made in an effort to maintain legacy rendering, at least until no more are an issue?

As long as they continue to support bbcode instead of dropping it, there will never be an identifiable time when it is no longer an issue.

"at least until no more an issue" was added in reference to the points I followed up - cleaning up outstanding data until no more remains. So yeah, if conversion is done over time to existing legacy data, then eventually the existence of BBCode would no longer be an issue; it would no longer exist. That was one of the proposed ideas about which I was hoping to hear a response...

7.5k · January 19, 2016

It should be modular.

I think this explains why you see the issue differently than HiddenGnome. You think "should be", not "is".

January 19, 2016

I will try to answer these where I can.

Will inserting a pure link like www.geocaching.com be treated like now, and be shown as clickable (visit link)? Or will users really have to go back and write "[some text](" before and ")" after a link?

We are adding logic so that URLs will be auto-linked without needing to reformat the URL into Markdown syntax, but I would recommend using the Markdown format after the switchover.

Will it be possible to use something comparable to <s> / ?

I believe you are referring to the strike-through. The Markdown standard does not support strike-through so out of the box it will not be implemented on day one, but we are always open to feedback and feature requests.

# and ##: Will there be really an extra rule compared to Markdown and # and ## only interpreted as header if the line is also ending with a #, or all lines starting with # or ## etc.? Millions of logs would be affected.

This was a large concern that was raised and we are modifying the rule to require a header to begin and end with a #.

# this is not a header

## this is a header ##

Lines starting with numbers, followed by dots (german way to say first, second ...): Always interpreted as ordered list, so other numbers than 2. in the second paragraph/line and 3. in the third paragraph/line will be 'overwritten' on the website?

You have described this accurately.

Lines starting with > (meant as more than): Interpreted as 'quote' and not showing the > anymore?

This will render as a quote unless the '>' is escaped '\>'

Lines starting with *** or ___ or * * *: interpreted like <hr>?

Lines starting with *** or ___ will render normal as long as there is other text on the same line.

Lines starting with * * * will render the first * as a bullet point and the remaining 2 * normally.

Lines starting with + or - or *: interpreted as unordered lists?

If there is a space between the +, - or * and then next character then the line will render as an unordered list. If there is no space then the line will render as before, e.g. "+123"

Lines starting with four or more blanks: Interpreted as 'Code'?

Yes.

And the wide field of emphasis: Will * and _ work as described "Emphasis can be used in the middle of a word" which will alter all usernames containing more than one _ or *, which would also affect usernames like yes_no_maybe, or will only usernames like *Eisbär* and _law_ be affected? Or will usernames like Alter_Fuchs also be affected if an opening or closing _ coincidentally happens to stand nearby?

*text* and _text_ will render as italic

**text** and __text__ will render as bold

Emphasis marks that are in the middle of words or "near" words will not change the rendering.

Alter_Fuchs = Alter_Fuchs

Alter_Fuchs _test = Alter_Fuchs _test

Emphasis marks that do not match will be rendered as written

*text_ = *text_

__text** = __text**

4.6k · January 19, 2016

Eagerly awaiting a document that specifies the exact syntax as implemented by Groundspeak.

I imagine the API partners are awaiting that too...

Thanks in advance.

January 19, 2016

Ok, so if I read this right, when a new log is posted it should be sanitized for acceptable content - the system's HTML sanitization script is too old and insufficient to provide a sufficiently secure algorithm (moving forward and for legacy content), and your old implementation of BBCode has known and unfixable security flaws and is too complex to swap out with a newer implementation? (I don't know what "brittle" means)

The code that is currently used to sanitize HTML and render BBCode has become the equivalent of Frankenstein's monster over the years with new "features" or bandaids being bolted on. One of the biggest factors that go into making the code "brittle" is the fact that there are no regression tests around this functionality. For anyone that doesn't know, a regression test is helpful because it allows you to make changes, run your tests and make sure that you haven't broken your existing functionality. Without that in place we are blind and have to rely on manual testing which is time consuming and prone to mistakes. This is why there are occasionally bugs introduced to the website when seemingly unrelated changes are deployed. Our goal is to slowly (because there is too much code to do it quickly) rewrite sections/areas of code, surrounded by a full test suite so that future changes can be made quickly and with confidence that nothing has be unintentionally broken. So to your other comment about it being "module" - I completely agree that it should be, but the current implementation is not.

Challenging the claim that it's not trivial (and supporting with practical suggestions) is, well, precisely what we've done for you

The act of challenging the idea that maintaining multiple rendering paths adds complexity and hinders maintainability does not change the fact that it is not trivial. There have been several proposals about how this could be accomplished and I won't discredit them as they are technical solutions that would accomplish the desired effect of maintaining multiple rendering options, but at the end of the day each and every one will add to the complexity and reduce the maintainability of the system while potentially continuing to allow security vulnerabilities.

This is why the suggestion was made to have an asynchronous bot that runs at low traffic times to analyze and deal with legacy logs as long as any exist which haven't yet been updated. It's fundamentally dynamic in that you can tell it to do 100 at a time or 5000, to work at whatever speed has an imperceptable impact, whether it takes days to get through the database or months

You make a good point here. The main issue I was trying to raise was that whether we update 10 logs at a time or 10 million we are still relying on an algorithm to decipher the content of a log in an automated manner which is prone to errors when you consider the number of permutations that exist across 1 billion logs.

January 19, 2016

I am still reading through the various ideas and feedback, but I want to thank everyone who is contributing to the discussion. In order to have a conversation we need a place to begin which is why we wanted to let players know well in advance.

9.2k · January 19, 2016

It should be modular.

I think this explains why you see the issue differently than HiddenGnome. You think "should be", not "is".

Yes, that's why I said 'should be' instead of presuming 'is'. If it's not modular (which I still don't grasp how parsing log text through a filter isn't) then it would not be trivial, as there may be many multiple locations to make code changes, and each location may be different or have different effects and scopes. That would not be pretty. It sounds like that's how the current system may be, and if so - I can't fathom how got to that point, as it pertains to a Log text sanitization routine. Alas.

Thanks again HiddenGnome for the response... Just want to say I appreciate the exchange - I love programatic problem solving, so I'm just attempting to wrap my head around the current setup to understand the reasoning. Couple things:

The code that is currently used to sanitize HTML and render BBCode has become the equivalent of Frankenstein's monster over the years with new "features" or bandaids being bolted on...So to your other comment about it being "module" - I completely agree that it should be, but the current implementation is not.

Now I'm very curious how the process for posting a new log to the database was programmed. Do you run multiple functions on the log text, each affecting a bit of the sanitization task, and each function also being used in other locations so that it would be a frustrating task to have to track each down to fix? By modular in this context I'm picturing something along the lines of a function that takes the text input, and returns a result - whether it's sanitized text, or a flag, or a warning, or errors out, or what have you. Replacing that function, even with its internal code being 'brittle', with an updated sanitization routine is essentially trivial. The change happens where the log storing process runs. If it happens multiple locations, the function's internal code can be replaced; for instance bump the legacy code into a secondary routine and wrap it under a check that determines if the log text should be sanitized by the legacy routine or the new Markdown routine.

Again, of course none of us are privvy to the source or how it's all actually programmed, but I'm having difficulty understanding conceptually how a text parser routine replacement for a very specific database function (as the request to retrieve log text can come from multiple sources, not just web) is more complex than how a few of us here have imagined. =/

If it is, then wow I feel for y'all having to fix this all up... :shocked:

The act of challenging the idea that maintaining multiple rendering paths adds complexity and hinders maintainability does not change the fact that it is not trivial.

Maintaining? Our suggestion is rather for a gentle phasing out of legacy content and code in a much more graceful manner so that it doesn't have to be maintained in the long run. It's short term complexity (in the form of an A or B parser, essentially) for a smoother transition.

This is of course presuming a pretty standard process by my understanding (for creation or editing logs from this point forward):

[submitted text] --> {input parser} --> [sanitized/Safe log text to store]

And for display:

[Log text] --> {output parser: as HTML, plaintext, XML, etc} --> [Destination display] (3rd party apps can trust the log text they receive as safe)

As currently, the input parser would be replaced to only allow Markdown input (no need for any legacy code there).

But where right now the output function returns at least HTML knowing that the source text may contain sanitized HTML or BBCode or neither, and presumably the intended new output function would treat every input log text as already Markdown-converted (which corrupts old text that may contain unintended Markdown syntax), this proposed interim output function would be augmented to first decide how to treat the input text based on some property of the Log, whether it's an additional flag for unconverted log text, or the last edited date, etc. Otherwise it would function exactly as GS's upcoming system has been explained.

...while potentially continuing to allow security vulnerabilities.

So am I right in inferring from what you're saying that if log creation/editing were suspended as of this moment, then there currently does exist a fatal security vulnerability in at least some person's log text, in which case it seems that it's bad enough that the only feasible solution is to turn off the html/bbcode system in its entirety as soon as possible that allows that known vulnerability and chop it off at the root? It's not about what people may enter from now on, but that there is a known flaw that exists at least once in the current log data and parsing algorithms, and the amount of time/money it would take to implement a graceful conversion process far outweighs the risk of leaving that vulnerability active until it would inevitably be located and removed?

If that's the case, then I'll support this whole thing... (but I, at least, don't know of any flaw in the BBCode system that drastic, and I'd be surprised if there's a flaw in the existing HTML parsing code that currently allows for a security hole that significant, since it's been in use for years, which can't be patched even on the short term until all legacy log text conversion is complete)

You make a good point here. The main issue I was trying to raise was that whether we update 10 logs at a time or 10 million we are still relying on an algorithm to decipher the content of a log in an automated manner which is prone to errors when you consider the number of permutations that exist across 1 billion logs.

enh, I still don't buy it. Let's say you have text input that's unpredictable at best and malicious at worst - you run it through trusted parsing code that strips out anything that's not whitelisted, and out the other end can only come trusted parsed text. Despite whether there's bad or malicious syntax in the input text, the parser can easily be designed to only find and replace safe syntax, thus removing anything malicious; like escaping any control code that defines HTML so no HTML can ever be output. That function wouldn't raise fatal errors. But ok, let's say it did. You trap the error, log where it happened so at some point someone can review it, and the automated process continues on to the next log (whether it's fast or slow, or running in 10 or 10 million log batches).

Anyway, if TPTB have already determined that any effort more than what's currently being done is not financially feasible, then all this is moot anyway

It's still fun to attempt to think through though.

Oh one other thing, you described a few point of how Markdown would interpret codes, but didn't address anything about username confusion. How would Markdown code handle usernames wrapped in Markdown syntax characters, like _username_? Or is GS unabashedly going to require all its users to know that they will have to escape all problematic usernames from this point forward?

7.9k · January 19, 2016

HiddenGnome, it's great to see someone so closely involved in the project discussing things with us here so openly. It's refreshing, and I'd love to see it happen more often. Thanks for taking the time to post here.

The main issue I was trying to raise was that whether we update 10 logs at a time or 10 million we are still relying on an algorithm to decipher the content of a log in an automated manner which is prone to errors when you consider the number of permutations that exist across 1 billion logs.

But how many permutations are there really? We're only talking about a limited set of valid HTML and BBCode tag pairs that would need to be converted. Taking a quick look through the list of BBCode tags and having a good knowledge of HTML, I would think there are at most a few dozen different tags that would need to be searched for. These are the ones I think would be used the most and would cover the majority of the BBCode/HTML use in the logs (that can be converted to Markup; I've left out things like font colouring or strikethroughs):

/<b>
/<i>
/<u>
/<font size>

/<center>

/<blockquote>

/<a href>
[ul]/[ol]/

/[li] / <ul>/<ol>/<li>

In most of these cases, a RegEx and a simple one-to-one text replacement is all that's necessary. In a few cases like those tags using attributes, some simple rules may be required. The use of most other BBCode or HTML tags would be very limited compared to the tags above, so leaving everything else as-is wouldn't leave much "broken" markup. During the conversion process, the script could even make note of any logs with unconverted [] or <> tags and give the cacher a list of logs they'd have to manually fix.

I just don't see how automatically converting the above tags could possibly give undesired results. If someone uses a valid BBCode or HTML tag pair wrapped around some text, I can't think of any use-case where the user *wouldn't* want the wrapped text to be formatted. The whole point of the BBCode and HTML tag pairs is that they're easily-distinguishable from normal text. That makes it simple to detect them, and it makes it vanishingly-unlikely that normal text would be incorrectly detected as BBCode or HTML.

14k · January 19, 2016

Eagerly awaiting a document that specifies the exact syntax as implemented by Groundspeak.

I imagine the API partners are awaiting that too...

Thanks in advance.

Hopefully there will be a page in the Help center which describes the Markdown syntax implemented by Groundspeak before the feature is released. I'm not holding my breath.

I hadn't considered the impact on API partners. I know that there is a GSAK macro that can be used for logging caches. Unless that macro is modified to use Markdown rather than html (as it does now) it's going to submit logs which contain html markup. Will the API strip out HTML before the log is persisted to the database and will it return some sort of status message indicating that it has done so? The Publish Logs macro is frequently used by those that want to log power trails. Often those logs might be boiler plate using a template. Suppose that template contains a brief log with a link to the first cache of the PT for the full log (rather than cut-n-paste a long log on every cache). Will every log be broken?

How about all the different mobile apps out there? Is GS expecting all of them to support Markdown for creating logs? Will the official app support Markdown with a WYSIWYG editor?

What will happen if someone tries to put HTML in their log? At what point will the HTML be filtered out. My assumption is that the Markdown will be transformed to "safe" html when reading the log out of the database for rendering on the GS webpages or in the log elements in a GSAK file. If the Markdown is transformed *before* persisting the log to the database how would someone edit their log?

9.2k · January 19, 2016

My assumption I outlined in my previous comment. Ideally, the input parser would strip anything known to be malicious so that it stores only 'safe' text - not formatted for any particular output. Just a basic sanitizer. I'd presume this routine has already been developed and is waiting for GS to flip the switch so all input moving forward is sanitized, however that may look to the devs.

It's the output parser that converts the raw log text to whatever context/display is being used, be it web page HTML, XML data, mobile app interface, etc. (I would think they'd attempt to make the API log text output universal - perhaps either raw Markdown text so the external app can decide whether to html-format the syntax or not, or already formatted from Markdown syntax to HTML source if an app just wants to display the log text without having to parse teh Markdown text to HTML itself.

I don't think the input is the problem - apart from the fact that people will need to know basic Markdown to avoid corruption, and any external scripts will need to stop inserting HTML and BBCode which won't parse properly into the GS database. (I certainly hope GS would implement tools to make that transition smoother, but it doesn't sound like that'll happen).

The legacy issue really only pertains to displaying of current (old) log text which was saved before the input parser was enabled for Markdown. That's the one that would best use the flag-of-some-kind to decide how to format the stored log text for external display.

January 19, 2016

Considering everything said until now I'm against automatic conversion of bbc/html to Markdown except for the most simple operations.

Even converting [ i] and [ /i] to _ won't work out well because before you could use it for parts of words, single words or more than one word in contrast to exact one word surrunded by _ as stated by HiddenGnome.

Automatic conversion might be fine for removing then useless because unrendered (simple formed) ubb/html code from logs, but only code that really can be removed without danger of ruining the meaning of what is written. Like the often used [colorname] etc.

But better not simply delete [ s] [ /s] without asking - instead let the users decide themselves what to do.

Edited January 19, 2016 by AnnaMoritz

4.5k · January 20, 2016

HiddenGnome, it's great to see someone so closely involved in the project discussing things with us here so openly. It's refreshing, and I'd love to see it happen more often. Thanks for taking the time to post here.

I just stated in another thread, that when GS made changes, they rarely responded or acknowledged concerns or problems with those changes. Even though this stuff is over my head :blink: , it's still nice having someone from GS discussing and answering questions here. Thanks HiddenGnome!

January 20, 2016

Thanks for joining the discussion HiddenGnome!

The code that is currently used to sanitize HTML and render BBCode has become the equivalent of Frankenstein's monster over the years with new "features" or bandaids being bolted on. One of the biggest factors that go into making the code "brittle" is the fact that there are no regression tests around this functionality. For anyone that doesn't know, a regression test is helpful because it allows you to make changes, run your tests and make sure that you haven't broken your existing functionality.

So, you say that you're going to break the existing functionality now, so that you don't have to break it again? I appreciate that you make it testable, but why can't you add the tests first, and then change it? (that's how we do it where I work)

You make a good point here. The main issue I was trying to raise was that whether we update 10 logs at a time or 10 million we are still relying on an algorithm to decipher the content of a log in an automated manner which is prone to errors when you consider the number of permutations that exist across 1 billion logs.

What if you use your existing parser to parse all the logs, and store the parsed value as html. Then use a flag to just display those logs without running them through the new parser. You won't have any old logs interpreted as markdown, and none of them will be broken.

I believe you are referring to the strike-through. The Markdown standard does not support strike-through so out of the box it will not be implemented on day one, but we are always open to feedback and feature requests.

I understand and appreciate that you want to improve. But you're taking away something that works, and replacing it with something that lacks a lot of features.

*text* and _text_ will render as italic

Not that this is a big problem to me, but does this mean that the much used {*FTF*} will be displayed as {FTF}?

I personally think that this change will lead to a lot of logs being wrongly formatted, because people write clear text that happens to be markdown syntax.

I think just removing formatting all together would have been better (but keeping the existing would have been the best - as long as HTML is considered safe in the cache descriptions, I don't see how they should not be in the logs...).

As an alternative to you doing the change, can you expand the API so that we can use the API to change our logs? Then we could use a GSAK macro, and get it done ourselves.

7.5k · January 20, 2016

So, you say that you're going to break the existing functionality now, so that you don't have to break it again?

No, they're going to remove functionality now so they never have to worry about it again.

January 26, 2016

Can you at least make the function that automatically links URLs exclude the old UBB-tags?

If you look at the picture below, the link is including way to many characters, making it useless.

12606986_10153187913257142_149890834_n.jpg?oh=e91a1348a6f03273afb3866403d70544&oe=56A98916

Instead of linking to http://www.site.tld, it now links to http://www.site.tld/]Text

Please fix at least that.

5.2k · January 26, 2016

[...]

Instead of linking to http://www.site.tld, it now links to Text"]http://www.site.tld/]Text

Please fix at least that.

Use this GSAK macro

Hans

7.9k · January 26, 2016

[...]

Instead of linking to http://www.site.tld, it now links to Text"]http://www.site.tld/]Text

Please fix at least that.

Use this GSAK macro

Hans

A GSAK macro won't have any effect on how the website renders old BBCode. Since it's unlikely that every user will edit every log to remove BBCode links, thomfre's concern is valid.

5.2k · January 26, 2016

[...] thomfre's concern is valid.

I never doubted it.

Hans

January 26, 2016

I made this nice test, in Norwegian, but I guess you get the point even if you don't understand Norwegian.

On the first line, I start with a date - February 2nd, which is written "2. februar" in Norwegian.

That suddenly becomes "1. Februar", the day before. Logical, isn't it?

5.2k · January 26, 2016

I made this nice test, in Norwegian, but I guess you get the point even if you don't understand Norwegian.

On the first line, I start with a date - February 2nd, which is written "2. februar" in Norwegian.

[...]

Shouldn't that "på den andre av Februar" at all? :lol:

Happy Hunting

Hans

NB: could you please provide the new log page's url?

4.6k · January 26, 2016

Easy to guess, having seen it mentioned here in the past.

staging.geocaching.com

For example,

https://staging.geoc...D=4454437&lcn=1

EDIT: Hmm, inserting a link doesn't seem to work. It ends up showing raw HTML, angle brackets, ooh yuck, instead of rendering it. I guess they're still working on it.

Edited January 26, 2016 by Viajero Perdido

January 26, 2016

And xx_yy_zz still shows as xxyyzz

Edited January 26, 2016 by AnnaMoritz

January 27, 2016

I'll respond to several comments without quoting.

Security of BBcode: I found a discussion of XSS issues in BBcode at http://1nfosec4all.blogspot.com/2012/07/bulletin-board-code-bbcode-xss-exploit.html. I tested all the examples given. I found that gc.com is not vulnerable to any of them. Of course there may be others, and reading the article gives one a good idea of how hackers can slip through unseen cracks, but the implication is that the GC code already interprets BBcode narrowly (safely) rather than broadly (accepting anything stuck in).

How can code get that bad? Without going into details, I'll just say that I've seen lots of bad code that was written with good intent. Sometimes systems start small and grow without ever being designed. Other times it's just bad programming, but in this case I'm willing to assume the first interpretation.

I remain unconvinced that it's impossible to do a reasonably safe conversion. (And if there's concern about loss of data, remember that garbaging the display of old logs is effectively losing data.) My approach would be narrow: pick out patterns which can be converted safely and convert those. In no case delete any text -- that should not be necessary in any case. I don't see any need to make any transformation which would delete text. I may yet try doing this in a GSAK macro, but the lack of capability to modify an existing log via the API would make this a less than satisfactory solution. Plus, I don't just want MY logs kept legible ... I want all the logs that I READ kept legible. Note that I'm harping on the theme "failure to act will constitute loss of data".

As for "a RegEx and a simple one-to-one text replacement is all that's necessary", I'm not convinced it's that simple. I'd say it's simple as long as the tags are properly nested. But what happens, for example, if a [b] is nested inside a [url]? There are real issues to be analyzed. For example

[url=http://[b]example.com[/b]]EXAMPLE[/url]

might get converted to

[EXAMPLE](http://**example.com**)

No danger there (the link is broken, but the BBcode link probably didn't work either), but it's a sample of what strange things can happen. I'm not devious enough to think up truly dangerous examples. I am however a believer in a modified version of the Law of Large Numbers: when you have enough records in a database, you'll find all kinds of things you never expected. That was true forty years ago when I was working with a database of ten million driver licenses -- absolutely huge for its time. It's certainly true for a billion log records, even if 95% of them are just a couple of words.

So "strip everything known to be malicious" is not sufficient. The design has to be "only allow what's known to be safe".

The function of the Publish Logs macro is now built in to GSAK, and Clyde is busy updating the function for Markdown.

I have not yet heard any clarification on the statement in the announcement that "Smileys will continue to be supported as they are today". Does this mean that the existing syntax for smileys will be supported, and thus that smileys in old logs will still be OK? Or that smileys will be supported with a different syntax?

Edward

14k · January 27, 2016

Easy to guess, having seen it mentioned here in the past.

staging.geocaching.com

For example,

https://staging.geoc...D=4454437&lcn=1

EDIT: Hmm, inserting a link doesn't seem to work. It ends up showing raw HTML, angle brackets, ooh yuck, instead of rendering it. I guess they're still working on it.

The link syntax is not real intuitive if you're familiar with HTML. You have to put the link text in square brackets, immediate followed by the url in parentheses. For example:

[Mingo](https://www.geocaching.com/geocache/GC30_mingo)

It's much easier just to highlight some text, and click on the "Insert hyperlink" icon, then enter the URL.

There's a "How to Format" link on the wysiwyg editor that explains some of the syntax. It has examples of a bunch of emoticons that can be used but they don't seem to render in the "preview" just below the editor, but do render correctly once the log is submitted.

4.6k · January 27, 2016

You have to put the link text in square brackets, immediate followed by the url in parentheses.

Did that; the editor makes it pretty obvious.

But I see now, it does work IF you manually add "http://" to the URL. No "http://", no linky. And heh, the resulting URL doesn't include it.

Nothing else but an URL is supposed to go in there, so it seems that part should be automatic. I guess you need to choose http vs https, but it could default to the former.

Edited January 27, 2016 by Viajero Perdido

9.2k · January 27, 2016

I still think that flagging 'old'/unconverted logs to remain rendered with the legacy system until either edited by the user or converted (presuming they produce an effective conversion algorithm) is the most favourable procedure. I agree with the idea that 'garbaging the display of old logs is effectively losing data'. I also think that just like with cache descriptions, it may be better to provide a flag for people to toggle if they want to display Markdown formatting, instead of presuming that every single log contains valid and intentional markdown syntax (thus garbaging those that do but are not intentional).

January 27, 2016

it may be better to provide a flag for people to toggle if they want to display Markdown formatting, instead of presuming that every single log contains valid and intentional markdown syntax (thus garbaging those that do but are not intentional).

^^^^THIS.

Markdown has the unfortunate attribute that it is all too easy to accidentally introduce formatting by using perfectly un-format-like character sequences. That's not a feature, it's a bug.

January 27, 2016

I don't mind that much that excessive formatting (colors, fonts) in logs like this one will be ruined. And maybe automatic conversion of links isn't that trivial and maybe GS won't find a perfect solution for handling old links formatted with ubb/bbc or html and leaves it up to the users to do that.

but I really don't like what the current state of handling Markdown syntax at staging.geocaching.com does to usernames. Hopefully the current state of handling Markdown syntax at staging.geocaching.com won't be what will be seen soon on www.geocaching.com. Moreover in preview Markdown syntax is implemented differently compared to the resulting log.

January 27, 2016

Handling urls or strings resembling urls seems a bit green at staging.geocaching.com, at least at the moment.

9.2k · January 27, 2016

it may be better to provide a flag for people to toggle if they want to display Markdown formatting, instead of presuming that every single log contains valid and intentional markdown syntax (thus garbaging those that do but are not intentional).

^^^^THIS.

Markdown has the unfortunate attribute that it is all too easy to accidentally introduce formatting by using perfectly un-format-like character sequences. That's not a feature, it's a bug.

The clear benefit to Markdown is that one can write a Markdown-styled page of text and have it readable as plaintext, which other markup languages have a more difficult time with.

The clear weakness to Markdown is that it alters unintentional text if the formatting syntax is forced on text that was created as plaintext.

The former? Excellent! The latter - please provide a resolution for this problem, Groundspeak!

January 27, 2016

As it still seems that there will be no converter to assist (semi-automatic) conversion of logs, I just made a feature-request at Project-GC.

Maybe they could help with a nice tool - if they are able to edit logs via API.

The reason I don't think GSAK will be of much help is that it doesn't have the ability to edit logs online - at least as far as I know.

January 27, 2016

As it still seems that there will be no converter to assist (semi-automatic) conversion of logs, I just made a feature-request at Project-GC.

Maybe they could help with a nice tool - if they are able to edit logs via API.

The reason I don't think GSAK will be of much help is that it doesn't have the ability to edit logs online - at least as far as I know.

GSAK and Project-GC both access the same API. And none of them can make this work without new functionality in the API.

January 28, 2016

Just waiting for the real fun and games to start when it actually goes live then all the real problems will become apparent. <_<

January 28, 2016

Just waiting for the real fun and games to start when it actually goes live then all the real problems will become apparent.

:drama:

January 28, 2016

I tried the staging.geocaching.com site

It is impossible to add the standard {*FTF*} tag to a log. First the *FTF* will be FTF in italic and when the } is added the row is select and } is not added to the text.

And please don't say that FTF in not a goundspeek thing. You have made adds for premium accounts that said get premium to get notification an FTF.

The `Preformated text` that i have seen on markup spec works but is not in the "How to Format" text. Will is always be supported and just missed in the guide or is the inclusion temporary. It is quire useful when writing logs on challenges to get text with GC codes and cachenames readable

Edited January 28, 2016 by Target.

2k · January 28, 2016

I tried the staging.geocaching.com site

It is impossible to add the standard {*FTF*} tag to a log. First the *FTF* will be FTF in italic and when the } is added the row is select and } is not added to the text.

And please don't say that FTF in not a goundspeek thing. You have made adds for premium accounts that said get premium to get notification an FTF.

The `Preformated text` that i have seen on markup spec works but is not in the "How to Format" text. Will is always be supported and just missed in the guide or is the inclusion temporary. It is quire useful when writing logs on challenges to get text with GC codes and cachenames readable

There is a standard for saying you are FTF? I don't think I've seen that syntax. Most people, on my caches, just say FTF.

January 28, 2016

There is a standard for saying you are FTF? I don't think I've seen that syntax. Most people, on my caches, just say FTF.

Yes, {*FTF*} is used a lot, at least in Europe. Project-GC and various other tools pick them up.

January 28, 2016

I tried the staging.geocaching.com site

It is impossible to add the standard {*FTF*} tag to a log. First the *FTF* will be FTF in italic and when the } is added the row is select and } is not added to the text.

And please don't say that FTF in not a goundspeek thing. You have made adds for premium accounts that said get premium to get notification an FTF.

The `Preformated text` that i have seen on markup spec works but is not in the "How to Format" text. Will is always be supported and just missed in the guide or is the inclusion temporary. It is quire useful when writing logs on challenges to get text with GC codes and cachenames readable

There is a standard for saying you are FTF? I don't think I've seen that syntax. Most people, on my caches, just say FTF.

No, but people playing the FTF-sidegame use {*FTF*} or {FTF} or [FTF] in order to let a third party programme know that this cache was FTF-ed. So they can move over to [FTF] anyway.

January 28, 2016

Now it seems that {`*`FTF`*` followed by 125; whithout space lets it _look_ like {*FTF*} in the log, but nothing more.

That won't help because other programs look for 'real' {*FTF*}.

For old logs there doesn't seem to be a problem, {*FTF*} is still there, and only at geocaching.com it will look like {FTF} which is not uglier than {*FTF*} how old log {*FTF*} it might look later.

And for new logs [FTF] won't be a problem, except when followed immediately by (some text) without blank. And as it seems you can't enter } via website [edit: on german and other keyboards where it is AltGr + 0] in the future there won't be any problem left.

Edited January 28, 2016 by AnnaMoritz

6.2k · January 28, 2016

I tried the staging.geocaching.com site

It is impossible to add the standard {*FTF*} tag to a log. First the *FTF* will be FTF in italic and when the } is added the row is select and } is not added to the text.

And please don't say that FTF in not a goundspeek thing. You have made adds for premium accounts that said get premium to get notification an FTF.

This might not be a problem, as although it displays as italic, the underlying text in the log is still {*FTF*}, and I would guess that the API pulls down the underlying text and so Project-GC et all will still see it as an FTF tag and record it as such.

6.8k · January 28, 2016

And as it seems you can't enter } via website in the future there won't be any problem left.

A new problem then however. In my opinion it is not acceptable to not allow { and } as symbols in logs.

January 28, 2016

I tried the staging.geocaching.com site

It is impossible to add the standard {*FTF*} tag to a log. First the *FTF* will be FTF in italic and when the } is added the row is select and } is not added to the text.

And please don't say that FTF in not a goundspeek thing. You have made adds for premium accounts that said get premium to get notification an FTF.

The `Preformated text` that i have seen on markup spec works but is not in the "How to Format" text. Will is always be supported and just missed in the guide or is the inclusion temporary. It is quire useful when writing logs on challenges to get text with GC codes and cachenames readable

There is a standard for saying you are FTF? I don't think I've seen that syntax. Most people, on my caches, just say FTF.

No, but people playing the FTF-sidegame use {*FTF*} or {FTF} or [FTF] in order to let a third party programme know that this cache was FTF-ed. So they can move over to [FTF] anyway.

The reason it was impossible to write {*FTF*} for me is that i use Swedish keyboard. } is on the 0 key with altgr. altgr triggers the key codes for alt and ctrl and the markdown editor uses ctrl for short cuts. shortcut 0 is H0 ie normal text and the line is set as normal an no } is added. If i switched to english keyboard it is possible to write } when only shift is used.

For the same reason an Swedish user cant write @ £ $ in logs when they are on AltGr+ 2 3 4

2k · January 28, 2016

I tried the staging.geocaching.com site

It is impossible to add the standard {*FTF*} tag to a log. First the *FTF* will be FTF in italic and when the } is added the row is select and } is not added to the text.

And please don't say that FTF in not a goundspeek thing. You have made adds for premium accounts that said get premium to get notification an FTF.

This might not be a problem, as although it displays as italic, the underlying text in the log is still {*FTF*}, and I would guess that the API pulls down the underlying text and so Project-GC et all will still see it as an FTF tag and record it as such.

That is what I would assume. If you use the API to download then you'll get the raw markdown text, which is {*FTF*}. Everything should work fine.

If you use something that screen scrapes, then all bets are off as the formatted would have been applied.

January 28, 2016

For new } (on keyboards where it is something like AltGr + 0 like on german keyboards where you can't use that) you can take it from clipboard. Or write followed by 125; but with bad luck it won't display correctly in notification mails and export.

All of {[]}\²³@ is possible when dropped from clipboard. And on german keyboards {[]\ is possible when entered by AltGr + 7, AltGr + 8, AltGr + 9, AltGr + ß

Loggers with german keyboard will be very happy to see what happens when they enter AltGr + 2 expecting ² and to AltGr + 3 expecting ³ and AltGr + q expecting @. And better not try to use tilde ~, you never dreamed that it will display as - after you managed to see ~ at least in the editor.

6.8k · January 28, 2016

For new } (on keyboards where it is something like AltGr + 0 like on german keyboards where you can't use that) you can take it from clipboard. Or write followed by 125; but with bad luck it won't display correctly in notification mails and export.

All of {[]}\²³@ is possible when dropped from clipboard. And on german keyboards {[]\ is possible when entered by AltGr + 7, AltGr + 8, AltGr + 9, AltGr + ß

What is highly needed in my opinion is a switch that allows to switch off markdown and the annoying editor and allow pure text logs (like pure text cache descriptions are possible) also for those who log via the website.

Log Change from HTML to Markdown

Recommended Posts

Join the conversation