Jump to content

Invalid character in GPX output file


gopman

Recommended Posts

There appears to be a problem with a GPX file I received from geocaching.com yesterday. I tried to read it with GPSbabel and a home grown perl program using XML::Parser. They both complain about illegal characters. I narrowed the problem down to cache ID=9167, the log entry by JoeyBob. There appears to be some unprintable characters just before the word "sorry". I'm not an XML expert, so I can't guarantee it's a bug in geocaching's GPX extract, but that's what it looks like. Thanks.

 

Rich

Link to comment

If you didn't get it zipped, switch to zipped files. Sometimes you get a corrupted file if it's not zipped. I've seen at least one GPX file corrupted when e-mailed unzipped (one of the "id=" attributes turned into "=d=", which is obviously an illegal attribute name).

 

If you *did* get it zipped and want to let me have a look at the characters, you can e-mail it to me, and I'll take a look and see what's up with it. (GPS@[MyName].com) Oh, and obviously, zip it before e-mailing it. icon_wink.gif

Link to comment

Since I'm on the developers list for GPX, I know the answer to this will be "what does saxcount say about the file?" When the GPX spec itself is vague, the answer always defers to what the validating parser SAXCount votes. So I created a tiny little pocket query that contained the cache in question. It hurls:

 

(robertl) rjlinux:/tmp

$ SAXCount 6659.gpx

 

Fatal Error at file /tmp/6659.gpx, line 36, char 134

Message: Invalid character reference

 

The character in question is indeed the one before "sorry" which appears to attempt to be an entity encoding but is ""

 

I'm not an XML jock, so I can't tell if any pertinent standards prohibit this, but emprically,

it appears that below 0x1f, the only values allowed are the whitespace brothers - 0x09, 0x0a, and 0x0d (tab/cr/nl)

 

There are three solutions, each with a different degree of moral purity:

 

1) Edit the character in question of the GPX file you're trying to process and move on. This solves the problem for you for this specific run.

 

2) Get the poster of the log in question to edit the log in question. This solves the problem for anyone creating a GPX file with this specific cache until there are four more logs to bump this log or the number of logs is raised. You get docked karma points if you post multiple notes or no-finds to this cache just to make this one scroll away....

 

3) Lobby all the authors of GPX-reading software to handle illegal input. (Hint: as the author of one of the packages in question, I can tell you this will be a non-starter.)

 

4) Bribe the admins of geocaching.com (maybe a "please" will suffice) to fix thsi problem upstream. Even with this, the degrees of rightness range from editing this specific log to trapping this on generation of HTML and XML to trapping it on input.

 

In working with Geocaching.com guys on similar issues, it was a stated goal to generate output that validated successfully, so I suspect there will be little debate that there is breakage.

 

I'll let you start the bidding with your favorite admin. icon_cool.gif

Link to comment

Thank you. I was hoping someone else would take a look at that log entry...

 

Clearly, Geocaching.com needs to fix their GPX output routines to catch this sort of problem. That's much safer than assuming that all of the input methods will catch the problem.

 

Rich

Link to comment

quote:
Originally posted by gopman:

TGeocaching.com needs to fix their GPX output routines to catch this sort of problem. That's much safer than assuming that all of the input methods will catch the problem.

Rich


 

I disagree. Data will be input once and output many times. Pay the cost of scanning, filtering, fixing it once instead of many times. In this very discussion, we've seen that there is illegal output on both HTML (web page) and GPX (XML) versions of the data. I'm speculating that's probably two different sets of code that spills those, so that's at least two bases of code that would have to do this. A log is entered once, but may appear on web pages and GPX files for months; spend the clock cycles to catch it at the point of violation.

 

That said, I'm not the one doing the work, so I don't much care where (or how) it's fixed. icon_biggrin.gif

Link to comment

We're digressing from the topic, but I believe in paranoid programming. Don't assume the data is correct even if you ensure that it is during the input process. Some day, someone will find a clever way to input bad data to the database. It will happen. Guaranteed.

 

Rich

Link to comment

I've been looking into this problem, since some people are getting GPX files that die because of illegal XML characters.

 

The problem is that some GPX files that are being sent out are NOT legal XML. This is ultimately a Groundspeak problem; GPX is supposed to be valid XML. The parsers that are dying are doing exactly what they are supposed to do as specified in the XML spec. Upon encountering an illegal character, a conforming parser must die and not deliver any additional information to the calling program.

 

However, in the meantime I have written a little Perl script that will remove illegal characters from the XML files. I've put it up on the gpx2html web page.

Link to comment

quote:
Originally posted by robertlipe:

quote:
Originally posted by gopman:

TGeocaching.com needs to fix their GPX output routines to catch this sort of problem. That's much safer than assuming that all of the input methods will catch the problem.

Rich


 

I disagree. Data will be input once and output many times. Pay the cost of scanning, filtering, fixing it once instead of many times. In this very discussion, we've seen that there is illegal output on both HTML (web page) and GPX (XML) versions of the data. I'm speculating that's probably two different sets of code that spills those, so that's at least two bases of code that would have to do this. A log is entered once, but may appear on web pages and GPX files for months; spend the clock cycles to catch it at the point of violation.

 

That said, I'm not the one doing the work, so I don't much care where (or how) it's fixed. icon_biggrin.gif


 

OK, I hope you guys are talking about two totally different "inputs".

 

I think gopman uses "inputs" to refer to the inputs to the GPX user programs. robertlipe is talking about inputing to a log entry ie. don't alow illegal characters to be typed into to a log entry. At least I hope that is what robertlipe is refering to.

 

I think it is to late to block entering illegal characters into a log because illegal characters are already logged.

Link to comment

quote:

The character in question is indeed the one before "sorry" which appears to attempt to be an entity encoding but is ""


 

I think the problem may be deeper than we (or at least I) expected. One of these showed up in one of my queries. The query in question references the binary entity: ampersand#x3;. (shouldn't that at least be ampersand#x0003; )?

 

Now for the kicker.. Internet Explorer, Watcher and GPXView all happily grock the invalid encoding back into 'unprintable box-character'. (All MSXML based, I'm assuming).

 

Everything else on the planet seems to hit it and blow chunks.

 

Based on just that, I'm led to believe that is acceptable "Micro$oft XML" even though it's not acceptable "XML Standard" XML.

 

Perhaps the geocaching.com gods have no choice with a MS based back-end?

 

grrr...

 

alex

Link to comment

The problem is, as your correctly divine, with the MSXML component, which allows illegal characters into the XML stream. Bad. Conforming XML implementations are required to barf upon encountering those characters; as usual, MS doesn't have a conforming implementation.

 

The problem here is that geocaching.com is sending out illegal XML files, probably because they rely on the MSXML component to generate them in the first place.

 

I have a Perl script at the gpx2html site that will fix GPX files with these problems. gpx2html also does it automatically.

Link to comment
Guest
This topic is now closed to further replies.
×
×
  • Create New...