Jump to content

Web Scraping Cache Details


arrowroot

Recommended Posts

I've looked at the HTML that's displayed for cache details, and I'm quite impressed with the structure that's been put in for the various pieces of information. The SPAN tag has been used to identify each data field, making it simple to extract the information from the page using the document object model.

 

My guess is that the data is stored in XML, and the page is generated by an XSLT transformation -- but it's nearly as simple to go the other way 'round.

 

A couple questions:

1) Is it possible to retrieve the XML instead of the HTML, or is this limited to the GPX files you get when subscribing (I'm not a subscriber (yet)).

 

2) Could a few more things be tagged? Specifically, some log information would be valuable to extract in a structured format: Date of log, success of log...

 

3) Would a web scraping app be in violation of the terms of service? I'd tend to doubt, but it does narrow the value of the subscription.

 

Joel (Arrowroot, son of Arrowshirt)

Link to comment

quote:

1) Is it possible to retrieve the XML instead of the HTML, or is this limited to the GPX files you get when subscribing (I'm not a subscriber (yet)).


Subscribe and get the GPX.

 

quote:

2) Could a few more things be tagged? Specifically, some log information would be valuable to extract in a structured format: Date of log, success of log...


All of this, and more is in the GPX.

 

quote:

3) Would a web scraping app be in violation of the terms of service? I'd tend to doubt, but it does narrow the value of the subscription.


It tends to piss off the geocaching.com gods.

Subscribe and get the GPX. I went searching and couldn't seem to locate the last thread I read where this was mentioned and jeremy gave his official "don't do it." answer. If I can find it, I'll add it later.

 

If you would like to see a GPX example, check out ClayJar's Watcher download directory. It has a dated GPX file of Louisiana geocaches, for example.

 

...

alex

Link to comment

quote:
Originally posted by alexm:

It tends to piss off the geocaching.com gods.


 

I'll second that. I've seen those curt answers.

 

Also, from what I understand the severs behind GC.com are under heavy load. Adding yet another source for data mining would slow down the site for everyone.

 

I'd prefer a utility that crunched on records on my own machine, even on a different server altogether similar to Lil' Devil's Spinner.

 

Personally, for local crunching I'm looking forward to a mature Watcher from ClayJar. I feel he is on the right track with what he's got so far. I look forward to maybe user defined printouts, more powerful searches, output to HTML and Palm files, and uploading directly to the GPS.

 

Plus, data mining from GC.com would be slow for a dialup person.

 

CR

 

72057_2000.gif

Link to comment

I get your point -- I'll send 'em a check (not a big fan of Paypal).

 

The only other comment is that scraping should have a pretty low cost: I'm using the Internet Explorer automation server, so if a page is already in your local cache, there's little cost in time (for dialup) or server to read it again, unless the server has used pragma: no-cache (which they probably do on something like this which gets updated frequently - urgh).

 

Oh, well, it should be pretty darn easy to convert the GPX file to the same format I was planning on.

 

I'd rather go to a real Palm database, but conduits are a real pain to write.

 

Arrowroot

Link to comment
Guest
This topic is now closed to further replies.
×
×
  • Create New...