Jump to content

Server Error: Connection Reset


ertyu

Recommended Posts

Hey Justin,

 

The site has been nearly unusable for the last 36+ hours. How come status.geocaching.com shows 5 minutes of downtime during that who time? I don't know how Pingdom works but it seems like they must be pinging things (or however they do it) in an odd way if the community is seeing no ability to use the site but they are saying its speeding along.

Link to comment

Thank you for taking time early and out of your Saturday to look into this and resolve it for us. Thanks for being so transparent about what the problem was and updating everyone. Thanks for the fix. Nice job. :D

 

I've reported the issue to our provider and have started waking people up. This isn't something we've seen before.

 

Does anyone know if this started before the thread was created early yesterday morning?

 

No the problem started after this thread was was posted. The OP looked into their crystal ball and knew there was going to be a problem. :blink:

 

Thanks for at least acknowledging someone is paying attention. This has been going on for closer to 36 hours now and this is the first response from Groundspeak that I have seen. Hope it gets fixed soon. Server overload maybe?

 

:anicute: Thanks for not being too hard on me... I hadn't had my coffee yet.

 

We traced the issues to one of our F5 BigIP LTMs. As you likely noticed, it wasn't routing traffic properly and was resulting in connection resets and the 502 Bad Gateway errors. Unfortunately, the health checks on the system didn't recognize the sub-optimal state of the primary unit and did not automatically failover to the standby unit.

 

After manually failing over to the standby unit, the problems appear to be resolved. We're pulling logs and will be submitting them to F5 support for an investigation to determine why it was failing and what can be done to prevent it in the future.

 

We're keeping a close eye, and hope we haven't impacted your ability to log #10in31.

Link to comment

Hey Justin,

 

The site has been nearly unusable for the last 36+ hours. How come status.geocaching.com shows 5 minutes of downtime during that who time? I don't know how Pingdom works but it seems like they must be pinging things (or however they do it) in an odd way if the community is seeing no ability to use the site but they are saying its speeding along.

I would not call the site completely unusable. Once I was able to get past the connection reset problem when I initially logged onto the site last night, I was able to visit any page I liked without any issues.

Link to comment
I would not call the site completely unusable. Once I was able to get past the connection reset problem when I initially logged onto the site last night, I was able to visit any page I liked without any issues.

Lucky you. For me, every click was a Connection Reset requiring 3-5 reloads to get anything. I finally gave up and it was even worse this AM. Much better now although, a bit slower than usual.

Link to comment

Hey Justin,

 

The site has been nearly unusable for the last 36+ hours. How come status.geocaching.com shows 5 minutes of downtime during that who time? I don't know how Pingdom works but it seems like they must be pinging things (or however they do it) in an odd way if the community is seeing no ability to use the site but they are saying its speeding along.

 

I'm not sure how Pingdom does it, but a typical ping utility will send a number pings in succession and then average the response times. (Windows ping utility defaults to 4). If the utility gets a response from any of the pings, the host has responded and that is probably what is being reported. The downtime that was actually reported this morning is obviously when they reset things.

Link to comment
I would not call the site completely unusable. Once I was able to get past the connection reset problem when I initially logged onto the site last night, I was able to visit any page I liked without any issues.

Lucky you. For me, every click was a Connection Reset requiring 3-5 reloads to get anything. I finally gave up and it was even worse this AM. Much better now although, a bit slower than usual.

 

You had to click your mouse five times so that made the site completely unusable? Yes it is frustrating, so much so that you gave up, but to say that it was completely unusable is an extreme exaggeration. The site was perfectly usable for those that had a bit of patience.

Link to comment

I too noticed that the status page was not representing the situation correctly for much longer than I would have thought normal. I wonder if the IT folks at Groundspeak might be helped with a bit more quantitative information about the specific incident this morning:

 

I was not able to access the API, nor view any cachepages, for 45 to 60 minutes after Pingdom began reporting the site to be healthy again.

 

I have marked the range on the timeline that shows green in which I could not get any geocaching.com web page to show nor API responses:

p7j.gif

 

In the time from the blue box to the next hashmark (~9 am) I was occasionally able to get the API to respond to a call without timing out.

 

After that timeframe the API began to respond but it was bogged down. I imagined 10000 geocachers suddenly realizing it was back and all hitting the service at once.

 

I would note that this screenshot was taken just now and the green bar is green all the way to now and at this moment all the services that I typically use appear healthy. Yet the icon for Aug 10 (today) is red on this page as well as on the summary page. It would seem to me that a "status" page would highlight the instantaneous condition rather than an incident since midnight.

 

So I see two issues:

1. I thought the page was a status page, and it is not for the purposes of this geocacher wanting to go caching. Now for the rest of the day the red icon will suggest to geocachers that the site is failing when it is actually is working fine.

2. This tool seems to be useful only to diagnose if the problem was so severe that the IT hardware was not working; It really cannot be trusted with the present condition at even an hourly resolution.

 

I do join in saying thanks to those who got up early this morning to resolve this issue.

Edited by Hynr
Link to comment

:anicute: Thanks for not being too hard on me... I hadn't had my coffee yet.

This I understood.

 

We traced the issues to one of our F5 BigIP LTMs. As you likely noticed, it wasn't routing traffic properly and was resulting in connection resets and the 502 Bad Gateway errors. Unfortunately, the health checks on the system didn't recognize the sub-optimal state of the primary unit and did not automatically failover to the standby unit.

 

After manually failing over to the standby unit, the problems appear to be resolved. We're pulling logs and will be submitting them to F5 support for an investigation to determine why it was failing and what can be done to prevent it in the future.

 

We're keeping a close eye, and hope we haven't impacted your ability to log #10in31.

Would you translate the bolded part from "techese" into English please.

For instance... "The software messed up on the servers and we need had to do a manual failover to the secondary servers. We are having the server provider look at it." If that is what happened. :)

Link to comment

:anicute: Thanks for not being too hard on me... I hadn't had my coffee yet.

This I understood.

 

We traced the issues to one of our F5 BigIP LTMs. As you likely noticed, it wasn't routing traffic properly and was resulting in connection resets and the 502 Bad Gateway errors. Unfortunately, the health checks on the system didn't recognize the sub-optimal state of the primary unit and did not automatically failover to the standby unit.

 

After manually failing over to the standby unit, the problems appear to be resolved. We're pulling logs and will be submitting them to F5 support for an investigation to determine why it was failing and what can be done to prevent it in the future.

 

We're keeping a close eye, and hope we haven't impacted your ability to log #10in31.

Would you translate the bolded part from "techese" into English please.

For instance... "The software messed up on the servers and we need had to do a manual failover to the secondary servers. We are having the server provider look at it." If that is what happened. :)

One of the load balancers that distributes web site traffic amongst the web servers to handle lots of site visitors was malfunctioning.

 

Or to make it simpler, piece of equipment to make internet page function go boom... :lol:

Edited by Dgwphotos
Link to comment
We traced the issues to one of our F5 BigIP LTMs. As you likely noticed, it wasn't routing traffic properly and was resulting in connection resets and the 502 Bad Gateway errors. Unfortunately, the health checks on the system didn't recognize the sub-optimal state of the primary unit and did not automatically failover to the standby unit.

 

After manually failing over to the standby unit, the problems appear to be resolved. We're pulling logs and will be submitting them to F5 support for an investigation to determine why it was failing and what can be done to prevent it in the future.

 

We're keeping a close eye, and hope we haven't impacted your ability to log #10in31.

I have no idea what you said :laughing: , but appreciate that it's been fixed.

Thanks !

Link to comment

I've reported the issue to our provider and have started waking people up. This isn't something we've seen before.

 

Does anyone know if this started before the thread was created early yesterday morning?

 

No the problem started after this thread was was posted. The OP looked into their crystal ball and knew there was going to be a problem. :blink:

 

Thanks for at least acknowledging someone is paying attention. This has been going on for closer to 36 hours now and this is the first response from Groundspeak that I have seen. Hope it gets fixed soon. Server overload maybe?

 

:anicute: Thanks for not being too hard on me... I hadn't had my coffee yet.

 

We traced the issues to one of our F5 BigIP LTMs. As you likely noticed, it wasn't routing traffic properly and was resulting in connection resets and the 502 Bad Gateway errors. Unfortunately, the health checks on the system didn't recognize the sub-optimal state of the primary unit and did not automatically failover to the standby unit.

 

After manually failing over to the standby unit, the problems appear to be resolved. We're pulling logs and will be submitting them to F5 support for an investigation to determine why it was failing and what can be done to prevent it in the future.

 

We're keeping a close eye, and hope we haven't impacted your ability to log #10in31.

 

I was going to suggest a DOS attack by Ashnikes. Errrr, oops, I mean Ashnike's roommate who is trying to frame him. :laughing:

 

Sorry, just being goofy. Thanks for the quick weekend response, and fix of the issue.

Link to comment

At this moment (3 hours later) I am having no trouble with GSAK using the API.

Posted too soon. Upon inspection the results are basically showing

<?xml version="1.0" encoding="utf-8"?>
<GetGeocacheDataResponse xmlns="http://www.geocaching.com/Geocaching.Live/data" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Status>
 <StatusCode>1</StatusCode>
 <StatusMessage>Fail</StatusMessage>
 <ExceptionDetails/>
 <Warnings/>
</Status>
etc

So, it seems that the API is responding but not delivering anything useful.

I am also noting further down in the response that the API appears to have no clue as to what my limits are:

<a:CachesLeft>2147483647</a:CachesLeft>
<a:CurrentCacheCount>2147483647</a:CurrentCacheCount>
<a:MaxCacheCount>2147483647</a:MaxCacheCount></CacheLimits>

Edited by Hynr
Link to comment

The Geocaching website hasn't been working efficiently for the past few days. It seems that this just started after they updated the site. Servers won't connect and you keep getting error messages when you try going to a different page or back. I'm sure the administration is aware of this problem and I sure hope it gets corrected soon. It takes all day to log caches when you've done a bunch.

Link to comment

I like geocaching, your web site and what Groundspeak generally does to support us very much. But in the last days and weeks it really seems to me that your web servers are not supported by hamsters as you stated but by slugs. Response times are in a range up to 30 seconds independant of the device, operating system, browser or internet connection. I hope you will recover soon. My best wishes and kind regards Gerhard_S

Link to comment

Posted too soon. Upon inspection the results are basically showing

<?xml version="1.0" encoding="utf-8"?>
<GetGeocacheDataResponse xmlns="http://www.geocaching.com/Geocaching.Live/data" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Status>
 <StatusCode>1</StatusCode>
 <StatusMessage>Fail</StatusMessage>
 <ExceptionDetails/>
 <Warnings/>
</Status>
etc

So, it seems that the API is responding but not delivering anything useful.

I am also noting further down in the response that the API appears to have no clue as to what my limits are:

<a:CachesLeft>2147483647</a:CachesLeft>
<a:CurrentCacheCount>2147483647</a:CurrentCacheCount>
<a:MaxCacheCount>2147483647</a:MaxCacheCount></CacheLimits>

 

I hve the same problem. I can#t change my ip adress. What can I do?

Link to comment

I've noticed that the site seems to have trouble maintaining lots of connections at one time. For example, if I'm looking at a number of different caches at the same time in browser tabs, the site will sometimes take quite a while to connect to a cache page after I open it, but others will open almost immediately. I suspect the load balancer is acting up again.

Edited by Dgwphotos
Link to comment

I've noticed that the site seems to have trouble maintaining lots of connections at one time. For example, if I'm looking at a number of different caches at the same time in browser tabs, the site will sometimes take quite a while to connect to a cache page after I open it, but others will open almost immediately. I suspect the load balancer is acting up again.

 

I reported this issue a few weeks ago and still have... no response from GS until now...

 

http://forums.Ground...dpost&p=5337344

 

It seems this issue is back again.

 

When i click on a geocaching link i get a brown background and have to wait...

 

fyi: other websites are fast. Only access to geocaching.com is slow.

 

Nobody else has "loading" problems?

 

Today, I opened a new IE session to www.geocaching.com -> the page hangs after loading 60% of progress.

 

I opened a new session in the same IE windows to www.geocaching.com -> loading page was as quick as 'greased lightning'...

Edited by DanPan
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...