Jump to content

Server "overload" Conclusions


Jeremy

Recommended Posts

We're still investigating the current rash of "Server Too Busy" Errors, but our initial discoveries are as follows:

 

- We're having the most simultaneous traffic we've ever experienced on the site.

 

- There are so many requests getting in line, that the server is throwing out the last requests until the server can process the ones it has. That is the main reason why you will get a "server too busy" error.

 

- There are settings which can increase the queue so more folks can get in line.

 

The server, during these "Server Too Busy" times, is artificially rejecting items that it could actually process in the line because of the default settings are keeping the threshold low. We have doubled this setting so we should be able to queue more requests and reduce the number of errors.

 

This is a unique process of tweaking the site so it runs smoothly and keeps you from getting frustrated with slow site responses.

Link to comment

Not sure if this is related but for the last 10 minutes I have been getting "Service Unavailable" errors after getting numerous server overload errors.

 

Hope you can straighten out the situation, and get back to some serious caching instead of chasing phantom server errors. :ph34r:

 

Happy caching!

 

Logger

Link to comment

I'm surprised when I read 'server' and not 'servers' with an 's'. If its a single 's' perhaps its time to invest in some in infrastruture. I'd be curious to know the setup gc is running. All I know that 1 minute page load (and probably more than that) is frustrating, especially when you get a deadlock message after waiting that long for the load.

Link to comment
Thanks for the update, Jeremy...with my dinosaur dial-up, I never know whether the problems come from the server, the computer, or (most likely) the user-to-keyboard interface.....

Isn't it hard to type with socks on your hands? :ph34r:

 

BTW, I've only had a few of those "Server Errors" everyone is having. Is that due to non-sock-puppet status or something? :ph34r:

Link to comment

 

BTW, I've only had a few of those "Server Errors" everyone is having. Is that due to non-sock-puppet status or something? :ph34r:[/color]

No sock puppet here.

 

I've had a number of these in the last few days. I've attributed this to the beautiful weather in the Midwest and the number of people who need to print cache information or log finds.

 

After a few of them, I just do something else...

Link to comment

Note also that server failure is one of those fast-growing lines. as traffic goes up the processes handling the requests bump into each other *much* more than when traffic is lower. assuming the slope of the traffic line is constant, at a certain point the failure rate just heads for the sky. IMO, adding to the size of the 'wait queue' is just going to unmask problems elsewhere with (as I've seen), memory consumption and resource locking.

Link to comment

Yes, we do believe that locking is a major contributor to these issues. A balance of indexes in the database needs to be created. As part of moving the geocaching.com site to the new codebase, most queries are now stored procedures, but there are still issues with efficiency.

Link to comment

SQL deadlock issues most often times result from database design issues and/or inefficiencies in the SQL code that's being executed. Deadlocks can sometimes be temporarily side-stepped by increasing timeouts on SQL code exectution until the real problems are uncovered and addressed. Of couse, you could then run into the problem of ASP.NET killing any threads that don't respond in time to a request. Oh the agony! Of course, your SQL code might be locking records when it doesn't really need to do so, also sometimes a resource deadlock.

 

I suggest getting out the SQL profiler and finding out where and when the deadlocks are occuring as well as the efficiency of the code being executed. No easy feat sometimes though!

 

Disclaimer: I'm not a professional DBA, but I am a professional jack-of-all-trades-master-of-none software engineer.

Edited by Ferreter5
Link to comment

well, i'm sure you're too busy to discuss architecture much, but here's hoping your new codebase implements a seperate database server. With a data-intensive app like this, that is where you need to throw the hardware. The webby-bit is almost secondary.

 

If I could even pretend to be a DBA and not just a run-of-the-mill programmer, and didn't have a son here in the bay area, I would be camped in your server room tomorrow. You need triage, then you need to assign a person to profiling. The Linux based sql's will explain what they're up to to some degree, I can't believe that MS-SQL won't do the same somehow. You really need to keep an eye on all your searches - when schema's change, they can unexpectedly degrade into linear searches. Profiling has to be part of change management in a data-centric environment.

 

I've seen the movement towards GUIDs. Have you implemented a GUID translation table? Drop some of the WP/ID indexes and look them up using a two-step process on a master GUID table. Too many indexes on your main tables could be killing you.

Link to comment

Could this be a disk full problem at its root? I hope not, but it could certainly be the case.

 

The site seems to be running faster, but that may just be since it is now 7:30 PST (or 10:00pm EST). We've done a lot of investigation and I have some solutions I will implement tomorrow and Tuesday.

 

One big hangup is the way I calculate finds on the cache logs. The current way is inefficient, so I'm implementing some changes that shouldn't be obvious on the page. I also made some database query changes to speed up some of the more common queries.

 

I'll continue to monitor it tomorrow, and make changes if I can.

Link to comment
Could this be a disk full problem at its root? I hope not, but it could certainly be the case.

I assume you're running a full version of SQL Server, but I suppose I should ask ... you're not running MSDE are you? Please say, "no". MSDE has a 2GB limit on the database size.

 

Also, as your experiences have probably already told you, SQL Server will never ever give up memory it has gotten its hands on. It operates under the "If I needed it once, I'll most likely need it again" philosophy of memory hording. It assumes it's the biggest fish in the pond and should be allowed to use as much memory as it needs.

Edited by Ferreter5
Link to comment
Could this be a disk full problem at its root? I hope not, but it could certainly be the case.

If your database is on raw partitions, it's unlikely. If it's on a filesystem partitions, it certainly cannot help as most filesystems show exponentially poorer performance as they approach "full". No serious filesystem is optimized for the "find the last 5K that's free on a 500G JOBD array" case.

One big hangup is the way I calculate finds on the cache logs. The current way is inefficient, so I'm implementing some changes that shouldn't be obvious on the page. I also made some database query changes to speed up some of the more common queries.

Due to the geek appeal of this hobby, you have an unbelievable number of programmer types that are willing to help with such things. (They do range from script-kiddie-programmer-wanna-bes to actual professionals; I'm not claiming that everyone that's willing to help actually has pertinent skills.) If you want actual help instead of mass sympathy, post the code that you'd like help analyzing and get additional eyeballs on it.

 

(Personally, I am not a SQL jock, but I do design enterprise OSes by profession....)

Edited by robertlipe
Link to comment

These current issues are not with the database server, but with the web server. However, the web server can sit waiting for SQL queries in such a way that it can block other requests, so if the SQL queries aren't running fast enough, a bottleneck occurs.

 

I'll try clearing up the user caches found issue and see if that helps the queries.

Link to comment
One big hangup is the way I calculate finds on the cache logs. The current way is inefficient, so I'm implementing some changes that shouldn't be obvious on the page. I also made some database query changes to speed up some of the more common queries.

 

I'll continue to monitor it tomorrow, and make changes if I can.

In all honesty, I don't pay much attention to the find number of cache logs. I find that its usually not accurate. If more caches were found that day, it is not the correct number of finds that user has. Also, it is not always the number of finds that they had at that point. So it seems like a pretty useless number to me. Personally I'd just remove it, at least until you got some more efficient to figure it out.

 

My $.02.

 

--RuffRidr

Link to comment

Poor Jeremy.

 

Everyone in the US simoultaneously said "What great weather! Let's go Geocaching!" and Jeremy gets to sit at his computer improving SQL server performance so everyone doesn't flame him for having a %@$ site.

 

All hail Jeremy the Geocaching martyr. He improves SQL query performance for our Geocaching. We should celebrate his birthday by giving each other gifts of dollar store trinkets, placed under trees, with an easy find to log.

 

Of course that also makes Jeremy's situation worse. Oh such a vicious cycle of life. :ph34r:

Edited by woo2
Link to comment
I removed around this time last year, and caught holy hell for it. So I'd rather leave it there.

You're right, you did. I can't blame you for hesitating in removing it again.

 

However, like RuffRidr, I only use the find count as a general idea of a person's finds, because it's accuracy is highly suspect due to the nature of the number.

 

As he suggested, and has been suggested before, I'd like to recommend that the find count numbers associated with cache logs be the number of finds that person had at the time of that find. I.E, my first find would have a (1) after it and my 100th find would have a (100) after it. In my layman's mind, this would also reduce server load, as the number would be totally static.

 

I realize there might be people who object to this idea, and I realize this post is slightly off-topic.. but when the time comes, a separate topic could be started to guage opinions.

 

Jamie

Link to comment

Here is some free advice, worthy of the price:

 

Being a SQL Server DBA of sorts, I am amazed at how well an RDBMS can work when tuned properly and at the same time, allow a single database query to bring a server to its knees.

 

Whenever I encounter performance problems and I don't have the luxury of throwing more hardware at it, I focus on poor relational table design, proper indexing, identifying unnecessary locking situations, and code-reviewing stored procedures for inefficient data processing and bad use of transactions.

 

Not knowing your schema or how well you code your procs, I can still suggest looking at one area, if you haven't already. I have been chosen as a 'deadlock' victim in cases where the gc.com server was in heavy demand and all I wanted to do was browse a cache page. If all I truly am doing is SELECTing data, there is no need for my data request to lock a table, data page or row. Since data at gc.com changes so often, 'dirty reads' should be allowed. Consider lowering the isolation level or adding the (WITH NOLOCK) hint to processes that only SELECT data (as long as it is OK that the data might become stale a second later). This would reduce contention by preventing locking that isn't necessary.

 

You still have to identify the SQL that is causing the deadlock, so use SQL Server Profiler, or the raw output of the sp_who and sp_lock stored procedures that SQL Server gives you (I have my own custom mods to these procs that serve me better).

Edited by konopapw
Link to comment
If anyone is an accomplished SQL Server DBA, I'd love to speak with them. Send resumes to jobs@Groundspeak.com

I was going to ask about the pay scale for an SOB with a BS degree. After reading these posts, I've concluded that I'm totally out of this league. Reading Konopapw's post confirmed this. I will now slink off to peruse the usual 'rants'. :ph34r:

Link to comment
These current issues are not with the database server, but with the web server. However, the web server can sit waiting for SQL queries in such a way that it can block other requests, so if the SQL queries aren't running fast enough, a bottleneck occurs.

 

I'll try clearing up the user caches found issue and see if that helps the queries.

Are you guys running queries or Stored Procedures? From a Java/Websphere perspective I've noticed that at Stored Procedures seem to run a lot faster. The faster you can complete the transaction the better.

Link to comment
Consider lowering the isolation level or adding the (WITH NOLOCK) hint to processes that only SELECT data (as long as it is OK that the data might become stale a second later).

I've been hesitant to use (WITH NOLOCK) as every DBA in forums have warned against dirty reads. Considering that the data I care about isn't like an ATM transaction, I don't think its a big deal either. I'll investigate that angle. Thanks!

 

The other issue is a poor query I've been using for a while. I think making some changes to the stored procedures may clear up the locking issue, but I need to add some triggers first.

 

Thanks :ph34r:

Link to comment

Jeremy,

 

I tried emailing jobs@Groundspeak.com and both contact@ addresses last night. All three addresses bounced back as undeliverable. I would like to talk with you regarding your SQL issues (I have experience in this area that may be helpful). Feel free to contact me at my account email address at any time.

Edited by M-Class
Link to comment
I removed around this time last year, and caught holy hell for it. So I'd rather leave it there.

You're right, you did. I can't blame you for hesitating in removing it again.

 

However, like RuffRidr, I only use the find count as a general idea of a person's finds, because it's accuracy is highly suspect due to the nature of the number.

 

As he suggested, and has been suggested before, I'd like to recommend that the find count numbers associated with cache logs be the number of finds that person had at the time of that find. I.E, my first find would have a (1) after it and my 100th find would have a (100) after it. In my layman's mind, this would also reduce server load, as the number would be totally static.

 

I realize there might be people who object to this idea, and I realize this post is slightly off-topic.. but when the time comes, a separate topic could be started to guage opinions.

 

Jamie

You know, I like that idea. It would relieve server load and give some good information at the same time. Just change the old title from "DirtRunner (103 found)" total to say --something like "Dirtrunner (Find: xx)" I know how many finds I have, and it would be good to have something like this to let me know generally when it was I originally found it.

 

Looking back I could see that Cache X was my first find (awwwwww isn't that cute?) and my 100th find was cache Y (horah!)

 

Just a thought

DirtRunner

Link to comment
Jeremy,

 

I tried emailing jobs@Groundspeak.com and both contact@ addresses last night.  All three addresses bounced back as undeliverable.  I would like to talk with you regarding your SQL issues (I have experience in this area that may be helpful).  Feel free to contact me at my account email address at any time.

Yeah, they had to delete those mailboxes to make more disk space... :)

 

--Marky

Link to comment

as an aside, i think it's interesting that you chose to display the *current* number of finds for a cacher next to any find that displays. It would have been perfectly valid to structure it so that the 'find number' that displays is that cacher's find number at the time the find was made. So that on the first find I ever made, the 'number of finds' next to my ID would be 1. In a way, it would have been *more* informative. Doing so would have moved that data into the cache data and saved you much woe. Of course that's water over the dam now, but...

Link to comment

I can't add much to what has been said by others who know their way around SQL here. But one thing I have found is to go over the code with a very sharp eye and make sure you are closing all your connections when you are done using them. I try to not only use connection.close but connection = nothing if I don't need to use the object again. It is a tough balance because opening the connection is expensive but if you are done with it make sure it gets closed.

 

But it is hard to diagnose without seeing code, stored procedures, indexing, schema, key relations, etc. But yes kick open the profiler and see what is happening there, it is a great tool that has saved me many hours.

Link to comment

Showing the number of logs at the time the user found it, if anything, compounds the problem. It also confuses just about everyone that has seen it the current way for the last 3 years.

 

What if I, say, log x y z cache in the order z y x and it shows my count as z when it should be y and so forth. I then have to indicate why they are in the wrong order. If an older log gets removed now I need to reset 1 to infinity logs to reorder them.

 

My personal feelings is that since points aren't fair, to just have a similar system to the forums. If you have found 1-10 caches you have one designation, and so forth. The main reason for having that info in the first place is to inform other cachers that the user who found it is a veteran, or didn't find it and is a veteran, so there is a good chance it is there (or isn't).

 

Otherwise, the changes I'm making now will fix the current response time with building cache pages, so the point is moot. In the future we can get back to adding new features to the cache listings.

Link to comment
The main reason for having that info in the first place is to inform other cachers that the user who found it is a veteran, or didn't find it and is a veteran, so there is a good chance it is there (or isn't).

But my fourth find, even if I am considered a veteran, represents an inexperienced cacher. For that particular cache, my log does not have the weight that the log on cache #653 would have, using that reasoning.

Link to comment

do you store the count total on the user profile and pull it from there or do you query for all the caches the user has found at that time the cache page is generated and then just display it?

 

If it's not being pulled from the profile, that might be a more efficient way of arriving at that number. When a user enters a new log, update their stats and store them in the profile. Then you'd have an easier query on the stats page as well as when you generate the main cache page. My expertise in actual SQL and DBA stuff is limited, but this is one thing I haven't seen pondered, I may have missed it though.

 

It would be equivalent to what you are suggesting about "ranks" for use on cache logs, but would still keep the number in tact. I do like seeing a number if possible. And it's actually one step easier to store the number than to translate it to a rank.

Link to comment

Ok, I'll give some free unsolicited advice...

 

Jeremy, if you have been transitioning from raw TSQL to Stored Procedures over time, you may not be aware that the TSQL you write the stored procedure with _is not_ the code that gets executed once the stored procedure is compiled. SQL Server will "optimize" the TSQL into a form it thinks is more efficient when it compiles the stored procedure. Usually this works great, but sometimes it makes the wrong decision. I've particularly noticed the case that the compiler sometimes will prefer an inner join rather than a subquery, even if the subquery is more suited for the particular application. You can test for this by doing an execution plan individualy for both the TSQL iteslf, then compile it into a stored proc and re-run the execution plan. If you see a _huge_ difference, you can bet that the server is trying to outsmart you and is fowling things up. Although this isn't the specific problem, consider the following queries as a simple example:

 

These queries are an example of returning a number of logs for a cache. An assumption is made that you have certain types which will be included in the count (those that display on the web) and certain types that will not ("deleted" logs, approver messages, etc.)...

 

Query Method #1 (inner join):

 

select

count(log_id) as logs_count

from

logs, logtypes

where

logs.cache_id = @cache_id and

logtypes.log_type_id = logs.log_type_id and

logtypes.logs_visible = 1

 

Query Method #2 (subquery):

 

select

count(log_id) as logs_count

from

logs

where

logs.cache_id = @cache_id and

logs.log_type_id in (

select

log_type_id

from

logtypes

where

logtypes.logs_visible = 1

)

 

Both query methods assume the following indexes:

logs.cache_id, logs.log_type_id

logtypes.logs_visible

 

Both queries return the same results, but one is going to be faster than the other. For the count() aggregate, you want the database engine to do two simple index lookups on fixed length fields in the index and do a row comparison (subtraction) to generate your aggregate. You do not want the database engine to generate a temp table in memory and then table scan that table to generate your count. In some cases the subquery method is better for this, but once a stored proc is compiled it may inadvertantly be using the join instead. I've on numerous occasions seen the stored procedure compiler foul this up and it goes unnoticed until the query is put under serious load. There is a command that will force a stored procedure to execute according to your TSQL if you are having this problem:

 

option(force order)

 

Without knowing your precise data structure, my example is very much over-simplified and the real performance difference may not be as realized as with some more complicated queries (multiple nested subqueries combined with outer joins- I've written one of these for our club leaderboard).

 

I know you are not much of a fan of statistics, but I did write our local club statistics app (stats.mtgc.org). I do have an insight into some of the problems you may be facing with aggregates on a database your size. I also have some idea what structure you probably have under the hood (after all, I probably rewrote half of it for our local app). I've managed to get ours to do _very_ sophisticated queries that would make even the most experienced DBA's head spin. Furthermore, all of the queries execute in real time without the convenience of having cached pages to rely on. Granted we don't get nearly the traffic that geocaching.com does (from one developer to another, I am thankful for that). Even under the load of geocaching.com, it should be no problem to stream simple aggregates such as log counts in real-time, not to mention the ease of doing so behind cached pages.

 

For the record... before I get blamed for all of the traffic on the weekends, our data collector only views on the order of 175 pages per day on the weekends.

 

If I can help you in any way, let me know.

Link to comment
Guest
This topic is now closed to further replies.
×
×
  • Create New...