Looking for strange (in your SQL Server) – Tibor Karaszi's SQL Server blog

Do you know if you have weird stuff going on in your SQL Server? Perhaps you walk through the errorlog every now and then? (Do you, really? You should!).

But there are lots of thing that doesn’t end up in the errorlog. For these kind of things, I created what I refer to as my KMDFM trace. This is from a tune of theirs, “Looking for strange“.

I decided to have this as an article, on my web-site. The reason is that I think if it as something that may evolve over time. Hopefully I’ll get feedback, with suggestions to refine the trace (adding filters, events, etc).

Needless to say, I’m using Extended Events. Instead of capturing each event, which can get pretty noisy, I use the event_counter target. This just count how many has happened.

You find the article here, with the details: https://karaszi.com/looking-for-strange

Here are some info from that article. In particular, I included below what events I added, and why:

General events

Attention. This fires when a client times out. The client might do a re-try, so then end-user isn’t even aware that this is happening, except that the performance sucks. The time-out period is specified at the client side, in the database API. “Indicates that a cancel operation, client-interrupt request, or broken client connection has occurred. Be aware that cancel operations can also occur as the result of implementing data access driver time-outs.“

Auto_stats. If you have lots of auto update statistics going on, then you might want to do something about that. The user submitting a query when auto-update occurs will notice a halt while the update happens (unless set to async, but then the query is ran with inferior statistics). Perhaps you want to run manual update more frequently? Or remove some table/index/stats from the auto-update handling? I had to work quite heavily with filtering for this to ger rid of noise. “Occurs when index and column statistics are automatically updated. This event can be generated multiple times per statistics collection when the update is asynchronous.“

Database_file_size_change. The change can be for several reasons. A grow or shrink can be automatic or manual. This should capture all. You should be aware of manual size changes, and they shouldn’t be frequent. But frequent autogrow you might want to handle by manually growing the file or changing the increment. And frequent autosrink is so bad and I doubt I have to explain any further on this. Hopefully you don’t have autosrink turned on for any of your databases. If this happens a lot,. consider adding a filter to only catch shrink operations and/or add a file target. “Occurs when any of the data or log files for a database changes size. Use this event to monitor changes to the size of database files.“

Database_started. You probably want to know if databases are created, attached frequently. Not to mention f you have autoclose turned on for any databases and that causes a database to be started. “Database started.“‘

Lock_timeout_greater_than_0. If a programmer specified to get a time-out while waiting for a log, then this can fire. Why not the evant simply named “lock_timeout”? For two reasons: One is that I noticed it fires more than once for each occurrence. And the other reason is that a developer might have deliberately set a timeout of 0, to immediately get an exception, and those you probably don’t want to catch. “Occurs when a request for a lock times out because of a blocking lock that is being held by a different resource. This event is identical to lock_timeout, but excludes cases in which the time-out value is 0, such as when lock probes are used. Time-outs with a duration of 0 do not necessarily indicate a blocking problem.“

Long_io_detected. You might have seem messages in the log that an I/O operation took longer than 15 seconds. This event fires when that happens. I didn’t include this in the performance section since 15 seconds is such a ridiculous long time for an I/O operation and something is clearly not normal if this event fires. “Long IO detected“

Lock_deadlock. Hopefully this is obvious. If you have lots of deadlocks, then you want to do something about it. Perhaps run a trace and catch the xml_deadlock_report event, and then pass this on to the developers. “Occurs when an attempt to acquire a lock is canceled for the victim of a deadlock.“

Lock_escalation. When SQL server escalates a log to a higher level (table), then it first acquired lots is fine-grained locks, then the use the sledge by acquiring a table level lock. You don’t want this to happen too frequently, adding indexes or other measures can reduce lock escalations. “Occurs when a fine-grain lock is converted into a coarser-grain lock, such as when a row lock is converted into an object lock. Use this event to monitor how locks are being used on the server.“

I also wanted to have the plan_affecting_convert event, to capture when some covers happens in a WHERE clause so that it makes it impossible to use an index seek. But SQL server seems to include anything with a conversion here, including conversion in the column list. This makes the even, IMO, useless.

Performance related events

Check the trace definitions for these. Some might happen a lot. If you aren’t interested in these, then just skip them. Perhaps you can’t do anything about it in the first place?