Management Pack for SQL Agent Alerts

I have finished my article about suggested SQL Server Agent Alerts. Perhaps calling this a “Management Pack” is OTT, but hopefully it will be useful to some of you folks.

We are talking about having Agent sending email to us when SQL Server writes messages to EventLog. The functionality has been in the product since version 6.0, but I have always lacked some good “default setting”. I’m talking about some easy way to tell Agent to notify me for messages which I typically want to be notified for. (Many of you are already using various Management software that has this functionality, so this won’t be for you 🙂 ).

I very much welcome feedback, you find all information in the article:

http://karaszi.com/agent-alerts-management-pack

 

sp_altermessage is back in business!

Just a quick note that we again can modify whether system messages are to go to eventlog/errorlog again. I.e., we can change the is_event_logged column in sys.messages. This is very valuable in general and specifically is you want to define Agent alerts (for which Agent polls the Eventlog). For instance:

SELECT * FROM sys.messages
WHERE message_id = 1205
AND language_id = 1033

Notice the value for the is_event_logged column. Now, run below:

EXEC sp_altermessage
 @message_id = 1205
,@parameter = ‘WITH_LOG’
,@parameter_value = ‘true’

Now, re-run the select statement and see that you modified the behavior for the system message. Now, re-run the sp_altermessage with ‘false’ to reset to default.

The ability to modify this behavior for system messages was available prior to SQL Server 2005, but some re-architecturing in 2005 removed the functionality. kozloski informed me in this blog post that 2005 sp3 re-introduced the functionality and obviously as of sp1 the functionlity is back in 2008 as well.

 

Match those types!

This is a recommendation I believe is worth repeating from time to time: Make sure you match data types when you write TSQL code. Else you in most cases end up with an implicit data type conversion. And in worst case, this conversion is performed at the column side – not the literal side of your query. What does that mean? Consider below:

WHERE Col = Val

Now, say that the types for above don’t match. Val might be some parameter (to a stored procedure, for instance), a variable or a written value (literal). In any case, when SQL Server need to do some operation )like comparison like here) involving several values which aren’t of the same type, then one of the values need to be converted so it is of the same type as the other. Which one is determined by the rules for “Data Type Precedence“.

What we don’t want is a conversion at the column side. Why? I would argue that an implicit conversion in the first place in many cases mean I misunderstood something about the types in my system and am on my way of producing a bug in my code. But having a more strict language is not the topic for this particular post (check out this). My point here is that it is bad for performance reasons. Just yesterday and today I was involved in a thread on the MSDN forum. Here’s the repro script from that thread (slightly adjusted by me):

USE tempdb
GO
IF OBJECT_ID('t'IS NOT NULL DROP TABLE t
GO
CREATE TABLE t
(
c1 int IDENTITY(10000001,1) NOT NULL
,
c2 char(8) NULL
,
c3 datetime NULL
)
GO
INSERT INTO t(c3)
SELECT TOP 3000000 '20080203'
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
UPDATE SET c2 CAST(c1 AS char(8))
CREATE UNIQUE CLUSTERED INDEX ON t(c2)
GO

SELECT FROM WHERE c2 N'10000009'
SELECT FROM WHERE c2 N'10000008'

Don’t bother too much about the first section. The relevant part here is that we have char(8) column with values which are unique (enforced by a unique index), on a table with three million rows.

Now, before we dig deeper: What is the problem with above queries? The type do not match! The type in the table is char(8), but literal is Unicode (the N before the string). Since nchar has higher precedence than char, then the column side needs to be converted to the value side. Not good. Try the code on SQL Server 2000 and you will see disaster. For me, STATISTICS IO gave 10755 logical reads for both queries, with Clustered Index Scan (which is the same as table scan). Hoover on the Clustered Index Scan icon and you will find “WHERE:(Convert([t].[c2])=[@1])”. See how the conversion is done at the column side? This, my friend, mean that SQL Server cannot seek an index to find the rows. Ouch! That’s just because we were a bit sloppy… Compare to below:

SELECT FROM WHERE c2 '10000009'
SELECT FROM WHERE c2 '10000008'

You will find for both two above we have Clustered Index Seek with only three logical reads. That is a good improvement.

OK, enough about SQL Server 2000, lets discuss 2008/2005. At first I couldn’t see any difference between the two original queries (N’10000009′ and N’10000008′). Both seemed to be instant, and the same execution plan for both. I even saved the execution plan as XML and used WinDiff to compare the xml files. No difference. But the poster said the “9” query was slower than the “8” query. I executed to text and had ctrl-E pressed for a while. And now I could see a difference. The flicker was quicker for the “8” version vs. the “9” version. I used Profiler and confirmed. The “9” version had consistently about 90000 microsecond duration where the “8” version had about 0 duration. This was also reflected in CPU usage, as shown by Profiler. I hoped to use Extended Events to find something deeper about this, but that didn’t give me anything. Perhaps I didn’t capture the relevant events, quite simply. X/Event experts are welcome to suggest event session configurations. 🙂

My guess for the difference is in the execution plan. You will see an index seek for both. That is good, since SQL Server nowadays potentially can use an index even if you end up with an implicit conversion at the column side. But you will see a few more operators in addition to the index seek. See that Constant Scalar followed by a Compute Scalar? My guess is that the built-in internal function used inside the Compute Scalar operator is more expensive for N’10000009′ compared to N’10000008′. You can see this function if you use SET STATISTICS PROFILE ON. See the GetRangeThroughConvert() function. My guess is that the answer is the cost for this function. (Search the Internet for things like “Constant Scan” and “GetRangeThroughConvert” for other blog post on this topic.)

The moral of the story? Here we have a cost which can be relevant in some cases, but we will only see this if we happen to execute our query with the right values and are very very observant. But we can eliminate all this mess this by using the right type for the column in the first place!

Is it only me or are "uncomfortable" answers largely ignored?

I do quite a bit of posting on forums, and I have noticed that in many cases when you take your time to explain how something work, or why you should do something perhaps in a not so obvious way, that answer tend to be ignored and never replied to.

Instead the original poster replies to the short answer which perhaps ignored things like best practices, performance, readability etc. And, very often I see endless discussion based on that reply, where the original poster never gets a satisfactory solution or even perhaps a working solution. And there I posted an elaboration which held the answer days ago! I also explained why solution X doesn’t work and solution Y only work in some circumstances etc.

In many cases, we have taken the time to write articles or blog posts where we have predicted most misunderstandings and explain why you should do it this or that way and what drawbacks workaround A has, and also what drawbacks workaround B has. So, we post the URL to the article only to find that it apparently never was read because the following discussions in the thread goes on for days and waste time to elaborate on things that could be understood by spending some 10 – 20 minutes to actually read the article or blog post we pointed to.

I find this state mildly depressing. Have we reached the sad state where we don’t tend to be interested in why something works, how it works, or even understand what we are actually doing? Quick fixes, anyone? Only 10 cents each!

Is this something new? I’ve been on forums for > 10 years now, let me try to think….. I believe that tendency has always been there, but perhaps it has been getting worse with time. What are your experiences? Am I only having a bad day?

Do you have Instant File Initialization?

You might ask yourself: What is Instant File Initialization and why does it matter?

What Instant File Initialization is:
Instant File Initialization allow SQL Server to allocate storage (space from disks) very very quickly. As you probably know, when you delete files they are not physically deleted from the disk – they are only marked as deleted in the file system allocation structures. This is why various undelete programs can recover deleted files. So imagine a process requiring disk space. The disk area given to the process from the file system and operating system could potentially contains some secret information from deleted files. This is why the file system/OS insist to zero out the storage before the process can actually see it. (I’m not OS / file system expert so anyone is free to correct me here.) That is, unless the one allocating the storage has some special privileges.

When does the zeroing out of disk space occur?
Whenever SQL Server need disk space, such as:

  • Create database
  • Add file to database
  • Restore (if the restore process includes database creation)
  • File growth (both manual and auto-grow)
  • Backup [edit 2011-08-26: Not sure how this got here, AFAIK shouldn’t be here]

Can we avoid the slow zeroing out space?
Yes, but only if you are on SQL Server 2005 or higher and for some operations: creation and allocation of data database files (mdf and ndf). The SQL Server service account need to have appropriate permissions from the OS. To be more precise, it need to have a privilege called SE_MANAGE_VOLUME_NAME. This is by default granted to Administrators. Do you run your SQL Server as an account being member of Administrators? I hope not. Did you grant this permission to the service account?

How do I grant this privilege?
This is easy. Add the SQL Server service account to the “Perform Volume Maintenance Tasks” security policy.

Does it matter?
You be the judge. Just to give you an idea, I created a database with a data file of various size (I had the log file at 1MB for all tests in order for it to influence the least). I timed it both with and without Instant File Initialization. I ran it on my desktop machine which has a RAID0 of two 7200RPM disks:

Size without IFI with IFI 1GB 10.3 s 0.3 s 10GB 128 s 1.3 s 50GB 663 s 4.5 s

The difference is roughly a factor of 100!

When does it hurt?
Basically every time disk storage is allocated. But let us focus of the occasions where you can do anything about it, i.e., when you can have Instant File Initialization. Such occasions include:

  • Whenever a database is created. Space need to be allocated for the data file(s).
  • Whenever a data file is manually expanded in size.
  • Whenever you have auto-grow for a data file. Note that potentially some poor user will now sit and wait for the auto-grow to complete.
  • When you start SQL Server. Why? Tempdb need to be re-created.
  • When you perform restore, if the destination database not already exists with matching database file structure.

How can I tell if I have Instant File Initialization?
I find it easiest to just create a database and time it, using some reasonable size for your data file, like 5GB. Actually, run two test: One with 5GB data file and really small log file. And then another with very small data file and 5GB log file. Remember that we never have Instant File Initialization for log files. For instance, run below and you will see (adjust the file path for the database files). You need to adapt your code for file path name, possibly database name and the datetime handling if you are lower then SQL Server 2008:

DECLARE @t time(3SYSDATETIME()
CREATE DATABASE IFI_test_ld
ON  PRIMARY
(NAME N'IFI_test'FILENAME N'C:\IFI_test\IFI_test_ld.mdf'SIZE 5GBFILEGROWTH 100MB)
LOG ON
(NAME N'IFI_test_log'FILENAME N'C:\IFI_test\IFI_test_ld.ldf'SIZE 1MB,  FILEGROWTH 10MB)
SELECT DATEDIFF(ms@tCAST(SYSDATETIME() AS time(3))) AS LargeDataFile
SET @t SYSDATETIME()
CREATE DATABASE IFI_test_ll
ON  PRIMARY
(NAME N'IFI_test'FILENAME N'C:\IFI_test\IFI_test_ll.mdf'SIZE 3MBFILEGROWTH 100MB)
LOG ON
(NAME N'IFI_test_log'FILENAME N'C:\IFI_test\IFI_test_ll.ldf'SIZE 5GB,  FILEGROWTH 10MB)
SELECT DATEDIFF(ms@tCAST(SYSDATETIME() AS time(3))) AS LargeLogFile

Are numbers for above two about the same? If yes, then you don’t have Instant File Initialization. If the one with large data file is much quicker, then you do have Instant File Initialization. And now you also know approx how long it takes to allocate 1 GB with of data and log file for your SQL Server.

John Samson blogged about an alternative way to check, involving trace flags.

I’m curious: Did you have Instant File Initialization?

Where’s that sys.messages management pack?

I’ve often been thinking that “somebody” should produce a list and perhaps some ready-made TSQL code of the SQL Server error messages we want to be alerted for. Something like a Management Pack, but not tied to any particular sys mgmt software. Essentially, what error messages would we typically want to be alerted for, and some ready made TSQL code which defines an alert (msdb.dbo.sp_add_alert, see also http://www.sqlskills.com/BLOGS/PAUL/post/Easy-monitoring-of-high-severity-errors-create-Agent-alerts.aspx) and operator for each of those.

Honestly, how many of you have been going through all SQL Server messages and determined which of those we want to be alerted for? Sure, we can decide to specify alerts for all high-severity level messages (like 19, 20, 21, 22, 23, 24 and 25). But unfortunately the severity classification isn’t as coherent and we would hope.

For instance, Paul Randal just recently blogged about error 825, which has as low severity as 10! Do we want to be alerted for 825? You bet! Check out Paul’s blog at http://www.sqlskills.com/BLOGS/PAUL/post/A-little-known-sign-of-impending-doom-error-825.aspx.

We can, however make it a little bit easier for us. Since Agent detects the message by sniffing the eventlog, we can immediately discard all messages which aren’t written to the eventlog (there’s no use defining alert for something which the alerting mechanism never detects, right?):

SELECT 
FROM sys.messages
WHERE language_id 1033
AND is_event_logged 1
ORDER BY severity

Now we are down to 699 messages. It used to be that we could configure whether a message should be written to eventlog (sp_altermessage), but this functionality disappeared in 2005. I haven’t checked Connect, but my guess is that if there is such a wish/entry in the first place, it doesn’t have too many votes.

My suggestion is that we create some small community project somewhere where we add what messages we want to be alerted for. This can be as simple as a table with error numbers and some insert, and then we use this to generate our sp_add_alert commands based on that. Question is where to do this, so it can be maintained and refined over time? Or if it has already been done by somebody…

 

Troubleshooting login problems

I’ve been learning a few things about troubelshooting login problems over the last few months. I first want to say that I’ve learned most of this through other blogs, comments etc. At the end of this blog post you will find an URL list where you can look for more details.

It started a few months back where I somewhere read a blog stating that we can use the state in a “login failed” error message to determine why the login failed. As you can imagine, there can be several reasons for login failure (user doesn’t exist, pwd doesn’t match, windows login name isn’t a trusted login etc). For some reason, I didn’t save that URL and of course I needed it some time later and couldn’t find it.

One place where you can find such a list of states is… Books Online. BOL 2008 documents every error number, so it is a matter of knowing what error number to search for: 18456. You can also search for “login failed” (pretty easy to remember) and the first hit is a different page (from above) but with similar information. I just tried a search in 2005 BOL for the same and had similar hits. This information might have been added to BOL 2005 in a more recent BOL 2005 release, though.

Now, don’t be surprised if the client app only receive state 1 for the login failed messages. This is a good thing, and on purpose. We don’t want the system to disclosure too much information to a malicous user (app) about why a login fails. So what we do is to look in the errorlog or for the “real” state (the EventLog doesn’t seem to carry this information, for some reason). Below is an example from SQL Server 2005, sp2, with timestamps removed:

Error: 18456, Severity: 14, State: 5.
Login failed for user ‘gurka’. [CLIENT: <local machine>]

So, state 5 tell us, according to SQL Server 2005 BOL, that “User ID is not valid.”.

Now, doing the same in SQL Server 2008 was interesting. Looking in the EventLog, we do see an explanation as to why the login failed. We still don’t see the state in the EventLog, but we have an explanation so we might not need the state. And in the errorlog file, I had below message:

Error: 18456, Severity: 14, State: 5.
Login failed for user ‘gurka’. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>]

So, not only do we have a state, we also have a decription for why the login failed. Is it pathetic that such a thing can make you happy? Perhaps it is… 😉

Here are some other blog posts on the subject:
http://blogs.msdn.com/sql_protocols/archive/2006/02/21/536201.aspx
http://jmkehayias.blogspot.com/2008/03/understanding-login-failed-error-18456.html

German translation of my updated datetime article

Frank Kalis has been kind enough to translate my datetime article (http://www.karaszi.com/SQLServer/info_datetime.asp) to German. I updated my article a while ago to reflect the new date and time related types in SQL Server 2008, and I just learned that Frank now has finished the translation of my updated article to to German. Check it out at:

http://www.insidesql.org/beitraege/datetime/der-ultimative-guide-fuer-die-datetime-datentypen

What does RESTORE do?

I was replying to a newsgroup post today, explaining the restore process. I find some confusion in various places about what actually happens during restore, and hopefully below can help a bit:

Fact: All backups contains log records. A log backup contains only of log records (more later on bulk-logged recovery). The different types of database backup contain the log records that was produced while the backup was running – in addition to data pages.
The restore command can perform several things:
  1. Delete the existing database. This happens if you specify the REPLACE option.
  2. Create the existing database. This happens if the database name you specify in the RESTORE command doesn’t exist.
  3. Copy data pages to the same address in each database file as they were when the backup was produced. And of course also copy log records to ldf file. Source for these copy operations is the backup file that you restore from.
  4. REDO. Roll forward transactions based on the log records copied in above step.
  5. UNDO. Rollback all open transaction. Not performed if you specify NORECOVERY (more later on STANDBY option). Database is now accessible, and no further backups can be restored (diff or log backups).
A couple of special cases:
If you are in bulk-logged recovery model, then a log backup performed if you did minimally logged operations since last log backup will contain also data pages (in addition to log records). This type of backup cannot be performed if any of the data files are lost. When you restore from this type of backup, you cannot do point-in-time restore.
The STANDBY option of the RESTORE command does perform UNDO but saves that work to a file you specify. This so that the UNDO work can be undone when you restore a subsequent backup.
I think I managed to compress the topic pretty well, so unless you worked a bit with backup and restore in SQL Server, you might want to read above a couple of times. 🙂

Does the Resource Governor govern?

Two weeks ago, we did the “SQL 2008 Summit” roadshow here in Sweden. We did 4 cities around Sweden in 4 days (http://www.expertzone.se/sql2k8/). It was a bit exhaustive, but even more fun – being able to travel and spend some time with persons wish I could meet more often (Kalen), others I meet regularly but only at workplace (Roger, Patrik, Anna) and yet other persons I just met (George, Per).

One of my presentations was on Resource Governor (RG), and I has this super-simple demo meaning to show CPU throttling. I classified connections to one of two Workload Groups based on login name. One group used a Resource Pool with max CPU at 10% and the other a Resource Pool with max CPU at 90%. Since I have two CPU cores, I started two execution loops for each login. An execution loop uses SQLCMD to login using the appropriate loginID and execute a proc which loops and outputs a counter using RAISERROR and NOWAIT (so we see something happening in the console).

For two of my presentations it worked just as expected. For two presentations it didn’t: the CPU usage looked very very strange – nowhere near what we expected. So, during the final day, I managed to spend some time with Mikael Colliander from MS Sweden. First we couldn’t reproduce the strange behavior, but after several restart, re-boot etc. we could. We now finally got to look at what scheduler each connection was using and there was the answer. One connection (ending up in the pool with max 10% CPU) was alone on one scheduler meaning alone on one CPU! The other three connections (one one on 10% CPU and two on max 90% CPU) was using the other CPU. So for the CPU where we had only one connection (belonging to the pool to be capped at 10% CPU) we just had no contention. So this connection could use everything on that CPU since nobody else was assigned to the CPU.

Now when I understand why this happened, it doesn’t look that strange. But I think we need to be very careful when we monitor resource usage for our connections and are using resource governor. The more CPUs we have the less chance we will see the (perhaps expected) distribution of load.