The checkpoint process writes dirty pages from the buffer pool area of memory to physical disk. It does not remove the pages from the buffer pool instead it marks them as clean again.Its purpose is to reduce the active portion of the log necessary to preform recovery so the frequency at which the checkpoint process is run determines the length of time it will take to recover the database when it is started up. This needs to be balanced against the overhead required to run the checkpoint process. Similarly to the lazy writer process, checkpoint runs as a background task. We can query the sysprocesses system table to view its status:
SELECT cmd, spid, status, dbid, login_time, waittime, lastwaittype, loginame,cpu, memusage, physical_io FROM sysprocesses WHERE cmd IN ('LAZY WRITER', 'CHECKPOINT')
By default SQL Server determines the interval at which the checkpoint process runs. This is usually every 1 minute. This behaviour can be changed by setting the recovery interval which is an advanced configuration option.
Generally I don’t change the default value of 0 which is to allow SQL Server control the checkpoint process.
Checkpoint and the buffer pool
I want to see exactly what the checkpoint process does and how this impacts on the buffer pool. I have setup a test database called DBA (DB_ID = 7)with a table called tblTest. This is a small table with a couple of rows. I have already determined that it is using page 222. The first thing I want to do is clear the page from cache. TO do this I am going to call the checkpoint statement to clean the page and then I am going to call DBCC DROPCLEANBUFFERS to remove all clean pages form the buffer pool. Since I have set the recovery interval to 1000 minutes the automatic checkpoint process should not kick in. Just to be sure I am also going to turn on Trace flag 3502. This will log all checkpoint activity in the SQL Server Log.
1: DBCC TRACEON (3502, -1);
9: DBCC DROPCLEANBUFFERS;
I can verify that the page is not in the buffer pool by using the dm_os_buffer_descriptors dmv.
1: select * from sys.dm_os_buffer_descriptors
2: where page_id = 222 and database_id = 7
Next I am going to bring the tblTest into the buffer pool and dirty it., I am going to do this by updating all the existing rows in the table.
I can verify that the page is now in the buffer pool and that the page has been modified (dirty). The is_modiifed column in sys.dm_os_buffer_discriptors indicates the status of the page (clean / dirty as 1/0)
1: select * from sys.dm_os_buffer_descriptors
2: where page_id = 222 and database_id = 7
When we run a checkpoint again the page should remain in the buffer pool but it should now be clean
5: select * from sys.dm_os_buffer_descriptors
6: where page_id = 222 and database_id = 7
This shows that the checkpoint process cleaned the page and left it in the buffer pool
Checkpoint and the transaction log
I am not going to get into Transaction Log management. For the purposes of the demo I setup the following
- The database is in the simple recovery mode
- There were no outstanding active transactions
- No other transaction log dependant features such as replication or DB mirroring were setup on the box.
I have a database setup and I have run a number of transactions against it. The recovery interval has been set to 1000 minutes. I am going to use DBCC LOGINFO and the undocumented
3: DBCC LOGINFO;
6: SELECT [Current LSN],[previous LSN], [Operation], context,
7: [Checkpoint Begin], [checkpoint end], [page id], [minimum lsn],
8: Dirty Pages] spid, [Begin Time], [Transaction Name],
9: [End Time], [Lock Information],Description
10: FROM ::fn_dblog(NULL, NULL) ;
3: dbcc loginfo;
5: SELECT [Current LSN],[previous LSN], Operation, context,
6: [Checkpoint Begin], [checkpoint end], [page id], [minimum lsn],
7: [Dirty Pages] spid, [Begin Time], [Transaction Name], [End Time],
8: [Lock Information], Description
9: FROM ::fn_dblog(NULL, NULL)
So when does checkpoint run
When the database is shutdown a checkpoint is run against each database. The only exception to this is when shutdown is called with the no wait option or the server crashes.
The checkpoint process is run before each backup
The transaction exceeds maximum’s for the active portion
We currently run a couple of production SQL Server clusters, the trouble for me is that they are very stable, I do not have to do much to maintain availability. to improve my Clustering knowledge I am taking a 2 phased approach. Since I learn best by doing, on top of reading up on the theory I am going to build a two node windows 2003 cluster (virtualised) I am going to install SQL 2005 on top of it.
I was lucky that a few things went wrong and I learned a lot doing the install. The most important thing I learned do not attempt to install a SQL Server Cluster using RDP. Make sure all RDP connections are closed before starting the install and install from the console. I do not know why this is an issue, I must investigate further??? The error messages returned talked about Error message saying the native client will not install and No process is on the other end of the pipe. troubleshooting both of these, in the end it was RDP that was causing the issues.
I used VMware to build a 2 node windows 2003 cluster and I installed a SQL Server 2005 Developer Edition cluster on top of this. The main documents I consulted during the install were
|SQL 2005 Clustering White Paper||https://www.microsoft.com/downloads/en/details.aspx?familyid=818234dc-a17b-4f09-b282-c6830fead499&displaylang=en|
|Windows 2003 Cluster White Paper||http://technet.microsoft.com/en-us/library/cc778252%28WS.10%29.aspx|
Following this documentation, these are the notes I made to assist me with future installs.
- Setup a single Virtual Machine using a standard configuration but add an additional NIC for the private heartbeat
- Install Microsoft Windows Enterprise Edition with all applicable updates on both nodes, keeping both configurations identical. When the setup completes
- Configure networking for both the LAN and the Private heartbeat NIC’s
- Under Control Panel > Network Connections > Advanced Settings >
- Configure the binding order, the LAN NIC should come first
- Configure the private NIC
- Disable everything except TCP/IP
- Set the duplicity to half / 10mb’s –reliability over performance
- Use a separate class of network to public
- FOR TCP/IP have no default gateway, no DNS, disable NETBIOS
- Power off Node1
- Within VMware Clone Node1
- Turn on the cloned Node1 which becomes Node2
- Change the server name and the assigned IP addresses
- Add Node2 to a domain
- Turn on Node1
- On both nodes run ipconfig /all and complete a quick comparison between nodes
- Add Node1 to domain
- A minimum of 3 shared disk are required (Quorum, MSDTC, SQL)
- Add the shared disks to Node1, to do this the disk must first be zeroed out.
- Use PUTTY to logon to the Server console and run vmkfstools –w diskname.vmdk
- When adding the disk there is a requirement to use a new SCSI controller which is set to LSI Logic, with the SCSI bus sharing set to Virtual.
- Add the disks to Node2 following the same procedure
- Power off both nodes
- Power on Node1 and add the disks through disk management and DISKPART setting the partition alignment.
- Verify the disks in window’s explorer – add txt document edit and delete
- Power off Node1
- Power on Node2 and Test the disks
- Power off Node2
- Power on Node1
- Ensure to have
- IP address for Cluster and a cluster name
- A domain account which has local Admin rights on Node1 and 2
- from Cluster Administrator launch cluster setup wizard
- When the setup completed review the cluster configuration log file C:\WINDOWS\system32\LogFiles\Cluster\clcsfsrv.log
- power on Node2
- From Node1 add Node2 using Cluster Administrator add node
- Test that failover works between both nodes.
- From Cluster Administrator
- Set the Network Priority. Give private priority
- Configure Private (heartbeat) for Private only
- Public (LAN) for mixed
On both node1 and node2
How can I identify slow running queries and what causes a query to run slowly?
One approach I like to use to identify query performance issues is use the built-in DMV’s available from SQL Server 2005 onwards. One of my favourites to start is sys.dm_exec_query_stats. This DMV looks at plans that are stored in the plan (procedure) cache. The view returns one row for each statement in a plan. When using this dmv I have to remind myself that not everything gets stored in the plan cache for instance:
- Zero cost or trivial plans
- DBCC , Backups and RECOMPILES,
On top of this older plans can get aged out depending on resource requirements, so it will not give a complete picture of your system, but it is a good good place to start given its benefit’s including:
- the data already collected for you by SQL Server
- it is non intrusive and resource light
It is worth taking a quick step back and reviewing how the plan cache fits into to the Anatomy of a Query. When a user executes a query SQL Server asks the command parser to verify that the query is syntactically correct, once this completes successfully it next generates a hash for the query and checks the plan cache to see if an existing plan is available for the query. Command Parser checks the –> plan cache (part of the buffer pool memory). It is area of memory that sys.dm_exec_query_stats queries for its information
sys.dm_exec_query_stats can help identify where plans are over utilising resources such as:
- CPU (worker)
- I/O – Memory (logical)
- I/O – Disk (Physical)
- Recompiles A bottleneck on any of these resources can cause queries to run slowly.
A quick but important distinction, important for SQL Server performance
- Physical I/O = Disk = Expensive (slow)
- Logical I/O = Memory = Faster
sys.dm_exec_query_stats gives the time the plan last started executing and the time the plan was compiled. This can be useful in narrowing the scope of the result set, for example if I only want to look at queries executed today or this week. The dmv also gives the plan handle and sql handle these are tokens that refer to the batch and plan that the query is part of, more on these later.
The chart below distinguishes between the various resources required by a query and the values we can obtain from sys.dm_exec_query_stats.
Totals – are totals since last compile
max / min – values are values achieved ever by the query
When running a query I can order the result set based on any of the resources identified above. If for example I have already identified that CPU is a bottleneck then I can order the result set by either total_worker_time or the average worker time DESCENDING.
Taking all this on board I usually end up with a query that look like:
,creation_time –time plan was last compiled
,last_execution_time — time the plan last started executing
,plan_generation_num – The number of times the plan was recompiled
,execution_count – number of times plan was executed since last recompile
,total_worker_time/ execution_count as avg_CPU_time
,total_logical_reads / execution_count as avg_Logical_read
,total_physical_reads / execution_count as avg_Physical_read
,total_logical_writes / execution_count as avg_logical_writes
order by avg_CPU_time DESC
The result set will tell me; for an averaged execution which queries are using the most CPU time.
- Has a query started to run slower and how can we tell ?
- We already looked at the total_elapsed_time / execution_count = Average Elapsed time
- Difference between the last_elapsed_time and the total_elapsed_time gives us an idea if it is running more slowly. Also we can look at the max_elapsed_time and the min_elapsed_time
- Difference between Average worker Time and last_worker_time = increase in query cpu time for last time run
- Is the difference across multiple queries or just one?
To make it more meaningful and to allow me to start troubleshooting performance issues what I need to see is the text of the query and the plan the query is using. To get these I use the sql_handle and plan_handle tokens I talked about earlier to call a couple of more DMV’s. Firstly sys.dm_exec_sql_text, this dmv is a function and takes the sql_handle as a parameter, it needs to be combined with the statement_start_offset and the statement_end_offset to retrieve the query text. Books Online has an example of how to achieve this.
(SELECT SUBSTRING(st.text, statement_start_offset/2 + 1,
(CASE WHEN statement_end_offset = -1
THEN LEN(CONVERT(nvarchar(max), st.text)) * 2
END – statement_start_offset) / 2
FROM sys.dm_exec_sql_text([sql_handle]) AS st) AS query_text
TO look at the query plan,I use the sys.dm_exec_query_plan dmv this takes the plan_handle as a parameter and returns the queries execution plan. Exercise caution when executing this dmv as can be resource intensive.
select * from sys.dm_exec_query_plan([plan_handle])
BOL search for : sys.dm_exec_query_stats
When I attempted to remove remove a virtual switch using the vSphere Client I kept getting the following error
“HostNetworkSystem.UpdateNetworkConfig" for object "networkSystem" failed
To resolve the issue I deleted the configuration from the console using the command
esxcfg-vswitch –d <virtual switch name>
Our VMware vSphere 4 Enterprise license covers us for for 2 physical CPUs (1-6 cores per CPU), we are running it on a 2 CPU x 4 core box giving us 8 virtual CPU’s but as we discovered we can only allocate 4 virtual CPU’s per host.
TO resolve this we would have to purchase an Enterprise Plus license per cpu for a 2 cpu server. In fact to enable failover and VM ware high availability features we would also need to purchase a set of licences for an additional server.
We are currently evaluating the best way to provide a high availability SQL Server environment. At the moment we are using MSCS we felt we may be able to better utilise server resources if we could move SQL Server to VMware but this may not be justifiable given the license costs of Enterprise Plus
VMware have a KB about it here http://www.vmware.com/products/vsphere/buy/editions_comparison.html
VMware also have a Compare vSphere Editions chart at
2 Step Process
- Login to navisphere manager and remove the LUN
- Tidy up the windows OS
- Can be done while server is running
Login to Navisphere Manager
- Select Storage > Storage Groups> Select LUN’s
- From the selected LUNs list choose the LUNS to remove
- Remove the LUNs and click Apply
Login to Windows Server
- From Computer Management
- Disk Management > Rescan Disks
- From the Command Prompt navigate to C:\Program Files (x86)\EMC\PowerPath
- run powermt check
- Accept all the changes