Saturday 21 June 2014

Flushed Cached Execution Plan Caused By Memory Pressure


In this scenario the end users where experiencing intermittent slow performance of their application. It would occur at random intervals during the working day at no specific times.
Most of the symptoms were timeouts or a slow loading time for specific reports.
The first check was to make sure the cached execution plans looked ok for the specific query.

Select
DB_NAME(st.dbid) DBName
,OBJECT_SCHEMA_NAME(st.objectid,st.dbid) SchemaName
,OBJECT_NAME(st.objectid,st.dbid) StoredProcedure
,max(cp.usecounts) Execution_count
,st.text
,cp.plan_handle
,CONVERT( XML, CAST(qp.query_plan AS VARCHAR(MAX))  )
FROM sys.dm_exec_cached_plans cp
CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) st
CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) qp
where DB_NAME(st.dbid) = 'MyDatabaseNameHere'
and cp.objtype = 'Adhoc'
and st.text LIKE '%                        ''%'
Group by
cp.plan_handle,
st.text,
DB_NAME(st.dbid),
OBJECT_SCHEMA_NAME(st.objectid,st.dbid),
OBJECT_NAME(st.objectid,st.dbid)
,  CAST (qp.query_plan AS VARCHAR(MAX))
order by max(cp.usecounts) desc
A look at the cached execution showed a RID Lookup with 90 %Cost to the query. This was reviewed and a covering index was created to overcome the RID Lookup shown in Figure 1.
Figure 1
The execution plan now improved, looked like this with an Index Seek as shown in Figure 2.
Figure 2
It was anticipated this improvement in the execution plan would help performance but users where still complaining of slow performance at intervals during the working day. This was something that was not easily replicated as each time the application was tested by the support team performance was good including during peak times.
A further revisit of the cached execution plan was done. Later in the day an unexpected observation was noted on the query execution count. Early on in the day the execution count of the query was around 50. When looked at again in the afternoon it had decreased in count to 1.
At first glance this did not make sense as to how an execution count could have decreased. This required a look at the stats of the query using the DMV dm_exec_query_stats. Using the plan_handle of the execution plan we could pull up more information about its stats.
SELECT * FROM sys.dm_exec_query_stats S
CROSS APPLY sys.dm_exec_sql_text (S.sql_handle)
WHERE plan_handle= 0x060005002F20A31FE0043F1…………...………
Below you can see the creation time for the plan was in the afternoon with an execution count of 1. This now shed light on why users were experencing intermittent slow performance because the execuction plan was being flushed from the cache. So when the user tried to run the application a new plan was being complied and thus they experineced either the timeouts or slow loading times.
Figure 3
There are no jobs running during that time of the day to cause the execution plan to be removed e.g Index rebuilds, Alterations to Indexes or tables. These types of operations only occur out of normal working hours, late at night or very early in the morning. So this eliminated these as potential causes. The other possible cause of flushing the cached plan was memory pressures on the Sql Instance. So a Perfmon counter was setup and run at 15 second interval over a 3 day period.
The analysis of perform counters turned up some interesting results. The first counter to review was the Page life expectancy.
Figure 4
It’s shown in image 1 that spikes in User Connectivity correlate with Page life expectancy (PLE) sharp drop although this is not the initial cause of the PLE downward trends.
A further look at the Batch requests along with PLE did identify some relationship with the sharp drops in PLE. There were some spikes during normal working hours (9am-5pm) in Batch requests that preceded PLE drops.
Figure 5


The below screenshot shows the correlation between PLE, Batch Requests/sec and SQL Compilations/sec. The scale has been increased for viewing purposes. The Microsoft recommendation is for SQL Compilations to be 10% below Batch Requests. For the majority of times its observed this was the case. However not exclusively, as its been noticed the SQL Compilations/Sec did exceed Batch Requests/Sec during normal work hours.

Figure 6
A granular look at the counters shows at times SQL Compilations were higher than Batch Requests as highlighted in figure 7. Below is a zoomed in view of a 9AM to 10PM time slot.
Figure 7
It’s shown in figure 7 in three points SQL Compilations/Sec were higher than Batch Requests.
The perform counter Average disk Sec/Read was also reviewed. Below it’s shown a correlation between Disk Read activity and PLE downward trends.


                Figure 8


In Figure 9 it’s showing the correlation with Average Disk Reads and Batch Requests activity.
Figure 9
From these findings it decided that a further investigation is required on using either a profile trace or Extended Events. A more granular observation is needed to identify the queries running at these periods of PLE downward trends.