Environment:
TMG Array
ACE Load Balancer (I know it's not supported, we have bypassed and also see issue behind F5)
VMWare Environments
McAfee Anti-Virus
POC
3 FE servers combined Mediation servers
Production
7 FE servers
5 ME Servers
Dedicated VM gear
Users are losing connections to the Lync environments on mobile devices. This is happening every 24 hours. The UCWA services crashes on the box rendering everyone on the particular server unable to connect. This ONLY affects users on the server where the crash happened. You can find this information by running a get-CSUSERPOOLINFO - Identity.
Corresponding with the following error on the iPhone:
Findings:
In our findings we are finding that the W3WP services fails on the UCWA virtual pool. The pool is getting too many exclusions and causing the .NET domain to shut down.
Running "C:\windows\system32\inetsrv\appcmd.exe List WP" will show the processes running on the box.
Next going to "Task Manager"
You will see that process 2332 is not listed. This is not listed because the worker process has shut down. Only way to get this back up and running is to perform a IIS reset.
We have tried just recycling the application pool for UCWA but we found that this would sometimes work and sometimes not. Also, we have tried IISReset -noforce but we found that this would also be hit or miss. From time to time we had the w3wp.exe server not restart and would require us to "End Task" on the process.
Working with Microsoft we had taken many logs. We started with the DebugDiag.exe tool. This caused us other issues. Within 12 hours we would have the boxes crippled by this tool since it consumed 28gigs of the 32gigs available. We have also take logs with Procdump.exe. This was also unhelpful for us, we would see the crash happen with Procdump running and it would not catch it.
The DebugDiag logs that did get captured showed a few things. First they showed that we had memory issues, this was to be expected since we had consumed most of the memory in the box. The frustrating part was that we only ran this for 8 hours and had the crash. Secondly, these logs showed that we had a TON of exclusions, the million dollar question is what is causing them. We spent about 2 weeks on this issue with MSFT with no resolution. So, the decision to rebuild was made because it was suspected that the issue was 2012 R2 with a new patch that fixed issues with Windows Update Services.
Possible Fixes Tried:
Anti-virus Exclusions
Windows 2012 R2 and Windows 2012 both showing symptoms of this issue
Bypassing Load Balancers
Moved Servers to dedicated VM gear
Installed F5 to replace the Cisco ACE
We rebuilt the entire pool to eliminate 2012 R2 as a possible issue
Final Resolution:
On the resolution we started working with MSFT and McAfee together. The initial fix was to exclude the following and change some IIS settings to the following:
- Ensure exclusions for Anti-Virus programs include the following:
- %systemdrive%\Windows\Microsoft.NET
- %systemdrive%\Windows\assembly
- %systemdrive%\Windows\system32\inetrsrv
- %systemdrive%\inetpub\temp
Also, include all sub folders and the exclusions specified in http://technet.microsoft.com/en-us/library/dn440138.aspx.
A command needs to be run on the Lync Servers for the below:
- To ensure UCWA continues to work following a recycle event run the following command on Lync machines:
- C:\Windows\System32\inetsrv>appcmd set config /section:applicationPools /[name='LyncUcwa'].recycling.disallowOverlappingRotation:true
Ensure IIS is logging all recycle events by running the following on Lync servers:
- C:\Windows\System32\inetsrv>appcmd set config /section:applicationPools /[name='LyncUcwa'].recycling.logEventOnRecycle:Time,Requests,Schedule,Memory,IsapiUnhealthy,OnDemand,ConfigChange,PrivateMemory
This command was used to ensure that the PID didn't overlap when the UCWA application pool restarted.
So, the final verdict came in, the issue is two-fold. First, McAfee is scanning and manipulating files for the UCWA application pool. This is causing UCWA to fail, MSFT has this following command that resolves the failure:
C:\Windows\System32\inetsrv>appcmd set config /section:applicationPools /[name='LyncUcwa'].recycling.disallowOverlappingRotation:true
This in short allows the UCWA pool to recycle properly when this event happens during the scan. This is a solution that McAfee and MSFT are fixing, MSFT is coming out with a hotfix for this situation. McAfee is also coming out with a fix to leave the UCWA directory files along. So, for now the overlapping set to True is a fix to solve the frustration.