Monday, July 14, 2014

Lync Mobility Issues - Event IDs 1309, 5011, 20002

Environment:
TMG Array
ACE Load Balancer (I know it's not supported, we have bypassed and also see issue behind F5)
VMWare Environments
McAfee Anti-Virus

POC
3 FE servers combined Mediation servers

Production
7 FE servers
5 ME Servers
Dedicated VM gear

Users are losing connections to the Lync environments on mobile devices. This is happening every 24 hours. The UCWA services crashes on the box rendering everyone on the particular server unable to connect. This ONLY affects users on the server where the crash happened. You can find this information by running a get-CSUSERPOOLINFO - Identity.

clip_image002


clip_image004


clip_image006


Corresponding with the following error on the iPhone:

 
clip_image008

Findings:

In our findings we are finding that the W3WP services fails on the UCWA virtual pool. The pool is getting too many exclusions and causing the .NET domain to shut down.

Running "C:\windows\system32\inetsrv\appcmd.exe List WP" will show the processes running on the box.

clip_image010

Next going to "Task Manager"

You will see that process 2332 is not listed. This is not listed because the worker process has shut down. Only way to get this back up and running is to perform a IIS reset.

We have tried just recycling the application pool for UCWA but we found that this would sometimes work and sometimes not. Also, we have tried IISReset -noforce but we found that this would also be hit or miss. From time to time we had the w3wp.exe server not restart and would require us to "End Task" on the process.

Working with Microsoft we had taken many logs. We started with the DebugDiag.exe tool. This caused us other issues. Within 12 hours we would have the boxes crippled by this tool since it consumed 28gigs of the 32gigs available. We have also take logs with Procdump.exe. This was also unhelpful for us, we would see the crash happen with Procdump running and it would not catch it.

The DebugDiag logs that did get captured showed a few things. First they showed that we had memory issues, this was to be expected since we had consumed most of the memory in the box. The frustrating part was that we only ran this for 8 hours and had the crash. Secondly, these logs showed that we had a TON of exclusions, the million dollar question is what is causing them. We spent about 2 weeks on this issue with MSFT with no resolution. So, the decision to rebuild was made because it was suspected that the issue was 2012 R2 with a new patch that fixed issues with Windows Update Services.

Possible Fixes Tried:
Anti-virus Exclusions
Windows 2012 R2 and Windows 2012 both showing symptoms of this issue
Bypassing Load Balancers
Moved Servers to dedicated VM gear
Installed F5 to replace the Cisco ACE
We rebuilt the entire pool to eliminate 2012 R2 as a possible issue

Final Resolution:
On the resolution we started working with MSFT and McAfee together. The initial fix was to exclude the following and change some IIS settings to the following:
 
  • Ensure exclusions for Anti-Virus programs include the following:
    • %systemdrive%\Windows\Microsoft.NET
    • %systemdrive%\Windows\assembly
    • %systemdrive%\Windows\system32\inetrsrv
    • %systemdrive%\inetpub\temp

Also, include all sub folders and the exclusions specified in http://technet.microsoft.com/en-us/library/dn440138.aspx.

A command needs to be run on the Lync Servers for the below:


  • To ensure UCWA continues to work following a recycle event run the following command on Lync machines:
    • C:\Windows\System32\inetsrv>appcmd set config /section:applicationPools /[name='LyncUcwa'].recycling.disallowOverlappingRotation:true

Ensure IIS is logging all recycle events by running the following on Lync servers:

  • C:\Windows\System32\inetsrv>appcmd set config /section:applicationPools /[name='LyncUcwa'].recycling.logEventOnRecycle:Time,Requests,Schedule,Memory,IsapiUnhealthy,OnDemand,ConfigChange,PrivateMemory

This command was used to ensure that the PID didn't overlap when the UCWA application pool restarted.

So, the final verdict came in, the issue is two-fold. First, McAfee is scanning and manipulating files for the UCWA application pool. This is causing UCWA to fail, MSFT has this following command that resolves the failure:
 

C:\Windows\System32\inetsrv>appcmd set config /section:applicationPools /[name='LyncUcwa'].recycling.disallowOverlappingRotation:true

This in short allows the UCWA pool to recycle properly when this event happens during the scan. This is a solution that McAfee and MSFT are fixing, MSFT is coming out with a hotfix for this situation. McAfee is also coming out with a fix to leave the UCWA directory files along. So, for now the overlapping set to True is a fix to solve the frustration.


Lync 2013 WAC Issue "Sorry, PowerPoint Web App Ran into a problem opening this presentation."

    Issue: When users are trying to share PowerPoints in meetings the PowerPoint's are uploaded fine (mainly PPT) but the user receives an error "Sorry, PowerPoint Web App ran into a problem opening this presentation. To view this presentation please open it in Microsoft PowerPoint."




























    Environment:
    (2) WAC servers behind a F5 Load Balancer
    TMG Cluster with 4 TMG servers.
    (7) Lync 2013 Enterprise Server

    Trials:
    I have tried a number of things to resolve this issue:
    • Bypass TMG by trying internal - Fail
    • Bypass F5 using host files and targeting each WAC server to ensure it wasn't an issue with one - Fail
    • Tried loading PPT on other environments to ensure it wasn't a corrupt file - Success
    • Loaded PPT on the WAC farm in our DR site - Success
    • WAC SP1 - Fail
    • Rebuilt both WAC servers - Fail

    What I have used to troubleshoot this issue:
    • Fiddler is the most detailed in helping pin point issues – Fiddler Showed the 200 during the upload but then the 500 during the failed download
    • IIS Logs - had shown me pretty much what fiddler showed
    • ULS Logs - wasn't finding much other than the 500 error until verbose logging was enabled then showed the following error relating to cache:

    DiskCacheReader: TimeoutException [Machine: http://lyncWACServer01:809/diskcache/DiskCache.svc, Exception:System.TimeoutException: The HTTP request to 'http://lyncWACServer01:809/diskcache/DiskCache.svc' has exceeded the allotted timeout of 00:00:02. The time allotted to this operation may have been a portion of a longer timeout. ---> System.Net.WebException: The request was aborted: The request was canceled.   
    DocumentInfoCache.GetDocumentCacheItem: Item found, 0 minutes old
    SetCompleted - Completed with unthrown exception Microsoft.Office.Server.Powerpoint.Pipe.Interface.PipeApplicationException: Exception of type 'Microsoft.Office.Server.Powerpoint.Pipe.Interface.PipeApplicationException' was thrown.   

    • Event Viewer on WAC server - fund this kind of useless        

    Resolution:
    In working on this a few commands came in handy:

    • Set-OfficeWebAppFarm -openfromurlenabled - This command allows you to generate a PowerPoint right on the server itself. This is very useful in eliminating Lync from the equation as well as any network related issues. To get to this tool you simply browse straight to the server itself (or VIP.




    How I did this was simply created a shared folder on the desktop, placed the path to the folder along with the powerpoint file (\\testserver\c$\user\test\PPT\Test.pptx in the first line and used the "Create Link". Then using the Test This Link I was able to see if the PowerPoint would render on the screen.

    In my case no such luck, same old error as above. Since I Knew I was failing locally I figured why not turn up some more logging to see if I could find something. The command allowed me to do this:

    • "Set-Officewebappsfarm -logverbosity verbose" this turned logging on high in OWAS, this also requires a services restart to complete.

    With verbose to high I ran back through the same tests with the same result and not seeing much other than cache issues. Speaking with MSFT PSS we were informed that whenever a PowerPoint fails it's never cleaned from the cache and will no longer display. The only way to clean this is to remove the cache. To do this its back to stopping services on the servers browsing to "C:\ProgramData\Microsoft\OfficeWebApps\Working\d" and removing ALL of the contents from the "d" folder.






































    With the "d" folder clear, restart your services and give another test. In my case I was now able to render the PowerPoint file just fine. With everything back to normal so you can now disable the logging:

    • "Set-Officewebappsfarm -logverbosity """