Hi All,
Perhaps someone has some ideas on this rather strange problem we haven been having with our Lync Basic 2013 roll-out?
After pushing to around the first 100 PCs, we started seeing an occasional BSOD - initially we did not suspect the Lync job at all, but after a bit of event log harvesting, it was plain to see that these BSOD coincided with the installation of Lync. All the BSODs that we looked at contained some reference to fonts in the stack trace of the memory.dmp file - most of which were page faults, all though there was the odd bad pool header.
The job itself is being ran from SCCM & first does an uninstall of Lync 2010 & then an install of 2013, with some very basic OCT/ config.xml customization (reboots suppressed, logging directory changed and org name set). Note that although the reboots are suppressed, SCCM manages the jobs separately, so until the uninstall has completed entirely (has had a user initiated reboot if necessary) the 2nd Lync 2013 job will not run. For the most part, the uninstall was not requiring a reboot, although the install job was.
We first found that the machines waiting for a restart would BSOD if we viewed their font folder. This seemed like a simple fix.....install lync when no user is logged on so no restart would be required & hopefully problem should go away. This turns out not to be the case...we still got the odd BSOD even after doing this.
So this time I setup a VM & snapshot so i can test a few more scenarios & now have the following information...
Install running as the logged on user context, no restart (non requested by installer either) = No crash
Install running as system context with user logged on, no restart (non requested by installer either) = No crash
Install running as system when logged off, no restart (non requested by installer either) = CRASH
Install running as system when logged off, restart manually (non requested by installer either) = no crash
That seems simple enough to fix, but problem is, out in production we find some PCs actual needing a restart to finish the install - which has not been something I have been able to replicate in testing yet. So if we assume that having a user logged on & using Lync & outlook throws a spanner in the works for option 1 & 2 , that leaves option 4 as the best way to proceed.
However, I do not feel entirely comfortable with this. Because production seems to make everything a lot less predictable (approx 400 clients upgraded, at the most 50 BSODs seen) there is nothing I can see that links this together to make a reliable explanation (it is not the case that all the BSODs were in a certain state and the non BSOD were not - we found examples of machines that fell outside of just about any pattern we could identify)
Does any one have any suggestions on how I can investigate this further? Preferably id like to understand what is going on rather than just carry out more and more testing to find a workaround.
One last bit of information I can offer is that I did view each of the fonts that the installer appears to do something with individually to see which cause the crash - I found leelawdb and msuigur were to blame on my test box (although perhaps this is just random depending on which bit of memory is being annoyed at the time!?)
Many thanks for reading
Colin