OWSTimer.exe consistently crashes on specific job
The "Windows SharePoint Services Timer" OWSTimer.exe consistently crashes on our 64-bit test server since the introduction of new custom timer jobs (one of which causes the error). This error does not occur on our 32-bit development or integration servers. The Test Server is 12.0.0.6327 so it should already include KB949399 which is a very similar error. The job in question simply calls a web service after checking the date it last ran. We believe this could be the cause as asking a job when it last ran seems to require a specific SharePoint_Config DB Stored Procedure permission which appears to be lacking. Here is the event log entry: Event Type: Error Event Source: .NET Runtime 2.0 Error Reporting Event Category: None Event ID: 1000 Date: 8/05/2009 Time: 2:30:05 PM User: N/A Computer: XXXSP03 Description: Faulting application owstimer.exe, version 12.0.6318.5000, stamp 4845bb0b, faulting module kernel32.dll, version 5.2.3790.4062, stamp 462643a7, debug? 0, fault address 0x0000000000027d8d. For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp. Data: 0000: 41 00 70 00 70 00 6c 00 A.p.p.l. 0008: 69 00 63 00 61 00 74 00 i.c.a.t. 0010: 69 00 6f 00 6e 00 20 00 i.o.n. . 0018: 46 00 61 00 69 00 6c 00 F.a.i.l. 0020: 75 00 72 00 65 00 20 00 u.r.e. . 0028: 20 00 6f 00 77 00 73 00 .o.w.s. 0030: 74 00 69 00 6d 00 65 00 t.i.m.e. 0038: 72 00 2e 00 65 00 78 00 r...e.x. 0040: 65 00 20 00 31 00 32 00 e. .1.2. 0048: 2e 00 30 00 2e 00 36 00 ..0...6. 0050: 33 00 31 00 38 00 2e 00 3.1.8... 0058: 35 00 30 00 30 00 30 00 5.0.0.0. 0060: 20 00 34 00 38 00 34 00 .4.8.4. 0068: 35 00 62 00 62 00 30 00 5.b.b.0. 0070: 62 00 20 00 69 00 6e 00 b. .i.n. 0078: 20 00 6b 00 65 00 72 00 .k.e.r. 0080: 6e 00 65 00 6c 00 33 00 n.e.l.3. 0088: 32 00 2e 00 64 00 6c 00 2...d.l. 0090: 6c 00 20 00 35 00 2e 00 l. .5... 0098: 32 00 2e 00 33 00 37 00 2...3.7. 00a0: 39 00 30 00 2e 00 34 00 9.0...4. 00a8: 30 00 36 00 32 00 20 00 0.6.2. . 00b0: 34 00 36 00 32 00 36 00 4.6.2.6. 00b8: 34 00 33 00 61 00 37 00 4.3.a.7. 00c0: 20 00 66 00 44 00 65 00 .f.D.e. 00c8: 62 00 75 00 67 00 20 00 b.u.g. . 00d0: 30 00 20 00 61 00 74 00 0. .a.t. 00d8: 20 00 6f 00 66 00 66 00 .o.f.f. 00e0: 73 00 65 00 74 00 20 00 s.e.t. . 00e8: 30 00 30 00 30 00 30 00 0.0.0.0. 00f0: 30 00 30 00 30 00 30 00 0.0.0.0. 00f8: 30 00 30 00 30 00 32 00 0.0.0.2. 0100: 37 00 64 00 38 00 64 00 7.d.8.d. 0108: 0d 00 0a 00 .... (This error appears to always be the same). Any advice would be appreciated.
May 8th, 2009 9:59am
Are the patch levels the same between the two environments? You said that the stored procedure requires a permission change; was that done in the other environment?SharePoint Developer | Administrator | Evangelist -- Twitter -- Blog - http://nextconnect.blogspot.com
Free Windows Admin Tool Kit Click here and download it now
May 9th, 2009 9:27pm
Thanks for the reply Mike,Unfortunately no-I was only recently made aware that Test/Prod are the August updates (KB956056, KB956057, KB957109), while Dev/Integration are at Infrastructure Updates.I will be remedying this shortly. Regarding permissions I cannot see why the issue would occur if the Timer Service is running as the MOSS Farm account. I have read about some strange permission issues with the App Pool account not having enough permissions, but surelywhen it runs as a job, it would be ok.I did actually locate a Thread.Sleep in one of the custom timer jobs, which I will remove as it's made me suspicious of possible causes of the Timer Job crash. I also suspectNintex Workflow running on the server couldberelated to thisissue.Nothing yet explains why OWSTimer.exe simply crashes. Cheers.
May 11th, 2009 2:10am
Here's a new addition to the error log:Event Type:InformationEvent Source:.NET Runtime 2.0 Error ReportingEvent Category:NoneEvent ID:1001Date:11/05/2009Time:8:58:25 AMUser:N/AComputer:XXXSP03Description:Bucket 03074118, bucket table 4, faulting application owstimer.exe, version 12.0.6318.5000, stamp 4845bb0b, faulting module kernel32.dll, version 5.2.3790.4062, stamp 462643a7, debug? 0, fault address 0x0000000000027d8d.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.Data:0000: 42 00 75 00 63 00 6b 00 B.u.c.k.0008: 65 00 74 00 3a 00 20 00 e.t.:. .0010: 30 00 33 00 30 00 37 00 0.3.0.7.0018: 34 00 31 00 31 00 38 00 4.1.1.8.0020: 0d 00 0a 00 42 00 75 00 ....B.u.0028: 63 00 6b 00 65 00 74 00 c.k.e.t.0030: 54 00 61 00 62 00 6c 00 T.a.b.l.0038: 65 00 20 00 34 00 0d 00 e. .4...0040: 0a 00 ..
Free Windows Admin Tool Kit Click here and download it now
May 11th, 2009 2:14am
Further detail: We thought it was a specific job crashing the Timer Service, it now seemsanother job caused it to crash. (This second jobtoo checks the date it last ran and calls a Web Service).Interestingly even when running as MOSS Farm we are unable to change the schedule of a Timer Job (using Object Model code) - we simple receive a 403 Forbidden on a custom page that allows us to adjust the freqency of jobs[Under the hood it's a SecurityException atMicrosoft.SharePoint.Administration.SPPersistedObject.Update()].It makes me wonder if it's allrelated.
May 11th, 2009 2:52am
Sounds like there could be something else going on. Try installing those other updates and see where you stand. This spring I spent a lot of time chasing 2-3 issues only to find that they were resolved in the Cumulative Updates. SharePoint Developer | Administrator | Evangelist -- Twitter -- Blog - http://nextconnect.blogspot.com
Free Windows Admin Tool Kit Click here and download it now
May 11th, 2009 2:56am
The title now needs to change to "OWSTimer.exe inconsistently crashes onsome custom jobs".Attempts to resolve issue by:- Removing calls to Thread.Sleep- Removing calls to "LastRunTime"- Removing any re-throw of exceptions... have failed. Something seems to be triggering the timer service to crash and now it seems 3/5 custom jobs can cause it the crash, but don't seem to cause it in all cases. (Though one still appears to cause it consistently- very strange). I am I have raised an MS Support Case... will try to keepthis post up to date.
May 11th, 2009 9:56am
Most recent update from MS Support: "The causes varies case by case and they were mostly caused by some custom timer jobs or solution deployed to the SharePoint server farm." (A re-register of ASP.NET into IIS is recommended, however as the crash doesn't occur after disabling the custom job this is not logical). Code has been provided to MS for investigation.
Free Windows Admin Tool Kit Click here and download it now
May 18th, 2009 5:50am
Update: Cause was a stack overflow caused by a shared Error Handling library that relied on <appSettings> in a configuration file. The library failed to get the necessary entry and caused an infinite loop which unfortunately wasn't sufficiently guarded against. (In this case no OWSTimer.exe.config file was present to prevent issue). If only the logs actually dumped out the stack so this issue could be resolved in 5 minutes and not 2 weeks.
May 21st, 2009 11:11am
FYI, Nick Huang @ MS provided some excellent details on how to low-level debug OWSTimer.exe (note the 14GB exception dump):
"We have to use windbg.exe to attach in owstimer.exe process and output details of every exception (C++ exception, CLR exception etc.). Only after going through the exception list, we build up some knowledge what may have gone wrong and take corresponding action.
It took a while we get that. Not to mention what issues we faced ... but he final way it worked is:
1. Create a registry key to allow Timer service to sleep for 15 secs on start (to allow us to attach windbg.exe). There is other way to do same job, but we found it is easiest this way.
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\WSS\SPTimerV3]
"SleepOnStart"=dword:00003a98
2. Stop and restart Timer service.
3. Attach windbg.exe into Owstimer.exe
4. Created a log file for debug output
.logopen c:\owstimer_debug.log
5. Load the .NET sos debug extension
.loadby sos mscorwks
6. Enable C++ exception to dump out native call stack (k) and then continue (g). eh stands for C++ exception
sxe -c "k;g" eh
7. Enable CLR exception to dump out native call stack (k), exception details (!pe), CLR stack (!clrstack) and then continue (g). clr is CLR exception
sxe -c "k;!pe;!clrstack;g" clr
8. hit F5 to let it run till it dies and we were able to get a complete exception list. The exception list is huge (just the text output is 14GB and full of the exceptions)
9. By walking through the exception list, we found that exception to be outstanding. Rest is to review source code & take dump on corresponding exception which is relatively easy.
All these (commands listed above) are debug commands. If you're interested, you can check Windbg help. There is also a very good blog by a MS engineer tess (http://blogs.msdn.com/tess/) on debugging. Probably you're already aware of that, but in case not, you can have a look. It is really a highly recommended blog. :)"
Free Windows Admin Tool Kit Click here and download it now
May 21st, 2009 2:08pm
Thanks for this, had the exact same problem -also caused by a stack overflow. Didn't suspect my code because no CLR exception was caught, butthis lead me backon the right track!
July 29th, 2009 9:19am
Thanks Mike for the details. I had a similar configuration on a 64 bit machine(s) and after long nights and lots of analysis on all the logs I can put my hands on ( Windows and SharePoint ) I found the problem was the Antivirus application on the same machine (McAfee), blocks certain workflow files (xoml) as well as outbound mail (SMTP on port 25) . looks like the actual problem is the OWSTimer service reties and consumes tons of memory , and eventually crashes. that may not apply to all but if you have a local Security suite that behaves the same , try to disable it first attack one problem at a time. Good luck -George Gergues -SharePoint Architect
Free Windows Admin Tool Kit Click here and download it now
January 30th, 2010 1:20am