Cortex PC BSOD overnight :-(

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • marcuslee
    Automated Home Ninja
    • Dec 2009
    • 279

    Cortex PC BSOD overnight :-(

    Hi Guys,

    Woke up this morning to find the dreaded DFP panels button lights lit and Idratek shown on the LCD display giving the me the heads up that the system was down :-(

    Remote logging into the Cortex PC was no go. So going to the PC, found it was BSOD with 'dumping Kernel' message.

    As it's all in Production/not at a scheduled down time (!!), a quick hardware reboot solved it.

    Forensically:
    - I wonder how to check when it went offline? I'm not that familiar with WinXP Event Viewer, but clicking around, I don't see anything
    - I didn't see anything in the Cortex Logs, but then again I wouldn't expect to if it went down suddenly
    - looking at temp graphs to see if it might show a blank patch in graph data, but there isn't any. However it would appear it might be just after 5am the line goes dead straight at that point until reboot this morning.
    - I've now enabled logging on the switch the PC connects to so I've got a means to see it when it went off
    - I don't suppose anyone else has seen this before or any insight? Obviously I'm a bit worried now about this being something that's going to come up time and again

    [PC is WinXP SP3, AVG Free antivirus, Cortex 26.7.2. Been rock steady for over a year etc, other than Cortex upgrades and AVG upgrades]

    Going forward:
    - I don't think Reflex would have helped as the system only uses DFPH panels, QLD for turning on/off lights and QRI relays for heating and Cortex for heating temp set points?
    - other than rebuild on the cards for Win7, is there any benefit to using Win7 64bit?
    - also has anyone any experience with Cortex redundant PC setup?
    - also I'm beginning to see some benefits to having Cortex as VM (though I tend to distrust the reliability of VMs!)

    Cheers,

    Marcus
  • marcuslee
    Automated Home Ninja
    • Dec 2009
    • 279

    #2
    - or of course Windows 8.1, if anyone has any insight into going to that (with it's faster boot times)?

    Cheers,

    Marcus

    Comment

    • Paul_B
      Automated Home Legend
      • Jul 2006
      • 608

      #3
      Originally posted by marcuslee View Post
      Hi Guys,

      Forensically:
      - I wonder how to check when it went offline? I'm not that familiar with WinXP Event Viewer, but clicking around, I don't see anything
      • If you have a look in the System Event log of Event Viewer and the times of events do you see a an obvious gap?
      • If don't see a gap then do you see EventLog entries with the EventID 6005, 6006, 6008 and 6009? Event 6005 / 6009 is generated on every reboot when the OS starts-up, just before or after you might see an event 6008, if it exists it might give some more info. - see http://support.microsoft.com/kb/196452 for more info


      Originally posted by marcuslee View Post
      [PC is WinXP SP3, AVG Free antivirus, Cortex 26.7.2. Been rock steady for over a year etc, other than Cortex upgrades and AVG upgrades]
      • If this is the first time this has happened it might have been a one off, for example a bit flip on a cluster written to disk (OS sends a 1 to disk but disk records a 0 due to electronic interpretation). It might not happen again for another year or more.



      Originally posted by marcuslee View Post
      Going forward:
      - other than rebuild on the cards for Win7, is there any benefit to using Win7 64bit?
      • Not really because Cortex is a 32 bit application, plus drivers for hardware are probably more mature in 32 bit guise than 64 bit.


      Originally posted by marcuslee View Post
      - also I'm beginning to see some benefits to having Cortex as VM (though I tend to distrust the reliability of VMs!)
      • To counter this thinking a VM can make Cortex harder to setup and run because of the abstraction layer and trying to pass-through devices. In addition if the memory dump was down to a hardware problem a VM wouldn't have helped




      To check if your machine is setup to generate a memory dump file, how you can find the location and tools to read it refer to http://support.microsoft.com/kb/254649
      Last edited by Paul_B; 22 October 2013, 09:02 PM.

      Comment

      • marcuslee
        Automated Home Ninja
        • Dec 2009
        • 279

        #4
        Originally posted by Paul_B View Post
        • If you have a look in the System Event log of Event Viewer and the times of events do you see a an obvious gap?
        • If don't see a gap then do you see EventLog entries with the EventID 6005, 6006, 6008 and 6009? Event 6005 / 6009 is generated on every reboot when the OS starts-up, just before or after you might see an event 6008, if it exists it might give some more info. - see http://support.microsoft.com/kb/196452 for more info
        Thanks for reply Paul.

        In answer:
        - no obvious gap in so far as the there was no log entries past 14/10/13 (I guess the last time something happened), until the reboot this morning and the 6009 log entry
        - thanks for the MS kb link

        Originally posted by Paul_B View Post
        • If this is the first time this has happened it might have been a one off, for example a bit flip on a cluster written to disk (OS sends a 1 to disk but disk records a 0 due to electronic interpretation). It might not happen again for another year or more.
        - thanks muchly for the reassurance and insight. Makes me feel better than previously where I saw a rebuild as being of extremely high priority


        Originally posted by Paul_B View Post
        • Not really because Cortex is a 32 bit application, plus drivers for hardware are probably more mature in 32 bit guise than 64 bit.
        Got it. My thinking also, but I was surprised to see a recent reply from Karam about building Cortex on 64bit Windows (I wasn't sure if that meant Cortex was 64bit available/compatible, or if it was simply running 32bit mode on 64bit OS). So I wasn't sure if there was some meat in moving to 64bit.


        Originally posted by Paul_B View Post
        • To counter this thinking a VM can make Cortex harder to setup and run because of the abstraction layer and trying to pass-through devices. In addition if the memory dump was down to a hardware problem a VM wouldn't have helped
        - understood with this. Also to be frank I dislike VMs for that reason. Also I find vmplayer not great (used on a daily basis, over months, I'll always find something odd happening like USB pass through failing). VMware ESXi though in my limited use, seems to be better, though I didn't punish it nearly as much as VM player on Windows to know.
        - additionally in the banking and finance sector I worked at a bank which used VMs exclusively (I'm in networks, so not very OS or App savy). At some point I did ask the Linux and Windows teams there about how robust that was. Their answer was: at another bank (of higher standing), they never used VMs. At most VMs were only used for lab purposes. And it was sufficient to say, they absolutely stood by this (even though they were admin'ing in this bank's VM environment).
        - thanks though for the information on it not helping.

        Originally posted by Paul_B View Post
        To check if your machine is setup to generate a memory dump file, how you can find the location and tools to read it refer to http://support.microsoft.com/kb/254649
        Thanks for link. I had a look at the last mini dump file was back in September 2012. So it appears that it was definitely a completely catastrophic failure that came out of the blue.


        I should add I've been meaning to start a thread to put forward a Best Practice for Cortex? thread. In so far as assisting or at least possibly producing a template from which new Idratek people can work from as a reference platform. For instance:
        - I have shortcut to Cortex log file on my desktop for quick double click to checking logs
        - I use the Ctrl+F in the logs for error usually to grab what's going on
        - Also a shortcut to Event Viewer
        - also I guess I should (but haven't so far), have some form of monitoring or at least some event notification taking place if the Cortex PC is running low on disk, memory, or possibly even high temp (if that exists)
        - also what's the preferred OS people are using.
        - and I suppose I should setup proper error handling from Cortex also
        - disable auto updates and only update Cortex, OS, anti virus etc when necessary or at least vetted and confirmed to be ok?

        I think it'd assist greatly and also I'd like the opportunity to be able to pool efforts in if there was a reference platform, at least we could help others out for those who might go to newer versions of Cortex, system updates etc and if there should be any gotchas and therefore to hang back? Especially for those on the forum who have more resources (such as VMs to support easier roll back vs those who are running a single PC, where if you somehow trash the system you're completely stuck).

        I should add, my own strategy has been:
        - single dedicated PC (as I deem Cortex to be a Priority 1 device ie the function it serves is too important since in my case it covers climate control for babies etc)
        - regular copy to separate NAS of the Idratek Database
        - I also have an identical cold spare PC to be used in the event of catastrophic failure. It's a mirror image, but would only require a database copying to

        As of this morning however I've determined it's a flawed as:
        - the cold spare hasn't had it's Cortex updated, so actually couldn't be enacted at time of failure as the production database relies on a minimum version of Cortex (which I'm not 100% sure where to find what minimum version is required actually)
        - Node 0 doesn't have lighting which doesn't rely on Cortex lighting, so I was fumbling in the dark!
        - also what bothers me, is that if I run the cold spare up to the same level as Production Cortex PC, it could very well be that a bug which kills the Production PC, is introduced into the cold spare. So I suppose best practice would dictate to leave it one compatible revision older??

        Comment

        • mcockerell
          Automated Home Sr Member
          • Jan 2009
          • 74

          #5
          For what it's worth:

          I used to have problems with our Cortex box hanging occasionally (WinXP SP3) which I suspected were due to hard disk reliability problems.
          I switched to a solid-state disk last Christmas and, touch wood, have experienced no problems since.
          The system is backed up automatically every night to our home server, and I also 'lock' a backup when I update Cortex or make significant changes.

          One useful side effect of using the SSD is that the system boots very quickly now.
          I disabled Windows Update some time ago - on more than one occasion an update prevented Cortex from running.

          Comment

          • marcuslee
            Automated Home Ninja
            • Dec 2009
            • 279

            #6
            Also Karam, Viv, if you get to reading this; Oddly post crash in the morning, there were some parts of the house which had their QLD channels lit (where previously when we went to bed, they were off).

            Not sure if there's any insight as to how/why that should be?

            Marcus

            Comment

            • pbj
              Automated Home Sr Member
              • Jul 2004
              • 57

              #7
              Hi Marcus,

              Interesting to read your backup precautions. My PC occasionally BSODs, but always seems to restart itelf (not sure if this is a w7 thing or I've got lucky with a setting somewhere!).

              On the one ocasion so far that the PC didn't restart itself reflex was plenty capable enough to allow lights to be turned on and off presence to work at a basic level etc. Your story reminds me I must at the very least auto generate the reflex vectors again. I probably should put the effort in to understand reflex more fully too!

              Peter.

              Comment

              • Karam
                Automated Home Legend
                • Mar 2005
                • 863

                #8
                Marcus, do you have defibrillator running on your machine? If the problem affected Cortex in the first instance then defibrillator will have tried to reboot the PC and possibly something might have gone AWOL after that. However in that case defibrillator will also have written a log file with defibrillator in the name before trying the reboot. So I guess either it wasn't running or whatever crashed the machine was some more global event.

                We noticed a couple of bugs in 26.7.2 - one was: having moved ourselves to a newer version of jquery (Cortex mobile related) we found that sliders in the Cortex mobile interface didn't seem to update values correctly so we have rolled back to a previous version. Also a bug was introduced which meant that Cortex would immediately fail an object if it failed to receive a communication acknowledge at first attempt no matter the retry settings in network supervisor. The symptoms were then that people suddenly started finding Cortex reporting failed modules (flashing icons) every so often where their system had been running just fine before. So there is an update now to correct these - 26.7.3 incase you haven't noticed it yet.

                Regarding running Cortex on other platforms, I can't say I'm an expert on the details but yes Cortex is a 32 bit application and as to whether it would benefit from a 64bit environment I'd say would have more to do with other processes that are running on that platform which can take advantage of that - in other words indirect speed benefits for example. We have a test installation running on Windows 8.0 (don't dare to try the upgrade to 8.1 just yet from the various reports I've heard). There were a number of initial hurdles to overcome mainly relating to more pernickety program privilege requirements but otherwise the installation has been running fine for a couple of months now. Incidentally the machine in question is an HP Pavillion laptop running an i3 processor which was purchased at the time for around £320 inc VAT. When you consider you also get a screen and integral battery backup IMHO its quite good value and needless to say the performance relative to its otherwise reliable for over two years Xp predecessor was very noticeable - especially so when it came to camera handling, the feel of the Cortex mobile interface and the overall machine. There is still plenty to irritate with Windows 8 itself (IMO) but seeing as the machine is dedicated to Cortex that doesn't really matter much.

                Comment

                • marcuslee
                  Automated Home Ninja
                  • Dec 2009
                  • 279

                  #9
                  Originally posted by m****erell View Post
                  I used to have problems with our Cortex box hanging occasionally (WinXP SP3) which I suspected were due to hard disk reliability problems.
                  I switched to a solid-state disk last Christmas and, touch wood, have experienced no problems since.
                  I've also considered SSD, given the light requirements of Cortex + OS not being significantly large (and therefore SSD price worthy!). And whilst lack of mechanical failure = good, I was wondering about SSD failure. I'm aware of a colleague's use of SSD for syslog server and he believed it came to complete failure due to small file write nature, which I thought Cortex might end up doing.

                  Quick re-read, I see there's been gains in SSD reliability, but not sure if it's sufficient? Hence I was going to live through slower boots for a trade for reliability.

                  The other thing being is that I don't think faster execution after boot, helps performance too much (at least on a non Chris Hunter sized Idratek installs!), since Idratek serial comms manner I believe is bottle neck?

                  Originally posted by m****erell View Post
                  The system is backed up automatically every night to our home server, and I also 'lock' a backup when I update Cortex or make significant changes.

                  One useful side effect of using the SSD is that the system boots very quickly now.
                  I disabled Windows Update some time ago - on more than one occasion an update prevented Cortex from running.
                  Could I ask if this is a manual backup? And is it just the Cortex database or something more?.

                  Agreed with windows Update!



                  Originally posted by pbj View Post
                  Interesting to read your backup precautions. My PC occasionally BSODs, but always seems to restart itelf (not sure if this is a w7 thing or I've got lucky with a setting somewhere!).
                  Indeed which is what I found so disturbing in this particular black out. I'm used to BSOD, and then some sort of dumps which can take an eternity, but they usually do come back to life. But in this case it was 2+ hours in and it was still stuck :-(

                  Originally posted by pbj View Post
                  On the one ocasion so far that the PC didn't restart itself reflex was plenty capable enough to allow lights to be turned on and off presence to work at a basic level etc.
                  You're absolutley right. For the readers of thread. I should concur with this, DFP mapped light buttons will continue to work. Unfortunately for me though, heating doesn't.



                  Originally posted by Karam View Post
                  Marcus, do you have defibrillator running on your machine? If the problem affected Cortex in the first instance then defibrillator will have tried to reboot the PC and possibly something might have gone AWOL after that. However in that case defibrillator will also have written a log file with defibrillator in the name before trying the reboot. So I guess either it wasn't running or whatever crashed the machine was some more global event.
                  Thanks for reply Karam, and it is running (as part of standard install, so untouched in that regard), and indeed no logs, so at least as you say it alludes to a global event.

                  Originally posted by Karam View Post
                  We noticed a couple of bugs in 26.7.2 - one was: having moved ourselves to a newer version of jquery (Cortex mobile related) we found that sliders in the Cortex mobile interface didn't seem to update values correctly so we have rolled back to a previous version. Also a bug was introduced which meant that Cortex would immediately fail an object if it failed to receive a communication acknowledge at first attempt no matter the retry settings in network supervisor. The symptoms were then that people suddenly started finding Cortex reporting failed modules (flashing icons) every so often where their system had been running just fine before. So there is an update now to correct these - 26.7.3 incase you haven't noticed it yet.
                  I haven't got using Cortex Mobile extensively yet, and no, havne't had failed modules, so indeed haven't had to move to 26.7.3 yet.

                  Originally posted by Karam View Post
                  Regarding running Cortex on other platforms, I can't say I'm an expert on the details but yes Cortex is a 32 bit application and as to whether it would benefit from a 64bit environment I'd say would have more to do with other processes that are running on that platform which can take advantage of that - in other words indirect speed benefits for example. We have a test installation running on Windows 8.0 (don't dare to try the upgrade to 8.1 just yet from the various reports I've heard). There were a number of initial hurdles to overcome mainly relating to more pernickety program privilege requirements but otherwise the installation has been running fine for a couple of months now.
                  A couple of months? hmmm... well I'd trade a couple of months Win8 install vs another user's report of Win7 install known to be running for couple of years.

                  Just kidding of course, thanks for feedback. It does sound like either platform will suffice, though realistically speaking other than faster boot with Win8, I would probably put forward Win7 for the time being, until the leap-aheaders like yourself etc have come back and let us all know Win8 can hit the same reliability benchmarks :-S :-D

                  Originally posted by Karam View Post
                  Incidentally the machine in question is an HP Pavillion laptop running an i3 processor which was purchased at the time for around £320 inc VAT. When you consider you also get a screen and integral battery backup IMHO its quite good value and needless to say the performance relative to its otherwise reliable for over two years Xp predecessor was very noticeable - especially so when it came to camera handling, the feel of the Cortex mobile interface and the overall machine. There is still plenty to irritate with Windows 8 itself (IMO) but seeing as the machine is dedicated to Cortex that doesn't really matter much.
                  I should add I'm somewhat agreed on this. As a comprimise (vs running enterprise servers etc), this is what I go for also:
                  - being a notebook, it's already been tuned with low power consumption in mind (for those of us energy conscious)
                  - it's a quick swap hard drive if need be - Thinkpads, and I've seen others, have a 1 screw removal hard drive swap
                  - in my case I go for Thinkpads as it is tested to destruction, but also that aside (as I don't advocate throwing laptops around to begin with!), they also offer onsite hardware repair. And I think it's 3 years, and it's global regardless of where it was purchased to where it's now homed (I've authenticated this), and regardless of whether you purchased it new / second hand / however you came to be the owner of it (!!), as it tracks the machine, not owner (also authenticated).

                  Comment

                  • Paul_B
                    Automated Home Legend
                    • Jul 2006
                    • 608

                    #10
                    Just an update on my experience of SSD's. I originally purchased an OCZ and had no end of problems including BSoD. I then purchased a Intel SSD and haven't had any problems.

                    Comment

                    • mcockerell
                      Automated Home Sr Member
                      • Jan 2009
                      • 74

                      #11
                      Originally posted by marcuslee View Post
                      Could I ask if this is a manual backup? And is it just the Cortex database or something more?.
                      We have a Windows Home Server which holds all our media files (mainly music) and shared data - during the daily backup window it backs up every attached PC.
                      As the Cortex PC is always on-line we get a full system backup every day (it's actually incremental) and this is sufficient to restore to a new disk if necessary.
                      The server automatically manages the backup retention periods. I believe that it's possible to set-up something similar with a Windows 8 system.

                      Comment

                      • marcuslee
                        Automated Home Ninja
                        • Dec 2009
                        • 279

                        #12
                        Originally posted by m****erell View Post
                        We have a Windows Home Server which holds all our media files (mainly music) and shared data - during the daily backup window it backs up every attached PC.
                        As the Cortex PC is always on-line we get a full system backup every day (it's actually incremental) and this is sufficient to restore to a new disk if necessary.
                        The server automatically manages the backup retention periods. I believe that it's possible to set-up something similar with a Windows 8 system.
                        Thanks for reply. I wasn't aware of this feature and it's definitely of interest (if not a must have really).

                        Also to add, about the laptop bit, another couple of things:
                        - Lenovo battery utility comes with a few options, one of which is change charge thresholds to try to "Optimize for battery lifespan". Not sure if it fully works, but for what it's worth
                        - also one final big one, it's BIOS set-able to "power the laptop on when AC power is attached". Something I've found missing from a lot of laptops vs their desktop cousins, is the ability to "restore PC powered up state as it was prior to power cut". ie when a laptop looses AC power, and it's battery depletes, and Windows does it's either go to hibernate (or let to run through to nothing), it won't ever come back on when power is finally restored

                        Comment

                        • JonS
                          Automated Home Guru
                          • Dec 2007
                          • 202

                          #13
                          Re Lenovo laptops, do they all have these features (esp BIOS one) or just more business oriented models?
                          Very happy with my existing laptop but it will only run XP, so looking for a change in the next 6 months.
                          Thanks
                          JonS
                          JonS

                          Comment

                          • marcuslee
                            Automated Home Ninja
                            • Dec 2009
                            • 279

                            #14
                            Originally posted by JonS View Post
                            Re Lenovo laptops, do they all have these features (esp BIOS one) or just more business oriented models?
                            Very happy with my existing laptop but it will only run XP, so looking for a change in the next 6 months.
                            I'm not sure actually, but I am aware of the old Thinkpad SL series, not having this feature.

                            Having said that there's this Lenovo portal which allows you to boot the BIOS for machines so you can see what they look like:



                            On another note, I note that no one replied to say they're using the dual Cortex PC setup? I wonder about the reliability of the setup?

                            Comment

                            Working...
                            X