Faulting nodes (cabling)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Kevin
    Moderator
    • Jan 2004
    • 558

    Faulting nodes (cabling)

    Earlier this week my IDRANet network became really unresponsive and Cortex indicated I had a couple of nodes fail. The effect on the remaining network was awful with buttons not working or long delays and their indicators not matching the current states. I quickly checked the wiring on these two nodes which seemed fine and restarted Cortex a couple of times to no avail. So I swapped to 'fault the node and stop communicating' and it settled down. I was however surprised how disruptive this was to the remaining unrelated network nodes.. a lesson learnt, I shall disable continual retrys as I couldn't afford this to happen if I wasn't at home.

    Anyway onto finding the problem. I was surprised the structure view didn't identify failed nodes somehow (e.g. red crossed) and I couldn't see an obvious way to find this info...but once I had the 'stop communicating/advise user' option then the plan view flashed failed sub nodes. The two failing nodes are both connected to one six way junction box and so I decided to check other nodes on that from within Cortex. The two errant nodes showed 'node faulted' IIRC in their properties but other attached hardware (ITR02's) did not show this and instead showed node status '0' which I think means 'OK' . These were also not flashing in the plan view so I assumed they were OK. I altered the polling period down to 5 secs just to be sure - still no fault dialog, no fault showing in properties and no flashing in plan view. Stupidly I also assumed that the red rapidly flashing light on an attached RS232 interface was indicative of communication.

    So I then assumed I had either a coincident double fault or maybe a track fault on the junction box. Red herring. In fact I had lost IDRANet comms to the whole junction box because of a broken wire and so all 5 attached devices had failed but Cortex only advised me of two and only these two flashed in the plan view. It's likely that on a network restart it would have advised me of all five but I didn't take that path. Not sure if this is as expected ? It just seems a little misleading but it was fairly quickly resolved as two coincidental failures is always an unlikely scenario.

    K
  • Paul_B
    Automated Home Legend
    • Jul 2006
    • 608

    #2
    Thanks for sharing your experience Kevin. Be interesting to see Viv's response to your observations

    Paul

    Comment

    • chris_j_hunter
      Automated Home Legend
      • Dec 2007
      • 1713

      #3
      we've had ours go unresponsive etc, just like that, several times ...

      frustrating & surprisingly disruptive, but always due (in our case) to a break in one of the comm's connections occurring sometime after successful commissioning (we're still building & adding to our network, so occasional disturbance is probably inevitable) ...

      sometimes because of a broken wire (solid core Cat-5e is fairly prone to this, when disturbed, we find, perhaps brought-on by the copper occasionally acquiring a slight nick during stripping), once because of something marginal in an 8WS distribution box, a couple of times due to marginal connections in a green plug (insulation being caught by the screw) ...

      we've found the Idratek resistor-plug has been a great help in isolating such problems ...

      we have our network set for two retries, because we found some (many, maybe all) of our nodes fail to respond (in-time) every now & then ...

      Chris
      Last edited by chris_j_hunter; 11 August 2012, 07:19 PM.
      Our self-build - going further with HA...

      Comment

      • Karam
        Automated Home Legend
        • Mar 2005
        • 863

        #4
        I believe 'Fault node and stop communicating with it' is the default setting but as with many things Cortex provides the flexibility to do otherwise, for example for diagnostic purposes. Anyhow with this setting selected if Cortex can't communicate with a node (after a few retrys) it will 'fault' the node - this means it will stop trying to communicate with it and will make any associated object icons flash in the database and will generate error messages both audible and text. These would also be logged (assuming default logging settings) together with pre and post history of system data packets to aid diagnosis.

        If the nodes themselves are faulted then that should be the end of any disruption, however if there is a physical fault, depending on its nature it may impact on other networked modules. A broken wire may not have an effect after the initial faulting process but shorts between certain pairs could. So the recommendation for any serious installation is to use an IPS (or IPD) as system power supply and bus spurring source because when faults are detected this has the ability to try more active corrective actions such as reseting power to a spur containing the suspect device and, if communications or power are being dirsupted through some physical fault, to isolate that spur. These actions would normally be undertaken without user intervention for the very reasons mentioned.

        If as Chris is indicating there seem to be a lot of retries then probably some investigative work needs to be undertaken. The reasons for retries are unlikely to be related to network congestion (even though I know that Chris has 200 or so nodes online) but rather perhaps to other reasons. For example node buffer full conditions can occur if too many locally time consuming commands are sent to a module in quick succession (in which case the module returns a buffer full NACK to Cortex and Cortex retries a little while later). Or, if a different type of NACK is returned eg. code indicating that the module simply didn't acknowledge then could indicate something flaky in the communication channel to that module. If these appear to be frequent amongst a set of nodes on a common cable then that would be a good indication to check the connections to that cable before the earliest node showing the problems.
        Last edited by Karam; 15 August 2012, 12:00 AM.

        Comment

        • chris_j_hunter
          Automated Home Legend
          • Dec 2007
          • 1713

          #5
          Interesting ('though we're still some way off 200) ...

          we don't get many auto-log's these days (we're pretty-much de-bugged, now) - those we do are just-about always down to power-cuts (rural location, building-work ongoing, etc) & our SLDs ...

          last week, we had three (auto-logs) and, when we thought to check & went through them, we found one had messages relating to one of the AUIs (speaker still not working, we had it out of its cover, it seemed fine, but amongst other things tried the reset button), and two had messages relating to SLDs, with mention of possible power or bulb failure - all bulbs still working, so presumably the former, 'though none had been noticed or mentioned (the Cortex PC runs off a UPS, and the IPSs protect the network) ...

          we didn't notice any retries relating to other modules, but we weren't looking for them, either - it's maybe possible those in the past all related to physical comm's disconnects, all now sorted, but there's no way to tell, now, because after identifying and/or solving a problem we routinely delete the logs (to save space) ... perhaps we should stop doing this but, either-way, we'll keep an eye hereon ...

          we do keep an eye on CPU usage - via Task Manager - from time to time, and generally (with the PC being dedicated to Cortex & its needs) it's around 30%, with occasional forays to rather more, and some excursions to around 100% when doing (eg) a Save ...
          Our self-build - going further with HA...

          Comment

          Working...
          X