On-Prem Connector Host Resiliency

Hi Team,

We use an On-Prem Connector Host for each of our instances, and in general it works great. Every so often, however, we face issues with a connector host and it results in a loss of connectivity (and therefore machine data) for the duration of the outage.

It would be great to be able to deploy multiple OPCHs per instance, and to automatically failover between them if an issue is detected with the primary CH. This would decrease disruption to our end-users and, buy us valuable time to adequately troubleshoot and resolve issues.

Hey Dan -

Thanks for the suggestion. A few questions on how we have thought about approaching this -

  • One approach we have considered is adding a connector host state chage as an event that can trigger an automaton, this would allow you to do a whole lot more than assigning a new connector host, including sending an email to IT stakeholders, tracking any potential outages to a table or other external system, etc. The downside here is it is build it yourself but the upside is you get tons of flexibility.
  • Another approach that we have considered is just building this as a point solution on top of the connector host UI, as you propose. Less flexible, but easier to configure.

Do you have a high level preference here as far as approach?

Both of these approaches come with the added challenge that the connector host will need to be of the same version (with the same functionality) to ensure that all of the associated connectors will be supported. When creating functions, we automatically check what capabilities the assigned connector host as and block those that are not possible on the respective connector host, if we allowed redundancy we would need to make sure that we do a further layer of checking for the primary and secondary connector host.

We do have a number of customers who are running OPCH instances within contained pods that monitor the connector host health, and will automatically spin up new instantiations of the OPCH of the health of the connection to Tulip is poor, addressing this problem fairly well. Not zero downtime, but well under 1 minute. I can connect you with some resources at those customers if you would be interested in their deployment details.

Pete

2 Likes

Hey Pete,

Thanks for the thorough response.

I think the first option to trigger an automation is inherently preferable, provided that cutting over to a new CH is an available action. This would be particularly valuable for customers that may not have mature infrastructure monitoring tools outside of Tulip, though for us it would mostly be a matter of convenience to be able to configure alerts, etc. in Tulip.

I’ll get in touch to discuss HA options in more depth, as that is a preferable setup for us in most scenarios.