diff --git a/user/pages/01.home/burble-dn42-services/default.md b/user/pages/01.home/burble-dn42-services/default.md index 73dd195..afcade3 100755 --- a/user/pages/01.home/burble-dn42-services/default.md +++ b/user/pages/01.home/burble-dn42-services/default.md @@ -19,7 +19,7 @@ Longer term, regional replicas of the DN42 site may be provided however this is ## Looking Glass -[lg.burble.com](https://lg.burble.com) (public internet link) +[lg.burble.com](https://lg.burble.com) (public internet link) [lg.burble.dn42](https://lg.burble.dn42) (dn42 link) The burble.dn42 looking glass is based on [bird-lg](https://github.com/sileht/bird-lg) with patches by diff --git a/user/pages/01.home/grafana-service/DN42 Monitoring 190524.png b/user/pages/01.home/grafana-service/DN42 Monitoring 190524.png new file mode 100644 index 0000000..10a44d5 Binary files /dev/null and b/user/pages/01.home/grafana-service/DN42 Monitoring 190524.png differ diff --git a/user/pages/01.home/grafana-service/default.md b/user/pages/01.home/grafana-service/default.md new file mode 100644 index 0000000..05d0d1f --- /dev/null +++ b/user/pages/01.home/grafana-service/default.md @@ -0,0 +1,75 @@ +Details of the burble.dn42 hosted Grafana service. + +=== + +# Hosted Grafana Service + +|Host / URL|Service| +|:--|:--| +|[http://grafana.burble.dn42](http://grafana.burble.dn42)|Grafana Dashboards (dn42 link)| +|[https://grafana.burble.com](https://grafana.burble.com)|Grafana Dashboards (public internet link)| +|influx.burble.dn42:8086|InfluxDB Endpoint| + +The hosted grafana service provides an [InfluxDB](https://www.influxdata.com/) and +[Grafana](https://grafana.com/) combination for storing and displaying stats and metrics. +The service can accept metrics from any source that is able to +[publish](https://docs.influxdata.com/influxdb/v1.7/supported_protocols/) to the InfluxDB, including +[Prometheus](https://prometheus.io/) and +[Telegraf](https://www.influxdata.com/time-series-platform/telegraf/). + +To apply for an account, contact dn42@burble.com. + +Accounts are provided with a dedicated database and Grafana organisation +allowing users to create and manage their own graphs and dashboards as required. The Influx +database will store up to 1 year of data with a minimum interval of 1 minute. + +The grafana service is hosted on dn42-fr-rbx1.burble.dn42. Service users are encouraged to peer +directly with the service node in order to lower latencies and avoid sending large amounts of +data through other nodes in DN42. + +# DN42 Infrastructure Monitoring + +The burble.dn42 network provides monitoring and alerting of key DN42 infrastructure. +The monitoring service logs metrics to the hosted grafana service, and presents alerts to +the #dn42-bots channel and slack. Two monitoring nodes hosted in separate regions ensure that +alerts will be generated if the main monitoring node fails. + +The monitoring architecture is detailed below: + +![Monitoring Diagram](DN42%20Monitoring%20190524.png) + +#### Nodes + +The main monitoring node is hosted on dn42-de-fra1, with a secondary backup node on dn42-us-nyc1. +Both nodes monitor the availability of services on each other and are capable of alerting if the +peer node is unavailable. + +#### Presentation + +Metrics collected by the service are presented as public graphs in the burble.dn42 grafana service +(see above). + +#### Alerting + +AlertManager is configured as a cluster, operating across both monitoring nodes. Alerts are +published in real time to the #dn42-bots hackint IRC channel (using +[alertmanager-irc-relay](https://github.com/google/alertmanager-irc-relay) and +burble.dn42/dn42-alerts channel in slack. + +Alerts typically fire when a problem occurs for 5 minutes or longer. + +#### Collection and Storage + +Prometheus is used to collect metrics from the various probes and publish them to the hosted Influx +database. Typically metrics are collected every minute, although this is reduced to every five minutes +for the clearnet DN42 services to avoid excessive load. + +The main node for data collection is monitor.de-fra1.burble.dn42 + +#### Probes + +||| +|:--|:--| +|[blackbox_exporter](https://github.com/prometheus/blackbox_exporter)|Used to ping hosts or query services (e.g. HTTP/s probes)| +|[netdata](https://github.com/netdata/netdata)|Used to collect many host system metrics| +|[dn42promsrv](https://git.burble.com/burble.dn42/dn42promsrv)|Custom scripts for DN42 specific probdes| diff --git a/user/pages/01.home/maintenance-log/default.md b/user/pages/01.home/maintenance-log/default.md index 43239c5..9c4d8c7 100755 --- a/user/pages/01.home/maintenance-log/default.md +++ b/user/pages/01.home/maintenance-log/default.md @@ -10,6 +10,12 @@ A log of changes to the burble.dn42 network. ## burble.dn42 Maintenance Log +#### 24th May 2019 + +Moved and extended the DN42 monitoring so that it is more independent and also clustered. + +A writeup of the hosted grafana service and monitoring is available [here](/home/grafana-services). + #### 21st May 2019 dn42-uk-lon1 is back again after being out of action for the day.