diff --git a/user/pages/01.home/burble-dn42-services/default.md b/user/pages/01.home/burble-dn42-services/default.md index afcade3..6225cab 100755 --- a/user/pages/01.home/burble-dn42-services/default.md +++ b/user/pages/01.home/burble-dn42-services/default.md @@ -115,10 +115,28 @@ Please mail [dn42@burble.com](mailto:dn42@burble.com) for further details. ## Network Status and Reporting +### Hosted Grafana Service + +[http://grafana.burble.dn42](http://grafana.burble.dn42) dn42 link +[https://grafana.burble.com](https://grafana.burble.com) public internet link + +The hosted grafana service has it's own page [here](/home/grafana-service). + +### DN42 Infrastructure Monitoring + +burble.dn42 hosts monitoring and alerting of key DN42 services, see the +[hosted grafana service](/home/grafana-service) for more details. + +### burble.dn42 status + [dn42.status.burble.com](https://dn42.status.burble.com/) -Each node in the network is monitored by [UptimeRobot](https://uptimerobot.com/) with alerts if a node becomes unavailable. +Each node in the network is monitored by [UptimeRobot](https://uptimerobot.com/) with alerts +if a node becomes unavailable. -Internally, nodes are measured by [netdata](https://github.com/netdata/netdata) which provides a real time view of each node. [prometheus](https://prometheus.io/) is then used to collect and store that data for historical reporting. [grafana](https://grafana.com/) is used for visualisation, but this is not currently a public service. +Internally, nodes are measured by [netdata](https://github.com/netdata/netdata) which provides +a real time view of each node. [prometheus](https://prometheus.io/) is then used to collect and +store that data for historical reporting. [grafana](https://grafana.com/) is used for +visualisation. Some public graphs are available on the [hosted grafana service](/home/grafana-service). Syslogs are exported in real time to a central logging node on the internal network. \ No newline at end of file diff --git a/user/pages/01.home/grafana-service/default.md b/user/pages/01.home/grafana-service/default.md index 05d0d1f..55389e0 100644 --- a/user/pages/01.home/grafana-service/default.md +++ b/user/pages/01.home/grafana-service/default.md @@ -1,17 +1,23 @@ +--- +title: Hosted Grafana +visible: true +--- + Details of the burble.dn42 hosted Grafana service. === -# Hosted Grafana Service +## Hosted Grafana Service |Host / URL|Service| |:--|:--| -|[http://grafana.burble.dn42](http://grafana.burble.dn42)|Grafana Dashboards (dn42 link)| -|[https://grafana.burble.com](https://grafana.burble.com)|Grafana Dashboards (public internet link)| +|[http://grafana.burble.dn42/](http://grafana.burble.dn42/)|Grafana Dashboards (dn42 link)| +|[https://grafana.burble.com/](https://grafana.burble.com/)|Grafana Dashboards (public internet link)| |influx.burble.dn42:8086|InfluxDB Endpoint| + The hosted grafana service provides an [InfluxDB](https://www.influxdata.com/) and -[Grafana](https://grafana.com/) combination for storing and displaying stats and metrics. +[Grafana](https://grafana.com/) combination for storing and displaying stats and metrics. The service can accept metrics from any source that is able to [publish](https://docs.influxdata.com/influxdb/v1.7/supported_protocols/) to the InfluxDB, including [Prometheus](https://prometheus.io/) and @@ -27,9 +33,9 @@ The grafana service is hosted on dn42-fr-rbx1.burble.dn42. Service users are enc directly with the service node in order to lower latencies and avoid sending large amounts of data through other nodes in DN42. -# DN42 Infrastructure Monitoring +## DN42 Infrastructure Monitoring -The burble.dn42 network provides monitoring and alerting of key DN42 infrastructure. +The burble.dn42 network hosts monitoring and alerting of key DN42 infrastructure. The monitoring service logs metrics to the hosted grafana service, and presents alerts to the #dn42-bots channel and slack. Two monitoring nodes hosted in separate regions ensure that alerts will be generated if the main monitoring node fails. @@ -40,7 +46,7 @@ The monitoring architecture is detailed below: #### Nodes -The main monitoring node is hosted on dn42-de-fra1, with a secondary backup node on dn42-us-nyc1. +The main monitoring node is hosted on dn42-de-fra1, with a secondary backup node on dn42-us-nyc1. Both nodes monitor the availability of services on each other and are capable of alerting if the peer node is unavailable. @@ -51,8 +57,8 @@ Metrics collected by the service are presented as public graphs in the burble.dn #### Alerting -AlertManager is configured as a cluster, operating across both monitoring nodes. Alerts are -published in real time to the #dn42-bots hackint IRC channel (using +AlertManager is configured as a cluster, operating across both monitoring nodes. +Alerts are published in real time to the #dn42-bots hackint IRC channel (using [alertmanager-irc-relay](https://github.com/google/alertmanager-irc-relay) and burble.dn42/dn42-alerts channel in slack. @@ -61,7 +67,8 @@ Alerts typically fire when a problem occurs for 5 minutes or longer. #### Collection and Storage Prometheus is used to collect metrics from the various probes and publish them to the hosted Influx -database. Typically metrics are collected every minute, although this is reduced to every five minutes +database. +Typically metrics are collected every minute, although this is reduced to every five minutes for the clearnet DN42 services to avoid excessive load. The main node for data collection is monitor.de-fra1.burble.dn42 @@ -72,4 +79,4 @@ The main node for data collection is monitor.de-fra1.burble.dn42 |:--|:--| |[blackbox_exporter](https://github.com/prometheus/blackbox_exporter)|Used to ping hosts or query services (e.g. HTTP/s probes)| |[netdata](https://github.com/netdata/netdata)|Used to collect many host system metrics| -|[dn42promsrv](https://git.burble.com/burble.dn42/dn42promsrv)|Custom scripts for DN42 specific probdes| +|[dn42promsrv](https://git.burble.com/burble.dn42/dn42promsrv)|Custom collector for DN42 specific probes|