Add Grafana Service writeup

This commit is contained in:
Simon Marsh 2019-05-24 15:45:03 +01:00
parent f56e1ed11b
commit 521eb692ad
Signed by: burble
GPG Key ID: 7B9FE8780CFB6593
4 changed files with 82 additions and 1 deletions

View File

@ -19,7 +19,7 @@ Longer term, regional replicas of the DN42 site may be provided however this is
## Looking Glass
[lg.burble.com](https://lg.burble.com) (public internet link)
[lg.burble.com](https://lg.burble.com) (public internet link)
[lg.burble.dn42](https://lg.burble.dn42) (dn42 link)
The burble.dn42 looking glass is based on [bird-lg](https://github.com/sileht/bird-lg) with patches by

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

View File

@ -0,0 +1,75 @@
Details of the burble.dn42 hosted Grafana service.
===
# Hosted Grafana Service
|Host / URL|Service|
|:--|:--|
|[http://grafana.burble.dn42](http://grafana.burble.dn42)|Grafana Dashboards (dn42 link)|
|[https://grafana.burble.com](https://grafana.burble.com)|Grafana Dashboards (public internet link)|
|influx.burble.dn42:8086|InfluxDB Endpoint|
The hosted grafana service provides an [InfluxDB](https://www.influxdata.com/) and
[Grafana](https://grafana.com/) combination for storing and displaying stats and metrics.
The service can accept metrics from any source that is able to
[publish](https://docs.influxdata.com/influxdb/v1.7/supported_protocols/) to the InfluxDB, including
[Prometheus](https://prometheus.io/) and
[Telegraf](https://www.influxdata.com/time-series-platform/telegraf/).
To apply for an account, contact dn42@burble.com.
Accounts are provided with a dedicated database and Grafana organisation
allowing users to create and manage their own graphs and dashboards as required. The Influx
database will store up to 1 year of data with a minimum interval of 1 minute.
The grafana service is hosted on dn42-fr-rbx1.burble.dn42. Service users are encouraged to peer
directly with the service node in order to lower latencies and avoid sending large amounts of
data through other nodes in DN42.
# DN42 Infrastructure Monitoring
The burble.dn42 network provides monitoring and alerting of key DN42 infrastructure.
The monitoring service logs metrics to the hosted grafana service, and presents alerts to
the #dn42-bots channel and slack. Two monitoring nodes hosted in separate regions ensure that
alerts will be generated if the main monitoring node fails.
The monitoring architecture is detailed below:
![Monitoring Diagram](DN42%20Monitoring%20190524.png)
#### Nodes
The main monitoring node is hosted on dn42-de-fra1, with a secondary backup node on dn42-us-nyc1.
Both nodes monitor the availability of services on each other and are capable of alerting if the
peer node is unavailable.
#### Presentation
Metrics collected by the service are presented as public graphs in the burble.dn42 grafana service
(see above).
#### Alerting
AlertManager is configured as a cluster, operating across both monitoring nodes. Alerts are
published in real time to the #dn42-bots hackint IRC channel (using
[alertmanager-irc-relay](https://github.com/google/alertmanager-irc-relay) and
burble.dn42/dn42-alerts channel in slack.
Alerts typically fire when a problem occurs for 5 minutes or longer.
#### Collection and Storage
Prometheus is used to collect metrics from the various probes and publish them to the hosted Influx
database. Typically metrics are collected every minute, although this is reduced to every five minutes
for the clearnet DN42 services to avoid excessive load.
The main node for data collection is monitor.de-fra1.burble.dn42
#### Probes
|||
|:--|:--|
|[blackbox_exporter](https://github.com/prometheus/blackbox_exporter)|Used to ping hosts or query services (e.g. HTTP/s probes)|
|[netdata](https://github.com/netdata/netdata)|Used to collect many host system metrics|
|[dn42promsrv](https://git.burble.com/burble.dn42/dn42promsrv)|Custom scripts for DN42 specific probdes|

View File

@ -10,6 +10,12 @@ A log of changes to the burble.dn42 network.
## burble.dn42 Maintenance Log
#### 24th May 2019
Moved and extended the DN42 monitoring so that it is more independent and also clustered.
A writeup of the hosted grafana service and monitoring is available [here](/home/grafana-services).
#### 21st May 2019
dn42-uk-lon1 is back again after being out of action for the day.