Skip to content

Grafana Monitoring Stack Deployment

Executive Summary

Objective: Deploy Grafana + Prometheus + Loki stack for centralised infrastructure monitoring and log aggregation on current cPanel servers.

Benefits: - Immediate Value: Provides monitoring NOW, before ApisCP migration - No Conflicts: Compatible with Imunify360 FREE + KernelCare on cPanel - Centralised Visibility: Unified dashboards for all 3 servers (EU1, NS1, NS2) - Log Aggregation: Centralised log collection and search with Loki - Performance Monitoring: Real-time metrics with Prometheus - Cost Effective: Open-source stack, only infrastructure costs - Future Integration: Foundation for Wazuh integration post-ApisCP migration

Timeline: 2-3 weeks for full deployment

Infrastructure Required: New Hetzner CPX31 server (€13.79/month, £12/month)

Project Status: Phase 1 - Planning and preparation

Current State

Existing Monitoring Limitations

Current Challenges: - Distributed Logs: Must SSH to each server individually to review logs - No Centralised Metrics: Cannot compare resource usage across servers - Manual Correlation: Difficult to correlate events across infrastructure - No Historical Trends: Limited ability to track performance over time - Reactive Posture: Discover issues when clients report problems - Time-Consuming: Daily monitoring tasks take 15-30 minutes manually

Current Servers: - eu1.cp (CPX31): Hosting server with ~30 client accounts - ns1.mdhosting.co.uk (CX22): Primary DNS server - ns2.mdhosting.co.uk (CX22): Secondary DNS server

Current Security Stack (Must Not Conflict): - Imunify360 FREE (uses /var/ossec directory - OSSEC-based) - KernelCare (live kernel patching) - CSF Firewall (connection tracking, rate limiting) - Fail2Ban (intrusion prevention) - ClamAV (malware scanning)

Why Grafana Stack Now?

Immediate Benefits: 1. Wazuh Incompatibility: Cannot deploy Wazuh on current cPanel servers due to Imunify360 conflict 2. Operational Visibility: Need monitoring improvements before ApisCP migration (Q2-Q3 2026) 3. Foundation Building: Grafana provides infrastructure for future Wazuh integration 4. Risk Reduction: Better visibility into system health during migration preparation 5. Skill Development: Gain experience with monitoring stack before more complex Wazuh deployment

Target State

Grafana Stack Architecture

graph TB
    subgraph "Current Production Servers - Hetzner Germany"
        EU1[eu1.cp<br/>CPX31 Hosting Server<br/>~30 Client Accounts<br/>Imunify360 + KernelCare]
        NS1[ns1.mdhosting.co.uk<br/>CX22 DNS Server 1<br/>Primary DNS<br/>Imunify360]
        NS2[ns2.mdhosting.co.uk<br/>CX22 DNS Server 2<br/>Secondary DNS<br/>Imunify360]
    end

    subgraph "Monitoring Agents - Production Servers"
        NEXPORTER1[Node Exporter<br/>System metrics<br/>Port 9100]
        NEXPORTER2[Node Exporter<br/>System metrics<br/>Port 9100]
        NEXPORTER3[Node Exporter<br/>System metrics<br/>Port 9100]
        PROMTAIL1[Promtail<br/>Log shipping<br/>Sends to Loki]
        PROMTAIL2[Promtail<br/>Log shipping<br/>Sends to Loki]
        PROMTAIL3[Promtail<br/>Log shipping<br/>Sends to Loki]
    end

    subgraph "NEW Monitoring Server - Hetzner Germany"
        MONITOR[Monitoring Server<br/>CPX31: 4 vCPU, 8GB RAM, 80GB<br/>AlmaLinux 10<br/>Docker + Portainer]
    end

    subgraph "Docker Containers - Monitoring Server"
        GRAFANA[Grafana Container<br/>Dashboards & Visualization<br/>Port 3000]
        PROMETHEUS[Prometheus Container<br/>Metrics collection & storage<br/>Port 9090]
        LOKI[Loki Container<br/>Log aggregation & indexing<br/>Port 3100]
        PORTAINER[Portainer Container<br/>Docker management UI<br/>Port 9443]
    end

    EU1 --> NEXPORTER1
    EU1 --> PROMTAIL1
    NS1 --> NEXPORTER2
    NS1 --> PROMTAIL2
    NS2 --> NEXPORTER3
    NS2 --> PROMTAIL3

    NEXPORTER1 -->|HTTP 9100<br/>Metrics pull| PROMETHEUS
    NEXPORTER2 -->|HTTP 9100<br/>Metrics pull| PROMETHEUS
    NEXPORTER3 -->|HTTP 9100<br/>Metrics pull| PROMETHEUS

    PROMTAIL1 -->|HTTP 3100<br/>Log push| LOKI
    PROMTAIL2 -->|HTTP 3100<br/>Log push| LOKI
    PROMTAIL3 -->|HTTP 3100<br/>Log push| LOKI

    PROMETHEUS -->|Data source| GRAFANA
    LOKI -->|Data source| GRAFANA

    MONITOR --> GRAFANA
    MONITOR --> PROMETHEUS
    MONITOR --> LOKI
    MONITOR --> PORTAINER

    PORTAINER -.->|Manages| GRAFANA
    PORTAINER -.->|Manages| PROMETHEUS
    PORTAINER -.->|Manages| LOKI

    ADMIN[Administrator] -->|HTTPS 443<br/>Grafana UI| GRAFANA
    ADMIN -->|HTTPS 9443<br/>Docker Management| PORTAINER

    style EU1 fill:#3498db,stroke:#2c3e50,stroke-width:2px,color:#fff
    style NS1 fill:#3498db,stroke:#2c3e50,stroke-width:2px,color:#fff
    style NS2 fill:#3498db,stroke:#2c3e50,stroke-width:2px,color:#fff
    style MONITOR fill:#27ae60,stroke:#2c3e50,stroke-width:2px,color:#fff
    style GRAFANA fill:#f39c12,stroke:#2c3e50,stroke-width:2px,color:#fff
    style PROMETHEUS fill:#f39c12,stroke:#2c3e50,stroke-width:2px,color:#fff
    style LOKI fill:#f39c12,stroke:#2c3e50,stroke-width:2px,color:#fff
    style PORTAINER fill:#8e44ad,stroke:#2c3e50,stroke-width:2px,color:#fff
    style NEXPORTER1 fill:#27ae60,stroke:#2c3e50,stroke-width:2px,color:#fff
    style NEXPORTER2 fill:#27ae60,stroke:#2c3e50,stroke-width:2px,color:#fff
    style NEXPORTER3 fill:#27ae60,stroke:#2c3e50,stroke-width:2px,color:#fff
    style PROMTAIL1 fill:#27ae60,stroke:#2c3e50,stroke-width:2px,color:#fff
    style PROMTAIL2 fill:#27ae60,stroke:#2c3e50,stroke-width:2px,color:#fff
    style PROMTAIL3 fill:#27ae60,stroke:#2c3e50,stroke-width:2px,color:#fff
    style ADMIN fill:#8e44ad,stroke:#2c3e50,stroke-width:2px,color:#fff

Stack Components

Grafana

Purpose: Visualization and dashboarding platform for metrics and logs.

Capabilities: - Unified dashboards combining metrics (Prometheus) and logs (Loki) - Alert rule configuration with email notifications - User-friendly query builder (no complex query language required) - Pre-built dashboard templates for common monitoring scenarios - Mobile-responsive interface for on-the-go monitoring - Dashboard sharing and export

Use Cases: - Real-time system health monitoring - Historical trend analysis - Alert management and notification - Incident investigation (correlate metrics + logs) - Capacity planning and resource optimization

Prometheus

Purpose: Time-series metrics collection, storage, and querying.

Capabilities: - Pull-based metrics collection (scrapes Node Exporter endpoints) - Efficient time-series database optimised for metrics - PromQL query language for data analysis - Built-in alerting rules (integrated with Grafana) - Service discovery and target management - Long-term metrics retention (configurable, 30+ days recommended)

Metrics Collected: - System: CPU usage, memory, disk I/O, network traffic - Services: Apache/nginx, MySQL/MariaDB, Exim, Dovecot - Node Exporter: 1000+ system-level metrics per server - Custom Exporters: Can add cPanel-specific metrics if needed

Loki

Purpose: Log aggregation system designed for efficiency and Grafana integration.

Capabilities: - Horizontal log aggregation (receive logs from all servers) - Efficient storage (indexes only metadata, not full log content) - LogQL query language (similar to PromQL, easy to learn) - Label-based log organization (by server, service, severity) - Stream processing and filtering - Long-term log retention (configurable, 14+ days recommended)

Logs Collected: - System Logs: /var/log/messages, /var/log/secure, /var/log/cron - Web Server: Apache access/error logs, per-domain logs - Email: Exim mainlog, rejectlog, paniclog, maillog (Dovecot) - Security: CSF/LFD, Fail2Ban, Imunify360, ClamAV - cPanel: cPanel/WHM access, error, login logs

Portainer

Purpose: Docker container management with web interface.

Capabilities: - Visual Docker management (no command-line required for common tasks) - Container lifecycle management (start, stop, restart, logs) - Docker Compose deployment and management - Resource usage monitoring per container - Log viewing and download - Access control and user management

Benefits: - Simplifies Docker operations for single-operator environment - Quick troubleshooting (view container logs, restart services) - Backup and restore of container configurations - Portainer Business license: 5 nodes included (free for MDHosting use case)

Infrastructure Requirements

New Monitoring Server

Specifications: - Server Type: Hetzner CPX31 (recommended) - vCPU: 4 cores (sufficient for 3 monitored servers) - RAM: 8GB (Prometheus + Loki + Grafana + overhead) - Storage: 80GB SSD (metrics + logs retention) - Network: 20TB traffic (ample for metrics/logs) - Cost: €13.79/month (~£12/month, £144/year)

Alternative (Budget Option): - Server Type: Hetzner CX32 (if CPX31 unavailable) - vCPU: 4 cores - RAM: 8GB - Storage: 80GB - Cost: Similar to CPX31

Operating System: - AlmaLinux 10 (consistent with planned ApisCP infrastructure) - Fresh installation (no cPanel, no Imunify360) - Docker and Docker Compose installed - Portainer for container management

Why Separate Server? - Isolation: Monitoring infrastructure separate from production - Performance: No impact on client services during metric collection - Security: Separate security profile, no client data - Flexibility: Easy to scale monitoring without affecting production - Future-Proof: Can be used for Wazuh deployment post-ApisCP migration

Monitored Servers (Current Production)

No Major Changes Required: - Node Exporter: Lightweight binary (~10MB RAM, <1% CPU) - Promtail: Lightweight log shipper (~20MB RAM, <1% CPU) - Firewall Rules: Allow outbound HTTP to monitoring server - Disk Space: Minimal (<100MB for exporters/shippers)

Total Production Impact: <50MB RAM, <2% CPU per server - negligible

Cost Analysis

Component Monthly Cost Annual Cost Notes
NEW Monitoring Server (CPX31) £12 £144 4 vCPU, 8GB RAM, 80GB SSD
Node Exporter (3x) £0 £0 Open-source, runs on existing servers
Promtail (3x) £0 £0 Open-source, runs on existing servers
Grafana £0 £0 Open-source (OSS version)
Prometheus £0 £0 Open-source
Loki £0 £0 Open-source
Portainer Business £0 £0 Free for 5 nodes (we have 4: monitoring + 3 production)
Total £12 £144 One-time setup ~4-6 hours + £144/year ongoing

Value Proposition: - £144/year for comprehensive infrastructure monitoring - Immediate benefit (before Wazuh deployment post-ApisCP) - Foundation for future Wazuh integration (unified dashboards) - Risk mitigation during ApisCP migration planning and execution

Comparison with Alternatives

Feature Grafana Stack Wazuh SIEM Commercial (Datadog/New Relic)
Cost (Annual) £144 £144 (post-ApisCP only) £300-600+
Deployment Time 2-3 weeks 8-11 weeks 1-2 weeks (SaaS)
Current Compatibility ✅ Works with Imunify360 ❌ Conflicts with Imunify360 ✅ Usually compatible
Infrastructure Monitoring ✅ Excellent (Prometheus) ⚠️ Basic ✅ Excellent
Log Aggregation ✅ Good (Loki) ✅ Excellent (OpenSearch) ✅ Excellent
Security Event Detection ⚠️ Basic (manual rules) ✅ Excellent (SIEM) ✅ Good
Customization ✅ Highly customizable ✅ Highly customizable ⚠️ Limited
Data Sovereignty ✅ Self-hosted (Germany) ✅ Self-hosted (Germany) ❌ Third-party SaaS
Future Wazuh Integration ✅ Designed for it N/A (is Wazuh) ⚠️ Possible but complex
Skill Development ✅ Industry-standard stack ✅ SIEM expertise ⚠️ Vendor-specific

Conclusion: Grafana stack is the optimal Phase 1 solution given Imunify360 conflict with Wazuh and immediate monitoring needs.

Deployment Strategy

Phased Deployment Approach

Phase 1: Monitoring Server Setup (Week 1)

Objectives: - Provision new Hetzner CPX31 server - Install AlmaLinux 10 operating system - Configure Docker and Docker Compose - Deploy Portainer for container management - Secure server (firewall, SSH keys, fail2ban)

Tasks: 1. Order Hetzner CPX31 server via Hetzner Cloud Console 2. Install AlmaLinux 10 (select from Hetzner image library) 3. Complete initial server hardening (disable root SSH, create admin user, SSH keys) 4. Install Docker Engine and Docker Compose 5. Deploy Portainer via Docker Compose 6. Configure firewall rules (CSF or firewalld) 7. Set up DNS record (e.g., monitoring.mdhosting.internal or monitor.mdhosting.co.uk) 8. Configure SSL certificate (Let's Encrypt for Grafana/Portainer)

Success Criteria: - Monitoring server accessible via SSH (key-based authentication) - Portainer web interface accessible at https://monitor.mdhosting.co.uk:9443 - Docker containers can be created and managed via Portainer - Firewall rules configured and tested

Phase 2: Monitoring Stack Deployment (Week 1-2)

Objectives: - Deploy Prometheus, Loki, Grafana via Docker Compose - Configure Prometheus scrape targets (prepare for Node Exporter) - Configure Loki to receive logs from Promtail - Set up Grafana data sources (Prometheus + Loki) - Create initial Grafana dashboards

Tasks: 1. Create Docker Compose configuration for monitoring stack 2. Deploy Prometheus container with persistent storage 3. Deploy Loki container with persistent storage and retention policy 4. Deploy Grafana container with persistent storage 5. Configure Prometheus: Add scrape configs for 3x Node Exporter endpoints (will be deployed in Phase 3) 6. Configure Loki: Set up log retention (14 days minimum, 30 days recommended) 7. Add Prometheus as Grafana data source 8. Add Loki as Grafana data source 9. Import pre-built dashboards (Node Exporter Full, Loki Logs) 10. Configure Grafana SMTP for email alerts (use existing admin@mdhosting.co.uk) 11. Set up Grafana authentication (admin user, consider adding client access later)

Success Criteria: - All 3 containers running and healthy (check with Portainer) - Grafana accessible at https://monitor.mdhosting.co.uk (reverse proxy via nginx/caddy or direct port) - Prometheus and Loki accessible as data sources in Grafana - Pre-built dashboards loaded (will show no data until Phase 3)

Phase 3: Production Server Agent Deployment (Week 2)

Objectives: - Install Node Exporter on EU1, NS1, NS2 - Install Promtail on EU1, NS1, NS2 - Configure firewall rules for metrics/log shipping - Verify metrics and logs flowing to monitoring server - Validate dashboards displaying data

Tasks: 1. Node Exporter Deployment: - Download Node Exporter binary to each server - Create systemd service for Node Exporter - Configure Node Exporter to listen on port 9100 - Start and enable Node Exporter service - Test metrics endpoint: curl http://localhost:9100/metrics

  1. Promtail Deployment:
  2. Download Promtail binary to each server
  3. Create Promtail configuration file (specify log paths and Loki endpoint)
  4. Create systemd service for Promtail
  5. Start and enable Promtail service
  6. Verify Promtail connecting to Loki (check logs)

  7. Firewall Configuration (per server):

  8. Allow outbound HTTP to monitoring server (port 3100 for Loki)
  9. Optionally restrict Node Exporter port 9100 to monitoring server IP only
  10. Verify connectivity: telnet monitor.mdhosting.internal 3100

  11. Validation:

  12. Verify Prometheus scraping metrics from all 3 Node Exporters (check Prometheus UI)
  13. Verify Loki receiving logs from all 3 Promtail instances (check Loki UI or Grafana)
  14. Check Grafana dashboards now displaying data from all servers
  15. Investigate and resolve any "no data" issues

Success Criteria: - Node Exporter metrics visible in Prometheus for all 3 servers - Logs from all 3 servers visible in Grafana (Loki data source) - "Node Exporter Full" dashboard showing CPU, memory, disk, network for all servers - "Loki Logs" dashboard showing recent log entries from all servers

Phase 4: Dashboard Configuration & Alerting (Week 2-3)

Objectives: - Customize dashboards for MDHosting-specific monitoring needs - Configure alerting rules for critical conditions - Set up notification channels (email, optionally SMS/Slack) - Test alerting functionality - Document monitoring procedures

Tasks: 1. Dashboard Customization: - Clone and customize "Node Exporter Full" dashboard - Add MDHosting-specific panels: - cPanel service status (httpd, mysqld, exim, dovecot) - Client account resource usage (if metrics available) - Backup job status (if metrics available) - Disk usage trends per partition - Create separate dashboards for each server (EU1, NS1, NS2) - Create unified overview dashboard (all servers at a glance) - Create log analysis dashboard (security events, errors, authentication failures)

  1. Alerting Configuration:
  2. Create alert rules in Grafana:
    • Critical: Disk usage >90%, service down, out of memory
    • High: Disk usage >80%, high CPU (>80% for 10min), high load average
    • Medium: Unusual log patterns, authentication failures spike
  3. Configure notification channel: Email to admin@mdhosting.co.uk
  4. Test alerts by simulating conditions (fill disk, stop service, etc.)
  5. Adjust alert thresholds based on false positive rate

  6. Log Queries and Saved Searches:

  7. Create LogQL queries for common investigations:
    • SSH authentication failures
    • Apache errors by domain
    • Exim mail queue issues
    • CSF/Fail2Ban blocks
    • Imunify360 security events
  8. Save queries in Grafana for quick access
  9. Document query patterns for team reference

  10. Documentation:

  11. Create runbook for common monitoring tasks
  12. Document dashboard navigation and usage
  13. Create alert response procedures
  14. Update Security Monitoring with Grafana integration

Success Criteria: - Customized dashboards meeting operational needs - Alert rules configured and tested (receive test alerts via email) - Saved log queries for common security/troubleshooting scenarios - Documentation complete and accessible

Phase 5: Testing & Optimization (Week 3)

Objectives: - Comprehensive testing of monitoring stack - Performance optimization (reduce false positives, tune retention) - Integration with existing procedures - Training and familiarization - Production cutover

Tasks: 1. Functional Testing: - Simulate various failure scenarios (service down, disk full, high load) - Verify alerts fire correctly and notifications received - Test log search functionality for incident investigation - Validate dashboard accuracy against manual checks

  1. Performance Optimization:
  2. Review Prometheus/Loki resource usage on monitoring server
  3. Adjust scrape intervals if needed (default 15s, can increase to 30s-60s if performance issue)
  4. Tune log retention based on disk usage (balance storage vs. historical analysis needs)
  5. Optimize dashboard queries for faster loading

  6. Integration with Existing Procedures:

  7. Update daily monitoring checklist to include Grafana checks
  8. Modify incident response procedures to leverage Grafana for investigation
  9. Train on correlating metrics + logs for faster troubleshooting
  10. Integrate Grafana links into documentation (e.g., link to specific dashboards)

  11. Production Cutover:

  12. Transition from manual log analysis to Grafana-first approach
  13. Maintain manual checks as backup for first 2 weeks
  14. Monitor false positive rate and adjust alert rules
  15. Gather feedback and iterate on dashboards/alerts

Success Criteria: - Monitoring stack stable and performant (monitoring server <70% resource usage) - Alert false positive rate <10% - Dashboards provide actionable insights (reduce MTTR for incidents) - Operator comfortable with Grafana interface and workflows

Deployment Timeline

gantt
    title Grafana Monitoring Stack Deployment Timeline
    dateFormat YYYY-MM-DD
    section Phase 1: Server Setup
    Provision Hetzner CPX31          :p1a, 2026-01-15, 1d
    Install AlmaLinux 10              :p1b, after p1a, 1d
    Install Docker + Portainer        :p1c, after p1b, 1d
    Server Hardening & SSL            :p1d, after p1c, 1d
    section Phase 2: Stack Deployment
    Deploy Prometheus                 :p2a, after p1d, 1d
    Deploy Loki                       :p2b, after p2a, 1d
    Deploy Grafana                    :p2c, after p2b, 1d
    Configure Data Sources            :p2d, after p2c, 1d
    Import Dashboards                 :p2e, after p2d, 1d
    section Phase 3: Agent Deployment
    Deploy Node Exporter (3x)         :p3a, after p2e, 2d
    Deploy Promtail (3x)              :p3b, after p3a, 2d
    Validate Metrics & Logs           :p3c, after p3b, 1d
    section Phase 4: Dashboards & Alerts
    Customize Dashboards              :p4a, after p3c, 2d
    Configure Alerting Rules          :p4b, after p4a, 2d
    Test Alerts                       :p4c, after p4b, 1d
    Documentation                     :p4d, after p4c, 1d
    section Phase 5: Testing & Launch
    Comprehensive Testing             :p5a, after p4d, 2d
    Performance Optimization          :p5b, after p5a, 1d
    Production Cutover                :p5c, after p5b, 2d

Total Timeline: 18-21 days (approximately 3 weeks)

Target Deployment: Q1 2026 (January-February 2026) - BEFORE ApisCP migration preparation intensifies

Technical Implementation

Monitoring Server Setup

Step 1: Provision Hetzner Server

Via Hetzner Cloud Console:

  1. Log in to https://console.hetzner.cloud/
  2. Select MDHosting project
  3. Click "Add Server"
  4. Location: Nuremberg, Germany (same as existing servers)
  5. Image: AlmaLinux 10 (from Apps/Distribution list)
  6. Type: CPX31 (4 vCPU, 8GB RAM, 80GB SSD)
  7. Networking:
  8. Enable IPv4 (public IP assigned automatically)
  9. Enable IPv6 if needed
  10. Add to existing network if using Hetzner private networking
  11. SSH Keys: Add existing SSH public key for initial access
  12. Firewall: Create/assign firewall rules (see below)
  13. Volume: Not needed (80GB included storage sufficient)
  14. Name: monitor.mdhosting or monitoring-server
  15. Labels: environment=monitoring, role=observability
  16. Click "Create & Buy Now"

Cost: €13.79/month, billed monthly

Post-Provision: - Note the assigned public IPv4 address - Create DNS A record: monitor.mdhosting.co.uk → IPv4 address - Test SSH access: ssh root@monitor.mdhosting.co.uk

Step 2: Initial Server Hardening

As root user (initial SSH access):

# Update system packages
dnf update -y

# Set hostname
hostnamectl set-hostname monitor.mdhosting.co.uk

# Create administrative user (non-root)
useradd -m -s /bin/bash mdhosting
usermod -aG wheel mdhosting  # Grant sudo access

# Set strong password
passwd mdhosting

# Copy SSH keys to new user
mkdir -p /home/mdhosting/.ssh
cp /root/.ssh/authorized_keys /home/mdhosting/.ssh/
chown -R mdhosting:mdhosting /home/mdhosting/.ssh
chmod 700 /home/mdhosting/.ssh
chmod 600 /home/mdhosting/.ssh/authorized_keys

# Test new user access (open new terminal)
ssh mdhosting@monitor.mdhosting.co.uk
sudo whoami  # Should return "root"

# Once confirmed, disable root SSH login
sed -i 's/#PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
sed -i 's/PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
systemctl restart sshd

Configure Firewall (CSF recommended for consistency):

# Install CSF (ConfigServer Security & Firewall)
cd /usr/src
wget https://download.configserver.com/csf.tgz
tar -xzf csf.tgz
cd csf
sh install.sh

# Configure CSF
vi /etc/csf/csf.conf
# Key settings:
#   TESTING = "0"  # Set to 0 for production
#   TCP_IN = "22,80,443,3000,9090,9100,9443"  # SSH, HTTP, HTTPS, Grafana, Prometheus, Node Exporter, Portainer
#   TCP_OUT = "22,80,443,3100"  # Outbound: SSH, HTTP, HTTPS, Loki (for receiving logs)
#   ICMP_IN = "1"  # Allow ping
#   ETH_DEVICE = ""  # Leave blank for automatic detection

# Allow monitoring server IP to access Node Exporter on production servers
# (Configure this on eu1.cp, ns1, ns2, not monitoring server)

# Enable and start CSF
systemctl enable csf
systemctl enable lfd
systemctl start csf
systemctl start lfd

# Test firewall
csf -r  # Restart CSF

Alternative: Firewalld (if preferring AlmaLinux default):

# Enable firewalld
systemctl enable --now firewalld

# Configure zones and services
firewall-cmd --permanent --add-service=ssh
firewall-cmd --permanent --add-service=http
firewall-cmd --permanent --add-service=https
firewall-cmd --permanent --add-port=3000/tcp   # Grafana
firewall-cmd --permanent --add-port=3100/tcp   # Loki (receive logs)
firewall-cmd --permanent --add-port=9090/tcp   # Prometheus
firewall-cmd --permanent --add-port=9100/tcp   # Node Exporter (optional, can restrict to monitoring server only)
firewall-cmd --permanent --add-port=9443/tcp   # Portainer

# Reload firewall
firewall-cmd --reload

# Verify rules
firewall-cmd --list-all

Step 3: Install Docker and Docker Compose

Install Docker Engine:

# Install Docker repository
sudo dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

# Install Docker
sudo dnf install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# Start and enable Docker
sudo systemctl start docker
sudo systemctl enable docker

# Add mdhosting user to docker group (allow non-root Docker commands)
sudo usermod -aG docker mdhosting

# Log out and back in for group change to take effect
exit
ssh mdhosting@monitor.mdhosting.co.uk

# Verify Docker installation
docker --version
docker ps  # Should return empty list (no containers yet)

Install Docker Compose (V2, plugin-based):

Docker Compose V2 is installed as a Docker plugin with the above commands. Verify:

docker compose version
# Should output: Docker Compose version v2.x.x

Step 4: Deploy Portainer

Create Portainer Docker Compose configuration:

# Create directory for Portainer
mkdir -p /opt/portainer
cd /opt/portainer

# Create docker-compose.yml
cat > docker-compose.yml <<'EOF'
version: '3.8'

services:
  portainer:
    image: portainer/portainer-ce:latest
    container_name: portainer
    restart: unless-stopped
    security_opt:
      - no-new-privileges:true
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - portainer_data:/data
    ports:
      - "9443:9443"
      - "9000:9000"
      - "8000:8000"

volumes:
  portainer_data:
EOF

# Deploy Portainer
docker compose up -d

# Verify Portainer running
docker ps
docker logs portainer

# Access Portainer web interface
# Navigate to: https://monitor.mdhosting.co.uk:9443
# Create admin account (username: admin, strong password)

Portainer Initial Configuration:

  1. Access https://monitor.mdhosting.co.uk:9443 (accept self-signed certificate warning)
  2. Create admin user:
  3. Username: admin
  4. Password: [Strong password, store in password manager]
  5. Select environment: Docker (local)
  6. Click "Connect"
  7. Portainer dashboard should load showing local Docker environment

Optional: Configure Portainer Business (Free for 5 nodes):

  1. In Portainer UI, go to SettingsLicenses
  2. Click "Add License"
  3. Enter MDHosting business email (admin@mdhosting.co.uk)
  4. Request free Business Edition license (supports up to 5 nodes)
  5. Enter license key once received
  6. Business features enabled (RBAC, edge agent support, etc.)

Monitoring Stack Deployment

Step 1: Create Docker Compose Configuration

Create directory structure:

# Create monitoring stack directory
mkdir -p /opt/monitoring-stack/{prometheus,loki,grafana}
cd /opt/monitoring-stack

# Create Prometheus configuration
mkdir -p prometheus/config
cat > prometheus/config/prometheus.yml <<'EOF'
global:
  scrape_interval: 15s  # Scrape targets every 15 seconds
  evaluation_interval: 15s  # Evaluate rules every 15 seconds

# Alertmanager configuration (optional, for advanced alerting)
# alerting:
#   alertmanagers:
#     - static_configs:
#         - targets: []

# Rule files (for recording rules and alerts)
rule_files:
  # - "alerts/*.yml"

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter - EU1 (Hosting Server)
  - job_name: 'node-exporter-eu1'
    static_configs:
      - targets: ['eu1.cp:9100']
        labels:
          instance: 'eu1-hosting'
          server_type: 'cpanel_hosting'
          location: 'hetzner_de'

  # Node Exporter - NS1 (DNS Server 1)
  - job_name: 'node-exporter-ns1'
    static_configs:
      - targets: ['ns1.mdhosting.co.uk:9100']
        labels:
          instance: 'ns1-dns'
          server_type: 'dns_primary'
          location: 'hetzner_de'

  # Node Exporter - NS2 (DNS Server 2)
  - job_name: 'node-exporter-ns2'
    static_configs:
      - targets: ['ns2.mdhosting.co.uk:9100']
        labels:
          instance: 'ns2-dns'
          server_type: 'dns_secondary'
          location: 'hetzner_de'
EOF

# Create Loki configuration
cat > loki/loki-config.yml <<'EOF'
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 0.0.0.0
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v12
      index:
        prefix: index_
        period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
  filesystem:
    directory: /loki/chunks

limits_config:
  retention_period: 336h  # 14 days retention (adjust as needed)
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  max_query_series: 500
  max_query_lookback: 720h  # 30 days max query lookback

compactor:
  working_directory: /loki/compactor
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150

ruler:
  alertmanager_url: http://localhost:9093
  enable_api: true
  rule_path: /loki/rules-temp
  storage:
    type: local
    local:
      directory: /loki/rules
EOF

# Create Grafana configuration (datasources provisioning)
mkdir -p grafana/provisioning/{datasources,dashboards}
cat > grafana/provisioning/datasources/datasources.yml <<'EOF'
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      timeInterval: '15s'

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: true
    jsonData:
      maxLines: 1000
EOF

# Create dashboard provisioning config
cat > grafana/provisioning/dashboards/dashboards.yml <<'EOF'
apiVersion: 1

providers:
  - name: 'MDHosting Dashboards'
    orgId: 1
    folder: 'MDHosting'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards/files
EOF

# Create dashboards directory (will add dashboard JSON files later)
mkdir -p grafana/provisioning/dashboards/files

Create main docker-compose.yml:

cd /opt/monitoring-stack
cat > docker-compose.yml <<'EOF'
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus/config:/etc/prometheus
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    networks:
      - monitoring

  loki:
    image: grafana/loki:latest
    container_name: loki
    restart: unless-stopped
    command: -config.file=/etc/loki/loki-config.yml
    volumes:
      - ./loki:/etc/loki
      - loki_data:/loki
    ports:
      - "3100:3100"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-changeme}
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=https://monitor.mdhosting.co.uk
      - GF_SMTP_ENABLED=true
      - GF_SMTP_HOST=eu1.cp:587
      - GF_SMTP_USER=admin@mdhosting.co.uk
      - GF_SMTP_PASSWORD=${SMTP_PASSWORD}
      - GF_SMTP_FROM_ADDRESS=admin@mdhosting.co.uk
      - GF_SMTP_FROM_NAME=MDHosting Monitoring
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    ports:
      - "3000:3000"
    networks:
      - monitoring
    depends_on:
      - prometheus
      - loki

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  loki_data:
  grafana_data:
EOF

# Create .env file for sensitive configuration
cat > .env <<'EOF'
# Grafana admin password (change this!)
GRAFANA_ADMIN_PASSWORD=CHANGE_THIS_STRONG_PASSWORD

# SMTP password for Grafana alerts (use existing email account password)
SMTP_PASSWORD=YOUR_EMAIL_PASSWORD_HERE
EOF

# Secure .env file
chmod 600 .env

# IMPORTANT: Edit .env and set strong passwords
vi .env

Step 2: Deploy Monitoring Stack

cd /opt/monitoring-stack

# Pull images (optional, docker compose up will do this automatically)
docker compose pull

# Start monitoring stack
docker compose up -d

# Verify all containers running
docker compose ps
# Should show: prometheus, loki, grafana - all "running" status

# Check container logs
docker compose logs prometheus
docker compose logs loki
docker compose logs grafana

# Verify services accessible
curl http://localhost:9090/-/healthy  # Prometheus health
curl http://localhost:3100/ready       # Loki ready
curl http://localhost:3000/api/health  # Grafana health

Access Services:

  • Grafana: http://monitor.mdhosting.co.uk:3000
  • Username: admin
  • Password: (from .env file GRAFANA_ADMIN_PASSWORD)
  • Prometheus: http://monitor.mdhosting.co.uk:9090
  • Loki: http://monitor.mdhosting.co.uk:3100 (no UI, use Grafana)

Verify Data Sources in Grafana:

  1. Log in to Grafana (http://monitor.mdhosting.co.uk:3000)
  2. Go to ConfigurationData Sources
  3. Verify "Prometheus" data source exists and shows green "Data source is working" message
  4. Verify "Loki" data source exists and shows green "Data source is working" message

If data sources not working: - Check docker network connectivity: docker network inspect monitoring-stack_monitoring - Check container logs: docker compose logs grafana - Verify URLs in datasources.yml match container names (prometheus, loki)

Production Server Agent Deployment

Now deploy Node Exporter and Promtail on all 3 production servers (EU1, NS1, NS2).

Node Exporter Installation

On each production server (eu1.cp, ns1, ns2):

# Create node_exporter user
sudo useradd --no-create-home --shell /bin/false node_exporter

# Download Node Exporter (check for latest version at https://github.com/prometheus/node_exporter/releases)
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz

# Extract and install
tar xvf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

# Clean up
rm -rf node_exporter-1.7.0.linux-amd64*

# Create systemd service
sudo cat > /etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Reload systemd, start and enable Node Exporter
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

# Verify Node Exporter running
sudo systemctl status node_exporter
curl http://localhost:9100/metrics | head -20

# Configure firewall to allow monitoring server to scrape metrics
# If using CSF:
sudo vi /etc/csf/csf.allow
# Add line: tcp|in|d=9100|s=MONITORING_SERVER_IP  # Allow Prometheus scrape

# Reload CSF
sudo csf -r

# If using firewalld:
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="MONITORING_SERVER_IP" port protocol="tcp" port="9100" accept'
sudo firewall-cmd --reload

Verify from monitoring server:

# From monitoring server
curl http://eu1.cp:9100/metrics | head -20
curl http://ns1.mdhosting.co.uk:9100/metrics | head -20
curl http://ns2.mdhosting.co.uk:9100/metrics | head -20

# Check Prometheus is scraping targets
# Open Prometheus UI: http://monitor.mdhosting.co.uk:9090
# Go to Status → Targets
# Should see node-exporter-eu1, node-exporter-ns1, node-exporter-ns2 all "UP" status

Promtail Installation

On each production server (eu1.cp, ns1, ns2):

# Create promtail user
sudo useradd --no-create-home --shell /bin/false promtail

# Download Promtail (check for latest version at https://github.com/grafana/loki/releases)
cd /tmp
wget https://github.com/grafana/loki/releases/download/v2.9.3/promtail-linux-amd64.zip

# Extract and install
unzip promtail-linux-amd64.zip
sudo cp promtail-linux-amd64 /usr/local/bin/promtail
sudo chown promtail:promtail /usr/local/bin/promtail
sudo chmod +x /usr/local/bin/promtail

# Clean up
rm -rf promtail-linux-amd64*

# Create Promtail configuration directory
sudo mkdir -p /etc/promtail

# Create Promtail configuration (customize per server)
sudo cat > /etc/promtail/promtail-config.yml <<'EOF'
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /var/lib/promtail/positions.yaml

clients:
  - url: http://monitor.mdhosting.internal:3100/loki/api/v1/push

scrape_configs:
  # System logs
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: systemlogs
          server: eu1-hosting  # Change per server: eu1-hosting, ns1-dns, ns2-dns
          __path__: /var/log/messages

  - job_name: secure
    static_configs:
      - targets:
          - localhost
        labels:
          job: secure
          server: eu1-hosting  # Change per server
          __path__: /var/log/secure

  - job_name: cron
    static_configs:
      - targets:
          - localhost
        labels:
          job: cron
          server: eu1-hosting  # Change per server
          __path__: /var/log/cron

  # Web server logs (EU1 only)
  - job_name: apache-access
    static_configs:
      - targets:
          - localhost
        labels:
          job: apache-access
          server: eu1-hosting
          __path__: /usr/local/apache/logs/access_log

  - job_name: apache-error
    static_configs:
      - targets:
          - localhost
        labels:
          job: apache-error
          server: eu1-hosting
          __path__: /usr/local/apache/logs/error_log

  # Email logs (EU1 only)
  - job_name: exim
    static_configs:
      - targets:
          - localhost
        labels:
          job: exim
          server: eu1-hosting
          __path__: /var/log/exim_mainlog

  - job_name: maillog
    static_configs:
      - targets:
          - localhost
        labels:
          job: maillog
          server: eu1-hosting
          __path__: /var/log/maillog

  # Security logs
  - job_name: fail2ban
    static_configs:
      - targets:
          - localhost
        labels:
          job: fail2ban
          server: eu1-hosting  # Change per server
          __path__: /var/log/fail2ban.log

  - job_name: lfd
    static_configs:
      - targets:
          - localhost
        labels:
          job: lfd
          server: eu1-hosting  # Change per server
          __path__: /var/log/lfd.log

  # cPanel logs (EU1 only)
  - job_name: cpanel-access
    static_configs:
      - targets:
          - localhost
        labels:
          job: cpanel-access
          server: eu1-hosting
          __path__: /usr/local/cpanel/logs/access_log

  - job_name: cpanel-error
    static_configs:
      - targets:
          - localhost
        labels:
          job: cpanel-error
          server: eu1-hosting
          __path__: /usr/local/cpanel/logs/error_log
EOF

# Create positions directory
sudo mkdir -p /var/lib/promtail
sudo chown promtail:promtail /var/lib/promtail

# Adjust log file permissions (Promtail needs read access)
# Option 1: Add promtail to adm group (can read most logs)
sudo usermod -aG adm promtail

# Option 2: Specific log file permissions (more restrictive)
# sudo setfacl -m u:promtail:r /var/log/messages
# sudo setfacl -m u:promtail:r /var/log/secure
# ... (repeat for each log file)

# Create systemd service
sudo cat > /etc/systemd/system/promtail.service <<'EOF'
[Unit]
Description=Promtail
Wants=network-online.target
After=network-online.target

[Service]
User=promtail
Group=promtail
Type=simple
ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/promtail-config.yml

[Install]
WantedBy=multi-user.target
EOF

# Reload systemd, start and enable Promtail
sudo systemctl daemon-reload
sudo systemctl start promtail
sudo systemctl enable promtail

# Verify Promtail running
sudo systemctl status promtail
sudo journalctl -u promtail -f  # Watch logs for connection to Loki

# Look for "clients/client.go" messages showing successful log shipping

Important: Customize promtail-config.yml per server!

  • eu1.cp: Include all scrape_configs (system, web, email, cPanel)
  • ns1/ns2: Remove web server, email, and cPanel sections (DNS servers don't have these services)

Verify from Grafana:

  1. Open Grafana: http://monitor.mdhosting.co.uk:3000
  2. Go to Explore
  3. Select Loki data source
  4. Run query: {server="eu1-hosting"} (should show logs from EU1)
  5. Run query: {server="ns1-dns"} (should show logs from NS1)
  6. Run query: {server="ns2-dns"} (should show logs from NS2)

If no logs appearing: - Check Promtail status: sudo systemctl status promtail - Check Promtail logs: sudo journalctl -u promtail -n 100 - Verify firewall allows outbound HTTP to monitoring server port 3100 - Check Loki logs on monitoring server: docker compose logs loki

Dashboard Configuration

Import Pre-Built Dashboards

Node Exporter Full Dashboard:

  1. In Grafana, go to DashboardsImport
  2. Enter dashboard ID: 1860 (Node Exporter Full by starsliao)
  3. Click "Load"
  4. Select "Prometheus" as data source
  5. Click "Import"
  6. Dashboard should load showing metrics from all 3 servers

Alternative Node Exporter Dashboard: - Dashboard ID: 11074 (Node Exporter for Prometheus Dashboard by ktal)

Loki Logs Dashboard:

  1. In Grafana, go to DashboardsImport
  2. Enter dashboard ID: 13639 (Logs / App by Loki)
  3. Click "Load"
  4. Select "Loki" as data source
  5. Click "Import"
  6. Dashboard should show log streams from all servers

Create Custom MDHosting Dashboard

Create Overview Dashboard:

  1. In Grafana, click +Dashboard
  2. Click "Add visualization"
  3. Select "Prometheus" data source

Panel 1: Server Status (Stat panel): - Query: up{job=~"node-exporter-.*"} - Visualization: Stat - Title: "Server Status" - Value mappings: - 1 = "UP" (green) - 0 = "DOWN" (red) - Display: All values

Panel 2: CPU Usage (Time series): - Query: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) - Visualization: Time series - Title: "CPU Usage %" - Legend: {{instance}} - Unit: Percent (0-100) - Thresholds: Yellow at 70%, Red at 90%

Panel 3: Memory Usage (Gauge): - Query: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 - Visualization: Gauge - Title: "Memory Usage %" - Unit: Percent (0-100) - Thresholds: Yellow at 75%, Red at 90%

Panel 4: Disk Usage (Bar gauge): - Query: (1 - (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes)) * 100 - Visualization: Bar gauge - Title: "Disk Usage %" - Unit: Percent (0-100) - Thresholds: Yellow at 80%, Red at 90%

Panel 5: Network Traffic (Time series): - Query A (Received): rate(node_network_receive_bytes_total{device!~"lo|veth.*|docker.*"}[5m]) - Query B (Transmitted): rate(node_network_transmit_bytes_total{device!~"lo|veth.*|docker.*"}[5m]) - Visualization: Time series - Title: "Network Traffic" - Unit: Bytes/sec

Panel 6: System Load (Time series): - Query A (1min): node_load1 - Query B (5min): node_load5 - Query C (15min): node_load15 - Visualization: Time series - Title: "System Load Average" - Legend: {{instance}} - {{name}}

Panel 7: Recent Errors (Logs panel): - Data Source: Loki - Query: {job=~"systemlogs|secure"} |= "error" or "fail" or "critical" - Visualization: Logs - Title: "Recent System Errors" - Time: Last 1 hour

Panel 8: SSH Authentication Failures (Stat): - Data Source: Loki - Query: count_over_time({job="secure"} |= "Failed password" [1h]) - Visualization: Stat - Title: "SSH Failed Logins (Last Hour)" - Thresholds: Yellow at 10, Red at 50

Save Dashboard: - Click Save dashboard (top right) - Name: "MDHosting Infrastructure Overview" - Folder: MDHosting - Tags: infrastructure, overview - Click "Save"

Configure Alert Rules

Alert 1: High CPU Usage:

  1. In Grafana, go to AlertingAlert rules
  2. Click "New alert rule"
  3. Alert name: High CPU Usage
  4. Query A: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
  5. Condition: WHEN last() OF query(A) IS ABOVE 80
  6. Evaluate every: 1m
  7. For: 10m (only alert if condition persists for 10 minutes)
  8. Annotations:
  9. Summary: High CPU usage on {{ $labels.instance }}
  10. Description: CPU usage is {{ $value }}% on {{ $labels.instance }}
  11. Labels: severity=warning, server={{ $labels.instance }}
  12. Click "Save"

Alert 2: High Memory Usage:

  1. Create new alert rule
  2. Alert name: High Memory Usage
  3. Query A: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
  4. Condition: WHEN last() OF query(A) IS ABOVE 85
  5. Evaluate every: 1m
  6. For: 5m
  7. Annotations:
  8. Summary: High memory usage on {{ $labels.instance }}
  9. Description: Memory usage is {{ $value }}% on {{ $labels.instance }}
  10. Labels: severity=warning
  11. Click "Save"

Alert 3: Disk Space Critical:

  1. Create new alert rule
  2. Alert name: Disk Space Critical
  3. Query A: (1 - (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes)) * 100 > 90
  4. Condition: WHEN last() OF query(A) IS ABOVE 90
  5. Evaluate every: 5m
  6. For: 5m
  7. Annotations:
  8. Summary: CRITICAL: Disk space low on {{ $labels.instance }}
  9. Description: Disk usage is {{ $value }}% on {{ $labels.instance }}. Immediate action required.
  10. Labels: severity=critical
  11. Click "Save"

Alert 4: Server Down:

  1. Create new alert rule
  2. Alert name: Server Down
  3. Query A: up{job=~"node-exporter-.*"} == 0
  4. Condition: WHEN last() OF query(A) IS BELOW 1
  5. Evaluate every: 1m
  6. For: 2m (give 2 minutes for temporary network issues)
  7. Annotations:
  8. Summary: CRITICAL: Server {{ $labels.instance }} is DOWN
  9. Description: Cannot reach {{ $labels.instance }}. Check server immediately.
  10. Labels: severity=critical
  11. Click "Save"

Configure Notification Channel:

  1. In Grafana, go to AlertingContact points
  2. Click "New contact point"
  3. Name: Email Admin
  4. Integration: Email
  5. Addresses: admin@mdhosting.co.uk
  6. Optional: Test notification (click "Test" button)
  7. Click "Save contact point"

Configure Notification Policy:

  1. Go to AlertingNotification policies
  2. Edit "Default policy"
  3. Contact point: Email Admin
  4. Group by: alertname, instance
  5. Timing:
  6. Group wait: 30s
  7. Group interval: 5m
  8. Repeat interval: 4h
  9. Click "Save policy"

Test Alerting:

Simulate high CPU to test alerting:

# On any production server
# Install stress tool
sudo dnf install -y stress

# Generate CPU load for 15 minutes
stress --cpu 4 --timeout 900s &

# Watch for alert to fire in Grafana (after 10 minutes)
# Check email for alert notification

Operational Procedures

Daily Monitoring Workflow

With Grafana (5-10 minutes):

  1. Open Grafana Dashboard:
  2. Navigate to "MDHosting Infrastructure Overview" dashboard
  3. Quick visual scan: All servers green? Any panels in red/yellow?

  4. Check Key Metrics:

  5. Server Status: All showing "UP"?
  6. CPU Usage: Any sustained high usage (>70%)?
  7. Memory Usage: Any servers >75%?
  8. Disk Usage: Any approaching 80%?
  9. Network Traffic: Any unusual spikes or patterns?

  10. Review Recent Errors:

  11. Check "Recent System Errors" panel
  12. Click on any interesting errors for full log context
  13. Investigate critical/error level messages

  14. Check Alerts:

  15. Go to AlertingAlert rules
  16. Any active alerts? If yes, investigate and resolve
  17. Review recently resolved alerts for patterns

  18. Log Investigation (if needed):

  19. Go to Explore → Select Loki
  20. Run targeted queries for specific investigations:
    • SSH failures: {job="secure"} |= "Failed password"
    • Email issues: {job="exim"} |= "error" or "frozen"
    • Apache errors: {job="apache-error"}

Compared to Manual Monitoring (Before Grafana): - Before: 15-30 minutes SSHing to each server, running commands, checking logs - After: 5-10 minutes in Grafana dashboard, only SSH if specific issue requires intervention - Time Saved: 10-20 minutes per day = 60-140 minutes per week

Incident Investigation Workflow

Scenario: Client reports "website slow"

Traditional Approach (Before Grafana): 1. SSH to eu1.cp 2. Check load average: uptime 3. Check disk: df -h 4. Check memory: free -h 5. Check processes: top 6. Check Apache logs: tail -f /usr/local/apache/logs/error_log 7. Check specific domain logs: tail -f /usr/local/apache/domlogs/client-domain.com 8. Correlate timing with other events manually 9. Time: 15-30 minutes

Grafana Approach: 1. Open "MDHosting Infrastructure Overview" dashboard 2. Visual inspection: CPU/memory/disk spikes at time of complaint? 3. Time series analysis: Set time range to when client reported issue 4. Log correlation: In Explore, query:

{instance="eu1-hosting", job=~"apache-access|apache-error"} |= "client-domain.com"
5. Identify issue: High CPU correlates with error spike in client's domain logs 6. Root cause: Specific error pattern identified (e.g., PHP fatal error, slow database query) 7. Resolution: SSH to server only if needed for remediation 8. Time: 5-10 minutes

Efficiency Gain: 2-3x faster incident investigation

Alert Response Procedures

Critical Alert: Server Down

Alert Received:

Subject: [CRITICAL] Server ns1-dns is DOWN
Body: Cannot reach ns1.mdhosting.co.uk. Check server immediately.

Response Steps: 1. Verify Alert: - Open Grafana: Check "Server Status" panel - Confirm server showing as DOWN - Check alert timeline: How long has it been down?

  1. Initial Investigation:
  2. Attempt to ping server: ping ns1.mdhosting.co.uk
  3. Attempt SSH access: ssh ns1.mdhosting.co.uk
  4. Check Hetzner Cloud Console: Server status, console access

  5. Root Cause Analysis:

  6. If accessible via Hetzner console: Check system logs for crash/kernel panic
  7. If not accessible: Hardware failure, network issue, or hosting provider outage
  8. Check Grafana logs (if server was up recently): Any errors before going down?

  9. Resolution:

  10. If server reboot needed: Via Hetzner console or SSH
  11. If hardware failure: Contact Hetzner support, consider failover (DNS has redundancy)
  12. If network issue: Check Hetzner status page, contact support

  13. Validation:

  14. Verify server returns to "UP" status in Grafana
  15. Check all services running: systemctl status named
  16. Monitor for stability (watch dashboard for 15 minutes)

  17. Follow-Up:

  18. Document incident in incident log
  19. Update Incident Response if new pattern
  20. Review alert threshold if false positive

High Alert: Disk Space Critical

Alert Received:

Subject: [CRITICAL] Disk space low on eu1-hosting
Body: Disk usage is 92% on eu1.cp. Immediate action required.

Response Steps: 1. Verify Alert: - Open Grafana: Check "Disk Usage %" panel - Confirm disk usage level and trend (increasing rapidly or stable?)

  1. Immediate Investigation:
  2. SSH to server: ssh eu1.cp
  3. Check disk usage: df -h
  4. Identify largest directories: du -sh /home/* /var/* | sort -hr | head -20

  5. Common Causes:

  6. Client backups not cleaned up
  7. Log files growing excessively
  8. Abandoned files in /tmp or /var/tmp
  9. Email queue buildup (frozen messages)
  10. cPanel backup staging files

  11. Resolution:

  12. Remove old backups: Review /backup directory, clean up manually
  13. Rotate logs: logrotate -f /etc/logrotate.conf
  14. Clean up temp files: rm -rf /tmp/* /var/tmp/* (be cautious)
  15. Clear mail queue frozen messages: exiqgrep -z -i | xargs exim -Mrm
  16. Identify and remove large unnecessary files

  17. Validation:

  18. Verify disk usage dropped below 85%: df -h
  19. Check Grafana dashboard: Disk usage panel should reflect reduction
  20. Wait for alert to auto-resolve

  21. Follow-Up:

  22. Investigate why disk usage increased (client issue? backup issue?)
  23. Adjust backup retention policy if needed (see Backup Recovery)
  24. Consider increasing disk space if recurring issue (upgrade to larger Hetzner server)

Warning Alert: High CPU Usage

Alert Received:

Subject: [WARNING] High CPU usage on eu1-hosting
Body: CPU usage is 87% on eu1.cp

Response Steps: 1. Verify Alert: - Open Grafana: Check "CPU Usage %" time series panel - Assess pattern: Sustained high usage or spike? - Check timing: Correlates with known events (backups, cron jobs)?

  1. Investigate Cause:
  2. SSH to server: ssh eu1.cp
  3. Check top processes: top or htop
  4. Identify resource-heavy processes

  5. Common Causes:

  6. Client website generating high traffic (legitimate)
  7. Backup job running
  8. ClamAV malware scan (scheduled daily)
  9. Brute force attack (check Apache logs for excessive requests)
  10. Compromised account running malicious scripts

  11. Resolution:

  12. Legitimate traffic: Monitor, consider optimizations or upgrades
  13. Backup/scan: Wait for completion, adjust schedule if problematic
  14. Attack: Block attacking IPs (CSF), investigate compromised account
  15. Malicious scripts: Identify and kill processes, investigate compromise

  16. Validation:

  17. Verify CPU usage returns to normal (<70%)
  18. Check Grafana: CPU panel should show reduction
  19. Wait for alert to auto-resolve

  20. Follow-Up:

  21. If attack: Follow Incident Response - Unauthorized Access
  22. If legitimate load: Review capacity planning, consider infrastructure upgrades
  23. If recurring: Adjust alert threshold (e.g., increase to 90% if 80% too sensitive)

Backup and Disaster Recovery

Backing Up Monitoring Configuration:

What to Backup: - Grafana dashboards and configuration (automated via volume) - Prometheus configuration (prometheus.yml) - Loki configuration (loki-config.yml) - Docker Compose files (docker-compose.yml, .env) - Grafana provisioning files (data sources, dashboards)

Automated Backup (Recommended):

# Create backup script
cat > /opt/monitoring-stack/backup.sh <<'EOF'
#!/bin/bash
# MDHosting Monitoring Stack Backup Script

BACKUP_DIR="/root/monitoring-backups"
DATE=$(date +%Y-%m-%d)
BACKUP_FILE="monitoring-backup-${DATE}.tar.gz"

# Create backup directory
mkdir -p ${BACKUP_DIR}

# Stop containers (optional, for consistent backup)
cd /opt/monitoring-stack
# docker compose stop

# Create backup archive
tar czf ${BACKUP_DIR}/${BACKUP_FILE} \
    /opt/monitoring-stack/ \
    /opt/portainer/

# Restart containers if stopped
# docker compose start

# Remove backups older than 30 days
find ${BACKUP_DIR} -name "monitoring-backup-*.tar.gz" -mtime +30 -delete

echo "Backup completed: ${BACKUP_FILE}"
ls -lh ${BACKUP_DIR}/${BACKUP_FILE}
EOF

# Make script executable
chmod +x /opt/monitoring-stack/backup.sh

# Add to cron (daily at 2am)
crontab -e
# Add line:
0 2 * * * /opt/monitoring-stack/backup.sh >> /var/log/monitoring-backup.log 2>&1

Manual Backup:

# Create backup
cd /opt
tar czf /root/monitoring-backup-$(date +%Y-%m-%d).tar.gz monitoring-stack/ portainer/

# Transfer to Hetzner Storage Box or local machine
scp /root/monitoring-backup-*.tar.gz user@backup-server:/backups/

Restore Procedure:

Scenario: Monitoring server failure, need to rebuild

  1. Provision New Monitoring Server:
  2. Follow "Monitoring Server Setup" steps (provision Hetzner CPX31, install AlmaLinux, Docker, etc.)

  3. Restore Configuration:

    # Transfer backup to new server
    scp monitoring-backup-YYYY-MM-DD.tar.gz mdhosting@new-monitor.mdhosting.co.uk:/tmp/
    
    # SSH to new server
    ssh mdhosting@new-monitor.mdhosting.co.uk
    
    # Extract backup
    cd /opt
    sudo tar xzf /tmp/monitoring-backup-YYYY-MM-DD.tar.gz
    
    # Start monitoring stack
    cd /opt/monitoring-stack
    docker compose up -d
    
    # Start Portainer
    cd /opt/portainer
    docker compose up -d
    

  4. Verify Restoration:

  5. Access Grafana: https://new-monitor.mdhosting.co.uk:3000
  6. Verify dashboards present
  7. Verify data sources configured
  8. Check Prometheus scraping metrics (may need to wait for next scrape interval)
  9. Check Loki receiving logs

  10. Update DNS:

  11. Update monitor.mdhosting.co.uk DNS A record to new server IP
  12. Wait for DNS propagation (5-60 minutes)

  13. Decommission Old Server:

  14. Once new server verified working, delete old server from Hetzner Cloud Console

Data Retention and Disaster Recovery:

Metrics (Prometheus): - Retention: 30 days (configurable in docker-compose.yml) - Impact of Loss: Lose historical metrics trend analysis - Mitigation: 30 days sufficient for most investigations, can extend to 90 days if needed - Recovery: Cannot recover lost metrics, but Node Exporter will continue sending new metrics immediately

Logs (Loki): - Retention: 14 days (configurable in loki-config.yml) - Impact of Loss: Lose historical log analysis capabilities - Mitigation: Critical logs still on production servers (/var/log/), Loki is additional aggregation layer - Recovery: Cannot recover lost logs from Loki, but original logs on production servers remain

Dashboards and Configuration: - Retention: Persistent Docker volumes + daily backups - Impact of Loss: Need to recreate dashboards and alerts manually - Mitigation: Regular backups (automated script above) + export dashboards as JSON - Recovery: Restore from backup (see Restore Procedure above)

Critical: Production server logs always retained on production servers per normal logrotate policies (4 weeks). Loki provides centralized access but does NOT replace original log files.

Integration with Future Wazuh Deployment

Phase 2 Integration Strategy

When Wazuh Deploys (Post-ApisCP Migration):

Grafana stack provides the perfect foundation for Wazuh integration:

Wazuh + Grafana Unified Monitoring

Architecture (Post-ApisCP):

graph TB
    subgraph "NEW ApisCP Servers - Post-Migration"
        APIS1[ApisCP Server 1<br/>No Imunify360]
        APIS2[ApisCP Server 2<br/>No Imunify360]
    end

    subgraph "Monitoring Server - Shared Infrastructure"
        GRAFANA[Grafana<br/>Unified Dashboards]
        PROMETHEUS[Prometheus<br/>Infrastructure Metrics]
        LOKI[Loki<br/>Log Aggregation]
    end

    subgraph "NEW Wazuh SIEM Server"
        WAZUH[Wazuh Manager<br/>Security Events]
        WAZUH_IDX[Wazuh Indexer<br/>OpenSearch]
    end

    subgraph "Agents on ApisCP Servers"
        NEXPORTER[Node Exporter<br/>Metrics]
        PROMTAIL[Promtail<br/>Logs]
        WAZUH_AGENT[Wazuh Agent<br/>Security Events]
    end

    APIS1 --> NEXPORTER
    APIS1 --> PROMTAIL
    APIS1 --> WAZUH_AGENT

    APIS2 --> NEXPORTER
    APIS2 --> PROMTAIL
    APIS2 --> WAZUH_AGENT

    NEXPORTER --> PROMETHEUS
    PROMTAIL --> LOKI
    WAZUH_AGENT --> WAZUH

    PROMETHEUS --> GRAFANA
    LOKI --> GRAFANA
    WAZUH_IDX --> GRAFANA

    WAZUH --> WAZUH_IDX

    style GRAFANA fill:#f39c12,stroke:#2c3e50,stroke-width:3px,color:#fff
    style APIS1 fill:#3498db,stroke:#2c3e50,stroke-width:2px,color:#fff
    style APIS2 fill:#3498db,stroke:#2c3e50,stroke-width:2px,color:#fff
    style WAZUH fill:#e74c3c,stroke:#2c3e50,stroke-width:2px,color:#fff

Benefits of Unified Approach:

  1. Single Pane of Glass:
  2. Infrastructure metrics (Prometheus) + Security events (Wazuh) + Logs (Loki) all in Grafana
  3. No need to switch between Wazuh Dashboard and Grafana
  4. Correlate infrastructure anomalies with security events

  5. Enhanced Incident Investigation:

  6. Example: High CPU alert fires (Prometheus)
  7. Correlate with security events: Was it a brute force attack? (Wazuh)
  8. Investigate logs: What happened? (Loki)
  9. All in one dashboard, single query interface

  10. Unified Alerting:

  11. Grafana Unified Alerting handles both infrastructure and security alerts
  12. Single notification channel (email, Slack, PagerDuty, etc.)
  13. Consistent alert format and response procedures

Grafana + Wazuh Integration Steps

When Wazuh is deployed (Q3-Q4 2026):

  1. Add Wazuh Data Source to Grafana:

    # In /opt/monitoring-stack/grafana/provisioning/datasources/datasources.yml
    - name: Wazuh
      type: elasticsearch
      access: proxy
      url: http://wazuh-indexer:9200
      database: "wazuh-alerts-*"
      isDefault: false
      editable: true
      jsonData:
        esVersion: "7.10.0"
        timeField: "@timestamp"
        logMessageField: "full_log"
        logLevelField: "rule.level"
    

  2. Import Wazuh Dashboards:

  3. Wazuh provides pre-built Grafana dashboards
  4. Import via Grafana UI: DashboardsImport → Upload Wazuh JSON files
  5. Dashboards available for:

    • Security Overview
    • Threat Intelligence
    • File Integrity Monitoring
    • Vulnerability Detection
    • Compliance (PCI DSS, GDPR, CIS)
  6. Create Unified Security + Infrastructure Dashboard:

  7. Combine panels from Node Exporter and Wazuh dashboards
  8. Example layout:

    • Row 1: Server status (infrastructure) + Security alert summary (Wazuh)
    • Row 2: CPU/Memory/Disk (Prometheus) + Top security events (Wazuh)
    • Row 3: Network traffic (Prometheus) + Intrusion attempts (Wazuh)
    • Row 4: Recent logs (Loki) + File integrity changes (Wazuh)
  9. Configure Cross-Data-Source Alerts:

  10. Example: "High CPU + Security Alert Correlation"

    • Query A (Prometheus): CPU > 80%
    • Query B (Wazuh): rule.level >= 10 (high severity security events)
    • Condition: IF both true within 5-minute window, fire alert
    • Interpretation: Possible crypto-mining malware or resource-intensive attack
  11. Update Operational Procedures:

  12. Daily monitoring: Include Wazuh panels in dashboard review
  13. Incident investigation: Use Grafana Explore to correlate all 3 data sources
  14. Alert response: Unified alert handling procedures

Documentation Updates Required: - Wazuh Deployment - Add Grafana integration section - Security Monitoring - Update with unified monitoring workflows - This document (Grafana Monitoring) - Add post-Wazuh integration procedures

Maintaining Separation During Migration

During ApisCP Migration (Q2-Q3 2026):

Grafana stack will monitor BOTH old cPanel and new ApisCP infrastructure:

Monitoring Strategy: - Old cPanel servers (EU1, NS1, NS2): Keep existing Node Exporter + Promtail (no Wazuh) - New ApisCP servers: Add Node Exporter + Promtail + Wazuh agents (once deployed) - Grafana dashboards: Separate "cPanel Infrastructure" and "ApisCP Infrastructure" dashboards - Migration visibility: Track metrics on both platforms during migration

Benefits: - Comparison: Compare performance of cPanel vs. ApisCP side-by-side - Risk Mitigation: Detect issues early on new ApisCP servers before full migration - Confidence: Verify new infrastructure stable before decommissioning old servers

Post-Migration: - Remove old cPanel servers from monitoring once fully decommissioned - Unified dashboard: Single "MDHosting Infrastructure" dashboard for all ApisCP servers - Wazuh integration: Deploy Wazuh agents on ApisCP servers (no Imunify360 conflict)

Troubleshooting

Common Issues and Resolutions

Issue 1: Prometheus Not Scraping Targets

Symptoms: - Prometheus UI (http://monitor.mdhosting.co.uk:9090) shows targets as "DOWN" - No metrics in Grafana dashboards - Error: "context deadline exceeded" or "connection refused"

Diagnosis:

# On monitoring server
# Check Prometheus logs
docker logs prometheus | tail -50

# Test connectivity to Node Exporter from monitoring server
curl http://eu1.cp:9100/metrics
curl http://ns1.mdhosting.co.uk:9100/metrics
curl http://ns2.mdhosting.co.uk:9100/metrics

# Check Prometheus configuration
cat /opt/monitoring-stack/prometheus/config/prometheus.yml

Resolution:

  1. Firewall Issue:

    # On production server (eu1.cp, ns1, ns2)
    # Verify firewall allows monitoring server IP to port 9100
    sudo csf -g MONITORING_SERVER_IP  # Should show rule allowing port 9100
    
    # If not, add rule
    sudo vi /etc/csf/csf.allow
    # Add: tcp|in|d=9100|s=MONITORING_SERVER_IP
    sudo csf -r
    

  2. Node Exporter Not Running:

    # On production server
    sudo systemctl status node_exporter
    # If not running:
    sudo systemctl start node_exporter
    sudo systemctl enable node_exporter
    

  3. Incorrect Hostname in Prometheus Config:

    # On monitoring server
    # Verify hostnames resolve
    nslookup eu1.cp
    nslookup ns1.mdhosting.co.uk
    nslookup ns2.mdhosting.co.uk
    
    # If resolution fails, add to /etc/hosts or fix DNS
    sudo vi /etc/hosts
    # Add: [IP_ADDRESS] eu1.cp
    

  4. Reload Prometheus Configuration:

    # On monitoring server
    cd /opt/monitoring-stack
    docker compose restart prometheus
    

Issue 2: Loki Not Receiving Logs

Symptoms: - No logs appearing in Grafana Explore with Loki data source - Grafana query returns "No data" - Promtail logs show errors

Diagnosis:

# On production server (check Promtail status)
sudo systemctl status promtail
sudo journalctl -u promtail -n 100

# Look for errors like:
# "Post http://monitor.mdhosting.internal:3100/loki/api/v1/push: dial tcp: lookup monitor.mdhosting.internal: no such host"
# "error sending batch, will retry: 400 Bad Request"

# On monitoring server (check Loki logs)
docker logs loki | tail -50

Resolution:

  1. Hostname Resolution Issue:

    # On production server
    # Verify monitoring server hostname resolves
    nslookup monitor.mdhosting.internal
    
    # If fails, add to /etc/hosts
    sudo vi /etc/hosts
    # Add: [MONITORING_SERVER_IP] monitor.mdhosting.internal
    

  2. Firewall Blocking Loki Port:

    # On production server
    # Test connectivity to Loki
    telnet monitor.mdhosting.internal 3100
    curl http://monitor.mdhosting.internal:3100/ready
    
    # If fails, check monitoring server firewall allows port 3100
    

  3. Promtail Configuration Error:

    # On production server
    # Validate Promtail config syntax
    /usr/local/bin/promtail -config.file=/etc/promtail/promtail-config.yml -dry-run
    
    # Common errors:
    # - Incorrect indentation (YAML)
    # - Invalid label names
    # - Incorrect Loki URL
    
    # Fix config and restart
    sudo systemctl restart promtail
    

  4. Log File Permissions:

    # On production server
    # Verify promtail user can read log files
    sudo -u promtail cat /var/log/messages
    # If "Permission denied":
    
    # Add promtail to adm group
    sudo usermod -aG adm promtail
    sudo systemctl restart promtail
    

Issue 3: Grafana Not Loading Dashboards

Symptoms: - Grafana UI loads but dashboards show "No data" - Data sources show errors - Panels display "Failed to fetch" errors

Diagnosis:

# On monitoring server
# Check Grafana logs
docker logs grafana | tail -100

# Check data source configuration
cat /opt/monitoring-stack/grafana/provisioning/datasources/datasources.yml

# Test data sources
# Prometheus:
curl http://localhost:9090/-/healthy
# Loki:
curl http://localhost:3100/ready

Resolution:

  1. Data Source URL Incorrect:

    # Verify container names in docker network
    docker network inspect monitoring-stack_monitoring
    
    # URLs should use container names (not localhost):
    # Prometheus: http://prometheus:9090
    # Loki: http://loki:3100
    
    # Fix datasources.yml if needed
    vi /opt/monitoring-stack/grafana/provisioning/datasources/datasources.yml
    # Restart Grafana
    docker compose restart grafana
    

  2. Data Source Not Provisioned:

    # In Grafana UI, go to Configuration → Data Sources
    # If Prometheus/Loki not listed, manually add:
    # 1. Click "Add data source"
    # 2. Select type (Prometheus or Loki)
    # 3. Enter URL (http://prometheus:9090 or http://loki:3100)
    # 4. Click "Save & Test"
    

  3. Docker Network Issue:

    # Verify all containers on same network
    docker ps
    docker inspect prometheus | grep NetworkMode
    docker inspect grafana | grep NetworkMode
    
    # If different networks, recreate containers
    cd /opt/monitoring-stack
    docker compose down
    docker compose up -d
    

Issue 4: Alerts Not Firing or Email Not Received

Symptoms: - Alert conditions met but no alert fires - Alert fires but no email received - Grafana shows alert as "Pending" indefinitely

Diagnosis:

# On monitoring server
# Check Grafana logs for SMTP errors
docker logs grafana | grep -i smtp
docker logs grafana | grep -i email
docker logs grafana | grep -i alert

# In Grafana UI:
# Go to Alerting → Alert rules
# Check rule status (Normal, Pending, Firing)
# Click rule to see evaluation history

Resolution:

  1. SMTP Configuration Issue:

    # Verify SMTP settings in docker-compose.yml
    cd /opt/monitoring-stack
    cat docker-compose.yml | grep GF_SMTP
    
    # Test SMTP connectivity from monitoring server
    telnet eu1.cp 587
    
    # If fails, check:
    # - SMTP host/port correct
    # - SMTP password correct in .env file
    # - Firewall allows outbound port 587
    

  2. Email Address Configuration:

    # In Grafana UI:
    # Go to Alerting → Contact points
    # Edit "Email Admin" contact point
    # Verify email address correct: admin@mdhosting.co.uk
    # Click "Test" button to send test email
    # Check spam folder if not received
    

  3. Alert Rule Threshold Not Met:

    # In Grafana UI:
    # Go to Alerting → Alert rules
    # Click alert rule
    # Check "For" duration: Alert only fires if condition persists for specified time
    # Example: "CPU > 80% for 10m" means CPU must be >80% for 10 continuous minutes
    
    # Adjust "For" duration if too long
    # Save rule and wait for next evaluation
    

  4. Notification Policy Not Configured:

    # In Grafana UI:
    # Go to Alerting → Notification policies
    # Verify "Default policy" has:
    #   - Contact point: Email Admin
    #   - Group by: alertname, instance
    # If not configured, edit and save
    

Issue 5: High Resource Usage on Monitoring Server

Symptoms: - Monitoring server running slow - Prometheus/Loki/Grafana containers using excessive CPU/memory - Disk space filling up rapidly

Diagnosis:

# On monitoring server
# Check resource usage
docker stats

# Check disk usage
df -h
du -sh /var/lib/docker/volumes/*

# Check Prometheus data size
du -sh /var/lib/docker/volumes/monitoring-stack_prometheus_data/

# Check Loki data size
du -sh /var/lib/docker/volumes/monitoring-stack_loki_data/

Resolution:

  1. Reduce Prometheus Retention:

    # Edit docker-compose.yml
    vi /opt/monitoring-stack/docker-compose.yml
    
    # Change Prometheus retention from 30d to 14d:
    command:
      - '--storage.tsdb.retention.time=14d'
    
    # Restart Prometheus
    docker compose restart prometheus
    

  2. Reduce Loki Retention:

    # Edit Loki config
    vi /opt/monitoring-stack/loki/loki-config.yml
    
    # Change retention_period from 336h (14d) to 168h (7d)
    limits_config:
      retention_period: 168h
    
    # Restart Loki
    docker compose restart loki
    

  3. Increase Scrape Interval (Prometheus):

    # Edit Prometheus config
    vi /opt/monitoring-stack/prometheus/config/prometheus.yml
    
    # Change scrape_interval from 15s to 30s or 60s
    global:
      scrape_interval: 30s
    
    # Restart Prometheus
    docker compose restart prometheus
    

  4. Upgrade Monitoring Server:

    # If resource optimization not sufficient, upgrade server
    # Via Hetzner Cloud Console:
    # 1. Shut down monitoring server
    # 2. Resize to CPX41 (8 vCPU, 16GB RAM) or larger
    # 3. Restart server
    # Cost increase: ~£10/month
    

Getting Help

Internal Resources: - This documentation (docs/projects/grafana-monitoring.md) - Security Monitoring - Integration procedures - Incident Response - Escalation procedures - Contacts - Vendor contacts

External Resources:

Resource URL Use Case
Grafana Documentation https://grafana.com/docs/grafana/latest/ Grafana features, configuration, troubleshooting
Prometheus Documentation https://prometheus.io/docs/ PromQL queries, configuration, best practices
Loki Documentation https://grafana.com/docs/loki/latest/ LogQL queries, configuration, retention
Portainer Documentation https://docs.portainer.io/ Container management, troubleshooting
Grafana Community Forums https://community.grafana.com/ User questions, dashboard sharing
Prometheus Mailing List https://groups.google.com/forum/#!forum/prometheus-users Technical questions, best practices

Vendor Support:

Vendor Service Contact Notes
Hetzner Infrastructure hosting https://robot.hetzner.com/ Server issues, network problems
Grafana Labs Grafana OSS (free) Community forums only No paid support for OSS version

Emergency Escalation: - If monitoring server failure impacts production monitoring, follow Incident Response - Infrastructure Failure - Monitoring server failure does NOT impact production services (EU1, NS1, NS2 continue operating) - Can operate without monitoring temporarily; rebuild from backup


Document Control

Version History:

Version Date Author Changes
1.0 January 2026 Claude Sonnet 4.5 Initial comprehensive Grafana monitoring deployment documentation

Review Schedule:

  • Post-Deployment Review: 2 weeks after Phase 5 completion (validate effectiveness)
  • Quarterly Review: Assess monitoring coverage, dashboard effectiveness, alert accuracy
  • Post-Wazuh Integration: Major revision after Wazuh SIEM deployment and integration
  • Annual Review: Comprehensive review and update (January each year)

Next Review Date: March 2026 (post-deployment review)

Related Documentation:

Document Status: ✅ Complete - Comprehensive Grafana monitoring deployment plan Classification: Confidential - Internal Use Only Document Owner: MDHosting Ltd Director (Matthew Dinsdale)


Last updated: January 2026