Grafana Monitoring Stack Deployment
Executive Summary
Objective: Deploy Grafana + Prometheus + Loki stack for centralised infrastructure monitoring and log aggregation on current cPanel servers.
Benefits: - Immediate Value: Provides monitoring NOW, before ApisCP migration - No Conflicts: Compatible with Imunify360 FREE + KernelCare on cPanel - Centralised Visibility: Unified dashboards for all 3 servers (EU1, NS1, NS2) - Log Aggregation: Centralised log collection and search with Loki - Performance Monitoring: Real-time metrics with Prometheus - Cost Effective: Open-source stack, only infrastructure costs - Future Integration: Foundation for Wazuh integration post-ApisCP migration
Timeline: 2-3 weeks for full deployment
Infrastructure Required: New Hetzner CPX31 server (€13.79/month, £12/month)
Project Status: Phase 1 - Planning and preparation
Current State
Existing Monitoring Limitations
Current Challenges: - Distributed Logs: Must SSH to each server individually to review logs - No Centralised Metrics: Cannot compare resource usage across servers - Manual Correlation: Difficult to correlate events across infrastructure - No Historical Trends: Limited ability to track performance over time - Reactive Posture: Discover issues when clients report problems - Time-Consuming: Daily monitoring tasks take 15-30 minutes manually
Current Servers: - eu1.cp (CPX31): Hosting server with ~30 client accounts - ns1.mdhosting.co.uk (CX22): Primary DNS server - ns2.mdhosting.co.uk (CX22): Secondary DNS server
Current Security Stack (Must Not Conflict):
- Imunify360 FREE (uses /var/ossec directory - OSSEC-based)
- KernelCare (live kernel patching)
- CSF Firewall (connection tracking, rate limiting)
- Fail2Ban (intrusion prevention)
- ClamAV (malware scanning)
Why Grafana Stack Now?
Immediate Benefits: 1. Wazuh Incompatibility: Cannot deploy Wazuh on current cPanel servers due to Imunify360 conflict 2. Operational Visibility: Need monitoring improvements before ApisCP migration (Q2-Q3 2026) 3. Foundation Building: Grafana provides infrastructure for future Wazuh integration 4. Risk Reduction: Better visibility into system health during migration preparation 5. Skill Development: Gain experience with monitoring stack before more complex Wazuh deployment
Target State
Grafana Stack Architecture
graph TB
subgraph "Current Production Servers - Hetzner Germany"
EU1[eu1.cp<br/>CPX31 Hosting Server<br/>~30 Client Accounts<br/>Imunify360 + KernelCare]
NS1[ns1.mdhosting.co.uk<br/>CX22 DNS Server 1<br/>Primary DNS<br/>Imunify360]
NS2[ns2.mdhosting.co.uk<br/>CX22 DNS Server 2<br/>Secondary DNS<br/>Imunify360]
end
subgraph "Monitoring Agents - Production Servers"
NEXPORTER1[Node Exporter<br/>System metrics<br/>Port 9100]
NEXPORTER2[Node Exporter<br/>System metrics<br/>Port 9100]
NEXPORTER3[Node Exporter<br/>System metrics<br/>Port 9100]
PROMTAIL1[Promtail<br/>Log shipping<br/>Sends to Loki]
PROMTAIL2[Promtail<br/>Log shipping<br/>Sends to Loki]
PROMTAIL3[Promtail<br/>Log shipping<br/>Sends to Loki]
end
subgraph "NEW Monitoring Server - Hetzner Germany"
MONITOR[Monitoring Server<br/>CPX31: 4 vCPU, 8GB RAM, 80GB<br/>AlmaLinux 10<br/>Docker + Portainer]
end
subgraph "Docker Containers - Monitoring Server"
GRAFANA[Grafana Container<br/>Dashboards & Visualization<br/>Port 3000]
PROMETHEUS[Prometheus Container<br/>Metrics collection & storage<br/>Port 9090]
LOKI[Loki Container<br/>Log aggregation & indexing<br/>Port 3100]
PORTAINER[Portainer Container<br/>Docker management UI<br/>Port 9443]
end
EU1 --> NEXPORTER1
EU1 --> PROMTAIL1
NS1 --> NEXPORTER2
NS1 --> PROMTAIL2
NS2 --> NEXPORTER3
NS2 --> PROMTAIL3
NEXPORTER1 -->|HTTP 9100<br/>Metrics pull| PROMETHEUS
NEXPORTER2 -->|HTTP 9100<br/>Metrics pull| PROMETHEUS
NEXPORTER3 -->|HTTP 9100<br/>Metrics pull| PROMETHEUS
PROMTAIL1 -->|HTTP 3100<br/>Log push| LOKI
PROMTAIL2 -->|HTTP 3100<br/>Log push| LOKI
PROMTAIL3 -->|HTTP 3100<br/>Log push| LOKI
PROMETHEUS -->|Data source| GRAFANA
LOKI -->|Data source| GRAFANA
MONITOR --> GRAFANA
MONITOR --> PROMETHEUS
MONITOR --> LOKI
MONITOR --> PORTAINER
PORTAINER -.->|Manages| GRAFANA
PORTAINER -.->|Manages| PROMETHEUS
PORTAINER -.->|Manages| LOKI
ADMIN[Administrator] -->|HTTPS 443<br/>Grafana UI| GRAFANA
ADMIN -->|HTTPS 9443<br/>Docker Management| PORTAINER
style EU1 fill:#3498db,stroke:#2c3e50,stroke-width:2px,color:#fff
style NS1 fill:#3498db,stroke:#2c3e50,stroke-width:2px,color:#fff
style NS2 fill:#3498db,stroke:#2c3e50,stroke-width:2px,color:#fff
style MONITOR fill:#27ae60,stroke:#2c3e50,stroke-width:2px,color:#fff
style GRAFANA fill:#f39c12,stroke:#2c3e50,stroke-width:2px,color:#fff
style PROMETHEUS fill:#f39c12,stroke:#2c3e50,stroke-width:2px,color:#fff
style LOKI fill:#f39c12,stroke:#2c3e50,stroke-width:2px,color:#fff
style PORTAINER fill:#8e44ad,stroke:#2c3e50,stroke-width:2px,color:#fff
style NEXPORTER1 fill:#27ae60,stroke:#2c3e50,stroke-width:2px,color:#fff
style NEXPORTER2 fill:#27ae60,stroke:#2c3e50,stroke-width:2px,color:#fff
style NEXPORTER3 fill:#27ae60,stroke:#2c3e50,stroke-width:2px,color:#fff
style PROMTAIL1 fill:#27ae60,stroke:#2c3e50,stroke-width:2px,color:#fff
style PROMTAIL2 fill:#27ae60,stroke:#2c3e50,stroke-width:2px,color:#fff
style PROMTAIL3 fill:#27ae60,stroke:#2c3e50,stroke-width:2px,color:#fff
style ADMIN fill:#8e44ad,stroke:#2c3e50,stroke-width:2px,color:#fff
Stack Components
Grafana
Purpose: Visualization and dashboarding platform for metrics and logs.
Capabilities: - Unified dashboards combining metrics (Prometheus) and logs (Loki) - Alert rule configuration with email notifications - User-friendly query builder (no complex query language required) - Pre-built dashboard templates for common monitoring scenarios - Mobile-responsive interface for on-the-go monitoring - Dashboard sharing and export
Use Cases: - Real-time system health monitoring - Historical trend analysis - Alert management and notification - Incident investigation (correlate metrics + logs) - Capacity planning and resource optimization
Prometheus
Purpose: Time-series metrics collection, storage, and querying.
Capabilities: - Pull-based metrics collection (scrapes Node Exporter endpoints) - Efficient time-series database optimised for metrics - PromQL query language for data analysis - Built-in alerting rules (integrated with Grafana) - Service discovery and target management - Long-term metrics retention (configurable, 30+ days recommended)
Metrics Collected: - System: CPU usage, memory, disk I/O, network traffic - Services: Apache/nginx, MySQL/MariaDB, Exim, Dovecot - Node Exporter: 1000+ system-level metrics per server - Custom Exporters: Can add cPanel-specific metrics if needed
Loki
Purpose: Log aggregation system designed for efficiency and Grafana integration.
Capabilities: - Horizontal log aggregation (receive logs from all servers) - Efficient storage (indexes only metadata, not full log content) - LogQL query language (similar to PromQL, easy to learn) - Label-based log organization (by server, service, severity) - Stream processing and filtering - Long-term log retention (configurable, 14+ days recommended)
Logs Collected: - System Logs: /var/log/messages, /var/log/secure, /var/log/cron - Web Server: Apache access/error logs, per-domain logs - Email: Exim mainlog, rejectlog, paniclog, maillog (Dovecot) - Security: CSF/LFD, Fail2Ban, Imunify360, ClamAV - cPanel: cPanel/WHM access, error, login logs
Portainer
Purpose: Docker container management with web interface.
Capabilities: - Visual Docker management (no command-line required for common tasks) - Container lifecycle management (start, stop, restart, logs) - Docker Compose deployment and management - Resource usage monitoring per container - Log viewing and download - Access control and user management
Benefits: - Simplifies Docker operations for single-operator environment - Quick troubleshooting (view container logs, restart services) - Backup and restore of container configurations - Portainer Business license: 5 nodes included (free for MDHosting use case)
Infrastructure Requirements
New Monitoring Server
Specifications: - Server Type: Hetzner CPX31 (recommended) - vCPU: 4 cores (sufficient for 3 monitored servers) - RAM: 8GB (Prometheus + Loki + Grafana + overhead) - Storage: 80GB SSD (metrics + logs retention) - Network: 20TB traffic (ample for metrics/logs) - Cost: €13.79/month (~£12/month, £144/year)
Alternative (Budget Option): - Server Type: Hetzner CX32 (if CPX31 unavailable) - vCPU: 4 cores - RAM: 8GB - Storage: 80GB - Cost: Similar to CPX31
Operating System: - AlmaLinux 10 (consistent with planned ApisCP infrastructure) - Fresh installation (no cPanel, no Imunify360) - Docker and Docker Compose installed - Portainer for container management
Why Separate Server? - Isolation: Monitoring infrastructure separate from production - Performance: No impact on client services during metric collection - Security: Separate security profile, no client data - Flexibility: Easy to scale monitoring without affecting production - Future-Proof: Can be used for Wazuh deployment post-ApisCP migration
Monitored Servers (Current Production)
No Major Changes Required: - Node Exporter: Lightweight binary (~10MB RAM, <1% CPU) - Promtail: Lightweight log shipper (~20MB RAM, <1% CPU) - Firewall Rules: Allow outbound HTTP to monitoring server - Disk Space: Minimal (<100MB for exporters/shippers)
Total Production Impact: <50MB RAM, <2% CPU per server - negligible
Cost Analysis
| Component | Monthly Cost | Annual Cost | Notes |
|---|---|---|---|
| NEW Monitoring Server (CPX31) | £12 | £144 | 4 vCPU, 8GB RAM, 80GB SSD |
| Node Exporter (3x) | £0 | £0 | Open-source, runs on existing servers |
| Promtail (3x) | £0 | £0 | Open-source, runs on existing servers |
| Grafana | £0 | £0 | Open-source (OSS version) |
| Prometheus | £0 | £0 | Open-source |
| Loki | £0 | £0 | Open-source |
| Portainer Business | £0 | £0 | Free for 5 nodes (we have 4: monitoring + 3 production) |
| Total | £12 | £144 | One-time setup ~4-6 hours + £144/year ongoing |
Value Proposition: - £144/year for comprehensive infrastructure monitoring - Immediate benefit (before Wazuh deployment post-ApisCP) - Foundation for future Wazuh integration (unified dashboards) - Risk mitigation during ApisCP migration planning and execution
Comparison with Alternatives
| Feature | Grafana Stack | Wazuh SIEM | Commercial (Datadog/New Relic) |
|---|---|---|---|
| Cost (Annual) | £144 | £144 (post-ApisCP only) | £300-600+ |
| Deployment Time | 2-3 weeks | 8-11 weeks | 1-2 weeks (SaaS) |
| Current Compatibility | ✅ Works with Imunify360 | ❌ Conflicts with Imunify360 | ✅ Usually compatible |
| Infrastructure Monitoring | ✅ Excellent (Prometheus) | ⚠️ Basic | ✅ Excellent |
| Log Aggregation | ✅ Good (Loki) | ✅ Excellent (OpenSearch) | ✅ Excellent |
| Security Event Detection | ⚠️ Basic (manual rules) | ✅ Excellent (SIEM) | ✅ Good |
| Customization | ✅ Highly customizable | ✅ Highly customizable | ⚠️ Limited |
| Data Sovereignty | ✅ Self-hosted (Germany) | ✅ Self-hosted (Germany) | ❌ Third-party SaaS |
| Future Wazuh Integration | ✅ Designed for it | N/A (is Wazuh) | ⚠️ Possible but complex |
| Skill Development | ✅ Industry-standard stack | ✅ SIEM expertise | ⚠️ Vendor-specific |
Conclusion: Grafana stack is the optimal Phase 1 solution given Imunify360 conflict with Wazuh and immediate monitoring needs.
Deployment Strategy
Phased Deployment Approach
Phase 1: Monitoring Server Setup (Week 1)
Objectives: - Provision new Hetzner CPX31 server - Install AlmaLinux 10 operating system - Configure Docker and Docker Compose - Deploy Portainer for container management - Secure server (firewall, SSH keys, fail2ban)
Tasks: 1. Order Hetzner CPX31 server via Hetzner Cloud Console 2. Install AlmaLinux 10 (select from Hetzner image library) 3. Complete initial server hardening (disable root SSH, create admin user, SSH keys) 4. Install Docker Engine and Docker Compose 5. Deploy Portainer via Docker Compose 6. Configure firewall rules (CSF or firewalld) 7. Set up DNS record (e.g., monitoring.mdhosting.internal or monitor.mdhosting.co.uk) 8. Configure SSL certificate (Let's Encrypt for Grafana/Portainer)
Success Criteria: - Monitoring server accessible via SSH (key-based authentication) - Portainer web interface accessible at https://monitor.mdhosting.co.uk:9443 - Docker containers can be created and managed via Portainer - Firewall rules configured and tested
Phase 2: Monitoring Stack Deployment (Week 1-2)
Objectives: - Deploy Prometheus, Loki, Grafana via Docker Compose - Configure Prometheus scrape targets (prepare for Node Exporter) - Configure Loki to receive logs from Promtail - Set up Grafana data sources (Prometheus + Loki) - Create initial Grafana dashboards
Tasks: 1. Create Docker Compose configuration for monitoring stack 2. Deploy Prometheus container with persistent storage 3. Deploy Loki container with persistent storage and retention policy 4. Deploy Grafana container with persistent storage 5. Configure Prometheus: Add scrape configs for 3x Node Exporter endpoints (will be deployed in Phase 3) 6. Configure Loki: Set up log retention (14 days minimum, 30 days recommended) 7. Add Prometheus as Grafana data source 8. Add Loki as Grafana data source 9. Import pre-built dashboards (Node Exporter Full, Loki Logs) 10. Configure Grafana SMTP for email alerts (use existing admin@mdhosting.co.uk) 11. Set up Grafana authentication (admin user, consider adding client access later)
Success Criteria: - All 3 containers running and healthy (check with Portainer) - Grafana accessible at https://monitor.mdhosting.co.uk (reverse proxy via nginx/caddy or direct port) - Prometheus and Loki accessible as data sources in Grafana - Pre-built dashboards loaded (will show no data until Phase 3)
Phase 3: Production Server Agent Deployment (Week 2)
Objectives: - Install Node Exporter on EU1, NS1, NS2 - Install Promtail on EU1, NS1, NS2 - Configure firewall rules for metrics/log shipping - Verify metrics and logs flowing to monitoring server - Validate dashboards displaying data
Tasks:
1. Node Exporter Deployment:
- Download Node Exporter binary to each server
- Create systemd service for Node Exporter
- Configure Node Exporter to listen on port 9100
- Start and enable Node Exporter service
- Test metrics endpoint: curl http://localhost:9100/metrics
- Promtail Deployment:
- Download Promtail binary to each server
- Create Promtail configuration file (specify log paths and Loki endpoint)
- Create systemd service for Promtail
- Start and enable Promtail service
-
Verify Promtail connecting to Loki (check logs)
-
Firewall Configuration (per server):
- Allow outbound HTTP to monitoring server (port 3100 for Loki)
- Optionally restrict Node Exporter port 9100 to monitoring server IP only
-
Verify connectivity:
telnet monitor.mdhosting.internal 3100 -
Validation:
- Verify Prometheus scraping metrics from all 3 Node Exporters (check Prometheus UI)
- Verify Loki receiving logs from all 3 Promtail instances (check Loki UI or Grafana)
- Check Grafana dashboards now displaying data from all servers
- Investigate and resolve any "no data" issues
Success Criteria: - Node Exporter metrics visible in Prometheus for all 3 servers - Logs from all 3 servers visible in Grafana (Loki data source) - "Node Exporter Full" dashboard showing CPU, memory, disk, network for all servers - "Loki Logs" dashboard showing recent log entries from all servers
Phase 4: Dashboard Configuration & Alerting (Week 2-3)
Objectives: - Customize dashboards for MDHosting-specific monitoring needs - Configure alerting rules for critical conditions - Set up notification channels (email, optionally SMS/Slack) - Test alerting functionality - Document monitoring procedures
Tasks: 1. Dashboard Customization: - Clone and customize "Node Exporter Full" dashboard - Add MDHosting-specific panels: - cPanel service status (httpd, mysqld, exim, dovecot) - Client account resource usage (if metrics available) - Backup job status (if metrics available) - Disk usage trends per partition - Create separate dashboards for each server (EU1, NS1, NS2) - Create unified overview dashboard (all servers at a glance) - Create log analysis dashboard (security events, errors, authentication failures)
- Alerting Configuration:
- Create alert rules in Grafana:
- Critical: Disk usage >90%, service down, out of memory
- High: Disk usage >80%, high CPU (>80% for 10min), high load average
- Medium: Unusual log patterns, authentication failures spike
- Configure notification channel: Email to admin@mdhosting.co.uk
- Test alerts by simulating conditions (fill disk, stop service, etc.)
-
Adjust alert thresholds based on false positive rate
-
Log Queries and Saved Searches:
- Create LogQL queries for common investigations:
- SSH authentication failures
- Apache errors by domain
- Exim mail queue issues
- CSF/Fail2Ban blocks
- Imunify360 security events
- Save queries in Grafana for quick access
-
Document query patterns for team reference
-
Documentation:
- Create runbook for common monitoring tasks
- Document dashboard navigation and usage
- Create alert response procedures
- Update Security Monitoring with Grafana integration
Success Criteria: - Customized dashboards meeting operational needs - Alert rules configured and tested (receive test alerts via email) - Saved log queries for common security/troubleshooting scenarios - Documentation complete and accessible
Phase 5: Testing & Optimization (Week 3)
Objectives: - Comprehensive testing of monitoring stack - Performance optimization (reduce false positives, tune retention) - Integration with existing procedures - Training and familiarization - Production cutover
Tasks: 1. Functional Testing: - Simulate various failure scenarios (service down, disk full, high load) - Verify alerts fire correctly and notifications received - Test log search functionality for incident investigation - Validate dashboard accuracy against manual checks
- Performance Optimization:
- Review Prometheus/Loki resource usage on monitoring server
- Adjust scrape intervals if needed (default 15s, can increase to 30s-60s if performance issue)
- Tune log retention based on disk usage (balance storage vs. historical analysis needs)
-
Optimize dashboard queries for faster loading
-
Integration with Existing Procedures:
- Update daily monitoring checklist to include Grafana checks
- Modify incident response procedures to leverage Grafana for investigation
- Train on correlating metrics + logs for faster troubleshooting
-
Integrate Grafana links into documentation (e.g., link to specific dashboards)
-
Production Cutover:
- Transition from manual log analysis to Grafana-first approach
- Maintain manual checks as backup for first 2 weeks
- Monitor false positive rate and adjust alert rules
- Gather feedback and iterate on dashboards/alerts
Success Criteria: - Monitoring stack stable and performant (monitoring server <70% resource usage) - Alert false positive rate <10% - Dashboards provide actionable insights (reduce MTTR for incidents) - Operator comfortable with Grafana interface and workflows
Deployment Timeline
gantt
title Grafana Monitoring Stack Deployment Timeline
dateFormat YYYY-MM-DD
section Phase 1: Server Setup
Provision Hetzner CPX31 :p1a, 2026-01-15, 1d
Install AlmaLinux 10 :p1b, after p1a, 1d
Install Docker + Portainer :p1c, after p1b, 1d
Server Hardening & SSL :p1d, after p1c, 1d
section Phase 2: Stack Deployment
Deploy Prometheus :p2a, after p1d, 1d
Deploy Loki :p2b, after p2a, 1d
Deploy Grafana :p2c, after p2b, 1d
Configure Data Sources :p2d, after p2c, 1d
Import Dashboards :p2e, after p2d, 1d
section Phase 3: Agent Deployment
Deploy Node Exporter (3x) :p3a, after p2e, 2d
Deploy Promtail (3x) :p3b, after p3a, 2d
Validate Metrics & Logs :p3c, after p3b, 1d
section Phase 4: Dashboards & Alerts
Customize Dashboards :p4a, after p3c, 2d
Configure Alerting Rules :p4b, after p4a, 2d
Test Alerts :p4c, after p4b, 1d
Documentation :p4d, after p4c, 1d
section Phase 5: Testing & Launch
Comprehensive Testing :p5a, after p4d, 2d
Performance Optimization :p5b, after p5a, 1d
Production Cutover :p5c, after p5b, 2d
Total Timeline: 18-21 days (approximately 3 weeks)
Target Deployment: Q1 2026 (January-February 2026) - BEFORE ApisCP migration preparation intensifies
Technical Implementation
Monitoring Server Setup
Step 1: Provision Hetzner Server
Via Hetzner Cloud Console:
- Log in to https://console.hetzner.cloud/
- Select MDHosting project
- Click "Add Server"
- Location: Nuremberg, Germany (same as existing servers)
- Image: AlmaLinux 10 (from Apps/Distribution list)
- Type: CPX31 (4 vCPU, 8GB RAM, 80GB SSD)
- Networking:
- Enable IPv4 (public IP assigned automatically)
- Enable IPv6 if needed
- Add to existing network if using Hetzner private networking
- SSH Keys: Add existing SSH public key for initial access
- Firewall: Create/assign firewall rules (see below)
- Volume: Not needed (80GB included storage sufficient)
- Name:
monitor.mdhostingormonitoring-server - Labels:
environment=monitoring,role=observability - Click "Create & Buy Now"
Cost: €13.79/month, billed monthly
Post-Provision:
- Note the assigned public IPv4 address
- Create DNS A record: monitor.mdhosting.co.uk → IPv4 address
- Test SSH access: ssh root@monitor.mdhosting.co.uk
Step 2: Initial Server Hardening
As root user (initial SSH access):
# Update system packages
dnf update -y
# Set hostname
hostnamectl set-hostname monitor.mdhosting.co.uk
# Create administrative user (non-root)
useradd -m -s /bin/bash mdhosting
usermod -aG wheel mdhosting # Grant sudo access
# Set strong password
passwd mdhosting
# Copy SSH keys to new user
mkdir -p /home/mdhosting/.ssh
cp /root/.ssh/authorized_keys /home/mdhosting/.ssh/
chown -R mdhosting:mdhosting /home/mdhosting/.ssh
chmod 700 /home/mdhosting/.ssh
chmod 600 /home/mdhosting/.ssh/authorized_keys
# Test new user access (open new terminal)
ssh mdhosting@monitor.mdhosting.co.uk
sudo whoami # Should return "root"
# Once confirmed, disable root SSH login
sed -i 's/#PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
sed -i 's/PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
systemctl restart sshd
Configure Firewall (CSF recommended for consistency):
# Install CSF (ConfigServer Security & Firewall)
cd /usr/src
wget https://download.configserver.com/csf.tgz
tar -xzf csf.tgz
cd csf
sh install.sh
# Configure CSF
vi /etc/csf/csf.conf
# Key settings:
# TESTING = "0" # Set to 0 for production
# TCP_IN = "22,80,443,3000,9090,9100,9443" # SSH, HTTP, HTTPS, Grafana, Prometheus, Node Exporter, Portainer
# TCP_OUT = "22,80,443,3100" # Outbound: SSH, HTTP, HTTPS, Loki (for receiving logs)
# ICMP_IN = "1" # Allow ping
# ETH_DEVICE = "" # Leave blank for automatic detection
# Allow monitoring server IP to access Node Exporter on production servers
# (Configure this on eu1.cp, ns1, ns2, not monitoring server)
# Enable and start CSF
systemctl enable csf
systemctl enable lfd
systemctl start csf
systemctl start lfd
# Test firewall
csf -r # Restart CSF
Alternative: Firewalld (if preferring AlmaLinux default):
# Enable firewalld
systemctl enable --now firewalld
# Configure zones and services
firewall-cmd --permanent --add-service=ssh
firewall-cmd --permanent --add-service=http
firewall-cmd --permanent --add-service=https
firewall-cmd --permanent --add-port=3000/tcp # Grafana
firewall-cmd --permanent --add-port=3100/tcp # Loki (receive logs)
firewall-cmd --permanent --add-port=9090/tcp # Prometheus
firewall-cmd --permanent --add-port=9100/tcp # Node Exporter (optional, can restrict to monitoring server only)
firewall-cmd --permanent --add-port=9443/tcp # Portainer
# Reload firewall
firewall-cmd --reload
# Verify rules
firewall-cmd --list-all
Step 3: Install Docker and Docker Compose
Install Docker Engine:
# Install Docker repository
sudo dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
# Install Docker
sudo dnf install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# Start and enable Docker
sudo systemctl start docker
sudo systemctl enable docker
# Add mdhosting user to docker group (allow non-root Docker commands)
sudo usermod -aG docker mdhosting
# Log out and back in for group change to take effect
exit
ssh mdhosting@monitor.mdhosting.co.uk
# Verify Docker installation
docker --version
docker ps # Should return empty list (no containers yet)
Install Docker Compose (V2, plugin-based):
Docker Compose V2 is installed as a Docker plugin with the above commands. Verify:
Step 4: Deploy Portainer
Create Portainer Docker Compose configuration:
# Create directory for Portainer
mkdir -p /opt/portainer
cd /opt/portainer
# Create docker-compose.yml
cat > docker-compose.yml <<'EOF'
version: '3.8'
services:
portainer:
image: portainer/portainer-ce:latest
container_name: portainer
restart: unless-stopped
security_opt:
- no-new-privileges:true
volumes:
- /etc/localtime:/etc/localtime:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- portainer_data:/data
ports:
- "9443:9443"
- "9000:9000"
- "8000:8000"
volumes:
portainer_data:
EOF
# Deploy Portainer
docker compose up -d
# Verify Portainer running
docker ps
docker logs portainer
# Access Portainer web interface
# Navigate to: https://monitor.mdhosting.co.uk:9443
# Create admin account (username: admin, strong password)
Portainer Initial Configuration:
- Access https://monitor.mdhosting.co.uk:9443 (accept self-signed certificate warning)
- Create admin user:
- Username: admin
- Password: [Strong password, store in password manager]
- Select environment: Docker (local)
- Click "Connect"
- Portainer dashboard should load showing local Docker environment
Optional: Configure Portainer Business (Free for 5 nodes):
- In Portainer UI, go to Settings → Licenses
- Click "Add License"
- Enter MDHosting business email (admin@mdhosting.co.uk)
- Request free Business Edition license (supports up to 5 nodes)
- Enter license key once received
- Business features enabled (RBAC, edge agent support, etc.)
Monitoring Stack Deployment
Step 1: Create Docker Compose Configuration
Create directory structure:
# Create monitoring stack directory
mkdir -p /opt/monitoring-stack/{prometheus,loki,grafana}
cd /opt/monitoring-stack
# Create Prometheus configuration
mkdir -p prometheus/config
cat > prometheus/config/prometheus.yml <<'EOF'
global:
scrape_interval: 15s # Scrape targets every 15 seconds
evaluation_interval: 15s # Evaluate rules every 15 seconds
# Alertmanager configuration (optional, for advanced alerting)
# alerting:
# alertmanagers:
# - static_configs:
# - targets: []
# Rule files (for recording rules and alerts)
rule_files:
# - "alerts/*.yml"
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter - EU1 (Hosting Server)
- job_name: 'node-exporter-eu1'
static_configs:
- targets: ['eu1.cp:9100']
labels:
instance: 'eu1-hosting'
server_type: 'cpanel_hosting'
location: 'hetzner_de'
# Node Exporter - NS1 (DNS Server 1)
- job_name: 'node-exporter-ns1'
static_configs:
- targets: ['ns1.mdhosting.co.uk:9100']
labels:
instance: 'ns1-dns'
server_type: 'dns_primary'
location: 'hetzner_de'
# Node Exporter - NS2 (DNS Server 2)
- job_name: 'node-exporter-ns2'
static_configs:
- targets: ['ns2.mdhosting.co.uk:9100']
labels:
instance: 'ns2-dns'
server_type: 'dns_secondary'
location: 'hetzner_de'
EOF
# Create Loki configuration
cat > loki/loki-config.yml <<'EOF'
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
instance_addr: 0.0.0.0
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v12
index:
prefix: index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
filesystem:
directory: /loki/chunks
limits_config:
retention_period: 336h # 14 days retention (adjust as needed)
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_query_series: 500
max_query_lookback: 720h # 30 days max query lookback
compactor:
working_directory: /loki/compactor
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
ruler:
alertmanager_url: http://localhost:9093
enable_api: true
rule_path: /loki/rules-temp
storage:
type: local
local:
directory: /loki/rules
EOF
# Create Grafana configuration (datasources provisioning)
mkdir -p grafana/provisioning/{datasources,dashboards}
cat > grafana/provisioning/datasources/datasources.yml <<'EOF'
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: '15s'
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: true
jsonData:
maxLines: 1000
EOF
# Create dashboard provisioning config
cat > grafana/provisioning/dashboards/dashboards.yml <<'EOF'
apiVersion: 1
providers:
- name: 'MDHosting Dashboards'
orgId: 1
folder: 'MDHosting'
type: file
disableDeletion: false
updateIntervalSeconds: 30
allowUiUpdates: true
options:
path: /etc/grafana/provisioning/dashboards/files
EOF
# Create dashboards directory (will add dashboard JSON files later)
mkdir -p grafana/provisioning/dashboards/files
Create main docker-compose.yml:
cd /opt/monitoring-stack
cat > docker-compose.yml <<'EOF'
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
volumes:
- ./prometheus/config:/etc/prometheus
- prometheus_data:/prometheus
ports:
- "9090:9090"
networks:
- monitoring
loki:
image: grafana/loki:latest
container_name: loki
restart: unless-stopped
command: -config.file=/etc/loki/loki-config.yml
volumes:
- ./loki:/etc/loki
- loki_data:/loki
ports:
- "3100:3100"
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-changeme}
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_ROOT_URL=https://monitor.mdhosting.co.uk
- GF_SMTP_ENABLED=true
- GF_SMTP_HOST=eu1.cp:587
- GF_SMTP_USER=admin@mdhosting.co.uk
- GF_SMTP_PASSWORD=${SMTP_PASSWORD}
- GF_SMTP_FROM_ADDRESS=admin@mdhosting.co.uk
- GF_SMTP_FROM_NAME=MDHosting Monitoring
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
ports:
- "3000:3000"
networks:
- monitoring
depends_on:
- prometheus
- loki
networks:
monitoring:
driver: bridge
volumes:
prometheus_data:
loki_data:
grafana_data:
EOF
# Create .env file for sensitive configuration
cat > .env <<'EOF'
# Grafana admin password (change this!)
GRAFANA_ADMIN_PASSWORD=CHANGE_THIS_STRONG_PASSWORD
# SMTP password for Grafana alerts (use existing email account password)
SMTP_PASSWORD=YOUR_EMAIL_PASSWORD_HERE
EOF
# Secure .env file
chmod 600 .env
# IMPORTANT: Edit .env and set strong passwords
vi .env
Step 2: Deploy Monitoring Stack
cd /opt/monitoring-stack
# Pull images (optional, docker compose up will do this automatically)
docker compose pull
# Start monitoring stack
docker compose up -d
# Verify all containers running
docker compose ps
# Should show: prometheus, loki, grafana - all "running" status
# Check container logs
docker compose logs prometheus
docker compose logs loki
docker compose logs grafana
# Verify services accessible
curl http://localhost:9090/-/healthy # Prometheus health
curl http://localhost:3100/ready # Loki ready
curl http://localhost:3000/api/health # Grafana health
Access Services:
- Grafana: http://monitor.mdhosting.co.uk:3000
- Username:
admin - Password: (from .env file
GRAFANA_ADMIN_PASSWORD) - Prometheus: http://monitor.mdhosting.co.uk:9090
- Loki: http://monitor.mdhosting.co.uk:3100 (no UI, use Grafana)
Verify Data Sources in Grafana:
- Log in to Grafana (http://monitor.mdhosting.co.uk:3000)
- Go to Configuration → Data Sources
- Verify "Prometheus" data source exists and shows green "Data source is working" message
- Verify "Loki" data source exists and shows green "Data source is working" message
If data sources not working:
- Check docker network connectivity: docker network inspect monitoring-stack_monitoring
- Check container logs: docker compose logs grafana
- Verify URLs in datasources.yml match container names (prometheus, loki)
Production Server Agent Deployment
Now deploy Node Exporter and Promtail on all 3 production servers (EU1, NS1, NS2).
Node Exporter Installation
On each production server (eu1.cp, ns1, ns2):
# Create node_exporter user
sudo useradd --no-create-home --shell /bin/false node_exporter
# Download Node Exporter (check for latest version at https://github.com/prometheus/node_exporter/releases)
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
# Extract and install
tar xvf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
# Clean up
rm -rf node_exporter-1.7.0.linux-amd64*
# Create systemd service
sudo cat > /etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
# Reload systemd, start and enable Node Exporter
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter
# Verify Node Exporter running
sudo systemctl status node_exporter
curl http://localhost:9100/metrics | head -20
# Configure firewall to allow monitoring server to scrape metrics
# If using CSF:
sudo vi /etc/csf/csf.allow
# Add line: tcp|in|d=9100|s=MONITORING_SERVER_IP # Allow Prometheus scrape
# Reload CSF
sudo csf -r
# If using firewalld:
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="MONITORING_SERVER_IP" port protocol="tcp" port="9100" accept'
sudo firewall-cmd --reload
Verify from monitoring server:
# From monitoring server
curl http://eu1.cp:9100/metrics | head -20
curl http://ns1.mdhosting.co.uk:9100/metrics | head -20
curl http://ns2.mdhosting.co.uk:9100/metrics | head -20
# Check Prometheus is scraping targets
# Open Prometheus UI: http://monitor.mdhosting.co.uk:9090
# Go to Status → Targets
# Should see node-exporter-eu1, node-exporter-ns1, node-exporter-ns2 all "UP" status
Promtail Installation
On each production server (eu1.cp, ns1, ns2):
# Create promtail user
sudo useradd --no-create-home --shell /bin/false promtail
# Download Promtail (check for latest version at https://github.com/grafana/loki/releases)
cd /tmp
wget https://github.com/grafana/loki/releases/download/v2.9.3/promtail-linux-amd64.zip
# Extract and install
unzip promtail-linux-amd64.zip
sudo cp promtail-linux-amd64 /usr/local/bin/promtail
sudo chown promtail:promtail /usr/local/bin/promtail
sudo chmod +x /usr/local/bin/promtail
# Clean up
rm -rf promtail-linux-amd64*
# Create Promtail configuration directory
sudo mkdir -p /etc/promtail
# Create Promtail configuration (customize per server)
sudo cat > /etc/promtail/promtail-config.yml <<'EOF'
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /var/lib/promtail/positions.yaml
clients:
- url: http://monitor.mdhosting.internal:3100/loki/api/v1/push
scrape_configs:
# System logs
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: systemlogs
server: eu1-hosting # Change per server: eu1-hosting, ns1-dns, ns2-dns
__path__: /var/log/messages
- job_name: secure
static_configs:
- targets:
- localhost
labels:
job: secure
server: eu1-hosting # Change per server
__path__: /var/log/secure
- job_name: cron
static_configs:
- targets:
- localhost
labels:
job: cron
server: eu1-hosting # Change per server
__path__: /var/log/cron
# Web server logs (EU1 only)
- job_name: apache-access
static_configs:
- targets:
- localhost
labels:
job: apache-access
server: eu1-hosting
__path__: /usr/local/apache/logs/access_log
- job_name: apache-error
static_configs:
- targets:
- localhost
labels:
job: apache-error
server: eu1-hosting
__path__: /usr/local/apache/logs/error_log
# Email logs (EU1 only)
- job_name: exim
static_configs:
- targets:
- localhost
labels:
job: exim
server: eu1-hosting
__path__: /var/log/exim_mainlog
- job_name: maillog
static_configs:
- targets:
- localhost
labels:
job: maillog
server: eu1-hosting
__path__: /var/log/maillog
# Security logs
- job_name: fail2ban
static_configs:
- targets:
- localhost
labels:
job: fail2ban
server: eu1-hosting # Change per server
__path__: /var/log/fail2ban.log
- job_name: lfd
static_configs:
- targets:
- localhost
labels:
job: lfd
server: eu1-hosting # Change per server
__path__: /var/log/lfd.log
# cPanel logs (EU1 only)
- job_name: cpanel-access
static_configs:
- targets:
- localhost
labels:
job: cpanel-access
server: eu1-hosting
__path__: /usr/local/cpanel/logs/access_log
- job_name: cpanel-error
static_configs:
- targets:
- localhost
labels:
job: cpanel-error
server: eu1-hosting
__path__: /usr/local/cpanel/logs/error_log
EOF
# Create positions directory
sudo mkdir -p /var/lib/promtail
sudo chown promtail:promtail /var/lib/promtail
# Adjust log file permissions (Promtail needs read access)
# Option 1: Add promtail to adm group (can read most logs)
sudo usermod -aG adm promtail
# Option 2: Specific log file permissions (more restrictive)
# sudo setfacl -m u:promtail:r /var/log/messages
# sudo setfacl -m u:promtail:r /var/log/secure
# ... (repeat for each log file)
# Create systemd service
sudo cat > /etc/systemd/system/promtail.service <<'EOF'
[Unit]
Description=Promtail
Wants=network-online.target
After=network-online.target
[Service]
User=promtail
Group=promtail
Type=simple
ExecStart=/usr/local/bin/promtail -config.file=/etc/promtail/promtail-config.yml
[Install]
WantedBy=multi-user.target
EOF
# Reload systemd, start and enable Promtail
sudo systemctl daemon-reload
sudo systemctl start promtail
sudo systemctl enable promtail
# Verify Promtail running
sudo systemctl status promtail
sudo journalctl -u promtail -f # Watch logs for connection to Loki
# Look for "clients/client.go" messages showing successful log shipping
Important: Customize promtail-config.yml per server!
- eu1.cp: Include all scrape_configs (system, web, email, cPanel)
- ns1/ns2: Remove web server, email, and cPanel sections (DNS servers don't have these services)
Verify from Grafana:
- Open Grafana: http://monitor.mdhosting.co.uk:3000
- Go to Explore
- Select Loki data source
- Run query:
{server="eu1-hosting"}(should show logs from EU1) - Run query:
{server="ns1-dns"}(should show logs from NS1) - Run query:
{server="ns2-dns"}(should show logs from NS2)
If no logs appearing:
- Check Promtail status: sudo systemctl status promtail
- Check Promtail logs: sudo journalctl -u promtail -n 100
- Verify firewall allows outbound HTTP to monitoring server port 3100
- Check Loki logs on monitoring server: docker compose logs loki
Dashboard Configuration
Import Pre-Built Dashboards
Node Exporter Full Dashboard:
- In Grafana, go to Dashboards → Import
- Enter dashboard ID: 1860 (Node Exporter Full by starsliao)
- Click "Load"
- Select "Prometheus" as data source
- Click "Import"
- Dashboard should load showing metrics from all 3 servers
Alternative Node Exporter Dashboard: - Dashboard ID: 11074 (Node Exporter for Prometheus Dashboard by ktal)
Loki Logs Dashboard:
- In Grafana, go to Dashboards → Import
- Enter dashboard ID: 13639 (Logs / App by Loki)
- Click "Load"
- Select "Loki" as data source
- Click "Import"
- Dashboard should show log streams from all servers
Create Custom MDHosting Dashboard
Create Overview Dashboard:
- In Grafana, click + → Dashboard
- Click "Add visualization"
- Select "Prometheus" data source
Panel 1: Server Status (Stat panel):
- Query: up{job=~"node-exporter-.*"}
- Visualization: Stat
- Title: "Server Status"
- Value mappings:
- 1 = "UP" (green)
- 0 = "DOWN" (red)
- Display: All values
Panel 2: CPU Usage (Time series):
- Query: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- Visualization: Time series
- Title: "CPU Usage %"
- Legend: {{instance}}
- Unit: Percent (0-100)
- Thresholds: Yellow at 70%, Red at 90%
Panel 3: Memory Usage (Gauge):
- Query: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
- Visualization: Gauge
- Title: "Memory Usage %"
- Unit: Percent (0-100)
- Thresholds: Yellow at 75%, Red at 90%
Panel 4: Disk Usage (Bar gauge):
- Query: (1 - (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes)) * 100
- Visualization: Bar gauge
- Title: "Disk Usage %"
- Unit: Percent (0-100)
- Thresholds: Yellow at 80%, Red at 90%
Panel 5: Network Traffic (Time series):
- Query A (Received): rate(node_network_receive_bytes_total{device!~"lo|veth.*|docker.*"}[5m])
- Query B (Transmitted): rate(node_network_transmit_bytes_total{device!~"lo|veth.*|docker.*"}[5m])
- Visualization: Time series
- Title: "Network Traffic"
- Unit: Bytes/sec
Panel 6: System Load (Time series):
- Query A (1min): node_load1
- Query B (5min): node_load5
- Query C (15min): node_load15
- Visualization: Time series
- Title: "System Load Average"
- Legend: {{instance}} - {{name}}
Panel 7: Recent Errors (Logs panel):
- Data Source: Loki
- Query: {job=~"systemlogs|secure"} |= "error" or "fail" or "critical"
- Visualization: Logs
- Title: "Recent System Errors"
- Time: Last 1 hour
Panel 8: SSH Authentication Failures (Stat):
- Data Source: Loki
- Query: count_over_time({job="secure"} |= "Failed password" [1h])
- Visualization: Stat
- Title: "SSH Failed Logins (Last Hour)"
- Thresholds: Yellow at 10, Red at 50
Save Dashboard: - Click Save dashboard (top right) - Name: "MDHosting Infrastructure Overview" - Folder: MDHosting - Tags: infrastructure, overview - Click "Save"
Configure Alert Rules
Alert 1: High CPU Usage:
- In Grafana, go to Alerting → Alert rules
- Click "New alert rule"
- Alert name: High CPU Usage
- Query A:
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 - Condition: WHEN last() OF query(A) IS ABOVE 80
- Evaluate every: 1m
- For: 10m (only alert if condition persists for 10 minutes)
- Annotations:
- Summary: High CPU usage on {{ $labels.instance }}
- Description: CPU usage is {{ $value }}% on {{ $labels.instance }}
- Labels: severity=warning, server={{ $labels.instance }}
- Click "Save"
Alert 2: High Memory Usage:
- Create new alert rule
- Alert name: High Memory Usage
- Query A:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85 - Condition: WHEN last() OF query(A) IS ABOVE 85
- Evaluate every: 1m
- For: 5m
- Annotations:
- Summary: High memory usage on {{ $labels.instance }}
- Description: Memory usage is {{ $value }}% on {{ $labels.instance }}
- Labels: severity=warning
- Click "Save"
Alert 3: Disk Space Critical:
- Create new alert rule
- Alert name: Disk Space Critical
- Query A:
(1 - (node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / node_filesystem_size_bytes)) * 100 > 90 - Condition: WHEN last() OF query(A) IS ABOVE 90
- Evaluate every: 5m
- For: 5m
- Annotations:
- Summary: CRITICAL: Disk space low on {{ $labels.instance }}
- Description: Disk usage is {{ $value }}% on {{ $labels.instance }}. Immediate action required.
- Labels: severity=critical
- Click "Save"
Alert 4: Server Down:
- Create new alert rule
- Alert name: Server Down
- Query A:
up{job=~"node-exporter-.*"} == 0 - Condition: WHEN last() OF query(A) IS BELOW 1
- Evaluate every: 1m
- For: 2m (give 2 minutes for temporary network issues)
- Annotations:
- Summary: CRITICAL: Server {{ $labels.instance }} is DOWN
- Description: Cannot reach {{ $labels.instance }}. Check server immediately.
- Labels: severity=critical
- Click "Save"
Configure Notification Channel:
- In Grafana, go to Alerting → Contact points
- Click "New contact point"
- Name: Email Admin
- Integration: Email
- Addresses: admin@mdhosting.co.uk
- Optional: Test notification (click "Test" button)
- Click "Save contact point"
Configure Notification Policy:
- Go to Alerting → Notification policies
- Edit "Default policy"
- Contact point: Email Admin
- Group by: alertname, instance
- Timing:
- Group wait: 30s
- Group interval: 5m
- Repeat interval: 4h
- Click "Save policy"
Test Alerting:
Simulate high CPU to test alerting:
# On any production server
# Install stress tool
sudo dnf install -y stress
# Generate CPU load for 15 minutes
stress --cpu 4 --timeout 900s &
# Watch for alert to fire in Grafana (after 10 minutes)
# Check email for alert notification
Operational Procedures
Daily Monitoring Workflow
With Grafana (5-10 minutes):
- Open Grafana Dashboard:
- Navigate to "MDHosting Infrastructure Overview" dashboard
-
Quick visual scan: All servers green? Any panels in red/yellow?
-
Check Key Metrics:
- Server Status: All showing "UP"?
- CPU Usage: Any sustained high usage (>70%)?
- Memory Usage: Any servers >75%?
- Disk Usage: Any approaching 80%?
-
Network Traffic: Any unusual spikes or patterns?
-
Review Recent Errors:
- Check "Recent System Errors" panel
- Click on any interesting errors for full log context
-
Investigate critical/error level messages
-
Check Alerts:
- Go to Alerting → Alert rules
- Any active alerts? If yes, investigate and resolve
-
Review recently resolved alerts for patterns
-
Log Investigation (if needed):
- Go to Explore → Select Loki
- Run targeted queries for specific investigations:
- SSH failures:
{job="secure"} |= "Failed password" - Email issues:
{job="exim"} |= "error" or "frozen" - Apache errors:
{job="apache-error"}
- SSH failures:
Compared to Manual Monitoring (Before Grafana): - Before: 15-30 minutes SSHing to each server, running commands, checking logs - After: 5-10 minutes in Grafana dashboard, only SSH if specific issue requires intervention - Time Saved: 10-20 minutes per day = 60-140 minutes per week
Incident Investigation Workflow
Scenario: Client reports "website slow"
Traditional Approach (Before Grafana):
1. SSH to eu1.cp
2. Check load average: uptime
3. Check disk: df -h
4. Check memory: free -h
5. Check processes: top
6. Check Apache logs: tail -f /usr/local/apache/logs/error_log
7. Check specific domain logs: tail -f /usr/local/apache/domlogs/client-domain.com
8. Correlate timing with other events manually
9. Time: 15-30 minutes
Grafana Approach: 1. Open "MDHosting Infrastructure Overview" dashboard 2. Visual inspection: CPU/memory/disk spikes at time of complaint? 3. Time series analysis: Set time range to when client reported issue 4. Log correlation: In Explore, query:
5. Identify issue: High CPU correlates with error spike in client's domain logs 6. Root cause: Specific error pattern identified (e.g., PHP fatal error, slow database query) 7. Resolution: SSH to server only if needed for remediation 8. Time: 5-10 minutesEfficiency Gain: 2-3x faster incident investigation
Alert Response Procedures
Critical Alert: Server Down
Alert Received:
Subject: [CRITICAL] Server ns1-dns is DOWN
Body: Cannot reach ns1.mdhosting.co.uk. Check server immediately.
Response Steps: 1. Verify Alert: - Open Grafana: Check "Server Status" panel - Confirm server showing as DOWN - Check alert timeline: How long has it been down?
- Initial Investigation:
- Attempt to ping server:
ping ns1.mdhosting.co.uk - Attempt SSH access:
ssh ns1.mdhosting.co.uk -
Check Hetzner Cloud Console: Server status, console access
-
Root Cause Analysis:
- If accessible via Hetzner console: Check system logs for crash/kernel panic
- If not accessible: Hardware failure, network issue, or hosting provider outage
-
Check Grafana logs (if server was up recently): Any errors before going down?
-
Resolution:
- If server reboot needed: Via Hetzner console or SSH
- If hardware failure: Contact Hetzner support, consider failover (DNS has redundancy)
-
If network issue: Check Hetzner status page, contact support
-
Validation:
- Verify server returns to "UP" status in Grafana
- Check all services running:
systemctl status named -
Monitor for stability (watch dashboard for 15 minutes)
-
Follow-Up:
- Document incident in incident log
- Update Incident Response if new pattern
- Review alert threshold if false positive
High Alert: Disk Space Critical
Alert Received:
Subject: [CRITICAL] Disk space low on eu1-hosting
Body: Disk usage is 92% on eu1.cp. Immediate action required.
Response Steps: 1. Verify Alert: - Open Grafana: Check "Disk Usage %" panel - Confirm disk usage level and trend (increasing rapidly or stable?)
- Immediate Investigation:
- SSH to server:
ssh eu1.cp - Check disk usage:
df -h -
Identify largest directories:
du -sh /home/* /var/* | sort -hr | head -20 -
Common Causes:
- Client backups not cleaned up
- Log files growing excessively
- Abandoned files in /tmp or /var/tmp
- Email queue buildup (frozen messages)
-
cPanel backup staging files
-
Resolution:
- Remove old backups: Review /backup directory, clean up manually
- Rotate logs:
logrotate -f /etc/logrotate.conf - Clean up temp files:
rm -rf /tmp/* /var/tmp/*(be cautious) - Clear mail queue frozen messages:
exiqgrep -z -i | xargs exim -Mrm -
Identify and remove large unnecessary files
-
Validation:
- Verify disk usage dropped below 85%:
df -h - Check Grafana dashboard: Disk usage panel should reflect reduction
-
Wait for alert to auto-resolve
-
Follow-Up:
- Investigate why disk usage increased (client issue? backup issue?)
- Adjust backup retention policy if needed (see Backup Recovery)
- Consider increasing disk space if recurring issue (upgrade to larger Hetzner server)
Warning Alert: High CPU Usage
Alert Received:
Response Steps: 1. Verify Alert: - Open Grafana: Check "CPU Usage %" time series panel - Assess pattern: Sustained high usage or spike? - Check timing: Correlates with known events (backups, cron jobs)?
- Investigate Cause:
- SSH to server:
ssh eu1.cp - Check top processes:
toporhtop -
Identify resource-heavy processes
-
Common Causes:
- Client website generating high traffic (legitimate)
- Backup job running
- ClamAV malware scan (scheduled daily)
- Brute force attack (check Apache logs for excessive requests)
-
Compromised account running malicious scripts
-
Resolution:
- Legitimate traffic: Monitor, consider optimizations or upgrades
- Backup/scan: Wait for completion, adjust schedule if problematic
- Attack: Block attacking IPs (CSF), investigate compromised account
-
Malicious scripts: Identify and kill processes, investigate compromise
-
Validation:
- Verify CPU usage returns to normal (<70%)
- Check Grafana: CPU panel should show reduction
-
Wait for alert to auto-resolve
-
Follow-Up:
- If attack: Follow Incident Response - Unauthorized Access
- If legitimate load: Review capacity planning, consider infrastructure upgrades
- If recurring: Adjust alert threshold (e.g., increase to 90% if 80% too sensitive)
Backup and Disaster Recovery
Backing Up Monitoring Configuration:
What to Backup: - Grafana dashboards and configuration (automated via volume) - Prometheus configuration (prometheus.yml) - Loki configuration (loki-config.yml) - Docker Compose files (docker-compose.yml, .env) - Grafana provisioning files (data sources, dashboards)
Automated Backup (Recommended):
# Create backup script
cat > /opt/monitoring-stack/backup.sh <<'EOF'
#!/bin/bash
# MDHosting Monitoring Stack Backup Script
BACKUP_DIR="/root/monitoring-backups"
DATE=$(date +%Y-%m-%d)
BACKUP_FILE="monitoring-backup-${DATE}.tar.gz"
# Create backup directory
mkdir -p ${BACKUP_DIR}
# Stop containers (optional, for consistent backup)
cd /opt/monitoring-stack
# docker compose stop
# Create backup archive
tar czf ${BACKUP_DIR}/${BACKUP_FILE} \
/opt/monitoring-stack/ \
/opt/portainer/
# Restart containers if stopped
# docker compose start
# Remove backups older than 30 days
find ${BACKUP_DIR} -name "monitoring-backup-*.tar.gz" -mtime +30 -delete
echo "Backup completed: ${BACKUP_FILE}"
ls -lh ${BACKUP_DIR}/${BACKUP_FILE}
EOF
# Make script executable
chmod +x /opt/monitoring-stack/backup.sh
# Add to cron (daily at 2am)
crontab -e
# Add line:
0 2 * * * /opt/monitoring-stack/backup.sh >> /var/log/monitoring-backup.log 2>&1
Manual Backup:
# Create backup
cd /opt
tar czf /root/monitoring-backup-$(date +%Y-%m-%d).tar.gz monitoring-stack/ portainer/
# Transfer to Hetzner Storage Box or local machine
scp /root/monitoring-backup-*.tar.gz user@backup-server:/backups/
Restore Procedure:
Scenario: Monitoring server failure, need to rebuild
- Provision New Monitoring Server:
-
Follow "Monitoring Server Setup" steps (provision Hetzner CPX31, install AlmaLinux, Docker, etc.)
-
Restore Configuration:
# Transfer backup to new server scp monitoring-backup-YYYY-MM-DD.tar.gz mdhosting@new-monitor.mdhosting.co.uk:/tmp/ # SSH to new server ssh mdhosting@new-monitor.mdhosting.co.uk # Extract backup cd /opt sudo tar xzf /tmp/monitoring-backup-YYYY-MM-DD.tar.gz # Start monitoring stack cd /opt/monitoring-stack docker compose up -d # Start Portainer cd /opt/portainer docker compose up -d -
Verify Restoration:
- Access Grafana: https://new-monitor.mdhosting.co.uk:3000
- Verify dashboards present
- Verify data sources configured
- Check Prometheus scraping metrics (may need to wait for next scrape interval)
-
Check Loki receiving logs
-
Update DNS:
- Update
monitor.mdhosting.co.ukDNS A record to new server IP -
Wait for DNS propagation (5-60 minutes)
-
Decommission Old Server:
- Once new server verified working, delete old server from Hetzner Cloud Console
Data Retention and Disaster Recovery:
Metrics (Prometheus): - Retention: 30 days (configurable in docker-compose.yml) - Impact of Loss: Lose historical metrics trend analysis - Mitigation: 30 days sufficient for most investigations, can extend to 90 days if needed - Recovery: Cannot recover lost metrics, but Node Exporter will continue sending new metrics immediately
Logs (Loki): - Retention: 14 days (configurable in loki-config.yml) - Impact of Loss: Lose historical log analysis capabilities - Mitigation: Critical logs still on production servers (/var/log/), Loki is additional aggregation layer - Recovery: Cannot recover lost logs from Loki, but original logs on production servers remain
Dashboards and Configuration: - Retention: Persistent Docker volumes + daily backups - Impact of Loss: Need to recreate dashboards and alerts manually - Mitigation: Regular backups (automated script above) + export dashboards as JSON - Recovery: Restore from backup (see Restore Procedure above)
Critical: Production server logs always retained on production servers per normal logrotate policies (4 weeks). Loki provides centralized access but does NOT replace original log files.
Integration with Future Wazuh Deployment
Phase 2 Integration Strategy
When Wazuh Deploys (Post-ApisCP Migration):
Grafana stack provides the perfect foundation for Wazuh integration:
Wazuh + Grafana Unified Monitoring
Architecture (Post-ApisCP):
graph TB
subgraph "NEW ApisCP Servers - Post-Migration"
APIS1[ApisCP Server 1<br/>No Imunify360]
APIS2[ApisCP Server 2<br/>No Imunify360]
end
subgraph "Monitoring Server - Shared Infrastructure"
GRAFANA[Grafana<br/>Unified Dashboards]
PROMETHEUS[Prometheus<br/>Infrastructure Metrics]
LOKI[Loki<br/>Log Aggregation]
end
subgraph "NEW Wazuh SIEM Server"
WAZUH[Wazuh Manager<br/>Security Events]
WAZUH_IDX[Wazuh Indexer<br/>OpenSearch]
end
subgraph "Agents on ApisCP Servers"
NEXPORTER[Node Exporter<br/>Metrics]
PROMTAIL[Promtail<br/>Logs]
WAZUH_AGENT[Wazuh Agent<br/>Security Events]
end
APIS1 --> NEXPORTER
APIS1 --> PROMTAIL
APIS1 --> WAZUH_AGENT
APIS2 --> NEXPORTER
APIS2 --> PROMTAIL
APIS2 --> WAZUH_AGENT
NEXPORTER --> PROMETHEUS
PROMTAIL --> LOKI
WAZUH_AGENT --> WAZUH
PROMETHEUS --> GRAFANA
LOKI --> GRAFANA
WAZUH_IDX --> GRAFANA
WAZUH --> WAZUH_IDX
style GRAFANA fill:#f39c12,stroke:#2c3e50,stroke-width:3px,color:#fff
style APIS1 fill:#3498db,stroke:#2c3e50,stroke-width:2px,color:#fff
style APIS2 fill:#3498db,stroke:#2c3e50,stroke-width:2px,color:#fff
style WAZUH fill:#e74c3c,stroke:#2c3e50,stroke-width:2px,color:#fff
Benefits of Unified Approach:
- Single Pane of Glass:
- Infrastructure metrics (Prometheus) + Security events (Wazuh) + Logs (Loki) all in Grafana
- No need to switch between Wazuh Dashboard and Grafana
-
Correlate infrastructure anomalies with security events
-
Enhanced Incident Investigation:
- Example: High CPU alert fires (Prometheus)
- Correlate with security events: Was it a brute force attack? (Wazuh)
- Investigate logs: What happened? (Loki)
-
All in one dashboard, single query interface
-
Unified Alerting:
- Grafana Unified Alerting handles both infrastructure and security alerts
- Single notification channel (email, Slack, PagerDuty, etc.)
- Consistent alert format and response procedures
Grafana + Wazuh Integration Steps
When Wazuh is deployed (Q3-Q4 2026):
-
Add Wazuh Data Source to Grafana:
# In /opt/monitoring-stack/grafana/provisioning/datasources/datasources.yml - name: Wazuh type: elasticsearch access: proxy url: http://wazuh-indexer:9200 database: "wazuh-alerts-*" isDefault: false editable: true jsonData: esVersion: "7.10.0" timeField: "@timestamp" logMessageField: "full_log" logLevelField: "rule.level" -
Import Wazuh Dashboards:
- Wazuh provides pre-built Grafana dashboards
- Import via Grafana UI: Dashboards → Import → Upload Wazuh JSON files
-
Dashboards available for:
- Security Overview
- Threat Intelligence
- File Integrity Monitoring
- Vulnerability Detection
- Compliance (PCI DSS, GDPR, CIS)
-
Create Unified Security + Infrastructure Dashboard:
- Combine panels from Node Exporter and Wazuh dashboards
-
Example layout:
- Row 1: Server status (infrastructure) + Security alert summary (Wazuh)
- Row 2: CPU/Memory/Disk (Prometheus) + Top security events (Wazuh)
- Row 3: Network traffic (Prometheus) + Intrusion attempts (Wazuh)
- Row 4: Recent logs (Loki) + File integrity changes (Wazuh)
-
Configure Cross-Data-Source Alerts:
-
Example: "High CPU + Security Alert Correlation"
- Query A (Prometheus):
CPU > 80% - Query B (Wazuh):
rule.level >= 10(high severity security events) - Condition: IF both true within 5-minute window, fire alert
- Interpretation: Possible crypto-mining malware or resource-intensive attack
- Query A (Prometheus):
-
Update Operational Procedures:
- Daily monitoring: Include Wazuh panels in dashboard review
- Incident investigation: Use Grafana Explore to correlate all 3 data sources
- Alert response: Unified alert handling procedures
Documentation Updates Required: - Wazuh Deployment - Add Grafana integration section - Security Monitoring - Update with unified monitoring workflows - This document (Grafana Monitoring) - Add post-Wazuh integration procedures
Maintaining Separation During Migration
During ApisCP Migration (Q2-Q3 2026):
Grafana stack will monitor BOTH old cPanel and new ApisCP infrastructure:
Monitoring Strategy: - Old cPanel servers (EU1, NS1, NS2): Keep existing Node Exporter + Promtail (no Wazuh) - New ApisCP servers: Add Node Exporter + Promtail + Wazuh agents (once deployed) - Grafana dashboards: Separate "cPanel Infrastructure" and "ApisCP Infrastructure" dashboards - Migration visibility: Track metrics on both platforms during migration
Benefits: - Comparison: Compare performance of cPanel vs. ApisCP side-by-side - Risk Mitigation: Detect issues early on new ApisCP servers before full migration - Confidence: Verify new infrastructure stable before decommissioning old servers
Post-Migration: - Remove old cPanel servers from monitoring once fully decommissioned - Unified dashboard: Single "MDHosting Infrastructure" dashboard for all ApisCP servers - Wazuh integration: Deploy Wazuh agents on ApisCP servers (no Imunify360 conflict)
Troubleshooting
Common Issues and Resolutions
Issue 1: Prometheus Not Scraping Targets
Symptoms: - Prometheus UI (http://monitor.mdhosting.co.uk:9090) shows targets as "DOWN" - No metrics in Grafana dashboards - Error: "context deadline exceeded" or "connection refused"
Diagnosis:
# On monitoring server
# Check Prometheus logs
docker logs prometheus | tail -50
# Test connectivity to Node Exporter from monitoring server
curl http://eu1.cp:9100/metrics
curl http://ns1.mdhosting.co.uk:9100/metrics
curl http://ns2.mdhosting.co.uk:9100/metrics
# Check Prometheus configuration
cat /opt/monitoring-stack/prometheus/config/prometheus.yml
Resolution:
-
Firewall Issue:
-
Node Exporter Not Running:
-
Incorrect Hostname in Prometheus Config:
-
Reload Prometheus Configuration:
Issue 2: Loki Not Receiving Logs
Symptoms: - No logs appearing in Grafana Explore with Loki data source - Grafana query returns "No data" - Promtail logs show errors
Diagnosis:
# On production server (check Promtail status)
sudo systemctl status promtail
sudo journalctl -u promtail -n 100
# Look for errors like:
# "Post http://monitor.mdhosting.internal:3100/loki/api/v1/push: dial tcp: lookup monitor.mdhosting.internal: no such host"
# "error sending batch, will retry: 400 Bad Request"
# On monitoring server (check Loki logs)
docker logs loki | tail -50
Resolution:
-
Hostname Resolution Issue:
-
Firewall Blocking Loki Port:
-
Promtail Configuration Error:
-
Log File Permissions:
Issue 3: Grafana Not Loading Dashboards
Symptoms: - Grafana UI loads but dashboards show "No data" - Data sources show errors - Panels display "Failed to fetch" errors
Diagnosis:
# On monitoring server
# Check Grafana logs
docker logs grafana | tail -100
# Check data source configuration
cat /opt/monitoring-stack/grafana/provisioning/datasources/datasources.yml
# Test data sources
# Prometheus:
curl http://localhost:9090/-/healthy
# Loki:
curl http://localhost:3100/ready
Resolution:
-
Data Source URL Incorrect:
# Verify container names in docker network docker network inspect monitoring-stack_monitoring # URLs should use container names (not localhost): # Prometheus: http://prometheus:9090 # Loki: http://loki:3100 # Fix datasources.yml if needed vi /opt/monitoring-stack/grafana/provisioning/datasources/datasources.yml # Restart Grafana docker compose restart grafana -
Data Source Not Provisioned:
-
Docker Network Issue:
Issue 4: Alerts Not Firing or Email Not Received
Symptoms: - Alert conditions met but no alert fires - Alert fires but no email received - Grafana shows alert as "Pending" indefinitely
Diagnosis:
# On monitoring server
# Check Grafana logs for SMTP errors
docker logs grafana | grep -i smtp
docker logs grafana | grep -i email
docker logs grafana | grep -i alert
# In Grafana UI:
# Go to Alerting → Alert rules
# Check rule status (Normal, Pending, Firing)
# Click rule to see evaluation history
Resolution:
-
SMTP Configuration Issue:
-
Email Address Configuration:
-
Alert Rule Threshold Not Met:
# In Grafana UI: # Go to Alerting → Alert rules # Click alert rule # Check "For" duration: Alert only fires if condition persists for specified time # Example: "CPU > 80% for 10m" means CPU must be >80% for 10 continuous minutes # Adjust "For" duration if too long # Save rule and wait for next evaluation -
Notification Policy Not Configured:
Issue 5: High Resource Usage on Monitoring Server
Symptoms: - Monitoring server running slow - Prometheus/Loki/Grafana containers using excessive CPU/memory - Disk space filling up rapidly
Diagnosis:
# On monitoring server
# Check resource usage
docker stats
# Check disk usage
df -h
du -sh /var/lib/docker/volumes/*
# Check Prometheus data size
du -sh /var/lib/docker/volumes/monitoring-stack_prometheus_data/
# Check Loki data size
du -sh /var/lib/docker/volumes/monitoring-stack_loki_data/
Resolution:
-
Reduce Prometheus Retention:
-
Reduce Loki Retention:
-
Increase Scrape Interval (Prometheus):
-
Upgrade Monitoring Server:
Getting Help
Internal Resources: - This documentation (docs/projects/grafana-monitoring.md) - Security Monitoring - Integration procedures - Incident Response - Escalation procedures - Contacts - Vendor contacts
External Resources:
| Resource | URL | Use Case |
|---|---|---|
| Grafana Documentation | https://grafana.com/docs/grafana/latest/ | Grafana features, configuration, troubleshooting |
| Prometheus Documentation | https://prometheus.io/docs/ | PromQL queries, configuration, best practices |
| Loki Documentation | https://grafana.com/docs/loki/latest/ | LogQL queries, configuration, retention |
| Portainer Documentation | https://docs.portainer.io/ | Container management, troubleshooting |
| Grafana Community Forums | https://community.grafana.com/ | User questions, dashboard sharing |
| Prometheus Mailing List | https://groups.google.com/forum/#!forum/prometheus-users | Technical questions, best practices |
Vendor Support:
| Vendor | Service | Contact | Notes |
|---|---|---|---|
| Hetzner | Infrastructure hosting | https://robot.hetzner.com/ | Server issues, network problems |
| Grafana Labs | Grafana OSS (free) | Community forums only | No paid support for OSS version |
Emergency Escalation: - If monitoring server failure impacts production monitoring, follow Incident Response - Infrastructure Failure - Monitoring server failure does NOT impact production services (EU1, NS1, NS2 continue operating) - Can operate without monitoring temporarily; rebuild from backup
Document Control
Version History:
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | January 2026 | Claude Sonnet 4.5 | Initial comprehensive Grafana monitoring deployment documentation |
Review Schedule:
- Post-Deployment Review: 2 weeks after Phase 5 completion (validate effectiveness)
- Quarterly Review: Assess monitoring coverage, dashboard effectiveness, alert accuracy
- Post-Wazuh Integration: Major revision after Wazuh SIEM deployment and integration
- Annual Review: Comprehensive review and update (January each year)
Next Review Date: March 2026 (post-deployment review)
Related Documentation:
- Security Monitoring - Current monitoring practices, integration with Grafana
- Wazuh Deployment Project - Phase 2 SIEM deployment plan
- ApisCP Migration - Infrastructure migration context
- Incident Response - Incident response procedures using monitoring data
- Backup Recovery - Backup procedures (separate from monitoring backups)
- Network Architecture - Network topology and firewall rules
Document Status: ✅ Complete - Comprehensive Grafana monitoring deployment plan Classification: Confidential - Internal Use Only Document Owner: MDHosting Ltd Director (Matthew Dinsdale)
Last updated: January 2026