Prévia do material em texto
W W W . A D M I N - M A G A Z I N E . C O M
A
D
M
IN
N
et
w
o
rk
&
S
ec
u
ri
ty
Network & Security
IPv6-Mostly
ISSUE 92
IAM for machines, workloads, and agents
Non-Human IAM Prometheus +
Cortex
DVD INSIDE
Bloonix
Combine numerous
services for continuous
IT monitoring
Prowler
Check AWS infrastructure for
vulnerabilities, compliance,
and security gaps
Geofencing
Isolate web services from
the public Internet
Zabbix
Monitor constrained
environments over time
Prometheus plus Cortex
MAT for large volumes of historical data
Uptime Kuma
Self-hosted uptime monitoring
IPv6-Mostly Networks
RT Industrial Ethernet
Protocols
Datapizza-AI
Edge AI automation on
constrained hardware
Non-Human
Identity Management
E-Ticket
Artificial Intelligence (AI) and its value, ethics, power, and future are in your daily news feed. Journalists raise
important questions such as “Will AI replace your job?” and “Will the government really remove guardrail
protection and train AI bots to surveil and potentially harm citizens?” The topic makes good headlines, but the
concern is real. I’m sure some of you IT administrators out there have wondered if an AI bot will take your job.
The answer isn’t obvious, but as company executives strive to make investors happy, the measures they take will
surely affect employment in a negative manner.
The funny part is that executives are always hunting for ways to save money without causing issues with business
continuity by targeting the people who are in the trenches doing the actual work. Those of us in the trenches
aren’t making the big money that the executives enjoy. We all know about the disparity between
worker pay and executive pay, so why don’t business owners look to replace management with
AI bots rather than the people who flip the switches and push the buttons? Management
would be far easier to replace than someone who performs hands-on tasks. If I were to
write a simple script to replace almost every IT manager, it would go something like:
#!/bin/bash
# Read all input (but ignore its contents)
read -r input
# Randomly choose response
if (( RANDOM % 2 )); then
echo "Yes."
else
echo "I'll get back to you with an answer."
fi
Even if you can’t read a Bash script, I think you get the idea that it’s much
simpler to replace someone who only supplies a “Yes” or an “I’ll get back
to you with an answer” than it is to replace someone who needs to make
decisions; fix what’s broken; troubleshoot complex situations; and interact with
users, customers, and managers. That’s my hot take. I’m sure some very competent middle
managers and executives do much more than placate their management, owners, and shareholders, but I have
yet to encounter them in my career. Perhaps my scope and experience are limited.
This part of the AI roller coaster is that long, slow ride to the top before you’re released into freefall with your
hands held high: waiting on so-called decision makers to contemplate your fate while you worry about your
mortgage, children’s healthcare, and career options in a world motivated by finding the lowest successful bidder.
You’ll also observe on your trek around the loops that no matter how successful AI companies are, their stock
prices still fall. It’s the exact opposite of what should happen. It’s the feeling of falling although you’re traveling
up against gravity. I understand the uncertainty surrounding AI: its promises, its future, and, more personally,
what it’s going to do to me and my family.
What is certain, though, is that soon AI will affect every part of your life – your car, your appliances, your home,
your communications, your privacy, your healthcare, and even your food. People are blindly embracing AI and
its flaws as if it were as safe as those foam balls introduced back in the 1970s that “won’t hurt babies or old
people.” AI is the new foam ball. It seems safe and benign because we control it. What happens, though, when
no human riders are on board or no person is pulling the lever to start and stop the roller coaster? Will we still
feel the same?
Don’t get me wrong. I am a daily user of AI tools. I have multiple AI “badges” that prove my competence.
However, as with any tool, there is good and bad. A hammer is a great tool, but if you drop it on your foot or
hit your finger, it’s now a @#$%! menace. I expect to see a lot of AI hammers being dropped onto human hands,
feet, and careers. Tools are good until you lose control of them. The roller coaster still requires a human hand
at the switch. A roller coaster without human riders is no fun. Let’s go forward and vow to use this new tool
ethically, safely, and with restraint.
No AI was used in the writing of this article.
Ken Hess • Senior ADMIN Editor
Let’s keep the AI technology roller coaster in human hands.
Le
ad
Im
ag
e
©
k
gt
oh
, 1
23
RF
.c
om
3A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
W E LCO M EWelcome to ADMIN
54 Forced Tunneling
The Microsoft security service
tunnels all traffic from Azure
resources downstream, so
Internet-bound traffic can be
inspected and monitored by a
local firewall before it leaves the
regional Azure gateway.
62 Geofencing
Use geofence technology to isolate
your web services from the broader
public Internet with custom security
rules and worker routes.
32 Azure Storage Explorer
Manage, automate, and perform
diagnostics while supporting Azurite
storage integration, shared access
signature management, and error
analysis.
36 Datapizza-AI PHP
Orchestrate API-first agents and
local vector stores on constrained
hardware without GPUs.
42 IPv6-Mostly Networks
Offer the best user experience
while reducing IPv4 resource
consumption to a minimum.
48 Java Memory
Management
Scale the steep Java memory
management learning curve
while keeping applications up and
running and looking for trends
that signal imminent crashes.
12 Non-Human Identity
Management
Many non-human identities —
workloads in the cloud,
service accounts in IT systems,
autonomous agents in AI
applications — are poorly
managed or not managed at all.
We present a strategic, holistic
approach to managing these
identities.
18 Prometheus plus Cortex
This monitoring, alerting, and
trending software is considered
the standard, but it is slow when
faced with a large volume of
historical data. Cortex comes to
the rescue, with cluster support,
as well.
26 Uptime Kuma
A combination of easy installation,
attractive interface, and extensive
feature set makes Uptime Kuma
a good choice for self-hosted
uptime monitoring.
68 Prowler
Systematically check your AWS
infrastructure for vulnerabilities,
meet compliance requirements,
and automatically plug security
gaps.
76 MITRE Caldera
Emulate attacks and optimize
monitoring with automated security
testing that facilitates the work of
red and blue teams.
Tools Containers and VirtualizationFeatures
Security
ADMIN
Network & Security
@adminmag
ADMIN magazine
@adminmagazine@hachyderm.io
@admin-magazine.com
You’ll find code and listings for ADMIN articles here:
https://linuxnewmedia.thegood.cloud/s/9nFQcFb2p8oRMEJ
4 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Table of ContentsS E RV I C E
Nuts and Bolts
78 Bloonix
Combine the numerous
monitoring services in complex
environments into a single
interface.
82 Data Collection with
Zabbix
Available system utilities and
tools can provide reliable, policy-
compliant monitoring coverage
in restricted environments where
traditional approaches fail.
Rocky Linux 10.1
The first minor release since the 2025
major release of the Rocky Linux (RL)
enterprise operating system retains
the improvements and upgrades of
RL 10 [1], including the following tools:
12 | Non-Human
Identity Management
IAM for machines, workloads, and agents
Address the increasing number of attack surfaces presented
by NHIs by focusing on attribute and capability descriptions.
18 Prometheusthe community growing steadily
and the project seeing a continu-
ous influx of work from contributors
worldwide.
Behind the Bear
At its core, Uptime Kuma (Figure 1)
is a monitoring tool that monitors
the availability of network services.
Unlike commercial SaaS solutions, it
Formerly a hobbyist project, Uptime Kuma has developed into one of today’s most popular open source
monitoring tools in just a few years. By Marius Quabeck
Uptime Kuma Open Source Monitoring Tool
Ursa Major
Figure 1: The dashboard shows all monitors at a glance, color-coded by status.
Ph
ot
o
by
Z
de
nÐ
k
M
ac
há
Ðe
k
on
U
ns
pl
as
h
26 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R E Uptime Kuma
runs entirely on your infrastructure –
whether a Raspberry Pi in your liv-
ing room, network-attached storage
(NAS) in your basement, a virtual
server at your hosting provider, or a
full-fledged data center. The software
does not need a cloud connection and
stores all its data locally, making it
the ideal choice for anyone who pre-
fers to keep their data on-premises.
The architecture is based on modern
web technologies: Node.js and Vue.
js provide the back end and reactive
user interface, respectively. TypeScript
ensures type safety in the code, and
SCSS stylesheets enable an attractive
design with a dark and light mode.
SQLite is the default database, sup-
porting operation without the need
to install a separate database. Version
2.0 introduced MariaDB as an alterna-
tive – an important boost for larger
installations.
The installation is typically Docker-
based; a single command is all it takes:
docker run -d U
--restart=unless-stopped U
-p 3001:3001 U
-v uptime-kuma:/app/data U
--name uptime-kuma U
louislam/uptime-kuma:2
After a few seconds, the dashboard
becomes accessible on port 3001.
When you access the dashboard for
the first time, you need to create
the admin user account and select
the language and, optionally, the
database type before monitoring can
alternative to commercial monitoring
solutions. The software is completely
free, available under the non-restric-
tive MIT license, and does not incur
any costs, apart from whatever you
pay for the hosting infrastructure,
with no subscription model, no limit
on the number of monitors, and no
hidden costs for additional notifica-
tion channels.
Uptime Kuma’s ease of use allows
even employees without in-depth
Linux knowledge to create new moni-
tors or configure notifications. The
web interface is intuitively designed
and does not require a command
line. Status pages keep customers and
stakeholders informed of service sta-
tus – professional external communi-
cation without additional tools.
Enterprise Environments
Larger organizations also benefit from
Uptime Kuma, although it is often on
top of more comprehensive monitor-
ing stacks. The software is relatively
straightforward; a combination of
Prometheus, Grafana, and Alertman-
ager is definitely more powerful, but
ultimately far more complex. Instead
of needing weeks of training, Uptime
Kuma gives you quick results without
complex configuration.
In version 2.0 and with MariaDB
support, scalability has improved sig-
nificantly. Installations with several
hundred monitors and longer data
histories benefit from the more ro-
bust database infrastructure. Rootless
begin. The entire setup takes just a
few minutes.
Home Lab Enthusiasts
The largest target group is probably
operators of private servers and home
networks. Anyone who owns a media
server such as Jellyfin or Plex, uses a
Nextcloud instance for data synchro-
nization, uses Home Assistant for
home automation, or hosts other ser-
vices at home wants to know whether
everything is working. Uptime Kuma
works seamlessly on a Raspberry Pi
and reliably keeps an eye on your
home infrastructure.
The resource requirements are frugal:
A single CPU core and 512MB of RAM
are sufficient for smaller installations
with a few dozen monitors. Of course,
as the number of monitors increases
and the check intervals become
shorter, the requirements also grow,
but even faced with several hundred
endpoints, Uptime Kuma still has a
modest appetite.
What is particularly practical for
home lab users is that Uptime Kuma
can be integrated directly into Home
Assistant, where it appears as an add-
on, which means you can integrate
availability data into existing dash-
boards and automations.
Small and Medium-Sized
Enterprises
For companies with limited IT bud-
gets, Uptime Kuma is a cost-effective
Figure 2: Uptime Kuma supports numerous monitoring types, from HTTP through DNS to Docker containers.
27A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R EUptime Kuma
Docker images also address security
requirements relevant in enterprise
environments.
Versatile Monitoring
Uptime Kuma is not limited to simple
HTTP checks. The software supports
a wide range of protocols and testing
methods. HTTP/ HTTPS monitoring
(Figure 2) is the classic example
where the software calls a URL and
checks the status code. Advanced
options let admins validate certain
keywords in the response body,
which comes in handy when a faulty
application is returning HTTP 200 but
displaying an error page. Targeting
API responses with JSON queries is
useful for health check endpoints that
provide structured status information.
SSL certificate monitoring is integrated
into HTTP(S) monitors. Uptime Kuma
not only checks whether an encrypted
connection is possible but also warns
of expiring certificates. The lead time
for warnings can be set; timely notifi-
cations prevent unpleasant surprises
from expired certificates.
TCP port monitoring checks whether a
specific port on a server is accessible,
which is useful for services such as
databases, email servers, or proprietary
applications that do not use HTTP. The
check confirms ac-
cessibility at the net-
work level but does
not provide informa-
tion on the applica-
tion status.
Ping/ ICMP monitor-
ing tests the basic
accessibility of a
host. If you cannot
even ping the target,
the problem is likely
to be more serious
than just a crashed
web server. The en-
tire machine could
be offline, or a net-
work route might
be down. DNS
monitoring checks
for correct domain
name resolution by
helping to identify
misconfigurations or propagation
problems at an early stage before they
affect end users.
Docker container monitoring provides
a direct view of the container status,
provided Uptime Kuma can access
the Docker socket. This information is
particularly useful when services are
running inside the container but are
difficult to test from the outside. Up-
time Kuma displays the container status
(running, stopped, restarting) and can
alert you to status changes.
Database checks for MySQL, Maria DB,
PostgreSQL, Redis, and MongoDB en-
able genuine connectivity tests instead
of simple port availability. The software
opens a connection and runs a simple
query. If the query is successful, the
monitor is considered “up.” Game
server monitoring checks the availabil-
ity of Steam game servers, such as for
Counter-Strike, Team Fortress 2, Rust,
or ARK. For community server opera-
tors, this helpful feature is rarely found
in generic monitoring tools.
Monitoring of MQTT (a standards-
based publish-subscribe messaging pro-
tocol) checks message brokers, such as
those found in Internet of things (IoT)
environments. Version 2.0 introduced
the ability to run JSON queries to eval-
uate specific message content. Push
monitoring reverses the principle:
Instead of Uptime Kuma querying a
service, the service regularly reports
to the application. An alarm is trig-
gered if an expected message is not
received, which is useful for cron jobs,
backup scripts, or other periodic tasks.
SMTP and SNMP monitoring have also
been part of the feature set since ver-
sion 2.0. You can use these functions
tomonitor mail servers and network
devices such as switches and routers
without additional tools.
Browser monitoring relies on an em-
bedded Chromium browser (or Micro-
soft Edge as an alternative as of ver-
sion 2.0) to load web pages like a real
user and identifies JavaScript errors
that remain invisible in simple HTTP
requests. Remote browser support al-
lows resource-intensive checks to be
outsourced to dedicated machines.
Flexible Check Intervals
The shortest possible check frequency
is 20 seconds – a value more com-
monly found in enterprise solutions.
A SaaS service like UptimeRobot will
only support such short intervals in
commercial plans. Of course, longer
intervals can also be set up, such as
60 seconds, five minutes, or more.
Shorter intervals mean faster noti-
fication in the event of failures, but
Figure 3: More than 90 notification channels are available, ranging from email and messengers to incident
management systems.
28 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Uptime KumaF E AT U R E
pending or maintenance. The project
improved the loading performance
in version 2.0, meaning that even
installations with a large number of
monitors can be operated smoothly.
Detailed views (Figure 5) provide
information with interactive ping
diagrams, uptime percentages over
various periods (24 hours, 30 days,
one year), and average response
times over time. The diagrams are
reactive and show details when
moused over.
Badges are available for integration
into other systems. These small
graphics reveal the current status or
uptime of a monitor. You can embed
them in READ ME files on GitHub,
integrate them into internal wikis,
or display them on dashboards. The
badge URLs support different styles
and time periods.
Security Features
Two-factor authentication (2FA) pro-
tects access to the dashboard. Uptime
Kuma supports time-based one-time
password (TOTP) apps such as
Google Authenticator or Authy. Since
version 2.0, the input field automati-
cally focuses on the token field when
logging in with 2FA.
The application lets you manage
multiple users, although they all have
identical authorizations. Role-based
access control (RBAC) or the option
to give certain users read-only access
does not exist, which is problematic
for larger teams or organizations with
different responsibilities.
API and Integration
With the help of a Prometheus
metrics interface, you can integrate
Uptime Kuma into existing monitor-
ing stacks, which means you can
visualize information from the tool
in Grafana dashboards or correlate
it with other metrics. This option
proves particularly useful when
Uptime Kuma is used as part of a
larger observability solution.
API keys secure access to the met-
rics endpoint. Once you have con-
figured an API key, simple HTTP
Webhooks for individual integrations
mean you can connect virtually any
system. The webhook payload is docu-
mented and can be processed in your
own scripts or automations. SMS mes-
sages by various providers (e.g., Twilio,
Clickatell, or SMSEagle) reach recipi-
ents without an Internet connection.
In version 2.0 the list was expanded
to include Nextcloud Talk, Brevo (for-
merly Sendinblue), Evolution API,
and Home Assistant. A new environ-
ment variable supports operation
behind proxies, which is important in
corporate environments with restric-
tive network policies.
Status Pages
Uptime Kuma can generate public
status pages (Figure 4) that display
the current status of selected services.
Multiple status pages with different
monitors can be set up and broken
down by, say, customer group, prod-
uct, or internal and external services.
Additionally, you can customize the
design with your own logo, title, and
description. The pages contain the cur-
rent status of each monitor displayed
in groups, a timeline with the latest
events, and optional maintenance
notes. You can define maintenance
windows in advance so that planned
downtime does not trigger alarms.
Public monitor URLs were introduced
in version 2 that let you access indi-
vidual monitors directly without hav-
ing to share the entire status page.
The dashboard shows all monitors at
a glance, color-coded by status: green
for up, red for down, and yellow for
also higher resource load on both the
monitoring server and the monitored
targets. For each monitor, you can in-
dividually specify the number of failed
checks after you want Uptime Kuma
to alert you, which also prevents false
positives in the event of short-term
outages or slow network connections.
Notifications
Uptime Kuma’s strength lies in its in-
tegration with more than 90 different
notification channels (Figure 3) with
email (SMTP) as the legacy choice.
Uptime Kuma supports any SMTP
server, and allows TLS encryption.
As of version 2.0, the templates use
LiquidJS, which allows for flexible
customization with variables such as
name, msg, status, heartbeatJSON, moni-
torJSON, and hostnameOrUrl. HTML
support in the templates ensures at-
tractively formatted notifications.
When it comes to messenger integra-
tion, Uptime Kuma impresses with
choice: Telegram, Signal, Discord,
Slack, Microsoft Teams, Mattermost,
Rocket.Chat, and more are supported.
Integration typically relies on web-
hooks or bot APIs. Discord and Slack
also support rich embeds with color-
highlighted status indicators.
Push services such as Pushover, Go-
tify, ntfy, Pushbullet, and Apprise pave
the way for notifications on mobile
devices without email or messenger.
PagerDuty, Opsgenie, Splunk On-Call,
Grafana OnCall, and other enterprise
tools can be connected as incident
management systems, enabling escala-
tion chains and on-call rotations.
Figure 4: Public status pages inform customers and stakeholders about the status of services.
29A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R EUptime Kuma
basic authentication is disabled.
The keys can be managed in the
dashboard.
Limitations
Despite the enthusiasm generated by
Uptime Kuma, the application does
have some limitations. Because all
tests are run from a single location,
if the server running Uptime Kuma
is experiencing network problems,
all monitors will appear to be down.
Conversely, local problems are not
detected if the checks come from out-
side. Commercial services such as Up-
timeRobot, Pingdom, or Better Stack
test from multiple geographically dis-
tributed locations and can therefore
more reliably distinguish between
genuine outages and local problems.
This Uptime Kuma disadvantage can
be partially compensated for by run-
ning multiple instances at different
locations, but you won’t get a central
correlation of the results.
A full-fledged REST API for manag-
ing monitors is also still a work in
progress. If you want to create or
configure monitors automatically,
you will encounter limitations. The
WebSocket-based API is primarily
intended for the web interface and is
not documented. Infrastructure-as-
code approaches, in which monitors
are versioned in Git, are therefore dif-
ficult to implement.
Uptime Kuma primarily measures
availability and response time. More
in-depth performance metrics such
as throughput, error rates by cat-
egory, synthetic transactions across
multiple steps, or real-user monitor-
ing (RUM) are outside its feature
scope. If you need these features,
you will have to turn to specialized
application performance manage-
ment (APM) tools such as New Relic,
Datadog, or Grafana Cloud.
You do have to operate the software
yourself, with all the responsibilities
that entails: installing updates, creat-
ing backups, and ensuring the avail-
ability of the monitoring server. The
paradox is obvious – who is monitor-
ing the monitor? If you don’t want to
or can’t do this, you might be better
off with a SaaS service.
As a community project, Uptime
Kuma does not guarantee support.
GitHub issues and discussions are
active, and the maintainerresponds
regularly, but you won’t be subject to
any service-level agreements (SLAs),
which is a risk for business-critical
applications that require guaranteed
response times. The application stores
all the settings in the database, and
you manage them in the web inter-
face. With no option for YAML file-
based configuration, in contrast to
Gatus, for example, versioning, code
reviews, and automated deployment
are more complicated.
Version 2.0
On October 20, 2025, Lam released
Uptime Kuma 2.0. After a year of de-
velopment with five beta versions, it
was the biggest release to date. The
list of changes is extensive, and some
of them require attention during the
update.
The most important new feature is
the optional use of MariaDB as a da-
tabase. SQLite works well for smaller
installations but reaches its perfor-
mance limits if you have several hun-
dred monitors and an extensive data
history. In particular, SQLite’s locking
behavior during simultaneous access
can cause problems at times.
MariaDB offers better scalability,
more robust locking, and the ability
to run the database on a dedicated
server. This is an important step for
enterprise deployment. Please note
that currently automatic migration
Figure 5: The detailed view shows history charts, uptime statistics, and average response times.
30 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Uptime KumaF E AT U R E
Small setup (up to 50 monitors):
one CPU core, 512MB of RAM
Medium setup (50 to 200 moni-
tors): two CPU cores, 1GB of RAM
Large setup (at least 200 moni-
tors): four CPU cores, 2GB of RAM;
MariaDB recommended
Of course, memory requirements grow
with data history. SQLite databases of-
ten reach several gigabytes if they have
been running for a long time.
Conclusion
Uptime Kuma has earned its place
in the toolboxes of admins and de-
velopers. The combination of easy
installation, attractive interface,
and extensive feature set makes it a
good choice for self-hosted uptime
monitoring.
Version 2.0 eliminates some sig-
nificant weaknesses of the previous
version – in particular, scalability
thanks to MariaDB support and se-
curity with rootless containers. The
project is showing no signs of slow-
ing down; on the contrary, the active
community and dedicated maintainer
ensure continuous improvements.
For example, more than 100 pull re-
quests were incorporated into the 2.0
beta versions alone.
If you are aware of, and can live
with, the limitations and do not
need distributed monitoring, a REST
API, or enterprise features such as
role-based access control, you will
find Uptime Kuma to be a mature,
perfectly functional tool. For more
complex requirements, alternatives
such as Gatus, Prometheus with
Blackbox Exporter, or commercial
solutions are worth a look. For the
majority of use cases, though, from
home labs to small and medium-
sized enterprises, Uptime Kuma is
just the right size, and the little bear
is a reliable watcher.
Info
[1] Project wiki: [https:// github. com/ louislam/
uptime-kuma/ wiki]
[2] Project page: [https:// github. com/
louislam/ uptime-kuma]
[3] GitHub Container Registry: [https:// ghcr. io]
devices such as switches, routers,
and uninterruptible power systems
(UPSs). Uptime Kuma thus is making
inroads into classic network monitor-
ing territory that previously required
other tools.
The real browser monitor supports
Microsoft Edge in addition to Chro-
mium. You can also connect remote
browsers without a fully installed
browser on every monitoring server,
which saves resources and means the
browser infrastructure can be distrib-
uted. JSON queries for MQTT moni-
tors enable targeted evaluation of IoT
messages. Instead of simply checking
whether a message has arrived, the
content can now be validated.
Hands On
A minimal configuration for a main-
tainable installation can be created in
a docker-compose.yml file:
services:
uptime-kuma:
image: louislam/uptime-kuma:2
container_name: uptime-kuma
restart: unless-stopped
ports:
- "3001:3001"
volumes:
- ./data:/app/data
The data directory contains the
SQLite database and all the settings.
Creating regular backups, ideally by
a cronjob with rsync or tar, is essen-
tial. If you want to use the embed-
ded MariaDB, select the matching
option in the setup wizard when you
first start the application. The data
will then also be stored in the data
directory.
For production use, you will want to
run Uptime Kuma behind a reverse
proxy with TLS termination. The ap-
plication supports NGINX, Caddy,
Traefik, Apache, and HAProxy. Cloud-
flare Tunnel also works, allowing op-
eration without a public IP address.
Resource Planning
A guideline for hardware resource
planning comes in three variants:
from SQLite to MariaDB cannot be
done. If you want to switch, you will
need to export and import the data
manually with tools such as sqlite3to-
mysql. The developer also explicitly
points out that support for migra-
tion problems cannot be provided.
For new installations, an embedded
MariaDB option is available that
does not require a separate database
installation.
Rootless Docker Images
Security-conscious administrators
can run Uptime Kuma in containers
without root privileges as of version
2. The new rootless images run as
an unprivileged user node (UID 1000),
thereby reducing the attack surface. If
an attacker breaks out of the applica-
tion context, they do not have root
privileges in the container.
Some restrictions are in place, how-
ever: Docker monitoring does not
run without additional configuration,
because access to the Docker socket
requires root privileges. The file per-
missions in the data directory must
also be correct (ownership 1000:1000),
which can make migration from a
non-rootless installation problem-
atic. Lam expressly recommends not
switching directly to rootless images
for upgrades from v1 to v2, but only
after completing the initial migration.
In addition to rootless images, the
now lean versions occupy 300 or
400MB less space than the full ver-
sion. However, some features such
as Docker monitoring, the embedded
Chromium browser for real browser
testing, and the embedded MariaDB
are not included. If you do not need
all of that, you can save storage space
and download time. Finally, the im-
ages are now available on GitHub
Container Registry (ghcr.io) [3], not
just on Docker Hub.
Advanced Monitoring
Direct SMTP monitoring of email
servers is now possible. The check
goes beyond a simple port test and
validates SMTP communication.
SNMP support is aimed at network
31A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R EUptime Kuma
Azure Storage Explorer (ASE) [1]
offers a lean, locally installable
environment for managing Azure
storage resources. ASE supports di-
rect access to blob, file, queue, and
table storage, with native support for
Windows, Linux, and macOS, and
it integrates with local development
environments such as Visual Studio.
ASE makes tasks such as copying
blobs between storage accounts,
generating Shared Access Signature
(SAS) tokens, or working with the
Azurite emulator far more efficient
than when working in the Azure
portal.
Connection Options and
Authentication
ASE offers several options for con-
necting to Azure
Storage in the Get
Started tab with
Attach to a resource
links (Figure 1):
with an Entra ID
login, a connec-
tion key, a SAS, or
direct URI access.
When connecting
to a storage ac-
count with a con-
nection key, you
can view and copy
the key directly in
Access keys on the
Azure portal. If
you use Entra ID to
log in, make sure
the appropriate
data roles, such as
Storage Blob Data
Contributor or Stor-
age Queue Data
Reader, are as-
signed; otherwise,
many operations
Learn how to manage, automate, and perform diagnostics with Microsoft Azure Storage Explorer, which also
supportsAzurite storage integration, shared access signature management, and error analysis. By Thomas Joos
Managing Azure Storage Resources
Expedition
Ph
ot
o
by
C
hr
is
to
ph
er
R
ue
l o
n
Un
sp
la
sh
Figure 1: ASE manages storage services in Azure without relying on the Azure portal.
32 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
TO O L S Azure Storage Explorer
will remain grayed out or throw errors.
In some scenarios, you will not need
access to an entire Azure subscrip-
tion, just a single resource. ASE lets
you integrate resources directly with
the Connect to Azure Storage option.
When you get there, you can select
Blob container or directory, for exam-
ple, as the resource type (Figure 2).
Next, specify the account and ten-
ant, assign a display name, and
enter the full URL for the resource.
Alternatively, you can add a SAS as
a URL to grant selective access to
containers, files, queues, or tables.
The resource then appears in Local
& Attached | Storage Accounts | (At-
tached Containers) in the navigation
pane. This method is particularly
useful for automated processes or
third-party systems with limited
permissions.
Creating and Managing
Containers for Blob Storage
To create a new blob container in
ASE, right-click on the Blob Contain-
ers node of the desired storage ac-
count and select the Create Blob Con-
tainer entry. After
entering a valid
name, the new
container appears
in the tree struc-
ture. Blobs can
be dragged and
dropped into the
main window or
uploaded with Up-
load | Upload Files,
or you can transfer
entire folders with
Upload | Upload
Folder. If you
want to adjust the
container’s access
level, select Set
Public Access Level
in the context
menu and then
No public access,
Public read access
for container and
blobs, or Public
read access for
blobs only.
time-limited SAS tokens. In the
Shared Access Signature dialog, you
can specify the start and end times,
permitted actions (read, write, de-
lete, list), and the time zone. Click-
ing Create generates a link including
a token, which you can copy di-
rectly with the Copy button.
Individual blobs can be managed
directly in the main window, where
functions such as Upload, Download,
Delete, Open, and Copy are available
in the toolbar. ASE automatically
recognizes virtual directories in a con-
tainer. By selecting a blob and click-
ing Open, you can download the file
locally, and it will open in the default
program. Blobs are moved by click-
ing Copy and then Paste in the target
container.
Advanced Container Actions
Containers can be created, deleted, and
copied in full. To copy, select the Copy
option in the context menu and paste
into another storage account – again,
from the context menu. To delete,
either select Delete or press the Del
key. If a blob has snapshots, a dialog
ASE supports two equivalent access
methods for specifically checking
the contents of an existing blob
container. You can either double-
click on the container entry in the
tree structure or select the Open
Blob Container Editor option from
the context menu. This function is
particularly helpful if the container
contains complex directory struc-
tures with many files. The main
window lists all blobs, subfolders,
and metadata in a table. You can fil-
ter the overview by simply reorder-
ing the columns or entering search
terms. You can navigate deeply
nested structures simply by clicking
on the directory entries in the path
bar, and you can edit the metadata
of individual blobs directly: Use the
context menu of the object in ques-
tion to call up Container properties
and adjust the blob type or user-
defined keys, for example.
ASE offers two mechanisms for
granular access control: policies and
SAS tokens. The Manage Access Poli-
cies option lets you create perma-
nently valid rules, and Get Shared
Access Signature lets you generate
Figure 2: ASE offers numerous options for connecting to storage resources in Azure.
33A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SAzure Storage Explorer
box opens when you delete it, giv-
ing you the option to Delete blobs
with snapshots. Azure file shares act
like network drives. Creating a share
with Create File Share makes the
share available on user devices. The
Get Shared Access Signature option
also lets you transfer temporary ac-
cess rights during this process.
The context menu of the menu item
in question also lets you create
queues. Messages are added with Add
Message, displayed by View Message,
or removed with Clear Queue. The
message content is structured, and
the JSON data is directly readable.
You can also use the context menu
to create tables, including manual
insertion of rows with partition keys,
row keys, and user-defined fields. The
filter function lets you analyze indi-
vidual entries.
Local Testing with Azurite
Azurite [2] is a local emulator ideal
for developing and testing storage
platforms without active Azure ac-
cess. ASE recognizes running Azurite
instances by default, provided they
are listening on the expected ports:
10000 for blobs, 10001 for queues,
and 10002 for table storage. Azurite is
located at Local & Attached | Storage
Accounts | Emulator – Default Ports in
the tree structure.
You can set up an alternative port or
container name manually. To do so,
open the Connect to Azure storage
dialog, select Local storage emulator
as the resource type, and specify the
ports and a display name. ASE does
not launch Azurite automatically –
the container or local instance needs
to be up and running beforehand. You
can do this at the command line by
typing:
azurite --silent U
--location c:\azurite U
--debug c:\azurite\debug.log
With Docker, it is a good idea to
check that the container is running
(docker container list --all). If the
ports and network settings are incor-
rect, ASE cannot open a connection.
If needed, you can reset the ports
and network settings by typing
docker restart or create new con-
tainers by executing docker run with
the mcr.microsoft.com/ azure-storage/
azurite image. The Docker context
you use is also important. Linux us-
ers should note that ASE only works
in the default context. Further ad-
justments can be made with:
docker context use
Snap users need to set additional per-
missions – for example by typing:
snap connect storage-explorer:docker U
docker:docker-daemon
After using Azurite to configure a
custom storage account, you can then
use the command
docker exec U
printenv AZURITE_ACCOUNTS
to check the name and key.
Troubleshooting and
Diagnostics
If you come across access problems,
check the role assignments. Without
the Storage Blob Data Reader role,
blobs cannot even be displayed. The
administration level (subscriptions,
storage accounts) requires reader or
contributor roles. If SAS tokens or ac-
count keys are missing, you can add
them by looking under the Manage
section, choosing Connect Resource,
and selecting Shared Access Signature
(SAS). You can fix TLS/ SSL problems
by going to Edit | SSL Certificates |
Import Certificates. The --ignore-cer-
tificate-errors parameter at startup
is not recommended for security
reasons.
Problems with the authentication
broker, the login window, or re-login
can be resolved with Help | Reset or
by deleting the .IdentityService di-
rectory in the user profile. On macOS,
go to Login to lock the keychain or
reauthorize it. If you are using Linux,
tools such as Seahorse can help you
manage the standard keychain.
Check the proxy configurations in Set-
tings | Application | Proxy. ASE only
supports standard authentication;
NTLM is not compatible. A network
tool such as Fiddler can help with
diagnostics if you set it up on local-
host:8888 and configure the proxy
source in ASE to Use system proxy.
Logging
You can go to Help | Open Logs Direc-
tory to access the application logs.
Thelog level can be increased to
Verbose in Settings | Application |
Log Level. AzCopy logs end up in C:\
Users.azcopy (Windows) or
~/.azcopy (Linux/ macOS). Authenti-
cation logs in C:\Users\Ap-
pData\Local\Temp\servicehub\logs or
~/.ServiceHub/logs also offer valuable
information in the event of errors.
To save and manage your own con-
nections, navigate to Help | Switch de-
veloper tools in the local storage area.
When you get there, you can clear the
settings in case of issues by deleting
the entry in question, leaving just the
square brackets, [ ]. You can also de-
lete non-functioning SAS URIs in this
menu by removing targeted entries in
StorageExplorer_AddStorage-Service-
SAS_v1_blob.
Automating ASE
One key advantage of ASE is its com-
prehensive support for complex role
models. Although you can manage
the way admin-level subscriptions and
accounts are displayed with reader or
contributor roles, data operations at
the resource level require specific as-
signments such as Storage Blob Data
Contributor or Storage Queue Data
Reader. The fact that these two levels
are separated often leads to errors,
but you can avoid issues by choosing
clear-cut role assignments.
ASE offers alternatives for environ-
ments with restricted GUI access: The
various connection options support
common authentication types and
connection scripts that use SAS URIs
or connection keys. You can then
integrate these into your deployment
or automation processes. A simple
34 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Azure Storage ExplorerTO O L S
resources in Azure. Its strengths lie
in its broad format support, seam-
less integration of local develop-
ment environments, granular access
control, and high level of automa-
tion. Practical features such as
blob copies between accounts, SAS
management, queue handling, table
management, emulator support,
detailed logging, and extensive trou-
bleshooting options make it a go-to
tool for managing various storage
platforms in Azure.
Info
[1] Azure Storage Explorer:
[https:// azure. microsoft. com/ en-us/
products/ storage/ storage-explorer]
[2] Azurite Emulator:
[https:// learn. microsoft. com/ en-us/ azure/
storage/ common/ storage-use-azurite]
The Author
Thomas Joos is a freelance IT consultant and
has been working in IT for more than 20 years.
In addition, he writes hands-on books and
papers on Windows and other Microsoft topics.
Online you can meet him on [http:// thomasjoos.
spaces. live. com].
and continuous deployment (CI/ CD)
pipelines. When managing access
rights, permanent access policies com-
bined with ephemeral SAS tokens are
recommended. You can use a program
to create the tokens and assign an ex-
piration date to and deploy them spe-
cifically for each service (blob, queue,
table). The integration of storage re-
sources in containers with anonymous
access or publicly accessible blobs is
also possible, provided you configure
Set Public Access Level appropriately.
Finally, integration with AzCopy plays
a central role. Internally, ASE relies
on the command-line tool for trans-
fers and lets you track every action
in the GUI. Integration is so tight that
you can open AzCopy logs directly
without leaving the tool. If you also
want to track data movements in the
background or automate them with
scripts, instead of just using the inter-
face, you can implement some actions
in a far faster and easier way than in
the Azure portal.
Conclusion
ASE offers a powerful interface
for effectively managing storage
sample script that automates the pro-
cess of uploading a file to an Azure
Blob container uses azcopy:
azcopy copy "C:\Data\example.txt" U
"https://mystorage.blob.core.windows.net/U
mycontainer/example.txt?U
sv=2022-11-02&U
ss=b&srt=sco&sp= wac&U
se=2025-12-31T23:59:59Z&U
st= 2025-01-01T00:00:00Z&U
spr=https& sig=" U
--overwrite=true
The command copies the files; C:\Data\
example.txt defines the local path to
the file, and https://… is the target
URL of the blob container, including
the SAS token. The --overwrite=true
parameter lets you overwrite existing
files with the same name. To upload an
entire folder instead of a single file, just
extend the command:
azcopy copy "C:\Projects\UploadFolder" U
"https://mystorage.blob.core.windows.net/U
mycontainer?sv=" U
--recursive=true
You can also integrate this script into
batch files or continuous integration
35A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SAzure Storage Explorer
What if the key to understanding the
future of artificial intelligence (AI)
lies not in the latest GPU, but in a
14-year-old piece of hardware? In this
article, I demonstrate how to build
and run a modern AI agent on a 2011
Raspberry Pi Model B, a single-core
computer with just 256MB of RAM.
The goal is not just to prove it can be
done, but to show why it matters for
system administrators and developers:
By embracing constraints, you can
design AI systems that are more effi-
cient, transparent, and secure.
Datapizza-AI PHP [1] is an open
source dependency-free framework
written in pure PHP 7.4+. Here, I
show how to build a Sysadmin Agent
capable of monitoring server health,
analyzing logs, and reasoning about
its own actions. This exercise isn’t
theoretical; it’s a hands-on journey
into the core mechanics of AI orches-
tration, proving that sophisticated
automation doesn’t require a cloud-
sized budget. You’ll learn how to
decouple local logic from remote in-
ference, manage local data with a file-
based vector store, and create custom
tools that give your agents real-world
capabilities.
API-First Agent Architecture
At the heart of this project is a
simple but powerful idea: decoupled
orchestration. Instead of running
a massive AI model locally, which
is impossible on the hardware I’m
using, I run only the “brain” of the
agent – the reasoning loop. The
Raspberry Pi acts as a conductor,
managing the conversation between
local tools and powerful remote
large language models (LLMs) over
API calls.
This architecture offers three key
advantages:
Efficiency: The local footprint is
tiny. The agent’s logic consumes
only a few megabytes of RAM,
making it a negligible load on
any server – from a vintage Pi
to a production-grade enterprise
machine.
Data Sovereignty: Sensitive data,
like internal documentation or
system logs, can be processed and
stored locally. Only the abstract
queries or sanitized data snippets
are sent to the external AI model,
keeping confidential information
within your network.
Transparency: Without complex
SDKs or black-box libraries, every
step of the agent’s process – every
API call, every tool execution, ev-
ery piece of context retrieved – is a
simple, auditable HTTP request or a
local file operation (Figure 1).
Why PHP for AI
Orchestration?
Although Python dominates the
AI landscape, PHP is surprisingly
well-suited for the role of an agent
orchestrator. At its core, an AI agent’s
reasoning loop is a series of block-
ing, I/ O-bound operations: Make an
API call, wait for the response, run
a local tool, wait for the result. PHP
was born for this role. Its simple, pro-
cedural nature and robust handling
of HTTP requests make it a perfect fit
for managing the call-and-response
flow of an agent’s thought process.
PHP is the language that built the
modern web, and its core strengths
– simplicity, ubiquity, and state man-
agement – are exactly what’s needed
to build transparent and reliable AI
agents on the edge.
The API-First Pattern
Most local AI tutorials focus on quan-
tization – shrinking a 70B parameter
model until it barely fits in RAM. I
Orchestrating API-first agents and local vector stores on constrained
hardware without GPUs. By Paolo Mulas
Edge AI Automation on a 2011 Raspberry Pi
Sublime Pie
Ph
ot
o
by
L
au
ra
S
ea
m
an
o
n
Unsp
la
sh
36 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
TO O L S Datapizza-AI PHP
take the opposite approach by accept-
ing that a 2011 Raspberry Pi cannot
run inference. Instead, it is optimized
for what it does best: I/ O orchestra-
tion (Figure 2).
The architecture comprises three
decoupled layers:
1. The brain (remote): a high-
intelligence API (OpenAI GPT-4o,
Anthropic Claude 3.5 Sonnet, or
a local llama.cpp library instance
on another machine) that handles
reasoning and natural language
generation.
2. The memory (local): a file-based
vector store holding embeddings
Observe: The agent reads the con-
versation history and the user’s
latest query.
Reason: This context is sent to the
LLM with a system prompt de-
scribing available tools.
Decide: The LLM replies not with
an answer, but with a structured
JSON payload: {"tool": "disk_
space", "params": {"path": "/"}}.
Act: The PHP script parses this
JSON, instantiates the DiskSp-
aceTool class, executes it, and cap-
tures the output.
Loop: The tool’s output is ap-
pended to the history, and the loop
repeats until the LLM decides it
has enough information to answer.
This transparency is a security feature.
You can log every single step of this
decision tree to a text file, creating a
perfect audit trail of why the AI de-
cided to check a specific logfile.
Environment Setup
Getting started requires minimal
setup by design. The entire frame-
work is self-contained and has no
external dependencies – not even the
Composer dependency manager for
PHP. All you need is a Linux environ-
ment with PHP and Git.
To begin, clone the repository from
GitHub:
git clone https://github.com/paolomulas/U
datapizza-ai-php.git
cd datapizza-ai-php
of your local data (logs, docs,
notes) that resides entirely on the
Pi’s SD card.
3. The hands (local): PHP classes that
execute system commands, read
files, or query internal APIs. These
run on the Pi’s bare metal CPU.
The ReAct Loop in PHP
The core of the agent is the reasoning
and acting (ReAct) loop. In Python
frameworks like LangChain, this
logic is often buried under layers of
abstraction. In Datapizza-AI PHP, it
is exposed as a single, readable while
loop.
The incredibly
easy-to-debug
process is syn-
chronous and
linear:
Figure 1: The API-first architecture comprises a Raspberry Pi that orchestrates local tools
and remote API calls, keeping the logic local and the heavy lifting in the cloud.
Figure 2: The decoupled architecture: The Pi acts as the secure
gateway, holding tools and memory, and the LLM provides pure
reasoning power.
Figure 3: The output of the 00_sanity_check.php script
confirms that the environment is ready.
37A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SDatapizza-AI PHP
Next, create a .env file in the root
directory to store your API key (a tem-
plate is provided as .env.example), add
your API key to the file, and finally,
run the built-in sanity check script to
ensure your environment is configured
correctly by verifying the PHP version,
permissions, and API connectivity:
cp .env.example .env
nano .env
OPENAI_API_KEY="sk-..."
php 00_sanity_check.php
If all checks pass, you’ll see a confir-
mation message (Figure 3), and your
minimal AI lab is ready for its first
experiment.
Implementing a Sysadmin
Agent
Now that the environment is ready,
you should build something practi-
cal: a Sysadmin Agent designed to
monitor server health and analyze
logs autonomously. This step shows
where the true power of the frame-
work’s extensibility shines. The
Datapizza-AI PHP architecture treats
“tools” as modular PHP classes that
the AI can invoke according to its
reasoning.
In this way, you can expose any
PHP logic system calls, database
queries, or API integrations to the
agent simply by extending the Base-
Tool class.
Designing Custom Tools
To create a tool that allows the
agent to check disk usage, leverage
PHP’s native disk_free_space and
disk_total_space functions instead of
relying on potentially unsafe shell_
exec calls or parsing df -h output.
This approach is faster, safer, and
platform-independent.
Listing 1 shows the implementation
of DiskSpaceTool. Notice how it de-
fines a JSON schema in the crucial get_
parameters_schema() that is injected into
the LLM’s system prompt, teaching the
model exactly how to use this tool and
what parameters to provide.
This pattern is universally applicable.
You could easily write ServiceRestart-
Tool to wrap systemctl commands
(with strict whitelisting for security) or
DatabaseHealthTool to run a quick SQL
diagnostic query. The agent doesn’t
need to know the implementation de-
tails; it just needs the schema.
Log Analysis Tool
Giving an AI agent read access to
system logs is powerful but risky.
To mitigate this risk, implement
LogGrepTool to enforce strict access
controls at the application level. The
tool allows the agent to search for
specific string patterns (e.g., error
or segfault) but restricts access to a
pre-defined whitelist of logfiles (e.g.,
/var/log/syslog, /var/log/auth.log),
which prevents the LLM from hal-
lucinating a request to read sensitive
files like /etc/shadow or strictly pri-
vate user data.
The implementation uses PHP’s file
handling to read lines safely, avoiding
the overhead of spawning external
grep processes – a critical optimiza-
tion when running on constrained
hardware like the Pi 1.
When the Agent Lies
A common fear is that an AI agent
will “go rogue.” In testing, I found
that the ReAct loop is remarkably
robust, but not infallible.
For example, if you ask, Check the
health of the Postgres database, but
haven’t written PostgresTool, the LLM
might try to hallucinate a solution. It
might incorrectly guess that it can
use LogGrepTool to read /var/lib/
postgresql/data, which is blocked by
your whitelist.
This scenario triggers a safety failure
in the tool: Error: Path not allowed.
Crucially, the agent sees this error in
its observation step. The LLM then
“reasons” about the failure:
Thought: I cannot read the data di-
rectory directly. I should check the
standard logs instead.
Action: log_grep on /var/log/syslog.
This self-correction capability is
what differentiates an agent from a
simple script. It adapts to permis-
sion-denied errors just like a human
operator would – by trying a safer
alternative path.
01 name = "disk_space";
08 $thi s->description = "Checks disk usage for a
given path. Returns free/total space and
percentage.";
09 }
10
11 public function execute($params = []) {
12 $path = $params['path'] ?? '/';
13
14 if (!file_exists($path)) {
15 return "Error: Path '$path' does not exist.";
16 }
17
18 $free = @disk_free_space($path);
19 $total = @disk_total_space($path);
20
21 if ($free === false || $total === false) {
22 return "Error: Unable to read disk stats.";
23 }
24
25 // Calculate percentage and format output
26 $used_p = (1 - ($free / $total)) * 100;
27
28 return sprintf(
29 "Dis k '%s': %.1f%% used (%.2f GB free / %.2f
GB total)",
30 $path, $used_p, $free/1e9, $total/1e9
31 );
32 }
33
34 public function get_parameters_schema() {
35 return [
36 'type' => 'object',
37 'properties' => [
38 'path' => [
39 'type' => 'string',
40 'des cription' => 'Path to check (default:
"/")'
41 ]
42 ]
43 ];
44 }
45 }
46 ?>
Listing 1: DiskSpaceTool Implementation
38 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Datapizza-AI PHPTO O L S
takes a different approach: a serverless
vector store that lives entirely in a local
JSON file.
The Math Behind theMagic
How do you search text by meaning
rather than keywords? You convert
configuration is minimal (Listing 2;
Figure 4):
File-Based Vector Store for
Local Context
Although the Sysadmin Agent is power-
ful, it is stateless. To build a truly intel-
ligent assistant –
one that knows
your specific server
configurations, run-
books, or incident
history – you need
persistent memory.
In the AI world,
this means a vector
store.
Standard vector
databases (e.g.,
Pinecone or Weavi-
ate) are overkill
for a Raspberry Pi.
They require Docker
containers, sig-
nificant RAM, and
a complex setup.
Datapizza-AI PHP
Reasoning Loop in Action
With the tools defined, you now con-
figure ReactAgent. The ReAct pattern
is the engine that drives the agent’s
autonomy. In this framework, the
loop is implemented as a straightfor-
ward while loop in datapizza/agents/
react_agent.php.
1. Thought: The agent receives a user
query (e.g., Is the disk full?). It
analyzes the available tool sche-
mas and decides it needs to call
disk_space.
2. Action: The framework intercepts
this decision, instantiates Disk-
SpaceTool, and executes it with
the parameters generated by the
model.
3. Observation: The tool returns the
raw string output (e.g., Disk '/ ':
45.2% used), which is fed back into
the conversation history.
4. Final Answer: The model sees the
observation and formulates a natu-
ral language response for the user.
To instantiate your Sysadmin
Agent with these capabilities, the
Figure 4: The agent’s reasoning trace in the terminal: Note how it sequentially calls disk_space and then log_grep before synthesizing
a final report.
Listing 2: Initializing the Sysadmin Agent
01 run(
18 " Check system health: verify disk space and look for errors in syslog."
19 );
20 echo $response;
39A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SDatapizza-AI PHP
text into embeddings – vectors of
floating-point numbers (e.g., [0.123,
-0.567, …]), where similar concepts
are mathematically close (Figure 5).
To find the most relevant document
for a query, calculate the cosine simi-
larity between the query vector and
your stored document vectors.
Most developers import a Python li-
brary for this task. In Datapizza-AI PHP,
the math is implemented in pure PHP
to demystify the process (Listing 3).
This function is the engine of your
retrieval-augmented generation
(RAG) system, and it runs surpris-
ingly fast on the Raspberry Pi for
datasets of fewer than 10,000 docu-
ments, proving that big data tools
aren’t always necessary for personal
AI projects.
The Limits of Bare Metal
Why does this JSON approach work
for fewer than 10,000 documents? You
need to do the math, because on a
256MB Raspberry Pi, every byte counts.
An OpenAI text-embedding-3-small
vector consists of 1,536 floating-
point numbers. In PHP, an array
of floats consumes significantly
more memory than a packed C
structure. A conservative estimate is
roughly 16KB per vector in memory
overhead:
1,000 documents is about 16MB
RAM
10,000 documents is about 160MB
RAM
On a Raspberry Pi Model B with
256MB of total RAM (and the operat-
ing system taking ~80MB), loading
10,000 vectors leaves practically zero
headroom for the PHP runtime itself
(Figure 6).
The Ingestion Pipeline
To populate this store, you need
a pipeline that converts raw text
files (Markdown, logs, configura-
tion files) into vectors (Listing 4).
The ingestion_pipeline.php script
handles the steps:
Load: Read files from a directory.
Split: Break text into chunks (e.g.,
500 tokens) to fit LLM context
windows.
Embed: Send each chunk to the
OpenAI API to get its vector
representation.
Store: Save the text, vector, and
metadata to data/vectorstore.json.
This simple pipeline allows you to
teach your AI agent about your in-
frastructure. You can feed it a folder
of post-mortem.md files, and suddenly
your Sysadmin Agent can answer
questions like: How did we fix the
MySQL crash last November? by
Figure 5: A visual representation of vector search. The query is converted into numbers
and compared against the database to find the closest match.
01 // datapizza/vectorstores/simple_vectorstore.php
02
03 private function cosine_similarity($vec1, $vec2) {
04 $dot_product = 0.0;
05 $norm1 = 0.0;
06 $norm2 = 0.0;
07
08 // O(d) complexity where d is vector dimension
(e.g., 1536)
09 for ($i = 0; $iBecause Datapizza-AI PHP is just a
PHP script, deploying it is as simple
as adding a line to your crontab.
This headless mode allows the agent
to perform scheduled health checks
without human intervention.
Headless Execution with
Cron
To run your Sysadmin Agent every
morning at 8:00am, simply point
cron to your PHP executable and
your agent script:
# /etc/cron.d/sysadmin-agent
0 8 * * * paolo /usr/bin/php U
/home/paolo/datapizza-ai-php/examples/U
05_sysadmin/sysadmin_agent.php >> U
/var/log/sysadmin_agent.log 2>&1
Because the framework uses standard
output for logging (the $this->log()
method you saw in ReactAgent),
all reasoning steps – thoughts, tool
outputs, and final answers – are au-
tomatically captured in the logfile,
which creates a comprehensive audit
trail. You can review /var/log/sys-
admin_agent.log to see exactly why
the agent decided to flag a disk space
warning.
Integration by HTTP
For more interactive use cases, the
framework also includes a simple
api/chat.php endpoint that allows
you to trigger your agent from
Keywords: Datapizza-AI, PHP, artificial, intelligence, Raspberry, Pi, local,
vector store, API, agents, edge, computing, automation, decoupled, orches-
tration
Listing 4: The Ingestion Pipeline
01 // datapizza/pipeline/ingestion_pipeline.php
02
03 fun ction pipeline_ingest_single($filepath, $embedder,
$vectorstore, $chunk_size=1000) {
04 // 1. Parse text
05 $parsed = parser_parse_text($filepath);
06
07 // 2. Split into chunks
08 $ch unks = splitter_split($parsed['text'], $chunk_
size);
09
10 foreach ($chunks as $i => $chunk) {
11 // 3. Generate embedding (Remote API call)
12 $embedding = $embedder->embed($chunk);
13
14 // 4. Store locally
15 $vectorstore->add_document($chunk, $embedding, [
16 'source' => basename($filepath),
17 'chunk_index' => $i
18 ]);
19 }
20 }
The Author
Paolo Mulas is a developer special-
izing in edge AI and minimal com-
puting architectures.
41A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SDatapizza-AI PHP
For years, networks have been slated
to migrate to IPv6 to address the short-
age of IPv4 addresses. This problem
can be an issue for users, because
many older devices, applications,
and services still in use do not work
properly in a IPv6-only environment.
Therefore, dual-stack networks cur-
rently offer the best user experience,
but at the expense of running out
of IPv4 addresses. Although most
network operators initially tend to
introduce IPv6 in parallel with their
existing IPv4 infrastructure, IPv6-only
networks are still uncommon outside
the mobile communications sector.
Most admins agree that the dual-stack
approach is an unavoidable transition
phase that allows lessons to be learned
with the IPv6 protocol while minimiz-
ing disruptions to network operations.
Admittedly, dual-stack networks do
not solve the core problem: running
out of IPv4 addresses. A network
operator still needs the same IPv4 re-
sources as for an IPv4-only network.
Worse still, a dual-stack infrastructure
often has to remain in operation for
many years. Many applications still
rely on IPv4, as well, which leads to
a chicken-and-egg problem: IPv6-only
networks are impractical for incom-
patible applications, while applica-
tions continue to rely on IPv4 because
IPv6-only networks are rare.
One possible solution is what are
dubbed IPv6-mostly networks, which
provide IPv4 connectivity when
needed and allows IPv6-enabled de-
vices to operate in IPv6-only mode,
while IPv4 is seamlessly delivered to
those who need this protocol version.
What Defines IPv6-Only
Networks?
An IPv6 network is very similar to
a dual-stack network, with two ad-
ditional key elements. First, the net-
work provides NAT64 functionality in
line with RFC 6146 [1], which enables
IPv6-only clients to communicate
with IPv4 destinations. Second, the
DHCPv4 infrastructure processes the
DHCP IPv6-Only Preferred DHCPv4
option (Option 108) in line with RFC
8925 [2]. When connecting to an
IPv6-enabled network segment, an
endpoint configures its IP stack ac-
cording to its capabilities:
An IPv4-only endpoint obtains an
IPv4 address by DHCPv4.
A dual-stack endpoint (not just
IPv6-capable) configures IPv6
IPv6-mostly networks primarily use IPv6 for communication but also support IPv4 as a fallback, simplifying
address management, reducing the load on the IPv4 infrastructure, and allowing IPv6-only and IPv4-enabled
endpoints to coexist on the same network. We describe transition mechanisms that facilitate the operation of
an IPv6-mostly network and the few technical hurdles to overcome. By Mathias Hein
Set Up an IPv6-Mostly Network
Twofer
Ph
ot
o
by
s
hr
ag
a
ko
ps
te
in
o
n
Un
sp
la
sh
42 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
TO O L S IPv6-Mostly Networks
addresses by stateless address
autoconfiguration (SLAAC) and op-
tionally by DHCPv6. Additionally,
this device obtains an IPv4 address
by DHCPv4.
An IPv6-only endpoint configures
its IPv6 addresses and, when per-
forming DHCPv4, includes Option
108 (in line with RFC 8925) in the
parameter request list. The DHCP
server returns the option, and the
endpoint waives the request for an
IPv4 address and remains in IPv6-
only mode.
A network segment primarily based
on IPv6 can support a mix of IPv4-
only, dual-stack, and IPv6-only de-
vices. IPv6-only endpoints use the
NAT64 provided by the network to
reach IPv4-only destinations.
However, the term “IPv6-only enabled
endpoint” is not a strict technical
definition. Instead, it describes a
device that can work without native
IPv4 connectivity or IPv4 addresses
while providing the same user experi-
ence. The most common method is to
implement a customer-side translator
(CLAT) as described in 464XLAT [3]
(RFC 6877). Devices that support
CLAT (e.g., mobile phones) are
known to operate in IPv6-only mode
without any problems. In some cases,
however, a network administrator
any rule or with a dynamic ACL from
the RADIUS server. If 802.1x authen-
tication is used, RADIUS can provide
an ACL that blocks all IPv4 traffic.
However, the ACL-based approach
has some implications for scalability
and is detrimental in terms of opera-
tional complexity, which is why it is
only recommended as a temporary
solution.
Access to IPv4-Only
Destinations
IPv6-only endpoints require NAT64 to
access IPv4-only destinations. Admins
often opt for a combination of NAT44
and NAT64 functions, but if not all
internal services are IPv6-enabled,
NAT64 might need to be implemented
closer to the user. If internal IPv4-
only destinations use the RFC 1918
address space, the known prefix
64:ff9b::/ 96 does not need to be used
for NAT64 (Figure 1; see section 3.1
of RFC 6052).
Enabling CLAT on endpoints is es-
sential for running IPv4-only appli-
cations in IPv6-only environments.
CLAT provides an RFC 1918-compat-
ible address and a default IPv4 route,
ensuring functionality even without a
native IPv4 address from the network.
Without CLAT, IPv4-only applications
might consider a device to be IPv6-
only-capable even without a CLAT
implementation – for example, if
all applications running on the de-
vice have been tested to work in a
NAT64 environment without IPv4
dependencies.
Coexistence of IPv6- and
IPv4-Capable Endpoints
One effective way to restrict IPv4
addresses only to those devices that
need them is to use Option 108. Most
CLAT-enabled systems also support
this setting. When a network detects
this option, it can configure these
devices as IPv6-only devices so
that they use CLAT to provide IPv4
addresses to the local endpoint’s
network stack.
Certain devices, such as resource-con-
strained embedded systems, can oper-
ate in IPv6-only mode without CLAT
if their communication is limited to
IPv6-enabled destinations. Because
these systems often do not support
Option 108,you might need to use
alternative methods to prevent the
assignment of IPv4 addresses. One
approach is to block IPv4 traffic at the
switch port level, which can be done
either with a static access control list
(ACL) and a filter with a deny ip any
Figure 1: Onboarding a new device without IPv6-mostly support (source: Ruhr University Bochum, Germany).
43A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SIPv6-Mostly Networks
would fail, negatively affecting the
user experience and adding to the
support overhead.
Recommendations for network ad-
mins who control the endpoints are
(1) controlling endpoint configura-
tion and enabling CLAT on endpoints
that send DHCPv4 Option 108 and
(2) enabling Option 108 without
CLAT if you are set to identify and
fix IPv4-only systems and applica-
tions or if all applications will run
reliably in IPv6-only mode.
Signaling the NAT64 Prefix
to Hosts
Hosts running 464XLAT must deter-
mine the IPv6 prefix (PREF64) used
by NAT64. The network administrator
needs to configure the first-hop rout-
ers to include PREF64 information in
router advertisements [4] (RA; RFC
8781), even if the network provides
DNS64 (so that hosts can use DNS64-
based prefix discovery, RFC 7050).
This measure is important because
hosts or individual applications could
have a custom DNS configuration (or
even run a local DNS server) and ig-
nore the DNS64 information provided
by the network, preventing them from
using the RFC 7050 method for de-
tecting PREF64 (Figure 2).
In the absence of PREF64 informa-
tion in RAs, these systems would be
unable to perform CLAT, resulting
in connectivity issues for all IPv4-
only applications running on the af-
fected device. Because such a device
would be unable to use the DNS64
provided by the network, access to
IPv4-only destinations would also
be disrupted.
All common operating systems cur-
rently support DHCPv4 Option 108
and automatically enable CLAT
according to RFC 8781. Therefore,
providing PREF64 information in
RAs can reliably reduce the effect
of a user-defined DNS configuration
on these systems. Receiving PREF64
information in RAs also speeds up
the CLAT startup time, making an
IPv4 address and a default route
available to applications in a far
faster way.
DNS vs. DNS64
DNS64 with NAT64 enables end-
points that exclusively use IPv6 to
access destinations that only use
IPv4. However, this arrangement has
some disadvantages. For example,
Domain Name System Security Ex-
tension (DNSSEC) incompatibility
causes DNS64 responses to fail DNS-
SEC validation. Moreover, endpoints
or applications configured with
custom resolvers are left out in the
cold when it comes to DNS64. The
application has additional require-
ments: To use DNS64, applications
must be IPv6-capable and use DNS
(i.e., not use IPv4 literals). Many
programs do not meet this require-
ment and therefore fail if the end-
point does not have an IPv4 address
or native IPv4 connectivity.
If the network provides PREF64 in
RAs and all endpoints are guaranteed
to enable CLAT, DNS64 is not needed,
and you should not enable it. How-
ever, if some IPv6-only devices may
not have CLAT support, the network
must provide DNS64 unless these
endpoints are guaranteed never to
require IPv4-only destinations (e.g.,
in specialized network segments that
exclusively communicate with IPv6-
enabled destinations).
Advantages of IPv6-Mostly
IPv6-mostly networks offer sig-
nificant advantages over traditional
dual-stack models, where endpoints
have both IPv4 and IPv6 addresses.
The first advantage is a drastic re-
duction in IPv4 address consumption
through IPv6-mostly. This reduction
depends on the capabilities of the
terminal devices (DHCPv4 Option
108 and CLAT support). In real-
world scenarios (e.g., WiFi confer-
ences), 60 to 70 percent of endpoints
can support IPv6-only operation,
Figure 2: With IPv6-mostly support, a device can discover that it can use IPv6 (source: Ruhr University Bochum, Germany).
44 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
IPv6-Mostly NetworksTO O L S
they become apparent if you have no
IPv4 on which to fall back. The net-
work should at least allow the Frag-
ment and ESP extension headers (for
IPSec traffic such as VPN).
Solving Typical Problems
Hidden problems usually occur on
IPv6 networks because the IPv4
safety net is no longer present. Al-
though implementation errors vary
greatly, I focus here on the configura-
tion, topology, or design decisions.
It is important to note that these
problems are already likely to exist on
dual-stack networks, although they
will go unnoticed because of IPv4
fallback.
In the past, disabling IPv6 was
considered a quick workaround for
problems, but it affected devices
without IPv6. Similarly, the IT depart-
ment might have disabled or filtered
IPv6 on the assumption that it is not
widely used. Devices that request Op-
tion 108 cannot connect on an IPv6-
centric network because they do not
receive IPv4 addresses and IPv6 is
disabled. You must therefore ensure
that IPv6 is enabled on your end-
points before migrating the network
to IPv6-centric mode.
When you expand your network,
NAT44 from IPv4 allows endpoints
to extend connectivity to down-
stream systems without the upstream
network being aware or without
granting permission. However, this
situation leads to problems with IPv6
if the endpoints do not have IPv4
addresses.
The following solutions are available
for the problems mentioned:
DHCPv6-PD for assigning prefixes
to endpoints: Provides downstream
systems with IPv6 addresses and
native connectivity.
Enabling the CLAT function on
the endpoint: Functions similar
to the wired network architecture
described in section 4.1 of RFC
6877. The downstream systems
receive IPv4 addresses, and their
IPv4 traffic is translated into IPv6
by the endpoint. However, this ap-
proach means that the downstream
of Happy Eyeballs [5], as well. IPv6
connectivity issues are now far more
apparent, including those that were
previously hidden in dual-stack en-
vironments. You should be prepared
for problems with both the endpoints
and the network infrastructure, even
if the dual-stack network is running
smoothly. Some considerations for
the rollout follow.
With limited control over endpoint
configuration, a rollout in each sub-
net is essential, where you gradu-
ally enable Option 108 processing in
DHCP. If you have control over the
endpoint, a rollout per device is pos-
sible (at least for operating systems
with a configurable Option 108). Note
that some operating systems enable
Option 108 support unconditionally
and only use IPv6 once it is run-
ning on the server side. I therefore
recommend that you enable Option
108 processing when enabling DHCP
server-side.
Some operating systems automatically
switch to IPv6-only. A rollback at this
stage affects the entire subnet, so it
makes more sense to enable Option
108 on the endpoints and make sure
each device can roll back if needed.
For a quick rollback, you should start
with a minimum Option 108 value
(300 seconds) and increase it if the
IPv6-centric network proves to be
reliable.
Network Operation
CLAT requires either a dedicated IPv6
prefix or a dedicated IPv6 address.
Currently, all implementations use
SLAAC to acquire CLAT addresses.
To enable CLAT functionality in IPv6
network segments, first-hop routers
therefore need to advertise a prefix
information option (PIO) containing a
globally routable, SLAAC-compatible
prefix with the autonomous address-
configuration flag set to 0.
Because this concept is specific to
IPv6, the IPv6 extension headers are
often neglected in dual-stack net-
works or even explicitly prohibited
by security policies. The problems
caused by blocking extension headers
are obscured by Happy Eyeballs, but
reducing the size of IPv4 subnets by
up to 75 percent.
Managing dual-stack networks means
operatingtwo network layers simul-
taneously, which increases complex-
ity, costs, and susceptibility to errors.
IPv6-mostly enables the elimination of
IPv4 at many endpoints, simplifying
operations and improving the reli-
ability of the entire network. It also re-
duces dependencies on DHCPv4. With
increasing numbers of devices operat-
ing seamlessly in IPv6-only mode, the
importance of the DHCPv4 service has
dropped significantly, making it pos-
sible to downsize the DHCPv4 infra-
structure or operate the infrastructure
with less stringent service level objec-
tives (SLOs) and a view to optimizing
costs and resource allocation.
Traditional IPv6 deployment required
separate networks plus dual-stack
networks. IPv6-mostly offers signifi-
cant improvements here, too, primar-
ily by improving scalability. Separate
IPv6-only networks double the num-
ber of service set identifiers (SSIDs)
in wireless environments, leading to
channel congestion and performance
degradation. IPv6-mostly does not
require additional SSIDs. Addition-
ally, IPv4 and IPv6 devices can coex-
ist on the same wired virtual LANs
(VLANs), eliminating the need for
additional VLANs.
Troubleshooting, in turn, provides
improved visibility: User-selected fall-
back to dual-stack networks can ob-
scure issues with IPv6-only operation
and make it difficult to report and
resolve problems. IPv6-mostly forces
users to deal with all the issues,
which improves identification and en-
ables troubleshooting for a smoother
long-term migration. Finally, IPv6-
mostly allows for a gradual migration
of devices on the basis of individual
segments. Devices only become IPv6-
only if they are fully compatible with
this mode.
Gradual Transition
Migrating endpoints to IPv6 funda-
mentally changes the network dy-
namics by removing the IPv4 safety
net and applies to the masking effect
45A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SIPv6-Mostly Networks
systems exclusively use IPv4 and
do not benefit from end-to-end IPv6
connectivity. To take advantage of
IPv6 despite these circumstances,
you can use a combination with
IPv6 prefix delegation (PD).
Bridging and ND proxy: Bridges
IPv6 traffic and hides all down-
stream devices behind its MAC
address. However, this arrange-
ment can lead to scalability is-
sues, because a single MAC ad-
dress is assigned to many IPv6
addresses.
Multiple Addresses per
Device
Unlike IPv4, where end devices typi-
cally have a single IPv4 address per
interface, IPv6 end devices inherently
use multiple addresses: the link-local
address, a temporary address (com-
monly used on mobile devices for pri-
vacy protection), a stable address for
long-term identification, and a CLAT
address. Endpoints with containers,
namespaces, or Neighbor Discovery
(ND) Proxy functions can have even
more addresses, posing a challenge for
network infrastructure devices such as
switches, wireless access points, and
so on that map MAC addresses to IPv6
addresses, often with limitations to
prevent resource exhaustion or denial-
of-service (DoS) attacks.
If the number of IP addresses per
MAC is exceeded, infrastructure de-
vices behave differently in different
implementations, resulting in incon-
sistent connectivity losses. Although
some systems reject new addresses,
others delete older entries, causing
previously functioning addresses to
lose their connection. In all these
cases, endpoints and applications are
not explicitly told that the address has
become unusable.
Assigning prefixes to endpoints by
DHCP-PD can eliminate this problem
and the associated scalability issues,
but not all devices support this op-
tion. You will therefore need to en-
sure that the deployed infrastructure
devices support a sufficient number
of IPv6 addresses that can be as-
signed to a client’s MAC address, and
you need to watch for events that in-
dicate that the limit has been reached
(e.g., syslog messages).
Avoiding Fragmentation
Because the basic IPv6 header is 20
bytes longer than the IPv4 header, the
transition from IPv4 to IPv6 can cause
packets to exceed the path maximum
transmission unit (MTU) on the IPv6
side. In this case, NAT64 generates
IPv6 packets with fragment headers.
In line with RFC 6145, the translator
fragments IPv4 packets by default
so that they will fit into 1280-byte
IPv6 packets: All IPv4 packets larger
than 1260 bytes are fragmented or
discarded if the DF (don’t fragment)
bit is set.
To minimize fragmentation, you need
to maximize the path MTU on the
IPv6 side (from the translator to the
IPv6-only hosts). Configuring NAT64
devices to use the actual path MTU
on the IPv6 side when fragmenting
IPv4 packets also makes sense.
Another common cause of IPv6
fragmentation is the use of protocols
such as DNS and RADIUS, where the
server response must be sent as a
single UDP datagram. Security poli-
cies must allow IPv6 fragments for
permitted UDP traffic if responses
in the form of single datagrams are
required. You need to allow IPv6 frag-
ments for permitted TCP traffic unless
the network infrastructure reliably
performs TCP maximum segment size
(MSS) provisioning.
Custom DNS Configuration
On IPv6 networks without PREF64
in RAs, hosts rely on DNS64 to de-
termine the NAT64 prefix for CLAT
operation. Endpoints or applications
configured with custom DNS resolv-
ers (e.g., public or corporate DNS)
can bypass the network-provided
DNS64, preventing the NAT64 prefix
from being detected and obstructing
CLAT functionality.
If possible, try to integrate PREF64 into
RAs on IPv6-centric networks to mini-
mize reliance on DNS64. Be aware of
the possibility of CLAT failures when
endpoints use custom resolvers in en-
vironments without PREF64.
Conclusion
In practice, IPv6-mostly networks of-
fers the best user experience while
reducing IPv4 resource consumption
to a minimum. On the other hand, the
complexity of conventional dual-stack
networks is affected, without resulting
in any direct IPv4 resource savings, be-
cause IPv4 is still necessary everywhere
to support older devices. In coming
years, the volume of native IPv4 traffic
on these networks is likely to decline to
such an extent that it will start to make
sense to stop using IPv4 altogether.
Info
[1] RFC 6146: Stateful NAT64:
[https:// www. rfc-editor. org/ info/ rfc6146]
[2] RFC 8925: IPv6-Only Preferred Option for
DHCPv4: [https:// www. rfc-editor. org/ info/
rfc8925]
[3] RFC 6877: 464XLAT:
[https:// www. rfc-editor. org/ info/ rfc6877]
[4] RFC 8781: Discovering PREF64 in Router
Advertisements:
[https:// www. rfc-editor. org/ info/ rfc8781]
[5] Happy Eyeballs: [https:// en. wikipedia. org/
wiki/ Happy_Eyeballs]
The Author
Mathias Hein is a freelance
IT consultant and technical
writer with more than
40 years of professional
experience in the field of
networking. He also serves
as an adjunct instructor at several universities.
As a trainer and speaker at technical seminars, he
shares his expertise in the areas of switching,
TCP/IP, Voice over IP, Carrier Ethernet, and network
management. As an author of technical books and
articles in relevant trade journals, Hein regularly
contributes to the dissemination of knowledge.
46 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
IPv6-Mostly NetworksTO O L S
Thirty years of unbroken compat-
ibility promises accumulated to
quite a number of choices between
different concepts in the Java Virtual
Machine (JVM) runtime environment
(Figure 1). Looking at some of those
ideas today makes them appear to
be a statement of Zeitgeist more than
anything else, but you still have to
choose the parameters for running
your applications. Understanding
how the JVM manages memory and
how to observe the many metrics it
exposes is essential to operating JVM
applications in production.
To observe a JVM application’s be-
havior and memory usage, operators
needplus
Cortex
Build a flexible monitoring
architecture that keeps pace
with the requirements of
modern, scalable apps and
neatly integrates trending.
42 IPv6-Mostly Networks
Simplify address management,
reduce load on the IPv4
infrastructure, and allow
IPv6-only and IPv4-enabled
endpoints to coexist on the
same network.
90 Industrial Ethernet
Real-time Ethernet can lower
production costs and vertically
integrate operations into a single
network, but the technology has
at least 10 different, and mostly
incompatible, technical solutions.
88 Certificate Enrollment
Web Service
Obtain X.509 certificates for
Linux systems from Active
Directory Certificate Services
with a combination of standard
Unix utilities, zabbix_sender, and
scheduled execution by crontab
files.
90 Real-Time Ethernet
The replacement of first-
generation fieldbuses with real-
time Ethernet creates a single
network that extends from the
control level in the office to field
devices, but admins have to
struggle with the lack of a single
uniform standard.
Management
Highlights
On the DVD
3 Welcome
6 News
96 Back Issues
97 Call for Papers
98 Coming Next Month
Service
• GCC 14.3
• glibc 2.39
• binutils 2.41
• Rust toolset 1.88.0
• Go toolset 1.24
• .NET 10.0
• OpenJDK 25
• Kernel 6.12.0
As with release 10, RL 10.1 for x86_64
only supports the x86-64-v3 micro-
architecture level, which is based on
the feature set of the Intel Haswell
processor generation.
[1] Release notes: [https://docs.rocky-
linux.org/release_notes/10_1/]
5A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
S E RV I C ETable of Contents
Le
ad
Im
ag
e
©
v
la
st
as
, 1
23
RF
.c
om
Get the latest
IT and HPC news
in your inbox
Subscribe free to
ADMIN Update
bit.ly/ADMIN-Update
Nitrux 6.0.0 Released
The Nitrux team has released version 6.0.0 of Nitrux Linux, which includes many software updates,
bug fixes, and performance improvements. This latest version of the specialized Linux system,
built for technical workstations, features Linux kernel 6.19.
One highlight of Nitrux 6.0 is the new VxM hypervisor orchestrator, designed for high-per-
formance virtualization. According to the announcement (https://nxos.org/changelog/release-announce-
ment-nitrux-6-0-0/), VxM “enables virtual machines to achieve near-native performance by using
technologies such as VFIO PCI passthrough and IOMMU isolation, which allow guest operating
systems direct access to dedicated hardware, including GPUs.”
Nitrux 6.0 also includes a fully rewritten Nitrux Update Tool System for improved performance
and long-term maintainability and introduces new components specifically designed for the
Wayland architecture.
Read more at Nitrux: https://nxos.org/.
Red Hat Announces New Integrated AI Platform
Red Hat has announced a new integrated AI platform for deploying and managing AI models,
agents, and applications.
The platform — known as Red Hat AI Enterprise (https://www.redhat.com/en/products/ai/enterprise) —
“is designed to bridge the gap between infrastructure and innovation by providing a unified metal to
agent platform,” says Joe Fernandes, vice president and general manager, AI Business Unit, Red Hat.
Red Hat AI Enterprise is powered by Red Hat OpenShift and offers capabilities such as:
• High-performance AI inference
• Model tuning and customization
• Agent deployment and management
• Integrated observability
• Lifecycle management
“By integrating advanced tuning and agentic capabilities with the industry-leading foundation of
Red Hat Enterprise Linux and Red Hat OpenShift, we are providing the complete stack — from the
GPU-accelerated hardware to the models and agents that drive business logic,” Fernandes says.
Learn more at Red Hat: https://thenewstack.io/red-hat-introduces-its-first-out-and-out-ai-platform/.
ControlMonkey Expands Cloud Disaster Recovery to the Network
ControlMonkey has expanded its Cloud Configuration Disaster Recovery (https://controlmonkey.io/solution/
infrastructure-disaster-recovery/) capability to major network vendors, such as Cloudflare, Fastly, Akamai,
and F5, “bringing visibility and automated recovery to routing, DNS, and edge configurations.”
To accomplish this, the platform “automatically captures daily snapshots of critical network
control-plane components — including route tables, CDN configurations, security groups, firewall
rules, DNS records, and edge routing policies,” the announcement states.
News for Admins
Tech News
6 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
N E WS ADMIN News
This approach allows organizations to restore networking policies and routing configurations directly
from the snapshots, thereby reducing manual effort and speeding recovery time. Features include:
• Automated configuration recovery
• Time machine for networking configuration
• Real-time drift detection
• Recovery readiness visibility
Learn more at ControlMonkey: https://controlmonkey.io/.
CISA and International Partners Warn of Major Cisco SD-WAN Vulnerability
The US Cybersecurity and Infrastructure Security Agency (CISA), along with international partner
agencies, has issued an alert regarding active compromise (https://blog.talosintelligence.com/uat-8616-sd-wan/)
of Cisco Catalyst SD-WAN systems.
According to the statement, malicious actors “have been observed exploiting a previously undis-
closed authentication bypass vulnerability, CVE-2026-20127 (https://www.cve.org/CVERecord?id=CVE-2026-20127),
for initial access before escalating privileges using CVE-2022-20775 (https://www.cve.org/CVERecord? id=CVE-
2022-20775) and establishing long-term persistence in Cisco SD-WAN systems.”
The alert strongly urges network defenders to immediately:
1. Inventory all in-scope Cisco SD-WAN systems.
2. Collect artifacts, including virtual snapshots and logs of SD-WAN systems to support threat hunt
activities.
3. Fully patch Cisco SD-WAN systems with available updates.
4. Hunt for evidence of compromise.
5. Concurrently review Cisco’s latest security advisories and implement Cisco’s SD-WAN Hardening
Guidance: https://sec.cloudapps.cisco.com/security/center/resources/Cisco-Catalyst-SD-WAN-HardeningGuide.
CISA has also issued the following directives to help address malicious activity involving vulnerable
Cisco SD-WAN systems (https://sec.cloudapps.cisco.com/security/center/resources/Cisco-Catalyst-SD-WAN-HardeningGuide):
• Emergency Directive 26-03: Mitigate Vulnerabilities in Cisco SD-WAN Systems: https://www.cisa.gov/
news-events/directives/ed-26-03-mitigate-vulnerabilities-cisco-sd-wan-systems
• Supplemental Direction ED 26-03: Hunt and Hardening Guidance for Cisco SD-WAN Systems:
https://www.cisa.gov/news-events/directives/supplemental-direction-ed-26-03-hunt-and-hardening-guidance-cisco-sd-wan-systems
Read more at CISA: https://www.cisa.gov/news-events/alerts/2026/02/25/cisa-and-partners-release-guidance-
ongoing-global-exploitation-cisco-sd-wan-systems.
LPI Offers Complete Learning Materials for New DevOps Certification
The Linux Professional Institute (LPI), which recently announced version 2.0 of the DevOps Tools
Engineer Certification (https://www.lpi.org/our-certifications/devops-overview/), is offering new learning
materials along with a series of articles (https://www.lpi.org/blog/2026/01/20/devops-tools-introduction-
01-getting-getting-started-started/) to help you prepare for the certification exam.
The new certification covers the methodologies and open source tools needed for implementing
modern DevOps, with general topics including:
• Software Engineering
• Application Container
• Kubernetes
• Security and Observability
The free, downloadable learning materials (https://learning.lpi.org/en/learning-materials/701-200/) are
structured around these same topics to provide in-depth lessons on required concepts. Lessons
include an overview of each topic, guided exercises, explorational exercises,more than a few heap graphs:
They need a reliable way to observe
JVM memory usage, interpret it cor-
rectly, and spot failure patterns early
enough to intervene. In the following
sefctions, we dive into the details
of JVM memory observation: which
metrics exist, how to retrieve them,
how to visualize them in Grafana,
and what memory trends tend to pre-
cede JVM memory failures.
JVM Memory Areas
Explained
The JVM exposes memory telemetry
in several layers. At the conceptual
level, the most important metric
groups are heap usage, non-heap
usage, garbage collection activ-
ity, and off-heap/ native memory
consumption. These can be broken
down further into “used,” “com-
mitted,” and “max” values, which
appear across the JVM’s memory
subsystems.
The heap is the best-known area
because it holds Java objects cre-
ated by the application. Heap met-
rics typically include current heap
usage (used), the amount currently
requested from the operating system
(committed, always greater than
used), and the configured upper
bound (max, usually derived from the
-Xmx flag value). If you only monitor
one thing, heap usage is the baseline.
Most garbage collectors organize the
heap into generations. New objects
are created in a young area, and ob-
jects that survive garbage collection
cycles are eventually promoted into
an old area. When administrators talk
about “the memory leak curve,” they
usually mean old-generation (Old
Gen) usage creeping upward. Heap
memory is managed by the garbage
collector, so a healthy service often
shows a sawtooth pattern: Allocations
push heap up, garbage collection
drops it down, then it repeats.
Although heap exhaustion is by far the
most common cause of crashes, some
real-world incidents happen outside
the heap. The non-heap category con-
tains several memory pools that are es-
sential for runtime execution. Most no-
tably it includes Metaspace, where the
JVM stores class metadata. Metaspace
exhaustion can lead to failures that
look like memory leaks while the heap
remains stable, but they are extremely
rare. Metaspace is not cleaned by nor-
mal object garbage collection in the
same way heap memory is. If the ap-
plication repeatedly loads new classes
(e.g., because of ClassLoader leaks,
redeploy loops, or dynamic proxy
generation without proper cleanup),
Metaspace usage could climb steadily
until the JVM fails.
Another relevant area is the JVM’s
code cache, which stores just-in-time
(JIT) compiled code. Less frequently,
code cache pressure can also create
instability.
Finally, off-heap areas such as direct
buffers often explain cases in which
operating system-level memory pres-
sure endangers the workload while
heap graphs look normal. In contain-
erized environments, this distinction
matters even more, because cgroup
limits apply to the overall process
memory, not just the heap.
As a general rule of thumb, non-heap
memory should be rather constant
across the runtime of an application,
whereas heap memory will fluctuate
a lot. Additionally, Java processes
consume memory in places that are
not always represented well in heap/
non-heap metrics. Every non-virtual
thread has a native stack, defaulting
to 1MB, which becomes significant
when thread counts are high. Large
non-virtual thread counts can cause Ph
ot
o
by
B
re
tt
J
or
da
n
on
U
ns
pl
as
h
Java’s memory management has quite a steep learning curve when you are tasked with operating a
Java Virtual Machine efficiently in production. We guide you through the waters of keeping applications
up and running and what signals to look for to prevent crashes. By Henner Schmidt and Max Jonas Werner
Managing JVM Applications in Production
Defensive Driving
48 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
TO O L S Java Memory Management
high memory usage, even when heap
is stable, which is a common surprise
in systems that use blocking I/ O or
misconfigured thread pools. The JVM
can also allocate memory off-heap
through direct buffers (e.g., by Byte-
Buffer.allocateDirect() as part of
non-blocking I/ O (NIO)), which is
common in asynchronous network-
heavy stacks such as Netty.
Slight differences occur regarding
metrics between garbage collector
(GC) implementations, but these mat-
ter mainly in naming and structure.
Most production distributions (Open-
JDK, Eclipse Temurin, Corretto) share
the same HotSpot foundation, so the
concepts are identical, and most pool
names are similar. Alternative JVMs
such as Eclipse OpenJ9 expose com-
parable metrics, but memory pool
labels and some GC-related signals
can differ. For this reason, building
dashboards around stable top-level
categories (heap/ non-heap/ resident
set size (RSS)) and treating pool-spe-
cific graphs as JVM- and GC-specific
are good practices.
For JVM applications, we highly ad-
vise opting for a white box or gray
box approach, because it allows you
to understand the application’s mem-
ory usage much better than looking at
it from the perspective of the operat-
ing system. The Java runtime allows
for different ways to obtain these
metrics.
Source-Level
Instrumentation
Source-level instrumentation means
the application actively exposes met-
rics as part of its runtime behavior,
typically with an HTTP endpoint
(Prometheus/ OpenMetrics format)
or an OpenTelemetry exporter. This
approach is common in modern Java
services because it integrates well
into the same observability pipeline
as request latency, error rates, data-
base timings, and other application
signals.
In practice, this approach is often
implemented with Micrometer (e.g.,
through Spring Boot Actuator) or
Obtaining Memory Usage
Metrics
For production environments, we
want to define three fundamentally
different approaches to collecting Java
application memory usage metrics
before diving into the details:
White box instrumentation: Appli-
cation metrics are exposed through
measures directly baked into the
source code, optionally through a
library, gaining insights into the
different memory areas and gar-
bage collector cycles.
Gray box instrumentation: Metrics
are consumed from the running
Java Virtual Machine. With regard
to memory, the same metrics can
be retrieved, as with white box
instrumentation.
Black box instrumentation: Met-
rics are exposed by the operating
system, which allows observation
of the overall use of memory and
CPU by an application, thread
count or network latency, and
saturation.
Figure 1: Java Virtual Machine tuning is an art in and of itself. Over the last 30 years, many knobs have been added, but only a subset is
crucial for operators to know.
49A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SJava Memory Management
with the OpenTelemetry SDK. Both
options provide memory-related
metrics, such as heap and non-heap
usage, memory pool usage (Old Gen,
Metaspace, etc.), garbage collection
pause behavior, thread counts, and
class loading statistics. These metrics
are exported in a monitoring-friendly
format and are easy to scrape with
Prometheus, without the need to ex-
pose JMX ports.
For administrators, the main advan-
tage is reliability and consistency:
You get a stable metrics endpoint,
consistent naming, and easier integra-
tion in containerized environments.
The trade-off is that you either need
application or, at minimum, runtime
configuration changes. The benefit of
source-level instrumentation over an
approach that uses Java agents is that
the application is clearly defined and
does not change at runtime, making
it easier to tick all the boxes in high-
security environments.
Java Agents
The easiest way to expose metrics
from a JVM without changing an ap-
plication’s source code is to run a
Java agent such as the Prometheus
JMX Exporter [1] or the OpenTelem-
etry Java Agent [2]. Both use the
same mechanism to gather metrics
from a running application, which is
to run a Java agent alongsidethe ap-
plication in the JVM, making use of
Java’s instrumentation API to extract
metrics. An example command line
for running an application with the
Prometheus JMX Exporter might be
$ java -Xmx32M U
-javaagent:./jmx_prometheus_U
javaagent-1.5.0.jar=9090:U
exporter.yaml U
-jar MyApp.jar
The most basic exporter configuration
(exporter.yaml) is
rules:
- pattern: ".*"
The exporter then listens on TCP port
9090 and provides metrics through
the /metrics endpoint. You would
subsequently configure Prometheus
to scrape http:// HOSTNAME:9090/
metrics and get all JVM memory met-
rics right out of the box.
Important to know, though, is that
Java agents manipulate the applica-
tion’s bytecode for inspection, which,
although not necessarily a concern in
general if you trust the agent, might
be an operational obstacle to deploy-
ing the agent files together with the
application, especially when running
the application in a container. In
highly regulated environments, run-
ning agents manipulating bytecode
might also not be possible because of
regulatory concerns. After all, with an
agent, you are not running the exact,
certified, and audited binary but a dy-
namically modified one. Make sure to
understand the implications of using
Java agents and document their use
in your operational handbooks.
Black Box Instrumentation
If you cannot or do not want to run
a Java agent and can’t change the
application’s source code, black box
instrumentation is your only way to
retrieve metrics from the JVM. Fortu-
nately, the Java ecosystem provides
tools to provide detailed metrics even
in these cases: jcmd and jstat. The
jcmd utility is the modern, supported,
all-purpose JVM control interface.
It supersedes many jmap and jstat
use cases. Given the JVM’s PID, you
would run the following command to
gather JVM heap metrics:
jcmd GC.heap_info
This single-shot command would
need to be scheduled to gather met-
rics continuously for time-based ob-
servations; therefore, jstat might be
preferable:
jstat -gc 1000
With this command, jstat would
gather and print metrics from the pro-
cess each second.
To provide these metrics in a Pro-
metheus-compatible format, you
would have to convert them so that
Prometheus can scrape at a regular
interval. Existing tools such as jstat-
2prom facilitate this process, but they
are usually not well maintained or
they are straight up abandoned, so
you would likely have to write your
own glue code. It becomes apparent
that JVM black box instrumentation
adds considerable operational over-
head that you should be well aware of
when planning your instrumentation
strategy. Tools like those mentioned
here might not be very well suited for
continuous monitoring, but they do
come in handy for ad hoc profiling of
applications.
As a last resort, you can still rely
on operating system metrics for the
JVM process. With Prometheus, the
easiest way is to use the node_ex-
porter [3] and process-exporter [4]
applications, which gather all kinds
of metrics about the node on which
they are running or the processes
that run on that node, respectively.
Obviously, you will not be able to
gather insights into the application’s
memory internals (such as the dif-
ferent memory areas), but you will
still be able to create coarse-grained
alerts that are based on the overall
process or node memory usage, still
allowing you to prevent a memory-
related application crash.
Visualizing JVM Memory
in Grafana
A useful Grafana dashboard should
make it easy to answer one opera-
tional question: What type of memory
pressure is killing my process?
The most important visualization is
heap usage compared with its con-
figured maximum. This graph pro-
vides an immediate view of whether
heap headroom exists. A panel that
tracks old-generation usage is also
extremely valuable, because it is the
best early-warning signal for retained
object growth.
Non-heap metrics should be dis-
played separately from heap. Meta-
space deserves its own graph,
because it can fail independently.
Garbage collection activity should
50 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Java Memory ManagementTO O L S
shapes and may build up over the
span of days or even weeks.
One of the most common patterns
is old-generation growth that does
not drop after major collections. If
Old Gen usage trends upward over
minutes or hours and never meaning-
fully returns downward, the service is
retaining objects. This classic memory
leak signature is independent of the
GC algorithm (Figure 2).
Another warning sign is increasing full
GC frequency with diminishing benefit
(Figure 3). The heap reaches a point
where every GC cycle frees too little
memory. The JVM responds by collect-
ing more often, causing latency spikes
and throughput collapse. This phase
often comes before the final OutOfMemo-
ryError and is where intervention still
helps. One intervention that is not
infrequently deployed in real-world
production environments is restart-
ing the application under controlled
conditions. Although this interven-
tion should be seen as a last resort to
avoid service interruptions, increases
in full GC cycles are a good sign that
you should capture a heap dump of
the application for further inspection
or capture a runtime profile with Java
Flight Recorder. In the best case, the
application’s heap memory just needs
to be adjusted. In the worst case, such
observations point to a memory leak
that can only be fixed by the applica-
tion developer.
A separate but equally important pat-
tern is monotonic Metaspace growth.
When Metaspace continually in-
creases, heap graphs can look healthy
right up until the crash. Operators
should treat this as a first-class signal,
especially in environments with fre-
quent redeployments.
Finally, watch for the mismatch be-
tween JVM-internal metrics and operat-
ing system-level memory. If RSS climbs
while heap remains stable, suspect
native allocations: direct buffers, thread
stacks, or JNI libraries. In this situation,
increasing heap will not solve the prob-
lem and may accelerate it by reducing
headroom for native memory.
Now that you understand how to
observe JVM memory usage, you can
look at how to optimize it.
pause times might indicate that the ap-
plication is experiencing memory pres-
sure (e.g., because of higher load). In
such a case, a countermeasure would
be to scale the application either hori-
zontally (by deploying more instances)
or vertically (by allocating more mem-
ory, e.g., through -Xmx).
Predicting
OutOfMemoryError
JVM memory incidents rarely happen
instantly. They show recognizable
be shown as both pause time and
pause frequency, because rising
full-GC rates are often the leading
indicator that the JVM is trying
(and failing) to recover memory.
As another rule of thumb, major
GC pause times are a sign that the
application is struggling to stay op-
erational with the given amount of
heap space as old-generation heap
space fills up.
Minor GC pause times should be
rather constant across the runtime of
the application; an increase in minor
Figure 2: Memory leak curve. Used Heap reaches 100 percent at 13GB, and the JVM crashes.
Heap looks fine afterward, but not for long.
Figure 3: The GC frequently fails its 200ms pause time target, indicating GC pressure
because of too little available memory.
51A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SJava Memory Management
Choosing a Garbage
Collector
The term “garbage collector” has
always been kind of a misnomer. It
is a bit like calling the mayor of a
town a “trash guy” just because tak-
ing care of the town`s trash manage-
ment is one of the mayor’s duties.
The GC is basically doing everything
about the system RAM used by the
JVM. You might ask: Why have
a choice? Why not just have one
“optimal” garbage collector? The
reason is the amount of sophistica-
tion in JVM memorymanagement.
Over the evolution of the JVM, it
quickly reached a degree at which
it became impossible to create one
GC that was optimal under all cir-
cumstances. The main challenge is
explained quickly: The “better” a
GC, the more resources – RAM and
CPU – it needs by itself. Therefore,
choosing the correct GC has con-
sequences for the overall memory
consumption, behavior under load,
and resource efficiency.
The Contenders
We stick to the garbage collectors
available in HotSpot OpenJDK.
Other JDK distributions, like the one
by Azul, and even alternative JVM
implementations like Eclipse OpenJ9
introduce broader implications than
just memory management; therefore,
the contenders as of Java 25 are:
1. Serial/ Parallel
2. G1
3. Z (ZGC)/ Shenandoah
Technically the list has five garbage
collectors from which to choose, but
you can put them into three groups to
make the first decision.
Nature of Your Workflow
The first group has by far the least
cost regarding CPU and memory, but
it is also the least favorable option
because of its huge stop-the-world
collection pauses. You might consider
it for jobs or CLI applications but
almost never for long-running server
applications.
The second group consists of only G1.
This GC is special in that it aims for
the sweet spot between cost and per-
formance and will therefore be your
most likely pick.
Z and Shenandoah are made for
workloads that trade maximum
throughput and efficiency for ultra-
short pause times. Z is marketed with
pauses around 1ms, which is interest-
ing to contrast against the 200ms tar-
geted by G1. Shenandoah is the only
garbage collector being developed
outside of the Java team. It was con-
tributed by Red Hat and more or less
has the same features as ZGC. Now
all you’re left to do is load test your
workload with both options if you
need the qualities of this group.
To Not Choose
Not picking and pinning an option is
not advisable because the JVM will
then make the decision for you. JVM
version-dependent magic numbers
form a heuristic to decide whether to
use Serial in case you did not bother
to define the GC of your choice. One
CPU and less than 1,792MB of RAM
will let it choose Serial because it has
a better efficiency than G1 under such
circumstances.
The Java team is working to change
this behavior, though. JEP 523 is
proposing to make G1 the default
under all circumstances. They laid the
groundwork for this option with Java
release 25, of which they claim to
reach a good-enough efficiency of G1
even for constrained environments.
When G1 is the default, when can
you choose ZGC or Shenandoah? See
it as an optimization for workloads
that have CPU and memory to trade
for shortest GC latency: probably
most web back ends these days. Keep
in mind, though, as with all optimiza-
tions, you will have to load test and
prove their positive effect.
Tuning Garbage Collection
The JVM is versatile with its ways
of optimizing for a range of envi-
ronments. We want to share some
of the tuning knobs that should be
considered for Linux server workloads.
The one thing you should always do
is set the garbage collector explicitly.
The JVM will otherwise attempt to
be smart by heuristically picking one,
as explained above, which could lead
to a different GC being chosen across
your development workstation, your
Build Server, or your deployment
stages, leading to hard-to-debug is-
sues that might only be visible with a
specific GC.
Planning Heap Size
Memory capacity planning for JVM
instances is complex because of the
several very specific uses GCs have
for memory.
The irony is that the heap size is not
the only use; it is just the most promi-
nent. You also have direct buffers,
the metaspace, class caches, memory
maps, and so on, as stated earlier.
Although you could attempt to budget
all of them explicitly, we argue that
it would do more harm than good.
Getting these explicit sizes right is dif-
ficult, partly because they are tightly
coupled to the features of the envi-
ronment the JVM runs in. Just two
examples are:
Thread stack size is double the
size on ARM 64 bit. Setting it to a
fixed value would allow for half the
amount of possible threads, just be-
cause you picked a different machine
type. You would like to prevent these
kinds of surprises.
Direct buffers are used by NIO.
NIO is the current I/ O component
of the JVM. The amount of memory
it uses is dependent on load: Set it
to a fixed size you figured out by
load testing and be unpleasantly
surprised that a small change in,
for example, network latency might
quickly lead to an out of memory
(OOM) while the heap is not even
close to being saturated.
A proven approach is to let the JVM
have some breathing room for dy-
namically sizing everything outside
the heap. Observing the total memory
utilization of the JVM while load test-
ing will inform you about how to set
the heap.
52 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Java Memory ManagementTO O L S
controllers by considering which
area of memory is quickest to reach
from each CPU. NUMA awareness is
disabled by default because VM plat-
forms might not be honest about the
amount of memory controllers. Not
setting this flag on a NUMA-enabled
host will set you back between 10 and
30 percent of throughput, depending
on your workload.
Parting Words
The JVM and its memory manage-
ment is a huge topic. We can recom-
mend its man pages for those of you
who want more information [5],
but even those docs do not mention
every aspect worth knowing – for
example, when you wonder why
the JVM is not exiting, although
you clearly see it running into out-
of-memory exceptions, and find
out that you explicitly have to set
-XX:+CrashOnOutOfMemoryError to make
it exit.
Info
[1] Prometheus JMX Exporter:
[https:// prometheus. github. io/ jmx_ex-
porter/]
[2] OpenTelemetry Java Agent:
[https:// opentelemetry. io/ docs/ zero-code/
java/ agent/]
[3] Node Exporter: [https:// github. com/
prometheus/ node_exporter]
[4] Process Exporter: [https:// github. com/
ncabatoff/ process-exporter]
[5] java command man page:
[https:// docs. oracle. com/ en/ java/ javase/
21/ docs/ specs/ man/ java. html]
Authors
Max Jonas Werner is a software engineer, pref-
erably working on and with Kubernetes. As part
of his day job at Coppersoft GmbH, he builds
and operates third-party applications for criti-
cal infrastructure suppliers across Europe. He
is one of the core maintainers of the Flux open
source continuous delivery solution.
Henner Schmidt works as a Fullstack staff
engineer for development and operations at As-
sense Software Solutions in Hamburg, Germany.
His expertise lies in writing and operating JVM
applications for the service industry.
touch every requested memory
page right at launch. You probably
want to use this approach if your
workload requires maximum risk
avoidance, because it is no longer
possible for the operating system to
give this memory to another process
without killing the JVM.
Performance Tuning
The most effective tuning measure
is to update the JVM version. The
developers introduce optimiza-
tions to footprint, throughput, and
latency with almost every release.
The ZGC design goal is to not of-
fer tuning knobs, but if you are
looking at G1-specific tuning with
JVM parameters, the champion is
-XX:MaxGCPauseMillis, which sets
G1 a soft but usually very effective
goal for pause times. It will trade
this value for some CPU cycles, but
most people will be happy to spend
that for shortening the maximum
response time of their applications
endpoints.
Reducing JVM Memory
Footprint
The -XX:+UseCompactObjectHeaders
and -XX:+UseStringDeduplication
parameters both target reducing the
memory footprint. They are off by
default, even in JVM 25, which is
a testament to the JVM developers’
conservative nature when introduc-ing rather radical changes to the
ecosystem. The ecosystem is prob-
ably what you want to look out
for when testing your workloads
for problems that might come up,
because the JVM itself will have no
issues with flags.
Taking Advantage of
Environment Features
A property of the server on which
your workload runs might be having
more than one memory controller.
Performance-wise, knowing this pos-
sibility is relevant information for
the GC. The -XX:+UseNUMA paramater
lets it work with multiple memory
Heap Sizing Example
Say you want to optimize the
memory allocation for a JVM run-
ning in a container: You go with
G1 as the starting point and want
to know the amount of JVM heap
space you should allocate relative
to the memory limit given to the
container. You have prepared a load
test and would love to see the appli-
cation get along with a total of 4GB
so it would fit the node sizing of
the cluster in which the container is
supposed to run. What you do not
know is whether this size is enough
under load. You can ask the JVM to
set the divide between heap and off-
heap as a percentage to avoid set-
ting the size with an absolute value.
A good starting point for this test is
70 percent heap and 30 percent off-
heap. The JVM parameter for that
is -XX:MaxRAMPercentage=70. You can
now alter the memory limit for the
container, test each, and adjust ac-
cording to what the metrics show.
Overcommitting Memory
All applications request memory
from the operating systems on
which they run, not just the JVM.
Digital evolution has lead most
operating systems to answer these
requests with virtual memory,
even if that means they overcom-
mit the physical memory available.
The pieces of the virtual memory
(pages) are backed with actual
memory the first time the applica-
tion uses them. You can change that
globally on some operating systems
and in some environments, like a
VM, but not everywhere you might
run an application.
Therefore, letting the JVM claim
the maximum amount of memory
you would want it to have does no
harm and has no effect on systems
that allow for unbound overcommit-
ting – other than keeping your met-
rics informed, which the parameter
-XX:InitialRAMPercentage does.
You also have a way to commit the
physical memory. The parameter
-XX:+AlwaysPreTouch makes the JVM
Keywords: Java, JVM, virtual, machine, memory, management
53A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
TO O L SJava Memory Management
Azure Firewall is a fully stateful
firewall as a service with built-in
high availability and unlimited cloud
scalability that manages both east-
west and north-south traffic. In this
article, I look at forced tunneling,
which allows northbound traffic to
be inspected by a local firewall before
leaving the regional Azure gateway.
To assess the capabilities of the ser-
vice properly in relation to its price
[1], you should understand how in-
frastructure-based workloads – think
virtual machines (VMs) – typically
communicate with the outside world.
In Azure, VMs are always deployed
on virtual networks (VNets), where
each VNet uses a freely selectable
RFC 1918-compliant address range.
The VNet must have at least one
subnet on which each VM uses the
private IP address of its virtual net-
work interface – that is, an address
from the subnet’s address range. Of
course, the VM can also access mul-
tiple private IP addresses, either in
the form of multiple IP configurations
(one of which is always primary) or
in the form of multiple network in-
terfaces. The VN needs the VNet to
communicate:
with other VMs in the same
network,
with other Azure services that re-
side on the same virtual network
with a service or private endpoint,
with Azure VMs on other Azure
VNets by VNet peering or IPsec
virtual private network (VPN),
with the local site by IPsec VPN
or Microsoft Azure ExpressRoute,
with other Azure resources by
their public endpoint, or
with the Internet.
Outgoing Internet communication
worked automatically out of the box
(up to September 30, 2025) without
any further configuration – even with-
out an explicit public IP. Microsoft
refers to this implicit network address
translation (NAT)-like procedure as
standard outbound access. However,
it was discontinued on the date men-
tioned above. Ever since, customers
have had to configure outbound In-
ternet communication explicitly (e.g.,
with the use of a public IP address,
a NAT gateway, or a source NAT
(SNAT) in conjunction with the Azure
Firewall). For inbound Internet con-
nectivity, the VM always (directly or
indirectly) needs a public IP address –
at an additional cost, priced by
standard stock keeping units (SKUs;
basic SKUs were discontinued at the
same time).
Every virtual network in Azure has
a default gateway (as a service) that
is not visible to the customer and a
default route table. Although you can-
not see the gateway as an entity in
Azure, typing ipconfig on the guest
system of the VM or with a Power-
Shell (PS) script injected into the VM
from outside by the VM agent will
reveal its existence. The gateway runs
on the first available IP address (after
the network address) in the VNet’s
address range (e.g., 10.0.0.1).
Routing in Azure
The invisible (system) routing table
contains matching routes (e.g., to the
default gateway for Internet commu-
nication). Azure automatically cre-
ates system routes and assigns them
to all subnets of a virtual network,
with the route definition consisting
of an address prefix and a next hop
type, which can be a kind of alias or
service tag.
For example, the address prefix
0.0.0.0/ 0 is assigned to Internet.
If the destination is not within the
network’s address range, the route
passes through the default gateway.
In contrast, the VirtualNetwork next
hop type stands for the address space Ph
ot
o
by
A
nt
ho
ny
R
eu
ng
èr
e
on
U
ns
pl
as
h
The Azure Firewall network security service combines threat protection, packet filtering,
and application firewalling for cloud workloads in a platform-based offering. By Thomas Drilling
Forced Tunneling in Azure Firewall
Thoroughfare
54 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O N Forced Tunneling
Azure. Although every newly created
VM in Azure (or its network inter-
face, if you prefer) is linked to a new
packet filter, users can also skip cre-
ating a filter. In Azure, packet filters
do pretty much what their name sug-
gests, that is, what Linux users but
not Windows users understand when
they hear the word “firewall.” Secu-
rity groups only work in OSI Layer 4
and, therefore, only support the UDP
and TCP protocols (plus ICMP).
Otherwise, the functionalities are
similar to those of the Linux kernel
(netfilter or iptables), which means
input and output rules with stateful
inspection (also known as connection
tracking on Linux), whose process-
ing order is determined by priority.
The rule with the highest priority is
processed last and is usually a deny.
Each rule includes information about
the port (e.g., 80), protocol (TCP or
UDP), source, destination, and action
(allow or deny). IP addresses, IP ad-
dress ranges, or service tags (aliases)
define the source, destination, or
both. Every VM can communicate
with every other VM on the same net-
work with the three default inbound
and outbound rules that are always
created automatically. Additionally,
why your firewall rules do not seem
to be working. Even worse, you might
not notice that something is awry.
You have an easy way of checking
whether Azure Firewall is being used
with Azure Network Watcher, by
the use of either topology visualiza-
tion, next hop analysis, connection
troubleshooting, VPN analysis, or a
combination of methods. Because
services such as Azure Firewall incur
costs (with standard SKUs, the offer is
approximately $1,000 per month for
provisioning,plus data transfer), the
service is usually operated as part of
a hub-spoke architecture, where the
spoke networks need to be connected
to the hub by a peering connection,
and each must have a user-defined
routing table to reach Azure Firewall.
The Azure Architecture Center [2]
provides more information about this
process.
Packet Filters Instead of
Firewalls
Although routing tables and default
gateways are mandatory for every
virtual network in Azure, they are
not used for packet filters, which are
known as network security groups in
of the virtual network itself, which
means that Azure automatically cre-
ates a route with an address prefix
that matches the address range de-
fined in the VNet’s address space.
If you need special routes that go
beyond the system routes, as in the
firewall scenario, you will need to
create your own routing tables (Fig-
ure 1) and populate them with routes,
because you cannot see or change
the system routes. This user-defined
routing (UDR) always takes higher
priority in Azure, ranking higher even
than learned routes (Border Gateway
Protocol, BGP) and system routes.
The next hop type is also fundamen-
tal later on, because you need it to
define Azure Firewall as the next hop
in a user-defined route table if devices
(e.g., VMs) use it as a gateway on the
source network. The definition uses
the VirtualAppliance next hop type,
which is specified by Azure Firewall’s
private IP address.
A user-defined routing table must
be actively assigned to the source
network in Azure, as well; otherwise,
Azure would continue to use the de-
fault gateway for all Internet traffic,
and you might wonder why Azure
Firewall is not being used or wonder
Figure 1: User-defined routes override standard system routes in Azure.
55A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O NForced Tunneling
Azure Load Balancer accepts all in-
ternal inbound traffic and blocks all
other inbound traffic.
The default outbound rules work in
a similar way, except that they allow
outbound Internet traffic. If security
groups are created automatically
when a VM is created, Azure links
them to the network interface of the
specific VM. If you create network
security groups up front, you can as-
sign them to a VM when you add the
VM or link them to a subnet, which
means all VMs on the subnet can use
the packet filter.
At the end of the day, though, packet
filters are just packet filters; they do
not work at the application level, al-
though Windows users might imagine
they do, which leaves you with Azure
Firewall as a platform as a service
(PaaS) for Layers 3 to 7. Alternatively,
you could search the Azure Market-
place for available firewall offerings;
doing so will bring up offerings from
virtually all noteworthy vendors,
from Barracuda through Fortigate to
Sophos. Many are based on virtual
machines (infrastructure as a service,
IaaS) and require the same mainte-
nance and administration as your lo-
cal firewall. You are then responsible
for high availability and patching
yourself, but software as a service
(SaaS) and PaaS are offered in the
marketplace, as well, including Azure
Firewall.
Deploying Azure Firewall
As mentioned earlier, Azure Firewall
is available in three SKUs: Basic,
Standard, and Premium [3]; you need
at least Standard for tunnel enforce-
ment. Deploy-
ment itself is
simple and
largely self-ex-
planatory. The
work lies more
in the accompa-
nying planning
in terms of vir-
tual network-
ing, hub-spoke
networking
(peering), and
UDR. The first point is important
because Azure Firewall, as PaaS, re-
quires a specific subnet for itself that
is visible with its own IP addresses.
This subnet must be named Azure-
FirewallSubnet and have a dimension
of at least /26 (i.e., 64 IP addresses)
in CIDR notation. However, you
can easily ensure this if you use the
Azure Firewall template to create the
subnet. For the tunnel enforcement
feature, the firewall VNet must also
contain another subnet named Azure-
FirewallManagementSubnet. Another
template is available for this purpose
when you create the Firewall Man-
agement (forced tunneling) network
(Figure 2).
A few terms need to be clarified.
On the Azure portal, you will find
Firewalls, Firewall Manager, Firewall
Policies (all three related to Azure
Firewall), and WAF (web application
firewall) policies. The latter do what
the name suggest and are not relevant
here. Firewall Manager is a central
management hub for use cases in
which you operate multiple Azure
firewalls and want to create, manage,
and assign your firewall rules inde-
pendently of the firewall instances.
However, Azure Firewall can also be
operated in a kind of classic mode,
where the firewall rules are created
firewall-side.
Figure 2: Deploying and operating Azure Firewall requires specific subnets.
$resourceGroupName = "fw-demo-rg"
$vnetName = "fw-vnet
$firewallName = "fw-firewall"
$firewallPipName = "fw-pip"
$firewallMgmtPipName = "fw-mgmt-pip"
$location = "germanywestcentral"
Ne w-AzFirewall -ResourceGroupName $resourceGroupName -Name $firewallName
-Location $location -Sku AZFW_VNet -VirtualNetworkName $vnetName -PublicIpName
$firewallPipName -ManagementPublicIpName $firewallMgmtPipName -EnableDnsProxy
$true -EnableForcedTunnel $true
Listing 1: Deploying the Firewall with PS
56 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Forced TunnelingCO N TA I N E R S A N D V I RT UA L I Z AT I O N
that the address is only resolved by
user-defined DNS servers that you
stored previously as the DNS servers
responsible for the workload network.
The matching network rule on Azure
Firewall would then be a UDP rule
for port 53 with the destination of
your user-defined DNS servers (e.g.,
8.8.8.8).
Forced Tunneling in Azure
You can also configure Azure Firewall
to forward first all Internet traffic to
the specified next hop, such as an
edge firewall at the local site, instead
of directly to the Internet. For com-
pliance reasons, some companies
require that multiple network security
devices (e.g., firewalls) inspect outgo-
ing network traffic before it goes to
an Internet destination. Perhaps your
security policy also requires you to
send all Internet-bound traffic first to
another network firewall (a network
virtual appliance, NVA) in Azure or
directly to a local firewall for inspec-
tion before it reaches the Internet.
Azure Firewall also supports split tun-
neling, which is the ability to forward
traffic selectively. One such scenario
is activating Windows licenses with
a key management service (KMS)
system, where Azure-based Windows
VMs require a public source IP ad-
dress owned by Microsoft rather than
their local Internet gateway IP ad-
dress. You could solve this situation
with custom routing tables on Azur-
eFirewallSubnet (see below). For the
tunneling enforcement scenarios here,
the protocol, 3389 as the destination
protocol, and the destination VM’s
private IP address as the translated
destination (i.e., also 3389).
The most important use case for
Azure Firewall is application rules
for outgoing HTTPS traffic. The key
feature here is that you can use fully
qualified domain names (FQDNs;
instead of IP addresses) in the rules;
moreover, Azure Firewall recognizes
FQDN tags. Microsoft defines these as
a group of FQDNs that are assigned
to known Microsoft services – for ex-
ample, to allow the required outgoing
network traffic (e.g., Windows update
traffic) to pass through the firewall.
Additionally, when creating rules,
Azure Firewall service tags can be
used in the target field for network
rules instead of specific IP addresses.
A service tag represents a group of
IP addresses. Service tags are primar-
ily used to reduce the complexity of
security rules. Also, Azure Firewall
includes a built-in rule collection
for infrastructure FQDNs that are al-
lowed by default.These FQDNs are
platform-specific and cannot be used
for other purposes.
The rule configured in Figure 3 is
a very simple example of a type of
web browsing rule. It allows outgoing
HTTPS traffic to https:// duckduckgo.
com from the workload source net-
work 10.2.0.0/ 24. Azure Firewall
also recognizes network rules (Layer
4). Application and network rules
then take effect in combination. For
example, you could restrict the reso-
lution of https:// duckduckgo.com so
If you set up the required resources
in advance – all that you are miss-
ing is two public IP addresses for the
firewall itself and its management
interface (only required for tunnel en-
forcement), which you can also create
on the fly when creating the firewall –
the deployment dialog for Azure
Firewall on the portal is completed
quickly. Optionally, you can deploy
the firewall with Azure PowerShell
(Listing 1).
Defining Firewall Rules
The main reason (apart from forced
tunneling) for the use of Azure Fire-
wall instead of default routing with a
default gateway, system routes, and
network security groups (see above)
is to control incoming and outgoing
traffic from Azure to the Internet, the
local site (over IPsec VPN), or both in
a more precise and granular way than
you can with packet filters in OSI
Layer 4. Azure Firewall supports rules
for the application layer (app rules),
network layer (network rules), and
network address translation rules for
destination NAT (DNAT).
DNAT rules are useful for securely ac-
cessing virtual machines on an Azure
VNet (without a public IP address)
for maintenance or management
tasks (e.g., by Remote Desktop Pro-
tocol (RDP)). Otherwise, you would
need a self-managed jump host or the
Azure Bastion service, for which you
would be billed. For a DNAT rule, you
would use the Azure firewall’s public
IP address as the destination, TCP as
Figure 3: A simple application rule on Azure Firewall allows traffic to DuckDuckGo.
57A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O NForced Tunneling
you need to deploy Azure Firewall
with the Management NIC enabled,
but without a public IP address. En-
abling the Management NIC means
that Azure will create a separate
management network interface with a
public IP address that Azure Firewall
uses for its management operations.
Setting Up a Test
Environment
To try out tunnel enforcement, you
can turn to a test lab provided by
Microsoft on GitHub [4] that covers
both normal tunnel enforcement and
split tunneling for the KMS scenario
outlined above.
The lab uses a simulated local site
in Azure for its site-to-site VPN. This
site is connected to the hub VNet of
an Azure firewall by IPsec site-to-site
VPN with the Azure VPN gateway,
which requires an additional subnet
named “GatewaySubet” for each site.
The firewall at the local site is also
represented by an Azure firewall,
which means you need a total of
three virtual networks: (1) a network
representing the on-premises side
and containing a gateway subnet for
the VPN gateway, (2) the mandatory
firewall subnet for the Azure firewall
representing the local firewall, and
(3) a workload subnet for a test work-
load in the form of an Azure VM.
On the Azure side, the lab uses a
hub-spoke architecture in which the
workload VNet is peered with the
hub VNet, which also contains: a
gateway subnet for the VPN gateway;
AzureFirewallSubnet for the Azure
firewall; and another subnet, Azure-
FirewallManagementSubnet, for the
Management NIC.
Additionally, the lab environment cre-
ates the required routing tables with
custom routes. The ARM template
on GitHub makes it extremely easy
to deploy the required components.
All you need to do is specify an ad-
min username and password and a
pre-shared key (PSK) for the IPsec
connection. Deployment takes about
40 minutes and outputs the mapped
environment distributed across two
resource groups.
The first resource group, rg-fw-azure,
contains all of the Azure environ-
ment components – that is, the hub
network with the required subnets,
the spoke network with the Workers
subnet, and the VPN gateway (in-
cluding the matching local network
gateway). It also includes the site-to-
site connection for the VPN gateway,
the Azure firewall, the firewall policy,
three custom routing tables (route-
spokes-snets, route-fw-snets, and
platform-managed-rt), the required
public IP addresses, a diagnostics set-
ting, and a log analytics workspace
for monitoring.
The second resource group, rg-fw-
onprem, contains the simulated
on-premises firewall in the form of
an Azure firewall, the simulated on-
premises VPN device on the subnets
of the on-premises VNet in the form
of an Azure VPN gateway and local
network gateway, a site-to-site VPN
connection (always a separate entity
in the case of the Azure VPN gate-
way) to Azure, and a workload VM
on the associated subnet for the on-
premises Workers. Also created here
are the required firewall policies and
a diagnostics setting.
Incidentally, the master template
references four linked templates. The
first template creates all of the Azure-
side resources in one fell swoop, the
second linked template creates the
complete on-premises environment
(simulated in Azure), the third cre-
ates the two Azure VPN connection
objects (VPN Connection) for the IPsec
site-to-site policy, and the fourth cre-
ates the diagnostics settings for the
Figure 4: You can visualize an ARM template in VS Code.
58 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Forced TunnelingCO N TA I N E R S A N D V I RT UA L I Z AT I O N
the form of FQDN owaspdirect.azur-
ewebsites.net (an app deployed in
Azure as an Azure Container Instance
during deployment) is allowed by an
application rule on Azure Firewall if
the source is the IP group ipg-azure-
network; in this case, it is then routed
over the local site because of tunnel
enforcement and rejected by the fire-
wall there. To test this arrangement,
call up owaspdirect.azurewebsites.net
in the VM’s browser.
Not only will you see an error in the
browser, you will also find an error
message in the Azure Firewall log
analytics – Action: Deny. Cause: No
rule matches. Proceed with default ac-
tion – because the Azure firewall in
the local environment is dropping the
traffic, which you can see in the con-
figured Log Analytics workspace. You
see two entries for this request: one
for each Azure firewall. The second
log entry shows that the local firewall
rejected the request, which in turn
confirms that the configuration forces
all Internet traffic to use the local
network because the Azure container
instance has a public endpoint.
A quick look at the source IP address
of the local firewall reveals that a
source NAT rule sent it to the private
following cmdlet should complete
successfully:
Test-NetConnection U
-ComputerName 10.100.0.68 U
-Port 3389
and initiate a TCP connection to port
3389, which is open by default on
Windows computers. The IP address
of the “local” VM is 10.100.0.68. To
avoid having to connect to the VM
console by RDP up front, you can
use the option of executing PS scripts
externally with the VM agent under
the Operations section of the Azure
portal.
Of course, you will probably want to
know whether the outgoing Internet
traffic from the workload network in
Azure uses the local firewall as its
Internet gateway. To find out, access
a public IP address from the Azure
VM. You will then see in log analytics
how the request reaches the local fire-
wall because of the enforced tunnel
configuration. You can also see that
tunnel enforcement works throughout
the environment for any traffic des-
tined for a public IP address, which
confirms that the application rules
configured on Azure Firewall are
working cor-
rectly. You can
view these in
Firewall Man-
ager or directly
in the associ-
ated policies
(Figure 5).The allowed
destination in
on-premises firewall to stream the
required monitoring logs to a log
analytics workspace in Azure. If you
installed the ARM extension and the
ARM template viewer in your local VS
Code, you can also visualize the tem-
plate components (Figure 4).
Another option is to deploy all the re-
quired objects manually, step by step,
in the Azure portal. The procedure
and required parameters for network
ranges can be found on GitHub [5];
however, the easiest and fastest way
to deploy the test environment is by
entering the PowerShell commands
shown in Listing 2 directly from the
Cloud Shell terminal.
Trying Out Tunnel
Enforcement
After successful deployment, it’s time
to test the setup. First, check the con-
nectivity from the Azure VM to the
local VM to determine whether the
basic deployment, routing, and tunnel
enforcement are working. The data
traffic flows from the VM hosted in
the snet-trust-workers subnet of the
vnet-spoke-workers VNet on Azure
through the Azure firewall in the hub.
The hub, in turn, resides on the vnet-
hub-secured VNet thanks to the vgw-
vnet-hub-secured VPN gateway.
The reason for this setup is that only
the default (system) route for IPsec,
which the system learns from the
BGP, is used here. The data traffic
then reaches the local firewall and,
from there, the local VM. To avoid
asymmetric routing, the return path
to the Azure VM is the same; the
Figure 5: The application rule in Azure Firewall allows access to a container app with a public endpoint provided by the lab scenario.
Listing 2: Deploying the Test Environment
$securePassword = ConvertTo-SecureString "YourPassword"
$securePSK = ConvertTo-SecureString "YourPSK"
Ne w-AzSubscriptionDeployment -Name demoSubDeployment -Location westeurope
-TemplateUri "https://raw.githubusercontent.com/Azure/Azure-Network-Security/
master/Lab%20Templates/Lab%20Template%20-%20Azure%20Firewall%20Forced%20
Tunnel%20Lab/Templates/azfwForceTunnelTemplate.json" -AdminPassword
$securePassword -SharedKey $securePSK
59A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O NForced Tunneling
IP address of one of the Azure firewall
instances (192.168.0.70). This behav-
ior is the result of the firewall classify-
ing all traffic with a destination IP ad-
dress outside the RFC 1918 ranges as
NATed. Incidentally, you can change
Azure Firewall’s SNAT behavior by
switching to the Private IP ranges
(SNAT) tab in the associated policy
and selecting one of the available op-
tions. For example, the Learned SNAT
IP Prefixes option was still a preview
feature at the time of writing.
Another supported scenario is the
shared tunnel mentioned above,
which is used, for example, in the
KMS scenario described earlier during
Windows activation, because activa-
tions would initially fail with forced
tunneling; after all, the configuration
routes all traffic from the Azure VM
to be activated to the local network.
The Azure VM is then unable to con-
nect to KMS servers for Windows
activation. A troubleshooting docu-
ment [6] describes this scenario in
detail and suggests custom routing.
Now add a new send-to-kms route
with destination 23.102.135.246/
32 and Internet as the next hop type
to the route-fw-snet routing table at-
tached to AzureFirewallSubnet. The
IP address, 23.102.135.246, is one
of three KMS servers that process
Windows activations for Azure VMs
worldwide. You must allow the traffic
to pass through Azure Firewall, which
the send-to-kms rule does, allowing
all connections from the 192.168.2.0/
24 subnet to KMS servers over the
Internet. You can test this again in
your Azure VM PowerShell session or
by remote script execution. To do so,
drag your Azure Firewall logs into log
analytics again, which should confirm
that the traffic passed through the
firewall and that the TCP request to
the Internet was allowed.
Conclusion
Forced tunneling has long been an
important security requirement for
many organizations. The need to
inspect and monitor Internet-bound
traffic from Azure resources is grow-
ing with the increasing prevalence
of Azure-powered infrastructures.
Configuring Azure Firewall to tunnel
all traffic downstream for additional
monitoring meets the strict require-
ments for maintaining compliance in
many organizations’ environments.
Additionally, the ability to split spe-
cific traffic to meet other dependen-
cies and requirements is key to main-
taining an operational and controlled
infrastructure.
Info
[1] Azure Firewall pricing:
[https:// azure. microsoft. com/ en-us/
pricing/ details/ azure-firewall/ # pricing]
[2] Azure Architecture Center: [https:// learn.
microsoft. com/ en-us/ azure/ architecture/
networking/ architecture/ hub-spoke]
[3] Firewall SKUs:
[https:// learn. microsoft. com/ en-us/ azure/
firewall/ choose-firewall-sku]
[4] Test environment: [https:// github. com/
Azure/ Azure-Network-Security/ tree/
master/ Lab%20Templates]
[5] Manual deployment: [https:// github. com/
Azure/ Azure-Network-Security/ tree/
master/ Lab%20Templates/ Lab%20Tem-
plate%20-%20Azure%20Firewall%20
Forced%20Tunnel%20Lab# readme]
[6] Azure Windows VM troubleshoot-
ing documentation: [https:// learn.
microsoft. com/ en-us/ troubleshoot/
azure/ virtual-machines/ windows/
welcome-virtual-machines-windows]
The Author
Thomas Drilling has been a full-time free-
lance journalist and editor for science and IT
magazines for more than 10 years. He and his
team make contributions on the topics of open
source, Linux, servers, IT administration, and
Mac OS X. Drilling is also a book author and
publisher, advises small and medium-sized en-
terprises as an IT consultant, and lectures on
Linux, open source, and IT security.
60 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Forced TunnelingCO N TA I N E R S A N D V I RT UA L I Z AT I O N
Have you ever gone online, per-
haps while on holiday, and received
a notification that your favorite BBC
original comedy isn’t available in
your area? How does the service
know what country you’re in? You
might assume it’s information sent
by the client in an HTTP header,
but in reality, it involves a combina-
tion of allocation records, inference,
and engineering that allows web
applications to assess where an IP
address likely originates (Figure 1).
IP address blocks are distributed by
regional Internet registries (RIRs):
ARIN in North America, RIPE in
Europe, and others worldwide. Each
registry records who owns a block
and for which country it is intended.
If an IP address belongs to a block
registered to a British Internet ser-
vice provider (ISP), for example, it
is reasonable to infer that the traffic
originates from the United Kingdom.
Geolocation databases aggregate RIR
allocation data, ISP documentation,
and historical routing information to
estimate a country ISO code (e.g., US)
and often a subdivision code (e.g., WI
for Wisconsin; Figure 2).
These databases can be queried di-
rectly by modules such as GeoIP2
Python [1] or indirectly through cloud
services. In this article, I describe how
to set up geolocation through a cloud
service. Here, I’ll use Cloudflare as an
example. For other environments, see
your cloud vendor documentation.
Cloudflare relies on proprietary infer-
ence systems to attach geographic
metadata to incoming requests. In
this tutorial, I parse the region code
from Cloudflare’s edge request con-
text object and use it to build a lay-
ered geofence control. Although the
example uses Cloudflare, the same
principles apply to other geolocation
platforms and can be adapted to im-
plement your own geofence policy.
Know Your Geofence
There are many reasons to deploy
a geofenced application. Private
Use geofence technology to isolate your web services from the broader public Internet with customsecurity
rules and worker routes. By Sam Klein
Isolating Cloud Web Services
Passport Check
Ph
ot
o
by
O
xa
na
M
el
is
o
n
Un
sp
la
sh
Figure 1: Cloudflare blocks access to the origin server. Users see this when a geofence
security rules policy blocks access to a website.
62 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O N Geofencing
the client’s direct source IP. Although
the original client IP can still be for-
warded by headers (which can be
implemented with Apache modifica-
tion to accept reverse proxy headers),
IP-based bans enforced at the origin
are not a substitute for edge-level
geographic controls or account-based
enforcement. IP geolocation is also in-
herently approximate. Users might be
restricted incorrectly if their network
is registered in a different region from
their physical location.
Another limitation is the widespread
availability of virtual private network
(VPN) services and proxy services.
Users can deliberately route traffic
through another geographic region
(e.g., appearing to originate from
Chicago while physically located in
Germany). As a result, geofencing pri-
marily filters low-effort or incidental
access. It should not be expected to
prevent determined circumvention.
Taken together, these limitations
reinforce that geofencing is a coarse-
grained control. It reduces routine
access from specific regions but does
not reliably constrain user behavior
on its own. Where the risk of circum-
vention is unacceptable, geofencing
should be combined with additional
controls such as account verification,
identity checks, or application-level
monitoring. Geographic filtering
might help determine when such ad-
ditional measures are appropriate, but
it should not be relied on as the sole
mechanism of enforcement.
Cloudflare Example
Consider a phpBB (PHP bulletin
board package) forum hosted for a
local club in Wisconsin. The admin-
istrator wants to reduce spam and
unwanted traffic by limiting access
primarily to users within the state.
Open registration is closed, members
are known personally, and SSH access
is restricted to a specific administra-
tive IP address.
The forum is hosted on a Digital-
Ocean Droplet. Each Droplet is as-
signed a static public IPv4 and IPv6
address for its lifetime [2], providing
a stable origin endpoint. A domain
This approach is particularly useful for
operators who want to avoid collecting
or verifying personal information but
still want to run a limited-scope service,
such as a hobby site or community
forum intended for a specific local-
ity or private network. In these cases,
geofencing functions as a proportional
control: It filters routine access without
introducing additional identity verifica-
tion or data collection obligations.
Policy decisions can be enforced at
the network edge without retaining
geographic information beyond what
is necessary to evaluate a request.
Persistently storing IP-based location
data or building user profiles on the
basis of geographic behavior intro-
duces privacy, security, and regula-
tory concerns that extend beyond
simple access control. For this reason,
geofencing mechanisms should be
implemented as stateless controls.
They should evaluate a request, en-
force policy, and discard geographic
context immediately. Logging should
be limited to aggregate operational
metrics rather than per-user geo-
graphic reporting.
Geofencing is a technical control, not
legal advice. Its effectiveness and ac-
ceptability vary by jurisdiction and
evolve over time. Even administrators
of small, non-commercial sites might
eventually be required to implement
more precise compliance mechanisms.
Geofencing should therefore be under-
stood as one tool for reducing opera-
tional exposure, not as a comprehen-
sive or permanent compliance strategy.
Limitations to Geofencing
Depending on how it is implemented,
geofencing can introduce unintended
side effects. It is important to un-
derstand its practical limitations and
failure modes.
One limitation involves shared IP
infrastructure (e.g., banning users
from registering with an IP address
because IP addresses no longer iden-
tify user behavior). In a Cloudflare
deployment, incoming requests termi-
nate at Cloudflare’s edge network. At
the TCP layer, the origin server sees
Cloudflare’s IP addresses rather than
organizations might operate region-
specific web assets that are not in-
tended for access outside a defined
locality. Commercial services use geo-
fencing to prevent fraudulent transac-
tions or to control the distribution
of licensed content (e.g., streaming
platforms that offer different catalogs
in different countries). Geofencing
can also be used defensively. Some
national networks employ large-scale
filtering regimes, and some US-based
retailers temporarily blocked Euro-
pean IP addresses after the publica-
tion of the European Union General
Data Protection Regulation (GDPR) to
reduce regulatory uncertainty.
Blocking requests from specific juris-
dictions does not guarantee regulatory
compliance. However, it can reduce
the number of users and interactions
originating from particular geographic
locations, thereby shrinking a ser-
vice’s regulatory footprint while still
requiring administrators to comply
with applicable law.
Figure 2: Geolocation isn’t a fixed attribute
but is inferred from data.
63A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O NGeofencing
name is configured to proxy through
Cloudflare, so all inbound HTTP traf-
fic first passes through Cloudflare’s
edge before reaching the Droplet.
The following sections consider build-
ing such an example from scratch,
not by going into details about the
application particularly, but by exam-
ining its network infrastructure. You
will consider how a domain name is
essential to geofencing, implementa-
tion of reverse proxy, and firewalls
to ensure that user traffic doesn’t go
around your geofence policy; finally,
I’ll look at how to implement the con-
trols relevant to just the geofence.
Creating a Domain Name
This section explains why URLs are
necessary for your geofence policy
with Cloudflare. I start by buying
a domain name from a registrar, of
which you have many to consider:
GoDaddy, Namecheap, etc. You
can use an existing account or any
domain name that Cloudflare accepts.
Subdomains can also be delegated
through name server (NS) record redi-
rection to Cloudflare or through enter-
prise management [3]. For long-term
accessibility, though, it will always be
a better idea to own the root domain.
On creating an account, Cloudflare
asks your domain name. Cloudflare
needs an address for its reverse
proxy (i.e., a server positioned be-
fore the application) so requests
can be handled and forwarded
for the reliability of the service.
For example, Apache can act as a
proxy for services like the Forgejo
forge and repository. When a cli-
ent types http:// example.com (port
80), Apache forwards the request
to a localhost service on a specific
port number (not on port 80 pub-
licly), which, if you were using the
server as a developer, might appear
in your browser locally as http://
localhost:3000/ example (which is
private; the Internet doesn’t see
port 3000). This approach hides the
internal structure of the application
from the client.
Cloudflare’s reverse proxy does the
same for your IP address so that
when the client looks for example.
com, the DNS address first goes to
Cloudflare and not to the origin ad-
dress (your static IP address); then,
it comes out from Cloudflare’s proxy,
providing a security boundary be-
tween your server and the request.
Because Cloudflare proxies at the
public edge, the origin can still run
its own reverse proxy such as Apache
forwarding requests to a backend ser-
vice like Forgejo.
After you secure a domain name
(preferablya root domain, because
that is the assumed condition on-
ward), you will drop in the external
nameservers Cloudflare provides into
your domains nameserver list (Fig-
ure 3) after enrolling your domain. If
you use Cloudlare as a registrar, this
behavior will be the default. These
nameservers
are gener-
ated for your
account [4].
Now go to your
Cloudflare ac-
count. On the
Domain Man-
agement page
click Onboard
a domain (Fig-
ure 4). In the
field, submit
your domain
name with a
quick scan. You
will be given
the name serv-
ers to put into
your records,
as shown in
Figure 3. Once
onboarded, the
console will
refresh to show
that your link
is active. (You
can see that
example.com
was added in
Figure 4).
Figure 3: Adding Cloudflare nameservers to a domain registrar.
Figure 4: Onboarding your domain from the Domain Management console.
64 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
GeofencingCO N TA I N E R S A N D V I RT UA L I Z AT I O N
block traffic at a lower network layer
if possible.
Packets blocked at Layer L3 or L4
(by the DigitalOcean Firewall) is a
better use of resources than block-
ing beyond and into the Application
Layer, saving resources for legitimate
requests (see Figure 5). At the top
layer, the traffic through the Internet
has yet to be processed. Before it
reaches your application (which has
finite resources), DigitalOcean firewall
blocks anything you specify.
The last step in this section is to copy
the IP address from the Droplet page
and paste it into your Cloudflare DNS
management console. Return to the
Cloudflare Domain Management con-
sole (Figure 4) and select the desired
domain name. In the left column
navigation bar, select DNS | Records
the Firewalls tab on the Networking
page from the Droplet itself or from
the Management | Firewalls screen.
In the Create Firewall window, you
can add a Name for the firewall (Fig-
ure 7) that is used to differentiate
multiple policies. Adding inbound
and outbound rules is the focus of
this article.
An important aspect of effective
Cloudflare implementation is to set
inbound rules so that only Cloudflare
IPs have direct access to HTTP 80 or
443. You must block all other traffic.
With inbound rules, anything not
explicitly allowed will be considered
blocked, so you will simply add all
the IP addresses from their documen-
tation [5] (Figure 7). Although you
could do this at the application layer
with the operating system, it’s best to
Once your domain has been regis-
tered with Cloudflare and it is using
the correct name servers, you should
assign your domain to a static IP
address.
Creating a Proxy Connection
DigitalOcean is an infrastructure as a
service (IaaS), implementing virtual
servers and storage. Other services
you might use include AWS Elastic
Container Service (ECS), Google
Kubernetes Engine (GKE), VMware
Tanzu Platform, Azure Kubernetes
Service (AKS), and IBM Red Hat
OpenShift on IBM Cloud. Other
smaller platforms can also meet the
requirements, as long as they can
host a virtual private server (VPS)
or virtual dedicated server (VDS),
maintain a static IP address for your
service, and enforce firewall controls.
(Although your application can ac-
complish this task, it’s better to do it
on the network and transport layer,
functioning with IP addresses, ports,
and protocols.)
Now you create a DigitalOcean drop-
let (which contains your VPS) by
clicking the Create button at the top
right of the navigation bar adjacent to
your project and team name. Select
Droplets, then choose whatever speci-
fications you desire: region, datacen-
ter, image, size, CPU options, storage,
backups, SSH key, and hostname. For
the example in this article, you don’t
need to specify any of these options
because you’ll just generate a static IP
address. Once you create the Droplet,
the next window shows its progress.
At any time, if you need the IP ad-
dress, you can find it by the time the
animation finishes or under the Man-
age category in the left navigation
column. Select Droplets to preview
the name, IP address, and time cre-
ated in a table.
Before copying the IP address, create
a firewall policy for the Droplet. At
the Droplets table mentioned in the
previous paragraph, click the name
of your latest Droplet, go to Network-
ing (Figure 6), and select an exist-
ing firewall or create your own by
clicking the Create Firewall button in
Figure 5: Users accessing your web services go through the DigitalOcean network from
the Internet.
Figure 6: Under the Manage category in the left column navigation bar is the Droplets
option. From here you can manage your firewall and see networking information.
65A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O NGeofencing
and add your Droplet static IP address
to the IPv4 address field for Type A
and Name @ (Figure 8).
Security Policy for
Geofencing
At a high level, enforcement occurs
in two layers (Figure 9). In this ex-
ample, (1) a Cloudflare security rule
blocks all traffic originating outside
the United States, and (2) a Cloud-
flare Worker enforces a second check,
allowing only requests from Wiscon-
sin (region code WI).
Layer 1: Country-Level Block
(Security Rule)
After onboarding the domain in
Cloudflare and updating nameservers
at the registrar, navigate to: Security |
Security rules | Create rule. Here, cre-
ate a rule with the Action value Block
and use the following expression:
(ip.geoip.country ne "US")
and not http.request.uri.path contains U
"/.well-known/acme-challenge/"
This expression blocks all requests
that do not originate from the United
States, while allowing Let’s Encrypt
ACME validation traffic. If certificate
management is handled entirely by
Cloudflare, the ACME exception might
not be required. The result should
look like Figure 10.
The field ip.geoip.country (also avail-
able as ip.src.country) returns the
two-letter ISO country code inferred
by Cloudflare [6] (see Figure 2). Be-
cause the rules on
the Security rules
tab execute before
Workers, requests
blocked here
never reach the
next enforcement
layer.
Layer 2:
State-Level
Enforcement
(Worker)
To create a
Worker, go to
Workers | Manage
Workers | Create
Application | Start
with Hello World,
then replace the
default code with
that in Listing 1.
Cloudflare at-
taches geographic
metadata to each
request through
the request.cf
object. The coun-
try field contains
the ISO country
code, and region-
Code contains the
state or provincial
subdivision when
Figure 7: Adding an inbound or outbound rule can be done in the web console GUI. Select the table element you
want to edit or create a new row. Fill out the Type, which determines what client is expected, the Protocol type
(either TCP or UDP), Port Range, and Sources.
Figure 8: The proxy configuration for a static IP.
01 export default {
02 async fetch(request) {
03 const url = new URL(request.url);
04
05 // Allow Let's Encrypt validation
06 if (url.pathname.startsWith("/.well-known/
acme-challenge/")) {
07 return fetch(request);
08 }
09
10 const regionCode = request.cf?.regionCode;
11 const country = request.cf?.country;
12
13 // Allow only US traffic from Wisconsin
14 if (!(country === "US" && regionCode === "WI")) {
15 re turn new Response("Access denied", { status:
403 });
16 }
17
18 return fetch(request);
19 }
20 }
Listing 1: Worker Route Expression
66 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
GeofencingCO N TA I N E R S A N D V I RT UA L I Z AT I O N
[3] Cloudflare CNAME partial setup, no root:
[https:// developers. cloudflare. com/ dns/
zone-setups/ partial-setup/]
[4] Cloudflare nameservers:
[https:// developers. cloudflare. com/ dns/
nameservers/]
[5] Cloudflare IP ranges:
[https:// www. cloudflare. com/ ips/][6] Cloudflare ip.src.country:
[https:// developers. cloudflare. com/
ruleset-engine/ rules-language/ fields/
reference/ ip. src. country/]
The Author
Sam Klein is a cybersecurity engineer with
more than six years of experience safeguarding
enterprise infrastructure and shaping resilient
systems at scale. His career spans embedded
Linux platforms, open source development,
and academic collaborations on privacy and
application security. Klein is currently on sab-
batical for full-time parenting while continuing
to contribute to the cybersecurity community.
thoughtfully, it
can serve as a
practical access
control mecha-
nism for small
organizations,
private com-
munities, and
region-specific
services that
want to limit
exposure without
expanding their
data collection
footprint.
Info
[1] GeoIP2 Python:
[https:// geoip2. readthedocs. io/ en/ latest/]
[2] DigitalOcean Droplet static IP address:
[https:// docs. digitalocean. com/ support/
are-my-droplets-ip-addresses-static/]
available. This Worker evaluates both
values. If the request is not from the
United States and Wisconsin, it re-
turns an HTTP 403 response, as seen
in the test panel (Figure 11, right);
otherwise, the request proceeds to the
origin server.
By blocking non-
US traffic at the
Security rules level,
most unwanted re-
quests are dropped
before Worker exe-
cution. The Worker
then applies a
more granular re-
gional check. The
origin server only
receives traffic that
has passed both
controls.
This design keeps
enforcement at the
edge, minimizes
origin load, and
avoids retaining
geographic data be-
yond the evaluation
of each request.
Conclusion
Geofencing is often
associated with
media licensing
and streaming re-
strictions, but its
utility extends well
beyond entertain-
ment platforms.
When implemented
Figure 10: Create a custom rule in the Cloudflare Security console.
Figure 9: Each request goes through different edge layers. Security
rules for the country are checked before the region check, reducing
checks per request.
Figure 11: Custom rules are created in the Cloudflare Security console.
67A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
CO N TA I N E R S A N D V I RT UA L I Z AT I O NGeofencing
Many vulnerabilities in AWS are not
caused by zero-day attacks but by
configuration errors – from Amazon
Simple Storage Service (S3) buckets
with open write permissions, Elastic
Compute Cloud (EC2) snapshots that
accidentally publish access creden-
tials, or identity and access manage-
ment (IAM) roles without multifac-
tor authentication. The Prowler [1]
open source tool [2] systematically
checks for violations of security stan-
dards and visualizes risks, and it can
be precisely tailored to individual
requirements.
The software is not a black box analy-
sis tool, but a framework for traceable
security audits at the command line
level. The checks are based on best
practices and benchmarks (e.g., from
such organizations as the Center for
Internet Security (CIS), the US Na-
tional Institute of Standards and Tech-
nology (NIST), and Payment Card
Industry Data Security Standard (PCI-
DSS)) and deliver immediately action-
able results for AWS, Azure, Google
Cloud Platform (GCP), Kubernetes,
and Microsoft 365. One focus is on
AWS, where the scope of testing is
greatest and integration with cloud-
native services such as Security Hub
and GuardDuty is most advanced.
Getting Started
If you want to use Prowler locally on
Linux, you need to install it with the
Python package manager, for exam-
ple, on Ubuntu or with Brew, enter:
pipx install prowler
brew install prowler
Alternatively, you can use the Docker
container:
docker run -it U
--rm ghcr.io/prowler-cloud/prowler U
prowler -v
The tool uses existing AWS CLI pro-
files for authentication. To use all
of the checks, the profile requires at
least the SecurityAudit and ViewOnly-
Access managed policies. Additionally,
an inline policy is recommended
to unlock specific read permissions
for non-standard resources. This
extension is found in permissions/
prowler-additions-policy.json in the
official repository.
Initial Security Scans
The following commands carry out
a basic audit of all regions of an ac-
count and then create a matching
profile from the AWS CLI:
prowler aws --profile
aws configure --profile
The tool will prompt you for four
things:
AWS access key ID: the key ID
belonging to the IAM user or a
role with sufficient authorizations
AWS secret access key: the match-
ing secret access token
Default region name: the name of
the region (e.g., eu-central-1)
Default output format: optional
information, such as json
The open source Prowler is ideal for systematically checking your AWS infrastructure for vulnerabilities,
meeting compliance requirements, and automatically plugging security gaps. We show you how to use this
tool in a production environment – from initial scan to integration into CI/ CD pipelines, dashboards, and
organization-wide audits. By Thomas Joos
AWS Security Audits with Prowler
Prowling the Depths
Ph
ot
o
by
J
os
ep
h
No
rt
hc
ut
t
on
U
ns
pl
as
h
68 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
S EC U R I T Y Prowler
The profile is saved in the ~/.aws/
credentials or ~/.aws/config file. Next,
call up Prowler as shown before. In ad-
dition to the SecurityAudit and ViewOn-
lyAccess managed policies, the IAM
account requires the inline policy from
the Prowler repository in permissions/
prowler-additions-policy.json to per-
form all the checks. The above call iter-
ates through all the configured checks
and stores the results in the output/ di-
rectory. In addition to CSV and HTML,
Prowler generates standards-compliant
JSON files in OCSF or ASFF format:
prowler aws -M html json-ocsf json-asff
ASFF is used for direct transfer to
AWS Security Hub (Figure 1) – more
on that later. The HTML output pro-
vides a clear overview with filter
options for compliance standards,
prowler aws --list-checks
You can use the next command to
carry out three specific checks that
focus on key aspects of IAM and ACM
security:
prowler aws U
--checks accessanalyzer_enabled U
acm_certificates_expiration_check U
iam_root_mfa_enabled
Prowler checks whether IAM Access
Analyzer is enabled (accessana-
lyzer_enabled), whether any ACM
certificates are close to their expira-
tion dates (acm_certificates_expira-
tion_check), and whether multifactor
authentication has been enabled for
the root account (iam_root_mfa_enabled).
These checks address typical vulner-
abilities in AWS accounts, can be
severity, and affected resources. The
checks can be restricted to individual
services, regions, or test groups.
Selectively Controlling
Checks
To carry out targeted security checks
for the three specified AWS services
(Amazon S3, EC2, and IAM), use the
command:
prowler aws --services s3 ec2 iam
Prowler limits the scan to these ser-
vices and checks their configurations
for security-related vulnerabilities,
policy violations, and potential risks,
which gives you a focused security re-
port without analyzing other services.
You can display a list of all available
checks (Figure 2), for example, with:
Figure 1: Prowler also works with AWS Security Hub.
69A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
S EC U R I T YProwler
validated separately, and provide a
focused report on particularly critical
configurations. The process of choos-
ing can be facilitated by a JSON check
file. One file with three security-
related checks that address common
vulnerabilities in AWS environments
could look like:
{
"checks": [
"s3_bucket_public_access",
"ec2_instance_port_ssh_exposed_U
to_internet",
"cloudtrail_multi_region_enabled"
]
This file ensures that Prowler per-
forms three critical checks: s3_bucket_
public_accessand answers.
For example, within the Security and Observability section, you’ll find lessons on:
• Cloud-native security
• Prometheus monitoring
• Log management and analysis
• Tracing concepts
“The new DevOps Tools Engineer exam covers the most important methodologies and tools
along the entire lifecycle of modern software applications,” says Fabian Thorns, Director of Product
Development at LPI.
7A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
N E WSADMIN News
The exam consists of 60 questions that must be answered in 90 minutes and is available in
English, with a Japanese language version planned for 2026.
Visit LPI for more details: https://www.lpi.org/.
Power Demands and Complexity Limit AI Deployments, per DDN Report
AI deployments introduce power demands and other challenges that most infrastructure budgets
and facilities were never designed for, says the 2026 State of AI Infrastructure Report from DDN
(https://www.ddn.com/2026-state-of-ai-infrastructure-report/).
“Energy consumption, cooling capacity, and inefficient data movement have become real oper-
ational constraints — often limiting progress long before compute capacity or GPU availability,”
the company says.
Specifically, the report says that:
• 65% of infrastructure sits idle while still consuming power.
• 93% of respondents are actively working to reduce AI’s energy footprint.
• 47% cite energy and cooling as their top inefficiency.
• Only 41% report efficiency gains from recent AI investments.
Complexity in AI infrastructure was cited as another top challenge, as:
• 98% of respondents report a skills gap related to AI infrastructure.
• 65% say their AI environments are already too complex.
• 54% say they have postponed or cancelled AI initiatives.
Read more at DDN: https://www.ddn.com/.
Microsoft Announces Open Source Litebox OS
Microsoft has announced Litebox, an open source “security-focused library OS supporting kernel-
and user-mode execution.”
According to the project page, “LiteBox is a sandboxing library OS that drastically cuts down the
interface to the host, thereby reducing attack surface.”
LiteBox, which is written in Rust and developed under the MIT license, is designed for use in
both kernel and non-kernel scenarios, with example use cases including:
• Running unmodified Linux programs on Windows
• Sandboxing Linux applications on Linux
The LiteBox team is currently working toward a stable release and notes that some APIs and in-
terfaces may change as development continues. Learn more from the GitHub page (https://github.com/
microsoft/litebox).
OpenMP Adds Support for Python
The OpenMP Architecture Review Board (ARB) has created a Python Language Subcommittee to
add Python support to version 7.0 of the OpenMP API specification for parallel programming. This
move will make Python the fourth officially supported language in the specification, alongside C,
C++, and Fortran.
“Adding Python support to the OpenMP standard will provide Python developers with a new way
to express parallelism portably and accelerate Python applications running on CPUs, GPUs, and other
accelerators,” the announcement states (https://www.openmp.org/press-release/python-new-member-anaconda/).
Additionally, the company notes that Anaconda has joined the OpenMP ARB and will play a key
role in the Python integration.
The OpenMP 7.0 release is planned for 2029, while version 6.1 (https://www.openmp.org/wp-content/
uploads/openmp-TR14.pdf) is expected in November 2026. For more information, visit OpenMP:
https://www.openmp.org/.
SUSE Offers Cloud Sovereignty Framework Self Assessment
SUSE has created a Cloud Sovereignty Framework Self Assessment tool aimed at helping organiza-
tions identify gaps in their digital strategy.
8 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
ADMIN NewsN E WS
This web-based, self-service assessment tool lets organizations quickly see how their infrastructure
measures up against the 2025 EU Cloud Sovereignty Framework, providing an analysis that includes:
• Overall sovereignty score (0-100%)
• Individual scores per area
• Critical violation warnings
• Prioritized gap analysis
• SUSE solution recommendations
“Most organizations struggle to bridge the gap between policy and production,” says Andreas
Prins, head of Global Sovereign Solutions at SUSE, but the Cloud Sovereignty Framework Self
Assessment “gives words to an abstract principle and gives them verified open source pathways to
make it resilient.”
Features include:
• The SEAL benchmark: Maps the organization to one of five Sovereignty Effective Assurance
Levels (SEAL 0–4). This creates a common language for organizations to discuss risk (e.g., “We
are currently SEAL-1, but our public sector contracts require SEAL-3”).
• Weighted risk analysis: The tool weighs eight sovereignty objectives (SOVs), prioritizing supply
chain and operational autonomy.
• Trust-based engagement: Results are stored only in the user’s browser.
Check out this video walkthrough (https://www.youtube.com/watch?v=c9y0YUHcObE) to see how the
assessment works and learn more at SUSE: https://www.suse.com/.
Open Invention Network Releases OIN 2.0
The Open Invention Network (OIN) has released OIN 2.0 (https://www.openinventionnetwork.com/license-
agreement-2/) — a “significant evolution” of its open source software patent protection program.
With this update, OIN has introduced a shared funding model with a modified, fee-based ap-
proach. Under the new model, participation remains free to individuals and small businesses,
while medium- and large-sized organizations will help support OIN through a tiered, annual fee
based on revenue.
Additionally, OIN has released Linux System Table 13 (https://www.openinventionnetwork.com/linux-
system/), which details the patent protection coverage offered under the OIN 2.0 license agreement.
This update “covers over 650 new open source software packages, including smart technologies,
security, networking, data centers, and automotive. It increases coverage for cloud computing,
including for Kubernetes and Eclipse, and expands coverage for modern languages by adding
many new libraries for Go, Python, and Rust.”
“OIN 2.0 is a continuation of OIN’s long-standing commitment to protect OSS from patent
threats, modified to reflect today’s realities,” said Keith Bergelt, CEO of Open Invention Network.
Learn more at Open Invention Network: https://www.openinventionnetwork.com/.
New Global Open Source Vulnerability Database Launched
The Global CVE (GCVE) initiative has launched a new open and freely accessible vulnerability
advisory database. According to the announcement, “the platform aggregates and correlates
vulnerability information from more than 25 public sources, including GCVE GNA (Numbering
Authority) sources and other established vulnerability databases.”
The GCVE database (https://db.gcve.eu/), which is maintained by the Computer Incident Response
Center Luxembourg (CIRCL), provides a public web interface, a public API (https://db.gcve.eu/api/),
and open data dumps for offline analysis. It also provides compatibility with existing CVEs through
a backward-compatible ID scheme.
The platform is powered by vulnerability-lookup (https://www.vulnerability-lookup.org/), an open source
project also maintained by CIRCL, that implements the Best Current Practices (https://gcve.eu/bcp/)
defined by the GCVE initiative.
By bringing together data from public sources, the GCVE vulnerability database “helps reduce frag-
mentation and improves visibility across the global vulnerability landscape,” the announcement says.
Learn more at GCVE: https://gcve.eu/.
9A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
N E WSADMIN News
Non-human identities (NHIs) are not
a new phenomenon, but they are rap-
idly becoming increasingly prevalent
and complex. NHIs include identities
for workloads, services, Internet of
Thingsdetects S3 buckets
that are publicly accessible without
restrictions. This misconfiguration is
one of the most common causes of
data leaks in AWS environments. The
ec2_instance_port_ssh_exposed_to_in-
ternet check looks at whether EC2
instances allow SSH access across
the entire IPv4 space (0.0.0.0/ 0). An
open port 22 is an inviting gateway
for brute force or exploit attempts.
Finally, cloudtrail_multi_region_en-
abled ensures that AWS CloudTrail
is active in all regions. Without this
setting, security-related activities in
regions that are not used by default
but might still be vulnerable will fly
under the radar. Targeted check pro-
files like this can be called up directly
with the command,
prowler aws --checks-file ./my-checks.json
which gives you targeted results on
particularly security-critical areas with-
out having to run through the entire
check catalog. If you need regularly
recurring checks, you can combine
profiles with a scheduler, such as AWS
Systems Manager or local cron jobs.
Restrictions and Profiles
If needed, you can limit the analysis
to individual regions,
prowler aws --profile audit-profile U
-f eu-central-1 us-east-1
which reduces run time and costs,
especially if you want to initiate fur-
ther processing with Security Hub.
For multi-account setups with mul-
tiple CLI profiles, scans can also be
scripted:
for profile in audit-prod U
audit-dev U
audit-test
do
prowler aws U
--profile "$profile" U
--output-folder ./audit-$profile
done
This loop lets you carry out auto-
matic security checks for multiple
AWS accounts by calling Prowler se-
quentially with different CLI profiles.
The three profile names audit-prod,
audit-dev, and audit-test stand for
the different production, develop-
ment, and testing environments.
A separate scan is started for each
profile, and the results are stored
in a dedicated folder named for the
respective profile (e.g., ./audit-au-
dit-prod), which facilitates struc-
tured evaluation and archiving of the
results, especially in multi-account
environments with role-based access
control and separate responsibilities.
The prerequisite is that a correspond-
ing entry exists in the AWS CLI con-
figuration for each profile.
In environments with multiple AWS
accounts, Prowler offers the op-
tion of centrally checking entire
Figure 2: Use the appropriate command to display the available Prowler checks at the command line.
70 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
ProwlerS EC U R I T Y
-o prowler.env
docker compose up -d
The GUI is then available from http://
localhost:3000. After logging in, you
can start scans, compare results,
and export compliance reports. The
interface is based on Next.js, and
the back end is based on Django and
PostgreSQL. For production environ-
ments, I recommend a separate de-
ployment including role-based access
control (RBAC) hardening and HTTPS
encryption.
Prowler can be run locally and
deployed as a fully managed solu-
tion. The Prowler Managed Service
automates daily audits across mul-
tiple cloud providers, storing the
results centrally and making them
accessible in a consolidated web
interface (Figure 3) that includes
AWS, Azure, GCP, and Kubernetes
– including compliance evaluations,
risk ratings, and context-related
recommendations.
The service also supports RBAC, API
access, and centralized visualization
in dashboards. In hybrid scenarios,
Managed Service can be synchronized
with local Prowler instances.
master-profile. The --org-role op-
tion lets you specify an IAM role that
can be temporarily assumed by the
subordinate accounts. This role must
be present in all audited accounts
and allow cross-account access.
Prowler stores the results of each
account check in the ./org-audit
directory, structured by account ID.
In this way, you get complete re-
ports for each member account, and
centralized evaluations or targeted
security measures to be derived are
enabled.
Dashboard and Web
Interface
Besides the CLI, Prowler introduced a
locally hostable dashboard in version
5, which you can install with Docker
Compose:
curl U
-LO https://raw.githubusercontent.com/U
prowler-cloud/prowler/refs/heads/U
master/docker-compose.yml
curl U
-L https://raw.githubusercontent.com/U
prowler-cloud/prowler/refs/heads/U
master/.env U
organizational units. If you have a
management account, the scan can be
extended to all subordinate accounts.
This operation requires a central role
with cross-account authorizations
and is available as a CloudFormation
template in the Prowler repository.
The check can then be performed
either sequentially or in parallel,
with the results stored in separate
subdirectories.
A typical scenario is an automated
run with separate report storage per
account and optional transmission to
the Security Hub. For larger organi-
zations with hundreds of accounts,
this method provides a consolidated
security overview without the need
for manual evaluation of individual
profiles. The command
prowler aws --org-role U
arn:aws:iam::111111111111U
:role/ProwlerAuditRole U
--org-master-profile master-profile U
--output-folder ./org-audit
starts an organization-wide secu-
rity audit across all your AWS ac-
counts. The management account
is addressed by the profile called
Figure 3: A security scan can also be initiated from the Prowler web interface.
72 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
ProwlerS EC U R I T Y
Customization and
Compliance
Prowler supports extensive custom-
ization at the configuration level,
such as extending the maximum log
retention time:
log_group_retention_days=500
This setting stipulates that Cloud-
Watch log groups be retained for at
least 500 days. The parameters are
stored in a user-defined configuration
file, usually in INI format, which you
pass in at startup:
prowler aws --config-file .of the hub with each
scan, enabling flexible, audit-proof
rule management tailored to your
requirements.
Prowler in DevOps Workflows
In the continuous integration and
continuous deployment (CI/ CD)
context, Prowler can be integrated
as a build step. In conjunction with
the new Prowler Fixer module, auto-
matic remediation is even possible.
A typical CI/ CD workflow performs
two consecutive steps. In the first
step, Prowler scans the AWS environ-
ment, executing only the security
checks from the checks.json file. This
file contains a list of check IDs that
you have defined specifically. The
--output-folder ./results parameter
ensures that all scan results, including
CSV, JSON, and HTML reports, are
stored in the results directory.
In the second step, the
cat ./results/html-report.html
command outputs the HTML report
directly to the console. It is particu-
larly useful for automated pipelines
for which you want to save the
results as an artifact or pass them
on to downstream steps, which in
conjunction with CI/ CD systems
such as GitLab or Jenkins, helps
you map out a continuous security
check process.
A webhook can be used to forward
the results to security information
platforms, such as Splunk or Elastic,
provided the JSON is in the OCSF
format, which saves integration work
and improves traceability.
In addition to the familiar command
mode, Prowler introduced support for
scanning GitHub repositories for se-
curity risks in version 5. Among other
things, this capability helps you de-
tect publicly accessible secrets, miss-
ing branch protection rules, or unpro-
tected repository settings. Authentica-
tion is handled by personal access
tokens, OAuth, or GitHub app access.
Microsoft 365 environments can now
also be checked, for example, for
inadequate authentication policies or
overly broad access authorizations in
Exchange Online.
Prowler offers a Checkov-based dedi-
cated scan engine for infrastructure
as code (IaC) that helps you analyze
Terraform, CloudFormation, or Ku-
bernetes manifest files before they
even reach the cloud. In this way,
Prowler can integrate with local-only
Kubernetes Checks and EKS Scans
The Prowler command
prowler kubernetes U
--kubeconfig-file ~/.kube/config
analyzes EKS clusters. The focus is on CIS
Benchmark 1.10, including checks for pod
security, network traffic, and API server
permissions and for securing worker nodes.
Vulnerabilities such as runAsRoot, missing
seccomp profiles, or overly open Cluster-
RoleBinding resources are listed along with
recommendations for hardening. A specific
namespace combination can also be scanned
for targeted analyses:
prowler kubernetes U
--namespaces kube-system production
The following job deployment is used for
integration with existing clusters:
kubectl apply -f kubernetes/job.yaml
The dashboard-based visualization of the
EKS results is similar to the AWS evaluation,
optional filtering by namespace, category,
and compliance framework.
73A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
S EC U R I T YProwler
detected vulnerabilities appear there
after a few minutes, including a
reference to the affected resources.
Actions can then be derived through
manual annotation or playbooks in
AWS Systems Manager.
Version 5 also saw Prowler introduce
the ability to analyze threat patterns
with AWS CloudTrail. The --cat-
egories threat-detection parameter
lets you enable checks that detect
typical attack indicators in the logs.
Examples include unusual API calls,
sudden privilege escalations, or activ-
ity in inactive regions. The tool evalu-
ates standardized events from the last
24 hours but can be adjusted to other
time periods if necessary. The prereq-
uisite is that CloudTrail is active in
all regions being checked. The results
can be filtered by severity or resource
type and help to identify compro-
mised identities or misused services
at an early stage.
Conclusion
Prowler covers a wide range of ap-
plication scenarios as a CLI tool, a
browser-based dashboard, or a man-
aged service for enterprise-wide com-
pliance. With customizable checks,
reporting in standard formats, direct
integration into CI/ CD pipelines, and
extensive support for multi-account
environments, the tool offers a com-
bination of flexibility, automation,
and transparency. As such, Prowler
provides a great basis for audit-proof,
traceable, and continuously improv-
able audits, especially for admins
who are responsible for the security
of complex cloud structures.
Info
[1] Prowler homepage: [https:// prowler. com]
[2] Prowler on GitHub: [https:// github. com/
prowler-cloud/ prowler]
The Author
Thomas Joos is a freelance IT consultant and
has been working in IT for more than 20 years.
In addition, he writes hands-on books and
papers on Windows and other Microsoft topics.
Online you can meet him on [http:// thomasjoos.
spaces. live. com].
requests or modify Terraform files.
Organizations with a high degree of
automation will benefit from a fast
feedback cycle between analysis, find-
ings, and validation.
Fixer works directly at the API level;
in other words, it accesses AWS re-
sources directly. By default, though,
only low-risk changes are supported,
such as the aforementioned activa-
tion of GuardDuty, enforcing secure
password rules, or setting missing
CloudTrail parameters. Individual ad-
justments are possible because each
remediation is written in Python and
can be modified as needed.
Without the use of external tools,
then, you can integrate your own se-
curity policies directly into the testing
and hardening process. A combina-
tion with IaC is also in the pipeline.
Instead of making live changes,
you can generate a pull request that
provides the desired changes with
GitOps.
Prowler Meets Security Hub
Prowler can transfer the results of its
security checks directly to AWS Secu-
rity Hub. Security Hub acts as a cen-
tral consolidation and analysis tool
for security-related events in an AWS
organization. Once a scan is com-
plete, the findings can be exported to
AWS Security Finding Format (ASFF)
and automatically transmitted with
the command:
prowler aws --security-hub --status FAIL
Security Hub fields this information
across regions and assigns it to the
respective accounts. As an admin,
you can see at a glance where a
problem was detected, including
the account, region, and resource
information. For multi-account en-
vironments with an organizational
structure, Security Hub provides a
unified interface for prioritizing,
categorizing, and tracking vulner-
abilities. Moreover, alerts can be
automated (e.g., with EventBridge
rules or playbooks for AWS Systems
Manager) to trigger coordinated
responses to critical findings. The
development environments while
safeguarding the shift-left approach in
DevSecOps pipelines.
Automatically Fixing Typical
Misconfigurations
Practical audits regularly reveal the
same misconfigurations, such as:
S3 buckets with public read
permissions
IAM users without multifactor au-
thentication (MFA)
EC2 instances with open ports in
0.0.0.0/ 0
Missing CloudTrail configuration
Lambda functions with sensitive
environment variables
Roles with permissions that are far
too broad
Outdated KMS policies
To address these vulnerabilities in a
targeted manner, Prowler can use the
Fixer module to fix selected findings
directly and automatically.
Each supported check can be supple-
mented with predefined remediation
logic. For example, a missing Cloud-
Trail in the us-east-1 region can be de-
tected and immediately resolved with
the command:
prowler aws U
--checks cloudtrail_enabled_multi_region U
--region us-east-1 U
--fixer
In this case, Prowler automatically
creates a new CloudTrail configura-
tion with the recommended settings.
However, this action only works if the
required IAM authorizations are in
place. Anothercommon tactic is acti-
vating GuardDuty, which can also be
automated by the Fixer module:
prowler aws U
--checks guardduty_enabled U
--region eu-central-1 U
--fixer
Prowler checks whether the service is
active and enables it if needed. These
automations can also be executed
with CI/ CD support or controlled
by IaC processes. Instead of active
changes, Fixer can generate pull
Keywords: Prowler, AWS, Amazon, vulnerability, compliance, automation,
CI/CD, audit, CLI, reporting
74 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
ProwlerS EC U R I T Y
Cybersecurity is not a one-time
investment, but an ongoing budget
item. Attackers are constantly im-
proving their tools, techniques, and
methods, which means defenders
also need to up their detection and
response game and improve security
checks. If you perform manual at-
tack analysis and emulation, you will
realize how expensive, time-consum-
ing, and difficult to repeat this work
can be.
Other articles have covered tools and
knowledge databases from US-based
research institution MITRE. With
Caldera [1], the organization now
promotes a tool that helps you auto-
matically replicate attacker behavior,
allowing you to simulate complex at-
tack chains without the need for a red
team on site. You execute the same
playbook of an attack pattern repeat-
edly to adjust your defenses in real
time and validate their effectiveness.
ATT&CK Framework Basis
Caldera is available as a free open
source platform and enables attacker
emulation exercises with the MITRE
ATT&CK framework [2]. The platform
is a plugin-based framework in which
modular attack steps, known as “abil-
ities,” are grouped into sequences or
“adversaries” that are then executed
by agents on the target computers.
The agents are cross-platform capable
and can be used on Windows, Linux,
and macOS.
Instead of targeting exploits or vul-
nerabilities like other tools, Caldera
targets the behavior of an attacker by
simulating techniques that attackers
use after a compromise, such as privi-
lege escalation, lateral movement, or
the exfiltration of company data. Its
modularity and automation will help
you hone your skills and adapt them
to the existing IT infrastructure.
Setting Up Caldera
To get a feel for how you can use
Caldera productively, I’ll first look at
a straightforward scenario. Of course,
you need a running Caldera instance,
which you can easily set up in the
usual way with Docker: To begin,
clone the current Git repository,
git clone https://github.com/mitre/ U
caldera.git --recursive
then change to the Caldera directory
and run the command
docker build . -t caldera:server
to create the Docker image for later
use. This step takes a while, because
all the dependencies and supplied
plugins for Caldera are either loaded
or generated on the fly.
If the build was successful, which is
the case if you see Successfully tagged
caldera:server as the output, you can
launch the platform with:
docker run U
-p 7010:7010 U
-p 7011:7011/udp U
-p 7012:7012 U
-p 8888:8888 caldera:serve
As soon as the Caldera label appears
in ASCII art in the console after start-
up, call http:// localhost:8888 in your
browser to access the login page. The
different access credentials for the red
and blue teams can be found on the
Docker container console without any
further configuration.
Because of the unusual width of
the log output, you cannot simply
Organizations often lack the human and financial resources for red and blue teaming, forcing many admins
to become both the attacker and the defender. The MITRE Caldera cybersecurity platform supports attack
emulation and automates security testing. By Matthias Wübbeling
Emulate Attacks with MITRE Caldera
Volcanic
Ph
ot
o
by
J
ef
er
so
n
Ar
gu
et
a
on
U
ns
pl
as
h
76 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
S EC U R I T Y MITRE Caldera
copy and paste the passwords; you
will need to copy both parts of the
password separately. Also note the
instructions that follow the access
credentials (Figure 1).
Emulating Attacks
As a blue team member, imagine you
operate a security information and
event management (SIEM) or an end-
point detection and response (EDR)
system in your organization and want
to test whether an attacker who is suc-
cessful in the first step will be detected
by your monitoring systems as the as-
sailant continues their activities (e.g.,
lateral movement or data exfiltration).
To simulate this situation effectively,
the above-mentioned Caldera agents
now enter the play. For the scenario
described, you need to create a
Sandcat agent under agents, which
simulates the attacker’s remote access
tool (RAT) already installed on one
of your servers. After clicking Deploy
an agent, select sandcat and then the
operating system of the machine that
has already been compromised.
In this example, I simply set up a
Linux virtual machine (VM) in my
cluster. Clicking on the penguin icon
opens the documentation relating
you should adjust your detection
logic. You can repeat the same op-
eration as often as you like to check
whether your new rules will work
later as intended. This iterative pro-
cess in Caldera helps you gradually
optimize your SIEM.
Besides the simple scenario looked
at here, Caldera offers many more
possibilities for carrying out attacks
or tests. Red teams, for example, can
investigate and develop new attack
chains, and blue teams can use the
tool for postmortem analyses of simu-
lated incidents.
Conclusion
MITRE Caldera is a proven and well-
equipped open platform for attack
simulation. In this article, I used a
small example to show how to use
Caldera to optimize monitoring. Cal-
dera also offers many other possibili-
ties to facilitate the work of red and
blue teams.
Although Caldera is already quite
mature, don’t expect a miracle solu-
tion. It does not replace the entire
spectrum of red teaming measures,
especially those that focus on social
engineering or zero-day vulnerabili-
ties. On the upside, you will gain a
basic understanding of attack tactics
and the MITRE ATT&CK framework.
Info
[1] Caldera: [https:// caldera. mitre. org]
[2] MITRE ATT&CK: [https:// attack. mitre. org]
The Author
Dr. Matthias Wübbeling is an IT security en-
thusiast, scientist, author, consultant, and
speaker. As a Lecturer at the University of
Bonn in Germany and Researcher at Fraunhofer
FKIE, he works on projects in network security,
IT security awareness, and protection against
account takeover and identity theft. He is the
CEO of the university spin-off Identeco, which
keeps a leaked identity database to protect
employee and customer accounts against iden-
tity fraud. As a practitioner, he supports the
German Informatics Society (GI), administrat-
ing computer systems and service back ends.
He has published more than 100 articles on IT
security and administration.
to the various installation methods.
The agent uses HTTP to communi-
cate with the framework’s open port
8888 and must not be filtered by the
firewall. The easiest way to start the
agent is with the first command from
the documentation. To do this, you
can execute the following commands
(replace the IP address with that of
the Caldera server in your setup):
server="http://127.0.0.1:8888"
curl -s -X POST -H "file:sandcat.go" U
-H "platform:linux" U
$server/file/download > splunkd
chmod +x splunkd
./splunkd -server $server -group red -v
Of course, this binary does not con-
tain a real Splunk daemon. The agent
only hides the way in which you
instruct. Once the test environment
is ready, you can select predefined
functions for the first test. These
functions represent an attacker’s
individual actions and include com-
mands from genuine attack behavior,
such as searching for files, creating
directories, or exfiltrating data. You
will notice that each ability contains
informationabout a relevant MITRE
ATT&CK technique, which means you
can immediately see the kind of be-
havior being emulated.
For this example, press the Create Op-
eration button under operations at the
top of the page. Assign a name (e.g.,
Worm) as the adversary and click
Start. The agent you created previ-
ously is now the focus of the graphi-
cal SVG view, and various worm
techniques are now being deployed
against your network. Once the op-
eration is complete, Caldera provides
a detailed log of each run. You can
view logs and collected files or export
them in JSON format with the button
at top right.
Testing SIEM
After the run, the all-important ques-
tion now arises: Did your monitoring
system detect the emulated attacks
and warn you appropriately? If not,
you might now have a good clue,
from the MITRE findings, as to how
Keywords: MITRE, Caldera, ATT&CK, security, red, blue, team, attack, simulation, defend, automation
Figure 1: Until you create and apply your own
configuration in the conf/local.yml file,
the access credentials will be regenerated
each time you start.
77A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
S EC U R I T YMITRE Caldera
Bloonix [1] is a user-friendly envi-
ronment for monitoring tasks; it is
capable of monitoring all of your IT
assets, assuming they are accessible
over a network connection. You
do need to install special plugins
to query certain network compo-
nents – but more on that later. The
web-based, modular Bloonix envi-
ronment comes with modules for a
massive crop of popular hardware
and software components and is
fundamentally based on the Simple
Network Management Protocol
(SNMP), although it can also use
other protocols for monitoring.
The first lines of code for Bloonix, de-
veloped by the company of the same
name based in Germany, date back to
2006. The project is available under the
GNU Affero General Public License ver-
sion 3 (AGPLv3), which allows users to
run, modify, and share software while
ensuring that any modified versions are
also made available to the public.
According to the developers, Bloonix’s
server software is highly available
and highly scalable and can be op-
erated on multiple servers for load
balancing. For monitoring tasks, the
tool primarily relies on agents that
are available for popular operat-
ing systems. It has no packages for
desktop systems – not even for ma-
cOS. Bloonix is available as a man-
aged server and as a self-hosted en-
vironment [2]. You can gain an ini-
tial impression of the environment
in the online demo [3]. Free support
is provided by the community [4],
although the extent of this support
is limited. Professional support is
also available (see the “Commercial
Services” box).
Bloonix Architecture
To meet the complex challenges in-
volved in monitoring heterogeneous
environments, Bloonix uses a modular
architecture comprising five compo-
nents: the Bloonix server, a WebGUI,
field information from agents, plugins,
and satellites. At the heart of the sys-
tem is the Bloonix server, which brings
together the various modules. When
this server boots up, it launches vari-
ous process pools, including listeners,
database (DB) managers, Keepalived,
and various checker and scheduler
modules. The Bloonix server usually
Continuous IT monitoring often requires multiple tools, depending on the scope and complexity of the
environment. The Bloonix modular monitoring tool combines numerous services in a single interface. We
show you how to set up and handle monitoring tasks with this free software. By Holger Reibold
Infrastructure monitoring with Bloonix
Guardian
Le
ad
Im
ag
e
©
t
ar
ok
ic
hi
, 1
23
RF
.c
om
Commercial Services
Bloonix is commercially available as a man-
aged server or as a self-hosted environment.
Customers who opt for the managed server
option are assigned their own virtual ma-
chine (VM). Daily backups and gateways for
SMS are also available. Prices start at around
EUR60 per month, depending on the number
of virtual CPUs (VCPUs) and the RAM and
hard disk size. Companies that host Bloonix
themselves but do not want to do without
support can choose between different sup-
port options starting at EUR600 per year.
78 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
M A N AG E M E N T Bloonix
writes its data to PostgreSQL and Redis
databases.
The monitoring environment primar-
ily relies on agents to collect relevant
system metrics on the target hosts –
in particular, CPU utilization, memory
usage, and database services. In
principle, an agent can also run on a
server that Bloonix itself cannot query
directly. From this vantage point, it
can then monitor routers, switches,
and other relevant network services.
For organizations with distributed
locations, Bloonix Satellite offers
monitoring of globally distributed
web services.
The WebGUI is used to control the
environment, which includes manag-
ing hosts, groups, clients, and other
services. The Bloonix environment
also has a plugin mechanism that
more or less includes complex scripts
that query the status of one or more
services. The extensions are usually
installed with the agents, the server,
or the satellites. On Linux, the pl-
ugins are stored in the /usr/lib/bloo-
nix/plugins directory by default.
The listener components field status
information and metrics from the
agents, validate the information, and
store it in the database; additionally,
the server checks whether or not it
needs to notify the administrator. The
DB Manager module is responsible
for all database-specific tasks and
writes the metrics to the PostgreSQL
database. An NGINX web server pre-
pares the data for the web interface.
The Bloonix server is also respon-
sible for checking registered routers,
switches, and services, and it queries
the satellite configuration.
Putting Monitoring into
Operation
When it comes to monitoring, Bloonix
distinguishes between monitoring
hosts and monitoring services. Basi-
cally, the environment prefers to work
with Linux servers. To simplify the
configuration, an agent should already
be installed on the monitored system.
In this kind of scenario, the host con-
figuration can be completed host-side.
To add a first host to your monitoring
executed by the server, the agent, and
the satellite component (Figure 1).
The Plugin World
Bloonix has plugins for external
tests, as well as for Linux, SNMP,
web servers, caching, and database-
specific checks. According to the
documentation, you can choose
from more than 40 plugins. For ex-
ample, to monitor the CPU load of a
Linux server, select check-linux-cpu.
Other useful SNMP checks examine
memory and hard disk usage, as
well as the number of services and
processes. The two most important
external checks are used to monitor
TCP/ IP or UDP/ IP. However, when
monitoring databases, you are lim-
ited to PostgreSQL and MySQL.
In this context, it is interesting to
see how the server and the agents
interact. After setting up a service, it
initially has an INFO status, because
no monitoring has yet taken place.
The service overview displays the
status information in the column of
the same name (Figure 2). When
monitoring is initiated, the agent es-
tablishes a connection to the server,
authenticates, and continuously trans-
mits the plugin-specific data.
Ideally, the blue INFO message will
change to a green OK. Bloonix recog-
nizes seven different status messages:
OK (exit code 0), INFO (code 0),
NOTICE (purple, 1), WARNING (light
orange, 2), ALERT (pink, 3), CRITI-
CAL (red, 4), and UNKNOWN (dark
orange, 5). Color highlighting in the
WebGUI makes it easy to classify the
messages when skimming through. In
the Services overview, you can also
change the sort order by clicking on
the header, simplifying your analysis
of the output.
When checking services, you are
not forced to rely on agents; you
can also have the checks carried
out by a Bloonixserver or Bloonix
satellites. This service is available
for all checks that can be performed
locally, which occurs when the
Remote check option is set to No.
Nevertheless, remote checks have
advantages; for example, you can
setup, open the Hosts menu (stacked
rectangles), click on the plus sign,
and specify the typical server data
in the corresponding dialog. You can
customize the hostname in the sys-
tem settings (cog wheel) with Bloonix
Server Hostnames. You have two ways
to configure the agent: manually or
with the bloonix-init-host script. For
a manual configuration, you need to
edit the /etc/bloonix/agent/main.conf
and /etc/bloonix/agent/conf.d/host.
conf files and enter the address of the
Bloonix server in the /etc/bloonix/
agent/main.conf file; then, configure
the required settings in the server
section and the corresponding host
parameter:
server {
host 127.0.0.1
host bloonix.server.de
}
Now, save the host ID and password in
/etc/bloonix/agent/conf.d/host.conf:
host {
host_id 01
password
}
For the changes to take effect, you
need to restart the Bloonix agent.
You can use the aforementioned
bloonix-init-host script to automate
the agent configuration, assuming the
agent configuration file has not been
modified:
bloonix-init-host U
--host-id 33 U
--password U
--server bloonix.server.de
Alternatively, you can simply enter
a line reading host 127.0.0.1 in the
server section.
After creating your first host, you can
begin configuring the services to be
monitored. The procedure is similar
to adding hosts: In the Services menu
(three stacked documents), click on
the plus sign and specify the proper-
ties, which includes selecting the
plugins (i.e., the scripts responsible
for the monitoring). The plugins are
79A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
M A N AG E M E N TBloonix
assess the quality of a service by
accessing it from the perspective of
different locations. This method is
particularly useful when monitoring
HTTP, IMAP, POP3, and SMTP. If
you implement a satellite configura-
tion, a special dashboard is avail-
able in the WebGUI that allows you
to filter response times by different
locations.
Configuration and
Administration
The WebGUI not only lets you cre-
ate hosts and services, you can also
Figure 2: The Services overview resulting from monitoring shows a variety of useful details.
Figure 1: After commissioning, Bloonix provides information in its web-based dashboard.
80 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
BloonixM A N AG E M E N T
bundles the collected information
and lists the various notices and
warnings. Clicking on these visu-
alizations takes you to a detailed
view, where you can initiate any
required action.
Conclusion
Bloonix is designed for monitoring IT
infrastructure components and does so
with flying colors. The only downside
is the lack of autodiscovery, which is
offset by the extensive plugin ecosys-
tem. Professional support is available
in the two commercial versions. You
can also get help for the more complex
configuration steps by reading the ex-
cellent documentation.
Info
[1] Bloonix homepage:
[https:// www. bloonix. org/ en/]
[2] Commercial Bloonix services:
[https:// www. bloonix. com]
[3] Online demo:
[https:// demo. bloonix. org/ login]
[4] Bloonix forum:
[https:// community. bloonix. org]
The Author
Holger Reibold is a computer scientist, having
worked as an IT journalist since 1995. Currently,
he works as a key account manager for a Ger-
man ISP. His main interests are open source
tools and security topics.
procedure: You need to save the
script to the directory specified as
the message_service_script_path op-
tion in the Bloonix server configura-
tion. As an example, I used test.py
and saved it in /usr/local/lib/bloo-
nix/message-service (Listing 1).
Next, create a new message service
of the Script type in the WebGUI and
assign a value of %message% to Mes-
sage. For send_to, enter %send_to%
and for foo, enter bar.
When Bloonix triggers an alarm, the
parameters defined in the WebGUI are
transferred to the script in JSON for-
mat by STDIN. The exit code tells the
Bloonix server whether the message
was sent successfully: 0 is a success-
ful transmission and 1 is unsuccess-
ful. You will find the entry for this
option in the /tmp/test.log file.
The targets for notifications are con-
tacts or contact groups. You can as-
sign different numbers of messaging
services to a contact and specify the
notification periods, provided their
content is not critical. The idea of
contact groups is that you can link
contacts to hosts and services, which
gives you precise control over which
contacts are notified in case of a fail-
ure of a specific host or service. In
practice, it is useful to assign at least
one group to each host.
Setting up host and service configu-
rations proves to be very time con-
suming in practice because Bloonix
lacks an autodiscovery function,
although it has an alternative in the
form of a Service Templates func-
tion, thanks to which you can bun-
dle services and service parameters
and apply them to any number of
hosts. When you create new hosts,
the templates are automatically ap-
plied there. To access the template
function, go to Configuration | Tem-
plates. When you get there, you will
find a selection of templates that
deliver standard checks for Apache
or MySQL servers along with ge-
neric Linux checks. You can also
create your own templates in the
WebGUI and assign checks to them.
Variables let you define different
thresholds for outputting warnings
or critical messages. The dashboard
manage host groups, users, and
satellites. Bloonix uses the admin,
operator, and user roles for user ad-
ministration, along with the associ-
ated permissions. If you want to delve
deeper into the specifics, it is worth
taking a look at the configuration of
the various components.
The Bloonix server configuration is
stored in the /etc/bloonix/server/
main.conf file. You can use the web-
gui_domain parameter to specify the
domain for the web interface. The
environment usually manages the
plugins in /usr/lib/bloonix/plugins,
which is also where you store your
development projects. The database
and storage are set up in two configu-
ration files: /etc/bloonix/database/
main.conf and /etc/bloonix/datas-
tore/main.conf.
Bloonix writes the WebGUI configu-
ration to the /etc/bloonix/webgui/
main.conf file, and the Bloonix agent
configuration is located in /etc/bloo-
nix/agent/main.conf. Modifying either
of these is only advisable if special
circumstances dictate this action.
Finally, you can edit the satellite con-
figuration in /etc/bloonix/satellite/
main.conf.
Setting Up Notifications
Monitoring the IT infrastructure
is not really useful if you are not
notified in the event of critical inci-
dents. To prevent this from happen-
ing, the software offers a notifica-
tion function that can communicate
in three ways: Sendmail, HTTP, and
script-based notification output. For
Sendmail, you need a mail transfer
agent (MTA; e.g., Postfix or Exim).
Specifying a valid sender is impor-
tant. With the help of the MTA, you
can decide whether email notifica-
tions are sent by a relay server or by
SMTP. With HTTP-based transmis-
sion, you can forward URL-encoded
or JSON-based data over a corre-
sponding HTTP interface.
If the HTTP and Sendmail variants
do not meet your requirements, you
can opt for script-based notification,
which integrates into the WebGUI.
A simple example illustrates the
Listing 1: Script-Based Notifications
#> cat /usr/local/lib/bloonix/ message-service/test.py
#!/usr/bin/python3
import json
import sys
lines = ""
while True:
try:
line = input()
except EOFError:
break
lines += line
param = json.loads(lines)
f = open("/tmp/test.log", "w")f.write(json.dumps(param))
f.close()
sys.exit(0)
81A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
M A N AG E M E N TBloonix
At first glance, collecting metrics
from IT infrastructure seems straight-
forward: Deploy an agent, configure
some checks, and watch the num-
bers roll in. However, anyone who
has spent time building production
monitoring systems knows that effec-
tive data collection is far from trivial.
The challenge isn’t simply gathering
data – it’s collecting the right data, at
the right intervals, with the right con-
text, all while minimizing the effect
on the systems being monitored.
Technical, policy, and economic
constraints that restrict many en-
vironments are also of concern. As
infrastructure becomes increasingly
complex and security requirements
more stringent, the ability to adapt
monitoring approaches to constrained
environments becomes not just valu-
able, but essential. Zabbix’s flexible ar-
chitecture and support for diverse col-
lection methods make it well-suited for
these challenging scenarios, enabling
comprehensive monitoring even when
circumstances are far from ideal.
Beyond Simple Numbers
When you instrument systems for
monitoring, you’re not just collecting
isolated data points: You’re also cap-
turing the relationships between them
by attempting to capture the behavior
of complex, dynamic systems that
operate continuously across multiple
dimensions. A CPU utilization metric
at 14:23:47 tells you something, but
that single number lacks the context
that makes it actionable. Was this
value typical for that time of day? Is
it trending upward? Did it spike mo-
mentarily or sustain for minutes?
The true value of monitoring data
emerges not from individual measure-
ments, but from the patterns and
relationships that become visible
when you collect data consistently
over time. In this case, measurement
transcends simple observation and
becomes a tool for understanding sys-
tem behavior.
Patterns in Time Series Data
Modern IT infrastructure exhibits
rhythmic behavior. Web applica-
tions see traffic patterns that mirror
human activity – morning rushes,
lunch lulls, evening peaks, and
overnight quiet periods. Database
systems show query patterns tied to
business processes. Backup systems
create predictable load cycles. These
rhythms exist at multiple time scales:
hourly patterns within days, weekly
patterns across months, and seasonal
patterns throughout years.
Effective monitoring systems must
capture these patterns because they
form the baseline against which you
detect anomalies. A database con-
suming 80% CPU might be alarming
at 3am, but perfectly normal during
end-of-month reporting. Without
historical context and pattern recog-
nition, you cannot distinguish be-
tween normal variation and genuine
problems.
Trend analysis adds another dimen-
sion to pattern recognition (Figure 1).
Whereas patterns reveal cyclical
behavior, trends show directional
change over time. Is disk usage grow-
ing linearly, or has growth acceler-
ated? Are response times gradually
degrading? These trends often signal
problems long before they become
critical, enabling proactive interven-
tion rather than reactive firefighting.
Sampling Continuous
Systems
One of the most significant challenges
in monitoring is measuring discrete
snapshots of systems that operate
continuously. Monitoring systems col-
lect data at intervals – perhaps every
Zabbix has emerged as a compelling choice for monitoring restricted
environments over time. By Attila Bartek
Monitoring Constrained Environments
For Good
Measure
Le
ad
Im
ag
e
©
K
on
st
an
ti
n
Yu
ga
no
v,
12
3R
F.
co
m
82 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
M A N AG E M E N T Data Collection with Zabbix
Why Zabbix?
Given the complexities of monitor-
ing, choosing the right monitoring
platform becomes a critical decision.
The monitoring tool must balance ca-
pability against complexity, flexibility
against maintainability, and power
against ease of deployment. Zabbix
has emerged as a compelling choice
for organizations, from small startups
to large enterprises.
Zabbix is Free and Open Source Soft-
ware (FOSS) licensed under the GNU
General Public License v2. The open
source nature of Zabbix is not merely
a cost consideration – although the
absence of per-host licensing fees
certainly matters at scale. The license
also provides several strategic ad-
vantages that proprietary monitoring
solutions cannot match, including
transparency, community, and inde-
pendence from corporate control.
Zabbix appears in the package reposi-
tories of virtually every major Linux
distribution: Debian, Ubuntu, Red
Hat Enterprise Linux, CentOS, Rocky
Linux, AlmaLinux, SUSE, and count-
less others. This universal availability
significantly reduces deployment
friction. You don’t need to config-
ure third-party repositories, manage
custom package signing keys, or
explain to security teams why you’re
installing software from non-standard
sources.
For organizations with standard-
ized deployment procedures, mature
change management processes, and
strict security requirements, being
able to install Zabbix through native
package managers means infrastruc-
ture monitoring follows the same
deployment, patching, and lifecycle
management processes as all other
metrics rarely tell complete stories.
A memory utilization metric means
something different on a database
server than on a web server. High
disk I/ O might indicate a problem
on one system and normal opera-
tion on another. Network throughput
numbers lack meaning without un-
derstanding the application’s require-
ments and typical behavior.
This context dependency means that
effective monitoring requires not just
collecting data, but collecting the
right combination of data points and
understanding their relationships.
You need to know not just that CPU
is high, but also whether it correlates
with increased request rates, whether
memory pressure exists simultane-
ously, and whether response times
have degraded. Single metrics viewed
in isolation can mislead as easily as
they inform.
Building Toward
Understanding
These challenges – pattern recogni-
tion, sampling limitations, observer
effects, and context dependencies –
shape how you should approach
monitoring. Understanding these
fundamental issues helps you make
informed decisions about what to
measure, how frequently to measure
it, and how to interpret the data
collected.
The goal of monitoring isn’t to elimi-
nate all uncertainty or capture every
possible event. Rather, it’s to build
a practical observability framework
that provides sufficient visibility into
system behavior to support opera-
tional decision-making, while remain-
ing sustainable in terms of cost and
complexity.
30 seconds, every minute, or every
five minutes. Between these col-
lection points, an infinite amount
of activity occurs that is never
observed.
Sampling strategies introduces sev-
eral well-known problems. First,
you face the risk of aliasing – miss-
ing important events that occur
between collection intervals. A CPU
spike that lasts 10 seconds will be
invisible if you collect data every 60
seconds and happen to sample dur-
ing the quiet periods before and af-
ter. Critical errors might be logged,
processed, and resolved entirely
within the gaps of the monitoring.
The sampling frequency creates a
fundamental trade-off. More fre-
quent collection provides better
visibility and reduces the chance of
missing transient events. However,
higher collection frequency means
more agent overhead, more network
traffic, more database writes, and
more storage consumption. In large
environments with thousands of
monitored items across hundreds of
hosts, these costs multiply rapidly.
Moreover, the act of measure-
ment itself affects the system be-
ing measured: the observer effect.
Monitoring agents consume CPUcycles, memory, and I/ O bandwidth.
Checking networks generates traf-
fic. Database queries for monitoring
compete with application queries.
At extreme scales, the monitoring
system can become a significant
portion of the infrastructure load it’s
meant to observe.
The Context Problem
Even if you collect data successfully
at appropriate intervals, individual
Figure 1: A long-term graph clearly illustrates the wavering behavior of the filling rate.
83A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
M A N AG E M E N TData Collection with Zabbix
infrastructure components. This con-
sistency reduces operational overhead
and simplifies compliance.
Simple Defaults, Complex
Capabilities
One of Zabbix’s most valuable char-
acteristics is its graduated complexity
curve. A default installation provides
immediate utility – basic host moni-
toring, common service checks, and
a functional web interface – without
requiring extensive configuration
or deep expertise. You can have a
working monitoring system for a
handful of servers within an hour of
installation.
This simplicity at entry doesn’t come
at the cost of capability. As require-
ments grow and expertise deepens,
Zabbix scales both technically and
functionally. The same platform that
monitors 10 servers with default tem-
plates can evolve into a sophisticated
monitoring infrastructure handling
thousands of hosts, custom metrics,
complex trigger logic, distributed
collection through proxies, and high-
availability configurations.
This architectural approach solves a
common problem in the selection of
monitoring tools: the tension between
immediate usability and long-term
capability. Tools that are simple to
deploy often lack depth for complex
environments. Tools with enterprise
capabilities often require significant
investment in time before delivering
any value. Zabbix occupies a middle
ground – quick wins early, with a
clear path to sophistication.
The Learning Curve
Advantage
Zabbix’s learning curve is notably
progressive. Initial deployment and
basic monitoring require minimal
expertise to get operational quickly,
just by following documentation and
using the provided templates. As you
work with the system, additional
capabilities become discoverable
organically. Template customization
leads to understanding items and
triggers. Trigger customization leads
to expression syntax. Expression
work leads to calculated items and
dependencies.
From Default to High
Availability
Perhaps the most compelling aspect
of the Zabbix architecture is continu-
ity from simple to complex deploy-
ments. A basic single-server installa-
tion can evolve incrementally toward
enterprise-grade high availability
without fundamental architectural
changes or data migration:
Do you need to distribute the col-
lection across network segments
or geographical locations? Add
Zabbix proxies.
Is your growing data volume
straining database performance?
Implement database partitioning
and optimize retention policies.
Do you need to eliminate single
points of failure? Configure active-
passive database clustering and
load-balanced Zabbix servers.
Do you have to meet strict uptime
SLAs? Implement a full high-avail-
ability architecture with redundant
components.
Each of these evolutions represents
architectural enhancement rather
than replacement. The templates, trig-
gers, and configurations developed
on a simple deployment remain valid
and functional in complex high-
availability setups. This continuity
protects operational investment and
reduces the risk of the monitoring in-
frastructure itself becoming a barrier
to growth.
Practical Considerations
From the perspective of security en-
gineering, several practical aspects of
Zabbix deserve mention. The system
supports encrypted agent communica-
tions, privilege separation, and granu-
lar access controls. It integrates with
enterprise authentication systems
through LDAP and SAML. Audit log-
ging tracks configuration changes and
user actions. These features aren’t
afterthoughts – they’re fundamental
design elements that enable Zabbix
deployment in security-conscious
environments.
The data model is well documented
and accessible, enabling integration
with other tools through the API or
direct database access where ap-
propriate. This openness facilitates
building monitoring into broader
operational workflows rather than
creating isolated silos of observabil-
ity data.
Zabbix Internal Structure
To configure and operate Zabbix ef-
fectively, you must understand its
internal data model: how monitoring
targets are represented, how collec-
tion is defined, and how these logi-
cal structures map to data collection
activities. This conceptual framework
shapes every aspect of Zabbix con-
figuration and operation.
In Zabbix terminology, a “host” repre-
sents any entity you want to monitor,
which seems straightforward until
you realize that “host” is a deliber-
ately abstract concept that doesn’t
necessarily correspond to what is
traditionally thought of as a host or
server. Traditional hosts are exactly
what you’d expect: physical servers,
virtual machines, network devices –
discrete systems with IP addresses
that you monitor as unified entities.
A web server is a host. A database
server is a host. A network switch is
a host.
However, hosts that aren’t traditional
systems demonstrate where the ab-
straction becomes powerful. A cloud
service with no single IP address can
be represented as a host collecting
metrics over API calls. A clustered
database system might be monitored
both as individual node hosts and
as a logical cluster host tracking ag-
gregate metrics. Business processes,
external APIs, and distributed
workflows can all be represented as
hosts – each serving as a container
for related monitoring items.
This flexible host concept becomes
powerful once you embrace the ab-
straction: A host is simply a container
for related monitoring items – nothing
more, nothing less.
84 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Data Collection with ZabbixM A N AG E M E N T
awk processes structured command
output, extracting specific columns
and performing calculations; sed
transforms and extracts text through
pattern substitution, which is particu-
larly useful for parsing configuration
files; and cut provides simple column
extraction from delimited data. The
real power emerges when combining
these utilities through pipes. A single
command chain can extract, filter,
parse, and aggregate data without
requiring any additional software
installation.
The Role of Zabbix Sender
The zabbix_sender utility provides the
universal integration point, enabling
any script or utility chain to push col-
lected metrics to the Zabbix server:
Define a trapper item on a host
in Zabbix to receive the collected
data.
Extract the necessary information
from the system with available
utilities.
Deliver the information to Zabbix
with zabbix_sender:
bash
zabbix_sender U
-z [zabbix_server] U
-s [hostname] U
-k [item_key] U
-o [value]
The process follows this consistent
pattern.
Real-World Production
Examples
The following examples reflect real
production challenges encountered
in monitoring SIEM appliances and
other restricted systems. All examples
are custom-tailored because the ap-
pliances require the monitoring of
specific parameters to determine op-
erational bottlenecks.
Monitoring High-Performance
Storage
If you’re managing a server compo-
nent responsible for continuously
storing large amounts of data on a
preconfigured packages that enable
a quick and efficient deployment.
For environments requiring a more
tailored setup, the official Zabbix
website [1] offers a guided selection
tool to help choose the most appro-
priate components for your existing
infrastructure. A complete installa-
tion involves more than just deploy-
ing the zabbix-server package.The
zabbix-frontend-php package must
also be installed and integrated with
a supported web server to provide
the web-based management interface.
Once the web front end is accessible
and the initial login is completed,
additional users can be created, and
the system configuration can begin.
Zabbix allows administrators to de-
fine alerting rules on the basis of con-
figurable thresholds. Numerous pre-
defined triggers are already available
through built-in templates, allowing
rapid implementation of monitoring
policies. Additionally, customizable
dashboards provide consolidated vi-
sual insight into monitored metrics,
helping teams track critical param-
eters and maintain operational visibil-
ity across the infrastructure.
Default Utilities in
Restricted Environments
Sometimes the environment is not
friendly to IT security engineers. On
restricted systems where installing
additional components is impossible
because of policy constraints, vendor
limitations, or technical restrictions,
you must work with the tools avail-
able by default. Fortunately, standard
Unix utilities (e.g., awk, sed, grep, cat,
sort, cut, wc) provide powerful data
extraction and processing capabilities
present on virtually every Unix and
Linux system from initial installation.
These utilities aren’t workarounds;
they’re legitimate monitoring tools
that have existed for decades. When
agents cannot be installed and custom
software is prohibited, these standard
utilities become the primary mecha-
nism for extracting monitoring data.
The grep tool filters and pattern-
matches text, making it ideal for log
analysis and counting occurrences;
If hosts represent what you’re moni-
toring, items represent specific metrics
you’re collecting from those hosts.
Items are the atomic units of data col-
lection in Zabbix: Each item collects
one metric.
An item definition includes several
critical attributes. The item key speci-
fies exactly what to collect – for an
agent item, this might be system.
cpu.load[percpu,avg1]. The item type
determines how collection happens:
Zabbix agent, SNMP, IPMI, simple
check, HTTP agent, or other methods.
The update interval controls col-
lection frequency, directly affecting
database load and monitoring respon-
siveness. The value type defines the
data format: numeric, character, log,
or text.
Items can also include preprocess-
ing steps – that is, transformations
applied to collected data before stor-
age, such as unit conversion, regular
expression extraction, or JSON path
parsing.
Host-Item Relationship
The relationship between hosts and
items creates the Zabbix monitoring
hierarchy. A single host might have
hundreds of predefined items: CPU
metrics, memory usage, disk I/ O,
network statistics, application logs,
and custom metrics. This relationship
enables bulk operations (disabling a
host stops all its items), templating
(define items once, apply to many
hosts), and logical organization.
Templates deserve special mention:
They define collections of items, trig-
gers, and monitoring elements that
can be linked to multiple hosts. A
“Linux Server” template might define
50 items for monitoring the operat-
ing system. Link that template to 100
hosts and you’ve defined 5,000 items
through a single template relation-
ship. Change the template, and all
linked hosts inherit the change.
Zabbix to Production
Installing Zabbix Server is rela-
tively straightforward because most
major Linux distributions provide
85A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
M A N AG E M E N TData Collection with Zabbix
high-performance local disk, monitor-
ing its performance is crucial, espe-
cially if the component is part of a log
storage and analysis system, such as
a SIEM. Given that this component
often functions as a dedicated ap-
pliance, which limits the ability to
install arbitrary software, the use of
a standalone Zabbix sender to trans-
mit critical performance data to your
monitoring system is a highly effec-
tive solution.
The iostat command provides disk
performance metrics, and with the -Nd
parameter, you
can retrieve de-
vice names with
utilization reports
(Listing 1).
Monitoring Net-
work Filesystem
Performance
When the local
disk reaches its
size limitations,
a common best
practice is to
offload data to
a more cost-
effective storage
device, such as
a network at-
tached storage
(NAS) device.
Because NAS operates over the
network, it’s equally important to
monitor its performance.
Monitoring NFS performance requires
parsing the output of nfsiostat to
extract read and write throughput
data (Figure 2). Because the com-
mand output follows a different two-
line format compared with previous
versions, a workaround is needed to
identify the relevant numbers cor-
rectly, which can be achieved in mul-
tiple ways. The approach used here
is to capture the required line along
with the next one and then exclude
the first line (Listing 2).
Monitoring Critical Network
Connections
For business continuity and to en-
sure no network issues exist on
your side, it’s a best practice to
monitor the TCP connection status
between sensitive components. Un-
fortunately, direct monitoring isn’t
feasible, so you enabled a password-
less SSH connection to allow remote
command execution. This approach
Figure 2: As a result of bandwidth disruption, the previously utilized bandwidth was no longer available.
bash
# Collect iostat output for a specific 10TB device
meter=$(iostat -Nd | grep 10T)
# Extract specific metrics
tps=$(echo "$meter" | awk '{print $2}')# transactions per second
blkr=$(echo "$meter" | awk '{print $3}')# read performance (kB/s)
blkw=$(echo "$meter" | awk '{print $4}')# write performance (kB/s)
# Send to Zabbix
zabbix_sender -z zabbix.example.com -s storage-host -k disk.tps -o "$tps"
zabbix_sender -z zabbix.example.com -s storage-host -k disk.read -o "$blkr"
zabbix_sender -z zabbix.example.com -s storage-host -k disk.write -o "$blkw"
Listing 1: High-Performance Storage
bash
# Extract read throughput (kB/s)
rk=$(/usr/sbin/nfsiostat | grep -A 1 "read" | grep -v "read" | awk '{print $2}')
# Extract write throughput (kB/s)
wk=$(/usr/sbin/nfsiostat | grep -A 1 "write" | grep -v "write" | awk '{print $2}')
# Send to Zabbix
zabbix_sender -z zabbix.example.com -s nfs-client -k nfs.read.throughput -o "$rk"
zabbix_sender -z zabbix.example.com -s nfs-client -k nfs.write.throughput -o "$wk"
Listing 2: Read and Write Throughput
bash
# Identify the Java process
jid=$(jps | grep [TaskIdentifier] | awk '{print $1}')
# Extract heap usage (removing formatting for Zabbix
integer requirement)
ju sed=$(jcmd $jid GC.heap_info | awk 'NR==2 {print $6}'
| sed "s/K//g" | sed "s/,//g")
jt otal=$(jcmd $jid GC.heap_info | awk 'NR==2 {print $4}'
| sed "s/K//g" | sed "s/,//g")
# Send to Zabbix
za bbix_sender -z zabbix.example.com -s java-host -k
java.heap.used -o "$jused"
za bbix_sender -z zabbix.example.com -s java-host -k
java.heap.total -o "$jtotal"
Listing 4: Java Heap Status
bash
# Execute netstat remotely via SSH
ou t=$(ssh user@remote-host -i /home/user/.ssh/id_rsa "netstat -tupn 2>/dev/null")
# Count established connections to important service
es tablished=$(echo "$out" | grep [critical_ip] | grep ESTABLISHED | wc -l)
# Send to Zabbix
za bbix_sender -z zabbix.example.com -s remote-host -k connections.established -o
"$established"
Listing 3: TCP Connection Status
86 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Data Collection with ZabbixM A N AG E M E N T
[2] Crontab guru: [https:// crontab. guru/]
[3] Zabbix community forum:
[https:// www. zabbix. com/ forum/]
Author
Attila Bartek has more than 25 years of experi-
ence as a Cybersecurity Engineer and Advisor.
persistent monitoring agents cannot
be deployed. Crontab guru [2] offers
a better understanding of the crontab
setting and structure.
Although the minimum scheduling
interval for crontab is one minute,you can achieve sub-minute collection
with sleep commands in a crontab file
(Listing 7).
Conclusion
Effective monitoring in restricted en-
vironments requires creativity, a deep
understanding of available system
utilities, and strategic use of tools like
zabbix_sender. Although these ap-
proaches might lack the elegance of
purpose-built monitoring agents, they
provide reliable, policy-compliant
monitoring coverage in environments
where traditional approaches fail.
The examples presented here dem-
onstrate that monitoring isn’t about
having perfect tools – it’s about
working effectively within constraints
while still achieving operational vis-
ibility. By combining standard Unix
utilities, zabbix_sender, and sched-
uled execution through crontab files,
security engineers can build robust
monitoring solutions that respect or-
ganizational policies, vendor limita-
tions, and technical realities.
If you require additional informa-
tion or have outstanding questions,
feel free to reach out to the Zabbix
community forum [3], a trusted
source of practical guidance and
best practices from experienced
professionals.
Info
[1] Zabbix server instal-
lation: [https:// www.
zabbix. com/ download/]
monitors established connections to
critical services (Listing 3).
Java Heap Monitoring Without
Agents
An official Java monitoring agent is
available for installation, but some-
times it’s not feasible because of
compliance or technical constraints.
However, monitoring the Java heap
status (Listing 4) is always important
because reaching the maximum heap
size can cause the application to stop
functioning (Figure 3).
Kafka Buffer Monitoring with
Auto-Remediation
An application failing to handle Kafka
buffers properly requires monitoring
with automatic intervention when
thresholds are exceeded (Listing 5).
DNS Resolution Monitoring
After an automated protection system
inadvertently makes a critical domain
unresolvable, leading to the failure
of other systems, you can implement
monitoring to detect similar issues
(Listing 6).
Use with Crontab
Crontab provides reliable scheduling
for zabbix_sender collection scripts on
any Unix or Linux system. Universally
available, crontab requires no installa-
tion, and administrators already under-
stand how to use and maintain cron
jobs. A simple crontab entry schedules
your collection script that gathers data
with standard utilities and calls zab-
bix_sender at whatever interval makes
sense for your metrics. This arrange-
ment creates lightweight, scheduled
monitoring that works even in the
most restricted environments where
Figure 3: An issue was encountered once the used heap size (brown) reached the total heap size (green).
Listing 7: Sub-Minute Collection
bash
* * * * * /home/cronscript/myscript.sh # Run script in every minute
* * * * * sleep 30 /home/cronscript/myscript.sh # Run script in every
minute, but wait for 30 seconds
Listing 6: DNS Resolution
bash
# Attempt DNS resolution
re solvedIP=$(nslookup api.example.com | awk -F':'
'/^Address: / {matched=1} matched {print $2}' | xargs)
# Determine success (1) or failure (0)
[[ -z "$resolvedIP" ]] && result=0 || result=1
# Send to Zabbix
za bbix_sender -z zabbix.example.com -s dns-monitor -k
dns.resolution.status -o "$result"
Listing 5: Kafka Buffers
bash
# Monitor partition fill percentage
fi ll=$(df -h | grep "[kafka_partition]" | awk '{print
$5}' | sed 's/%//')
# Define threshold
threshold=85
if [ $fill -ge $threshold ]; then
echo " Threshold $threshold reached - executing
cleanup."
/usr/local/bin/kafka_cleanup.sh
za bbix_sender -z zabbix.example.com -s kafka-host -k
kafka.cleanup.triggered -o 1
else
echo "Below threshold - no action needed."
fi
# Always send current fill level
za bbix_sender -z zabbix.example.com -s kafka-host -k
kafka.buffer.fill -o "$fill"
87A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
M A N AG E M E N TData Collection with Zabbix
The Certificate Enrollment Web
Service was introduced in Windows
Server 2008 R2 to modernize certifi-
cate requests and make them more
flexible. Unlike traditional requests
by Remote Procedure Call (RPC) and
Distributed Component Object Model
(DCOM) protocols, which require a
direct connection to internal network
ports and domain membership, both
Certificate Enrollment Policy (CEP)
web service and Certificate Enroll-
ment Web Service (CES) are imple-
mented on the Simple Object Access
Protocol (SOAP) standard, which
allows certificate requests to be made
over an HTTPS interface, facilitating
the integration of systems that are
not part of the Active Directory (AD)
domain or even reside on remote
networks.
Two Central Services
The CEP web service is based on
X.509 CEP (MS-XCEP) [1] and is used
to provide clients with information
about available certificate templates
and certification authorities. The ser-
vice provides this information over
an HTTPS interface. Authentication
is handled either by Kerberos with a
username/ password combination, or
it relies on a client certificate.
In contrast, the CES web service is
based on the WS-Trust X.509v3 Token
Enrollment Protocol (MS-WSTEP) [2] –
a Microsoft-specific implementation of
the OASIS WS-TRUST [3] standard. It
is responsible for requesting the cer-
tificate, which it does by forwarding
certificate signing requests (CSRs) to
the certification authority (CA). As with
CEP, communication takes place over
HTTPS, and authentication is identical
to the CEP protocol.
Managing Certificates with
certmonger
The certmonger tool [4] helps with all
the tasks related to managing X.509
certificates on Linux systems, which
means everything from generation
of private keys, through certificate
requests (CSRs), to automatic renewal
of certificates before they expire.
The cepces [5] plugin lets you use
CEP/ CES to procure a certificate from
AD Certificate Services (CS) and place
it under the control of certmonger. This
function is used by Samba to provision
certificates automatically for clients
(Certificate Auto Enrollment) with a
Group Policy Object (GPO) [6].
To ensure that communication with
AD CS over CEP and CES protocols
works, make sure the Certificate En-
rollment Web Service and Certificate
Enrollment Policy Web Service roles
are installed on an AD system, in ad-
dition to the Certificate Services. If
these roles are not available, you can
discover online how to add the roles
to your existing AD CA [7][8].
Requesting a Certificate
The following example is based on
a current Fedora system, but it also
works on all other Linux systems on
which the certmonger tool and the cep-
ces plugin are available. As usual, the
two packages are installed on Fedora
Microsoft’s Certificate Enrollment Web Service offers an easy way to
obtain X.509 certificates from Active Directory Certificate Services. We
introduce the protocols and investigate how to use the certmonger tool
to issue certificates for Linux systems. By Thorsten Scherf
Linux Meets Windows CA
Bridges
Ph
ot
o
by
S
an
di
p
Ro
y
on
U
ns
pl
as
h
88 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
N U TS A N D B O LTS Certificate Enrollment Web Service
At the end of the day, though, the
combination of CEP, CES, and cert-
monger offers a very useful approach
for automated certificate requests in
heterogeneous environments.
Info
[1] MS-XCEP:
[https:// learn. microsoft. com/ en-us/
openspecs/ windows_protocols/ ms-xcep/
08ec4475-32c2-457d-8c27-5a176660a210]
[2] MS-WSTEP:
[https:// learn. microsoft. com/ en-us/
openspecs/ windows_protocols/ ms-wstep/
4766a85d-0d18-4fa1-a51f-e5cb98b752ea]
[3] WS-TRUST: [https:// docs. oasis-open. org/
ws-sx/ ws-trust/ v1. 4/ ws-trust. html]
[4] certmonger:
[https:// pagure. io/ certmonger]
[5] cepces on GitHub:
[https:// github. com/ openSUSE/ cepces]
[6] Certificate Auto Enrollment:[https:// wiki. samba. org/ index. php/
Certificate_Auto_Enrollment]
[7] Configuring the Certificate Enrollment
Web Service: [https:// learn. microsoft.
com/ en-us/ windows-server/ identity/
ad-cs/ configure-certificate-enrollment-
web-service]
[8] Configuring the Certificate Enrollment Pol-
icy Web Service: [https:// learn. microsoft.
com/ en-us/ windows-server/ identity/
ad-cs/ configure-certificate-enrollment-
policy-web-service]
[9] realmd: [https:// www. freedesktop. org/
software/ realmd/]
[10] realmd and AD:
[https:// www. freedesktop. org/ software/
realmd/ docs/ guide-active-directory. html]
realm discover win2022-1g7p.test
Make sure you use the server that
has the AD DNS entries as the DNS
resolver [10]. After making sure
this worked, add the system to the
domain:
realm join win2022-1g7p.test
A simple id command lets you
verify that you can query users from
the domain; then finally, test the
authentication:
id Administrator@win2022-1g7p.test
kinit Administrator@win2022-1g7p.test
If everything worked, you can now
manually request the certificate for
your system:
getcert request U
-c cepces U
-k /etc/pki/tls/private/machine.key U
-f /etc/pki/tls/certs/machine.crt
Type the -c option here to use the
previously installed cepces plugin. If
everything worked, you will see from
the output of getcert list that a certif-
icate was issued, and the system jour-
nal will also display information about
a successful certificate issuance.
You can also use openssl to query the
certificate’s details (Listing 1).
Conclusion
The certmonger tool and cepces
plugin make it very easy to obtain
certificates from an AD CS if the
CEP and CES CA features are avail-
able. Currently, the client must be a
domain member, because certmonger
only supports Kerberos for authen-
tication. However, this situation
could change in future versions of
the tool.
Alternatively, you could check with
curl and openssl
whether addi-
tional wrappers
or manual re-
quests let you log
in with a certifi-
cate or password.
by the dnf package manager from the
distribution’s standard repository:
dnf install certmonger cepces-certmonger
The package manager automatically
adds the Cepces CA plug-in to the
certmonger configuration. To verify
that this install worked, use:
getcert list-cas
[...]
CA 'cepces':
is-default: no
ca-type: EXTERNAL
helper-location: U
/usr/libexec/certmonger/ cepces-submit
In addition to several other plugins,
you should now see a CA named cepces
in the command output. If this entry
does not appear, simply add the plugin
manually by typing the command:
getcert add-ca U
-c cepces U
-e '/usr/libexec/certmonger/U
cepces-submit'
In the /etc/cepces/cepces.conf con-
figuration file, the next step is to enter
grep '^server' /etc/cepces/cepces.conf
server=ad1-1g7p.win2022-1g7p.test
to find the name of the AD system on
which you previously installed the
CEP and CES roles.
Into the Domain with realmd
Before you can request a certificate
from AD CS, you first need to add the
client system to the domain, which
might seem a little surprising because
CEP and CES support different authen-
tication methods. Unfortunately, the
certmonger plugin currently only uses
Kerberos to log in to an AD system.
The easiest way to add the client to
the AD domain is to use the realmd
tool [9]. The package is available for
most Linux distributions. Once the
package is installed on the system,
the first step is to perform a domain
discovery:
Listing 1: Certificate Details
# openssl x509 -in /etc/pki/tls/certs/machine.crt -noout -issuer -subject -dates
issuer=DC=test, DC=win2022-1g7p, CN=win2022-1g7p-AD1-1G7P-CA
subject=CN=client.win2022-yn6a.test
notBefore=May 30 10:18:31 2025 GMT
notAfter=May 30 10:18:31 2026 GMT
The Author
Thorsten Scherf is the
global Product Lead for
Identity Management and
Platform Security in Red
Hat’s Product Operations
group. He is a regular
speaker at various international conferences
and writes a lot about open source software.
89A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
N U TS A N D B O LTSCertificate Enrollment Web Service
Demand for Ethernet as a real-
time control network is growing as
manufacturers and other companies
discover the advantages of a single
network technology throughout the
enterprise (from the office floor
to the factory floor). This kind of
vertical integration offers many
benefits in terms of administration
and support for IT. Lower product
costs combined with the potential
for overlap in training and mainte-
nance costs for information, field,
control, and possibly device net-
works, are expected to reduce costs
significantly.
Ethernet offers many advantages
over existing approaches at the
real-time control level. As a con-
trol network, it offers a bandwidth
of 10Gbps (and higher), which is
almost 1,000 times faster than com-
parable fieldbus networks. However,
distributed applications in control
environments require tight syn-
chronization to guarantee message
delivery within defined cycle times.
Conventional Ethernet and fieldbus
systems are unable to meet the tim-
ing requirements of less than a few
milliseconds, but real-time Industrial
Ethernet enables cycle times of just a
few microseconds.
Ethernet also promises less complex-
ity with all the features required for
a field, control, or device network.
Moreover, Ethernet devices support
TCP/ IP stacks, allowing Ethernet
to connect to the Internet without
problems. This feature is attractive
because it enables remote diagnostics,
control, and monitoring of an indus-
trial network from any device con-
nected to the Internet.
Real-Time Systems
Various organizations like IEEE and
ISO define standards and guidelines
for real-time systems that can vary by
context and application, but real time
generally can be defined as the opera-
tion of a computing system in which
programs for processing incoming
data are constantly ready for immedi-
ate execution, enabling the system
to process data and produce outputs
within a strict, predefined time con-
straint. Depending on the applica-
tion, the data could occur at random
intervals or at predetermined times.
Appropriate hardware and software
must be used to avoid the occurrence
of delays capable of preventing com-
pliance with this condition.
Correct execution of real-time (RT)
systems depends not only on the logi-
cal validity of the data, but also on
its timeliness. Hard real-time (HRT)
systems are those in which faulty
operation can lead to catastrophic
events. Errors can lead to accidents or
even death. Such computers are typi-
cally found in flight or train control
systems. In contrast, soft real-time
(SRT) systems are not as vulnerable.
Although errors are undesirable, they
do not lead to the loss of property or
human life.
The building blocks on which real-
time systems are based are referred
to as “jobs.” Each real-time job is
assigned specific timing parameters:
release time, readiness time, execu-
tion time, response time, and dead-
line. The release time of a job is the
point at which the job is available
to the system. The execution time is
the time required for a job to be fully
processed. The response time is the
period between the release time and
the execution time. The readiness
time is the earliest time at which the
The replacement of first-generation fieldbuses with real-time Ethernet creates a single network that extends from the
control level in the office to field devices. We describe the challenges and solutions of various protocols for Industrial
Ethernet with real-time capabilities that currently is not governed by a single uniform standard. By Mathias Hein
Real-Time Industrial Ethernet Protocols
Every Second Counts
Le
ad
Im
ag
e
©
O
rl
an
do
R
os
u,
12
3R
F.
co
m
90 A D M I N 92 W W W. A D M I N - M(IoT) devices, machines, and,
increasingly, autonomous artificial
intelligence (AI) applications. Studies
and observations in corporate envi-
ronments show that NHIs exceed the
number of human identities many
times over: Ratios of 40:1 to 80:1
have been reported. Whether or not
these numbers are accurate, clearly
NHIs give rise to an identity and ac-
cess management (IAM) and cyber-
security problem of a considerable
magnitude, giving rise to a variety of
security risks and prompting the need
for automation.
The challenge lies not only in the
sheer numbers. NHIs are often cre-
ated automatically, for example, as
part of continuous integration and
continuous delivery (CI/ CD) pipelines
or through instances of Kubernetes
pods. Their lifespans can range from
a few seconds to several years, and
their privileges range from simple
read access to comprehensive admin-
istrative rights.
The majority of today’s NHIs are
either unknown or work with static
access credentials that do not change
over long periods of time. This com-
bination of opacity and permanent
authorizations creates a massive at-
tack surface that classic strategies
in the area of IAM do not address.
The strategies currently in place only
consider human identities and a small
subset of NHIs – the technical and
functional user accounts managed by
privileged access management (PAM;
i.e., service and system accounts to
be more precise).
Management of Non-Human
Identities
Different terms are sometimes used
synonymously with the umbrella term
“non-human identity management”
for strategies, technologies, and pro-
cesses, and sometimes specific sub-
areas (Table 1).
NHI management encompasses iden-
tifying, creating, governing, and de-
leting these identities, including cre-
dential (authentication information)
management; creating, managing,
and assigning policies; and manag-
ing the resulting risks. The goal is to
enforce basic principles, such as least
Many non-human identities – workloads in the cloud, service accounts in IT systems, autonomous agents
in AI applications – are poorly managed or not managed at all. We present a strategic, holistic approach to
managing these identities. By Martin Kuppinger
Identity for Machines, Workloads, and Agents
Digital Colleagues
Le
ad
Im
ag
e
©
A
RM
M
Y
PI
CC
A,
12
3R
F.
co
m
Table 1: Non-Human Identities
Term Definition or Focus Distinction
Non-human identity Generic term for all digital identities
not related to a human being
Also includes workloads,
machines, agents
Machine identity Identity of physical or virtual systems
(e.g., servers, IoT devices)
Typically long-term; secured by
certificates
Workload identity Identities for temporary processes
(e.g., containers, serverless functions)
Ephemeral, dynamic; token-based
Service account or
API identity
Functional accounts for services,
pipelines, APIs
Often static, with wide-ranging
authorizations
Agentic AI identity Identity for autonomous AI agents Contextual, adaptive, goal-
oriented
12 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R E Non-Human Identity Management
privilege and zero standing privileges,
for NHIs, too.
The diversity of terms in the field of
NHIs also reflects the complexity of
the topic. Although workload iden-
tity often plays a role in the context
of cloud-native architectures and
DevOps, machine identity is more
focused on classic system-to-system
communication (e.g., in the context
of TLS certificates or device certifi-
cates for IoT). Agentic AI identity,
on the other hand, describes a new
class of identities that is increasingly
characterized by autonomous, adap-
tive systems. These identities come
with additional requirements – for
example, with regard to the decision-
making context and the ability to
change over time.
A future-oriented model for NHI
management therefore needs to avoid
being based on fixed typologies and
must instead focus on attribute and
capability descriptions. These descrip-
tions cover, for example, the duration
of an identity’s existence (short-lived
or ephemeral vs. persistent), its origin
(automatically generated vs. manually
created), the degree of autonomy, and
the type of interaction with systems.
An attribute-based approach enables
more flexible governance and pro-
motes a better understanding of how
identities should be treated, regard-
less of how they are labeled.
CIEM
Cloud infrastructure entitlement man-
agement (CIEM) focuses on managing
and analyzing access authorizations
security technologies. For example,
anomalies in access behavior can be
identified by combining CIEM with
identity threat detection and response
(ITDR). Automatically remedying in-
correct or risk-fraught authorizations
leads to an adaptive security model
that can also respond to volatile and
short-term workload identities, which
makes CIEM an indispensable com-
ponent of any modern, risk-oriented
cloud security architecture.
Interfaces to Other
Security Segments
NHI management does not stand
alone. Numerous other segments in
the area of IAM and cybersecurity
overlap or relate to NHI management
(Table 2). These segments need to be
integrated into a holistic identity and
security model, such as an identity
fabric that covers both human and
non-human actors (Figure 1).
Historically, PAM did not focus exclu-
sively on privileged human users. Early
on, PAM also addressed functional
and technical accounts, in particular
shared accounts or service accounts
at the operating system and database
level. These accounts form a subset of
non-human identities because they are
either used automatically or shared by
multiple people with elevated privi-
leges. However, with the expansion of
dynamic, short-lived workload identi-
ties, PAM needs to be rethought and
more closely integrated with modern
NHI management strategies.
Secrets management is a central com-
ponent in dealing with NHI, because
almost every non-human identity
requires authentication information
in the form of secrets. The secure
management, versioning, and rotation
of these secrets is essential to avoid-
ing security risks and vulnerabilities.
However, the isolated use of vault
technologies is not up to this task.
Secrets, including authentication cre-
dentials, are linked to identities, such
as workloads, defined owners of ap-
plications and software components,
and security policies.
Simply managing different types
of secrets, from SSH keys and SSH
in cloud infrastructures. In other
words, CIEM occupies the space be-
tween NHI management and access
governance. Another term for this
could be non-human access manage-
ment (NHA); in fact, this term would
describe CIEM’s function far much
more accurately and make clear the
very close relationships between
NHI and CIEM, although usually
they currently are not implemented
in business practice.
CIEM tools analyze which NHIs have
access to which resources, whether
these authorizations are too exten-
sive, and whether principles such
as least privilege are being violated.
CIEM therefore provides the required
counterbalance to identity manage-
ment: Where NHI management is
responsible for managing identity and
associated credentials, CIEM consid-
ers access to specific resources in the
cloud.
Especially in dynamic cloud environ-
ments with infrastructure as code
and automated provisioning, it is
nearly impossible for companies to
keep track manually of all access
authorizations for NHIs. CIEM tools
enable transparency here through
continuous analysis and visualization
of entitlement structures, identifying
overprivileged roles, and recommend-
ing optimizations on the basis of
usage patterns, which is an essential
step toward implementing the least
privilege principle in complex cloud
landscapes.
Additionally, modern CIEM ap-
proaches increasingly offer integrated
functions for correlation with other
Table 2: Security TechnologiesAGA Z I N E .CO M
N U TS A N D B O LTS Real-Time Ethernet
job can be executed (always greater
than or equal to the release time).
The deadline is the time by which the
execution must be completed.
All real-time systems exhibit a certain
degree of jitter (i.e., a deviation from
the actual timing of the aforemen-
tioned times). In a real-time system,
the jitter should be measurable within
a defined interval so that system per-
formance can still be guaranteed.
Ethernet Without Collisions
Ethernet is a non-deterministic net-
work protocol and therefore inher-
ently unsuitable for hard real-time
applications. The Carrier Sense Mul-
tiple Access with Collision Detection
(CSMA/ CD) media access control
protocol specified in the IEEE 802.3
standard, with its binary, exponential
backoff algorithm, does not enable
the network to support hard real-time
communication, because it includes
random delays and allows for the pos-
sibility of transmission errors.
With the CSMA/ CD mechanism, each
node detects whether another node is
transmitting on the medium (carrier
sense). If the carrier sense function
is active on a node, it delays trans-
mission until it determines that the
medium is free. Whenever two nodes
transmit simultaneously (multiple
access), a collision occurs in the net-
work, and all packets become invalid.
The nodes can detect collisions by
monitoring the collision signal pro-
vided by the bit transmission layer. If
a collision occurs, the node sends a
corresponding notification.
When a node begins transmission on
the medium, a specific time interval,
known as the collision window, takes
place, during which a collision can
occur. This window is large enough
for the signal to propagate throughout
the entire network segment. Once this
time window has expired, all (func-
tioning) nodes should have their car-
rier detection enabled and therefore
not attempt to begin transmission.
If a collision occurs, the backoff algo-
rithm is applied to each colliding node.
One advantage of this algorithm is
that it controls the use of the medium.
switches. These devices can isolate
collision domains by segmenting the
network, as each device connection
is configured as a single collision
domain, which means full-duplex
switches in combination with full-
duplex-capable nodes can eliminate
collisions in all segments.
The IEEE 802.1Q standard provides
for the required quality of service
(QoS) at the media access control
(MAC) level and defines how these
switches can handle prioritization.
An 802.1Q implementation has
certain advantages for real-time
Industrial Ethernet applications: It
introduces standardized prioritiza-
tion on Ethernet and enables control
engineers to implement up to eight
different user-defined priority levels
for their data traffic.
To classify real-time capability, the
OSI model is divided into three
classes. Class 1 above the Transport
Layer, class 2 above the Ethernet
Layer, and class 3 through modifica-
tion of the Ethernet Layer (Figure 1).
In class 1, the entire protocol stack is
retained, preserving full compatibility
with conventional Ethernet up to the
Application Layer. Well-known imple-
mentations of this class are Modbus/
TCP, P-NET, JetSync, EtherNet/ IP
with CIP Sync, and Foundation Field-
bus high-speed ethernet (HSE).
EtherNet/ IP
EtherNet/ IP (EIP, where IP stands for
Industrial Protocol) is an open Ap-
plication Layer protocol that is based
on the existing IEEE 802.3 Physical/
Data Layers and TCP/ UDP/ IP, which
ensures interoperability with most
information layer networks. EIP offers
real-time performance if strict guide-
lines are followed but is not determin-
istic. It uses the open, object-oriented
Control and Information Protocol
(CIP) as its Application Layer – the
same Layers 5 through 7 as DeviceNet
and ControlNet, providing full in-
teroperability with those networks.
CIP is a flexible and scalable au-
tomation protocol, well suited for
distributed systems, and features
object orientation, electronic data
If the medium is heavily loaded, the
probability of collisions increases, and
the algorithm increases the interval
from which the random delay time
is selected. This step is intended to
reduce the load and avoid further col-
lisions. However, Ethernet’s CSMA/ CD
algorithm can result in complete trans-
mission failure and the possibility of a
random transmission time, making the
protocol non-deterministic, especially
in heavily loaded networks.
That said, Ethernet is only non-deter-
ministic when collisions can occur. To
implement a fully deterministic Eth-
ernet, all collisions must be avoided.
A collision domain is a CSMA/ CD
segment in which simultaneous trans-
missions can lead to a collision. The
probability of collision increases with
the number of nodes transmitting in a
single collision domain.
Switched Ethernet
in the IoT
When Ethernet was standardized,
all communication was based on a
half-duplex transmission mechanism,
wherein a node can either send or
receive, but not do both at the same
time. Nodes that share a half-duplex
connection operate in the same col-
lision domain, which means these
nodes compete for bus access and
their packets can collide with other
packets on the network. With full-
duplex, a node can send and receive
simultaneously, and a maximum of
two nodes can be connected to it,
which is usually a node-to-switch or
switch-to-switch configuration where
each network node has its own colli-
sion domain. This method completely
avoids collisions. Because full-duplex
connections can serve a maximum of
two nodes per connection, this tech-
nology is not practical without the
use of fast switches.
The most common method of colli-
sion avoidance is the introduction of
individual collision domains for each
node, because this guarantees the
node sole use of the medium, elimi-
nating access conflicts. This system is
achieved by implementing full-duplex
connections and hardware such as
91A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
N U TS A N D B O LTSReal-Time Ethernet
sheets, and device profiles. EIP
with CIP is not a real-time protocol.
To achieve RT for EIP, CIP Sync (a
high-speed CIP synchronization
solution) is used. With 100Mbps
switched Ethernet, it achieves a syn-
chronization accuracy of more than
500ns between devices, although jit-
ter caused by the protocol stack still
poses a problem.
EIP uses both TCP and UDP with IP
for communication. If a connection-
oriented exchange is preferred (e.g.,
during initialization), it uses TCP
(Explicit Messaging). Explicit Mes-
saging contains protocol and service
information but has no strict timing
requirements; it is therefore perfectly
okay to use the slower but guaranteed
TCP protocol. For RT traffic, EIP uses
the unicast and multicast capabilities
of UDP to implement the producer-
consumer model of communication,
which is popular in control applica-
tions. Implicit messages do not con-
tain commands, only data. The mean-
ing of this data is configured during
initialization, which reduces runtime
processing in the nodes. Network
collisions are avoided by switches,
whereas EIP generally operates in a
star topology. One variant uses virtual
local area networks (VLANs) and
places all devices that exchange time-
critical data on the same VLAN.
Foundation Fieldbus HSE
The starting point for Foundation
Fieldbus HSE (Figure 2) is Founda-
tion Fieldbus H1, introduced in 1995,
with a transmission rate of 31.25Kbps
and identical bus physics to Profibus
PA (process automation) in accor-
dance with IEC 61158-2. Because the
transmission rate is very low, a faster
Figure 1: Overview of real-time capability classes.
Figure 2: On a Foundation Fieldbus network, linking devices (e.g., by ABB) connect systems
that communicate over Ethernet.
92 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Real-Time EthernetN U TS A N D B O LTS
switch.The minions have an inte-
grated memory of between 2 bits and
64KB. They look like a single device
to the Ethernet, although in real-
ity they can comprise up to 65,535
devices configured in an open ring
topology with the Ethernet interface
at the open end. The manager sends
commands to the MAC address of the
first device. When the signal reaches
the Ethernet-minion interface, it is
converted to eBus specifications (if
eBus is used) and forwarded.
The field memory management unit
(FMMU) of each configurable min-
ion converts a logical address into a
physical address; this information is
available to the manager during ini-
tialization, which is why each minion
requires a special application-specific
integrated circuit (ASIC). When a min-
ion receives a datagram, it determines
whether it is being addressed and
then forwards the data to or from the
datagram, resulting in a delay of a few
nanoseconds. EtherCAT is therefore
a fast real-time Ethernet and is deter-
ministic when not used with UDP/
IP or between managers and minions
connected by switches or routers.
Ethernet Powerlink
Ethernet Powerlink (EPL) is a hard RT
protocol that is based on Fast Ether-
net. EPL devices use standard Ether-
net hardware without special ASICs.
EPL can deliver a cycle time of 200ms
with jitter of less than 1ms. EPL uses
cyclic communication with time slot
allocation and the manager-minion
model. One manager (manager) is
allowed per network. This manager
schedules all transmissions and is the
only active station; the minions trans-
mit on demand.
The EPL cycle comprises four sec-
tions. During the start period, the EPL
manager sends the start-of-cyclic (SoC)
frame, which synchronizes the min-
ions. The timing of this frame is the
only time base for network synchro-
nization; all other frames are purely
event-driven. The SoC is followed by
the cyclic period, when the manager
polls each station with a poll request
frame. At this point, the minion
can process 1,000 I/ O in 30ms, but
requires a full-duplex transmission
mechanism of copper or fiber optic
cables. EtherCAT is based on the
manager-minion principle and can
interact with normal TCP/ IP and
other Ethernet-based networks such
as EIP or Profinet. It also supports
any Ethernet topology, including bus.
The EtherCAT manager processes
the RT data with dedicated hardware
and software. The manager priori-
tizes EtherCAT frames over normal
Ethernet traffic and controls traffic
by initiating all transmissions. The
datagrams (data packets between
manager and minions) are standard
Ethernet packets, where the data field
encapsulates the EtherCAT frame (an
EtherCAT header and one or more
EtherCAT commands). Each com-
mand contains a header, data, and a
working counter field. Each Ethernet
datagram can contain many Ether-
CAT commands, resulting in higher
bandwidth and more efficient use of
the large Ethernet data field size and
header. The standard Ethernet cyclic
redundancy check (CRC) is used to
verify the correctness of the message.
The EtherCAT manager completely
controls its minions. Its commands
only trigger responses; the minions
do not initiate transmissions. The
two EtherCAT communication meth-
ods used are EtherType or UDP/
IP encapsulation. The EtherType
implementation does not use IP,
which limits EtherCAT traffic to the
original subnet. Encapsulating com-
mands with UDP/ IP lets EtherCAT
frames traverse subnets but has dis-
advantages. The UDP/ IP header adds
28 bytes to the Ethernet frame and
undermines RT performance with its
non-deterministic stack.
EtherCAT minions range from intel-
ligent nodes to 2-bit I/ O modules and
are networked by 100BASE-TX, fiber
optic cable, or eBus. eBus is a Physi-
cal Layer of EtherCAT for Ethernet
that provides a low-voltage differen-
tial signal (LVDS) scheme. Minions
are hot-pluggable in any topology
of branches or stub lines. Multiple
minion rings can exist on a single
network if they are connected by a
H2 bus was initially considered for
communication at the control level.
However, because of the widespread
use of Ethernet in this area, develop-
ment was discontinued at an early
stage, and the development of Foun-
dation Fieldbus HSE was initiated.
A mixture of tree and bus topology
can be used with the H1 fieldbus.
Communication takes place by man-
ager-minion access or a deterministic
token passing procedure. In specifica-
tion 1.2, a maximum of 32 devices
can be located on an H1 subnet, in
which non-deterministic communica-
tion is not permitted. A linking device
uses a bridge to connect several H1
subnets and form an HSE network
on which conventional 100Mbps
switches operate. Because HSE is still
based on standard Ethernet with a
superimposed TCP/ UDP/ IP protocol
stack, it is not real-time-capable itself.
However, the development of another
real-time-capable network was not
the focus of the Fieldbus Foundation.
At the application level of the H1
fieldbus, much like EtherNet/ IP, a
function block model already existed
for managing reusable hardware and
software components of the auto-
mated facility. The function block is
standardized according to IEC 61131
and interacts with other function
blocks by way of I/ O variables. Here,
too, are gateways to other fieldbuses
(third-party I/ O gateways). The goal
of the Fieldbus Foundation was to
transfer this function block model
to the HSE level and use the same
object model. In this way, the bridge
between the two buses appears
transparent. The user on the Ether-
net side has the impression of being
able to access all H1 devices directly
and equally. Multiple H1 buses can
exchange non-time-critical manage-
ment, diagnostic, and configuration
data with each other over the HSE
bridges.
EtherCAT
The EtherCAT (Ethernet for Control
Automation Technology) protocol is
a real-time motion control concept
defined in IEC standard 61158. It
93A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
N U TS A N D B O LTSReal-Time Ethernet
responds with a poll response frame
containing data, avoiding collisions.
The minion sends its response to all
devices, enabling communication be-
tween the minions. After successfully
polling all minions, the manager sends
the end-of-cyclic frame, which informs
each minion that cyclic traffic has
been completed correctly.
The asynchronous section allows
non-cyclic data transfers under the
control of the manager. To transmit
during this period, a minion must
have informed the manager in its poll
response during the cyclic period.
The manager creates a list of wait-
ing minions and uses a scheduler to
ensure that no transmission request is
delayed indefinitely. Standard IP data-
grams can be transmitted during the
asynchronous period.
EPL does not use switches to avoid
collisions or ensure network syn-
chronization – this responsibility
is controlled by the manager. EPL
networks can be based on standard
hubs, the recommendation being that
each device contain a hub to facilitate
bus implementation. Switches are not
prohibited, but they add jitter and
reduce determinism. Because the EPL
network avoids collisions through
time-controlled bus access, up to 10
hubs can cascade.
Currently, EPL devices that require
RT communication cannot coexist in
the same segment as non-RT Ethernet
devices. However, EPL devices can be
operated like normal Ethernet hard-
ware. In protected mode, the real-time
segment must be separated from
normal traffic by a switch or router.
In open mode, RT traffic shares the
segment with normal traffic, but real-
time communication is impaired.
Profinet
Profinet is a fieldbus standard for dis-
tributed automation systems. It uses
object orientation and existing IT
standards (TCP/ IP, Ethernet, XML).
Profinet is based on IEEE 802.3, is in-
teroperable with TCP/ IP and therefore
with Ethernet, and is compatible with
Profibus-DP (decentralized peripher-
als). Profinet v1has a response time
of 10 to 100ms (Figure 3).
In contrast, Profinet-SRT (soft real-
time) with a cycle time of 5 to 10ms is
designed to work in factory automa-
tion and to implement real-time ex-
clusively in the software. It uses TCP/
IP and its own software channel for
RT communication. Profinet-IRT (iso-
chronous RT) introduces a hard RT el-
ement into the Profinet protocols. The
three Profinet protocols enable differ-
ent degrees of real-time performance.
Profinet-IRT supports systems that
require synchronization in the sub-
microsecond range, typically high-
performance motion control systems.
The benchmark for such a system is
a millisecond cycle time, microsecond
jitter accuracy, and guaranteed deter-
minism; IRT meets all three criteria.
However, because the software causes
jitter of greater than 1ms, IRT (unlike
SRT) is implemented in hardware with
synchronized Ethernet nodes. With
the use of full-duplex Fast Ethernet,
the communication cycle is divided
into an open standard TCP/ IP channel
and a deterministic RT channel. Each
Profinet-IRT device has a special ASIC
for handling node synchronization and
cycle division and includes an intel-
ligent two- or four-port switch.
The Profinet switch in each node con-
tains a bus access schedule and can
process RT and non-RT traffic. This
bus prioritizes real-time traffic and
provides full-duplex connections for
all ports. Classic switches add jitter,
which affects determinism. Profinet
switches minimize jitter so that it
has a negligible effect. The Profinet
communication model enables the
coexistence of RT and non-RT traf-
fic in a network without additional
precautions.
Conclusion
Currently no uniform standard for
automation technology has been
determined for Industrial Ethernet
with real-time capabilities. The IEC
61784-2 standard specifies at least
10 different, and mostly incompat-
ible, technical solutions. In practice,
though, no technical reason demands
that so many different real-time Ether-
net implementations should be main-
tained. Pressure from users likely will
lead to a reduction in these numbers
in the medium term, with the market
deciding which candidates best meet
the requirements of the respective
automation applications.
Keywords: Ethernet, real, time, RT, protocol, frame, layer, EIP, fieldbus,
HSE, EtherCAT, Powerlink, EPL, Profinet, SRT, IRT
Figure 3: Profinet occupies its own area in the data packet of an Ethernet frame (FCS, frame check sequence).
The Author
Mathias Hein is a freelance
IT consultant and technical
writer with more than
40 years of professional
experience in the field of
networking. He also serves
as an adjunct instructor at several universities.
As a trainer and speaker at technical seminars, he
shares his expertise in the areas of switching,
TCP/IP, Voice over IP, Carrier Ethernet, and network
management. As an author of technical books and
articles in relevant trade journals, Hein regularly
contributes to the dissemination of knowledge.
94 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Real-Time EthernetN U TS A N D B O LTS
ADMIN is your source for technical solutions to real-world problems. Every issue
is packed with practical articles on the topics you need, such as: security, cloud
computing, DevOps, HPC, storage, and more! Explore our full catalog of back
issues for specific topics or to complete your collection.
#87 – May/June 2025
Lightweight Kubernetes
K3s, k0s, and MicroK8s vie for performance honors on the control plane and data plane in
artificially created extreme stress scenarios.
On the DVD: AlmaLinux 9.5 Minimal
#89 – September/October 2025
Automation
Optimize, automate, and manage workflows and processes in your data center.
• Microsoft Power Automate
• Ansible Automation Platform
On the DVD: IPFire 2.29 Core Update 196
#91 – January/February 2026
AI in the Enterprise
New tools bring the power of artificial intelligence and machine learning to the
corporate world.
On the DVD: Fedora Server 43
NEWSSTAND
ADMIN
Network & Security
#90 – November/December 2025
VoIP Network Security
Session Initiation Protocol provides both the underpinnings for VoIP and a potential attack
vector for hackers. Open source tools can help you test and secure your VoIP networks.
On the DVD: Ubuntu 25.10 Server
Order online:
bit.ly/ADMIN-Library
#88 – July/August 2025
5 Network Admin Distros
Admin distros take both workstations and servers into account, have broad support for
various filesystems, deploy in heterogeneous environments without restrictions, and come
with the necessary tool collections.
On the DVD: openSUSE Leap 15.6
#86 – March/April 2025
Data Obfuscation
Generalization, suppression, perturbation, and differential privacy are essential data
protection techniques that enable a balance between data security and usability and
ensure compliance with legal requirements.
On the DVD: Rocky Linux 9.5 Minimal
96 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Back IssuesS E RV I C E
Admin: Network and Security is
looking for good, practical articles on
system administration topics. We
love to hear from IT professionals
who have discovered innovative tools
or techniques for solving real-world
problems.
Tell us about your favorite:
• Interoperability solutions
• Practical tools for cloud
environments
• Security problems and how you
solved them
• Ingenious custom scripts
• Unheralded open source utilities
• Windows networking techniques
that aren’t explained (or aren’t
explained well) in the standard
documentation
We need concrete, fully developed solu-
tions: installation steps, configuration
files, examples – we are looking for a
complete discussion, not just a “hot tip”
that leaves the details to the reader.
If you have an idea for an article, send
a 1-2 paragraph proposal describing
your topic to:
edit@admin-magazine. com.
WRITE FOR US
Authors
Amber Ankerholz 6
Attila Bartek 82
Thomas Drilling 54
Mathias Hein 42, 90
Ken Hess 3
Thomas Joos 32, 68
Samuel Klein 62
Martin Kuppinger 12
Martin Gerhard Loschwitz 18
Paolo Mulas 36
Marius Quabeck 26
Dr. Holger Reibold 78
Thorsten Scherf 88
Henner Schmidt 48
Max Werner 48
Matthias Wübbeling 76
Contact Info
Editor in Chief
Joe Casad, jcasad@linuxnewmedia.com
Managing Editors
Rita L Sooby, rsooby@linuxnewmedia.com
Lori White, lwhite@linuxnewmedia.com
Senior Editor
Ken Hess
Localization & Translation
Ian Travis
News Editor
Amber Ankerholz
Copy Editors
Amy Pettle, Aubrey Vaughn
Layout
Dena Friesen, Lori White
Cover Design
Lori White, Illustration based on graphics by
kgtoh 123RF.com
Advertising
Brian Osborn, bosborn@linuxnewmedia.com
Publisher
Brian Osborn
Marketing Communications
Gwen Clark, gclark@linuxnewmedia.com
Linux New Media USA, LLC
4840 Bob Billings Parkway, Ste 104
Lawrence, KS 66049 USA
Customer Service / Subscription
For USA and Canada:
Email: cs@linuxnewmedia.com
Phone: 1-785-856-3080
For all other countries:
Email: subs@linuxnewmedia.com
www.admin-magazine.com
While every care has been taken in the content of
the magazine, the publishers cannot be held re-
sponsible for the accuracy of the information con-
tained within it or any consequences arising from
the use of it. The use of the DVD provided with the
magazine or any material provided on it is at your
own risk.
Copyright and Trademarks © 2026 Linux New
Media USA, LLC.
No material may be reproduced in any form
whatsoever in whole or in part without the writ-
ten permission of the publishers. It is assumed
that all correspondence sent, for example, let-
ters, email, faxes, photographs, articles, draw-
ings, are supplied for publication or license to
third parties on a non-exclusive worldwide
basis by Linux New Media unless otherwise
stated in writing.
All brand or product names are trademarks
of their respective owners. Contact usif we
haven’t credited your copyright; we will always
correct any oversight.
Printed in Nuremberg, Germany by be1druckt GmbH.
Distributed by Seymour Distribution Ltd, United
Kingdom
ADMIN (Print ISSN: 2045-0702, Online ISSN: 2831-
9583, USPS No: 347-931) is published bimonthly by
Linux New Media USA, LLC, and distributed in the
USA by Asendia USA, 701 Ashland Ave, Folcroft PA.
March/April 2026. Application to Mail at
Periodicals Postage Prices is pending at
Philadelphia, PA and additional mailing offices.
POSTMASTER: send address changes to ADMIN,
4840 Bob Billings Parkway, Ste 104, Lawrence,
KS 66049, USA.
Represented in Europe and other territories by:
Sparkhaus Media GmbH, Bialasstr. 1a, 85625
Glonn, Germany.
97A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
S E RV I C EContact Info / Authors
BE THE FIRST TO SEE WHAT'S NEXT
Subscribe free to the ADMIN
Preview newsletter and get a
sneak peek at every article
included in the next issue of
ADMIN.
Sign up today at https://bit.ly/admin-preview
Available Starting
June 5
ADMIN 93
Image © artnovielysa, 123RF.com
Our next issue will be packed with all the great
content you expect from ADMIN. Here are a few of
the upcoming articles:
Microsoft Dataverse
Virtualizing with Neko
BunkerWeb Firewall
Repairing MySQL Tables
And much more!
Please note: Articles could change before the next issue.
Next Issue PreviewS E RV I C E
98 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO Mand NHI Management
Security Segment Relationship to NHI Management
Privileged access management
(PAM)
Management of privileged NHI and privileged human-assigned
user accounts, especially technical and shared service
accounts
Secrets management Securing and rotating access information (tokens, keys) for NHI
Identity governance and
administration (IGA)
Governance, responsibilities, and lifecycle management,
including for NHI
Cloud-native application
protection platform (CNAPP)
Consideration of security aspects at the application level,
including NHI context
Cloud workload protection
platforms (CWPPs)
Protection of workloads, including their identities, runtime
monitoring, and vulnerability analysis
Identity threat detection and
response (ITDR)
Detection of anomalous behavior by NHIs
13A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R ENon-Human Identity Management
certificates to API tokens, is not
enough. The goal must be not just to
store secrets in a vault, but to manage
them in a controlled lifecycle.
CNAPPs (please refer to Table 2 for
security technology acronyms) extend
protection to the application level.
Among other things, they combine
CIEM, CWPPs, and vulnerability
management. CNAPPs are particu-
larly relevant in the context of NHI
because they provide contextual in-
formation about workloads and their
interactions. This information makes
it possible to assess the risk context
of individual identities better – for
example, when a workload identity
requests access to particularly sensi-
tive resources or originates from a
vulnerable application component.
CWPPs focus on protecting workloads
in cloud and hybrid environments.
They monitor runtime behavior,
identify vulnerabilities, and isolate
or block workloads with policies.
In terms of NHI, CWPP solutions
provide valuable signals that reveal
which workload identities are active,
and in which context, and whether
they originate from potentially com-
promised instances or exhibit suspi-
cious behavior. They therefore com-
plement the purely access-based view
of other security technologies with an
operational perspective.
Another key link is IGA. For NHI, too,
owners must be named, lifecycles
defined, and access authorizations
regularly reviewed. Classic IGA pro-
cesses such as recertification and the
joiner-mover-leaver (JML) principle
can be adapted to ensure control and
accountability for non-human identi-
ties, too. However, these cases require
customized workflows and an evalu-
ation logic that draws on technical
metadata and usage patterns rather
than personal attributes. Additionally,
a high degree of automation is neces-
sary, if only because of the volatility
of NHIs and their large numbers.
Last but not least, interaction with
ITDR plays a special role. NHIs oper-
ate in a highly automated way, often
in the background, which makes
them particularly vulnerable to mis-
use and difficult to monitor. Only
through behavior-based analysis are
anomalies, such as the misuse of a
secret or the expansion of an access
pattern, detected in a timely fashion.
ITDR therefore significantly boosts
the ability to respond to threats in
the context of NHIs and must be an
integral part of any security strategy
in this area.
What Is Delivered and What
Is Missing
Many of the products currently mar-
keted as NHI management primarily
address the management of secrets,
but less so the entire identity and
authorization model. Several dimen-
sions need to be taken into account:
Secret vs. identity: A secret is not
the same as an identity. Secrets are
access credentials; identity defines
the entity, its characteristics, and
responsibilities.
Static vs. dynamic: Long-lived se-
crets contradict security principles.
Ephemeral identities with short-
lived tokens are the goal.
Credentials vs. entitlements:
Possession of a secret alone says
nothing about entitlements. The
mapping of identity to entitlement
is crucial.
In practice, many products lack a
consistent view of the relationship
between identity, assigned secret,
technical and organizational owner-
ship, and actual access rights. Indi-
vidual components such as token
issuers, vaults, or identity providers
(IDPs) typically operate in isolation
and without consistent policies and
enforcement, which creates gray areas
Figure 1: Human and non-human identities play an equal role in the identity fabric.
14 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Non-Human Identity ManagementF E AT U R E
and agility requirements of software
development.
Also important is that the identity
fabric has an integrative meta model
that can map different identity types,
credential types, usage contexts, and
trust levels. This model serves as the
basis for automated decisions, such
as granting temporary access rights or
escalating rule violations. It also helps
to implement regulatory requirements
such as traceability, data residency,
and client separation for NHI.
Such a strategic concept must also be
designed for heterogeneity from the
outset. In reality, companies typically
rely on multiple cloud platforms, a
variety of vault technologies, and
different approaches to software de-
velopment. A central identity fabric is
required to orchestrate this diversity
without artificially restricting it. The
goal is comprehensive control, not
the homogenization of tools, which is
why modularity is a key success fac-
tor: Organizations need to be able to
rely on interoperable building blocks
that can be flexibly integrated into
existing landscapes.
Finally, the integration of NHI into
the identity fabric also has a cultural
component. Cooperation between
IT security, IAM, cloud governance,
and software development must be
institutionalized, which can only be
achieved through clearly defined pro-
cesses, coordinated interfaces, and a
common vision. The identity fabric is
thus not only a technological archi-
tecture, but also the organizational
framework for modern, scalable iden-
tity management. NHIs are therefore
no longer an exception in this con-
struct – they are an integral part of it.
Organizational Challenges
Responsibility for NHIs typically
lies between software development,
and access model. The identity fabric
forms the structural and conceptual
backbone, enabling different types of
identities with their specific require-
ments to be managed consistently
and holistically.
An identity fabric typically includes
functions for identity provisioning,
authentication, authorization, gover-
nance, and access protection across
platform boundaries. For NHIs, it
means that the NHI must not be
treated as a special case, but as an
equivalent entity with the same re-
quirements for traceability, control,
and automation, necessitating a clear
extension of classic IAM models to
include NHI-specific elements, such
as those for managing ephemeral
workloads, cross-platform secrets, or
autonomous agents.
A strategic NHI approach within the
identity fabric begins with a complete
inventory of all NHIs. This discov-
ery process must be continuous and
include both declarative (e.g., infra-
structure definitions) and observable
source (e.g., runtime data). On this
basis, a clear assignment of respon-
sibilities follows: Who is the owner
of an identity? Who is allowed to use
it? Who controls the assigned permis-
sions? Without this governance, NHI
management remains fragmented and
difficult to audit.
Additionally, the identity fabric must
also provide the technical mecha-
nisms for security and enforcement.
These mechanisms include automated
provisioning and deletion processes,
standardized interfaces for integrating
vaults and policy engines, and central
control mechanisms for access control
and role management. A policy-as-
code approach, with code generated
automatically on the basis of poli-
cies, can help enforce policies con-
sistently across systemand platform
boundaries while meeting the speed
in the security architecture, especially
where identities are generated by
automated processes outside of tradi-
tional IAM provisioning.
Another shortcoming is the lack of
defined and managed lifecycle man-
agement for NHI. Whereas typical
events for human identities, such as
entry, role changes, or departure, are
clearly defined and automated, NHI
has no comparable triggers. Without
explicit definitions of expiration dates,
dependencies, or usage context, many
identities remain active even though
they are no longer needed. This type
of shadow identity poses a significant
risk, especially in combination with
overprivileged secrets.
Moreover, the integration of analysis
and response mechanisms shows
weaknesses. Only a few products
offer native support for continuous
monitoring of secret usage or detect
anomalies in the behavior of individ-
ual workloads. As already mentioned,
a closer link to ITDR is essential to
evaluate the behavior of non-human
identities on a situational basis and
initiate automatic countermeasures.
These functions are still the exception
rather than the rule today.
Table 3 shows which vaults are
commonly used today and in which
contexts. These vaults need to be
incorporated into overarching NHI
management to achieve a centralized
view of identities, policies, secrets,
and access.
NHI as Part of the Identity
Fabric
An isolated view of NHI manage-
ment falls short. In modern IT
landscapes, which are increasingly
characterized by hybrid, dynamic,
and distributed architectures, all
identities – human and non-human –
must be part of a common identity
Table 3: Important NHI Management Values
Vault Provider or Technology Typical Area of Application
HashiCorp Vault [1] Open source and enterprise Multicloud, development security operations (DevSecOps), Kubernetes
AWS Secrets Manager [2] Amazon Web Services (AWS) AWS Services, Lambda, Elastic Container Service (ECS), etc.
Azure Key Vault [3] Microsoft Azure Entra ID, functions, app services
Google Secret Manager [4] Google Cloud Platform (GCP) GCP-native workloads, IAM integration
15A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R ENon-Human Identity Management
DevOps, IAM, and IT security. This
shared responsibility model often
leads to gaps. Clear role assignments
are necessary:
IAM/ IT security defines gover-
nance and security mechanisms.
Software development consumes
services and vaults as part of agile
processes.
The goal is cooperation by division
of labor. Security must not slow
down development but must support
it through automated services and
guidelines. In DevOps environments
in particular, different vaults are used
in parallel (Table 3). These vaults
must be identified, managed, and in-
tegrated in a controlled way.
An additional challenge arises from
the lack of standardization of re-
sponsibility models for non-human
identities. Although roles and re-
sponsibilities for human users are
often defined as part of onboarding
processes and organizational struc-
tures, comparable mechanisms are
often lacking for NHIs. Organizations
therefore need defined procedures
for assigning technical ownership
that are clearly documented and
regularly reviewed, which also in-
cludes processes for transferring
responsibilities when projects
change or technical owners leave the
organization.
Equally important is the integration of
security requirements through deploy-
ment pipelines in application develop-
ment. Security guidelines must be for-
mulated and implemented such that
they can be seamlessly integrated into
existing CI/ CD processes. Instead of
checking security as a separate con-
trol instance downstream, audits and
policy checks should be an integral
part of automation. In this way, both
security and development goals can
be achieved efficiently without creat-
ing conflicting objectives.
Conclusion
Non-human identities are a central
element of modern IT landscapes.
Their secure management requires
more than ad hoc solutions. Compa-
nies need a holistic strategy for NHI
management that is embedded in an
identity fabric and tailored to cloud
and DevOps realities. Future-proof
NHI management must be based on
a modular architecture principle that
takes into account the diversity of
platforms, vaults, and development
methods used, allowing the com-
bination of agility with central
controllability.
For this reason, central governance re-
quirements must be combined on an
organizational level with decentral-
ized implementation options within
development teams. This tension can
only be resolved through defined in-
terfaces, coordinated role models, and
common goal definitions. Securing
non-human identities is crucial to the
resilience of digital infrastructures.
The challenges of dealing with NHI
affect not only IT departments, but
the entire organization.
Info
[1] HashiCorp Vault: [https:// www. hashicorp.
com/ en/ products/ vault]
[2] AWS Secrets Manager:
[https:// aws. amazon. com/ secrets-manager/]
[3] Azure Key Vault: [https:// azure. microsoft.
com/ en-us/ products/ key-vault]
[4] Google Secret Manager:
[https:// cloud. google. com/ security/
products/ secret-manager]
Author
Martin Kuppinger is the founder of and Principal
Analyst at KuppingerCole Analysts AG.
Keywords: non-human, identity, NHI, management, attribute,
CIEM, access, NHA, modular, role, security, automation
16 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Non-Human Identity ManagementF E AT U R E
Prometheus suffers from a structural
problem: It does not offer a true clus-
ter mode. A single instance stores its
data locally and responds to queries
from this local database. High avail-
ability therefore requires a separate
design. Many teams solve this prob-
lem with the use of two Prometheus
instances that query the same targets
and with graphical or logical abstrac-
tions that merge the two data sources
(e.g., in Grafana).
This solution increases availability,
but it does not eliminate the fun-
damental problem of scaling. Each
instance continues to back up locally,
each instance compresses its own
data, and each instance only stores its
own data. Metrics volumes that grow
and retention times that become lon-
ger mean more than a significant loss
of convenience, because you have to
deal with multiple points of adminis-
tration. If you have several locations
with the same setup, the unpredict-
ability of slow connections between
them adds to the problem.
In these scenarios, Cortex [1] [2]
enters the scene. Cortex is directly
related to Prometheus, because it oper-
ates in the same data model and pro-
tocol world and natively understands
Prometheus data. It fields Prometheus
metrics, stores them long term on scal-
able back ends, and makes them avail-
able again for queries. Instead of each
Prometheus instance keeping its entire
dataset locally, Prometheus transfers
the data to Cortex at defined intervals,
typically with its remote write module.
Cortex then assumes responsibility for
long-term storage and distributes the
data across multiple instances of itself
that scale horizontally, which means
you can offload the pressure from the
individual Prometheus instance to a
system designed for scaling.
Monitoring
Containers have secured their place in
everyday IT, and they are here to stay.
As new as the deployment mechanism
may be, Kubernetes (K8s) [3] and the
like face exactly the same challenges
in everyday operation as their conven-
tional counterparts. At the top of the
scale of operational pain points for
containers, much as in conventional
environments, is monitoring, because
if something goes wrong in Kuber-
netes, you want to know about it, just
as when operating typical monoliths in
legacy environments.
Monitoring in legacy environments
has long followed a familiar pattern:
Youmonitor hosts and services, check
system statuses, and respond to events
flagged by the monitoring system. If a
service fails, a check triggers an alert.
If a value exceeds a threshold, the
system reports an error. This model
is still in place today in the majority
of conventional setups. In contrast,
container platforms fundamentally
shift the requirements. In K8s environ-
ments, plain vanilla event monitoring
is no longer fit for the purpose because
stability depends not just on whether
or not something is running.
Container workloads are rarely binary
or simple. Instead, they gradually run
into difficulties: increasing latencies,
CPU or memory bottlenecks, satu-
rated networks, excessive requests,
and full buffers on the network and in
storage are just a few of the potential
issues. These kinds of developments
need to be identified by analyzing
time series; otherwise, you only see
the end state in the form of a failure,
which is precisely why event logging
is seeing a second aspect coming to
the fore in container platforms: trend-
ing – that is, continuous monitoring
of utilization and behavior over time.
The value that legacy event moni-
toring systems such as Nagios [4],
Icinga [5], or Checkmk [6] add in
this scenario is limited. These tools
are an excellent choice for static hosts
and static services where you can
clearly define thresholds for each
check. They record statuses, generate
warnings, and provide an actionable
list of problems.
Trending, on the other hand, is a very
different function. Historical evalua-
tions are often incomplete, depending
as they do on retention parameters,
or the data could end up in graphs
that look great but do not allow for
real-time series analysis. What is
more, the systems do not scale well in
container environments because the
number of targets to be monitored is
constantly changing.
Prometheus is the standard application when it comes to
monitoring, alerting, and trending, but the software is slow
when faced with a large volume of historical data. Cortex comes
to the rescue and offers cluster support, as well. By Martin Loschwitz
Long-Term Prometheus Data Storage with Cortex
Trend Scout
Ph
ot
o
by
F
LO
UF
FY
o
n
Un
sp
la
sh
18 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R E Prometheus plus Cortex
Of course, this change is a key aspect
of container environments: Applica-
tions and their instances come and
go dynamically. A Kubernetes clus-
ter creates and destroys pods every
second in high-traffic setups, mov-
ing workloads between nodes and
automatically scaling deployments.
A monitoring approach that needs to
input every host and every service
manually cannot hope to keep up
with the pace.
Where Have All the Targets
Gone?
The next key challenge is reliably
capturing the targets. In conventional
environments, you know your servers
and enter them as static objects. In
container environments, these objects
might not even exist. Pods in Kuber-
netes are created dynamically, and if
you use mesh tools such as Istio [7],
services are sometimes even given
completely new endpoints.
Monitoring that can be used effectively
in K8s needs to detect these changes
and respond to them. The system must
determine for itself which endpoints
provide metrics, and it has to refresh its
internal list of these endpoints at short
intervals. Even a well-maintained data
center inventory management (DCIM)
and dashboard logic for Kubernetes
itself and for hosts and services (Fig-
ure 1), and Alertmanager distributes
the alerts generated by Prometheus.
Together, this trio forms the basis
for observability in platforms where
workloads change dynamically and
traditional monitoring models fail.
Data Flood
Prometheus also offers a crucial feature
that container environments absolutely
need: automatic service discovery
(SD). In Kubernetes, Prometheus uses
its API to identify pods, services, and
endpoints automatically. Admins also
define scrape jobs and labels that
regularly query the data and store the
results in Prometheus in a structured
way. Prometheus itself continuously
updates the metrics data from its recog-
nized sources, eliminating the biggest
hurdle that legacy systems face: manu-
ally maintaining a constantly chang-
ing inventory list. Prometheus works
closely with the platform and follows
its reality every step of the way.
However, Prometheus soon reaches
its limits, a fact that quickly becomes
apparent in larger environments.
Prometheus’ local storage saves time
series on the local server drive. As
hours pass, the data volume grows,
tool or a server and service database
(configuration management database,
CMDB) is not particularly helpful here.
After all, the reality in clusters changes
far faster than the documentation can
ever hope to. Monitoring then becomes
a question of integration into the or-
chestration logic that already exists in
K8s: If you monitor Kubernetes, you
have to understand it.
In this context, Prometheus [8] has
established itself as the standard appli-
cation. It combines three features that
are crucial in container and platform
environments: It stores metrics as time
series, actively pulls data from export-
ers, and has a powerful query lan-
guage in the form of PromQL, which
makes the tool ideal for trending,
capacity analysis, and understanding
system behavior over time.
State monitoring is more-or-less a by-
product, because the number of httpd
services running at any point in time
is also a time series, but admittedly
a very short one. Automated pro-
cesses such as alerts can be tailored
to this scenario. In combination with
Grafana [9] and Prometheus’ own
Alertmanager [10], you have a combi-
nation currently considered the gold
standard by many teams.
Prometheus provides the time se-
ries, Grafana has the visualization
Figure 1: Grafana is the miracle tool for visualizing metrics data in Prometheus, as shown in this example from Kubernetes. © CNCF
19A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R EPrometheus plus Cortex
and the bigger the history collection
becomes, the greater the demands on
I/ O, CPU, and memory become.
Although Prometheus itself works ef-
ficiently, it remains a system designed
for a small number of time series.
Queries over large periods of time tend
to take quite a while to complete, and
compressing and storing data long-term
exposes the hardware to excessive load.
Admins constantly have to keep an eye
on the data repositories. To keep a long
story short, the more historical data
Prometheus needs to keep, the slower
it becomes in everyday use – especially
where large numbers of metrics and
labels come together.
Help in Sight
Cortex not only addresses the issue of
more history, it also addresses opera-
tion in larger, distributed structures.
The tool consists of a distributed sys-
tem of several components, such as
the distributor, the ingester, the store,
and the query service, and follows the
principle of microarchitecture: Each
of the components listed here is re-
sponsible for precisely one task.
This architecture separates the tasks
of processing incoming data, storing
the data, and retrieval. Each layer
scales independently horizontally,
leading to a monitoring back end that
grows with the platform, while Pro-
metheus is at the forefront, handling
the monitoring and trending.
Cortex uses object storage such as
Amazon S3-compatible targets – think
a local instance of the Ceph Object
Gateway or other scalable variants – for
storage. Therefore, the local drives of
the individual Prometheus servers lose
their central importance. They only
contain a core set of data that must re-
main accessible for quick access. Que-
ries are no longer run against a single
local Prometheus instance, but against
a distributed storage system that pro-
vides long-term data in a powerful way.
Administrators can look forwardto a
monitoring and trending architecture
that reliably captures all the relevant
vital signs in extremely dynamic con-
tainer environments and then visual-
izes trends over months and years
without a single service instance col-
lapsing under the weight of its own
history.
On the basis of an existing Kuber-
netes cluster, I describe how to roll
out Prometheus, Prometheus Alert-
manager, and Grafana and how to use
the Prometheus Node Exporter to ac-
quire metrics for key vital signs of the
hardware platform. Later, I also look
into processing the metrics data from
applications in detail.
Installing Prometheus
Helm [11] has established itself as
the ideal solution for integrating
Prometheus, its Alertmanager, and
Grafana. It consistently versions the
components, cleanly resolves depen-
dencies between them, and automati-
cally casts an immutable configuration
into declarative statements for K8s.
To begin, you need to set up a
dedicated namespace (e.g., monitor-
ing), and then use the kube-pro-
metheus-stack [12] Helm chart
package to roll out Prometheus in
the form of its own operator [13];
the Alertmanager, Grafana, and
kube-state-metrics [14]; plus the
Prometheus Node Exporter as a coor-
dinated set across the cluster. In this
way, the entire cluster has a com-
plete monitoring basis – essentially
launched in a single command line
– without the need to put together de-
ployments and services individually.
The most important aspect for the suc-
cess of this endeavor is a clean labeling
strategy in Prometheus because the
application recognizes targets in K8s
by label and logically groups metrics
on that basis. If you ensure that all
monitoring components have consis-
tent labels (e.g., app.kubernetes.io/
part-of=monitoring and app.kubernetes.
io/managed-by=helm), you are implicitly
enabling the automatic detection of
services and their grouping within Pro-
metheus itself.
For workloads that provide their own
metrics, you ideally also want to es-
tablish a uniform schema on the basis
of parameters such as team, service,
env, or component. These labels end up
both on the objects created in K8s and
later in label queries in Prometheus,
meaning that queries, dashboards,
and alarms are governed by a fixed
structure.
Ideally, you will also separate your
platform metrics from application
metrics in line with best practices with
the use of namespaces or additional
labels, such as metrics=platform and
metrics=app, which allows the data to
be accessed separately in services such
as Grafana, preventing Grafana dash-
boards from becoming too chaotic.
After the install, Node Exporter (Fig-
ure 2) provides the host metrics of
the physical systems. The Helm chart
used here installs the exporter as a
daemon set, giving each K8s node a
local metric source. Prometheus auto-
matically grazes these targets because
the stack chart contains matching
service monitor entries. Finally, you
need to check in the Prometheus user
interface whether the targets appear
there as expected and whether en-
tries such as those for node-exporter,
kubelet, kube-state-metrics, and
Prometheus’s own components are
present there. If these targets have an
UP status, basic data acquisition is
working reliably, which means that
the cluster will field CPU, RAM, disk,
and network metrics for each node,
plus status metrics for deployments,
pods, daemon sets, and many other
Kubernetes objects.
Fine-Tuning
The Prometheus Operator is respon-
sible for automatic service discovery
in the Kube Prometheus stack. It
observes K8s objects and generates
scrape configurations for Prometheus
from them, which means you no lon-
ger need legacy scrape_config entries
for applications but instead create
ServiceMonitor or PodMonitor objects
in Kubernetes. A service monitor de-
scribes a service with a label selector,
plus the port and path for the /metrics
endpoint.
As soon as a team assigns a suitable
label to a K8s service, Prometheus
automatically adds the endpoint to the
list of targets to be monitored. This
pattern scales because you no longer
20 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Prometheus plus CortexF E AT U R E
with Grafana’s Prometheus source
plugin. For admins, this means leav-
ing Prometheus in place as the data
collector and for real-time queries in
the cluster but shifting the long-term
history to Cortex.
Deployment takes place on the same
Kubernetes cluster as Prometheus,
preferably also in the monitoring
namespace or in a separate
namespace such as cortex. This
arrangement makes sense if
separation of concerns is important
in your organization.
Helm is again recommended for the
installation, because most Cortex de-
ployments are highly specific to their
environments and can be handled
easily on a per cluster basis with Cor-
tex Helm charts and values.yaml files.
Much like Prometheus, Cortex does
not consist of a single pod, but of
several components that perform dif-
ferent tasks. Deployment in K8s must
take this setting into account.
cluster status, node utilization, and
workload behavior. You need to con-
nect Grafana to Prometheus as a data
source, enable the chart’s predefined
dashboards, and then add your
own views according to the previ-
ously defined labeling strategy. That
completes the Prometheus installa-
tion. All that’s missing is the cluster
mechanism.
Operating Cortex
After the basic installation of the
monitoring system, the next step
is to convert plain vanilla cluster
monitoring into a scalable metrics
back end. In other words, you need
to roll out Cortex (Figure 3), which
relies on the remote_write interface
to field metrics as described, before
putting them into long-term storage,
while remaining alert to queries from
PromQL-compatible query endpoints,
which also makes it fully compatible
need to maintain a centrally managed
target list. The platform identifies new
services by labels and automatically
includes them in the scraping process.
Alertmanager extends this setup to
include central routing for all alerts.
To do this, you need to set up routing
rules in the Alertmanager configura-
tion by severity, team assignment,
and namespace. External tools for
delivering alerts also need to be cre-
ated here (e.g., for mailing or Matrix
messages).
Prometheus, in turn, relies on
PromQL rules to generate alerts, and
you manage the rules as Prometheus-
Rule objects directly in Kubernetes.
Assigning consistent labels (e.g., team
and severity) to ruleset objects is
important so that Alertmanager can
route them in a targeted way – and to
support deduplication.
Grafana closes the loop by querying
dashboard data from Prometheus
and providing visualizations for the
Figure 2: Grafana evaluates data from Node Exporter and visualizes the results, as shown here with an example of a simple Raspberry Pi.
© Grafana
21A D M I N 92W W W. A D M I N - M AGA Z I N E .CO M
F E AT U R EPrometheus plus Cortex
Storage as a Backbone
A stable Cortex installation stands
and falls with its storage back end.
You need to choose an object storage
system that is compatible with the S3
API (e.g., MinIO [15] on the cluster or,
as mentioned, the Ceph [16] Object
Gateway [17]). Cortex writes its data
directly to this object storage system,
which eliminates the need to bind the
local disks of individual systems. For
fast processing, Cortex also uses a key-
value store for ring and status informa-
tion, often in the form of memberlist ob-
jects or, alternatively, with Consul [18]
or Etcd [19]. In Kubernetes, memberlist
is becoming the norm because it does
not have external dependencies and is
well-suited to dynamic environments,
ensuring stable DNS names for the mem-
berlist entries, which in turn enables
the individual Cortex services to find
each other for communication.
For the connection between Cortex
and Prometheus, you need to add a
remote_write block to the Prometheusconfiguration that points to an in-
stance of the Cortex distributor. In an
operator-based setup, as rolled out by
the Helm chart, you will use the Pro-
metheus Custom Resource Definition
(CRD) for this purpose and add the
remote write endpoint there.
Prometheus is responsible for query-
ing all the active exporters – that is,
the Node Exporter, kube-state-met-
rics, and all application-specific
ServiceMonitor objects, but it ad-
ditionally forwards scraped samples
to Cortex. To allow this to happen,
the creation of additional labels in
Prometheus to designate the envi-
ronment or cluster (external label-
ing) makes sense. A label such as
cluster= or platform=
prevents time series from different
Prometheus instances being mixed up
later. The label automatically ends up
as an identifier for each time series
on the Cortex back end and supports
queries across multiple clusters with-
out collisions because of identical job
or instance names.
Everyday Use
A pattern has established itself in op-
eration, with Prometheus continuing
to be used for short-term queries and
Grafana querying Cortex for long-
term views. Grafana accesses two
data sources for this purpose: one for
Prometheus and one for Cortex. On
your dashboards, you need to define
which panels require short-term de-
tailed data and which panels will dis-
play the long-term history (Figure 4).
Cortex provides this history with the
query endpoint, whereas Prometheus
communicates directly with Grafana.
In this way, the query language re-
mains the same, but the back end
changes. The operator stack fits in
neatly because service discovery
continues to take place in Pro-
metheus, and Cortex does not query
any data itself.
Multicluster Scenarios
Multiple Cortex instances within
the same K8s cluster can be imple-
mented in different ways. Here, I
look at two models. The first relies
on multitenancy to separate tenants
within a central Cortex installation.
Cortex supports tenant IDs by entries
in the HTTP header of incoming re-
quests. You need to define a tenant
ID for each cluster or each team in
Prometheus to keep all the data logi-
cally separate, even though the same
Cortex cluster is used. This model
reduces resource requirements, sim-
plifies operation, and creates a central
query layer. However, it does pose a
challenge in terms of separation of
concerns: The data is stored in the
same repositories, and auditors might
raise an eyebrow at that.
The second model avoids this prob-
lem by rolling out multiple sepa-
rate Cortex stacks within the same
platform (e.g., for different security
zones or platform teams). In this
way, you can ensure hard isolation
at the Kubernetes level, but at the
cost of significantly greater opera-
tional overhead. In this model, the
instances are assigned separate
buckets or separate prefixes in ob-
ject storage so that block collisions
cannot occur. You also need to use
different block storage – for ex-
ample, different storage classes (in
Kubernetes).
Figure 3: Cortex follows the microarchitecture application principle and comprises several
components that work together with Prometheus and Grafana. © Cortex
22 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Prometheus plus CortexF E AT U R E
[8] Prometheus: [https:// prometheus. io/]
[9] Grafana: [https:// grafana. com/]
[10] Prometheus Alertmanager:
[https:// prometheus. io/ docs/ alerting/
latest/ alertmanager/]
[11] Helm: [https:// helm. sh/]
[12] kube-prometheus-stack:
[https:// github. com/ prometheus-
community/ helm-charts/ blob/ main/ charts/
kube-prometheus- stack/ README. md]
[13] Prometheus Operator:
[https:// github. com/ prometheus-
operator/ prometheus-operator]
[14] kube-state-metrics: [https:// github. com/
kubernetes/ kube-state-metrics]
[15] MinIO: [https:// www. min. io/]
[16] Ceph: [https:// ceph. io/ en/]
[17] Ceph Object Gateway:
[https:// docs. ceph. com/]
[18] Consul: [https:// www. consul. io/]
[19] etcd: [https:// etcd. io/]
Conclusion: Easy to Do
Monitoring applications and plat-
forms with Kubernetes, Prometheus,
Grafana, and Cortex promises an
extremely flexible monitoring archi-
tecture. It keeps pace with the re-
quirements of modern, scalable apps
and neatly integrates trending. If you
have been used to Checkmk or Nagios,
it could be a considerable change
to familiarize yourself with all the
components that make up the stack.
However, the bottom line is a solution
whose performance far exceeds that
of legacy event monitoring. I can only
encourage anyone who works with
Kubernetes to take the plunge – if
only for your own peace of mind.
Info
[1] Cortex on GitHub: [https:// github. com/
cortexproject/ cortex]
[2] Cortex: [https:// cortexmetrics. io/]
[3] Kubernetes: [https:// kubernetes. io/]
[4] Nagios: [https:// www. nagios. org/]
[5] Icinga: [https:// icinga. com/]
[6] Checkmk: [https:// checkmk. com/]
[7] Istio: [https:// istio. io/]
A hub-and-spoke approach is recom-
mended for platforms that spread
across cluster boundaries. Each K8s
cluster runs Prometheus locally for
scraping and short-term queries but
uses remote_write to send metrics to
a central Cortex back end running
either on a dedicated observability
cluster or as a standalone platform
instance. You can use (external)
labels, typically with the values for
cluster, region, and environment, to
enforce a clean identity for the in-
coming data.
Grafana then accesses this central
Cortex instance and creates global
dashboards that map multiple plat-
forms simultaneously. The approach
scales across locations as long as a
network path exists between Pro-
metheus and the central Cortex.
In practice, companies tend to rely
on TLS-secured ingress endpoints
in K8s, mutual (m)TLS between
clusters, or dedicated private net-
work connections for this purpose.
However, the latter are difficult to
implement in public clusters and
are usually reserved as a feature for
private clouds.
Keywords: Kubernetes, Prometheus, Grafana, Cortex, data, storage, trending, monitoring, alerting
The Author
Martin Loschwitz is the
founder and managing
director of True West IT
Services GmbH, which offers
scalable IT infrastructure
based on OpenStack and Kubernetes.
Figure 4: Visualization of long-term trending enables not only the rapid identification of specific problems, but also the detection of
long-term trends, such as the emergence of alerts, before they become a problem. © Grafana
24 A D M I N 92 W W W. A D M I N - M AGA Z I N E .CO M
Prometheus plus CortexF E AT U R E
In Japanese, Kuma means bear,
which the Ainu associate with protec-
tive qualities. Uptime Kuma [1], [2] is
a little bear that keeps a watchful eye
on your websites, servers, and ser-
vices 24x7, as described by developer
Louis Lam. What began in July 2021
as a personal solution to a specific
problem has grown into one of the
most successful self-hosted monitor-
ing tools.
The story behind Uptime Kuma is
typical of open
source projects:
Lam was looking for
a free, self-hosted
monitoring tool
with a state-of-the-
art interface. He
was unimpressed
by the alternatives
available at the
time: statping-ng
was no longer ac-
tively maintained
and seemed out-
dated. The free
version of Uptime-
Robot, a software-
as-a-service (SaaS)
solution, proved to
be too limited in
its scope; to make
matters worse, it’s not open source.
All of these problems prompted Lam
to write his own tool.
The numbers speak for themselves:
nearly 79,000 GitHub stars and
more than 127 million Docker pulls
make Uptime Kuma one of the most
popular projects in its category. The
growth curve is also impressive: In
August 2021, just a few weeks after
the initial release, the project reached
its first 1,000 stars. Within a year, that
number rose to more than 20,000,
with