NCP-CN 6.10 Study Guide · Section 3 of 4

Perform Day 2 Operations

Six objectives, the biggest section, and per community intel the densest question territory: auth, logging, backup, monitoring, autoscaling, lifecycle. The platform is built; now you run it.

🎯 Objectives 3.1 – 3.6 📅 War plan: Week 2, days 1-4 Versions NKP 2.12 · AOS 6.10 · pc2024.2

3.1Authentication and authorization

Authentication: NKP embeds Dex as an OIDC broker. You configure an identity provider (external LDAP directory, OIDC, SAML) in Kommander, map groups from the IdP, and users sign in through a generated login URL (per tenant URLs exist for multi tenancy). Service/automation access uses tokens.

Authorization is layered, and the layering is the exam question:

Layer	Governs	Examples
Kommander roles	The fleet management plane	Who can administer workspaces, projects, attach clusters, deploy platform apps
Cluster roles (plain K8s RBAC)	Inside each cluster	Role/ClusterRole + bindings from Guide 00 §4; what users can do to workloads

Role inheritance / environment contexts: grants flow downhill. A role granted at the workspace level federates to the clusters in that workspace; project level grants scope to the project's namespaces on its clusters. Custom roles + role bindings let you go finer than the built ins. Gatekeeper (OPA admission policy) is the policy enforcement piece the blueprint names under security.

Exam shape

Scenario gives a permission outcome and asks which layer to touch. Heuristic: anything about seeing/administering workspaces, projects, or platform apps = Kommander role. Anything about pods/deployments/namespaces on a specific cluster = cluster RBAC. "User authenticates fine but sees nothing" = authenticated via Dex but no role binding at any layer.

3.2Logging

The NKP logging stack is a platform application set built on the Logging Operator with Loki as the log store and Grafana as the view. It is not enabled by default; you enable the logging stack apps per workspace/cluster.

Piece	Job
Logging Operator	Orchestrates collection: Fluent Bit agents on nodes, flows/outputs route logs
Loki	Stores logs (label indexed, object storage friendly)
Grafana (logging)	Query/view logs

Multi tenant logging: restrict who sees which logs by scoping flows to namespaces/tenants, so each team views only its own. Persistence and scale: Loki's backend can target object storage (the blueprint names an S3 example and integrating persistent data with Nutanix Unified Storage, i.e. Objects/Files); scaling the stack means scaling Loki components and collectors for volume.

3.3Backup and recovery (Velero)

Velero backs up cluster API objects + persistent volume data. NKP ships it in the platform app stack (license tier gated; "NKP License Support for Backup & Restore").

Dependencies to recognize: a target object storage location for backup data (S3 compatible; Nutanix Objects works), credentials for it, and VolumeSnapshotClasses (CSI) so PV data can be snapshotted. No target storage, no backups.

# the Velero CLI vocabulary the blueprint expects
velero backup create nightly-app --include-namespaces team-a
velero schedule create nightly --schedule "0 2 * * *"   # cron shaped
velero backup get / describe / logs                      # diagnose
velero restore create --from-backup nightly-app

Diagnosing backup issues: backup PartiallyFailed or Failed → read velero backup logs; usual suspects are storage location credentials/reachability, snapshot class missing/misconfigured, or namespace scoping. Cluster restore = restoring from a backup into a rebuilt/replacement cluster; restores are how you prove backups (test them).

3.4Performance and health monitoring

The monitoring stack platform apps: Prometheus (scrape + store metrics), Alertmanager (route alerts), Grafana (dashboards). Per cluster stacks, plus centralized metrics in a multi cluster fleet: the management cluster aggregates/federates so you watch the estate from one place (Kommander dashboards; NKP Insights adds anomaly detection at Ultimate).

Customization the blueprint expects you to know exists: custom dashboards in Grafana, custom service/system level metrics (ServiceMonitors scraping your apps), alert rules + notification endpoints (where Alertmanager sends: email/webhook/chat), and backend storage for the monitoring app (Prometheus retention needs a PV; size it or it falls over).

Exam shape

"Centralize monitoring in a multi cluster environment" → management cluster federation/centralized metrics, not logging into 12 Grafanas. "Configure backend storage for a monitoring application" → PVC/StorageClass sizing for Prometheus. Health diagnosis questions pair metrics (what is slow) with the platform app status (what is broken).

3.5Cluster autoscaling

What it is: the Cluster Autoscaler grows/shrinks a node pool between min and max bounds based on pod schedulability: pods Pending for lack of resources → scale up; nodes underutilized → scale down. This is NODE autoscaling (vs HPA scaling pods; know the difference cold).

Configuring it: per node pool min/max settings; provider specific configuration exists for Nutanix, AWS, and vSphere (the blueprint lists all three docs). Use cases: bursty workloads, cost control, CI farms.

3.6Lifecycle management

Operation	How it works in NKP
Upgrade clusters	Sequenced: check Upgrade Prerequisites, upgrade the management cluster/Kommander first, then workload clusters; CAPI rolls nodes with new machine images (Kubernetes version upgrades on managed clusters). Air-gapped upgrades have their own doc: new bundles seeded to the registry first. Pro vs Ultimate upgrade docs differ in scope (fleet vs single).
Update cluster configuration	Edit the owning CAPI resource (Guide 02 table) and let reconcile do it; scenario questions ask when config update vs rebuild is right
Manage node pools	Create/resize/delete pools = MachineDeployments; heterogeneous pools for different workload shapes
Manually scale	Bump replica count on a node pool (CLI or UI); autoscaler bounds vs manual setpoints
Delete clusters	`nkp delete cluster` ("Delete an NKP Cluster with One Command"); cleans up provider resources; managed vs attached matters (Guide 04)

Exam shape

Upgrade ORDER is the classic: management cluster before workload clusters; air gap means re-seed the registry with the new release bundle BEFORE anything. Node pool questions hinge on MachineDeployment being the unit of scaling.

§Section 3 in one breath

Dex authenticates against your IdP; Kommander roles govern the fleet plane while cluster RBAC governs inside, with inheritance flowing workspace → project → cluster. The logging stack (Logging Operator + Loki) and monitoring stack (Prometheus + Alertmanager + Grafana, centralized at the management cluster) are platform apps you enable, customize, persist, and scale. Velero needs object storage + snapshot classes and is driven by its CLI. The autoscaler manages node counts per pool bounds. Upgrades flow management first, air gaps re-seed first, and node pools are MachineDeployments.

§Self check

1. A user logs in via LDAP fine but sees an empty Kommander UI. What is missing, and at which layer?

2. Velero backups fail with storage location errors. Name the two dependencies you verify first.

3. Team A must not see Team B's logs in Grafana. What feature solves this and what is it scoped to?

4. Pods sit Pending at 9am daily, nodes idle by noon. Which feature, and what two bounds configure it?

5. Air-gapped fleet upgrade from NKP 2.12 to the next release: what is step zero, and which cluster upgrades first?