Pubstack Blog - Cloud computing and engineering articles

PostgreSQL Modeling Workflow with pgModeler and binaries from pgmodeler-builder

2024-12-05T00:00:00+00:00

If you work with PostgreSQL databases and have a passion for open-source tools, you’ve likely heard of pgModeler—a powerful, multiplatform database modeler. But what if you could have or need consistent, automated builds of this essential tool across major platforms like Linux, Windows, and macOS? Enter pgmodeler-builder https://github.com/ccamacho/pgmodeler-builder, a repository designed to make monthly builds of pgModeler seamless and accessible.

What is pgmodeler-builder?

pgmodeler-builder is a GitHub Actions-powered workflow designed to generate monthly builds of pgModeler. It ensures that the tool is always up-to-date for users on Linux, Windows, and macOS. The releases includes:

.tarball, and .zip files for compiled sources.
.AppImage for Linux users.
.dmg for ARM-based macOS.
.exe for Windows.

Why pgModeler?

pgModeler is a standout in the PostgreSQL ecosystem. As an open-source and cross-platform database modeler, it simplifies database design and management with features tailored for PostgreSQL. It’s the go-to solution for developers and DBAs who prioritize free and open-source software.

Key Benefits of pgmodeler-builder

This repository brings several advantages to the table:

Automated, Regular Builds: Ensure you’re using the latest version with scheduled monthly releases.
Cross-Platform Compatibility: Access builds tailored to your operating system.
Streamlined Updates: The repository makes it easy to add new releases or versions with minimal effort.
Community-Driven Improvements: Contributions are welcomed, allowing developers to refine and optimize workflows further.

LLM performance: Speculative decoding

2024-07-01T00:00:00+00:00

In the world of natural language processing (NLP), large language models (LLMs) like GPT-3 and GPT-4 have set new benchmarks in generating human-like text. However, as these models grow in size and complexity, the computational resources required to generate text, especially in real-time applications, become a significant challenge. One of the techniques developed to address this challenge is speculative decoding.

Speculative Decoding: Enhancing Efficiency in Large Language Models

In this article, we’ll explore what speculative decoding is, how it works, and its impact on the efficiency of large language models.

What is Speculative Decoding?

Speculative decoding is a technique designed to accelerate the text generation process in large language models by making informed “guesses” about future tokens (words or subwords) in a sequence. The core idea is to leverage the model’s knowledge and certain statistical assumptions to predict parts of the output sequence in parallel, rather than generating tokens one at a time sequentially. This parallelization can lead to significant speed-ups, especially in latency-sensitive applications like chatbots, real-time translation, and voice assistants.

The Problem with Sequential Decoding

In traditional autoregressive language models, text generation is inherently sequential. The model generates one token at a time, with each token conditioned on all previously generated tokens. For instance, when generating a sentence like “The cat sat on the mat,” the model first generates “The,” then “cat,” and so on. Each step involves calculating probabilities for the next token based on the previously generated tokens.

While this approach ensures that each token is optimally generated given the context, it also means that the process is slow, particularly when working with long sequences or complex models. The time complexity of generating a sequence of n tokens in a sequential manner is O(n), where each step requires a full pass through the model. This limitation becomes more pronounced as the model size and the length of the generated text increase.

How Does Speculative Decoding Work?

Speculative decoding introduces parallelism into the text generation process by predicting multiple tokens at once. It involves generating “speculative” sequences or fragments of text that are likely to occur, based on the model’s understanding of the context and certain statistical heuristics.

Steps Involved in Speculative Decoding

Initial Prediction: The model starts by generating a few tokens using the traditional autoregressive approach. These tokens serve as a seed for speculative decoding.
Speculative Branching: Based on the initial tokens, the model speculatively generates multiple possible sequences (branches) for the next few tokens. Each branch is a possible continuation of the text, generated in parallel.
Pruning and Selection: Once the speculative branches are generated, the model evaluates them using a scoring mechanism (e.g., likelihood or a heuristic based on language structure). The most likely branch or branches are selected for further expansion.
Verification: The selected speculative sequence is then verified by generating the next token(s) using the standard autoregressive method. If the speculative sequence aligns with the sequentially generated sequence, it is accepted and integrated into the final output.
Backtracking: If the speculative sequence deviates significantly from what the sequential model would generate, the model can backtrack to the point of divergence and correct the sequence. This ensures that the final output maintains high quality and coherence.

Example of Speculative Decoding

Imagine you are using a language model to generate a story. After generating the phrase “Once upon a time,” the model could speculatively generate several possible continuations like:

“there was a king who ruled over a vast kingdom.”
“a young girl found a mysterious key.”
“an old man discovered a hidden treasure.”

Each of these continuations could be scored based on how likely they are given the initial phrase. The model might select “there was a king who ruled” as the most probable continuation. It would then verify this by generating the next token (“over”) using the standard method. If everything aligns, it proceeds with the speculative branch, saving time by not generating each token one by one.

Benefits of Speculative Decoding

Speculative decoding offers several advantages, particularly in the context of large language models:

1. Increased Speed

The most immediate benefit of speculative decoding is the speed-up in text generation. By generating multiple tokens in parallel, the model can produce text faster, which is crucial for applications where response time is critical. For example, in a real-time translation system, reducing latency can significantly improve user experience.

2. Reduced Computational Load

Although speculative decoding involves generating multiple branches, it can ultimately reduce the computational load by avoiding unnecessary calculations. Since speculative branches that deviate from the expected output are pruned early, the model doesn’t waste resources on less likely sequences.

3. Improved Throughput

In environments where multiple text generation tasks are running concurrently (e.g., cloud-based NLP services), speculative decoding can improve overall throughput. By handling multiple tokens at once, the system can serve more requests in a given time frame, making it more scalable.

4. Enhanced Flexibility

Speculative decoding can be fine-tuned based on the application’s needs. For example, the number of speculative branches, the depth of branching, and the criteria for pruning can all be adjusted to balance between speed and quality. This flexibility allows developers to optimize performance for specific use cases.

Challenges and Considerations

While speculative decoding offers significant benefits, it also comes with challenges that need to be carefully managed:

1. Trade-off Between Speed and Accuracy

One of the primary challenges is balancing speed with accuracy. If the model speculatively generates too many tokens without sufficient verification, the quality of the generated text might suffer. On the other hand, excessive verification can negate the speed benefits.

2. Complexity in Implementation

Implementing speculative decoding requires a more complex setup compared to traditional sequential decoding. It involves managing multiple speculative branches, scoring them, and handling backtracking when necessary. This added complexity can increase the development time and computational overhead.

3. Resource Management

While speculative decoding can reduce the time complexity of text generation, it may increase the overall resource usage during the speculative phase, particularly in terms of memory and computational power. Efficient management of these resources is crucial to maximize the benefits of speculative decoding.

4. Quality Assurance

Ensuring that the final output remains coherent and high-quality is essential. Speculative decoding must be carefully tuned to avoid producing text that is disjointed or semantically inconsistent. This requires rigorous testing and fine-tuning, especially for applications where accuracy is critical.

Conclusion

Speculative decoding is a powerful technique that enhances the efficiency of large language models by introducing parallelism into the text generation process. By speculatively generating multiple tokens at once and selecting the most likely sequence, models can achieve faster response times without compromising too much on quality.

As large language models continue to evolve, techniques like speculative decoding will play an increasingly important role in making these models more practical for real-world applications. While challenges remain in balancing speed, accuracy, and resource usage, the potential benefits make speculative decoding a valuable tool in the NLP toolkit. As researchers and developers continue to refine these methods, we can expect to see even more efficient and powerful language models in the future.

Sizing an LLM: Parameters

2024-05-01T00:00:00+00:00

Large language models (LLMs) like GPT-3 and GPT-4 have become a cornerstone of modern natural language processing (NLP) applications, driving advancements in machine translation, text generation, question answering, and more. These models are powered by neural networks with billions of parameters, making them incredibly powerful but also resource-intensive.

Understanding Large Language Models: Parameters, Calculations, and VRAM Requirements

In this blog post, we’ll explore what parameters in large language models are, how they are calculated, and the implications of these calculations on the space required to load a model into VRAM (Video RAM).

What Are Large Language Models?

Large language models are deep learning models designed to understand and generate human language. They are typically based on transformer architectures, which allow the model to handle complex language tasks by capturing contextual information from input text sequences.

Parameters: The Building Blocks of LLMs

In the context of neural networks, parameters refer to the weights and biases within the model. These are the elements that the model adjusts during training to minimize the difference between its predictions and the actual outcomes (i.e., to reduce the loss function). The parameters are what the model “learns” and are essential for the model’s ability to generalize from training data to unseen text.

The number of parameters in a model is a key factor in its complexity and capacity. For example, GPT-3 has 175 billion parameters, while earlier versions like GPT-2 had 1.5 billion parameters. The sheer scale of these models means they can capture a wide range of linguistic patterns, making them versatile across different NLP tasks.

How Are Parameters Calculated?

Calculating the parameters of a language model involves understanding the architecture of the model. For transformer-based models, the parameters are primarily located in:

Embedding Layers: These layers convert input tokens (words, subwords, or characters) into vectors of continuous numbers. The parameters here are the weights used to create these embeddings.
Attention Mechanisms: Transformers rely heavily on attention mechanisms, particularly self-attention, to determine the importance of each word in a sequence relative to others. The parameters in these mechanisms include weights that govern the interactions between different tokens.
Feedforward Layers: After the attention mechanism processes the input, the output is passed through feedforward neural networks, which consist of fully connected layers. Each of these layers has its own set of parameters (weights and biases).
Output Layers: Finally, the processed data is passed through the output layers, which convert the model’s internal representation back into a format that can be interpreted (e.g., predicting the next word in a sequence).

Example Calculation

Let’s consider a simplified transformer model with the following characteristics:

Embedding Size: 512
Number of Layers: 12
Attention Heads: 8
Vocabulary Size: 50,000

The total number of parameters can be broken down as follows:

Embedding Layer:
- The embedding layer would have Vocabulary Size * Embedding Size parameters.
- Example: (50,000 \times 512 = 25,600,000) parameters.
Attention Mechanism:
- Each attention head has Embedding Size / Attention Heads parameters for queries, keys, and values.
- Example: If we have 8 heads, each head would have (512 / 8 = 64) parameters per token.
- For the entire model: (12 \times 8 \times (Embedding Size \times Embedding Size / Attention Heads) = 12 \times 8 \times (512 \times 64) = 3,932,160) parameters.
Feedforward Layers:
- Each layer typically has 2 * Embedding Size^2 parameters.
- Example: (12 \times 2 \times 512^2 = 6,291,456) parameters.
Output Layers:
- The output layer usually has Embedding Size * Vocabulary Size parameters.
- Example: (512 \times 50,000 = 25,600,000) parameters.

Adding these up, the total number of parameters for this simplified model would be approximately:

[ 25,600,000 + 3,932,160 + 6,291,456 + 25,600,000 = 61,423,616 \text{ parameters} ]

VRAM Requirements: How Much Space Do LLMs Need?

The enormous number of parameters in large language models poses significant challenges for deployment, especially concerning memory requirements. VRAM, which is specialized memory used by GPUs (Graphics Processing Units), plays a critical role in running these models efficiently.

Parameter Storage

Each parameter in a model is typically stored as a floating-point number, usually in 16-bit (FP16), 32-bit (FP32), or in some cases, 8-bit (INT8) precision. The choice of precision affects both the VRAM requirements and the performance of the model.

FP32 (32-bit precision): Each parameter requires 4 bytes.
FP16 (16-bit precision): Each parameter requires 2 bytes.
INT8 (8-bit precision): Each parameter requires 1 byte.

Using our previous example of a model with approximately 61 million parameters:

FP32: (61,423,616 \times 4 = 245,694,464) bytes or roughly 234 MB of VRAM.
FP16: (61,423,616 \times 2 = 122,847,232) bytes or roughly 117 MB of VRAM.
INT8: (61,423,616 \times 1 = 61,423,616) bytes or roughly 58.6 MB of VRAM.

Additional Memory Requirements

Besides storing the parameters, VRAM is also needed for:

Activations: Intermediate activations during the forward and backward passes require additional memory. The size of activations can be substantial, especially for deep networks with many layers.
Optimizer States: When training a model, optimizers like Adam maintain additional states for each parameter, such as momentums and velocities, which also consume VRAM.
Batch Size: The size of the input batch processed by the model in one go also affects VRAM usage. Larger batch sizes require more memory for both the input data and the activations.

Estimating VRAM Requirements

To estimate the total VRAM required, you can use the following rule of thumb:

[ \text{Total VRAM} \approx \text{Model Parameters} + \text{Activations} + \text{Optimizer States} + \text{Input Data} ]

For a model with 61 million parameters using FP32 precision, a small batch size, and moderate complexity, you might need:

Model Parameters: 234 MB (from above)
Activations: 200 MB (estimated)
Optimizer States: 468 MB (2 times model parameters, typical for Adam)
Input Data: 50 MB (depends on input size and batch size)

Total estimated VRAM requirement: (234 + 200 + 468 + 50 = 952) MB.

Conclusion

Large language models are incredibly powerful tools that rely on a vast number of parameters to perform complex language tasks. Understanding how these parameters are calculated and their impact on VRAM usage is crucial for effectively deploying these models in real-world applications. While the resource demands are significant, advances in model optimization, mixed precision training, and hardware capabilities are making it increasingly feasible to leverage these models in various settings. As technology continues to evolve, the balance between model size, performance, and resource efficiency will remain a critical area of focus in the development of large language models.

Sharing your GPU in the cloud

2024-04-08T00:00:00+00:00

In this post we will introduce different methods for sharing GPU resources across workloads in K8s clusters.

GPU sharing refers to the practice of allowing multiple users or processes to access and utilize the resources of a single Graphics Processing Unit (GPU) concurrently. This approach can be beneficial in scenarios where there are multiple workloads that can utilize GPU resources intermittently or simultaneously, such as in cloud computing environments, data centers, or research institutions.

Prerequisites:

OpenShift 4.15 deployed.
NFD operator.
Nvidia GPU operator from master.

Time-slicing, is a technique used in GPU resource management where the available GPU resources are divided into time intervals, or “slices,” and allocated to different users or processes sequentially. Each user or process is granted access to the GPU for a specified duration, known as a time slice, before the GPU is relinquished and made available to the next user or process in the queue. This method of GPU sharing is particularly useful in environments where there are multiple users or processes vying for access to limited GPU resources. By allocating GPU time slices to different users or processes, time-slicing ensures fair and efficient utilization of the GPU among multiple competing workloads.

Multi-Instance GPU (MIG), revolutionizes GPU utilization in data center environments by allowing a single physical GPU to be partitioned into multiple isolated instances, each with its own dedicated compute resources, memory, and performance profiles. MIG enables efficient sharing of GPU resources among multiple users or workloads by providing predictable performance guarantees and workload isolation. With MIG, administrators can dynamically adjust the number and size of MIG instances to adapt to changing workload demands, ensuring optimal resource utilization and scalability. This innovative technology enhances flexibility, efficiency, and performance in GPU-accelerated computing environments, enabling organizations to meet the diverse needs of applications and users while maximizing GPU utilization.

Multi-Process Service (MPS), facilitates the concurrent sharing of a single GPU among multiple CUDA applications. By allowing the GPU to swiftly transition between various CUDA contexts, MPS optimizes GPU resource utilization across multiple processes. This capability enables efficient allocation of GPU resources to different applications running simultaneously, ensuring that the GPU is effectively utilized even when serving multiple workloads concurrently. MPS enhances GPU efficiency by dynamically managing CUDA contexts, enabling seamless context switching and minimizing overhead, thus maximizing GPU throughput and responsiveness in multi-application environments.

Overall, these techniques aim to improve the efficiency of GPU utilization and accommodate the diverse needs of users or processes sharing the GPU resources.

A shortcut to the different strategies reviewed as follows:

Quick check

Let’s check before anything that the GPU is detected and configured before continuing.

NODE=my-host-example-com
kubectl label --list nodes $NODE | \
  grep nvidia.com

Enabling Time-slicing

The following CR example defines how we will be sharing the GPU in a config map (this wont have any effect in the cluster at the moment).

cat << EOF | kubectl apply -f -
---
apiVersion: v1
kind: ConfigMap
metadata:
  # name: device-plugin-config # NVIDIA
  name: time-slicing-config #OpenShift
  namespace: nvidia-gpu-operator
data:
  NVIDIA-A100-PCIE-40GB: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 10
EOF

With the resource created we need to patch the initial ClusterPolicy from the GPU operator gpu-cluster-policy.

oc patch clusterpolicy \
    gpu-cluster-policy \
    -n nvidia-gpu-operator \
    --type merge \
    -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config"}}}}'

The previous example will update the devicePlugin configuration in the gpu-cluster-policy. When inspecting the cluster policy it should look like

devicePlugin:
  config:
    default: ''
    name: time-slicing-config
  enabled: true
  mps:
    root: /run/nvidia/mps

So we fetch the configuration from the previously created config map time-slicing-config.

To make sure the NFD operator runs we label the node stating that the device-plugin.config should point to NVIDIA-A100-PCIE-40GB. And we label the node we want to apply the configuration

oc label \
  --overwrite node my-host-example-com \
  nvidia.com/device-plugin.config=NVIDIA-A100-PCIE-40GB

After a few minutes we can change that the NFD operator reconfigured the node to use time-slicing

oc get node \
  --selector=nvidia.com/gpu.product=NVIDIA-A100-PCIE-40GB-SHARED \
  -o json | jq '.items[0].status.capacity'

{
  "cpu": "128",
  "ephemeral-storage": "3123565732Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "527845520Ki",
  "nvidia.com/gpu": "8",
  "pods": "250"
}

Now we test we can schedule the shared GPU to a pod (or many slices). Note that we allocate the pod with the limits instead of the nodeselector because the GPU label name changed, so we assume we dont know the name.

cat << EOF | kubectl create -f -
---
apiVersion: v1
kind: Pod
metadata:
  # generateName: command-nvidia-smi-
  name: command-nvidia-smi
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/cuda:12.1.0-base-ubi8
      command: ["/bin/sh","-c"]
      args: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  # nodeSelector:
  #   nvidia.com/gpu.product: NVIDIA-A100-PCIE-40GB
EOF

pod/command-nvidia-smi created
ccamacho@guateque:~/dev/$ kubectl logs command-nvidia-smi
Mon Apr  8 10:01:26 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0             38W /  250W |   39023MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

In the previous case a pod will be scheduled and its execution will end, now let’s see how can we schedule the 8 slices that were created before.

We can create a deployment with 11 pods that will run dcgmproftester12. In this case dcgmproftester12 will generate load as a half-precision matrix-multiply-accumulate for the Tensor Cores (-t 1004) for 5 minutes (-d 300).

cat << EOF | kubectl apply -f -
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nvidia-plugin-test
  labels:
    app: nvidia-plugin-test
spec:
  replicas: 11
  selector:
    matchLabels:
      app: nvidia-plugin-test
  template:
    metadata:
      labels:
        app: nvidia-plugin-test
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
       - name: dcgmproftester12
         image: nvcr.io/nvidia/cloud-native/dcgm:3.3.5-1-ubi9
         # What is the difference between nvcr.io/nvidia/cloud-native and nvcr.io/nvidia/k8s?
         command: ["/bin/sh", "-c"]
         args:
            - while true; do /usr/bin/dcgmproftester12 --no-dcgm-validation -t 1004 -d 30; sleep 30; done
         resources:
           limits:
             # To check gpu vs gpu.shared (MPS)
             nvidia.com/gpu: "1"
         securityContext:
           privileged: true
         # The following SYS_ADMIN capability is not enough  
         # securityContext:
         #   capabilities:
         #     add: ["SYS_ADMIN"]
EOF

Where we have 8 from those 8 slices that were created used (8 running and 3 pending).

ccamacho@guateque:~/dev$ kubectl get pods --selector=app=nvidia-plugin-test
NAME                                  READY   STATUS    RESTARTS   AGE
nvidia-plugin-test-6988cc8f8f-22gt2   1/1     Running   0          53s
nvidia-plugin-test-6988cc8f8f-45dgd   1/1     Running   0          53s
nvidia-plugin-test-6988cc8f8f-gjrg8   1/1     Running   0          53s
nvidia-plugin-test-6988cc8f8f-lwlvh   1/1     Running   0          53s
nvidia-plugin-test-6988cc8f8f-rq6hx   1/1     Running   0          53s
nvidia-plugin-test-6988cc8f8f-tgkxm   1/1     Running   0          53s
nvidia-plugin-test-6988cc8f8f-wdc7l   1/1     Running   0          53s
nvidia-plugin-test-6988cc8f8f-wmzz9   1/1     Running   0          53s
nvidia-plugin-test-6988cc8f8f-wn8jt   0/1     Pending   0          53s
nvidia-plugin-test-6988cc8f8f-x5lbp   0/1     Pending   0          53s
nvidia-plugin-test-6988cc8f8f-zdvwn   0/1     Pending   0          53s

If is forbidden: unable to validate against any security context constraint: [spec.initContainers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed spec.init

oc adm policy add-scc-to-user privileged -z default -n <namespace>

After dcgmproftester12 finishes, the pods logs show:

Skipping CreateDcgmGroups() since DCGM validation is disabled
Worker 0:0 [1004]: TensorEngineActive: generated ???, dcgm 0 (1.78e+05 gflops).
.
.
.
Worker 0:0 [1004]: TensorEngineActive: generated ???, dcgm 0 (1.89e+05 gflops).
Worker 0:0 [1004]: TensorEngineActive: generated ???, dcgm 0 (1.89e+05 gflops).
Worker 0:0 [1004]: TensorEngineActive: generated ???, dcgm 0 (1.89e+05 gflops).
Worker 0:0 [1004]: TensorEngineActive: generated ???, dcgm 0 (1.89e+05 gflops).
Worker 0:0 [1004]: TensorEngineActive: generated ???, dcgm 0 (1.89e+05 gflops).
Worker 0:0[1004]: Message: Bus ID 00000000:0A:00.0 mapped to cuda device ID 0
DCGM CudaContext Init completed successfully.
CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR: 2048
CUDA_VISIBLE_DEVICES: GPU-539a31f4-1487-3a42-cc55-910b7bf82d0c
CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT: 108
CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR: 167936
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR: 8
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR: 0
CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH: 5120
CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE: 1215
Max Memory bandwidth: 1555200000000 bytes (1555.2 GiB)
CU_DEVICE_ATTRIBUTE_ECC_SUPPORT: true

oc rsh \
  -n nvidia-gpu-operator \
  $(kubectl get pods -n nvidia-gpu-operator | grep -E 'nvidia-dcgm-exporter.*' | awk '{print $1}') nvidia-smi

Enabling Multi-Instance GPU (MIG)

Similarly as time-slicing the configuration needs to be adjusted using both configmaps and updating the cluster policy to the changes are applied correctly. The labeling step done in time-slicing needs to be executed only once.

Depending on the GPU different MIG partitions can be created:

Product	Architecture	Microarchitecture	Compute Capability	Memory Size	Max Number of Instances
H100-SXM5	Hopper	GH100	9.0	80GB	7
H100-PCIE	Hopper	GH100	9.0	80GB	7
H100-SXM5	Hopper	GH100	9.0	94GB	7
H100-PCIE	Hopper	GH100	9.0	94GB	7
H100 on GH200	Hopper	GH100	9.0	96GB	7
A100-SXM4	NVIDIA Ampere	GA100	8.0	40GB	7
A100-SXM4	NVIDIA Ampere	GA100	8.0	80GB	7
A100-PCIE	NVIDIA Ampere	GA100	8.0	40GB	7
A100-PCIE	NVIDIA Ampere	GA100	8.0	80GB	7
A30	NVIDIA Ampere	GA100	8.0	24GB	4

Where in the case of the A100 we can have the following profiles:

Profile	Memory	Compute Units	Maximum number of homogeneous instances
1g.5gb	5 GB	1	7
2g.10gb	10 GB	2	3
3g.20gb	20 GB	3	2
4g.20gb	20 GB	4	1
7g.40gb	40 GB	7	1

We can check that by running:

oc rsh \
  -n nvidia-gpu-operator \
  $(kubectl get pods -n nvidia-gpu-operator | grep -E 'nvidia-dcgm-exporter.*' | awk '{print $1}') nvidia-smi mig -lgip

Where the profiles are described with the notation <COMPUTE>.<MEMORY>, an administrator will create a set of profiles that can be consumed by the workloads.

And let’s check how many MIG-enabled are available.

oc rsh \
  -n nvidia-gpu-operator \
  $(kubectl get pods -n nvidia-gpu-operator | grep -E 'nvidia-driver-daemonset.*' | awk '{print $1}') nvidia-smi mig -lgi

This should output No MIG-enabled devices found. because we didn’t create any device yet.

The strategies could be (it is important that A GEOMETRY MUST BE DEFINED PER CLUSTER) single, all GPUs within the same node with the same geometry i.e. 1g.5gb. mixed, heterogeneous advertisement like single node with multiple GPUs, each GPU can be configured in a different MIG geometry.

Let see our MIG related labels

kubectl get node -o json | \
   jq '.items[0].metadata.labels | with_entries(select(.key | startswith("nvidia.com")))'

Where the MIG related labels are:

.
.
.
  "nvidia.com/mig.capable": "true",
  "nvidia.com/mig.config": "all-disabled",
  "nvidia.com/mig.config.state": "success",
  "nvidia.com/mig.strategy": "single",
  "nvidia.com/mps.capable": "false"

cat << EOF | kubectl apply -f -
---
apiVersion: v1
kind: ConfigMap
metadata:
  # name: device-plugin-config # NVIDIA
  name: mig-profiles-config #OpenShift
  namespace: nvidia-gpu-operator
data:
  NVIDIA-A100-PCIE-40GB: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 2
          - name: nvidia.com/mig-1g.5gb
            replicas: 2
          - name: nvidia.com/mig-2g.10gb
            replicas: 2
          - name: nvidia.com/mig-3g.20gb
            replicas: 2
          - name: nvidia.com/mig-7g.40gb
            replicas: 2
EOF

oc patch clusterpolicy gpu-cluster-policy \
    -n nvidia-gpu-operator --type merge \
    -p '{"spec": {"devicePlugin": {"config": {"name": "mig-profiles-config"}}}}'

The device config plugin already points to this GPU name NVIDIA-A100-PCIE-40GB, when we configured timeslicing.

oc label \
  --overwrite node my-host-example-com \
  nvidia.com/device-plugin.config=NVIDIA-A100-PCIE-40GB

We patch the cluster policy to enable a mixed strategy.

STRATEGY=mixed
oc patch clusterpolicy/gpu-cluster-policy \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/mig/strategy", "value": '$STRATEGY'}]'

We apply the MIG Partitioning profiles, you can relabel as you need to change between different MIG profiles:


NODE_NAME=my-host-example-com
MIG_CONFIGURATION=all-disabled
# MIG_CONFIGURATION=all-1g.5gb
# MIG_CONFIGURATION=all-1g.10gb
# MIG_CONFIGURATION=all-2g.10gb
# MIG_CONFIGURATION=all-3g.20gb
# MIG_CONFIGURATION=all-7g.40gb

oc label node/$NODE_NAME nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite=true

# or
kubectl label nodes $NODE_NAME nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite

Wait for the mig-manager to perform the reconfiguration:

oc -n nvidia-gpu-operator logs ds/nvidia-mig-manager --all-containers -f --prefix

Get all the MIG-enabled devices.

oc rsh \
  -n nvidia-gpu-operator \
  $(kubectl get pods -n nvidia-gpu-operator | grep -E 'nvidia-driver-daemonset.*' | awk '{print $1}') nvidia-smi mig -lgi

If this return No MIG-enabled devices found. there is something wrong that needs to be debugged.

Otherwise:

+-------------------------------------------------------+
| GPU instances:                                        |
| GPU   Name             Profile  Instance   Placement  |
|                          ID       ID       Start:Size |
|=======================================================|
|   0  MIG 3g.20gb          9        1          4:4     |
+-------------------------------------------------------+
|   0  MIG 3g.20gb          9        2          0:4     |
+-------------------------------------------------------+

Make sure the capabilities match

kubectl describe nodes my-host-example-com | grep "nvidia.com/"

See that we stop using nvidia.com/gpu for mig-3g.20gb instead.

                    nvidia.com/mig-3g.20gb.slices.gi=3
                    nvidia.com/mig.capable=true
                    nvidia.com/mig.config=all-3g.20gb
                    nvidia.com/mig.config.state=success
                    nvidia.com/mig.strategy=mixed
                    nvidia.com/mps.capable=false
                    nvidia.com/gpu-driver-upgrade-enabled: true
  nvidia.com/gpu:          0
  nvidia.com/mig-3g.20gb:  4
  nvidia.com/gpu:          0
  nvidia.com/mig-3g.20gb:  4
  nvidia.com/gpu          0             0
  nvidia.com/mig-3g.20gb  0             0

Lets test it:

cat << EOF | kubectl apply -f -
---
apiVersion: v1
kind: Pod
metadata:
  name: test-1
  namespace: model-serving-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: gpu-test
    image: nvcr.io/nvidia/pytorch:23.07-py3
    command: ["nvidia-smi"]
    args: ["-L"]
    resources:
      limits:
        nvidia.com/mig-3g.20gb: 1
EOF

The pod should be scheduled and to show in the logs something like:

GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-539a31f4-1487-3a42-cc55-910b7bf82d0c)
MIG 3g.20gb Device 0: (UUID: MIG-e3864840-4bd8-5b03-9ed2-59a7f1d73b5f)

Enabling MPS

We go through a similar process as enabling time-slicing. We create a config map with the MPS configuration. The labeling step done in time-slicing needs to be executed only once.

cat << EOF | kubectl apply -f -
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: mps-enable-config #OpenShift
  namespace: nvidia-gpu-operator
data:
  NVIDIA-A100-PCIE-40GB: |-
    version: v1
    sharing:
      mps:
        # renameByDefault: false
        # if we rename the resources they will move
        # to gpu.shared vs gpu
        resources:
        - name: nvidia.com/gpu
          replicas: 10
EOF

And we path the cluster policy to apply the configuration from the new config map. Now we patch the CR

oc patch clusterpolicy \
    gpu-cluster-policy \
    -n nvidia-gpu-operator \
    --type merge \
    -p '{"spec": {"devicePlugin": {"config": {"name": "mps-enable-config"}}}}'

With the cluster policy patched, we dont have to relabel the node because it was done in a previous step:

oc label \
  --overwrite node my-host-example-com \
  nvidia.com/device-plugin.config=NVIDIA-A100-PCIE-40GB

To trigger the NFD gpu-feature-discovery-XXX POD YOU NEED TO MAKE SURE THE CLUSTER POLICY IS PATCHED. Now the product name should change from -SHARED (from time slicing) to the original name.

oc get node \
  --selector=nvidia.com/gpu.product=NVIDIA-A100-PCIE-40GB \
  -o json | jq '.items[0].status.capacity'

And we query the gpu labels.

kubectl describe node my-host-example-com | awk '/^Labels:/,/^Taints:/' | grep nvidia.com

And these should be available:

nvidia.com/gpu.sharing-strategy=mps
nvidia.com/mps.capable=true

The MPS Control Daemon should be up and running:

pod_name=$(kubectl get pods -n nvidia-gpu-operator -o=name | grep 'nvidia-device-plugin-mps-control-daemon-' | tail -n 1 | cut -d '/' -f 2)
kubectl logs -n nvidia-gpu-operator $pod_name

nvidia-device-plugin-mps-control-daemon-9ff2n
I0405 12:49:04.996976 46 main.go:78] Starting NVIDIA MPS Control Daemon d838ad11
commit: d838ad11d323a71f19c136fabbf65c9e23b2ae81
I0405 12:49:04.997136 46 main.go:55] "Starting NVIDIA MPS Control Daemon" version=<
d838ad11
commit: d838ad11d323a71f19c136fabbf65c9e23b2ae81
>
I0405 12:49:04.997150 46 main.go:107] Starting OS watcher.
I0405 12:49:04.997615 46 main.go:121] Starting Daemons.
I0405 12:49:04.997654 46 main.go:164] Loading configuration.
I0405 12:49:04.998024 46 main.go:172] Updating config with default resource matching patterns.
I0405 12:49:04.998064 46 main.go:183]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "single",
"failOnInitError": null,
"gdsEnabled": null,
"mofedEnabled": null,
"useNodeFeatureAPI": null,
"plugin": {
"passDeviceSpecs": null,
"deviceListStrategy": null,
"deviceIDStrategy": null,
"cdiAnnotationPrefix": null,
"nvidiaCTKPath": null,
"containerDriverRoot": null
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
],
"mig": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {},
"mps": {
"renameByDefault": true,
"failRequestsGreaterThanOne": true,
"resources": [
{
"name": "nvidia.com/gpu",
"rename": "nvidia.com/gpu.shared",
"devices": "all",
"replicas": 10
}
]
}
}
}
I0405 12:49:04.998071 46 main.go:187] Retrieving MPS daemons.
I0405 12:49:05.073068 46 daemon.go:93] "Staring MPS daemon" resource="nvidia.com/gpu.shared"
I0405 12:49:05.113709 46 daemon.go:131] "Starting log tailer" resource="nvidia.com/gpu.shared"
[2024-04-05 12:49:05.086 Control 61] Starting control daemon using socket /mps/nvidia.com/gpu.shared/pipe/control
[2024-04-05 12:49:05.086 Control 61] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu.shared/pipe
[2024-04-05 12:49:05.100 Control 61] Accepting connection...
[2024-04-05 12:49:05.100 Control 61] NEW UI
[2024-04-05 12:49:05.100 Control 61] Cmd:set_default_device_pinned_mem_limit 0 4096M
[2024-04-05 12:49:05.100 Control 61] UI closed
[2024-04-05 12:49:05.113 Control 61] Accepting connection...
[2024-04-05 12:49:05.113 Control 61] NEW UI
[2024-04-05 12:49:05.113 Control 61] Cmd:set_default_active_thread_percentage 10
[2024-04-05 12:49:05.113 Control 61] 10.0
[2024-04-05 12:49:05.113 Control 61] UI closed

Make sure the capabilities match

kubectl describe nodes my-host-example-com | grep "nvidia.com/"

                    nvidia.com/device-plugin.config=NVIDIA-A100-PCIE-40GB
                    nvidia.com/gfd.timestamp=1715175821
                    nvidia.com/gpu.product=NVIDIA-A100-PCIE-40GB
                    nvidia.com/gpu.replicas=10
                    nvidia.com/gpu.sharing-strategy=mps
  nvidia.com/gpu:         2
  nvidia.com/gpu.shared:  20
  nvidia.com/gpu:         0
  nvidia.com/gpu.shared:  20
  nvidia.com/gpu         0             0
  nvidia.com/gpu.shared  0             0

Lets see how to schedule a pod using mps:

cat << EOF | kubectl create -f -
---
apiVersion: v1
kind: Pod
metadata:
  generateName: command-nvidia-smi-
  # name: command-nvidia-smi
spec:
  hostIPC: true # 
  securityContext:
    runAsUser: 1000 # 
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/cuda:12.1.0-base-ubi8
      command: ["/bin/sh","-c"]
      args: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: "1"
  # nodeSelector:
  #   nvidia.com/gpu.product: NVIDIA-A100-PCIE-40GB
EOF

cat << EOF | kubectl create -f -
---
apiVersion: v1
kind: Pod
metadata:
  generateName: dcgmproftester12-
  # name: dcgmproftester12
  namespace: nvidia-gpu-operator
spec:
  hostIPC: true # 
  securityContext:
    runAsUser: 0 # 
  restartPolicy: Never
  containers:
    - name: dcgmproftester12
      image: nvcr.io/nvidia/cloud-native/dcgm:3.3.5-1-ubi9
      command: ["/bin/sh", "-c"]
      args:
        - while true; do /usr/bin/dcgmproftester12 --no-dcgm-validation -t 1004 -d 300; sleep 30; done
      securityContext:
        privileged: true
      # The following SYS_ADMIN capability is not enough  
        capabilities:
          add: ["SYS_ADMIN"]
      resources:
        limits:
          nvidia.com/gpu: "1"
  # nodeSelector:
  #   nvidia.com/gpu.product: NVIDIA-A100-PCIE-40GB
EOF

From the host:

sudo chroot /run/nvidia/driver nvidia-smi
tail -f /run/nvidia/mps/nvidia.com/gpu/log/server.log

If it works you should see something like:

sudo chroot /run/nvidia/driver nvidia-smi
Thu May  9 18:01:08 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:0A:00.0 Off |                    0 |
| N/A   55C    P0            121W /  250W |     495MiB /  40960MiB |    100%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  |   00000000:AE:00.0 Off |                    0 |
| N/A   38C    P0             62W /  250W |      38MiB /  40960MiB |      0%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    543063      C   nvidia-cuda-mps-server                         30MiB |
|    0   N/A  N/A    686657    M+C   /usr/bin/dcgmproftester12                     456MiB |
|    1   N/A  N/A    543063      C   nvidia-cuda-mps-server                         30MiB |
+-----------------------------------------------------------------------------------------+

Where the dcgmproftester12 and nvidia-cuda-mps-server are running.

And a deployment to schedule multiple pods:

cat << EOF | kubectl apply -f -
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nvidia-plugin-test
  labels:
    app: nvidia-plugin-test
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nvidia-plugin-test
  template:
    metadata:
      labels:
        app: nvidia-plugin-test
    spec:
      tolerations:
        - key: nvidia.com/gpu.shared
          operator: Exists
          effect: NoSchedule
      containers:
       - name: dcgmproftester12
         image: nvcr.io/nvidia/cloud-native/dcgm:3.3.5-1-ubi9
         # What is the difference between nvcr.io/nvidia/cloud-native and nvcr.io/nvidia/k8s?
         command: ["/bin/sh", "-c"]
         args:
            - while true; do /usr/bin/dcgmproftester12 --no-dcgm-validation -t 1004 -d 30; sleep 30; done
         resources:
           limits:
             nvidia.com/gpu: "1"
         securityContext:
           privileged: true
         # The following SYS_ADMIN capability is not enough  
         # securityContext:
         #   capabilities:
         #     add: ["SYS_ADMIN"]
EOF

Reset

To reset the cluster to un-share the GPUs proceed to:

Make sure pods are not using the GPU. The following snippet will fetch all pods from any namespace requesting a GPU,from the pod spec resources.limits.nvidia\.com/gpu: "1".

kubectl get pods --all-namespaces -o=jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\n"}{end}' \
| while read -r namespace pod; do \
    kubectl get pod "$pod" -n "$namespace" -o=jsonpath='{range .spec.containers[*]}{.name}{"\t"}{.resources.limits.nvidia\.com/gpu}{"\n"}{end}' \
    | grep '1' >/dev/null && \
    echo "$namespace\t$pod"; \
  done

The previous command shouldn’t return any pod name, otherwise you have pods requesting GPU access.

Recreate the NVIDIA GPU operator cluster policy.

oc get clusterpolicies gpu-cluster-policy -o yaml
oc delete clusterpolicy gpu-cluster-policy

#oc get nodefeaturediscoveries nfd-instance -o yaml -n openshift-nfd
#oc delete nodefeaturediscovery nfd-instance -n openshift-nfd

#NODE=my-host-example-com
#kubectl label --list nodes $NODE | \
#  grep nvidia.com | \
#  awk -F= '{print $1}' | xargs -I{} kubectl label node $NODE {}-

kubectl label --list nodes $NODE | grep nvidia.com

Create a new NVIDIA GPU operator cluster policy.

Update log:

2024/04/04: Initial draft.

2024/04/08: Initial post.

7 career skills: Build your network

2023-10-03T00:00:00+00:00

I’m writing this blog post as a partial output from the work done with James Mernim in the RHT’s mentoring program. We reviewed a framework called the “7 careers skills’’ in which we focused on one specifically called ‘building your network’.

Careers skills: Build your network

The -7 career skills- is a framework part of the official career coaching program in Red Hat where the main goal is to provide a growth mindset and the tools to move forward in the direction the coachee decides to go.

The reasons for writing this post are many, among them, I would like to introduce the –7 careers skills– framework, with the initial description of the exercise called ‘building your network’. Also, to highlight the Importance of having a set of well defined ‘Career Tools’ as part of an officially supported ‘career coaching program’.

The content that will be described here is a written exercise designed to help individuals reflect on their career, gain insights, and develop a growth-mindset. This exercise will map a set of relationships with individuals, whether personal, informational, or structural relationships, play a vital role in career development and to actually materialize them as “opportunities for growth”. This post aims to educate readers about the significance of these relationships and how they can be leveraged.

In Red Hat, it is commonly used a coaching framework called the ‘7 career skills’, these skills are: Build your network, Stretch yourself, Adapt to change, Reflect and plan, Know yourself, Spot opportunities, and Build your brand.

Today I’m going to describe “Building your network” and we will end by doing an exercise to realize how big or healthy your ‘network’ is.

A career tool is a written exercise for a mentee or coachee to complete assisting their thinking, insights and reflection, and they help build up the mindset moving onwards and providing structure, focus and if possible some lightbulb moments.

Mapping your network of career supporters

When we think about our career, people are a key differentiator, the right person in the right place will boost your motivation to make you thrive. Chances are that when you reflect on your career, you’ll identify particular people who have played a major role, both in a positive and negative way. It can be family, friends or co-workers who support us. Often, it’s a great manager who gives us a chance. We succeed via our relationships with others “It’s usually how we get access to opportunities for career development and learning”.

Spoiler alert!!!

Did you watch the movie “Oppenheimer”? (if not, I will love to spoil the first 5 minutes for you)…

Robert Oppenheimer was a theoretical physicist studying in the Cambridge’s Cavendish Laboratory under a professor named Blackett. He didn’t seem happy working there “in the lab”, this, until the point of adding cyanide to his professor’s apple. Then, someone in his network called Niels Bohr, said to him “you don’t seem happy here”… “Get out of Cambridge and go somewhere they let you think”… “Where to go”… “Go with Born (University of Göttingen)”… Then, the movie actually starts building its arguments and you will see Oppenheimer’s evolution across the time.

The main goal of this exercise is to help you to “find your Niels Bohr” -

There is also some interesting research and academic theory around how these networks function. Sociologist Mark Granovetter developed the idea of ‘strong’ and ‘weak’ ties within networks. Weak ties are relationships outside of your immediate circle, and they are more important for spotting opportunities and accessing new information. Granovetter also found that a person’s number of connections was less important than their diversity. This activity is designed to help you think about and understand your network. It draws on theory about the kinds of relationships we need to build to have the right kind of career support in place.

Relationships

There are three categories of relationships which can be helpful:

Personal: Those who believe in you, listen to you and whom you trust. You can show ‘vulnerability’ and they can provide reassurance. Surround yourself with positive people. Those in your emotional support network will:
- Be willing to listen to you.
- Provide reassurance.
- Be dependable.
- Be a person to go to for advice.
- Be a person you can trust and who trusts you.
- Encourages you to be your best and has faith in your abilities.
- Is concerned that you reach your goals.
- Provides a positive but honest voice.
Informational: This type of connection is about know-how. These are the people who keep you in the loop about what is going on, tell you about the unwritten rules and share their knowledge and experience with you. Those in your informational support network will:
- Share their know-how, knowledge and experience with you.
- Keep you in touch with what is going on.
- Decode unwritten laws.
- Inform you of company policies and procedures that may affect you.
Structural: These are the connectors to other functions. These people will endorse you, champion you, and help you be more visible. Find people who will mentor or champion you. Those in your structural support network will:
- Connect you to other functions.
- Endorse you.
- Champion you.
- Help you be more visible.
- Help you maximize your exposure.

Example of the exercise “Build your network”

Here is an example network map, the three types of relationship are represented as a Venn diagram, you’ll notice that there might be some overlaps depending on how many people you are able to allocate in the diagram.

Once you finish allocating all the people in the diagram (write their names by hand where it says “Insert text”) it should help make an abstract idea of your support network more concrete and understandable.

Suggestions for Expanding One’s Network can be many, these suggestions will try to provide some insights and ideas about some practical steps they can take to improve their network. Among them, contributing to other organizations, attending conferences, and engaging in stronger communication roles within the same organization, internally, we also have the technical gigs program. All this with the idea of provisioning some practical steps you can take to improve your network.

In summary, this post aims to educate and guide readers on the importance of building and maintaining a supportive network in their careers. It should introduce the –7 career skills- framework with emphasis in building your network, with a glimpse of some theoretical and practical advice to help individuals strengthen their professional connections and ultimately advance in their careers.

Kubernetes MLOps 101: Deployment and usage

2023-07-17T00:00:00+00:00

MLOps stands for Machine Learning Operations. MLOps is a core function of Machine Learning engineering, focused on streamlining the process of taking machine learning models to production and then maintaining and monitoring them. MLOps is a collaborative function, often comprising data scientists, DevOps engineers, and IT.

Kubeflow is an open-source platform built on Kubernetes designed to simplify the deployment and management of machine learning (ML) workflows. It provides a set of tools and frameworks that enable data scientists and ML engineers to build, deploy, and scale ML models efficiently.

Kubeflow’s goal is to facilitate the adoption of machine learning best practices and enable reproducibility, scalability, and collaboration in ML workflows.

TL;DR;

This post will be an initial approach to how to deploy some MLOps tools on top of OpenSHift 4.12, among them Kubeflow, and the OpenDataHub project. The goal of this tutorial is to play with the technology and have a learning environment as fast as possible.

References for future steps, workshops, and activities

https://developers.redhat.com/developer-sandbox/activities/use-rhods-to-master-nlp
https://demo.redhat.com/ “MLOps with Red Hat OpenShift Data Science: Retail Coupon App Workshop”

1. Deploying the infrastructure

Requirements installation

KubeInit is a project that aims to simplify the deployment of Kubernetes distributions. It provides a set of Ansible playbooks and collections to automate the installation and configuration of Kubernetes clusters.

Install the required dependencies by running the following commands:

# Install the requirements assuming python3/pip3 is installed
python3 -m pip install --upgrade pip \
                                 shyaml \
                                 ansible \
                                 netaddr

# Clone the KubeInit repository and navigate to the project's directory
git clone https://github.com/Kubeinit/kubeinit.git
cd kubeinit

# Install the Ansible collection requirements
ansible-galaxy collection install --force --requirements-file kubeinit/requirements.yml

# Build and install the collection
rm -rf ~/.ansible/collections/ansible_collections/kubeinit/kubeinit
ansible-galaxy collection build kubeinit --verbose --force --output-path releases/
ansible-galaxy collection install --force --force-with-deps releases/kubeinit-kubeinit-`cat kubeinit/galaxy.yml | shyaml get-value version`.tar.gz

Deploy a single node 4.12 OKD cluster as our development environment

This step will get us a single-node OpenShift 4.12 development environment. From the hypervisor run:

# Run the playbook
ansible-playbook \
    -v --user root \
    -e kubeinit_spec=okd-libvirt-1-0-1 \
    -e hypervisor_hosts_spec='[{"ansible_host":"nyctea"},{"ansible_host":"tyto"}]' \
    -e controller_node_disk_size='300G' \
    -e controller_node_ram_size='88080384' \
    ./kubeinit/playbook.yml

To cleanup the environment include the -e kubeinit_stop_after_task=task-cleanup-hypervisors variable.

Depending on the value of kubeinit_spec we can choose between multiple K8s distributions, determine how many controller or compute nodes, and how many hypervisors we will like to spread the cluster guests, for more information go to the Kubeinit Github repository or the docs website.

Configuring the storage PV backend

From the hypervisor connect to the service guest by running:

ssh -i ~/.ssh/okdcluster_id_rsa [email protected]
# Install some tools
yum groupinstall 'Development Tools' -y
yum install git -y

To support the PersistentVolumeClaims (PVCs) from the Kubeflow deployment, a storage PV backend needs to be set up. We will setup some basic default storage class to support the PVC from the Kubeflow deployment. Follow the steps below to configure the backend:

# Create a new namespace for the NFS provisioner
oc new-project nfsprovisioner-operator
sleep 30;
# Deploy NFS Provisioner operator in the terminal
cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: nfs-provisioner-operator
  namespace: openshift-operators
spec:
  channel: alpha
  installPlanApproval: Automatic
  name: nfs-provisioner-operator
  source: community-operators
  sourceNamespace: openshift-marketplace
EOF

# We assume this is a single node deployment, we will get the first worker node
export target_node=$(oc get nodes | grep worker | head -1 | cut -d' ' -f1)
# We assign the NFS provisioner role to our first worker node
oc label node/${target_node} app=nfs-provisioner

Now, we need to configure the worker node filesystem to support the location where the PVs will be stored.

# ssh to the node
oc debug node/${target_node}

# Configure the local folder for the PVs
chroot /host
mkdir -p /home/core/nfs
chcon -Rvt svirt_sandbox_file_t /home/core/nfs
exit; exit

Setting up the container registries credentials

Connect to the worker node to configure the OpenShift registry token, get the OpenShift registry token list (pullsecrets) from cloud.redhat.com and store it locally as /root/config.json

# There is a local pull-secret for pulling from the internal cluster container registry
# TODO: Make sure we have the local registry and the RHT credentials together

# By default there is a local container registry in this single node cluster
# and those credentials are deployed in the OCP cluster.

# Merge the local rendered registry-auths.json from the services guest
# With the token downloaded from cloud.redhat.com
# store it as rht-registry-auths.json and merge them with:

cd
jq -s  '.[0] * .[1]' registry-auths.json rht-registry-auths.json > full-registry-auths.json
oc set data secret/pull-secret -n openshift-config --from-file=.dockerconfigjson=full-registry-auths.json

# oc create secret generic pull-secret -n openshift-config --type=kubernetes.io/dockerconfigjson --from-file=.dockerconfigjson=/root/downloaded_token.json
# oc secrets link default pull-secret -n openshift-config --for=pull
# Refer to: https://access.redhat.com/solutions/4902871 for further information

Make sure the registry secret is correct by printing its value.

oc get secret pull-secret -n openshift-config --template=''

Or configure it in the cluster per node by (not required and done in the previous step):

ssh [email protected] # (First controller node, in this case, a single node cluster)
podman login registry.redhat.io
podman login registry.access.redhat.com
# Username: ***@redhat.com
# Password: 
# Login Succeeded!

Create the NFSProvisioner Custom Resource

cat << EOF | oc apply -f -  
apiVersion: cache.jhouse.com/v1alpha1
kind: NFSProvisioner
metadata:
  name: nfsprovisioner-sample
  namespace: nfsprovisioner-operator
spec:
  nodeSelector: 
    app: nfs-provisioner
  hostPathDir: "/home/core/nfs"
EOF
sleep 30;

# Check if NFS Server is running
oc get pod

# Update annotation of the NFS StorageClass
oc patch storageclass nfs -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

# Check the default next to nfs StorageClass
oc get sc
NAME            PROVISIONER       RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
nfs (default)   example.com/nfs   Delete          Immediate           false                  4m29s

Create a test PVC to verify the claims can be filled correctly

# Create a test PVC
cat << EOF | oc apply -f -  
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nfs-pvc-example
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Mi
  storageClassName: nfs
EOF

# Check the test PV/PVC
oc get pv,pvc

NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                                                 STORAGECLASS   REASON   AGE
persistentvolume/pvc-e30ba0c8-4a41-4fa0-bc2c-999190fd0282   1Mi        RWX            Delete           Bound       nfsprovisioner-operator/nfs-pvc-example               nfs                     5s
NAME                                    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/nfs-pvc-example   Bound    pvc-e30ba0c8-4a41-4fa0-bc2c-999190fd0282   1Mi        RWX            nfs            5s

The output shown here indicates that the NFS server, NFS provisioner, and NFS StorageClass are all working fine. You can use the NFS StorageClass for any test scenarios that need PVC.

The following snippets allow to play with the applications deployed to the single-node cluster providing an easy revert option.

Note: For rollbacking the env and try new things, instead of redeploying (30 mins) try restoring the snapshots (1 min).

########
# BACKUP
vms=( $(virsh list --all | grep running | awk '{print $2}') )
# Create an initial snapshot for each VM
for i in "${vms[@]}"; \
do \
echo "virsh snapshot-create-as --domain $i --name $i-fresh-install --description $i-fresh-install --atomic"; \
virsh snapshot-create-as --domain "$i" --name "$i"-fresh-install --description "$i"-fresh-install --atomic; \
done
# List current snapshots (After they should be already created)
for i in "${vms[@]}"; \
do \
virsh snapshot-list --domain "$i"; \
done
########

#########
# RESTORE
vms=( $(virsh list --all | grep running | awk '{print $2}') )
for i in "${vms[@]}"; \
do \
virsh shutdown $i;
virsh snapshot-revert --domain $i --snapshotname $i-fresh-install --running;
virsh list --all;
done
#########

#########
# DELETE
vms=( $(virsh list --all | grep -E 'running|shut' | awk '{print $2}') )
for i in "${vms[@]}"; \
do \
virsh snapshot-delete --domain $i --snapshotname $i-fresh-install;
done
#########

2. Deploying the MLOps applications

This section will explore the installation of different MLOps components in an OCP 4.12 cluster.

Select one subsection exclusively (2.1, 2.2, or 2.3 but not all of them).

From the services pod execute:

2.1 Deploying the Kubeflow pipelines component

###################################
# Installing the Kubeflow templates
# https://www.kubeflow.org/docs/components/pipelines/v1/installation/localcluster-deployment/#deploying-kubeflow-pipelines
#
###############################
# Kubeflow pipelines standalone
# We will deploy kubeflow 2.0.0
#
cd
export PIPELINE_VERSION=2.0.0
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-pns?ref=$PIPELINE_VERSION"
sleep 30
oc get pods -n kubeflow

NAME                                               READY   STATUS    RESTARTS      AGE
cache-deployer-deployment-76f8bc8897-t48vs         1/1     Running   0             3m
cache-server-65fc86f747-2rg7t                      1/1     Running   0             3m
metadata-envoy-deployment-5bf6bbb856-tqw85         1/1     Running   0             3m
metadata-grpc-deployment-784b8b5fb4-l94tm          1/1     Running   3 (52s ago)   3m
metadata-writer-647bfd9f77-m5c8w                   1/1     Running   0             3m
minio-65dff76b66-vstbk                             1/1     Running   0             3m
ml-pipeline-86965f8976-qbgqs                       1/1     Running   0             3m
ml-pipeline-persistenceagent-dbc9d95b6-g7nsb       1/1     Running   0             3m
ml-pipeline-scheduledworkflow-6fbf57b54d-446f5     1/1     Running   0             2m59s
ml-pipeline-ui-5b99c79fc8-2vbcp                    1/1     Running   0             2m59s
ml-pipeline-viewer-crd-5fdb467bb5-rktvs            1/1     Running   0             2m59s
ml-pipeline-visualizationserver-6cf48684f5-b929v   1/1     Running   0             2m59s
mysql-c999c6c8-jzths                               1/1     Running   0             2m59s
workflow-controller-6c85bc4f95-lmkrg               1/1     Running   0             2m59s
##########################

To access the Kubeflow pipelines UI, follow these steps:

# Create an initial SSH hop from your machine to the hypervisor
ssh -L 38080:localhost:38080 root@labserver
# A second hop will connect you to the services guest
ssh -L 38080:localhost:8080 -i ~/.ssh/okdcluster_id_rsa [email protected]
# Once we are in a network segment with access to the Kubeflow services
# we can forward the traffic to the ml-pipeline-ui pod
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

After running the hop/forwarding commands, you can access the Kubeflow pipelines UI by opening your browser and visit http://localhost:38080

Once all the pods are running the UI should work as expected without reporting any issue.

2.2 Deploying the modelmesh service operator

cd
git clone https://github.com/opendatahub-io/modelmesh-serving.git --branch release-v0.11.0-alpha
cd modelmesh-serving
cd opendatahub/quickstart/basic/
./deploy.sh

Make sure the pvc, replicaset, pod, and service are all running before continuing.

# Check the OpenDataHub ModelServing's inference service
oc get isvc -n modelmesh-serving
# NAME                 URL                                               READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
# example-onnx-mnist   grpc://modelmesh-serving.modelmesh-serving:8033   True                                                                  42m

#Check the URL for the deployed model
oc get routes
# NAME                 HOST/PORT                                                             PATH                            SERVICES            PORT   TERMINATION     WILDCARD
# example-onnx-mnist   example-onnx-mnist-modelmesh-serving.apps.okdcluster.kubeinit.local   /v2/models/example-onnx-mnist   modelmesh-serving   8008   edge/Redirect   None

Let’s test a model from the manifests folder.

export HOST_URL=$(oc get route example-onnx-mnist -ojsonpath='{.spec.host}' -n modelmesh-serving)
export HOST_PATH=$(oc get route example-onnx-mnist -ojsonpath='{.spec.path}' -n modelmesh-serving)
export COMMON_MANIFESTS_DIR='/root/modelmesh-serving/opendatahub/quickstart/common_manifests'
curl --silent --location --fail --show-error --insecure https://${HOST_URL}${HOST_PATH}/infer -d  @${COMMON_MANIFESTS_DIR}/input-onnx.json

# This is the expected output
# {"model_name":"example-onnx-mnist__isvc-b29c3d91f3","model_version":"1","outputs":[{"name":"Plus214_Output_0","datatype":"FP32","shape":[1,10],"data":[-8.233053,-7.7497034,-3.4236815,12.3630295,-12.079103,17.266596,-10.570976,0.7130762,3.321715,1.3621228]}]}

2.3 Deploying all Kubeflow components (WIP pending failing resources)

##########################
# Complete install
# From: https://github.com/kubeflow/manifests#installation
cd
git clone https://github.com/kubeflow/manifests kubeflow_manifests
cd kubeflow_manifests
while ! kustomize build example | awk '!/well-defined/' | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
##########################

# WIP checking Security Context Constraints by executing:

# We deploy all the Security Context Constraints
# cd
# git clone https://github.com/opendatahub-io/manifests.git ocp_manifests
# cd ocp_manifests
# while ! kustomize build openshift/openshiftstack/application/openshift/openshift-scc/base | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
# while ! kustomize build openshift/openshiftstack/application/openshift/openshift-scc/overlays/istio | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
# while ! kustomize build openshift/openshiftstack/application/openshift/openshift-scc/overlays/servicemesh | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

Checking that all the services are running

To ensure that all services that were deployed are running, check pod, service, replicaset, deployment, and pvc.

3. Running Kubeflow (Creating experiments, pipelines, and executions)

Kubeflow simplifies the development and deployment of machine learning pipelines by providing a higher level of abstraction over Kubernetes. It offers a resilient framework for distributed computing, allowing ML pipelines to be scalable and production-ready.

In this section, we will explore the process of creating a machine learning pipeline using Kubeflow, covering various components and their integration throughout the ML solution lifecycle, that is creating the experiment, pipeline, and run.

The following components are the main organizational structure within Kubeflow.

A Kubeflow Experiment is a logical grouping of machine learning runs or trials. It provides a way to organize and track multiple iterations of training or evaluation experiments. Experiments help in managing different versions of models, hyperparameters, and data configurations.
A Kubeflow Pipeline is a workflow that defines a series of interconnected steps or components for an end-to-end machine learning process. It allows for the orchestration and automation of complex ML workflows, including data preprocessing, model training, evaluation, and deployment. Pipelines enable reproducibility, scalability, and collaboration in ML development by providing a visual representation of the workflow and its dependencies.
A Kubeflow Run refers to the execution of a pipeline or an individual component within a pipeline. It represents a specific instance of running a pipeline or a component with specific inputs and outputs. Runs can be triggered manually or automatically based on predefined conditions or events. Each run captures metadata and logs, allowing for easy tracking, monitoring, and troubleshooting of the pipeline’s execution.

Running our first experiment

Text

Creating an experiment

Text

Creating the pipeline

Text

Running the pipeline

Text

Conclusions

Deploying Kubeflow using KubeInit simplifies the process of setting up a scalable and reproducible ML workflow platform. With KubeInit’s Ansible playbooks and collections, you can automate the deployment of Kubernetes and easily configure the necessary components for Kubeflow. By leveraging Kubeflow’s templates and services, data scientists and ML engineers can accelerate the development and deployment of machine learning models.

Interesting errors

  Warning  Failed          8m1s (x6 over 11m)   kubelet            Error: ImagePullBackOff
  Warning  Failed          6m16s (x5 over 11m)  kubelet            Failed to pull image "mysql:8.0.29": rpc error: code = Unknown desc = reading manifest 8.0.29 in docker.io/library/mysql: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
  Normal   BackOff         103s (x28 over 11m)  kubelet            Back-off pulling image "mysql:8.0.29"

The end

If you like this post, please try the code, raise issues, and ask for more details, features, or anything that you feel interested in. Also, it would be awesome if you become a stargazer to catch up updates and new features.

This is the main repository for the infrastructure automation based on Ansible.

Happy Kubeflow’ing & Kubeinit’ing!

The homelab project

2023-05-26T00:00:00+00:00

Homelabs, also known as home data center, refers to a personal setup created by technology enthusiasts or professionals in their homes for learning, experimentation, or production purposes. It typically consists of a collection of computer hardware, networking equipment, and software that simulates or replicates a professional IT environment.

Homelabs serve various purposes, including:

Learning and Skill Development: Many individuals use homelabs to enhance their knowledge and skills in areas such as system administration, networking, virtualization, storage, and cybersecurity. It provides a hands-on environment to experiment with different technologies, configurations, and software without the constraints of a production environment.
Testing and Prototyping: Homelabs offer a safe and controlled environment to test new software, applications, or hardware configurations. It allows individuals to evaluate their performance, compatibility, and feasibility before deploying them in a production environment.
Personal Projects and Services: Some people build homelabs to host personal projects, services, or applications. For example, they may set up media servers, game servers, file sharing platforms, or websites to meet their specific needs or interests.
Home Automation and Internet of Things (IoT): Homelabs can be utilized for setting up smart home automation systems and experimenting with IoT devices. This enables individuals to control and manage various aspects of their home, such as lighting, temperature, security, and entertainment systems.
Data Storage and Backup: Homelabs often include storage solutions like Network-Attached Storage (NAS) or storage area network (SAN) devices, which allow users to store and backup their data locally.

Homelabs can range from a single server or a few network devices to a comprehensive setup with multiple servers, switches, routers, and other equipment. They can be built using off-the-shelf components or repurposed enterprise-grade hardware, depending on the user’s requirements, budget, and level of expertise.

While building and maintaining a homelab can be challenging, it can also be a rewarding and educational experience. The availability of upstream communities and resources can help you overcome difficulties and find support from other homelab enthusiasts. Start with a clear plan, research thoroughly, and take it step by step, adjusting your goals and complexity as you gain experience.

Origins

Datacenters are large-scale facilities designed to house and manage extensive computing infrastructure for businesses or organizations. They have high-end server racks, robust networking equipment, redundant power and cooling systems, and advanced security measures. Datacenters offer scalability, reliability, and support for critical operations, often involving multiple clients or users. In contrast, homelabs are personal setups in individuals’ homes used for learning, experimentation, or small-scale production. They typically consist of a collection of hardware, networking devices, and software. Homelabs provide a controlled environment for skill development, testing, and hosting personal projects, but on a smaller scale with limited resources and fewer users.

The ideal homelab architecture typically includes a combination of components to create a versatile and powerful setup. It starts with robust server hardware, whether it’s rack-mounted servers or repurposed desktop machines, equipped with ample processing power, memory, and storage capacity. Networking equipment such as routers, switches, and firewalls play a crucial role in connecting devices and enabling communication within the homelab. Virtualization software like VMware or Hyper-V allows for the creation and management of virtual machines, enabling efficient resource utilization. Storage solutions, including NAS (Network Attached Storage) or SAN (Storage Area Network), provide ample storage capacity for data and backups. Additionally, monitoring and management tools help ensure the stability and performance of the homelab environment. The ideal architecture emphasizes flexibility, scalability, and the ability to experiment with various technologies and configurations to meet specific learning or project requirements.

Networking is a crucial component in an ideal homelab architecture, enabling devices to communicate and share resources effectively. It starts with a reliable and feature-rich router that provides internet access and manages IP addresses. A network switch connects multiple devices within the homelab, allowing seamless communication. Managed switches offer advanced features for improved network performance and segmentation. Implementing a firewall ensures the security of the homelab, protecting against unauthorized access. Additionally, utilizing technologies like VLANs and QoS can enhance network efficiency and prioritize traffic. Proper networking setup enables connectivity, facilitates the sharing of services and resources, and creates a robust foundation for various homelab activities and projects.

Storage is a critical component in an ideal homelab architecture, providing ample space to store data, virtual machine images, and backups. Network Attached Storage (NAS) or Storage Area Network (SAN) solutions offer centralized storage with high capacity and performance. NAS devices are easy to set up and provide shared file access over the network, making them suitable for storing media, documents, and backups. SANs, on the other hand, deliver fast and reliable storage for virtual machine environments, supporting features like RAID, snapshotting, and replication. Additionally, leveraging cloud storage services can provide off-site backups and enhance data accessibility. By incorporating a robust storage solution into the homelab architecture, users can ensure data integrity, scalability, and efficient management of their digital assets.

Compute is a fundamental aspect of an ideal homelab architecture, empowering users to run diverse workloads and applications. It typically revolves around powerful server hardware, which can include rack-mounted servers or repurposed desktop machines with ample processing power, memory, and storage capacity. Virtualization technologies like VMware or Hyper-V allow for the creation and management of virtual machines, enabling efficient utilization of resources and the ability to run multiple operating systems and applications simultaneously. Additionally, containerization platforms such as Docker and Kubernetes offer lightweight and scalable environments for deploying and managing containerized applications. The compute component of a homelab provides the necessary horsepower to support a wide range of experiments, projects, and learning opportunities, empowering users to explore various technologies and configurations.

A cool homelab is a sight to behold, with a sleek equipment rack housing powerful servers and illuminated by LED lights. Neatly managed cables run along cable management arms and trays, with color-coded sleeves adding style and clarity. High-performance networking switches, routers, and firewalls display blinking status lights, creating an entrancing visual display. Multiple monitors on a spacious desk provide a command center, surrounded by shelves and storage units for additional hardware. The room is optimized for airflow and ventilation, ensuring the equipment remains cool and efficient. Whiteboards or smart boards adorn the walls, inspiring creativity and organization. This cool homelab is a blend of functionality, organization, and aesthetic appeal, reflecting a passion for technology and a commitment to continuous learning and exploration.

Keeping things organized in a homelab can be a real challenge, but it’s crucial for maintaining efficiency and ease of use. One area that often requires attention is cable management. With numerous devices, power cords, and networking cables, it’s easy for things to become a tangled mess. Implementing cable management techniques such as using cable ties, Velcro straps, or cable management trays can help keep cables organized and prevent them from becoming a jumbled mess. Additionally, labeling cables and ports can make it easier to trace connections and troubleshoot any issues that may arise. Regular maintenance and periodic cable cleanups can go a long way in maintaining a tidy and well-organized homelab.

Aside from cables, proper equipment and component organization also contribute to an organized homelab. Utilizing equipment racks or shelves can provide a designated space for servers, switches, and other hardware. Grouping similar components together and using clear labeling or color-coding techniques can simplify identification and access. It’s also helpful to have a central documentation system to keep track of configurations, IP addresses, and system changes. By investing time and effort into organizing the physical and virtual aspects of the homelab, users can save valuable time during troubleshooting, upgrades, and expansions, ensuring a smoother and more efficient homelab experience overall.

Cabling is a crucial aspect of homelab organization, and a few extra tips can help maintain a tidy setup. Planning the cabling layout in advance, considering cable lengths and future expansions, reduces clutter. Utilizing color-coded cables or cable sleeves simplifies identification. Within server racks or cabinets, employing cable management tools like arms, managers, and routing channels guides cables neatly and improves airflow. Regular cable audits and cleanups remove unused cables and ensure secure connections. Documenting the cabling infrastructure with diagrams or management software aids in troubleshooting and modifications, preventing confusion and saving time. By implementing these strategies, the homelab maintains an organized and efficient cabling system.

Before

Building a homelab is all about trial and error, and things can get a bit messy and disorganized at times. It’s totally normal! When you start setting up your homelab, you might run into issues and face configuration problems. But don’t worry, that’s how you learn! Embrace the challenges and see them as opportunities to grow your skills. Your homelab might end up looking like a tangled web of cables, but that’s part of the fun.

Before homelab gallery

Take the time to label and document everything-it’ll save you headaches down the road. And remember, there’s a whole homelab community out there ready to help you out when things get rough!

BoM

Next are the details of my enclosure homelab’s bill of materials (BoM). Picture this, a comprehensive list of all the hardware and equipment required to build this box of awesomeness. The BoM is like a treasure map, guiding me through the vast landscape of parts and stuff to deep dive into the heart and soul of my homelab setup. There are many parts not in this BoM list because they are not used in the final layout.

Item	Quantity	Price	Total
Countertop (240cm, cut in 2x70cm & 2x50cm)	1	TBD	TBD
T-slot 2020 profiles	12	TBD	TBD
3 way connectors	8	TBD	TBD
M4 hexagonal head screws (5mm and 25mm)	50/50	TBD	TBD
T-slot 2020 corner connectors	8	TBD	TBD
T-slot 90 degrees connectors	50	TBD	TBD
T-slot M4 nut	50	TBD	TBD
Spax wood screws 2mmx25mm	50	TBD	TBD
L (90 degrees) bracket 30mmx80mmx55mmx2mm	2	TBD	TBD
Heavy 90 degrees connectors	50	TBD	TBD
Heavy floor feets	4	TBD	TBD
Upper door shock absorver	1	TBD	TBD
Nylon wheels (1 inch), M8	4	TBD	TBD
M8 hexagon connector	4	TBD	TBD
Washers	50	TBD	TBD
Hinges	2	TBD	TBD
Glass doors (255mmx610mm)	2	TBD	TBD
19” rack profile (2x12U,4x2U,2x4U)	8	TBD	TBD
19” rack shelf (250mm)	1	TBD	TBD
19” rack shelf (150mm)	1	TBD	TBD
19” rack power distributor	1	TBD	TBD
19” rack rail depth adapter kit	2	TBD	TBD

Materials gallery

In conclusion, the bill of materials (BoM) serves as the backbone of my homelab’s main rack enclosure, providing a detailed roadmap to its inner workings. Through extensive research, careful consideration, and a touch of geeky enthusiasm, I’ve curated a collection of hardware and equipment that forms the foundation of it.

After building the homelab

I’ll like to share the journey I embarked on to build my homelab. It was no walk in the park, but oh , the end result is worth every bit of the sweat and tears. First, I dove headfirst into the technical realm, researching hardware options like a mad scientist on a caffeine-fueled mission. Then came the fun part-trial and error galore!. But hey, that’s how we learn, right? Finally, I had my homelab up and running, ready to tackle any tech challenge thrown my way. So, buckle up, folks, and get ready to witness the fruits of my homelab labor-let the geeky adventures begin!

After homelab gallery

My homelab build process was an exhilarating roller coaster ride filled with learning, challenges, and moments of pure triumph.

Showcase

After a lot of trial and error, messy cables, I’ve finally created a tech wonderland right in the comfort of my own home. Picture this: a sleek rack filled with powerful custom machines, and blinking lights. I’ve got my own mini data center going on! With my homelab, I can experiment with cutting-edge technologies, host my own services, and learn like a pro. It’s my own little tech playground, and boy, it feels good to have accomplished this. Welcome to my homelab, where the possibilities are endless. BTW all this stuff was 100% sponsored by ME.

Showcase gallery

Status

At the moment some pictures of the build…

Showcase gallery

Deploying

To deploy things here I’m using Ansible, and tools like Kubeinit, these are designed to simplify the process of setting up and managing complex infrastructure, such as Kubernetes clusters, within a homelab environment.

Let’s talk about deployment tools for homelabs, like Kubeinit. These tools are super handy when you want to set up and manage complex stuff like Kubernetes clusters in your homelab. Kubeinit, for example, is an awesome open-source tool that makes deploying Kubernetes a breeze. It takes care of all the nitty-gritty installation and configuration details, so you don’t have to stress about it. Plus, it lets you customize your cluster to fit your needs. And the best part? There’s a friendly community around tools like Kubeinit that’s always ready to lend a hand and share their wisdom.

Kubeinit is an open-source deployment tool specifically tailored for setting up Kubernetes clusters. It aims to provide an easy and reproducible way to deploy Kubernetes in different configurations, such as single-node or multi-node clusters. Kubeinit automates the installation process, handles configuration management, and assists in deploying additional services and tools commonly used with Kubernetes.

Deployment tools like Kubeinit can be highly beneficial for homelab enthusiasts looking to leverage Kubernetes and containerization technologies. They simplify the setup process, reduce manual work, and provide a standardized approach to deploying and managing Kubernetes clusters within a homelab environment.

Update log:

2023/05/26: Initial version.

2023/10/04: Status gallery.

Agile 101 +Jira

2023-03-23T00:00:00+00:00

Agile methodologies are a set of iterative and incremental approaches to software development that prioritize collaboration, flexibility, and rapid response to change.

NOTE: WIP document subject to changes.

Agile 101 - DFG:Upgrades (migrations, adoption, and backup and recovery)

Benefits

Some benefits include:

Increased flexibility: Agile methodologies enable teams to respond quickly to changes in requirements, timelines, or resources. Teams can adjust their approach based on feedback from stakeholders or changes in the market.
Better collaboration: Agile methodologies emphasize communication and collaboration between team members, stakeholders, and customers. This leads to better alignment and understanding of project goals and expectations.
Improved quality: Agile methodologies prioritize continuous testing and feedback, leading to higher-quality software products.
Faster time-to-market: Agile methodologies help teams deliver software products faster and more frequently. This enables teams to respond quickly to changing market conditions and customer needs.
Increased customer satisfaction: By prioritizing collaboration, feedback, and responsiveness, agile methodologies help teams deliver products that better meet customer needs and expectations.

Also, Agile methodologies improve team performance and vision by promoting:

Transparency: Agile methodologies encourage teams to be transparent about their progress, challenges, and goals. This helps teams stay aligned and focused on the most important priorities.
Continuous improvement: Agile methodologies encourage teams to reflect on their processes and outcomes and make changes based on feedback. This helps teams continually improve and innovate.
Empowerment: Agile methodologies empower team members to take ownership of their work, make decisions, and collaborate with others. This leads to higher levels of engagement, creativity, and job satisfaction.

Agile methodologies are a powerful approach to software development that promote collaboration, flexibility, and continuous improvement. They can help teams deliver higher-quality software products faster, while also increasing customer satisfaction and team morale.

Core values

The Agile Manifesto outlines four core values that underpin all agile methodologies:

Individuals and interactions over processes and tools: This value emphasizes the importance of people and collaboration in the software development process. Agile methodologies prioritize face-to-face communication, feedback, and collaboration among team members and stakeholders over following rigid processes or relying on tools.
Working software over comprehensive documentation: This value prioritizes delivering working software over spending time and resources on extensive documentation. Agile methodologies emphasize the importance of frequent releases and testing to ensure that software is functional and meets user needs.
Customer collaboration over contract negotiation: This value emphasizes the importance of working closely with customers and stakeholders throughout the software development process. Agile methodologies prioritize customer feedback and input to ensure that the final product meets their needs and expectations.
Responding to change over following a plan: This value recognizes that the software development process is inherently unpredictable and that plans may need to change. Agile methodologies prioritize flexibility and the ability to respond quickly to changes in requirements, timelines, or resources.

These core values guide the behavior and decision-making of agile teams and help them deliver high-quality software products that meet customer needs and expectations.

Principles

Customer satisfaction: Delivering value to the customer is the highest priority.
Changing requirements: Agile processes are flexible and can adapt to changing requirements, even in the later stages of development.
Delivering frequently: Frequent delivery of working software builds trust, encourages feedback, and enables the customer to realize benefits earlier.
Collaboration: Agile processes emphasize collaboration between customers, developers, and stakeholders to ensure the best outcome.
Motivated individuals: Teams should be composed of self-motivated individuals who are empowered to make decisions and work collaboratively.
Face-to-face communication: Communication is key, and face-to-face communication is the most effective way to convey information.
Working software: Working software is the primary measure of progress.
Sustainable development: Agile processes promote sustainable development, with a focus on maintaining a steady pace and avoiding burnout.
Technical excellence: A strong focus on technical excellence is necessary to maintain quality and enable agility.
Simplicity: Simplicity is a key aspect of agile development, with a focus on delivering the simplest possible solution that meets customer needs.
Self-organizing teams: Teams should be self-organizing, with the ability to adapt to changing requirements and make decisions.
Reflection and adaptation: Agile processes promote reflection and adaptation, with a focus on continuous improvement and learning from experience.

Roles

The main roles in agile are:

Product Owner: The product owner is responsible for defining and prioritizing the features of the product or service being developed. They are the primary point of contact for the development team and are responsible for communicating the vision and goals of the product to the team.
Scrum Master: The scrum master is responsible for ensuring that the development team follows the agile process and for facilitating communication and collaboration within the team. They help remove obstacles and ensure that the team is working efficiently and effectively.
Development Team: The development team is responsible for designing, developing, testing, and delivering the product or service being developed. They work together to complete the tasks necessary to meet the project goals.
Stakeholders: Stakeholders are individuals or groups with an interest in the project, such as customers, users, or managers. They provide feedback and input to the product owner and development team to ensure that the product meets their needs and expectations.
Agile Coach: An agile coach is a mentor who helps teams adopt and implement agile methodologies. They provide guidance and support to the team and help them identify areas for improvement.
Business Analyst: A business analyst is responsible for understanding and documenting the requirements of the product or service being developed. They work closely with the product owner to ensure that the development team understands the project goals and requirements.
Quality Assurance (QA) Engineer: A QA engineer is responsible for testing the product or service to ensure that it meets quality standards and user requirements. They work closely with the development team to identify and resolve defects.

The specific roles and responsibilities may vary depending on the particular agile methodology being used and the needs of the project.

Agile components

The six key components that make up the Agile methodology:

User Stories: User stories are brief, non-technical descriptions of a feature or requirement from the user’s perspective. They are used to capture user requirements and help the development team understand what the user wants.
Backlog: The product backlog is a prioritized list of user stories or features that need to be developed. It is managed by the product owner and is used to guide the development team’s work.
Sprints: Sprints are short, time-boxed iterations in which the development team works to deliver a set of features or user stories. Sprints typically last 1-4 weeks, and at the end of each sprint, the team delivers a potentially shippable product increment.
Daily Stand-ups: Daily stand-ups are brief, daily meetings in which the development team discusses progress, identifies obstacles, and plans the day’s work. They are intended to keep the team informed and ensure that everyone is working towards the same goals.
Sprint Review: The sprint review is a meeting held at the end of each sprint to demonstrate the work that has been completed to the stakeholders. It is an opportunity for the team to get feedback on the product and make any necessary adjustments.
Sprint Retrospective: The sprint retrospective is a meeting held at the end of each sprint to review the team’s process and identify areas for improvement. The team reflects on what went well and what didn’t, and then makes adjustments for the next sprint.

These six components work together to guide the agile process, from capturing user requirements through to delivering a high-quality product increment at the end of each sprint.

Agile Phases

While there are different variations of the Agile methodology, most include the following phases:

Project Initiation: In this phase, the team identifies the scope of the project, the stakeholders involved, and the objectives and goals of the project. The team also establishes the project’s vision and creates a product roadmap to guide the development process.
Planning and Requirements Analysis: In this phase, the team identifies the user requirements and translates them into user stories or features. The product backlog is created and prioritized based on customer needs and business objectives. The team also creates a project plan, which outlines the timeline, resources, and budget required to complete the project.
Design and Prototyping: In this phase, the team designs the architecture of the system and creates a prototype of the product. The design and prototype are reviewed and tested to ensure they meet the user requirements and project goals.
Development and Iterations: In this phase, the team begins to develop the product incrementally through a series of iterations or sprints. Each iteration typically lasts 1-4 weeks and results in a working product increment. The team conducts regular sprint reviews and retrospectives to evaluate progress and identify areas for improvement.
Testing: In this phase, the team tests the product to ensure that it meets the user requirements and quality standards. Testing is conducted throughout the development process and includes unit testing, integration testing, and acceptance testing.
Deployment and Maintenance: In this phase, the team deploys the product to the production environment and provides ongoing maintenance and support. The team continues to refine the product based on user feedback and implements updates and improvements as necessary.

It’s important to note that Agile methodology is iterative and flexible, and these phases are not necessarily sequential. The team can move back and forth between phases as needed to respond to changes and adjust the product as it evolves.

Delivery Focused Group (DFG) internal mechanics

A delivery focused group is responsible end-to-end for the delivery of the stories (includes improvements that can’t be characterized as features, delivering new upstream releases, bug fixing, errata, test coverage improvements, working with support and the field, etc…) in its functional scope. The group is cross-functional, regrouping product managers, engineers and quality engineers to achieve the delivery.

Groups will cross organizational boundaries, but also technology component boundaries. For example a group in charge of developing a feature will also take care of the installation, update and upgrade of this feature using the framework provided by the deployment tooling group. The group is responsible for the product implementation: its members determine the technical solution in their area of concern. They are maintaining one backlog for each group. The delivery focused groups should be small (5 to 9), and our goal is to aim to self-organisation. You can only be on one team at a time.

Organization of the DFG:Upgrades (squads)

The DFG:Upgrades provides and maintain a framework for migrating, updating or upgrading from an earlier version of Red Hat OpenStack (OSP) version to an equal or later version.

There are three squads in this DFG, each of them with an specific core mission.

Updates	Upgrades	Migrations, adoption, and backup and recovery
…	…	OpenStack migration tooling (OS migrate)
…	…	Next-gen adoption framework
…	…	Backup and recovery strategies for updates and upgrades

Agile tooling for the squad

Main Jira plan

A Jira plan is a high-level view of a project’s roadmap, timelines, and milestones. It is a visual representation of the project plan that allows teams to see the big picture and understand how different tasks and issues fit together. Jira plans can include details such as start and end dates, priorities, dependencies, and progress indicators. They are often used to communicate project status and progress to stakeholders, and can help teams stay aligned and focused on project goals. Jira plans can be created and managed within the Jira software, and are often used in conjunction with other Jira tools such as boards and dashboards. It allows:

Resource allocation: Jira Plan allows teams to allocate resources effectively by identifying tasks and assigning them to specific team members.

Time management: Jira Plan helps teams manage their time more efficiently by setting deadlines and prioritizing tasks.

Budgeting: Jira Plan can be used to monitor project budgets and ensure that resources are being used effectively.

In particular the above image shows the Jira plan for OSP, including a time-based view of the roadmap, timelines, and milestones.

Main Jira board

A Jira board is a visual representation of a team’s work that allows team members to track and manage their tasks and issues in real-time. Jira boards can be customized to reflect a team’s unique workflow and can include different views, such as a Kanban board or a Scrum board.

In a Kanban board, tasks are represented as cards that move across different columns that reflect different stages of the workflow, such as “To Do,” “In Progress,” and “Done.” The columns can be customized to match the team’s specific workflow, and team members can easily see the status of each task and identify any bottlenecks or areas for improvement.

In a Scrum board, tasks are represented as cards that are grouped into sprints, which are typically two-week time periods in which the team works to complete specific goals. The Scrum board includes columns for the different stages of the sprint, such as “To Do,” “In Progress,” “Code Review,” and “Done,” and team members can track the progress of each task and see how it fits into the larger sprint goal.

Jira boards can be used by development teams, project managers, and other stakeholders to manage and track work in real-time, and to ensure that everyone is aligned and focused on the team’s goals.

Visualization: Jira Board provides a visual representation of tasks and project progress, making it easier for team members to see what’s been done and what still needs to be done.

Collaboration: Jira Board facilitates collaboration and communication between team members, allowing them to easily see what others are working on and identify areas where they can help.

Prioritization: Jira Board helps teams prioritize tasks and focus on what’s most important, ensuring that the team is working on the right things at the right time.

In particular the previous image, the Jira DFG:Upgrades board link shows the whole team work.

Squad board

The squad board is represented by a subset of the filters used for the main DFG Jira board. This can help different squads part of the main DFG to focus on specific subsets of work, and to avoid being overwhelmed by too much information.

For example, a DFG might have a main Jira board that includes all of their projects and issues, but they may also have sub-boards for each squad that focus on specific projects or workflows. These sub-boards would only include the filters and columns that are relevant to that specific project or workflow, allowing team members to focus on the work that is most important to them.

Sub-boards can also be used to create more detailed views of specific issues or sets of issues. For example, a team might create a sub-board that shows all of the issues related to a specific feature or bug fix, allowing them to easily track progress and collaborate on solutions.

Managing sub-boards might be a useful tool for organizing and managing work in Jira, and can help teams to stay focused and productive.

In particular the previous image, the Jira DFG:Upgrades squad sub-board link shows a sub-set of all the tasks for the team, so the squad can be focused on their specific planned work. It is important to show that this view gives us the ability to compare the overall team’s progress with respect of each squad.

Squad dashboard

A Jira dashboard is a customizable, visual display of key metrics and information from Jira projects and issues. It is designed to provide users with a high-level overview of project status, progress, and key performance indicators (KPIs).

Jira dashboards can be configured to display different types of data, such as progress of individual projects, workload of team members, and number of open and closed issues. They can also include custom widgets and gadgets, such as charts, graphs, and filters, that allow users to drill down into specific data and see details about individual issues or projects.

Jira dashboards can be personalized to meet the needs of individual users, teams, or departments, and can be shared across the organization to ensure everyone is aligned and working towards the same goals. They are a powerful tool for project managers and team members to stay on top of their work and make informed decisions based on real-time data. In particular, they allow:

Performance tracking: Jira Dashboards provide real-time insights into project performance, enabling teams to identify areas for improvement and track progress against goals.

Customization: Jira Dashboards can be customized to display the information that’s most important to each team member, helping to streamline workflows and improve productivity.

Decision making: Jira Dashboards provide the information needed to make informed decisions, helping teams to adjust course and make changes as needed.

The previous image shows the Jira DFG:Upgrades migrations dashboard where it can be seen:

The remaining days of the sprint.
The people involved in the squad tasks.
The effort allocation for the remaining tasks planned for the sprint.
The current open tasks for the sprint.
The current closed tasks for the sprint.
The already groomed list of tasks that will be planned for the next sprint.
Both burn-down charts for the whole team and the squad.
The list of epics that should be worked on.

Sprint Timeline

As the sprint moves forward there are a set of activities that are handled by all the team members.

Sprint Planning	Sprint Execution	Sprint Review	Sprint Retrospective
Determine scope and objectives of the sprint.	Develop and test user stories.	Demo the product to stakeholders and gather feedback.	Reflect on the sprint and identify areas for improvement.
Create a sprint backlog.	Conduct daily stand-up meetings.	Review and prioritize backlog for next sprint.	Discuss what went well, what could have been better, and how to improve.
Define user stories and acceptance criteria.	Collaborate and communicate with team members.	Analyze metrics and data to identify areas for improvement.	Assign action items and determine a plan for implementing changes.
Estimate the effort required for each user story.	Conduct user testing and integrate feedback.	Celebrate successes and recognize team members for their contributions.	Plan for the next sprint and make adjustments as needed.

Constant grooming and backlog ranking

It might be considered a good practice to do constant grooming and backlog ranking in agile development.

Grooming, also known as backlog refinement, is the process of reviewing and updating the product backlog to ensure that it is up-to-date and prioritized. This involves reviewing user stories, breaking them down into smaller tasks, and estimating the effort required to complete each task. By doing this regularly, the team can ensure that the backlog is accurate and up-to-date, and that everyone understands the scope of the work that needs to be done.

Backlog ranking is the process of prioritizing user stories and tasks based on their importance and value to the product. This involves considering factors such as customer needs, business goals, and technical dependencies, and assigning each item in the backlog a priority level. By doing this regularly, the team can ensure that they are always working on the most important and valuable items, and that they are delivering value to the customer in a timely manner.

The previous image shows in each sprint how the process of constantly revising the backlog and ranking the tasks takes place.

By doing constant grooming and backlog ranking, the team can stay focused on the most important work, and can ensure that they are delivering value to the customer in a timely manner. This helps to ensure that the project stays on track, and that the team is able to deliver a high-quality product that meets the needs of the customer.

Some final conclusions

Agile is an iterative and flexible approach to software development that emphasizes collaboration, customer satisfaction, and continuous improvement. It consists of a set of core values, such as valuing individuals and interactions over processes and tools, and 12 principles, such as delivering working software frequently and embracing change.

Agile methodology is made up of six key components, including user stories, backlog, sprints, daily stand-ups, sprint review, and sprint retrospective. These components work together to guide the agile process, from capturing user requirements through to delivering a high-quality product increment at the end of each sprint.

By embracing agile methodologies, teams can achieve a variety of benefits, such as increased productivity, better collaboration, improved quality, faster time-to-market, and greater customer satisfaction. By focusing on continuous improvement and adapting to change, agile teams can remain responsive to customer needs and competitive in an ever-changing market.

Jira is a popular project management tool that can be used to support agile development. It includes a number of features, such as Jira plans, boards, and dashboards, that allow teams to plan, track, and manage their work in a flexible and collaborative way.

Jira plans are high-level roadmaps that help teams to visualize their work over time, and to track progress towards their goals. Jira boards are visual representations of the work that needs to be done, and can be customized to reflect different workflows and priorities. Jira dashboards are customizable views that provide users with a high-level overview of project status and progress, and can be personalized to meet the needs of individual users or teams.

Some of the benefits of using Jira for agile development include improved collaboration, increased transparency, and better visibility into project status and progress. By using Jira, teams can stay organized, focused, and productive, and can deliver high-quality software products that meet the needs of their customers.

Finally, constant grooming and backlog ranking are important practices in agile development that help teams to stay focused on the most important work, and to deliver value to the customer in a timely manner. By regularly reviewing and updating the backlog, teams can ensure that they are working on the most important and valuable items, and can adjust their plans as needed to stay on track.

ChatGPT a world breaker technology

2022-12-10T00:00:00+00:00

ChatGPT is a large language model developed by OpenAI. It is a variant of the GPT (Generative Pre-training Transformer) model, which is trained on a massive amount of text data and is capable of generating human-like text.

Benefits of ChatGPT

ChatGPT’s ability to generate human-like text makes it useful in a variety of applications such as chatbots, language translation, and text summarization.
ChatGPT can be fine-tuned on specific tasks, such as answering questions, providing information, or composing coherent and grammatically correct text.
ChatGPT’s ability to understand and respond to context allows it to generate text that is relevant to the input it receives.

Potential usage of ChatGPT

Chatbots for customer service, virtual assistance, and e-commerce.
Generating text for social media, email, and messaging platforms.
Improving language translation and summarization systems.
Helping with content creation, such as composing articles, stories, or poetry.
Generating responses in virtual reality and gaming and many more.

Technical details

ChatGPT is based on the transformer architecture.
It is trained on a massive amount of text data, which allows it to generate human-like text.
The cost to run ChatGPT will depend on the specific use case and the hardware it is running on.
The cost can be reduced by using the model in a “serverless” fashion, where the model is hosted by the provider and you only pay for the computation you use.

Conclusion

ChatGPT is a powerful and versatile language model that can be used to improve a wide range of natural language processing tasks. Its ability to generate human-like text, understand context, and perform various language-based tasks makes it an attractive option for developers and researchers looking to build advanced language-based applications.

This article was generated with ChatGPT.

Deploying a Kubernetes cluster with Windows containers support

2022-06-30T00:00:00+00:00

Workloads running on top of Windows-based infrastructure still represent a huge opportunity for ‘the cloud’, specific applications like video-games development (Unity) rely heavily on the Microsoft Windows ecosystem to work and be used among developers and customers. Hence the need to provide a consistent cloud infrastructure for such Windows based software.

TL;DR;

This post will show you how to deploy a Kubernetes cluster with Windows containers support, and what to expect from it.

The good, the bad and the ugly

From what is available in the documentation, the support for Windows workloads started in Kubernetes 1.5 (2017 ‘alfa’), and as a stable build on K8s 1.14 (2019), but even at the moment of writing this post, all the documentation, the container runtime support, and the CNI connectivity supporting this type of specific workloads is fairly limited. Before showing the actual 1-command deployment magic, this post will go through an brief review about what was needed, and what was done to have this ‘working’.

The following sections are an initial assessment of what is working, and how easy is to integrate these functional components in a more fashioned and automated way using Kubeinit.

The good

It works!!!. With some specific limitations about the container runtimes that are supported, and the CNI plugins that have support for topologies like vxlan tunnels you can have something working once you know what is currently supported for your distro.

There are a lot of resources spread on the Internet that will give you an idea of what should work, and in some cases how-to deploy it.

Useful links:

This is the partial list of resources checked to finish the integration of Windows workloads in Kubeinit:

Blog posts and documentation:

https://www.jamessturtevant.com/posts/Windows-Containers-on-Windows-10-without-Docker-using-Containerd/
https://github.com/lippertmarkus/vagrant-k8s-win-hostprocess
https://docs.microsoft.com/en-us/virtualization/windowscontainers/kubernetes/common-problems
https://techcommunity.microsoft.com/t5/networking-blog/introducing-kubernetes-overlay-networking-for-windows/ba-p/363082
https://deepkb.com/CO_000014/en/kb/IMPORT-4bb99d54-1582-32fa-b130-b496089f7678/guide-for-adding-windows-nodes-in-kubernetes

Official code from Kubernetes:

https://github.com/kubernetes/kubernetes/issues/94924
https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/windows/k8s-node-setup.psm1

Code from the CNI Microsoft team:

https://github.com/Microsoft/SDN/blob/master/Kubernetes/windows/start-kubelet.ps1
https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/helper.psm1
https://github.com/microsoft/SDN/tree/master/Kubernetes/flannel/overlay
https://github.com/microsoft/SDN/blob/master/Kubernetes/flannel/register-svc.ps1

Code from the Windows K8s sig:

https://github.com/kubernetes-sigs/sig-windows-dev-tools
https://github.com/kubernetes-sigs/sig-windows-tools/issues/128

Code from the CNI dev team:

https://github.com/containernetworking/plugins/blob/main/plugins/main/windows/win-overlay/sample-v2.conf

The bad

It is pretty cumbersome to catch all the steps, ordering, and combinations of services that actually work. Also, the support for different distributions might lead to being forced to use i.e. container runtimes which at the moment are just not supported.

The ugly

While reading about what to install and how to configure I ended up with situations like the next one. What’s the difference between sdnoverlay from microsoft/windows-container-networking and winoverlay from containernetworking/plugins remains unknown to me (mostly because of my limited time to dig into it with more detail), they are supposed to do the same, but there are plugins maintained by different folks with different names, so it created a little bit (or too much) confusion for me when doing this initial integration.

We assume by default that the components that can be consumed in their “stable” releases will just work. An example of this “not happening” is this, by default, I thought that the CNI plugin winoverlay will work “as-is” in Vanilla Kubernetes, because this same configuration is supported and working on OpenSHift and OKD. This assumption might not be true, at the moment of writing this post (July 1st, 2022), the support for containerD and Windows compute nodes in this GitHub PR is merged, but not released upstream (v1.1.1 does not have this change) so whatever you pull from the released versions will just won’t work…

Architectural considerations and deployment

Before going ahead, this section will briefly introduce Kubeinit’s architectural reference so you know ahead what is deployed, where, and how.

The picture above shows the legend of the main functional components of the platform.

The left icon represents the services pod, that will run all the infrastructure services required to have the cluster up and running, services like HAProxy, Bind, and a local container registry among others that will support the cluster external required services.

The right icon represents the Virtual Machines instances that will host each node of the cluster, these nodes can be, the control-plane nodes and the worker nodes, the control-plane nodes will be Linux, and depending on the Kubernetes distribution they will have installed CentOS Stream, Debian, Fedora CoreOS, or CoreOS. The worker nodes will have the same Linux distribution as the control-plane nodes with the addition of the recently added Windows worker nodes.

Now, we have a representation of the physical topology of a deployed cluster, the requirement is to have a set of hypervisors where the cluster nodes (guests) will be deployed, these hypervisors must be connected in a way that they can all reach each others (or having them connected in a L2 segment). Another assumed-by-default requirement is to have SSH passwordless access to the nodes where you are calling Ansible from.

The logical topology represents a fairly more detailed view about how the components are actually ‘connected’ by allowing all the guests in the cluster (including the services pod) to be reachable without any distinction of what is deployed where within the cluster.

In this particular case, we have installed OVS in each hypervisor and by using OVN we create an internal overlay network to provide a consistent and uniform way to access any cluster resource.

Latest Kubeinit’s support for Windows workloads

With some context of what will be deployed from the previous sections, let’s go ahead and test this awesome cool feature in a magical 1-command deployment.

NOTE: Please check the complete instructions from the main README page, also, a good reference to know the hypervisor requirements is the CI install script there you will find what is required to set up the hypervisors based on Debian/Ubuntu/CentOS Stream/Fedora.

Deploying

# Install the requirements assuming python3/pip3 is installed
pip3 install \
        --upgrade \
        pip \
        shyaml \
        ansible \
        netaddr

# Get the project's source code
git clone https://github.com/Kubeinit/kubeinit.git
cd kubeinit

# Install the Ansible collection requirements
ansible-galaxy collection install --force --requirements-file kubeinit/requirements.yml

# Build and install the collection
rm -rf ~/.ansible/collections/ansible_collections/kubeinit/kubeinit
ansible-galaxy collection build kubeinit --verbose --force --output-path releases/
ansible-galaxy collection install --force --force-with-deps releases/kubeinit-kubeinit-`cat kubeinit/galaxy.yml | shyaml get-value version`.tar.gz

# Run the deployment
ansible-playbook \
    --user root \
    -v \
    -e kubeinit_spec="k8s-libvirt-1-1-1" \
    -e kubeinit_libvirt_cloud_user_create=true \
    -e hypervisor_hosts_spec='[[ansible_host=nyctea],[ansible_host=tyto]]' \
    -e cluster_nodes_spec='[[when_group=compute_nodes,os=windows]]' \
    -e compute_node_ram_size=16777216 \
    ./kubeinit/playbook.yml

If you will like to cleanup your environment after using Kubeinit, just run the same deployment command appending -e kubeinit_stop_after_task=task-cleanup-hypervisors that will clean all the resources deployed by the installer.

NOTE: If you have an error when deploying the services pod like ‘TASK [kubeinit.kubeinit.kubeinit_services : Install python3] … unreachable Could not resolve host: mirrorlist.centos.org’ make sure your DNS server is reachable, by default it is used the 1.1.1.1 DNS server from Google, and it might be blocked in your internal network, run export KUBEINIT_COMMON_DNS_PUBLIC=<your valid DNS> and then run the deployment as usual.

The ‘new’ parameter present when running the Ansible playbook is called cluster_nodes_spec, this parameter allows to determine the OS (Operative System) of the compute nodes. There are still in progress some features to fully allow to customize the setup and decide how many nodes will be Windows based and how many will have Linux installed. The only supported version is Windows server 2022 Datacenter Edition. This Windows Server installation is based on the actual .ISO installer, so if a user want’s to use this in a more stable scenario they will need to register these cluster’s nodes.

Once the deployment finishes, in the case of having Windows compute nodes it will be around ~50 minutes (at least the first time because we need to download all the .ISO images), you can VNC into your Windows compute nodes firstly by forwarding the 5900 port like:

[ccamacho@laptop]$ ssh root@nyctea -L 5900:127.0.0.1:5900

Then, from your workstation start a VNC session to 127.0.0.1:5900.

Or even better, you can SSH directly into your Windows computes like:

# This will get you into the first controller node.
[ccamacho@nyctea]$ ssh -i .ssh/k8scluster_id_rsa [email protected]
# This will get you to your first compute node.
[ccamacho@nyctea]$ ssh -i .ssh/k8scluster_id_rsa [email protected]
# This will get you to the services pod.
[ccamacho@nyctea]$ ssh -i .ssh/k8scluster_id_rsa [email protected]

The default IP segment assigned for the cluster nodes is the 10.0.0.0/24 network, where the IPs are assigned in order, first the controller nodes, then the computes, and the last IP for the services pod. In this example, given that the kubeinit_spec is k8s-libvirt-1-1-1 we will deploy a vanilla Kubernetes cluster on top of libvirt with one controller, one compute, and using a single hypervisor (explained in the same order than the spec content).

This same configuration is deployed periodically, you can check the status of the execution in the periodic job page

Considerations

After having the cluster deployed, some considerations include determining from now on in the applications deployments the OS of the guests where the workloads will run.

The behavior I was able to see is that the Linux deployment tried to be scheduled in the Windows compute nodes, which at some point it just timed-out.

So make sure your deployments use the nodeSelector like:

nodeSelector:
 kubernetes.io/os: linux

nodeSelector:
 kubernetes.io/os: windows

Conclusions

While it was hard to make it work at first, having Windows workloads integrated into some of the use cases, proves how easy it was to extend Kubeinit’s core architectural structure with new types of nodes, and new types of workloads.

This new use case enables other types of workloads that might benefit other completely different usages of ‘the cloud’ we were used to seeing.

If you are interested in checking the PR’s code, they are 664, 668, 669, 672, and 676.

There might be still features not working properly or connectivity issues between pods as that is something I didn’t have time to properly test so far. The architectural diagrams are available online for further edits and references.

Update log:

2022/07/01: Initial version and minor edits.

The end

If you like this post, please try the code, raise issues, and ask for more details, features or anything that you feel interested in. Also it would be awesome if you become a stargazer to catch up updates and new features.

This is the project’s main repository.

Happy KubeIniting!

Learning how to deploy OpenShift with KubeInit

2021-03-12T00:00:00+00:00

KubeInit is an Ansible collection to ease the deployment of multiple Kubernetes distributions. This post will show you how to use it to deploy OpenShift in your infrastructure.

Note 2021/10/13: DEPRECATED - This tutorial only works with kubeinit 1.0.2 make sure you use this version of the code if you are following this tutorial, or refer to the documentation to use the latest code.

TL;DR;

This post will show you the command and the parameters that needs to be configured to deploy OpenShift (4.7).

Prerequirements

Adjust your inventory file accordingly to what you will like to deploy. Please, make sure you read older posts to understand KubeInit’s deployment workflow, or give the docs a try at https://docs.kubeinit.com/.

OpenShift registry token

Then the next step is to fetch a valid registry tokens list (pullsecrets) from https://cloud.redhat.com/openshift/install/pull-secret.

You should get a very long json object as a dictionary with the credential details we need to adjust our deployment pull secrets to be able to “fetch” the images accordingly.

The pullsecret syntax should look like:

{
  "auths":{
    "cloud.openshift.com":{"auth":"TOKEN1_GOES_HERE","email":"email@example"},
    "quay.io":{"auth":"TOKEN2_GOES_HERE","email":"[email protected]"},
    "registry.connect.redhat.com":{"auth":"TOKEN3_GOES_HERE","email":"[email protected]"},
    "registry.redhat.io":{"auth":"TOKEN4_GOES_HERE","email":"[email protected]"}
  }
}

Deploying

The deployment procedure is the same as it is for all the other Kubernetes distributions that can be deployed with KubeInit.

Please follow the usage documentation to understand the system’s requirements and the required host supported Linux distributions.

At the moment we will deploy OpenShift 4.7.0 (the latest release available), if you need to deploy other releases adjust the value of the kubeinit_okd_registry_release_tag variable.

# Choose the distro
distro=okd
# Run the deployment command
ansible-playbook \
    -v \
    --user root \
    -i ./hosts/$distro/inventory \
    --become \
    --become-user root \
    -e kubeinit_okd_openshift_deploy=True \
    -e kubeinit_okd_openshift_registry_token_cloud_openshift_com="TOKEN1_GOES_HERE" \
    -e kubeinit_okd_openshift_registry_token_quay_io="TOKEN2_GOES_HERE" \
    -e kubeinit_okd_openshift_registry_token_registry_connect_redhat_com="TOKEN3_GOES_HERE" \
    -e kubeinit_okd_openshift_registry_token_registry_redhat_io="TOKEN4_GOES_HERE" \
    -e kubeinit_okd_openshift_registry_token_email="[email protected]" \
    ./playbooks/$distro.yml

Note: The variables required to override an OpenShift deployment are kubeinit_okd_openshift_deploy, kubeinit_okd_openshift_registry_token_cloud_openshift_com, kubeinit_okd_openshift_registry_token_quay_io, kubeinit_okd_openshift_registry_token_registry_connect_redhat_com, kubeinit_okd_openshift_registry_token_registry_redhat_io, and kubeinit_okd_openshift_registry_token_email.

Conclusions

Deploying also OpenShift demostrates how flexible KubeInit can be. With a few changes we can deploy also downstream Kubernetes distributions ready to be used for production grade deployments.

Once #219 is merged you should be able to run this.

The end

This is the main project repository.

Happy KubeIniting!

Multihost deployments with Kubeinit

2021-02-20T00:00:00+00:00

Until now there was possible to deploy Kubeinit in a single host configuration, this means to deploy all the guest VMs in the same hypervisor. Now, we can decouple the deployment architecture in multiple hosts.

Note 2021/10/13: DEPRECATED - This tutorial only works with kubeinit 1.0.2 make sure you use this version of the code if you are following this tutorial, or refer to the documentation to use the latest code.

TL;DR;

This post will show how to adjust the inventory files to deploy Kubeinit in multiple hosts.

Network architecture

From the official docs, OVN (Open Virtual Network) is a series of daemons for the Open vSwitch that translate virtual network configurations into OpenFlow. OVN is licensed under the open source Apache 2 license.

OVN provides a higher-layer of abstraction than Open vSwitch, working with logical routers and logical switches, rather than flows. OVN is intended to be used by cloud management software (CMS).

Open vSwitch is a free and open source multi-layer software switch, which is used to manage the traffic between virtual machines and physical or logical networks.

The following is the current network architecture of a Kubeinit deployment.

Adjusting the inventory files

To deploy Kubeinit in multiple hosts the only changes that needs to be made are in the inventory

In this case by default there is only one hypervisor enabled (nyctea).

[hypervisor_nodes]
hypervisor-01 ansible_host=nyctea
# hypervisor-02 ansible_host=tyto
# hypervisor-03 ansible_host=strix
# hypervisor-04 ansible_host=otus

The only action that needs to be taken is uncomment the lines to enable the extra hosts required.

The next step is to determine where to deploy each guest. In this example all guests are deployed to hypervisor-01, make the adjustments as required.

[okd_master_nodes]
okd-master-01 ansible_host=10.0.0.1 mac=52:54:00:34:84:26 interfaceid=47f2be09-9cde-49d5-bc7b-76189dfcb8a9 target=hypervisor-01 type=virtual
okd-master-02 ansible_host=10.0.0.2 mac=52:54:00:53:75:61 interfaceid=fb2028cf-dfb9-4d17-827d-3fae36cb3e98 target=hypervisor-01 type=virtual
okd-master-03 ansible_host=10.0.0.3 mac=52:54:00:96:67:20 interfaceid=d43b705e-86ce-4955-bbf4-3888210af82e target=hypervisor-01 type=virtual

For example, let’s deploy each master node in a different hypervisor:

[okd_master_nodes]
okd-master-01 ansible_host=10.0.0.1 mac=52:54:00:34:84:26 interfaceid=47f2be09-9cde-49d5-bc7b-76189dfcb8a9 target=hypervisor-01 type=virtual
okd-master-02 ansible_host=10.0.0.2 mac=52:54:00:53:75:61 interfaceid=fb2028cf-dfb9-4d17-827d-3fae36cb3e98 target=hypervisor-02 type=virtual
okd-master-03 ansible_host=10.0.0.3 mac=52:54:00:96:67:20 interfaceid=d43b705e-86ce-4955-bbf4-3888210af82e target=hypervisor-03 type=virtual

Now, okd-master-01 will be deployed to hypervisor-01, okd-master-02 will be deployed to hypervisor-02, and okd-master-03 will be deployed to hypervisor-03.

Requirements

To deploy Kubeinit in multiple hypervisors, the only requirement is to have passwordless root access to the hosts from the machine executing ansible-playbook.

Deploying

The deployment procedure is the same as it is for all the other Kubernetes distributions that can be deployed with KubeInit.

Please follow the usage documentation to understand the system’s requirements and the required host supported Linux distributions.

# Choose the distro
distro=okd

# Run the deployment command
git clone https://github.com/kubeinit/kubeinit.git
cd kubeinit
ansible-playbook \
    --user root \
    -v -i ./hosts/$distro/inventory \
    --become \
    --become-user root \
    ./playbooks/$distro.yml

Conclusions

Been able to decouple the cluster across multiple hosts allows to scale and deploy production ready environments, for different use cases. The overlay network deployed with OVN across the hosts provides a simple abstraction layer to communicate all the guests in the cluster.

Special thanks to @dalvarez and @dmellado for their help and insights.

The end

This is the main project repository.

Happy KubeIniting!

How to deploy Amazon EKS-D on top of a Libvirt host with KubeInit in 15 minutes

2020-12-07T00:00:00+00:00

And there is a new distro in town, today we will speak about Amazon EKS Distro (EKS-D), a Kubernetes distribution based on Amazon Elastic Kubernetes Service (Amazon EKS) and how to deploy it in a Libvirt host with almost or zero effort in a few minutes.

Note 2021/10/13: DEPRECATED - This tutorial only works with kubeinit 1.0.2 make sure you use this version of the code if you are following this tutorial, or refer to the documentation to use the latest code.

TL;DR;

We will deploy using KubeInit a Kubernetes cluster based in Amazon’s EKS distribution. Disclaimer: This is not completely implemented as there are still some EKS-D images to be added to the deployment.

Components

Here is a list of the components that are currently deployed:

Guests OS: CentOS 8 (8.2.2004)
Kubernetes distribution: EKS-D
Infrastructure provider: Libvirt
A service machine with the following services:
- HAProxy: 1.8.23 2019/11/25
- Apache: 2.4.37
- NFS (nfs-utils): 2.3.3
- DNS (bind9): 9.11.13
- Disconnected docker registry: v2
- Skopeo: 0.1.40
Control plane services:
- Kubelet 1.18.4
- CRI-O: 1.18.4
- Podman: 1.6.4
Controller nodes: 3
Worker nodes: 1

Deploying

The deployment procedure is the same as it is for all the other Kubernetes distributions that can be deployed with KubeInit.

Note: Make sure you can connect to your hypervisor (called nyctea) with passwordless access.

Please follow the usage documentation to understand the system’s requirements and the required host supported Linux distributions.

# Choose the distro
distro=eks

# Run the deployment command
git clone https://github.com/kubeinit/kubeinit.git
cd kubeinit
ansible-playbook \
    --user root \
    -v -i ./hosts/$distro/inventory \
    --become \
    --become-user root \
    ./playbooks/$distro.yml

You can also run it from a container to avoid compatibility issues between your set up and the required libraries.

This will deploy by default a 3 controllers 1 compute cluster.

The deployment time was fairly quick (around 15 minutes):

.
.
.
  "      description: snapshot-validation-webhook container image",
  "      image:",
  "        uri: public.ecr.aws/eks-distro/kubernetes-csi/external-snapshotter/snapshot-validation-webhook:v3.0.2-eks-1-18-1",
  "      name: snapshot-validation-webhook-image",
  "      os: linux",
  "      type: Image",
  "    gitTag: v3.0.2",
  "    name: external-snapshotter",
  "  date: \"2020-12-01T00:05:35Z\""
]
}
META: ran handlers
META: ran handlers

PLAY RECAP *****************************************************************************************************************
hypervisor-01              : ok=188  changed=93   unreachable=0    failed=0    skipped=43   rescued=0    ignored=4   

real	17m12.889s
user	1m24.846s
sys	0m24.366s

Let’s run some commands in the cluster.

[root@eks-service-01 ~]# curl --user registryusername:registrypassword https://eks-service-01.clustername0.kubeinit.local:5000/v2/_catalog
{
   "repositories":[
      "aws-iam-authenticator",
      "coredns",
      "csi-snapshotter",
      "eks-distro/coredns/coredns",
      "eks-distro/etcd-io/etcd",
      "eks-distro/kubernetes/go-runner",
      "eks-distro/kubernetes/kube-apiserver",
      "eks-distro/kubernetes/kube-controller-manager",
      "eks-distro/kubernetes/kube-proxy",
      "eks-distro/kubernetes/kube-proxy-base",
      "eks-distro/kubernetes/kube-scheduler",
      "eks-distro/kubernetes/pause",
      "eks-distro/kubernetes-csi/external-attacher",
      "eks-distro/kubernetes-csi/external-provisioner",
      "eks-distro/kubernetes-csi/external-resizer",
      "eks-distro/kubernetes-csi/external-snapshotter/csi-snapshotter",
      "eks-distro/kubernetes-csi/external-snapshotter/snapshot-controller",
      "eks-distro/kubernetes-csi/external-snapshotter/snapshot-validation-webhook",
      "eks-distro/kubernetes-csi/livenessprobe",
      "eks-distro/kubernetes-csi/node-driver-registrar",
      "eks-distro/kubernetes-sigs/aws-iam-authenticator",
      "eks-distro/kubernetes-sigs/metrics-server",
      "etcd",
      "external-attacher",
      "external-provisioner",
      "external-resizer",
      "go-runner",
      "kube-apiserver",
      "kube-controller-manager",
      "kube-proxy",
      "kube-proxy-base",
      "kube-scheduler",
      "livenessprobe",
      "metrics-server",
      "node-driver-registrar",
      "pause",
      "snapshot-controller",
      "snapshot-validation-webhook"
   ]
}

And check some of the deployed resources.

[root@eks-service-01 ~]# kubectl describe pods etcd-eks-master-01.kubeinit.local -n kube-system
Name:                 etcd-eks-master-01.kubeinit.local
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 eks-master-01.kubeinit.local/10.0.0.1
Start Time:           Sun, 06 Dec 2020 19:51:25 +0000
Labels:               component=etcd
                      tier=control-plane
Annotations:          kubeadm.kubernetes.io/etcd.advertise-client-urls: https://10.0.0.1:2379
                      kubernetes.io/config.hash: 3be258678a84985dbdb9ae7cb90c6a97
                      kubernetes.io/config.mirror: 3be258678a84985dbdb9ae7cb90c6a97
                      kubernetes.io/config.seen: 2020-12-06T19:51:18.652592779Z
                      kubernetes.io/config.source: file
Status:               Running
IP:                   10.0.0.1
IPs:
  IP:           10.0.0.1
Controlled By:  Node/eks-master-01.kubeinit.local
Containers:
  etcd:
    Container ID:  cri-o://7a52bd0b80feb8c861c502add4c252e83c7e4a1f904a376108e3f6f787fd342c
    Image:         eks-service-01.clustername0.kubeinit.local:5000/etcd:v3.4.14-eks-1-18-1

Conclusions

Either if this new distribution can be useful or not for your use cases, there is for sure value on having in both cloud and on-premise the same architecture and services consistency.

The end

This is the main project repository.

Happy KubeIniting!

KubeInit 4-in-1 - Deploying multiple Kubernetes distributions (K8S, OKD, RKE, and CDK) with the same platform

2020-10-19T00:00:00+00:00

One of the KubeInit’s pillars is to define a common framework to deploy multiple Kubernetes distributions, once you finish the deployment you should be able to use the specific distro tooling to mange the lifecycle of your deployment (or multiple deployments).

Note 2021/10/13: DEPRECATED - This tutorial only works with kubeinit 1.0.2 make sure you use this version of the code if you are following this tutorial, or refer to the documentation to use the latest code.

Using KubeInit you should be able to reuse a common set of third party services, and infrastructure deployment assets with any already integrated distro.

The current distributions of Kubernetes that should be deployable are:

Origin Kubernetes Distribution
Kubernetes
Rancher Kubernetes Distribution
Canonical Distribution of Kubernetes

TL;DR;

Disclaimer: This does not fully work, yet xD… Multiple scenarios might be broken, the DNS might not work properly and the HAProxy service might be failing also, this is the reason why is not documented in the official docs. Any contribution is and always be welcomed. Yet, the deployment should finish successfully.

The roles and playbooks structure

Every supported distro has a role folder called like kubeinit_<distro>, this means, kubeinit_okd, kubeinit_k8s, kubeinit_rke, and kubeinit_cdk.

Then, there is a specific playbook for each distro named using their distribution initials like okd, k8s, rke, and cdk.

This means that the workflow to deploy any distro must be the same for every one of them.

Deploying

Choose the Kubernetes distribution you will use:

Origin Kubernetes Distribution: okd
Kubernetes: k8s
Rancher Kubernetes Distribution: rke
Canonical Distribution of Kubernetes : cdk

# Choose the distro
# distro=k8s
# distro=rke
# distro=cdk
distro=okd

# Run the deployment command
git clone https://github.com/kubeinit/kubeinit.git
cd kubeinit
ansible-playbook \
    --user root \
    -v -i ./hosts/$distro/inventory \
    --become \
    --become-user root \
    ./playbooks/$distro.yml

Conclusions

Being able to deploy multiple Kubernetes distributions in an easy, quick, reproducible, and using the same interface allow users to test and evaluate them to see which one (or many) fits better their use cases.

The end

This is the main project repository.

Happy KubeIniting!

Deploying multiple KubeInit clusters in the same hypervisor

2020-10-04T00:00:00+00:00

The Ansible inventory file defines the hosts and groups of hosts upon which commands, modules, and tasks in a playbook operate. In this post it will be explained the way to update the inventory files to allow deploying multiple KubeInit clusters in the same hypervisor.

Note 2021/10/13: DEPRECATED - This tutorial only works with kubeinit 1.0.2 make sure you use this version of the code if you are following this tutorial, or refer to the documentation to use the latest code.

TL;DR;

We will show the required changes to the inventory file to deploy more than one KubeInit cluster in the same host.

Basic information

All the change required to achieve the goal of this post will be done in the same file.

As an example, in this post we will use the OKD inventory file.

Steps

The following steps are required to deploy a second (or as many are required) KubeInit clusters in the same host.

1. Duplicate the main inventory file.

echo "NEW_ID MUST BE AN INTEGER"
new_id=2
cp inventory inventory$new_id

2. Adjust network parameters.

The default internal network used is 10.0.0.0/24 so we need to change it to a new range.

We will change from the range 10.0.0 to 10.0.2 (referring to step 1 new_id=2)

sed -i "s/10\.0\.0/10\.0\.$new_id/g" inventory$new_id

3. Adjust the network and bridges names.

We will create new bridges and networks for the new deployment.

We will change from i.e. kimgtnet0 to kimgtnet2 (referring to step 1 new_id=2)

sed -i "s/kimgtnet0/kimgtnet$new_id/g" inventory$new_id
sed -i "s/kimgtbr0/kimgtbr$new_id/g" inventory$new_id
sed -i "s/kiextbr0/kiextbr$new_id/g" inventory$new_id

4. Replace the hosts MAC addresses for new addresses.

We will randomly replace the MAC addresses for all guest definitions. The following command will shuffle the MAC addresses in the file each time is executed. Note: awk does not support hexadecimal number operations, and it is no possible to replace characters by colons.

awk -v seed="$RANDOM" '
  BEGIN { srand(seed) }
  { while(sub(/52:54:00:([[:xdigit:]]{1,2}:){2}[[:xdigit:]]{1,2}/,
              "52,,,54,,,00,,,"int(10+rand()*(99-10+1))",,,"int(10+rand()*(99-10+1))",,,"int(10+rand()*(99-10+1))));
    print > "tmp"
  }
  END { print "MAC shuffled" }
' "inventory$new_id"
mv tmp inventory$new_id
sed -i "s/,,,/:/g" inventory$new_id

5. Change the guest names.

VMs are cleaned every time the host is provisioned, if their names are not updated they will be removed every time.

We will change from okd- to okd2- (referring to step 1 new_id=2)

sed -i "s/okd-/okd$new_id-/g" inventory$new_id
sed -i "s/clustername0/clustername$new_id/g" inventory$new_id

6. Run the deployment command using the new inventory file.

The deployment command should remain exactly as it was, just update the reference to the new inventory file.

# Use the following inventory in your deployment command
-i ./hosts/okd/inventory$new_id

7. Cleaning up the host.

Just in case you need to clean things up.

Warning: This will destroy any guest in the host, run with caution.

for i in $(virsh -q list | awk '{ print $2 }'); do
  virsh destroy $i;
  virsh undefine $i --remove-all-storage;
done;

Conclusions

Being able to deploy multiple clusters in the same hypervisor it will allow you to test multiple architectures and scenarios without the need to re-provision completely the environment.

The end

This is the main project repository.

Happy KubeIniting!

Persistent volumes and claims in KubeInit

2020-09-28T00:00:00+00:00

Pods are ephemeral and any information stored in them is not persistent, this means that every time you restart or create pods from the same application any internal data will be lost.

Note 2021/10/13: DEPRECATED - This tutorial only works with kubeinit 1.0.2 make sure you use this version of the code if you are following this tutorial, or refer to the documentation to use the latest code.

The solution for this is to use persistent volumes so pods can persist their data every time they are restarted, a volume is YAKR (Yet Another Kubernetes Resource).

TL;DR;

We will create a static PV/PVC to be used with an example application.

Basic information

From [1], lets define some basic concepts.

Cluster: By default, every cluster is set up with a plug-in to provision file storage. You can choose to install other add-ons, such as the one for block storage. To use storage in a cluster, you must create a persistent volume claim, a persistent volume and a physical storage instance. When you delete the cluster, you have the option to delete related storage instances.
App: To read from and write to your storage instance, you must mount the persistent volume claim (PVC) to your app. Different storage types have different read-write rules. For example, you can mount multiple pods to the same PVC for file storage. Block storage comes with a RWO (ReadWriteOnce) access mode so that you can mount the storage to one pod only.
Persistent volume claim (PVC): A PVC is the request to provision persistent storage with a specific type and configuration. To specify the persistent storage flavor that you want, you use Kubernetes storage classes. The cluster admin can define storage classes. When you create a PVC, the request is sent to the storage provider. If the requested configuration does not exist, the storage is not created.
Persistent volume (PV): A PV is a virtual storage instance that is added as a volume to the cluster. The PV points to a physical storage device in your account and abstracts the API that is used to communicate with the storage device. To mount a PV to an app, you must have a matching PVC. Mounted PVs appear as a folder inside the container’s file system.
Physical storage: A physical storage instance that you can use to persist your data. Examples of physical storage in Kubernetes clusters include File Storage, Block Storage, Object Storage, and local worker node storage. However, data that is stored on a physical storage instance is not backed up automatically. Depending on the type of storage that you use, different methods exist to set up backup and restore solutions.

Static provisioning

If you have an existing persistent storage device, you can use static provisioning to make the storage instance available to your cluster.

How does it work?

Static provisioning is a feature that is native to Kubernetes and that allows cluster administrators to make existing storage devices available to a cluster. As a cluster administrator, you must know the details of the storage device, its supported configurations, and mount options. To make existing storage available to a cluster user, you must manually create the storage device, a PV, and a PVC.

Execute all the following steps from the service machine.

# We create a new folder in the main NFS share
mkdir -p /var/nfsshare/test-nfs
chmod -R 777 /var/nfsshare/test-nfs
chown -R nobody:nobody /var/nfsshare/test-nfs

# We define the resources for the PV, PVC, and an example pod to use them
cat << EOF > ~/test_nfs_pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: test-nfs-pv
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteMany
  storageClassName: nfs01testnfs
  persistentVolumeReclaimPolicy: Retain
  nfs:
    path: /var/nfsshare/test-nfs
    server: 10.0.0.100
EOF

cat << EOF > ~/test_nfs_pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-nfs-pvc
spec:
  volumeName: test-nfs-pv
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: nfs01testnfs
  volumeMode: Filesystem
EOF

cat <<EOF > ~/pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nfs-test
  labels:
    name: frontendhttp
spec:
  containers:
    - name: nginx
      image: nginx:1.7.9
      imagePullPolicy: IfNotPresent
      ports:
        - containerPort: 80
          name: http-server
      volumeMounts:
        - mountPath: /var/nfsshare/testmount
          name: pvol
  volumes:
    - name: pvol
      persistentVolumeClaim:
        claimName: test-nfs-pvc
EOF

# We create those resources
export KUBECONFIG=~/install_dir/auth/kubeconfig
oc create -f ~/test_nfs_pv.yaml
oc create -f ~/test_nfs_pvc.yaml
oc create -f ~/pod.yaml

# We test the PV is bound to the PVC
oc get pv
oc get pvc
kubectl get pods
showmount -e 10.0.0.100

# Now we check that our test application is mounting the volume correctly
# As you can see we have the NFS PV mounted in /var/nfsshare/testmount
# we connect to the container and put something in the PV
kubectl exec --stdin --tty nfs-test -- /bin/bash
echo "hello world" > /var/nfsshare/testmount/asdf
exit
cat /var/nfsshare/test-nfs/asdf
# hello world

This proves how easy is to create persistent volumes and claims to be used in KubeInit.

Next steps

Dynamic provisioning is a feature that is native to Kubernetes and that allows a cluster developer to order storage with a pre-defined type and configuration without knowing all the details about how to provision the physical storage device. To abstract the details for the specific storage type, the cluster admin must create storage classes that the developer can use.

Ideally assigning the persistent volumes should be dynamically assigned. In the future this should be natively be part of KubeInit.

References

https://cloud.ibm.com/docs/containers?topic=containers-kube_concepts

The end

This is the main project repository.

Happy KubeIniting!

Deploying KubeInit from a container

2020-09-11T00:00:00+00:00

There are some use cases where users have old libraries versions, old environments, or in general, difficulties to execute the ansible-playbook command to deploy KubeInit, due to unrelated issues. The following steps will help users to deploy KubeInit by launching the ansible-playbook command from a container.

Note 2021/10/13: DEPRECATED - This tutorial only works with kubeinit 1.0.2 make sure you use this version of the code if you are following this tutorial, or refer to the documentation to use the latest code.

TL;DR;

We will describe how to run KubeInit within a container.

Requirements

Having docker or podman installed.
If you do not have podman (what is used in the commands bellow), then replace podman by docker in all the commands bellow.

Deploying KubeInit within a container

Building the image

We will clone the repository as usual, and build the container image.

git clone https://github.com/Kubeinit/kubeinit.git
cd kubeinit
podman build -t kubeinit/kubeinit .

Run the deployment command from the recently created container

NOTE: Beware of the :z flag if you have SELinux enabled you must use it, otherwise you will get a permissions denied as the PK won’t be able to be mounted correctly.

NOTE: Each time a change is included in any code inside the repository BUILD THE IMAGE YOU MUST…

podman run --rm -it \
    -v ~/.ssh/id_rsa:/root/.ssh/id_rsa:z \
    -v /etc/hosts:/etc/hosts \
    kubeinit/kubeinit \
        --user root \
        -v -i ./hosts/okd/inventory \
        --become \
        --become-user root \
        ./playbooks/okd.yml

As it is clearly visible in the previous command, starting from the 5th line, the deployment command is exactly as if we were executing it with ansible-playbook.

What we are doing is to add ansible-playbook as this container ENTRYPOINT, so you can add any variable that will be part of the ansible-playbook command.

Easy as always and in a single deployment command you should have your cluster in approximately 30 minutes.

Pros and Cons of executing ansible-playbook within a container

This is a very opinionated section to show that sometimes running containers can add some more extra steps that might be useful or not depending on your environment.

Pros

No dependencies from the host.
Easy to run if there is no other change to make inside the collection.

Cons

Debugging will be always harder.
Another layer of something that might be hard to understand.
More time to build the image and deploying.
Each time you need to make a change in any part of the code you will need to build the image again. This in particular can be really painful, as for each small change in the code it will make you invest some maybe unneeded extra time to be able to run the code again.

The end

This is the main project repository.

Happy KubeIniting!

KubeInit External access for OpenShift/OKD deployments with Libvirt

2020-08-25T00:00:00+00:00

In this post it will be described the basic network architecture when OKD is deployed using KubeInit in a KVM host.

Note 2021/10/13: DEPRECATED - This tutorial only works with kubeinit 1.0.2 make sure you use this version of the code if you are following this tutorial, or refer to the documentation to use the latest code.

TL;DR;

We will describe how to extend the basic network configuration to provide external access to the cluster services by adding an external IP to the service machine.

What is KubeInit?

KubeInit provides Ansible playbooks and roles for the deployment and configuration of multiple Kubernetes distributions.

The main goal of KubeInit is to have a fully automated way to deploy in a single command a curated list of prescribed architectures.

KubeInit is opensource, and licensed under the Apache 2.0 license. The project’s source code is hosted in GitHub.

Initial hypervisor status

We check both the routing table and the network connections in the hypervisor host.

[root@nyctea ~]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         _gateway        0.0.0.0         UG    100    0        0 eno1
10.19.41.0      0.0.0.0         255.255.255.0   U     100    0        0 eno1
192.168.122.0   0.0.0.0         255.255.255.0   U     0      0        0 virbr0

[root@nyctea ~]# nmcli con show
NAME         UUID                                  TYPE      DEVICE
System eno1  162499bc-a6fa-45db-ba76-1b45f0be46e8  ethernet  eno1   
virbr0       4ba12c69-3a8b-42e8-a9dd-bc020fdc1a90  bridge    virbr0
eno2         e19725f2-84f5-4f71-b300-469ffc99fd99  ethernet  --     
enp6s0f0     7348301f-8cae-4ab1-9061-97d7a344699c  ethernet  --     
enp6s0f1     8a96c226-959a-4218-b9f7-c3ab6ee3d02b  ethernet  --

As it is possible to see there are two physical network interfaces (eno1, and eno2) for which only one is actually connected.

Initial network architecture

The following picture represents the default network layout for a usual deployment.

The default deployment will install a multi-master cluster, with one worker node (up to 10). From the above figure is possible to see:

All cluster nodes are connected to the 10.0.0.0/24 network. This will be the cluster management network, and the one will use to access the nodes within the hypervisor.
The 10.0.0.0/24 network is defined as a Virtual Network Switch implementing both NAT and DCHP for any interface connected to the kimgtnet0 network.
All bootstrap, master, and worker nodes are installed with Fedora CoreOS as is the required OS for OKD > 4.
The services machine has installed CentOS 8 with BIND, HAProxy, and NFS.
Using DHCP, we assign the following IP mapping based on the MAC address of each node (defined in the Ansible inventory).

 # Master
 okd-master-01 ansible_host=10.0.0.1 mac=52:54:00:aa:6c:b1
 okd-master-02 ansible_host=10.0.0.2 mac=52:54:00:59:0e:e4
 okd-master-03 ansible_host=10.0.0.3 mac=52:54:00:b4:39:45

 # Worker
 okd-worker-01 ansible_host=10.0.0.4 mac=52:54:00:61:22:5a
 okd-worker-02 ansible_host=10.0.0.5 mac=52:54:00:21:fd:fd
 okd-worker-03 ansible_host=10.0.0.6 mac=52:54:00:4c:0a:81
 okd-worker-04 ansible_host=10.0.0.7 mac=52:54:00:54:ff:ac
 okd-worker-05 ansible_host=10.0.0.8 mac=52:54:00:4a:6b:f6
 okd-worker-06 ansible_host=10.0.0.9 mac=52:54:00:40:22:52
 okd-worker-07 ansible_host=10.0.0.10 mac=52:54:00:6c:0a:03
 okd-worker-08 ansible_host=10.0.0.11 mac=52:54:00:0b:14:f8
 okd-worker-09 ansible_host=10.0.0.12 mac=52:54:00:f5:6e:e5
 okd-worker-10 ansible_host=10.0.0.13 mac=52:54:00:5c:26:4f

 # Service
 okd-service-01 ansible_host=10.0.0.100 mac=52:54:00:f2:46:a7

 # Bootstrap
 okd-bootstrap-01 ansible_host=10.0.0.200 mac=52:54:00:6e:4d:a3

The previous deployment can be used for any purpose but it has one limitation, this limitation is that the endpoints do not have external access. This means that i.e. https://console-openshift-console.apps.watata.kubeinit.local can not be accessed from anywhere instead the hypervisor itself.

Extending the basic network layout

Now it will be described a simple way to provide external access to the cluster public endpoints published in the service machine.

Requirements

An additional IP address to be mapped to the services machine from an external location.
Creating a network bridge to slave the interface used for the external access.

If a user has one extra IP (public or private) it will be enough to configure remote access to the cluster endpoints.

As long as we have an extra IP it does not matter how many physical interfaces we have, as we can have multiple IP addresses configured using a single physical NIC.

New network layout

This is the resulting network architecture to access remotely our freshly installed OKD cluster.

As is visible in the above figure there is an extra connection to the service machine, connected directly to the virtual bridge slaving a physical interface.

Our development environment has only one network card connected, in this case after we create the main switch and slave the network device, it will lose the assigned IP automatically. Do not try this using a shell as you will get dropped.

How to enable the external interface

To deploy this architecture please follow the next steps:

Create a virtual bridge slaving the selected physical interface.
Adjust the deployment command.
Run KubeInit.
Adjust your local Domain Name System (DNS) resolver.

Step 1 (creating the virtual bridge)

Using CentOS8 cockpit

We create an initial bridge using the CentOS cockpit, after losing the IP it will be recovered/reconfigured automatically (don’t try this from the CLI as you will lose access).

In this case,

Create a bridge called kiextbr0 connected to eno1:

Click on: Networking -> Add Bridge

Then adjust the bridge configuration options (bridge name and the interface to slave).

Write: kiextbr0 as the bridge name, and select your network interface eno1.

Go to the dashboard and verify that everything is OK.

Check that the bridge is created correctly and has the IP configured correctly.

Manual bridge creation

As an example you can run these steps by the CLI adjusting your interface and bridge names accordingly.

nmcli connection add ifname br0 type bridge con-name br0
nmcli connection add type bridge-slave ifname enp0s25 master br0
nmcli connection modify br0 bridge.stp no
nmcli connection delete enp0s25
nmcli connection up br0

NOTE: If you have only one interface the connection will be dropped and you will. lose connectivity.

Checking the system status

We check again the system status:

[root@nyctea ~]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         _gateway        0.0.0.0         UG    425    0        0 kiextbr0
10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 kimgtbr0
10.19.41.0      0.0.0.0         255.255.255.0   U     425    0        0 kiextbr0

NAME         UUID                                  TYPE      DEVICE   
kiextbr0     55d0a549-8123-488a-815b-5771b62644d2  bridge    kiextbr0
kimgtbr0     3e73e0d9-28bd-4db7-8ccf-be11297e3300  bridge    kimgtbr0
System eno1  3251ed0c-706a-463e-aeac-2a57782ce7c1  ethernet  eno1     
vnet0        4515a0b8-1a20-4414-86b2-2ff5545fcffa  tun       vnet0    
vnet1        5f1b253f-9c38-4637-8a02-222aa5c51be3  tun       vnet1    
vnet2        e7d466d5-bc2b-47b0-a6ca-5a3825170501  tun       vnet2    
eno2         190c35fb-1ff0-41ff-b32e-c190f513b2a0  ethernet  --       
eno3         1b644415-0a91-44a9-bfd0-2279ddca0020  ethernet  --       
eno4         c99ba8a7-b62c-4b1f-b191-8798f0eff2ff  ethernet  --       
enp6s0f0     11c63800-8cd9-4411-8854-43ced2a464f3  ethernet  --       
enp6s0f1     be01957b-2933-47df-9793-156fe3b1d767  ethernet  --

We can see we have the new bridge created successfully and it has the IP address also configured correctly.

Step 2 (adjusting the deployment command)

There are a few variables that need to be adjusted in order to successfully configure the external interface.

These variables are defined in the libvirt role (the location of these variables will change) but not their name.

The meaning of the variables are:

kubeinit_libvirt_external_service_interface_enabled: true - This will enable the Ansible configuration of the external interface, the BIND update, and the additional interface in the service node.
kubeinit_libvirt_external_service_interface.attached: kiextbr0 - This is the virtual bridge where we will plug the eth1 interface of the services machine. The bridge MUST be created first and slaving the physical interface we will use.
kubeinit_libvirt_external_service_interface.dev: eth1 - This is the name of the external interface we will add to the services machine.
kubeinit_libvirt_external_service_interface.ip: 10.19.41.157 - The external IP address of the services machine.
kubeinit_libvirt_external_service_interface.gateway: 10.19.41.254 - The gateway IP address of the services machine.
kubeinit_libvirt_external_service_interface.netmask: 255.255.255.0 - The network mask of the external interface of the services machine.

After we configure correctly the previous variables we can proceed to run the deployment command.

Step 3 (run the deployment command)

Now we deploy as usual KubeInit:

Remember that you can execute this deployment command before creating the bridge with the CentOS cockpit, the bridge creation has no impact in how we deploy KubeInit.

ansible-playbook \
    -v \
    --user root \
    -i ./hosts/okd/inventory \
    --become \
    --become-user root \
    -e "{ \
      'kubeinit_libvirt_external_service_interface_enabled': 'true', \
      'kubeinit_libvirt_external_service_interface': { \
        'attached': 'kiextbr0', \
        'dev': 'eth1', \
        'ip': '10.19.41.157', \
        'mac': '52:54:00:6a:39:ad', \
        'gateway': '10.19.41.254', \
        'netmask': '255.255.255.0' \
      } \
    }" \
    ./playbooks/okd.yml

Step 4 (adjust your resolv.conf)

You must reach the cluster external endpoints by DNS, this means, the dashboard and any other application deployed (you can add entries for any registry pointing to the service machine but this can be cumbersome).

For example, configure your local DNS resolver to point to 10.19.41.157

 [ccamacho@localhost]$ cat /etc/resolv.conf
 nameserver 10.19.41.157
 nameserver 8.8.8.8

After that you should be able to access the cluster without any issue and use it for any purpose you have with the following URL https://console-openshift-console.apps.clustername0.kubeinit.local/.

Voilà!

Final considerations

Some of the very interesting changes in BIND is how we manage both external and internal views.

In this case we have an internal and external view that will behave differently depending on where the requests are originated from.

If a DNS request is created trough the cluster’s external interface, the reply will be created based on the external view, in this case we only reply with the external HAProxy endpoints related to the services node, thus, we will only reply with 10.19.41.157 as it is the only that needs to be presented externally.

The end

This is the main project repository.

Happy KubeIniting!

Updated 2020/08/25: First version (draft).

Updated 2020/08/26: Published.

Updated 2020/10/06: Update in network details.

A review of the MachineConfig operator

2020-08-16T00:00:00+00:00

The latest versions of OpenShift rely on operators to completely manage the cluster and OS state, this state includes for instance, configuration changes and OS upgrades. For example, to install additional packages or changing any configuration file to execute whatever task is required, the MachineConfig operator should be the one in charge of applying these changes.

These configuration changes are executed by an instance of the ‘openshift-machine-config-operator’ pod, which after this new state is reached the updated nodes will be automatically restarted.

There are several mature and production-ready technologies allowing to automate and apply configuration changes to the underlying infrastructure nodes, like, Ansible, Helm, Puppet, Chef, and many others, yet, the MachineConfig operator force users to adopt this new method and pretty much discard any previously developed automation infrastructure.

The MachineConfig operator is more than installing packages and updating configuration files

I see this MachineConfig operator as a finite state machine where it’s represented a cluster-wide specific sequential logic to ensure that the cluster’s state is preserved and consistent. This notion has several and very powerful benefits, like making the cluster resilient to failures due to unfulfilled conditions in each of the sub-stages part of this finite state machine workflow.

For instance, let the following example show a practical and objective application of the benefits of this approach.

We assume the following architecture reference:

3 master nodes.
1 worker node.
The master nodes are not schedulable.

This is a quite simple multi-master deployment with a single worker node for development purposes, before the cluster was deployed the master nodes were set as “mastersSchedulable: False” by running "sed -i 's/mastersSchedulable: true/mastersSchedulable: False/' install_dir/manifests/cluster-scheduler-02-config.yml". Now, after deploying the cluster and executing a configuration change in the worker node it will fail, and let investigate why.

The following yaml file will be applied, which it’s content is correct and it should work out of the box:

cat << EOF > ~/99_kubeinit_extra_config_worker.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  creationTimestamp: null
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-kubeinit-extra-config-worker
spec:
  osImageURL: ''
  config:
    ignition:
      config:
        replace:
          verification: {}
      security:
        tls: {}
      timeouts: {}
      version: 2.2.0
    networkd: {}
    passwd: {}
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,IyEvdXNyL2Jpbi9iYXNoCnNldCAteAptYWluKCkgewpzdWRvIHJwbS1vc3RyZWUgaW5zdGFsbCBwb2xpY3ljb3JldXRpbHMtcHl0aG9uLXV0aWxzCnN1ZG8gc2VkIC1pICdzL2VuZm9yY2luZy9kaXNhYmxlZC9nJyAvZXRjL3NlbGludXgvY29uZmlnIC9ldGMvc2VsaW51eC9jb25maWcKfQptYWluCg==
          verification: {}
        filesystem: root
        mode: 0755
        path: /usr/local/bin/kubeinit_kubevirt_extra_config_script
EOF
oc apply -f ~/99_kubeinit_extra_config_worker.yaml

The defined MachineConfig object will create a file in /usr/local/bin/kubeinit_kubevirt_extra_config_script that once executed it will install a package and disable SElinux in the worker nodes.

Now, let’s check the state of the worker machine config pool.

oc get machineconfigpool/worker

This is the result:

NAME    CONFIG               UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT  AGE
worker  rendered-worker-a9.. False    True      True      1             0                  0                    1                     12h

Now it is possible to depict that the operator state is degraded and there is not much more information about it. Let’s get the status of the machine-config pods.

kubectl get pod -o wide --all-namespaces | grep machine-config

It is possible to see that all pods are running without issues.

openshift-machine-config-operator                  etcd-quorum-guard-7bb76959df-5bj7g                       1/1     Running     0          11h     10.0.0.2      okd-master-02   <none>           <none>
openshift-machine-config-operator                  etcd-quorum-guard-7bb76959df-jdtbv                       1/1     Running     0          11h     10.0.0.3      okd-master-03   <none>           <none>
openshift-machine-config-operator                  etcd-quorum-guard-7bb76959df-sndb2                       1/1     Running     0          11h     10.0.0.1      okd-master-01   <none>           <none>
openshift-machine-config-operator                  machine-config-controller-7cbb584655-bfjmh               1/1     Running     0          11h     10.100.0.20   okd-master-01   <none>           <none>
openshift-machine-config-operator                  machine-config-daemon-ctczg                              2/2     Running     0          12h     10.0.0.3      okd-master-03   <none>           <none>
openshift-machine-config-operator                  machine-config-daemon-m82gz                              2/2     Running     0          12h     10.0.0.2      okd-master-02   <none>           <none>
openshift-machine-config-operator                  machine-config-daemon-qfc82                              2/2     Running     0          12h     10.0.0.1      okd-master-01   <none>           <none>
openshift-machine-config-operator                  machine-config-daemon-vwh4d                              2/2     Running     0          11h     10.0.0.4      okd-worker-01   <none>           <none>
openshift-machine-config-operator                  machine-config-operator-c98bb964d-5vnww                  1/1     Running     0          11h     10.100.0.21   okd-master-01   <none>           <none>
openshift-machine-config-operator                  machine-config-server-g75x5                              1/1     Running     0          12h     10.0.0.2      okd-master-02   <none>           <none>
openshift-machine-config-operator                  machine-config-server-kpwqb                              1/1     Running     0          12h     10.0.0.3      okd-master-03   <none>           <none>
openshift-machine-config-operator                  machine-config-server-n9q2r                              1/1     Running     0          12h     10.0.0.1      okd-master-01   <none>           <none>

Let’s check the logs of the machine-config-daemon pod in the worker node. This pod has two containers, machine-config-daemon, and oauth-proxy.

kubectl logs -f machine-config-daemon-vwh4d -n openshift-machine-config-operator -c machine-config-daemon

Now, it is possible to see the actual error in the container execution:

I0816 06:58:42.985762    3240 update.go:283] Checking Reconcilable for config rendered-worker-a9681850fe39078ea0f42bd017922eb7 to rendered-worker-7131e04f110c489a0ad171e719cedc24
I0816 06:58:43.849830    3240 update.go:1403] Starting update from rendered-worker-a9681850fe39078ea0f42bd017922eb7 to rendered-worker-7131e04f110c489a0ad171e719cedc24: &{osUpdate:false kargs:false fips:false passwd:false files:true units:false kernelType:false}
I0816 06:58:43.852961    3240 update.go:1403] Update prepared; beginning drain
E0816 06:58:43.911711    3240 daemon.go:336] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-48g5s, openshift-dns/dns-default-h2lt5, openshift-image-registry/node-ca-9z9zt, openshift-machine-config-operator/machine-config-daemon-vwh4d, openshift-monitoring/node-exporter-m5p2n, openshift-multus/multus-lnsng, openshift-sdn/ovs-5xzqs, openshift-sdn/sdn-vplps
.
.
.
I0816 06:58:43.918261    3240 daemon.go:336] evicting pod openshift-ingress/router-default-796df5847b-9hxzx
E0816 06:58:43.928176    3240 daemon.go:336] error when evicting pod "router-default-796df5847b-9hxzx" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0816 07:08:44.981198    3240 update.go:172] Draining failed with: error when evicting pod "router-default-796df5847b-9hxzx": global timeout reached: 1m30s, retrying
E0816 07:08:44.981273    3240 writer.go:135] Marking Degraded due to: failed to drain node (5 tries): timed out waiting for the condition: error when evicting pod "router-default-796df5847b-9hxzx": global timeout reached: 1m30s

The log shows that the machine config operator failed to drain the worker node before applying the configuration and executing the restart, as the router-default pod was not able to be rescheduled in another node. Not being able to schedule again this pod violates the pod's disruption budget, thus, the operator is now degraded.

Let’s check the router-default pod status:

kubectl get pod -o wide --all-namespaces | grep "router-default"

It is possible to see that the pod is pending to be scheduled.

openshift-ingress  router-default-796df5847b-9hxzx  1/1  Running  0  12h  10.0.0.4  okd-worker-01  <none>  <none>
openshift-ingress  router-default-796df5847b-h8bm4  0/1  Pending  0  12h  <none>    <none>         <none>  <none>

Let’s check it’s status:

oc describe pod router-default-796df5847b-h8bm4 -n openshift-ingress

Now, it is possible to confirm that the pod is Pending as there is not any available node to schedule it back again.

Name:                 router-default-796df5847b-h8bm4
Namespace:            openshift-ingress
.
.
.
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/4 nodes are available: 1 node(s) were unschedulable, 3 node(s) didn't match node selector.

We check the nodes status

oc get nodes

Again, it is possible to see that the MachineConfig operator tried to drain the node but it failed when rescheduling its pods.

NAME            STATUS                     ROLES    AGE   VERSION
okd-master-01   Ready                      master   12h   v1.18.3
okd-master-02   Ready                      master   12h   v1.18.3
okd-master-03   Ready                      master   12h   v1.18.3
okd-worker-01   Ready,SchedulingDisabled   worker   12h   v1.18.3

Why did this happen?

Master nodes are not schedulable to handle workloads, this was configured when the cluster was deployed. So pretty much the operator didn’t have enough room to reschedule the pods in other nodes.

The benefits of this approach (using the MachineConfig operator) are uncountable as the operator is smart enough to avoid services breaking when it finds that a configuration change is not able to get the system back to a consistent state.

But, not everything is as perfect as it sounds…

Ignition files are used to apply these configuration changes and it’s json representation is not human-readable at all, for this we use Fedora CoreOS Configuration (FCC) files in YAML format, then internally these yaml files are converted into an Ignition (JSON) file by the Fedora CoreOS Config Transpiler, which is a tool that produces a JSON Ignition file from the YAML FCC file.

There is a huge limitation in the resources that can be defined, it is only supported storage, system.d services, and users, so, for executing anything the user will have to render a script that must be called once by a systemd service after the node restarts. This, after using for many many years technologies like Ansible, Puppet, or Chef, it looks like a hacky and dirty approach for users to apply their custom configurations.

Another thing, debugging, if there is a problem with your MachineConfig object you might see only this degraded state, forcing you to dig into the containers logs and hopefully find the source of any issue you might have.

I believe there is a lot of room for improvements in the MachineConfig operator, I would love to see an Ansible interface to be able to plug-in my configuration changes by the openshift-machine-config-operator pod. Also, it was showed that the operator improves the system’s resiliency by impeding that a configuration change breaks what we defined as a cluster’s consistent state.

The easiest and fastest way to deploy an OKD 4.5 cluster in a Libvirt/KVM host

2020-07-31T00:00:00+00:00

Long story short… We will deploy an OKD 4.5 cluster in ~30 minutes (3 controllers, 1 to 10 workers, 1 service, and 1 bootstrap node) using one single command in around 30 minutes using a tool called KubeInit.

Note 2021/10/13: DEPRECATED - This tutorial only works with kubeinit 1.0.2 make sure you use this version of the code if you are following this tutorial, or refer to the documentation to use the latest code.

I wrote so much automation in the meantime I worked/learned/practiced in OpenStack/RHOSP/Kubernetes/Openshift/OKD in the last 2Y, but suddenly I “lost” the machine where I hosted all these valuable code snippets.

With all this… I had to quickly invest some time to put together all that code. The first part is related to K8s/OKD and I created a small project called KubeInit “The KUBErnetes INITiator” to share it with the world.

The first (and only for now) playbook will deploy in a single command a fully operational OKD 4.5 cluster with 3 master nodes, 1 compute nodes (configurable from 1 to 10 nodes), 1 services node, and 1 dummy bootstrap node. The services node has installed HAproxy, Bind, Apache httpd, and NFS to host some of the external required cluster services.

Introduction

What is OpenShift?

Red Hat OpenShift is an open source container application platform based on the Kubernetes container orchestrator for enterprise application development and deployment.

– https://www.openshift.com/

There are multiple ways of deploying the Community Distribution of Kubernetes that powers Red Hat OpenShift (OKD) depending on the underlying infrastructure where it will be installed. In this particular blog post, we will deploy it on top of a KVM host using Libvirt. The initial upstream support is described in the official upstream OpenShift documentation, but as you can see, it involves a high number of manual steps prone to manual errors, and most important, outdated references when the deployment workflow changes.

In this case, we will use a project based in Ansible playbooks and roles for deploying and configuring multiple Kubernetes distributions, the project is called KubeInit.

Requirements

A CentOS 8 fresh deployed host for hosting all the guests.
RAM, depending on how many compute nodes this can go up to 384GB (the smallest amount required is around 64GB), configure the node’s resources in the inventory file.
Be able to log in as root in the hypervisor node without using passwords (using SSH certificate authentication).
Reach the hypervisor node using the hostname nyctea, you can change this in the inventory or add an entry in your /etc/hosts file.

Deploy

That’s it, now, let’s execute the deployment command:

git clone https://github.com/kubeinit/kubeinit.git
cd kubeinit
ansible-playbook \
    --user root \
    -v -i ./hosts/okd/inventory \
    --become \
    --become-user root \
    ./playbooks/okd.yml

You should get something like:

[ccamacho@wakawaka kubeinit]$ time ansible-playbook \
                                  --user root \
                                  -i ./hosts/okd/inventory \
                                  --become \
                                  --become-user root \
                                  ./playbooks/okd.yml

Using /etc/ansible/ansible.cfg as config file
PLAY [Main deployment playbook for OKD] ********************************************
TASK [Gathering Facts] *************************************************************
ok: [hypervisor-01]
.
.
.
"NAME              STATUS   ROLES    AGE     VERSION",
"okd-master-01     Ready    master   16m     v1.18.3",
"okd-master-02     Ready    master   15m     v1.18.3",
"okd-master-03     Ready    master   12m     v1.18.3",
"okd-worker-01     Ready    worker   6m12s   v1.18.3"
]}]}}

PLAY RECAP *************************************************************************
hypervisor-01: ok=83 changed=39 unreachable=0 failed=0 skipped=6 rescued=0 ignored=3   

real 33m49.483s
user 2m30.920s
sys  0m19.678s

A ready to use OKD 4.5 cluster in ~30 minutes!

What you just executed should give you an operational OKD 4.5 cluster with 3 master nodes, 1 compute node (configurable from 1 to 10 nodes), 1 services node, and 1 dummy bootstrap node. The services node has installed HAproxy, Bind, Apache httpd, and NFS to host some of the external required cluster services.

Now, ssh into your hypervisor node and check the cluster status from the services machine.

ssh root@nyctea
ssh [email protected]
# This is now the service node (check the Ansible inventory for IPs and other details)
export KUBECONFIG=~/install_dir/auth/kubeconfig
oc get pv
oc get nodes

The root password of the services machine is defined as a variable in the playbook, but the public key of the hypervisor root user is deployed across all the cluster nodes, so, you should be able to connect to any node from the hypervisor machine using SSH certificate authentication. Connect as the root user for the services machine (because is CentOS based) or as the core user to any other node (CoreOS based), using the IP addresses defined in the inventory file.

There are reasons for having this password-based access to the services node. Sometimes we need to connect to the services machine when we deploy for debugging purposes, in this case, if we don’t set a password for the user we won’t be able to log in using the console. Instead, for all the CoreOS nodes, once they are bootstrapped correctly/automatically there is no need to log in using the console, just wait until they are deployed to connect to them using SSH.

Final thoughts

KubeInit is a simple and intuitive way to show to potential users and customers how easy an OpenShift (OKD) cluster can be deployed, managed, and used for any purpose they might require (production or development environments). Once they have the environment deployed then it’s always easier to learn how it works, hack it, and even start contributing to the upstream community, if you are interested in this last part, please read the contribution page from the official OKD website.

All the Ansible automation is hosted in https://github.com/kubeinit/kubeinit/.

The code is not perfect by any mean but is a good example of how to use a libvirt host to run your OKD cluster and it’s incredibly easy to improve and add other roles and scenarios.

Next steps, I’ll clean all the lint nits around…

This is the GitHub repository https://github.com/kubeinit/kubeinit/.

Please if you like it, add some comments, test it, use it, hack it, break it, or become a stargazer ;)

Stay at home!

2020-03-24T00:00:00+00:00

For the people you love, please stay at home!

COVID-19 is a respiratory illness (which affects breathing) caused by a new coronavirus.

Symptoms can range from mild, such as a sore throat, to severe, such as pneumonia. Most people will not need medical attention for their symptoms. Together we can slow the spread and protect those at higher risk of severe illness and our health care workers from getting sick.

Disclaimer.

TripleO deep dive session #14 (Containerized deployments without paunch)

2020-02-18T00:00:00+00:00

This is the 14th release of the TripleO “Deep Dive” sessions

Thanks to Emilien Macchi for this deep dive session about the status of the containerized deployment without Paunch.

You can access the presentation.

So please, check the full session content on the TripleO YouTube channel.

Please check the sessions index to have access to all available content.

Badgeboard - GitHub actions, where is my CI dashboard!

2019-12-04T00:00:00+00:00

A widely used term in the agile world is the information radiator, which refers to display the project’s critical information as simple as possible. These information radiators improve the team’s communication by amplifying pieces of data to get a better notion of self-awareness.

TL;DR;

If you just want to go straight to the solution of how to convert SVG badges to a widget-based CI dashboard, just go to the Badgeboard repository or open the demo.

Otherwise, continue reading.

If you are beginning to apply agile methodologies in your team, a good information radiator can be for example a CI status dashboard.

The purpose of this information radiators, as the name implies, is to radiate information. It is something that people know about it and can see it easily. Keep in mind that a good information radiator will adapt to the needs of the project throughout its life, so try not to invest too much time in its initial design, and make sure that it can be easily changed/fixed/used/improved.

Some features of these information radiators:

It reflects the now: Information radiators always show what is going on (if things are going north or south). They help us see what matters now to the team and what to focus on in most of the cases when we hit i.e. regressions.
Minimum maximum value information: Simple and highly valuable. The more information, the less focus on important information, and more effort to maintain the panel.
Must be alive: This information artifact should be updated each time. As soon as reality changes, the artifact status should also change.

CI Dashboards

CI dashboards are a graphical representation of the continuous integration test results, usually HTML based and displaying in colors (red, yellow and green) the actual tests running results.

GitHub badges

We can see the status badges as a brief summary of the CI pipeline status. Badges1 are a unified way to present condensed pieces of information about your projects. They are also considered as any visual token of achievement, affiliation, authorization, or other trust relationship.

They consist of a small image and additionally a URL that the image points to. Examples for badges can be the pipeline status, test coverage, or ways to contact the project maintainers.

What now?

We introduce a tool to convert SVG badges to CI dasboards (Badgeboard).

GitHub actions -> No CI dashboard by default :(

I really liked the big dashboard view printed on a big screen so everyone can see it in a quick and easy manner. So, if we start using i.e. GitHub actions we lose the ability to have this graphical representation towards a badge based view.

Yehi!!! Here we have badgeboard!!!

Badgeboard is an awesome information radiator to show the status of the badges you have in your project as a widget-based dashboard, in particular it’s the main CI dashboard of Pystol.

Is a very simple tool that converts the information inside any SVG badge you define from any source in a widget-based dashboard.

Demo

Just open the index.html file and see how the dashboard is rendered.

Requirements

None! Just clone the repo and open the index.html file in your favorite browser.

Once you have a copy, make the adjustments to the configuration file located in assets/data_source/badges_list.js to use your own badges.

Note: Due to CORS restrictions, badgeboard uses a proxy to add cross-origin headers when building the widgets panel. Check additional information about the CORS proxy on NPM.

How it works

We capture the badges list (SVG files) and we read the color information from a single pixel, from there, depending on the color of the pixel the widget is painted with its corresponding color.

This would be the usual view of the project badges.

Adding your badges and colors

Use the coordinates_testing.html file to determine based on the SVG coordinates the RGB color to be used in the JS configuration file.

To do so, copy the link to your badge, find the badge example in the file, replace it with yours, open the file in a browser, get the console logs and move around the mouse over the badge to see the coordinates and the RBG color that matches it.

Adding custom color badges

To add new colors, edit the assets/css/custom.css file and add new color definitions for the widgets. Once you define the new color, in the configuration file called assets/data_source/badges_list.js use the new color like in the following example.

colors:[['<new_color_definition','<matching_rgb_from_the _badge>'],['status-good','48,196,82']],

Troubleshooting

If the board does not render correctly (No widgets at all) it’s for sure that you refreshed too many times the page. We use a CORS proxy to add cross-origin headers when building the widgets panel.

The requests it can handled are limited in order to avoid crashing the container, so we can all use it.

Please read the requirements and use your own NPM proxy so these restrictions go away.

References

We use both smashing and gridster to create the dashboard and its widgets.

License

Badgeboard is part of Pystol and Pystol is open source software licensed under the Apache license.

Next steps

It would be awesome to get some feedback around the tool, so, please feel free to file issues, pull requests or comments in this post or in the Badgeboard repository.

List of TO-DOs

There are still some bits to fix in Badgeboard for example:

~~Make the link from the widgets to work.~~
Move common hardcoded bits into variables for an easier update.
Improve the documentation.

Updated 2019/12/04: Initial version.

Oil painting and Minikube - Installing Minikube in Centos 7

2019-10-13T00:00:00+00:00

Today I got some time to do some oil painting and reading about techy stuff :)

This post is a brief summary of the deployment steps for installing Minikube in a Centos 7 baremetal machine, and, to show you my painting (check the fedora!).

The following steps need to run in the Hypervisor machine in which you will like to have your Minikube deployment.

You need to execute them one after the other, the idea of this recipe is to have something just for copying/pasting.

The usual steps are:

01 - Prepare the hypervisor node.

Now, let’s install some dependencies. Same Hypervisor node, same root user.

# In this dev. env. /var is only 50GB, so I will create
# a sym link to another location with more capacity.
sudo mkdir -p /home/libvirt/
sudo ln -sf /home/libvirt/ /var/lib/libvirt

sudo mkdir -p /home/docker/
sudo ln -sf /home/docker/ /var/lib/docker

# Install some packages
sudo yum install dnf -y
sudo dnf update -y
sudo dnf groupinstall "Virtualization Host" -y
sudo dnf install libvirt qemu-kvm virt-install virt-top libguestfs-tools bridge-utils -y
sudo dnf install git lvm2 lvm2-devel -y
sudo dnf install libvirt-python python-lxml libvirt curl-y
sudo dnf install binutils qt gcc make patch libgomp -y
sudo dnf install glibc-headers glibc-devel kernel-headers -y
sudo dnf install kernel-devel dkms bash-completion -y
sudo dnf install nano wget -y
sudo dnf install python3-pip -y

02 - Check that the kernel modules are OK.

# Check the kernel modules are OK
sudo lsmod | grep kvm

03 - Enable libvirtd, disable SElinux xD and firewalld.

# Enable libvirtd
sudo systemctl start libvirtd
sudo systemctl enable libvirtd

# Disable selinux & stop firewall as needed.
setenforce 0
perl -pi -e 's/SELINUX\=enforcing/SELINUX\=disabled/g' /etc/selinux/config
systemctl stop firewalld
systemctl disable firewalld

04 - Install Minikube.

#Install minikube
/usr/bin/curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 && chmod +x minikube
cp -p minikube /usr/local/bin && rm -f minikube

# Create the repo for kubernetes
cat << EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF

# Install kubectl
sudo dnf install kubectl -y
source <(kubectl completion bash)
echo "source <(kubectl completion bash)" >> ~/.bashrc

05 - Create the toor user (from the Hypervisor node, as root).

sudo useradd toor
echo "toor:toor" | sudo chpasswd
echo "toor ALL=(root) NOPASSWD:ALL" \
  | sudo tee /etc/sudoers.d/toor
sudo chmod 0440 /etc/sudoers.d/toor
sudo su - toor

cd
mkdir .ssh
ssh-keygen -t rsa -N "" -f .ssh/id_rsa

Now, follow as the toor user and prepare the Hypervisor node for Minikube.

06 - Install Docker.

We will like to also use docker in the Hypervisor node for creating images and debugging purposes.

# Install docker
sudo dnf install docker -y
sudo usermod --append --groups dockerroot toor
sudo tee /etc/docker/daemon.json >/dev/null <<-EOF
{
    "live-restore": true,
    "group": "dockerroot"
}
EOF
sudo systemctl start docker
sudo systemctl enable docker

07 - Finish the Minikube configuration.

# Add to bashrc in toor user
source <(kubectl completion bash)
echo "source <(kubectl completion bash)" >> ~/.bashrc

# We add toor to the libvirtd group
sudo usermod --append --groups libvirt toor

08 - Start Minikube.

minikube start --memory=65536 --cpus=4 --vm-driver kvm2
export no_proxy=$no_proxy,$(minikube ip)
nohup kubectl proxy --address='0.0.0.0' --port=8001 --disable-filter=true &
sleep 30
minikube addons enable dashboard
nohup minikube dashboard &
minikube addons open dashboard

The Minikube instance should be reachable from the following URL:

http://machine_ip:8001/api/v1/namespaces/kubernetes-dashboard/services/http:kubernetes-dashboard:/proxy/

# To stop/delete
kubectl delete deploy,svc --all
minikube stop
minikube delete

09 - Minikube cheat sheet.

# set & get current context of cluster
 kubectl config use-context minikube
 kubectl config current-context

# fetch all the kubernetes objects for a namespace
 kubectl get all -n kube-system

# display cluster details
 kubectl cluster-info

# set custom memory and cpu 
 minikube config set memory 4096
 minikube config set cpus 2

# fetch cluster ip
 minikube ip

# ssh to the minikube vm
 minikube ssh

# display addons list and status
 minikube addons list

# exposes service to vm & retrieves url 
 minikube service elasticsearch
 minikube service elasticsearch --url

Updated 2019/10/13: Initial version.

Updated 2019/10/15: Install also docker in the hypervisor.

Automated weekly reports

2019-07-07T00:00:00+00:00

Remote work is a trend among Senior or hi-performant roles.

In general when working remotely it might be hard to measure or be accountable for all the work we/you usually do.

Some of these tasks can be among user escalations, preparing development environments, reviews, writing code, checking unit tests, verifying CI systems status, meetings, have 1and1 with your associates among many others…

And its really easy to miss the track or a log about these tasks we want to actually report.

In this particular case I will explain how to generate automated reports with the tasks you have logged in a Trello board using a google docs template, also we will fetch data from Bugzilla, Launchpad and Storyboard.

Resources

The main google doc we will use for this how-to.
A code repository in GitHub with the JS/HTML code used inside the google docs (its all integrated to the google docs, so is not mandatory to use this, just read the README).

How to

If you open the google document and click “Main menu” -> “Scrum” you will see the following options:

Create today’s agenda.
Create tomorrow’s agenda.
Create today’s agenda with Trello items.
Generate activity report.

Create today’s and tomorrow’s agenda

These two options are really simple to explain, we will just generate an enumerated text section to write anything we need.

You will need to grant permissions to the script, the scripts are located in “Main menu” -> “Tools” -> “Script editor”.

There you will be able to see the code we can run from this google document (we will speak about the code later).

Create today’s agenda with Trello items.

In this case we will fetch the content of three lists in a Trello board.

In particular I have created an example Trello board for this tutorial

The board in question is private now (you can set it up as public, but is private to show how to configure the Key and Token for Trello, otherwise you will not be able to access it).

When you click on the option Create today’s agenda with Trello items.

We will generate In the place you have the cursor the agenda pulling the content of the board lists.

As you can see we have in the generated agenda the following information based on each list. In this case we have three lists, To Do, Doing, and Done, and in each list we have:

Assignees
Description.
Tags.
Link to the Trello card.

Now, to customize this, please Create your own copy of the google document!! and proceed to edit it’s content by clicking:

“Main menu” -> “Tools” -> “Script editor”

The section in which we will focus to customize are the following parameters.

You can update and modify the script as you wish

If you open your board appending a .json in the URL you will fetch the IDs of the lists we need to display.

For example in this example will be:

https://trello.com/b/L8t9rCOz/1and1.json

In this case the variables TRELLO KEY, and TRELLO_TOKEN will be needed if the board is private, otherwise you won’t need them at all.

var TRELLO_KEY = '8b164347d9cf6a1026b5d20dc8556620';
var TRELLO_TOKEN = '4a64be415e2f7d128d2543fbbddd2b8ffd33d3ead6921803668ca39fe715f5cd';

The following parameters are:

TRELLO_LIST_ID: The IDs of the lists we will fetch the content.
TRELLO_TITLES: The titles we will add to the report (the order should match, first list ID will be displayed under the first title and so on).
TRELLO_USER_FILTER: Filter the cards based on the assignee name, here if this is a report, it might be a good idea to have your manager name and your name. Use the name you have in Trello. (see the example)

var TRELLO_LIST_ID = ["5d20bd50eb63e24f7a0c8744", "5d20bd50eb63e24f7a0c8745","5d20bd50eb63e24f7a0c8746"]; //To Do, Doing, Done
var TRELLO_TITLES = ["To Do", "Doing","Done",]; //To Do, Doing, Done
var TRELLO_USER_FILTER = ["Carlos Camacho", "Pubstack"]; //Only display these people cards

You will need to generate a Developer key and a Token.

Use the Key generated.

Then you need to generate a token to be able to interact with Trello from the Google docs scripts.

Copy the Token.

Replace both TRELLO KEY, and TRELLO_TOKEN with the values you now have.

Generate activity report.

This section will generate an activity report based on quarters, and this will have a little bit more of information:

Stackalytics.
Launchpad.
Storyboard.
Bugzilla.

The information retrieved here is not private as is base in upstream metrics, so you can just use mine as an example.

Basically based in the following configuration parameters:

/* Stackalytics variables */
var STACKALYTICS_USER= "ccamacho";
/* Bugzilla variables */
var BZ_HOST = "https://bugzilla.redhat.com";
var BZ_STATUS = "bug_status=NEW&bug_status=ASSIGNED&bug_status=POST&bug_status=MODIFIED&bug_status=ON_DEV&bug_status=ON_QA&bug_status=VERIFIED&bug_status=RELEASE_PENDING";
var BZ_EMAIL = "ccamacho%40redhat.com";
/* Launchpad variables */
var LAUNCHPAD_USER = "ccamacho";
/* Storyboard variables */
var STORYBOARD_USER = "3328";

Those parameters basically describe the data we will fetch, i.e. the BZ query, the BZ email and your Launchpad and Storyboard user ID.

Then when you generate the activity report you will filter by dates, in my particular case I based the reports on each quarter.

The code to configure the date filter is also stored in the google doc.

And then you can see the activity report generated correctly.

Again, if you don’t trust to create a copy of the document and use it directly because of the google apps permissions requirement, you can always get the code in GitHub and follow the README

This is a nice way of keeping up reported your contributions in the team in an easy and automated way. Having a record of all the tasks you did is a really nice way of i.e. make a yearly retrospective to share your achievements.

Updated 2019/07/07: Initial version.

Updated 2019/07/08: Fixed some nits.

Advanced Deployment with Red Hat OpenShift - Retrospective

2019-07-02T00:00:00+00:00

I had the opportunity to attend last week to a training session in the office about Advanced Deployment with Red Hat OpenShift. The course in general covers installing Red Hat OpenShift Container Platform in an HA environment or without an internet connection.

Other topics include networking and security configuration, and management skills using Red Hat OpenShift Container Platform. In theory, after completing the course, you should be able to:

Describe and install OpenShift in an HA environment.
Describe and configure OpenShift Machines.
Describe and configure networking including creating network policies to secure applications.
Describe and configure the OpenShift Scheduler.
Protect the platform using quotas and limits.
Describe and install OpenShift without an internet connection (disconnected install).

As usual, there is some previous knowledge about the technology presented in this course you should have, among others you should have some:

Understanding of networking and concepts such as routing and software-defined networking (SDN)
Understanding of containers and virtualization
Basic understanding of development life cycle and developer workflow
Ability to read and modify code

The course is comprised of 8 modules covering:

Introduction to Course and Learning Environment.
Learn about the Advanced Deployment with Red Hat OpenShift course.
Understand the prerequisites, training environment, and system designations used during the lab procedures.
Learn tips for successfully completing the labs.
Understand course resources.

Disconnected Install

Learn what a disconnected install is and about the architectures for disconnected environments.
Learn about the advanced installed and the reference configuration implemented with Ansible Playbooks.
Review the software components required for a disconnected install, including Red Hat OpenShift Container Platform installation software, Red Hat OpenShift Container Platform images, a source code repository, and a development artifact repository.
Learn how to import images from preloaded Docker storage or a local repository.
Perform an installation of Red Hat OpenShift Container Platform, including importing other images such as Nexus and deploying other infrastructure like the source code repository.

OpenShift 4 Installation

Review the many components of the OpenShift architecture.
Use OpenShift installer to deploy an HA OpenShift cluster.
Understand how application HA is achieved with the replication controller and the scheduler.
Learn about container log aggregation and metrics collection.
Use diagnostics tools in server and client environments.

Machine Management

Review how OpenShift manages underlying infrastructure
Change MachineSet and Machine Configuration
Add nodes by scaling MachineSets
Understand and configure the Cluster Autoscaler

Networking

Review networking goals and software-defined networking (SDN).
Review packet-flow scenarios and learn about traffic inside an OpenShift cluster.
Learn about local traffic in a cluster and how OpenShift controls access between different OpenShift namespaces and projects.
Learn how pods connect to external hosts and how IPTables controls access to networks outside the SDN cluster.
Study how pods communicate across a cluster and about traffic inside a cluster.
Learn about pod IP allocation and network isolation.
Configure SDN and set up external access.
Study project network management and setting secure network policies.
Learn about the seven common use cases for NetworkPolicy.
Learn about OpenShift internal DNS.
Learn about external access, including load balancing in SDN and establishing a tunnel in ramp mode.
Understand how OpenShift masters also serve as an internal domain name service (DNS).

Network Policy

Learn about NetworkPolicy
Configure NetworkPolicy objects in the cluster
Protect a complex application using NetworkPolicy

Managing Compute Resources

Learn what compute resources are and about requesting, allocating, and consuming them.
Learn about compute requests, CPU limits, and memory limits.
Learn about quality of service (QoS) tiers.
Create, edit, and delete project resource limits.
Learn how limit ranges enumerate and specify project compute resource constraints for pods, containers, images, and image streams.
Learn about container limits and image limits.
Learn how to use quotas and limit ranges to limit the number of objects or amount of compute resources that are used in a project.
Understand which resources can be managed with quotas.
Learn how BestEffort, NotBestEffort, Terminating, and NotTerminating quota scopes restricts pod resources.
Understand how quotas are enforced and how to set quotas across multiple projects.
Learn about overcommitting CPU and memory and how to configure overcommitting.

General comments

I really liked the course, It wasn’t much advanced in general but it was a very good approach to “the actual thing” people are doing when deploying the technology for production ready environment.

It’s an operators course IMHO and not a developers course, I missed to have an actual view about where is the code, how is organized an how can we contribute with the Kubernetes/OpenShift community.

It has a very nice 8 hours exam xD xD :/ ;( at the very end as a proof that you understood what you did there, I think I did it OK but still, I’m waiting for the course test results.

Even though, I tried to dig into the code and understand better the integration with the python-kubernetes client and gladly created an issue report with its corresponding PR to fix it :)

This will be now a brief history about what I hacked in the cluster when doing the course.

Hacking the environment

Let’s start with the research question!

How the replication controller manage a massive pods kill?
Did it recover fast?
Did it recover at all?

Let’s create a simple web application using 500 pods.

I have created a simple script using the Python Kubernetes client to demonstrate the cluster behavior

The Python script is a simple application using two threads in order to measure the number of available and unavailable pods for the application at the same time we kill some pods based in a Poisson distribution.

Let’s see how the cluster behaves when killing the pods number from the test-webapp namespace

[ccamacho@localhost mocoloco]$ python script.py

Buum!

The Python Kubernetes client failed, allowing me to create a GitHub Issue report and a Pull request with its corresponding fix.

First error within the Kubernetes python client:
test-webapp-1-zv6jr	Running	10.129.2.168
Traceback (most recent call last):
  File "watch.py", line 75, in <module>
    for event in stream:
  File "/usr/lib/python2.7/site-packages/kubernetes/watch/watch.py", line 134, in stream
    for line in iter_resp_lines(resp):
  File "/usr/lib/python2.7/site-packages/kubernetes/watch/watch.py", line 47, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
AttributeError: 'HTTPResponse' object has no attribute 'read_chunked'

The execution

After hacking around a bit with the client’s methods this is the result

It seems that even if the pods creation is fast enough it takes some time to fully recover.

Here is the whole script if you are insterested:

import matplotlib
matplotlib.use('Agg')
import random
import time
from scipy.stats import poisson
import matplotlib.pyplot as plt
import matplotlib.dates as md
from kubernetes import client, config
from kubernetes.client.rest import ApiException
import threading
import datetime

global_available=[]
global_unavailable=[]
global_kill=[]

t1_stop = threading.Event()
t2_stop = threading.Event()

def delete_pod(name, namespace):
    core_v1 = client.CoreV1Api()
    delete_options = client.V1DeleteOptions()
    try:
        api_response = core_v1.delete_namespaced_pod(
            name=name,
            namespace=namespace,
            body=delete_options)
    except ApiException as e:
        print("Exception when calling CoreV1Api->delete_namespaced_pod: %s\n" % e)

def get_pods(namespace=''):
    api_instance = client.CoreV1Api()
    try: 
        if namespace == '':
            api_response = api_instance.list_pod_for_all_namespaces()
        else:
            api_response = api_instance.list_namespaced_pod(namespace, field_selector='status.phase=Running')
        return api_response
    except ApiException as e:
        print("Exception when calling CoreV1Api->list_pod_for_all_namespaces: %s\n" % e)

def get_event(namespace, stop):

    while not stop.is_set():
        config.load_kube_config()
        configuration = client.Configuration()
        configuration.assert_hostname = False
        api_client = client.api_client.ApiClient(configuration=configuration)
        dat = datetime.datetime.now()
        api_instance = client.AppsV1beta1Api()
        api_response = api_instance.read_namespaced_deployment_status('example', namespace)
        global_available.append((dat,api_response.status.available_replicas))
        global_unavailable.append((dat,api_response.status.unavailable_replicas))
        time.sleep(2)
    t2_stop.set()
    print("Ending live monitor")

def run_histogram(namespace, stop):
    # random numbers from poisson distribution
    n = 500
    a = 0
    data_poisson = poisson.rvs(mu=10, size=n, loc=a)
    counts, bins, bars = plt.hist(data_poisson)
    plt.close()
    config.load_kube_config()
    configuration = client.Configuration()
    configuration.assert_hostname = False
    api_client = client.api_client.ApiClient(configuration=configuration)
    for experiment in counts:
        pod_list = get_pods(namespace=namespace)
        aux_li = []
        for fil in pod_list.items:
            if fil.status.phase == "Running":
                aux_li.append(fil)
        pod_list = aux_li

        # From the Running pods I randomly choose those to die
        # based on the histogram length
        to_be_killed = random.sample(pod_list, int(experiment))

        for pod in to_be_killed:
            delete_pod(pod.metadata.name,pod.metadata.namespace)
        print("To be killed: "+str(experiment))
        global_kill.append((datetime.datetime.now(), int(experiment)))
        print(datetime.datetime.now())
    print("Ending histogram execution")
    time.sleep(300)
    t1_stop.set()

def plot_graph():
    plt.style.use('classic')

    ax=plt.gca()
    ax.xaxis_date()
    xfmt = md.DateFormatter('%H:%M:%S')
    ax.xaxis.set_major_formatter(xfmt)

    x_available = [x[0] for x in global_available]
    y_available = [x[1] for x in global_available]
    plt.plot(x_available,y_available, color='blue')
    plt.plot(x_available,y_available, color='blue',marker='o',label='Available pods')

    x_unavailable = [x[0] for x in global_unavailable]
    y_unavailable = [x[1] for x in global_unavailable]
    plt.plot(x_unavailable,y_unavailable, color='magenta')
    plt.plot(x_unavailable,y_unavailable, color='magenta',marker='o',label='Unavailable pods')


    x_kill = [x[0] for x in global_kill]
    y_kill = [x[1] for x in global_kill]
    plt.plot(x_kill,y_kill,color='red',marker='o',label='Killed pods')

    plt.legend(loc='upper left')

    plt.savefig('foo.png', bbox_inches='tight')
    plt.close()


if __name__ == "__main__":
    namespace = "test-webapp"
    try:
        t1 = threading.Thread(target=get_event, args=(namespace, t1_stop))
        t1.start()
        t2 = threading.Thread(target=run_histogram, args=(namespace, t2_stop))
        t2.start()
    except:
        print "Error: unable to start thread"

    while not t2_stop.is_set():
        pass

    print ("Ended all threads")

    plot_graph()

Porsche 993

2019-06-13T00:00:00+00:00

The Porsche 993 is the internal designation for the Porsche 911 model manufactured and sold between January 1994 and early 1998 (model years 1995–1998 in the United States), replacing the 964. Its discontinuation marked the end of air-cooled Porsches.

The 993 was much improved over, and quite different from its predecessor. According to Porsche, every part of the car was designed from the ground up, including the engine and only 20% of its parts were carried over from the previous generation. Porsche refers to the 993 as “a significant advance, not just from a technical, but also a visual perspective.” Porsche’s engineers devised a new light-alloy subframe with coil and wishbone suspension (an all new multi-link system), putting behind the previous lift-off oversteer and making significant progress with the engine and handling, creating a more civilized car overall providing an improved driving experience. The 993 was also the first 911 to receive a six speed transmission.

The external design of the Porsche 993, penned by English designer Tony Hatter, retained the basic body shell architecture of the 964 and other earlier 911 models, but with revised exterior panels, with much more flared wheel arches, a smoother front and rear bumper design, an enlarged retractable rear wing and teardrop mirrors. A 993 was promoted globally via its role of the safety car during the 1994 Formula One season.

Next, you can find several resources related to the 993 internals.

Please feel free to submit any pull request to add more resources here if you find them useful. New files added to the folder /static/993/ will be added automatically to this post after the commits are merged.

Now some more pictures, and of course if you find this article useful feel free to share it!.

Gallery

Kubebox - Tray

2019-05-22T00:00:00+00:00

The components tray is the enclosure part allowing the user to allocate mother boards, GPUS, FPGAs, Disks arrays.

Up to 8 trays in the same enclosure.

Frontal view of the generic tray.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Comments are welcomed as usual, thank you!!!!

Go to the project index

Update log:

2019/05/22: Initial version.

The Kubernetes in a box project

2019-05-21T00:00:00+00:00

Implementing cloud computing solutions that runs in hybrid environments might be the final solution when comes to finding the best benefits/cost ratio.

This post will be the main thread to build and describe the KIAB/Kubebox project (www.kubebox.org and/or www.kiab.org).

Spoiler alert!

The name

First thing first, the name.. I have in my mind two names having the same meaning. The first one is KIAB (Kubernetes In A Box) this name came to my mind as the Kiai sound from karatekas (practitioners of karate). The second one is more traditional, “Kubebox”. I have no preference but it would be awesome if you help me to decide the official name for this project.

Add a comment and contribute to select the project name!

Introduction

This project is about to integrate together already market available devices to run cloud software as an appliance.

The proof-of-concept delivered in this series of posts will allow people to put a well-known set of hardware devices into a single chassis for either create their own cloud appliances, research and development, continuous integration, testing, home labs, staging or production-ready environments or simply just for fun.

Hereby it’s humbly presented to you the design of KubeBox/KIAB an open chassis specification for building cloud appliances.

The case enclosure is fully designed, and hopefully in the last phases for building the first set of enclosures, now, the posts will appear in the mean time I have some free cycles for writing the overall description.

Use cases

Several use cases can be defined to run on a KubeBox chassis.

AWS outpost.
Development environments.
EDGE.
Production Environments for small sites.
GitLab CI integration.
Demos for summits and conferences.
R&D: FPGA usage, deep learning, AI, TensorFlow, among many others.
Marketing WOW effect.
Training.

Enclosure design

The enclosure is designed as a rackable unit, using 7U. It tries to minimize the space used to deploy an up to 8-node cluster with redundancy for both power and networking.

Cloud appliance description

This build will be described across several sub-posts linked from this main thread. The posts will be created particularly without any specific order depending on my availability.

Backstory and initial parts selection.
Designing the case part 1: Design software.
A brief introduction to CAD software.
Designing the case part 2: U’s, brakes, and ghosts.
Designing the case part 3: Sheet thickness and bend radius.
Designing the case part 4: Parts Allowance (finish, tolerance, and fit).
Designing the case part 5: Vent cutouts and frickin’ laser beams!.
Designing the case part 6: Self-clinching nuts and standoffs.
Designing the case part 7: The standoffs strike back.
A brief primer on screws and PEMSERTs.
Designing the case part 8: Implementing PEMSERTs and screws.
Designing the case part 9: Bend reliefs and flat patterns.
Designing the case part 10: Tray caddy, to be used with GPU, Mother boards, disks, any other peripherals you want to add to the enclosure.
Designing the case part 11: Components rig.
Designing the case part 12: Power supply.
Designing the case part 13: Networking.
Designing the case part 14: 3D printed supports.
Designing the case part 15: Adding computing power.
Designing the case part 16: Adding Storage.
Designing the case part 17: Front display and bastion for automation.
Manufacturing the case part 1: PEMSERT installation.
Manufacturing the case part 2: Bending metal.
Manufacturing the case part 3: Bending metal.
KubeBox cloud appliance in detail!.
Manufacturing the case part 0: Getting quotes.
Manufacturing the case part 1: Getting the cases.
Software deployments: Reference architecture.
Design final source files for the enclosure design.
KubeBox is fully functional.

Update log:

2019/05/21: Initial version.

Running Relax-and-Recover to save your OpenStack deployment

2019-05-20T00:00:00+00:00

ReaR is a pretty impressive disaster recovery solution for Linux. Relax-and-Recover, creates both a bootable rescue image and a backup of the associated files you choose.

When doing disaster recovery of a system, this Rescue Image plays the files back from the backup and so in the twinkling of an eye the latest state.

Various configuration options are available for the rescue image. For example, slim ISO files, USB sticks or even images for PXE servers are generated. As many backup options are possible. Starting with a simple archive file (eg * .tar.gz), various backup technologies such as IBM Tivoli Storage Manager (TSM), EMC NetWorker (Legato), Bacula or even Bareos can be addressed.

The ReaR written in Bash enables the skilful distribution of Rescue Image and if necessary archive file via NFS, CIFS (SMB) or another transport method in the network. The actual recovery process then takes place via this transport route.

In this specific case, due to the nature of the OpenStack deployment we will choose those protocols that are allowed by default in the Iptables rules (SSH, SFTP in particular).

But enough with the theory, here’s a practical example of one of many possible configurations. We will apply this specific use of ReaR to recover a failed control plane after a critical maintenance task (like an upgrade).

01 - Prepare the Undercloud backup bucket.

We need to prepare the place to store the backups from the Overcloud. From the Undercloud, check you have enough space to make the backups and prepare the environment. We will also create a user in the Undercloud with no shell access to be able to push the backups from the controllers or the compute nodes.

groupadd backup
mkdir /data
useradd -m -g backup -d /data/backup backup
echo "backup:backup" | chpasswd
chown -R backup:backup /data
chmod -R 755 /data

02 - Run the backup from the Overcloud nodes.

Let’s install some required packages and run some previous configuration steps.

#Install packages
sudo yum install rear genisoimage syslinux lftp wget -y

#Make sure you are able to use sshfs to store the ReaR backup
sudo yum install fuse -y
sudo yum groupinstall "Development tools" -y
wget http://download-ib01.fedoraproject.org/pub/epel/7/x86_64/Packages/f/fuse-sshfs-2.10-1.el7.x86_64.rpm
sudo rpm -i fuse-sshfs-2.10-1.el7.x86_64.rpm

sudo mkdir -p /data/backup
sudo sshfs -o allow_other backup@undercloud-0:/data/backup /data/backup
#Use backup password, which is... backup

Now, let’s configure ReaR config file.

#Configure ReaR
sudo tee -a "/etc/rear/local.conf" > /dev/null <<'EOF'
OUTPUT=ISO
OUTPUT_URL=sftp://backup:backup@undercloud-0/data/backup/
BACKUP=NETFS
BACKUP_URL=sshfs://backup@undercloud-0/data/backup/
BACKUP_PROG_COMPRESS_OPTIONS=( --gzip )
BACKUP_PROG_COMPRESS_SUFFIX=".gz"
BACKUP_PROG_EXCLUDE=( '/tmp/*' '/data/*' )
EOF

Now run the backup, this should create an ISO image in the Undercloud node (/data/backup/).

You will be asked for the backup user password

sudo rear -d -v mkbackup

Now, simulate a failure xD

# sudo rm -rf /lib

After the ISO image is created, we can proceed to verify we can restore it from the Hypervisor.

03 - Prepare the hypervisor.

# Enable the use of fusefs for the VMs on the hypervisor
setsebool -P virt_use_fusefs 1

# Install some required packages
sudo yum install -y fuse-sshfs

# Mount the Undercloud backup folder to access the images
mkdir -p /data/backup
sudo sshfs -o allow_other root@undercloud-0:/data/backup /data/backup
ls /data/backup/*

04 - Stop the damaged controller node.

virsh shutdown controller-0
# virsh destroy controller-0

# Wait until is down
watch virsh list --all

# Backup the guest definition
virsh dumpxml controller-0 > controller-0.xml
cp controller-0.xml controller-0.xml.bak

Now, we need to change the guest definition to boot from the ISO file.

Edit controller-0.xml and update it to boot from the ISO file.

Find the OS section,add the cdrom device and enable the boot menu.

<os>
<boot dev='cdrom'/>
<boot dev='hd'/>
<bootmenu enable='yes'/>
</os>

Edit the devices section and add the CDROM.

<disk type='file' device='cdrom'>
<driver name='qemu' type='raw'/>
<source file='/data/backup/rear-controller-0.iso'/>
<target dev='hdc' bus='ide'/>
<readonly/>
<address type='drive' controller='0' bus='1' target='0' unit='0'/>
</disk>

Update the guest definition.

virsh define controller-0.xml

Restart and connect to the guest

virsh start controller-0
virsh console controller-0

You should be able to see the boot menu to start the recover process, select Recover controller-0 and follow the instructions.

Now, before proceeding to run the controller restore, it’s possible that the host undercloud-0 can’t be resolved, just:

echo "192.168.24.1 undercloud-0" >> /etc/hosts

Having resolved the Undercloud host, just follow the wizard, Relax And Recover :)

You yould see a message like:

Welcome to Relax-and-Recover. Run "rear recover" to restore your system !

RESCUE controller-0:~ # rear recover

The image restore should progress quickly.

Continue to see the restore evolution.

Now, each time you reboot the node will have the ISO file as the first boot option so it’s something we need to fix. In the mean time let’s check if the restore went fine.

Reboot the guest booting from the hard disk.

Now we can see that the guest VM started successfully.

Now we need to restore the guest to it’s original definition, so from the Hypervisor we need to restore the controller-0.xml.bak file we created.

#From the Hypervisor
virsh shutdown controller-0
watch virsh list --all
virsh define controller-0.xml.bak
virsh start controller-0

Enjoy.

Considerations:

Space.
Multiple protocols supported but we might then to update firewall rules, that’s why I prefered SFTP.
Network load when moving data.
Shutdown/Starting sequence for HA control plane.
Do we need to backup the data plane?
User workloads should be handled by a third party backup software.

Update log:

2019/05/20: Initial version.

2019/06/18: Appeared in OpenStack Superuser blog.

A linux or unix sysadmin in his natural habitat

2019-02-17T00:00:00+00:00

A linux sysadmin in his natural habitat.

Explanation: No comments.
Disclaimer.

Bye bye old theme

2019-02-16T00:00:00+00:00

Let’s say bye bye and thanks to the old theme I used for the last 3 years.

I created this blog, Pubstack, just after joining Red Hat. This with the purpose of logging part of my upstream work within the TripleO community.

It evolved in such a way that I’m currently adding other types of posts, like management, hobbies, software development posts, other cloud computing technologies, talks, among many other things.

I had a milestone in which after 70 posts, the current theme do not scale up correctly anymore.

That’s why I welcome the new theme and I thank the old one for holding for the last years.

There are a few things to polish in the new site, like adding my CV and about page. Otherwise, shiiiiip it!!!

Thank you!!!!

TripleO - Deployment configurations

2019-02-05T00:00:00+00:00

This post is a summary of the deployments I usually test for deploying TripleO using quickstart.

The following steps need to run in the Hypervisor node in order to deploy both the Undercloud and the Overcloud.

You need to execute them one after the other, the idea of this recipe is to have something just for copying/pasting.

Once the last step ends you can/should be able to connect to the Undercloud VM to start operating your Overcloud deployment.

The usual steps are:

01 - Prepare the hypervisor node.

Now, let’s install some dependencies. Same Hypervisor node, same root user.

# In this dev. env. /var is only 50GB, so I will create
# a sym link to another location with more capacity.
# It will take easily more tan 50GB deploying a 3+1 overcloud
sudo mkdir -p /home/libvirt/
sudo ln -sf /home/libvirt/ /var/lib/libvirt

# Disable IPv6 lookups
# sudo bash -c "cat >> /etc/sysctl.conf" << EOL
# net.ipv6.conf.all.disable_ipv6 = 1
# net.ipv6.conf.default.disable_ipv6 = 1
# EOL
# sudo sysctl -p

# Enable IPv6 in kernel cmdline
# sed -i s/ipv6.disable=1/ipv6.disable=0/ /etc/default/grub
# grub2-mkconfig -o /boot/grub2/grub.cfg
# reboot

sudo yum groupinstall "Virtualization Host" -y
sudo yum install git lvm2 lvm2-devel -y
sudo yum install libvirt-python python-lxml libvirt -y

02 - Create the toor user (from the Hypervisor node, as root).

sudo useradd toor
echo "toor:toor" | sudo chpasswd
echo "toor ALL=(root) NOPASSWD:ALL" \
  | sudo tee /etc/sudoers.d/toor
sudo chmod 0440 /etc/sudoers.d/toor
sudo su - toor

cd
mkdir .ssh
ssh-keygen -t rsa -N "" -f .ssh/id_rsa
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
cat .ssh/id_rsa.pub | sudo tee -a /root/.ssh/authorized_keys
echo '127.0.0.1 127.0.0.2' | sudo tee -a /etc/hosts

export VIRTHOST=127.0.0.2
ssh root@$VIRTHOST uname -a

Now, follow as the toor user and prepare the Hypervisor node for the deployment.

03 - Clone repos and install deps.

git clone \
  https://github.com/openstack/tripleo-quickstart
chmod u+x ./tripleo-quickstart/quickstart.sh
bash ./tripleo-quickstart/quickstart.sh \
  --install-deps
sudo setenforce 0

Export some variables used in the deployment command.

04 - Export common variables.

export CONFIG=~/deploy-config.yaml
export VIRTHOST=127.0.0.2

Now we will create the configuration file used for the deployment, depending on the file you choose you will deploy different environments.

05 - Click on the environment description to expand the recipe.

OpenStack [Containerized & HA] - 1 Controller, 1 Compute


cat > $CONFIG << EOF
overcloud_nodes:
  - name: control_0
    flavor: control
    virtualbmc_port: 6230
  - name: compute_0
    flavor: compute
    virtualbmc_port: 6231
node_count: 2
containerized_overcloud: true
delete_docker_cache: true
enable_pacemaker: true
run_tempest: false
extra_args: >-
  --libvirt-type qemu
  --ntp-server pool.ntp.org
  -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml
EOF

OpenStack [Containerized & HA] - 3 Controllers, 1 Compute


cat > $CONFIG << EOF
overcloud_nodes:
  - name: control_0
    flavor: control
    virtualbmc_port: 6230
  - name: control_1
    flavor: control
    virtualbmc_port: 6231
  - name: control_2
    flavor: control
    virtualbmc_port: 6232
  - name: compute_1
    flavor: compute
    virtualbmc_port: 6233
node_count: 4
containerized_overcloud: true
delete_docker_cache: true
enable_pacemaker: true
run_tempest: false
extra_args: >-
  --libvirt-type qemu
  --ntp-server pool.ntp.org
  --control-scale 3
  --compute-scale 1
  -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml
EOF

OpenShift [Containerized] - 1 Controller, 1 Compute


cat > $CONFIG << EOF
# Original from https://github.com/openstack/tripleo-quickstart/blob/master/config/general_config/featureset033.yml
composable_scenario: scenario009-multinode.yaml
deployed_server: true

network_isolation: false
enable_pacemaker: false
overcloud_ipv6: false
containerized_undercloud: true
containerized_overcloud: true

# This enables TLS for the undercloud which will also make haproxy bind to the
# configured public-vip and admin-vip.
undercloud_generate_service_certificate: false
undercloud_enable_validations: false

# This enables the deployment of the overcloud with SSL.
ssl_overcloud: false

# Centos Virt-SIG repo for atomic package
add_repos:
  # NOTE(trown) The atomic package from centos-extras does not work for
  # us but its version is higher than the one from the virt-sig. Hence,
  # using priorities to ensure we get the virt-sig package.
  - type: package
    pkg_name: yum-plugin-priorities
  - type: generic
    reponame: quickstart-centos-paas
    filename: quickstart-centos-paas.repo
    baseurl: https://cbs.centos.org/repos/paas7-openshift-origin311-candidate/x86_64/os/
  - type: generic
    reponame: quickstart-centos-virt-container
    filename: quickstart-centos-virt-container.repo
    baseurl: https://cbs.centos.org/repos/virt7-container-common-candidate/x86_64/os/
    includepkgs:
      - atomic
    priority: 1

extra_args: ''

container_args: >-
  # If Pike or Queens
  #-e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml
  # If Ocata, Pike, Queens or Rocky
  #-e /home/stack/containers-default-parameters.yaml
  # If >= Stein
  -e /home/stack/containers-prepare-parameter.yaml

  -e /usr/share/openstack-tripleo-heat-templates/openshift.yaml
# NOTE(mandre) use container images mirrored on the dockerhub to take advantage
# of the proxy setup by openstack infra
docker_openshift_etcd_namespace: docker.io/
docker_openshift_cluster_monitoring_namespace: docker.io/tripleomaster
docker_openshift_cluster_monitoring_image: coreos-cluster-monitoring-operator
docker_openshift_configmap_reload_namespace: docker.io/tripleomaster
docker_openshift_configmap_reload_image: coreos-configmap-reload
docker_openshift_prometheus_operator_namespace: docker.io/tripleomaster
docker_openshift_prometheus_operator_image: coreos-prometheus-operator
docker_openshift_prometheus_config_reload_namespace: docker.io/tripleomaster
docker_openshift_prometheus_config_reload_image: coreos-prometheus-config-reloader
docker_openshift_kube_rbac_proxy_namespace: docker.io/tripleomaster
docker_openshift_kube_rbac_proxy_image: coreos-kube-rbac-proxy
docker_openshift_kube_state_metrics_namespace: docker.io/tripleomaster
docker_openshift_kube_state_metrics_image: coreos-kube-state-metrics

deploy_steps_ansible_workflow: true
config_download_args: >-
  -e /home/stack/config-download.yaml
  --disable-validations
  --verbose
composable_roles: true

overcloud_roles:
  - name: Controller
    CountDefault: 1
    tags:
      - primary
      - controller
    networks:
      - External
      - InternalApi
      - Storage
      - StorageMgmt
      - Tenant
  - name: Compute
    CountDefault: 0
    tags:
      - compute
    networks:
      - External
      - InternalApi
      - Storage
      - StorageMgmt
      - Tenant

tempest_config: false
test_ping: false
run_tempest: false
EOF

From the Hypervisor, as the toor user run the deployment command to deploy both your Undercloud and Overcloud.

06 - Deploy TripleO.

bash ./tripleo-quickstart/quickstart.sh \
      --clean          \
      --release master \
      --teardown all   \
      --tags all       \
      -e @$CONFIG      \
      $VIRTHOST

Updated 2019/02/05: Initial version.

Updated 2019/02/05: TODO: Test the OpenShift deployment.

Updated 2019/02/06: Added some clarifications about where the commands should run.

Remote work management, the never ending story...

2019-01-23T00:00:00+00:00

This will a quick summary of my last 3 years experience working remotely for one of the best and biggest companies investing and developing Open Source software ever. Also, my idea is to keep this post updated with my latest experiences and tips for doing remote work the best possible experience.

The first feeling that comes to my mind is a picture of a well-known tv series “Naked and Afraid”. Yes, you will end up naked in the jungle waiting for guidance, and it’s not your fault if this help never comes.

Working remotely gives you several nice and not so nice experiences when executing your functions in your new remote awesome job.

Good:

Working hours flexibility.
Work life balance.
Saving commuting time.
Cooking your own healthy food.
Geographically distributed teams allow to get the best talents across the globe.
And many more …

There are also several benefits for the company you work, so it’s not a benefit only for you, in most cases they (the company you work for) will optimize how many resources they need to scale their workforce, saving in offices, electricity, heating, Internet access, food, furniture, and many others…

But this is not about good things, this post it’s about identifying those corner cases in which working remotely might become a daunting task. And after identifying those cases, we should be able to provide some countermeasures for avoiding sad, frustrated and burnt employees (at least for me they worked quite well).

Not so good:

Operation costs are translated to the employee

Yes, in some cases you will have to pay for all the expenses generated from working in your home, like, a good chair, Internet access, electricity, gas, among others. But this is not generally a win-lose deal, it depends because you will also save in fuel, food and most important TIME.

The slow start

Starting to work remotely is hard, so first thing first, IMHO you need to do a few things first.

Get the context of your role in the team.
Understand your team goals, functions and responsibilities.
Get yourself a development environment as soon as possible.

Hard to effectively communicate with ALL your colleagues

This might create inside the team, a sort of a ghetto feeling where there are people communicating each other and others who don’t, in which case it’s bad because the people might feel it’s not part of the team’s mission.

There are several solutions to this:

Have a common and real-time communication channel, like IRC, Hangouts, Slack or any other technology you want to use.
Make all the people in the team speak daily about this “What I’m doing and what is blocking me to achieve my tasks”. No more than 2 minutes per person in a daily stand-up call.
Avoid doing solo-tasks, try to get at least 2 people per task even if it can be decomposed in different sub-tasks make your team work together, using tools like t-mate to do peer programming sessions it’s also a good idea.

Senior roles usually feel they need to pay back the freedom with more hours

The freedom of doing remote work might be wrongly interpreted for some people, yes, you have the freedom of doing laundry, get your kids to the school, or to write this post. But, there is no need to excessively do extra hours, if you are able to measure the value you are generating each day, you won’t have the need for burning yourself out with extra hours.

If you don’t have clear tasks or goals you might end up with this question, I’m I doing too much, I’m I working less than I should? and there you have the error. If you measure the value you are generating you won’t have this question, period.

Measure all! Measuring as much as you can give you a better overview of your current and day-to-day performance, i.e. something really simple might be a google docs script connecting to the apps you usually use to generate a report and have a better scope about your own and personal performance.

PRODUCTIVITY IS NOT ABOUT TIME, PRODUCTIVITY IS ABOUT TO GENERATING ADDED VALUE (work smart!).

In the meantime, if you have covered all the tasks committed for the sprint you should not be feeling bad about your performance, but, what if you don’t have defined those sprint tasks?

So, the next item will speak about productivity, innovating at your workplace and doing some planning kung-fu.

Productivity and innovation vs Planned work

There are people that have an incredibly good performance doing their work and they just don’t like to plan the daily basis tasks they need to achieve, this is motivated mostly because they have all the bits in their minds. But, this is a team, and we are as good as our lowest performance component in the chain.

There are several solutions to this:

Agree in a way of defining the tasks that need to be achieved in the sprint, try not to use time as a measure for finishing tasks, instead try with something more subjective like value or difficulty.
Know what to do at any moment of your time, sometimes, not knowing that will force you to invest much time on this without having the need to.
Keep all the knowledge in a single place of truth, yes, this is painful when you don’t have it. Trello, Taiga, GitHub issues, Google docs, Google spreadsheets, Launchpad, Bugzilla, Storyboard, and so many others… this is the toolset that I’m usually using on my day-to-day work. Now it evolved to use Jira with some plugins to keep and maintain a single-place-of-truth.

The catch, sometimes over commitment makes you not able to innovate on your role, so, keep a little bit of time for improving your product and yourself.

Updated 2019/01/23: First version

Vote for the OpenStack Berlin Summit presentations!

2018-07-24T00:00:00+00:00

I pushed some presentations for this year OpenStack summit in Berlin, the presentations are related to updates, upgrades, backups, failures and restores.

¡¡¡Please vote!!!

Happy TripleOing!

TripleO deep dive session #13 (Containerized Undercloud)

2018-05-31T00:00:00+00:00

This is the 13th release of the TripleO “Deep Dive” sessions

Thanks to Dan Prince & Emilien Macchi for this deep dive session about the next step of the TripleO’s Undercloud evolution.

In this session, they will explain in detail the movement re-architecting the Undercloud to move towards containers in order to reuse the containerized Overcloud ecosystem.

You can access the presentation or the Etherpad notes.

So please, check the full session content on the TripleO YouTube channel.

Please check the sessions index to have access to all available content.

Testing Undercloud backup and restore using Ansible

2018-05-18T00:00:00+00:00

This post will introduce you about how to run backups and restores using Ansible in TripleO.

Testing the Undercloud backup and restore

It is possible to test how the Undercloud backup and restore should be performed using Ansible.

The following Ansible playbooks will show how can be used Ansible to test the backups execution in a test environment.

Creating the Ansible playbooks to run the tasks

Create a yaml file called uc-backup.yaml with the following content:

---
- hosts: localhost
  tasks:
  - name: Remove any previously created UC backups
    shell: |
      source ~/stackrc
      openstack container delete undercloud-backups --recursive
    ignore_errors: True
  - name: Create UC backup
    shell: |
      source ~/stackrc
      openstack undercloud backup --add-path /etc/ --add-path /root/

Create a yaml file called uc-backup-download.yaml with the following content:

---
- hosts: localhost
  tasks:
  - name: Make sure the temp folder used for the restore does not exist
    become: true
    file:
      path: "/var/tmp/test_bk_down"
      state: absent
  - name: Create temp folder to unzip the backup
    become: true
    file:
      path: "/var/tmp/test_bk_down"
      state: directory
      owner: "stack"
      group: "stack"
      mode: "0775"
      recurse: "yes"
  - name: Download the UC backup to a temporary folder (After breaking the UC we won't be able to get it back)
    shell: |
      source ~/stackrc
      cd /var/tmp/test_bk_down
      openstack container save undercloud-backups
  - name: Unzip the backup
    become: true
    shell: |
      cd /var/tmp/test_bk_down
      tar -xvf UC-backup-*.tar
      gunzip *.gz
      tar -xvf filesystem-*.tar
  - name: Make sure stack user can get the backup files
    become: true
    file:
      path: "/var/tmp/test_bk_down"
      state: directory
      owner: "stack"
      group: "stack"
      mode: "0775"
      recurse: "yes"

Create a yaml file called uc-destroy.yaml with the following content:

---
- hosts: localhost
  tasks:
  - name: Remove mariadb
    become: true
    yum: pkg=
         state=absent
    with_items:
      - mariadb
      - mariadb-server
  - name: Remove files
    become: true
    file:
      path: ""
      state: absent
    with_items:
      - /root/.my.cnf
      - /var/lib/mysql

Create a yaml file called uc-restore.yaml with the following content:

---
- hosts: localhost
  tasks:
    - name: Install mariadb
      become: true
      yum: pkg=
           state=present
      with_items:
        - mariadb
        - mariadb-server
    - name: Restart MariaDB
      become: true
      service: name=mariadb state=restarted
    - name: Restore the backup DB
      shell: cat /var/tmp/test_bk_down/all-databases-*.sql | sudo mysql
    - name: Restart MariaDB to perms to refresh
      become: true
      service: name=mariadb state=restarted
    - name: Register root password
      become: true
      shell: cat /var/tmp/test_bk_down/root/.my.cnf | grep -m1 password | cut -d'=' -f2 | tr -d "'"
      register: oldpass
    - name: Clean root password from MariaDB to reinstall the UC
      shell: |
        mysqladmin -u root -p password ''
    - name: Clean users
      become: true
      mysql_user: name="" host_all="yes" state="absent"
      with_items:
        - ceilometer
        - glance
        - heat
        - ironic
        - keystone
        - neutron
        - nova
        - mistral
        - zaqar
    - name: Reinstall the undercloud
      shell: |
        openstack undercloud install

Running the Undercloud backup and restore tasks

To test the UC backup and restore procedure, run from the UC after creating the Ansible playbooks:

  # This playbook will create the UC backup
  ansible-playbook uc-backup.yaml
  # This playbook will download the UC backup to be used in the restore
  ansible-playbook uc-backup-download.yaml
  # This playbook will destroy the UC (remove DB server, remove DB files, remove config files)
  ansible-playbook uc-destroy.yaml
  # This playbook will reinstall the DB server, restore the DB backup, fix permissions and reinstall the UC
  ansible-playbook uc-restore.yaml

Checking the Undercloud state

After finishing the Undercloud restore playbook the user should be able to execute again any CLI command like:

  source ~/stackrc
  openstack stack list

Source code available in GitHub

Install tmate.io and share your terminal session

2018-03-13T00:00:00+00:00

This is an easy solution for sharing terminal sessions over ssh. Tmate.io is great terminal sharing app, you can think of it as Teamviewer for ssh.

To avoid compiling issues and dependencies, we will get the static build directly from GitHub to automagically use it.

# Get files and install
wget https://github.com/tmate-io/tmate/releases/download/2.2.1/tmate-2.2.1-static-linux-amd64.tar.gz
tar -xvzf tmate-2.2.1-static-linux-amd64.tar.gz
sudo mv ./tmate-2.2.1-static-linux-amd64/tmate /usr/bin/
sudo chmod +x /usr/bin/tmate
rm -rf tmate-2.2.1-static-linux-amd64*

# echo "export TERM=xterm" >> .bashrc

#Configure Tmate using ln2 as the default server
sudo tee -a ~/.tmate.conf > /dev/null <<'EOF'
set -g tmate-server-host "ln2.tmate.io"
EOF

And that is it, enjoy.

Use tmate and share the link…

Running tmate as a daemon

You can run tmate detached, and retrieve the SSH connection strings as follow:

tmate -S /tmp/tmate.sock new-session -d               # Launch tmate in a detached state
tmate -S /tmp/tmate.sock wait tmate-ready             # Blocks until the SSH connection is established
tmate -S /tmp/tmate.sock display -p '#{tmate_ssh}'    # Prints the SSH connection string
tmate -S /tmp/tmate.sock display -p '#{tmate_ssh_ro}' # Prints the read-only SSH connection string
tmate -S /tmp/tmate.sock display -p '#{tmate_web}'    # Prints the web connection string
tmate -S /tmp/tmate.sock display -p '#{tmate_web_ro}' # Prints the read-only web connection string

Note that it is important to specify a socket (e.g. /tmp/dev.sock) as tmate uses a random socket name by default.

You can think of tmate as a reverse ssh tunnel accessible from anywhere.

My 2nd birthday as a Red Hatter

2018-03-01T00:00:00+00:00

This post will be about to speak about my experience working in TripleO as a Red Hatter for the last 2 years.

In my 2nd birthday as a Red Hatter, I have learned about many technologies, really a lot… But the most intriguing thing is that here you never stop learning. Not just because you just don’t want to learn new things, instead, is because of the project’s nature, this project… TripleO…

TripleO (Openstack On Openstack) is a software aimed to deploy OpenStack services using the same OpenStack ecosystem, this means that we will deploy a minimal OpenStack instance (Undercloud) and from there, deploy our production environment (Overcloud)… Yikes! What a mouthful, huh? Put simply, TripleO is an installer which should make integrators/operators/developers lives easier, but the reality sometimes is far away from the expectation.

TripleO is capable of doing wonderful things, with a little of patience, love, and dedication, your hands can be the right hands to deploy complex environments at easy.

One of the cool things being one of the programmers who write TripleO, from now on TripleOers, is that many of us also use the software regularly. We are writing code not just because we are told to do it, but because we want to improve it for our own purposes.

Part of the programmers’ motivation momentum have to do with TripleO’s open‐source nature, so if you code in TripleO you are part of a community.

Congratulations! As a TripleO user or a TripleOer, you are a part of our community and it means that you’re joining a diverse group that spans all age ranges, ethnicities, professional backgrounds, and parts of the globe. We are a passionate bunch of crazy people, proud of this “little” monster and more than willing to help others enjoy using it as much as we do.

Getting to know the interface (the templates, Mistral, Heat, Ansible, Docker, Puppet, Jinja, …) and how all components are tight together, probably is one of the most daunting aspects of TripleO for newcomers (and not newcomers). This for sure will raise the blood pressure of some of you who tried using TripleO in the past, but failed miserably and gave up in frustration when it did not behave as expected. Yeah.. sometimes that “$h1t” happens…

Although learning TripleO isn’t that easy, the architecture updates, the decoupling of the role services “compostable roles”, the backup and restore strategies, the integration of Ansible among many others have made great strides toward alleviating that frustration, and the improvements continue through to today.

So this is the question…

Is TripleO meant to be “fast to use” or “fast to learn”?

There is a significant way of describing software products, but we need to know what our software will be used for… TripleO is designed to work at scale, it might be easier to deploy manually a few controllers and computes, but what about deploying 100 computes, 3 controllers and 50 cinder nodes, all of them configured to be integrated and work as one single “cloud”? Buum!. So there we find the TripleO benefits if we want to make it scale we need to make it fast to use…

This means that we will find several customizations, hacks, workarounds, to make it work as we need it.

The upside to this approach is that TripleO evolved to be super-ultra-giga customizable so operators are enabled to produce great environments blazingly fast..

The downside, Jaja, yes.. there is a downside “or several”. As with most things that are customized, TripleO became somewhat difficult for new people to understand. Also, it’s incredibly hard to test all the possible deployments, and when a user does non-standard or not supported customizations, the upgrades are not as intuitive as they need…

This trade‐off is what I mean when I say “fast to use versus fast to learn.” You can be extremely productive with TripleO after you understand how it thinks “yes, it thinks”.

However, your first few deployments and patches might be arduous. Of course, alleviating that potential pain is what our work is about. IMHO the pros are more than the cons and once you find a niche to improve it will be a really nice experience.

Also, we have the TripleO YouTube channel a place to push video tutorials and deep dive sessions driven by the community for the community.

For the Spanish community we have a 100% translated TripleO UI, go to https://translate.openstack.org and help us to reach as many languages as possible!!!

www.pubstack.com was born on July 5th of 2016 (first GitHub commit), yeah is my way of expressing my gratitude to the community doing some CtrlC + CtrlV recipes to avoid the frustration of working with TripleO and not having something deployed and easy to be used ASAP.

Pubstack does not have much traffic but it reached superuser.openstack.org, the TripleO cheatsheets were on devconf.cz and FOSDEM, so in general, is really nice. When people reference your writings anywhere. Maybe in the future can evolve to be more related to ANsible and openSTACK ;) as TripleO is adding more and more support for Ansible.

What about Red Hat? Yeahp, I have a long time speaking about the project but haven’t spoken about the company making it all real. Red Hat is the world’s leading provider of open source solutions, using a community-powered approach to provide reliable and high-performing cloud, virtualization, storage, Linux, and middleware technologies.

There is a strong feeling of belonging in Red Hat, you are part of a team, a culture and you are able to find a perfect balance between your work and life. Also, having all people from all over the globe makes a perfect place for sharing ideas and collaborate. Not all of it is good, i.e. Working mostly remotely in upstream communities can be really hard to manage if you are not 100% sure about the tasks that need to be done.

Keep rocking and become part of the TripleO community!

TripleO deep dive session #12 (config-download)

2018-02-23T00:00:00+00:00

This is the 12th release of the TripleO “Deep Dive” sessions

Thanks to James Slagle for this new session, in which he will describe and speak about a feature called config-download.

In this session we will have an update for the TripleO ansible integration called config-download. It’s about aplying all the software configuration with Ansible instead of doing it with the Heat agents.

So please, check the full session content on the TripleO YouTube channel.

Please check the sessions index to have access to all available content.

The Fender Stratocaster coffee table

2018-01-08T00:00:00+00:00

This post will show briefly a project I did a long time ago to build a Stratocaster coffee table.

Here you have the pictures:

IKEA countertop to make the table

Source code for the table design

This link have the design I used to CNC the countertop. It’s a simple SVG file that can be translated to .gcode without issues.

CNC machine view 1

CNC machine view 2

CNC video building the table

Machined countertop

Applying some epoxy resin to fill the holes

Table after first round of sanding

Table sanded

Hope you enjoyed reading this.

New TripleO quickstart cheatsheet

2018-01-05T00:00:00+00:00

I have created some cheatsheets for people starting to work on TripleO, mostly to help them to bootstrap a development environment as quickly as possible.

The previous version of this cheatsheet series was used in several community conferences (FOSDEM, DevConf.cz), now, they are deprecated as the way TripleO should be deployed changed considerably last months.

Here you have the latest version:

The source code of these bookmarks is available as usual on GitHub

And this is the code if you want to execute it directly:

# 01 - Create the toor user.
sudo useradd toor
echo "toor:toor" | chpasswd
echo "toor ALL=(root) NOPASSWD:ALL" \
  | sudo tee /etc/sudoers.d/toor
sudo chmod 0440 /etc/sudoers.d/toor
su - toor

# 02 - Prepare the hypervisor node.
cd
mkdir .ssh
ssh-keygen -t rsa -N "" -f .ssh/id_rsa
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
cat .ssh/id_rsa.pub | sudo tee -a /root/.ssh/authorized_keys
echo '127.0.0.1 127.0.0.2' | sudo tee -a /etc/hosts
export VIRTHOST=127.0.0.2
sudo yum groupinstall "Virtualization Host" -y
sudo yum install git lvm2 lvm2-devel -y
ssh root@$VIRTHOST uname -a

# 03 - Clone repos and install deps.
git clone \
  https://github.com/openstack/tripleo-quickstart
chmod u+x ./tripleo-quickstart/quickstart.sh
bash ./tripleo-quickstart/quickstart.sh \
  --install-deps
sudo setenforce 0

# 04 - Configure the TripleO deployment with Docker and HA.
export CONFIG=~/deploy-config.yaml
cat > $CONFIG << EOF
overcloud_nodes:
  - name: control_0
    flavor: control
    virtualbmc_port: 6230
  - name: compute_0
    flavor: compute
    virtualbmc_port: 6231
node_count: 2
containerized_overcloud: true
delete_docker_cache: true
enable_pacemaker: true
run_tempest: false
extra_args: >-
  --libvirt-type qemu
  --ntp-server pool.ntp.org
  -e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml
  -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml
EOF

# 05 - Deploy TripleO.
export VIRTHOST=127.0.0.2
bash ./tripleo-quickstart/quickstart.sh \
      --clean          \
      --release master \
      --teardown all   \
      --tags all       \
      -e @$CONFIG      \
      $VIRTHOST

Happy TripleOing!!!

Update log:

2018/01/05: Initial version.

2019/01/16: Appeared in OpenStack Superuser blog.

Automating Undercloud backups and a Mistral introduction for creating workbooks, workflows and actions

2017-12-18T00:00:00+00:00

The goal of this developer documentation is to address the automated process of backing up a TripleO Undercloud and to give developers a complete description about how to integrate Mistral workbooks, workflows and actions into the Python TripleO client.

This tutorial will be divided into several sections:

Introduction and prerequisites
Undercloud backups
Creating a new OpenStack CLI command in python-tripleoclient (openstack undercloud backup).
Creating Mistral workflows for the new python-tripleoclient CLI command.
Give support for new Mistral environment variables when installing the undercloud.
Show how to test locally the changes in python-tripleoclient and tripleo-common.
Give elevated privileges to specific Mistral actions that need to run with elevated privileges.
Debugging actions
Unit tests
Why all previous sections are related to Upgrades?

1. Introduction and prerequisites

Let’s assume you have a TripleO development environment healthy and working properly. All the commands and customization we are going to run will run in the Undercloud, as usual logged in as the stack user and having sourced the stackrc file.

Then let’s proceed by cloning the repositories we are going to work with in a temporary folder:

mkdir dev-docs
cd dev-docs
git clone https://github.com/openstack/python-tripleoclient
git clone https://github.com/openstack/tripleo-common
git clone https://github.com/openstack/instack-undercloud

python-tripleoclient: Will define the OpenStack CLI commands.
tripleo-common: Will have the Mistral logic.
instack-undercloud: Allows to update and create mistral environments to store configuration details needed when executing Mistral workflows.

2. Undercloud backups

Most of the Undercloud back procedure is available in the TripleO official documentation site.

We will focus on the automation of backing up the resources required to restore the Undercloud in case of a failed upgrade.

All MariaDB databases on the undercloud node
MariaDB configuration file on undercloud (so we can restore databases accurately)
All glance image data in /var/lib/glance/images
All swift data in /srv/node
All data in stack users home directory

For doing this we need to be able to:

Connect to the database server as root.
Dump all databases to file.
Create a filesystem backup of several folders (and be able to access folders with restricted access).
Upload this backup to a swift container to be able to get it from the TripleO web UI.

3. Creating a new OpenStack CLI command in python-tripleoclient (openstack undercloud backup).

The first action needed is to be able to create a new CLI command for the OpenStack client. In this case, we are going to implement the openstack undercloud backup command.

cd dev-docs
cd python-tripleoclient

Let’s list the files inside this folder:

[stack@undercloud python-tripleoclient]$ ls
AUTHORS           doc                            setup.py
babel.cfg         LICENSE                        test-requirements.txt
bindep.txt        zuul.d                         tools
build             README.rst                     tox.ini
ChangeLog         releasenotes                   tripleoclient
config-generator  requirements.txt               
CONTRIBUTING.rst  setup.cfg

Once inside the python-tripleoclient folder we need to check the following file:

setup.cfg: This file defines all the CLI commands for the Python TripleO client. Specifically, we will need at the end of this file our new command definition:

undercloud_backup = tripleoclient.v1.undercloud_backup:BackupUndercloud

This means that we have a new command defined as undercloud backup that will instantiate the BackupUndercloud class defined in the file tripleoclient/v1/undercloud_backup.py

For further details related to this class definition please go to the gerrit review.

Now, having our class defined we can call other methods to invoke Mistral in this way:

clients = self.app.client_manager

files_to_backup = ','.join(list(set(parsed_args.add_files_to_backup)))

workflow_input = {
    "sources_path": files_to_backup
}
output = undercloud_backup.prepare(clients, workflow_input)

So forth, we will call the undercloud_backup.prepare method defined in the file tripleoclient/workflows/undercloud_backup.py wich will call the Mistral workflow:

def prepare(clients, workflow_input):
    workflow_client = clients.workflow_engine
    tripleoclients = clients.tripleoclient
    with tripleoclients.messaging_websocket() as ws:
        execution = base.start_workflow(
            workflow_client,
            'tripleo.undercloud_backup.v1.prepare_environment',
            workflow_input=workflow_input
        )
        for payload in base.wait_for_messages(workflow_client, ws, execution):
            if 'message' in payload:
                return payload['message']

In this case, we will create a loop within the tripleoclient and wait until we receive a message from the Mistral workflow tripleo.undercloud_backup.v1.prepare_environment that indicates if the invoked workflow ended correctly.

4. Creating Mistral workflows for the new python-tripleoclient CLI command.

The next step is to define the tripleo.undercloud_backup.v1.prepare_environment Mistral workflow, all the Mistral workbooks, workflows and actions will be defined in the tripleo-common repository.

Let’s go inside tripleo-common

cd dev-docs
cd tripleo-common

And see it’s conent:

[stack@undercloud tripleo-common]$ ls
AUTHORS           doc                README.rst        test-requirements.txt
babel.cfg         HACKING.rst        releasenotes      tools
build             healthcheck        requirements.txt  tox.ini
ChangeLog         heat_docker_agent  scripts           tripleo_common
container-images  image-yaml         setup.cfg         undercloud_heat_plugins
contrib           LICENSE            setup.py          workbooks
CONTRIBUTING.rst  playbooks          sudoers           zuul.d

Again we need to check the following file:

setup.cfg: This file defines all the Mistral actions we can call. Specifically, we will need at the end of this file our new actions:

tripleo.undercloud.get_free_space = tripleo_common.actions.undercloud:GetFreeSpace
tripleo.undercloud.create_backup_dir = tripleo_common.actions.undercloud:CreateBackupDir
tripleo.undercloud.create_database_backup = tripleo_common.actions.undercloud:CreateDatabaseBackup
tripleo.undercloud.create_file_system_backup = tripleo_common.actions.undercloud:CreateFileSystemBackup
tripleo.undercloud.upload_backup_to_swift = tripleo_common.actions.undercloud:UploadUndercloudBackupToSwift

4.1. Action definition

Let’s take the first action to describe it’s definition, tripleo.undercloud.get_free_space = tripleo_common.actions.undercloud:GetFreeSpace

We have defined the action named as tripleo.undercloud.get_free_space which will instantiate the class GetFreeSpace defined in the file tripleo_common/actions/undercloud.py file.

If we open tripleo_common/actions/undercloud.py we can see the class definition as:

class GetFreeSpace(base.Action):
    """Get the Undercloud free space for the backup.

       The default path to check will be /tmp and the default minimum size will
       be 10240 MB (10GB).
    """

    def __init__(self, min_space=10240):
        self.min_space = min_space

    def run(self, context):
        temp_path = tempfile.gettempdir()
        min_space = self.min_space
        while not os.path.isdir(temp_path):
            head, tail = os.path.split(temp_path)
            temp_path = head
        available_space = (
            (os.statvfs(temp_path).f_frsize * os.statvfs(temp_path).f_bavail) /
            (1024 * 1024))
        if (available_space < min_space):
            msg = "There is no enough space, avail. - %s MB" \
                  % str(available_space)
            return actions.Result(error={'msg': msg})
        else:
            msg = "There is enough space, avail. - %s MB" \
                  % str(available_space)
            return actions.Result(data={'msg': msg})

In this specific case this class will check if there is enough space to perform the backup. Later we will be able to inkove action as

mistral run-action tripleo.undercloud.get_free_space

or use it workbooks.

4.2. Workflow definition.

Once we have defined all our new actions, we need to orchestrate them in order to have a fully working Mistral workflow.

All tripleo-comon workbooks are defined in the workbooks folder.

In the next example we have a workbook definition with all actions inside it, in this case we put in the example the first workflow with all the tasks involved.

---
version: '2.0'
name: tripleo.undercloud_backup.v1
description: TripleO Undercloud backup workflows

workflows:

  prepare_environment:
    description: >
      This workflow will prepare the Undercloud to run the database backup
    tags:
      - tripleo-common-managed
    input:
      - queue_name: tripleo
    tasks:
      # Action to know if there is enough available space
      # to run the Undercloud backup
      get_free_space:
        action: tripleo.undercloud.get_free_space
        publish:
            status: <% task().result %>
            free_space: <% task().result %>
        on-success: send_message
        on-error: send_message
        publish-on-error:
          status: FAILED
          message: <% task().result %>


      # Sending a message that the folder to create the backup was
      # created succesfully
      send_message:
        action: zaqar.queue_post
        retry: count=5 delay=1
        input:
          queue_name: <% $.queue_name %>
          messages:
            body:
              type: tripleo.undercloud_backup.v1.launch
              payload:
                status: <% $.status %>
                execution: <% execution() %>
                message: <% $.get('message', '') %>
        on-success:
          - fail: <% $.get('status') = "FAILED" %>

The workflow its self explanatory, the only not so clear part might be the last one as the workflow uses an action to send a message stating that the workflow ended correctly. Passing as the message the output of the previous task, in this case the result of the create_backup_dir.

5. Give support for new Mistral environment variables when installing the undercloud.

Sometimes is needed to use additional values inside a Mistral task. For example, if we need to create a dump of a database we might need another that the Mistral user credentials for authentication purposes.

Initially when the Undercloud is installed it’s created a Mistral environment called tripleo.undercloud-config. This environment variable will have all required configuration details that we can get from Mistral. This is defined in the instack-undercloud repository.

Let’s get into the repository and check the content of the file instack_undercloud/undercloud.py.

This file defines a set of methods to interact with the Undercloud, specifically the method called _create_mistral_config_environment allows to configure additional environment variables when installing the Undercloud.

For additional testing, you can use the Python snippet to call Mistral client from the Undercloud node available in gist.github.com.

6. Show how to test locally the changes in python-tripleoclient and tripleo-common.

If it’s needed a local test of a change in python-tripleoclient or tripleo-common, the following procedures allow to test it locally.

For a change in python-tripleoclient, assuming you already have downloaded the change you want to test, execute:

cd python-tripleoclient
sudo rm -Rf /usr/lib/python2.7/site-packages/tripleoclient*
sudo rm -Rf /usr/lib/python2.7/site-packages/python_tripleoclient*
sudo python setup.py clean --all install

For a change in tripleo-common, assuming you already have downloaded the change you want to test, execute:

cd tripleo-common
sudo rm -Rf /usr/lib/python2.7/site-packages/tripleo_common*
sudo python setup.py clean --all install
sudo cp /usr/share/tripleo-common/sudoers /etc/sudoers.d/tripleo-common
# this loads the actions via entrypoints
sudo mistral-db-manage --config-file /etc/mistral/mistral.conf populate
# make sure the new actions got loaded
mistral action-list | grep tripleo
for workbook in workbooks/*.yaml; do
    mistral workbook-create $workbook
done

for workbook in workbooks/*.yaml; do
    mistral workbook-update $workbook
done
sudo systemctl restart openstack-mistral-executor
sudo systemctl restart openstack-mistral-engine

If we want to execute a Mistral action or a Mistral workflow you can execute:

Examples about how to test Mistral actions independently:

mistral run-action tripleo.undercloud.get_free_space #Without parameters
mistral run-action tripleo.undercloud.get_free_space '{"path": "/etc/"}' # With parameters
mistral run-action tripleo.undercloud.create_file_system_backup '{"sources_path": "/tmp/asdf.txt,/tmp/asdf", "destination_path": "/tmp/"}'

Examples about how to test a Mistral workflow independently:

mistral execution-create tripleo.undercloud_backup.v1.prepare_environment # No parameters
mistral execution-create tripleo.undercloud_backup.v1.filesystem_backup '{"sources_path": "/tmp/asdf.txt,/tmp/asdf", "destination_path": "/tmp/"}' # With parameters

7. Give elevated privileges to specific Mistral actions that need to run with elevated privileges.

Sometimes its is not possible to execute some restricted actions from the Mistral user, for example, when creating the Undercloud backup we won’t be able to access the /home/stack/ folder to create a tarball of it. For this cases it’s possible to execute elevates actions from the Mistral user:

This is the content of the sudoers in the root of the tripleo-common repository at the time of the creatino of this guide.

Defaults!/usr/bin/run-validation !requiretty
Defaults:validations !requiretty
Defaults:mistral !requiretty
mistral ALL = (validations) NOPASSWD:SETENV: /usr/bin/run-validation
mistral ALL = NOPASSWD: /usr/bin/chown -h validations\: /tmp/validations_identity_[A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_], \
        /usr/bin/chown validations\: /tmp/validations_identity_[A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_], \
        !/usr/bin/chown /tmp/validations_identity_* *, !/usr/bin/chown /tmp/validations_identity_*..*
mistral ALL = NOPASSWD: /usr/bin/rm -f /tmp/validations_identity_[A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_], \
        !/usr/bin/rm /tmp/validations_identity_* *, !/usr/bin/rm /tmp/validations_identity_*..*
mistral ALL = NOPASSWD: /bin/nova-manage cell_v2 discover_hosts *
mistral ALL = NOPASSWD: /usr/bin/tar --ignore-failed-read -C / -cf /tmp/undercloud-backup-*.tar *
mistral ALL = NOPASSWD: /usr/bin/chown mistral. /tmp/undercloud-backup-*/filesystem-*.tar
validations ALL = NOPASSWD: ALL

Here you can grant permissions for specific tasks in when executing Mistral workflows from tripleo-common

7. Debugging actions.

Let’s assume the action is written, added to setup.cfg but not appeared. Firstly, check if action was added by sudo mistral-db-manage populate. Run

mistral action-list -f value -c Name | grep -e '^tripleo.undercloud'

If you don’t see your actions check output of sudo mistral-db-manage populate as

sudo mistral-db-manage populate 2>&1| grep ERROR | less

The following output may indicate issues in code. Simply fix code.

2018-01-01:00:59.730 7218 ERROR stevedore.extension [-] Could not load 'tripleo.undercloud.get_free_space': unexpected indent (undercloud.py, line 40):   File "/usr/lib/python2.7/site-packages/tripleo_common/actions/undercloud.py", line 40

Execute single action, execute workflow from workbook to make sure it works as designed.

8. Unit tests

Writing Unit test is essential instrument of Software Developer. Unit tests are much faster that running Workflow itself. So, let’s write unit tests for written action. Let’s add tripleo_common/tests/actions/test_undercloud.py file with the following content in tripleo-comon repositiry.

import mock

from tripleo_common.actions import undercloud
from tripleo_common.tests import base


class GetFreeSpaceTest(base.TestCase):
    def setUp(self):
        super(GetFreeSpaceTest, self).setUp()
        self.temp_dir = "/tmp"

    @mock.patch('tempfile.gettempdir')
    @mock.patch("os.path.isdir")
    @mock.patch("os.statvfs")
    def test_run_false(self, mock_statvfs, mock_isdir, mock_gettempdir):
        mock_gettempdir.return_value = self.temp_dir
        mock_isdir.return_value = True
        mock_statvfs.return_value = mock.MagicMock(
            spec_set=['f_frsize', 'f_bavail'],
            f_frsize=4096, f_bavail=1024)
        action = undercloud.GetFreeSpace()
        action_result = action.run(context={})
        mock_gettempdir.assert_called()
        mock_isdir.assert_called()
        mock_statvfs.assert_called()
        self.assertEqual("There is no enough space, avail. - 4 MB",
                         action_result.error['msg'])

    @mock.patch('tempfile.gettempdir')
    @mock.patch("os.path.isdir")
    @mock.patch("os.statvfs")
    def test_run_true(self, mock_statvfs, mock_isdir, mock_gettempdir):
        mock_gettempdir.return_value = self.temp_dir
        mock_isdir.return_value = True
        mock_statvfs.return_value = mock.MagicMock(
            spec_set=['f_frsize', 'f_bavail'],
            f_frsize=4096, f_bavail=10240000)
        action = undercloud.GetFreeSpace()
        action_result = action.run(context={})
        mock_gettempdir.assert_called()
        mock_isdir.assert_called()
        mock_statvfs.assert_called()
        self.assertEqual("There is enough space, avail. - 40000 MB",
                         action_result.data['msg'])

Run

tox -epy27

to see any unit tests errors.

Undercloud backups are an important step before runing an Upgrade.
Writing developer docs will help people to create and develope new features.

9. References

http://www.dougalmatthews.com/2016/Sep/21/debugging-mistral-in-tripleo/
http://blog.johnlikesopenstack.com/2017/06/accessing-mistral-environment-in-cli.html
http://hardysteven.blogspot.com.es/2017/03/developing-mistral-workflows-for-tripleo.html

Restarting your TripleO hypervisor will break cinder volume service thus the overcloud pingtest

2017-10-30T00:00:00+00:00

I don’t usualy restart my hypervisor, today I had to install LVM2 and virsh stopped to work so a restart was required, once the VMs were up and running the overcloud pingtest failed as cinder was not able to start.

From your Overcloud controller run:

sudo losetup -f /var/lib/cinder/cinder-volumes
sudo vgdisplay
sudo service openstack-cinder-volume restart

This will make your Overcloud pingtest work again.

Create a TripleO snapshot before breaking it...

2017-07-14T00:00:00+00:00

The idea of this post is to show how developers can save some time creating snapshots of their development environments for not deploying it each time it breaks.

So, don’t waste time re-deploying your environment when testing submissions.

I’ll show here how to be a little more agile when deploying your Undercloud/Overcloud for testing purposes.

Deploying a fully working development environment takes around 3 hours with human supervision… And breaking it just after deployed is not cool at all…

Step 1

Deploy your environment as usual.

Step 2

Create your Undercloud/Overcloud snapshots. Do this as the stack user, otherwise virsh won’t see the VMs

# The VMs deployed are:
# $vms will have something like ne next line...
# vms=( "undercloud" "control_0" "compute_0" )
vms=( $(virsh list --all | grep running | awk '{print $2}') )

# List all VMs
virsh list --all

# List current snapshots
for i in "${vms[@]}"; \
do \
virsh snapshot-list --domain "$i"; \
done

# Dump VMs XLM and check for qemu
for i in "${vms[@]}"; \
do \
virsh dumpxml "$i" | grep -i qemu; \
done

# Create an initial snapshot for each VM
for i in "${vms[@]}"; \
do \
echo "virsh snapshot-create-as --domain $i --name $i-fresh-install --description $i-fresh-install --atomic"; \
virsh snapshot-create-as --domain "$i" --name "$i"-fresh-install --description "$i"-fresh-install --atomic; \
done

# List current snapshots (After they should be already created)
for i in "${vms[@]}"; \
do \
virsh snapshot-list --domain "$i"; \
done

#########################################################################################################
# Current libvirt version does not support live snapshots.
# error: Operation not supported: live disk snapshot not supported with this QEMU binary
# --disk-only and --live not yet available.

# Create the folder for the images
# cd
# mkdir ~/backup_images

# for i in "${vms[@]}"; \
# do \
# echo "<domainsnapshot>" > $i.xml; \
# echo "  <memory snapshot='external' file='/home/stack/backup_images/$i.mem.snap2'/>" >> $i.xml; \
# echo "  <disks>" >> $i.xml; \
# echo "    <disk name='vda'>" >> $i.xml; \
# echo "      <source file='/home/stack/backup_images/$i.disk.snap2'/>" >> $i.xml; \
# echo "    </disk>" >> $i.xml; \
# echo "  </disks>" >> $i.xml; \
# echo "</domainsnapshot>" >> $i.xml; \
# done

# for i in "${vms[@]}"; \
# do \
# echo "virsh snapshot-create $i --xmlfile ~/$i.xml --atomic"; \
# virsh snapshot-create $i --xmlfile ~/$i.xml --atomic; \
# done

Step 3

Break your environment xD

Step 4

Restore your snapshots

# Commented for safety reasons...
# i=compute_0
i=blehblehbleh
virsh list --all
virsh shutdown $i
sleep 120
virsh list --all
virsh snapshot-revert --domain $i --snapshotname $i-fresh-install --running
virsh list --all

Or restore them all at once

vms=( $(virsh list –all | grep running | awk ‘{print $2}’) )

for i in “${vms[@]}”;
do
virsh shutdown $i;
virsh snapshot-revert –domain $i –snapshotname $i-fresh-install –running;
virsh list –all;
done

TripleO deep dive session #11 (i18n)

2017-07-07T00:00:00+00:00

This is the 11th release of the TripleO “Deep Dive” sessions

In this session we will have an update for the TripleO internationalization project for the TripleO UI, gladly presented by Julie Pichon.

So please, check the full session content on the TripleO YouTube channel.

Please check the sessions index to have access to all available content.

OpenStack versions - Upstream/Downstream

2017-06-27T00:00:00+00:00

I’m adding this note as I’m prune to forget how upstream and downstream versions are matching.

RHOS Version 0 = Diablo
RHOS Version 1 = Essex
RHOS Version 2 = Folsom
RHOS Version 3 = Grizzly
RHOS Version 4 = Havana
RHOS Version 5 = Icehouse
RHOS Version 6 = Juno
RHOS Version 7 = Kilo
RHOS Version 8 = Liberty
RHOS Version 9 = Mitaka
RHOS Version 10 = Newton
RHOS Version 11 = Ocata
RHOS Version 12 = Pike
RHOS Version 13 = Queens
RHOS Version 14 = R
RHOS Version 15 = S

Ph.inally D.one! - Español

2017-06-20T00:00:00+00:00

Este artículo resume mi experiencia a lo largo de los últimos 5 años durante el desarrollo de mi tesis doctoral. Principalmente, son algunos consejos y tips de lo que disfruté y sufrí. Estoy de acuerdo en que mis condiciones no son iguales a las de todos los estudiantes de doctorado, por lo que este artículo se basa exclusivamente en mi opinión y experiencia personal.

Divide y vencerás. Separa tu trabajo de investigación en bloques en los que puedas trabajar uno a la vez, es importante definir estos bloques al comienzo del proyecto de investigación aunque sea de manera tentativa. Esto te permitirá descomponer tu trabajo de forma que puedas escribir, por ejemplo, un artículo por cada uno de estos bloques.

Escribe artículos como si no existiera un mañana. Verídico, para terminar tu tesis doctoral debes contar con un mínimo de artículos de investigación para que el tribunal considere ‘apto’ tu trabajo. Usualmente, debes tener 1 en el primer cuartil como requisito para presentarla en modo monografía y 3 para presentarla por artículos.

Entonces, ¿Por qué no basas todo tu trabajo en escribir estos papers? Desde el primer minuto comienza con tu plantilla de latex y utilízala para escribir todo lo relacionado a esa sección de tu trabajo de investigación. Es sumamente importante tener en cuenta las revistas donde envias tus artículos, sólo envía a revistas que estén en los cuartiles 1 y 2.

No olvides tu tiempo libre. Es de las cosas que debes tener en cuenta para que sea lo mas agradable el transcurso de tu doctorado. No tener tiempo libre te frustra y te quita las ganas de seguir avanzando. Intenta mantener tu vida social sin importar que. En mi caso particular, siempre intenté dedicar 2 horas al día después del trabajo a actividades de investigación, y cada 2 fines de semana uno para hacer actividades duras como programar, escribir, etc…

Libre de frustraciones. Seguro que les pasará por la mente esto de ‘Abandonar’… ¡Nunca! ¿Creen que merece la pena abandonar el doctorado habiendo invertido tu vida (el tiempo nadie te lo va a devolver), sudor y lágrimas? ¡Así que ánimo! Lo peor que podría suceder es que te tome más tiempo del que habías planificado. En mi caso particular el Programa de Doctorado de Ingeniería Informática regulado por el RD 1393/2007 al cual pertenezco, se extinguirá a finales del curso actual (2017). Así que, ni abandonamos ni nos retrasamos. ¡Hay que trabajar que ya queda poco!

Persiste. Terminar tu Ph.D. es trabajo de resistencia, eventualmente, si sigues investigando podrás leer tu tesis. No renuncies y seguramente conseguirás tu objetivo. Ten en cuenta que terminar el doctorado se basa en acumular una cantidad considerable de trabajo de investigación a medida que pasa el tiempo, nadie podrá refutar los artículos que ya publicaste con lo que solo debes ‘aguantar’ y por supuesto disfrutar de lo que haces.

Esto es un trabajo. Las personas suelen confundir el trabajo que haces en tu Ph.D. con ‘estudiar’. Suelen preguntar que tal lo llevas como si estudiaras para sacarte el carnet de conducir ¡Esto es serio! Es trabajo que lleva esfuerzo y dedicación como cualquier otro y en muchos casos, mucho más, ya que incluye tardes, noches y fines de semana.

Que no te vean llorar. Llora, llora todo lo que quieras, pero que no te vean. En mi caso, al preguntarme mis amigos y familia sobre ‘la uni’ solía recordar lo poco que había hecho en los últimos días y sólo me quedaba llorar en silencio. Jajaj, recuerdo esta pregunta ¿Cómo llevas el doctorado? Como si me echaran agua fría…

El doctorando necesita comer. España valora inmensamente a sus doctorandos, les paga un sueldo justo por su ‘colaboración’ al desarrollo de la humanidad y más aun, los motiva para que sigan en el camino que llevan. Quizá he exagerado un poco, la situación real sería mas cercana a que están… mal pagados, luchando por becas, sufriendo recortes, disfrutando de la muy común falta de presupuesto, paro, incremento de tasas académicas.. Todos necesitamos comer y es la razón por la que decidí tener un trabajo donde valoren de manera justa mi esfuerzo, esto a parte de mi ‘trabajo’ como investigador, trabajo por el que nadie me paga y por el contrario, tengo que pagar yo. Lamentablemente, el mercado laboral de los ‘Doctores’ es bastante limitado en España pero el esfuerzo siempre se verá recompensado, al menos podrás poner que eres doctor al comprar billetes de Ryanair…

Se realista y objetivo. Reflexiona sobre el esfuerzo que ha tomado tu trabajo de investigación y no dejes que nadie lo menosprecie, al final del día, tu serás doctor. Siempre podrás corregir erratas, hacer mas casos de uso, implementaciones esotéricas. Buah… Se te puede ir la vida entera generando conocimiento, pero, una vez hayas cumplido con esos ‘bloques’ de los que hablamos antes… ¡A defender la tesis!

Ten en cuenta que por muy bien que hayas hecho tu ‘planning’ tendrás infinitos imprevistos. Así que un plan ‘B’ siempre ayuda a pasar obstáculos. Ten en cuenta los plazos de tu programa de doctorado para que no te lleves ‘sorpresas’ desagradables. Es mejor saber antes los problemas que tendrás a que no puedas cumplir con los plazos que tienes previstos. Mi consejo, cada vez que tengas un artículo publicado trabaja en montarlo en la memoria de tu tesis, es bastante trabajo y es mejor ir sacandolo poco a poco.

¿En Español o Inglés, por artículos o monografía? Es una pregunta difícil, es importante tener un plan ‘B’ en caso que tengas un deadline para entregar tu memoria. Los artículos que tengas en WIP (work in progress) siempre podrás terminarlos en tu post-doc o en tu tiempo libre. En mi caso particular, teniendo el Español como lengua nativa, he preferido escribir la memoria entera en Español por dos razones. La primera razón es que podré escribir mucho más rápido y la segunda porque publicar un artículo suele tardar de media unos 6 meses, en mi caso tengo el tercer artículo todavía en revisión por parte de la revista. Este plan ‘B’ me ha salvado de poder leer la tesis en los plazos correspondientes.

Toma notas para la defensa. Parece sencillo, pero algo que probablemente suceda es que no recuerdes por un segundo algo que te han preguntado. Por eso, es importante preparar material para la ronda de preguntas, para cuando te pidan que vayas a una página específica y aclares algo que no se ha entendido bien.

En mi caso imprimí la memoria página por hoja y apunté en cada hoja por la cara en blanco, notas que creía relevantes. Truco sencillo y fiable para no bloquearnos en esos momentos donde un ‘No recuerdo’ no está permitido.

Al fín. Tal día como hoy 20 de Junio de 2017, si la fuerza me acompaña seré doctor… ¡Yeah motherfuckers! Al parecer todo ha llegado a buen término y hoy a las 11hrs CET leeré mi tesis doctoral en la Facultad de Informática de la UCM. Solo puedo concluir que volvería a hacerlo sin duda alguna si estuviera en la posición que estuve hace 5 años. Pero, no repetiría (hacer otro Ph.D), sinceramente…

Hacer el doctorado no sólo te permite crecer en el ámbito profesional, también, te permite conocer personas fantásticas que seguramente te acompañarán por el resto de tu vida y se convertirán en buenos amigos, parejas, etc…

Por si hay alguien interesado en dar un vistazo a mi trabajo de investigación, está disponible públicamente en ccamacho.github.io/phd. Y mi perfil profesional como de costumbre en LinkedIn.

Dr. Carlos Camacho ☚ (<‿<)☚

P.S. Sólo por si os parece curioso, aclaro el origen de las Matrioshkas de la portada. Estas, representan uno de los principales logros del trabajo de investigación y en la portada, una metáfora… Mi tesis plantea un conjunto de relaciones de equivalencia para simplificar los términos del álgebra y reducirlos a sus formas normales. Estas formas normales permiten entre otras cosas, eliminar aquellos operadores combinatorios que hacen que la implementación de la semántica operacional y denotacional sea ‘improcesable’.

TripleO deep dive session index

2017-06-15T00:00:00+00:00

This is a brief index with all TripleO deep dive sessions, you can see all videos on the TripleO YouTube channel.

Sessions index:

    * TripleO deep dive #1 (Quickstart deployment)

    * TripleO deep dive #2 (TripleO Heat Templates)

    * TripleO deep dive #3 (Overcloud deployment debugging)

    * TripleO deep dive #4 (Puppet modules)

    * TripleO deep dive #5 (Undercloud - Under the hood)

    * TripleO deep dive #6 (Overcloud - Physical network)

    * TripleO deep dive #7 (Undercloud - TripleO UI)

    * TripleO deep dive #8 (TripleO - Deployed server)

    * TripleO deep dive #9 (TripleO - Quickstart)

    * TripleO deep dive #10 (TripleO - Containers)

    * TripleO deep dive #11 (TripleO - i18n)

    * TripleO deep dive #12 (TripleO - config-download)

    * TripleO deep dive #13 (TripleO - Containerized Undercloud)

    * TripleO deep dive #14 (TripleO - Containerized deployments without Paunch)

TripleO deep dive session #10 (Containers)

2017-06-15T00:00:00+00:00

This is the 10th release of the TripleO “Deep Dive” sessions

In this session we will have an update for the TripleO containers effort, thanks to Jiri Stransky.

So please, check the full session content on the TripleO YouTube channel.

Please check the sessions index to have access to all available content.

Git merge!!!

2017-06-13T00:00:00+00:00

Git merge!!!

Explanation: No comments.
Disclaimer.

Map, filter and reduce explained

2017-06-12T00:00:00+00:00

Map, filter and reduce explained

Explanation: No comments.
Disclaimer.

Software Engineer taxonomy

2017-06-01T00:00:00+00:00

Software Engineer taxonomy

Explanation: No comments.
Disclaimer.

Converting AAX audiobooks to MP3

2017-05-31T00:00:00+00:00

This is a quick guide to convert AAX files (DRMed audiobooks) to it’s MP3 equivalent.

I just got ‘The Phoenix Project’ book from amazon.es, it was on sale together with it’s audible audiobook version..

The thing is that I don’t want to install any additional software and I just want to listen the MP3 version in my phone, and no it isn’t a smartphone.. And to keep a sort of backup in something different than this mumbo-jumbo AAX stuff.

So this is a small copy/edit/paste recipe to convert the AAX files to MP3… Working fine and really easy to achieve…

  git clone https://github.com/inAudible-NG/tables.git
  git clone https://github.com/KrumpetPirate/AAXtoMP3.git
  wget http://project-rainbowcrack.com/rainbowcrack-1.7-linux64.zip
  unzip rainbowcrack-1.7-linux64.zip
  mv AAXtoMP3/* tables/
  mv rainbowcrack-1.7-linux64/* tables/
  mv <your_aax_file_name>.aax tables/
  cd tables
  make
  ffprobe <your_aax_file_name>.aax
  # Get the "[aax] file checksum"
  ./rcrack . -h <your_checksum>
  # Get the activation bytes hex
  bash AAXtoMP3 <your_activation_bytes> <your_aax_file_name>.aax

  # It will be generated an audiobook with the MP3 files in it.
  # *-*-Enjoy-*-*

This is basically all you need to convert your AAX file to MP3.

And no, I won’t share with you my copy of the ‘The Phoenix Project’ DeDRMed audiobook…

Disclaimer:

The purpose of this recipe is to be able to create backups of your audio books and be able to play them on other non DRM capable players. Please do not share or upload your DeDRMed files with anyone else.

TripleO deep dive session #9 (TripleO - Quickstart)

2017-05-05T00:00:00+00:00

This is the ninth release of the TripleO “Deep Dive” sessions

In this session we will have an overall description for TripleO Quickstart, thanks to Gabriele Cerami.

So please, check the full session content on the TripleO YouTube channel.

Please check the sessions index to have access to all available content.

TripleO deep dive session #8 (TripleO - Deployed server)

2017-05-04T21:00:00+00:00

This is the eight release of the TripleO “Deep Dive” sessions

In this session we will have a full description about the deployed server feature in TripleO thanks to James Slagle.

So please, check the full session content on the TripleO YouTube channel.

Please check the sessions index to have access to all available content.

The Seven Circles of Developer Hell

2017-03-07T00:00:00+00:00

The Seven Circles of Developer Hell

Explanation: From Toggl.
Disclaimer.

Installing TripleO Quickstart

2017-02-24T00:00:00+00:00

This is a brief recipe about how to manually install TripleO Quickstart in a remote 32GB RAM box and not dying trying it.

Now instack-virt-setup is deprecated :( :( :( so the manual process needs to evolve and use OOOQ (TripleO Quickstart).

This post is a brief recipe about how to provision the Hypervisor node and deploy an end-to-end development environment based on TripleO-Quickstart.

From the hypervisor run:

# In this dev. env. /var is only 50GB, so I will create
# a sym link to another location with more capacity.
# It will take easily more tan 50GB deploying a 3+1 overcloud
sudo mkdir -p /home/libvirt/
sudo ln -sf /home/libvirt/ /var/lib/libvirt
# Add default toor user
sudo useradd toor
echo "toor:toor" | chpasswd
echo "toor ALL=(root) NOPASSWD:ALL" | sudo tee -a /etc/sudoers.d/toor
sudo chmod 0440 /etc/sudoers.d/toor
sudo yum install -y lvm2 lvm2-devel
su - toor
whoami

sudo yum groupinstall "Virtualization Host" -y
sudo yum install git -y

# Disable requiretty otherwise the deployment will fail...
sudo sed -i -e 's/Defaults[ \t]*requiretty/#Defaults requiretty/g' /etc/sudoers

cd
mkdir .ssh
ssh-keygen -t rsa -N "" -f .ssh/id_rsa
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
sudo bash -c "cat .ssh/id_rsa.pub >> /root/.ssh/authorized_keys"
sudo bash -c "echo '127.0.0.1 127.0.0.2' >> /etc/hosts"
export VIRTHOST=127.0.0.2

ssh root@$VIRTHOST uname -a

Now, we can start deploying TripleO Quickstart by following:

# Source: http://rdo-ci-doc.usersys.redhat.com/docs/tripleo-environments/en/latest/oooq-downstream.html
# Downstream bits for OSP8 ...
# cd
# sudo yum -y install /usr/bin/c_rehash ca-certificates
# sudo update-ca-trust check
# sudo update-ca-trust force-enable
# sudo update-ca-trust enable
# wget cert.pem
# sudo cp cert.pem /etc/pki/tls/certs/
# sudo cp cert.pem /etc/pki/ca-trust/source/anchors/
# sudo c_rehash
# sudo update-ca-trust extract
# git clone https://github.com/openstack/tripleo-quickstart
# cd tripleo-quickstart
# wget http://rhos-release.virt.bos.redhat.com/ci-images/internal-requirements-new.txt
# cd
# chmod u+x ./tripleo-quickstart/quickstart.sh
# bash ./tripleo-quickstart/quickstart.sh --install-deps
# bash ./tripleo-quickstart/quickstart.sh -v --release rhos-8-baseos-undercloud --clean --teardown all --requirements "/home/toor/tripleo-quickstart/internal-requirements-new.txt" $VIRTHOST

git clone https://github.com/openstack/tripleo-quickstart
chmod u+x ./tripleo-quickstart/quickstart.sh
printf "\n\nSee:\n./tripleo-quickstart/quickstart.sh --help for a full list of options\n\n"
bash ./tripleo-quickstart/quickstart.sh --install-deps

export VIRTHOST=127.0.0.2
export CONFIG=~/deploy-config.yaml

cat > $CONFIG << EOF

# undercloud_undercloud_hostname: undercloud.ratata-domain

# control_memory: 8192
# compute_memory: 6120

# undercloud_memory: 10240
# undercloud_vcpu: 4
# undercloud_workers: 3

# default_vcpu: 1

custom_nameserver: '10.16.36.29'
undercloud_undercloud_nameservers: '10.16.36.29'              
overcloud_dns_servers: '10.16.36.29'

# node_count: 4

# overcloud_nodes:
#   - name: control_0
#     flavor: control
#     virtualbmc_port: 6230
#   - name: control_1
#     flavor: control
#     virtualbmc_port: 6231
#   - name: control_2
#     flavor: control
#     virtualbmc_port: 6232
#   - name: compute_0
#     flavor: compute
#     virtualbmc_port: 6233

# topology: >-
#   --control-scale 3
#   --compute-scale 1

extra_args: >-
  --libvirt-type qemu
  --ntp-server pool.ntp.org
#   -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml

run_tempest: false
EOF

# We disable SELINUX as it breaks the deployment
# You will get some permission denied when running
# the Ansible playbooks
sudo setenforce 0

bash ./tripleo-quickstart/quickstart.sh \
                --clean \
                --release master \
                --teardown all \
                --tags all \
                -e @$CONFIG \
                $VIRTHOST

In the hypervisor run the following command to log-in in the Undercloud:

ssh -F /home/toor/.quickstart/ssh.config.ansible undercloud

# Add the TRIPLEO_ROOT var to stackrc
# to use with tripleo-ci
echo "export TRIPLEO_ROOT=~" >> stackrc

source stackrc

At this point you should have your development environment deployed correctly.

Clone the tripleo-ci repo:

git clone https://github.com/openstack-infra/tripleo-ci

And, run the Overcloud pingtest:

~/tripleo-ci/scripts/tripleo.sh --overcloud-pingtest

Enjoy TripleOing (~˘▾˘)~

Note: I had to execute the deployment command 3/4 times to have an OK deployment, sometimes it just fails (i.e. timeout getting the images).

Note: From the host, virsh list --all will work only as the stack user.

Note: Each time you run the quickstart.sh command from the hypervisor the UC and OC will be nuked (--teardown all), you will see tasks like ‘PLAY [Tear down undercloud and overcloud vms] **’.

Note: If you delete the Overcloud i.e. using heat stack-delete overcloud you can re-deploy what you had by running the dynamically generated overcloud-deploy.sh script in the stack home folder from the UC.

Note: There are several options for TripleO Quickstart besides the basic virthost deployment, check them here: https://docs.openstack.org/developer/tripleo-quickstart/working-with-extras.html

Updated 2017/03/17: Bleh, had to execute several times the deployment command to have it working.. :/ I miss you instack-virt-setup

Updated 2017/03/16: The --config option seems to be broken, using instead -e @~/deploy-config.yaml.

Updated 2017/03/14: New workflow added.

Updated 2017/02/27: Working fine.

Updated 2017/02/23: Seems to work.

Updated 2017/02/23: instack-virt-setup is deprecatred :( moving to tripleo-quickstart.

I am not a robot

2017-01-27T00:00:00+00:00

I am not a robot

Explanation: No comments.
Disclaimer.

OpenStack and services for BigData applications

2017-01-26T00:00:00+00:00

Yesterday I had the opportunity of presenting together with Daniel Mellado a brief talk about OpenStack and it’s integration with services to support Big Data tools (OpenStack Sahara).

It was a combined talk for two Meetups MAD-for-OpenStack and Data-Science-Madrid.

The presentation is stored in GitHub.

Regrets:

We prepared a 1 hour presentation that had to be presented in 20min.
Wasn’t able to have access to our demo server.

TripleO deep dive session #7 (Undercloud - TripleO UI)

2017-01-16T16:00:00+00:00

This is the seven release of the TripleO “Deep Dive” sessions

In this session Liz Blanchard and Ana Krivokapic will give us some bits about how to contribute to the TripleO UI project. Once checking this session we will have a general overview about the project’s history, properties, architecture and contributing steps.

So please, check the full session content on the TripleO YouTube channel.

Here you will be able to see a quick overview about how to install the UI as a development environment.

The summarized steps are also available in this blog post.

Please check the sessions index to have access to all available content.

Programmer reality

2017-01-13T00:00:00+00:00

Programmer reality

Explanation: No comments.
Disclaimer.

Installing the TripleO UI

2017-01-13T00:00:00+00:00

This is a brief recipe to use or install TripleO UI in the Undercloud.

First, once installed the Undercloud, the TripleO UI is already available in the 3000 port.

Let’s assume you have both root password for your development environment and the Undercloud node.

TripleO-UI queries directly the endpoints (i.e. keystone) from your browser, so we need the traffic for the net 192.168.24.0/24 forwarded from your workstation to the Undercloud node in order to reach all required ports (6385, 5000, 8004, 8080, 9000, 8989, 3000, 13385, 13000, 13004, 13808, 9000, 13989 and 443).

Let’s install sshuttle in your workstation.

sudo yum install -y sshuttle

Now, let’s get the Undercloud IP and configure SSH with a ProxyCommand.

undercloudIp=`ssh root@labserver "arp -e" | grep brext | grep -v incomplete | awk '{print $1}' | sed 's/\/.*$//'`

cat << EOF >> ~/.ssh/config
Host lab
  Hostname labserver
  User root
Host uc
  Hostname $undercloudIp
  User root
  ProxyCommand ssh -vvvv -W %h:%p root@lab
EOF

sshuttle will ask you for your hypervisor and Undercloud root password.

To start forwarding the traffic execute:

sshuttle -e "ssh -vvv" -r root@uc -vvvv 192.168.24.0/24

Once you have done this, open from your browser http://192.168.24.1:3000/ and the TripleO UI should be shown correctly.

It’s probable that you receive an error like: Connection to Keystone is not available.

This is because you are trying to access the Keystone endpoint from your workstation and it fails as the certificate is self-signed. In order to fix this, open the developer view in your browser and check the endpoint you are using to access keystone. For example, https://192.168.24.2/keystone/v2.0/tokens now open this URL in your browser and acept the certificate. If you do this the Keystone error should go away.

If you need a TripleO UI development environment follow:

The first step will be to install the TripleO UI and all the dependencies.

  cd
  sudo yum install -y nodejs npm tmux
  git clone https://github.com/openstack/tripleo-ui.git
  cd tripleo-ui
  npm install

Now, we need to update all the TripleO UI config files

  cd
  cp ~/tripleo-ui/config/tripleo_ui_config.js.sample ~/tripleo-ui/config/tripleo_ui_config.js
  echo "Changing the default IP"
  export ENDPOINT_ADDR=$(cat stackrc | grep OS_AUTH_URL= | awk -F':' '{print $2}'| tr -d /)
  sed -i "s/[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}/$ENDPOINT_ADDR/g" ~/tripleo-ui/config/tripleo_ui_config.js

  echo "Removing comments for the keystone URI"
  sed -i '/^  \/\/ '\''keystone'\''\:/s/^  \/\///' ~/tripleo-ui/config/tripleo_ui_config.js

  echo "Removing comments for the zaqar_default_queue"
  sed -i '/^  \/\/ '\''zaqar_default_queue'\''\:/s/^  \/\///' ~/tripleo-ui/config/tripleo_ui_config.js

  # Uncomment all the parameters
  # sed -i '/^  \/\/ '\''.*'\''\:/s/^  \/\///' ~/tripleo-ui/config/tripleo_ui_config.js

  echo "Changing listening port for the dev server, 3000 already used"
  sed -i '/port: 3000/s/3000/33000/' ~/tripleo-ui/webpack.dev.js

In the following step we will use tmux to persist the service running for debugging purposes.

  cd
  tmux new -s tripleo-ui
  cd ~/tripleo-ui/
  npm start

At this stage you should have up and running your node server (33000 port).

If you followed the first step to see the default TripleO UI installation go to log in the TripleO UI: http://192.168.24.1:33000/

Happy TripleOing!

Updated 2017/01/13: First version.

Updated 2017/01/14: Add default TripleO UI info. Still getting 'Connection to Keystone is not available' the config params are correct, checking it...

Updated 2017/01/17: Forwarded all the required ports using sshuttle.

Printed TripleO cheatsheets for FOSDEM/DevConf (feedback needed)

2016-12-16T00:00:00+00:00

We are working preparing some cheatsheets for people jumping into TripleO.

So there is an early WIP version for a few cheatsheets that we want to share:

TripleO manual installation (Just a copy/paste step-wise process to install TripleO).

Deployments - Debugging tips (Relevant commands to know what’s happening with the deployment).

Deployments - CI (URL’s and resources to check our CI status).

OOOQ installation (Also a step-wise recipe to install OOOQ, does not exist yet).

We already have some drafts available in GitHub.

So, we will like to have some feedback from the community and make a stable version for the cheatsheets in the next week.

Feedback for adding/removing content and general reviews about all of them is welcomed.

Thanks!!!!

Testing composable upgrades

2016-11-28T00:00:00+00:00

This is a brief recipe about how I’m testing composable upgrades O->P.

Based on the original shardy’s notes from this link.

The following steps are followed to upgrade your Overcloud from Ocata to latest master (Pike).

Deploy latest master TripleO following this post.
Remove the current Overcloud deployment.

  source stackrc
  heat stack-delete overcloud

Remove the Overcloud images and create new ones (for the Overcloud).

  cd
  openstack image list
  openstack image delete <image_ID> #Delete all the Overcloud images overcloud-full*
  rm -rf /home/stack/overcloud-full.*

  export STABLE_RELEASE=ocata
  export USE_DELOREAN_TRUNK=1
  export DELOREAN_TRUNK_REPO="https://trunk.rdoproject.org/centos7-ocata/current/"
  export DELOREAN_REPO_FILE="delorean.repo"
  /home/stack/tripleo-ci/scripts/tripleo.sh --overcloud-images

  # Or reuse images
  # wget https://images.rdoproject.org/ocata/delorean/current-tripleo/stable/overcloud-full.tar
  # tar -xvf overcloud-full.tar
  # openstack overcloud image upload --update-existing

Download Ocata tripleo-heat-templates.

  cd
  git clone -b stable/ocata https://github.com/openstack/tripleo-heat-templates tht-ocata

Configure the DNS (needed when upgrading the Overcloud).

  neutron subnet-update `neutron subnet-list | grep ctlplane-subnet | awk '{print $2}'` --dns-nameserver 192.168.122.1

Deploy an Ocata Overcloud.

  openstack overcloud deploy \
  --libvirt-type qemu \
  --ntp-server pool.ntp.org \
  --templates /home/stack/tht-ocata/ \
  -e /home/stack/tht-ocata/overcloud-resource-registry-puppet.yaml \
  -e /home/stack/tht-ocata/environments/puppet-pacemaker.yaml

Install prerequisites in nodes (if no DNS configured this will fail, so make sure they have Intenet access), check that your nodes can connect to Internet.

cat > upgrade_repos.yaml << EOF
parameter_defaults:
  UpgradeInitCommand: |
    set -e

    #Master repositories
    sudo curl -o /etc/yum.repos.d/delorean.repo https://trunk.rdoproject.org/centos7-master/current-passed-ci/delorean.repo
    sudo curl -o /etc/yum.repos.d/delorean-deps.repo https://trunk.rdoproject.org/centos7/delorean-deps.repo


    export HOME=/root
    cd /root/
    if [ ! -d tripleo-ci ]; then
      git clone https://github.com/openstack-infra/tripleo-ci.git
    else
      pushd tripleo-ci
      git checkout master
      git pull
      popd
    fi

    if [ ! -d tripleo-heat-templates ]; then
      git clone https://github.com/openstack/tripleo-heat-templates.git
    else
      pushd tripleo-heat-templates
      git checkout master
      git pull
      popd
    fi

    ./tripleo-ci/scripts/tripleo.sh --repo-setup
    sed -i "s/includepkgs=/includepkgs=python-heat-agent*,/" /etc/yum.repos.d/delorean-current.repo
    #yum -y install python-heat-agent-ansible
    yum install -y python-heat-agent-*

    rm -f /usr/libexec/os-apply-config/templates/etc/puppet/hiera.yaml
    rm -f /usr/libexec/os-refresh-config/configure.d/40-hiera-datafiles
    rm -f /etc/puppet/hieradata/*.yaml
    yum remove -y python-UcsSdk openstack-neutron-bigswitch-agent python-networking-bigswitch openstack-neutron-bigswitch-lldp python-networking-odl
    crudini --set /etc/ansible/ansible.cfg DEFAULT library /usr/share/ansible-modules/
EOF

Download master tripleo-heat-templates.

  cd
  git clone https://github.com/openstack/tripleo-heat-templates tht-master

Upgrade Overcloud to master

  cd
  openstack overcloud deploy \
  --libvirt-type qemu \
  --ntp-server pool.ntp.org \
  --templates /home/stack/tht-master/ \
  -e /home/stack/tht-master/overcloud-resource-registry-puppet.yaml \
  -e /home/stack/tht-master/environments/puppet-pacemaker.yaml \
  -e /home/stack/tht-master/environments/major-upgrade-composable-steps.yaml \
  -e upgrade_repos.yaml

Note: if upgrading to a containerized Overcloud (Pike and beyond) do:

cat > docker_registry.yaml << EOF
parameter_defaults:
  DockerNamespace: 192.168.24.1:8787/tripleoupstream
  DockerNamespaceIsRegistry: true
EOF

# This will take some time...
openstack overcloud container image upload --config-file /usr/share/openstack-tripleo-common/container-images/overcloud_containers.yaml

openstack overcloud container image prepare \
--namespace tripleoupstream \
--tag latest \
--env-file docker-centos-tripleoupstream.yaml

cd
source ~/stackrc
export THT=/home/stack/tht-master

openstack overcloud deploy --templates $THT \
--libvirt-type qemu \
--ntp-server pool.ntp.org \
-e $THT/overcloud-resource-registry-puppet.yaml \
-e $THT/environments/puppet-pacemaker.yaml \
-e $THT/environments/major-upgrade-composable-steps.yaml \
-e upgrade_repos.yaml \
-e $THT/environments/docker.yaml \
-e $THT/environments/docker-ha.yaml \
-e $THT/environments/major-upgrade-composable-steps-docker.yaml \
-e docker-centos-tripleoupstream.yaml \
-e docker_registry.yaml

Run the converge step ** Not tested on the containerized upgrade **

  cd
  openstack overcloud deploy \
  --libvirt-type qemu \
  --ntp-server pool.ntp.org \
  --templates /home/stack/tht-master/ \
  -e /home/stack/tht-master/overcloud-resource-registry-puppet.yaml \
  -e /home/stack/tht-master/environments/puppet-pacemaker.yaml \
  -e /home/stack/tht-master/environments/major-upgrade-converge.yaml

If the last steps manage to finish successfully, you just have upgraded your Overcloud from Ocata to Pike (latest master).

For more resources related to TripleO deployments, check out the TripleO YouTube channel.

Updated 2017/01/28: Working fine.

Oh yeah, NES classic mini!

2016-11-27T00:00:00+00:00

I had the chance to find a treasure last week, the NES classic mini in game.es for it’s official price (Shame on you speculators!!!!).

So.. I just loved this tiny console since I just opened the package. Lot of people complain about the controller cord being to short (You have workarounds for this) so for me it’s not a problem.

It boots up in less than 5 seconds, shows a nice interface with all the pre-loaded 30 games ready-to-play in a long list alphabetically sorted and you will have 4 memory slots for each game.

Playing with this console will be like going back 20 years ago but so much better, perfect pixels and a really good sound emulation.

Last statement from Nintendo:

The Nintendo Entertainment System: NES Classic Edition system is a hot item, and we are working hard to keep up with consumer demand. There will be a steady flow of additional systems through the holiday shopping season and into the new year. Please contact your local retailers to check product availability. A selection of participating retailers can be found at www.Nintendo.com/nes-classic.

You just need to ping retailers pages until you find stock again.

If you can get one, go for it (But don’t buy it from speculators)!

TripleO cheatsheet

2016-11-26T00:00:00+00:00

This is a cheatsheet some of my regularly used commands to test, develop or debug TripleO deployments.

Deployments

swift download overcloud

Download the overcloud swift container files in the current folder (With the rendered j2 templates).
heat resource-list --nested-depth 5 overcloud | grep FAILED

Show resources, filtering to get those who have failed.
heat deployment-show <deployment_ID>

Get the deployment details for <deployment_ID>.
openstack image list

List images.
openstack image delete <image_ID>

Delete <image_ID>.
wget http://buildlogs.centos.org/centos/7/cloud/x86_64/tripleo_images/<release>/delorean/overcloud-full.tar

Download <release> overcloud images tar file [liberty|mitaka|newton|...]
openstack overcloud image upload --update-existing

Once downloaded the images, this command will upload them to glance.

Debugging CI

http://status.openstack.org/zuul/

Check submissions CI status.
wget -e robots=off -r --no-parent <patch_ID>

Download all logs from <patch_ID>.
console.html & logs/postci.txt.gz

Relevant log files when debuging a TripleO Gerrit job.

If you think there are more useful commands to add to the list just add a comment!

Happy TripleOing!

IT detox

2016-11-25T00:00:00+00:00

IT detox

Explanation: No comments.
Disclaimer.

The Venezuelan mug

2016-11-24T00:00:00+00:00

The Venezuelan mug

Explanation: No comments.
Disclaimer.

I just don't like the new drafted TripleO logo

2016-11-23T00:00:00+00:00

I just don’t like the new drafted TripleO logo

Explanation: No comments.
Disclaimer.

How to recover a commit from GitHub's Reflog

2016-11-23T00:00:00+00:00

Writing this blog post, suddenly and without knowing I ended up by squashing/removing the commit holding those changes.

The first thing that came to my mind was to locally check the Reflog to restore the commit, but sadly I was removed the repo from my laptop and the Reflog didn’t exist anymore. Then I went into panic as I didn’t wanted to lose those hours that took me to write the post. Then, I said… Does GitHub have Reflog? The sweet answer, yes..

So lets learn how to recover a commit from GitHub Reflog.

Relevant strings to fill:

<user>: The user holding the git repo.
<repo>: The repository name.
<recover-branch-name>: The remote branch that you will create.
<sha-goes-here>: The commit sha to be restored.

Let’s curl GitHub to get all commits in the Reflog:

$ curl https://api.github.com/repos/<user>/<repo>/events

Now let’s create a remote branch with your commit.

Choose your commit sha and run:

$ curl -i -H "Accept: application/json" -H "Content-Type: application/json" -X POST -d '{"ref":"refs/heads/<recover-branch-name>", "sha":"<sha-goes-here>"}' https://api.github.com/repos/<user>/<repo>/git/refs

You should have created a branch called in your GitHub repo. Now you can safely cherry-pick it to your master branch or fetch locally those changes and do whatever you want with them.

So yeah, I have to admit that I ~~squashed~~crashed those commits in some how…

I hope that last tips are useful if you are in a trouble like me trying to recover some lost commits from GitHub.

Humanity...

2016-11-22T00:00:00+00:00

Humanity…

Explanation: No comments.
Disclaimer.

Enabling nested KVM support for a instack-virt-setup deployment.

2016-11-21T12:30:00+00:00

The following bash snippet will enable nested KVM support in the host when deploying TripleO using instack-virt-setup.

This will work in AMD or Intel architectures.

#!/bin/bash
echo "Checking if nested KVM is enabled in the host."
ARCH=$(lscpu | grep Architecture | head -1 | awk '{print $2}')
if [[ $ARCH == 'x86_64' ]]; then
    ARCH_BRAND=intel
    KVM_STATUS_FILE=/sys/module/kvm_intel/parameters/nested
    ENABLE_NESTED_KVM=Y
else
    ARCH_BRAND=amd
    KVM_STATUS_FILE=/sys/module/kvm_amd/parameters/nested
    ENABLE_NESTED_KVM=1
fi
if [[ -f $KVM_STATUS_FILE ]]; then
    KVM_CURRENT_STATUS=$(head -n 1 $KVM_STATUS_FILE)
    if [[ "${KVM_CURRENT_STATUS^^}" -ne "${ENABLE_NESTED_KVM^^}" ]]; then
        echo "This host does not have nested KVM enabled, enabling."
        sudo rmmod kvm-$ARCH_BRAND
        sudo sh -c "echo 'options kvm-$ARCH_BRAND nested=$ENABLE_NESTED_KVM' >> /etc/modprobe.d/dist.conf"
        sudo modprobe kvm-$ARCH_BRAND
    else
        echo "Nested KVM support is already enabled."
    fi
else
    echo "$KVM_STATUS_FILE does not exist."
fi

By default nested virtualization with KVM is disabled in the host, so in order to run the overcloud-pingtest correctly we have two options. Either run the previous snippet on the host, or, when deploying the Compute node in a virtual machine add --libvirt-type qemu to the deployment command. Otherwise launching instances on the deployed overcloud will fail.

Here you have an example of the deployment command, fixing libvirt to qemu.

cd
openstack overcloud deploy \
--libvirt-type qemu \
--ntp-server pool.ntp.org \
--templates /home/stack/tripleo-heat-templates \
-e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
-e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml

Have a happy TripleO deployment!

Ocata OpenStack summit 2016 - Barcelona

2016-11-21T00:00:00+00:00

A few weeks ago I had the opportunity to attend to the Barcelona OpenStack summit ‘Ocata design session’ and this post is related to collect some overall information about it. In order to achieve this, I’m crawling into my paper notes to highlight the aspects IMHO are relevant.

Sessions list by date.

Tuesday - Oct. 25th

RDO Booth: Carlos Camacho TripleO composable roles demo (12:15pm-12:55pm)
What the Heck is OoO: Owls All the Way Down (5:55pm – 6:35)

Wednesday - Oct. 26th

Anomaly Detection in Contrail Networking (1:15pm-1:29pm)
Freezer: Plugin Architecture and Deduplication (3:05pm-3:45pm)
TripleO: Containers - Current Status and Roadmap (3:55pm-4:35pm)
TripleO: Work Session - Growing the team (5:05pm-5:45pm)
TripleO: Work Session - CI - current status and roadmap (5:55pm-6:35pm)

Thursday - Oct. 27th

Zuul v3: OpenStack and Ansible Native CI/CD (11:00am-11:40am)
The Latest in the Container World and the Role of Container in OpenStack (11:50am-12:30pm)
TripleO: Upgrades - current status and roadmap (1:50pm-2:30pm)
Mistral: Mistral and StackStorm (3:30pm-4:10pm)
Nokia: TOSCA & Mistral: Orchestrating End-to-End Telco Grade NFV (5:30pm-6:10pm)

Friday - Oct. 28th

TripleO: Work Session - Composable Undercloud deployment with Heat (9:00am-9:20am)
TripleO: Work Session - GUI, CLI, Validations current status, roadmap, requirements (9:20am-9:40am)
TripleO: Work Session - Multiple topics - Blueprints, specs, tools and Ocata summary. (9:50am-10:30am)

Beyond the analysis of “What I did there in a week” I want to state a few relevenat facts for me.

Why is important to attend to a design session (My case: Upstream TripleO developer)?

I think when working remotely in OpenStack projects, for example, in such a complex project as TripleO is really hard to know what other people are doing. So forth, design sessions force engineers to realize about your peers work on different OpenStack projects or even in the same project ;)

This will give you some ideas for future features, new services to integrate, issues that you might have in the future among many others. Also for TripleO specific case if you are interested in working in a specific service, you can get into those services sessions to know more about them.

Where is the value for companies when sending engineers to design sessions?

There might be several answers for this question, but I believe the overall answer will be that sending engineers to the design sessions allow engineers to be aligned based on companies goals, mostly for cases when several companies are related to the same project. Also allow team members to know each others, maybe this can be a soft benefit, for me as important as being aligned for future features or architectural agreements.

Is it really mandatory to send people to design sessions?

I think this is not mandatory at all, but at the a relevant factor might be that all knowledge/value generate in those sessions can be delivered and processed by the rest of the team members.

Do attendees gain value when attending to these design sessions?

Off course they will gain lot of value:

You might have a wrong impression about people on IRC, meeting them in person can change this dramatically ‘or not’.
Know better and engage with your team members and other peers.
Better alignment with your project goals.
Discuss blueprints and have a better understanding about the features life-cycle and roadmap.
Know what other people are doing.
Improve you overall knowledge about other projects (This time is for doing that “Not in your free time if you like it like me”).

If we are in a design session, are we in “working mode” or just in “learning mode”?

Hardest question by far, I had the remorse of feeling that I was not working enough. There were a lot of distractions if you wanted to actually make some time for coding or reviewing submissions. But that’s the thing, I believe is good time to align and after the summit you will always have time for coding :)

What about some business alignment?

I believe this is also an important factor to know about how OpenStack evolves with the time, release announcements and how summits are actually evolving.

Querying haproxy data using socat from CLI

2016-11-04T00:00:00+00:00

Currently (most users) I don’t have any way to check the haproxy status in a TripleO virtual deployment (via web-browser) if not previously created some tunnels enabled for that purpose.

So let’s check some haproxy data from our controller.

Check the controller IP:

nova-list

Connect to the controller:

ssh [email protected]

Now, we need to have installed socat:

sudo yum install -y socat

By default haproxy it’s already configured to dump stats data to /var/run/haproxy.sock, now let’s query haproxy get some data from it:

Show details like haproxy version, PID, current connections, session rates, tasks, among others.

echo "show info" | socat unix-connect:/var/run/haproxy.sock stdio

Echo the stats about all frontents and backends as a csv.

echo "show stat" | socat unix-connect:/var/run/haproxy.sock stdio

Display information about errors if there are any.

echo "show errors" | socat unix-connect:/var/run/haproxy.sock stdio

Display open sessions.

echo "show sess" | socat unix-connect:/var/run/haproxy.sock stdio

How to un-brick a Sony Xperia S and install Oneofakind Android 6.0.1_r10

2016-10-08T00:00:00+00:00

Ok, after playing converting partitions to f2fs I made a huge mistake and corrupted the partitions table of my mobile.. Yeahp, I messed it up.

The following notes are meant to be a reminder about how to follow this process as it’s a cumbersome and time consuming task to get the correct version of the firmware and tools needed.

A hard brick is the state of android device that occurs when your device won’t boot at all i.e. no boot loop, no recovery and also not charging.

The steps in order to fix the phone will try to flash a new ROM and upgrade it to Android 6.0.1.

Prerequisites

Download all the required files from here (TWRP, flashtool, flashtool drivers, OEM firmware, root exploit and Oneofakind ROM).
Have a PC near to you.
A USB cable.
1 beer (Save it for the end of the tutorial).

Part 1

The first part of the tutorial will get you a working installation of Android 4.1 Jelly Bean in our bricked phone.

Install Flashtool

Download and install Flashtool, this tool is in the package that you should have downloaded from the prerequisites.

Install flashtool drivers for your phone

In order to install the flashtool drivers you need to run some previous steps as Windows won’t let you install them as they are not signed (Also inside the prerequisites package).

Run the following steps from your PC

Press the Windows key + R together and in the ‘Run’ box type shutdown.exe /r /o /f /t 00
Now make the following selections to boot into the Start Up Setting Screen: Troubleshoot > Advanced options > Start Up Settings > Restart
Then, when the machine restarts, select number 7 i.e. ‘Disable driver signature enforcement’. Your machine will start with Driver signing enforcement disabled until the next reboot.

Now you can install the Flashtool drivers.

Select from the options to install, flashmode drivers and fastboot driver.

Windows will warn that the driver is not signed and will require you to confirm the installation. Once the installation is complete, reboot the machine.

Start the Xperia S in flashmode

Switch off your Xperia S first, then press and hold the Volume Down button. Connect to your PC using a USB Cable while holding down the Volume Down button on your Xperia S. Your phone should be in Flash Mode now and the device’s LED light should turn into Green.

Flash the image using flashtool

With your phone in flashmode, open Flashtool and then click on the lightning bolt. Select the folder where you have the firmware image LT26i_6.2.B.1.96_World.ftf (downloaded from the prerequisites package). Check all the items from wipe list and uncheck all items from exlude list. Click flash and wait until finish.

Reboot the phone and you should have Android 4.1 Jelly Bean installed and ready to use, but we still want to install Android 6.1, so the first part of the tutorial is finished.

Part 2

After having a working installation of Android in our already un-bricked phone, the next steps allow to upgrade the stantard ROM to Oneofakind Android 6.0.1_r10.

Unlock the phone boot loader

Follow the instructions from the official Sony website

Gain root access to the phone.

First in your mobile phone.

Activate USB Debugging, Setting -> Developer Options
Activate Unknown Sources, Setting -> Security

Now open the application RunMe.bat within Root_with_Restore_by_Bin4ry_v36 and select option 2. Your phone should be rooted and rebooted.

Installing TWRP

Switch off your Xperia S first. Press and hold the Volume Up button. Connect to your PC using a USB Cable while holding down the Volume Up button on your Xperia S. Your Xperia S should be in Fast Mode now and the device’s LED light should turn into Blue. Then run:

fastboot flash boot twrp-3.0.2-0-nozomi.img

At this point you should have the recovery installed.

Re-partitioning

Let’s start TWRP. Turn on the phone and the the Sony screen appears press the Volume Up button.

Go to Mount on TWRP gui (uncheck system, data, cache) Open the terminal and run:

fdisk -l /dev/block/mmcblk0

Copy the output of the command to a file with your backup.

Interesting parts are the following lines:

/dev/block/mmcblk0p14 42945 261695 7000024 83 Linux
/dev/block/mmcblk0p15 261696 954240 22161424 83 Linux

It can be not exactly the same values for you depending the size of your /data (p14) and /sdcard (p15), run:

fdisk /dev/block/mmcblk0

The following steps will merge p14 and 15 into one big partition for data, do this or otherwise you won’t be able to use the internal SD.

Command (m for help): p

Command (m for help): d
Partition number (1-15): 15

Command (m for help): d
Partition number (1-14): 14

Command (m for help): n
First cylinder (769-954240, default 769): 42945
Last cylinder or +size or +sizeM or +sizeK (42945-954240, default 954240): (just press enter if the default value is the good one)
Using default value 954240

Command (m for help): t
Partition number (1-14): 14
Hex code (type L to list codes): 83

Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table

Once re-partitioning done, do NOT do anything else and just reboot the device (to be sure that the partition table is take into account by the kernel)

Now we will convert /data and /cache to F2FS. Ext4 is not supported anymore on nAOSProm. You don’t need to take care about the 16384 byte to reserve for encryption. TWRP will do it for you.

Install and run TWRP again

Switch off your Xperia S first. Press and hold the Volume Up button Connect to your PC using a USB Cable while holding down the Volume Up button on your Xperia S Your phone should be in Fast Mode now and the device’s LED light should turn into Blue, then run

fastboot flash boot twrp-3.0.2-0-nozomi.img

Let’s start TWRP. Turn on the phone and the the Sony screen appears press the Volume Up button.

Mount all the partitions.

From the menu click:

Wipe -> Advanced Wipe -> select Data -> Repair or Change File system -> Change File System -> F2FS -> Swipe to Change
Wipe -> Advanced Wipe -> select Cache -> Repair or Change File system -> Change File System -> F2FS -> Swipe to Change

Once done, again, do NOT do anything else and just reboot the device (required by TWRP).

Congratulation, if everything is fine you should be able to mount /cache and /data and to see a big /data volume around 28 GiB.

Boot the phone again in recovery mode (TWRP).

Start TWRP, mount all partitions and from your PC run:

adb push open_gapps-arm-6.0-pico-20161006.zip /sdcard/
adb push oneofakind_nozomi-27-Jan-2016.zip /sdcard/

This will load those files into your smartphone allowing the installation of Oneofakind.

Final installation of Oneofakind….

From TWRP menu click install and select the two images that we just have uploaded (oneofakind_nozomi-27-Jan-2016.zip and open_gapps-arm-6.0-pico-20161006.zip).

Click flash and wait until the installation is complete, in the mean while open the beer (Last prerequisite) and drink it.

Cheers!

Deployment tips for puppet-tripleo changes

2016-09-29T09:00:00+00:00

This post will describe different ways of debugging puppet-tripleo changes.

Deploying puppet-tripleo using gerrit patches or source code repositories

In some cases, dependencies should be merged first in order to test newer patches when adding new features to THT. With the following procedure, the user will be able to create the overcloud images using WorkInProgress patches from gerrit code review without having them merged (for CI testing purposes).

If using third party repos included in the overcloud image, like i.e. the puppet-tripleo repository, your changes will not be available by default in the overcloud until you write them in the overcloud image (by default is: overcloud-full.qcow2)

In order to make ~~quick~~ changes to the overcloud image for testing purposes, you can:

Export the paths to your submission by following an In-Progress review:

  export DIB_INSTALLTYPE_puppet_tripleo=source
  export DIB_REPOLOCATION_puppet_tripleo=https://review.openstack.org/openstack/puppet-tripleo
  export DIB_REPOREF_puppet_tripleo=refs/changes/25/310725/14

In order to avoid noise on IRC, it is possible to clone puppet-tripleo and apply the changes from your github account. In some cases this is particularly useful as there is no need to update the patchset number.

  export DIB_INSTALLTYPE_puppet_tripleo=source
  export DIB_REPOLOCATION_puppet_tripleo=https://github.com/<usergoeshere>/puppet-tripleo

Remove previously created images from glance and from the user home folder by:

  rm -rf /home/stack/overcloud-full.*
  glance image-delete overcloud-full
  glance image-delete overcloud-full-initrd
  glance image-delete overcloud-full-vmlinuz

After this step the images can be recreated by executing:

  ./tripleo-ci/scripts/tripleo.sh --overcloud-images

Debugging puppet-tripleo from overcloud images

For debugging purposes, it is possible to mount the overcloud .qcow2 file:

  #Install the libguest tool:
  sudo yum install -y libguestfs-tools

  #Create a temp folder to mount the overcloud-full image:
  mkdir /tmp/overcloud-full

  #Mount the image:
  guestmount -a overcloud-full.qcow2 -i --rw /tmp/overcloud-full

  #Check and validate all the changes to your overcloud image, go to /tmp/overcloud-full:
  #  For example, in this step you can go to /opt/puppet-modules/tripleo,

  #Umount the image
  sudo umount /tmp/overcloud-full

From the mounted image file it is also possible to run, for testing purposes, the puppet manifests by using puppet apply and including your manifests:

  sudo puppet apply -v --debug --modulepath=/tmp/overcloud-full/opt/stack/puppet-modules -e "include ::tripleo::services::time::ntp"

Crazy murdering robots

2016-09-22T00:00:00+00:00

Crazy murdering robots

Explanation: No comments.
Disclaimer.

6 or 9

2016-09-21T00:00:00+00:00

6 or 9

Explanation: No comments.
Disclaimer.

So You Think You Are Ready For The RHCSA Exam?

2016-09-18T00:00:00+00:00

I just want to share this amazing blog post from Finnbarr P. Murphy, originally from the Musing of an OS plumber blog.

Again, this is awesome!

–

So you have studied hard, maybe even attended a week or two of formal training, for the Red Hat Certified System Administrator exam and now you think you are ready to take the actual examination.

Before you spend your money (currently $400) on the actual examination, why not download this custom CentOS 7.2 VM and attempt a real world test of your RHCSA skills.

This VM, which is in the form of an OVA (Open Virtualization Archive), will work with VMware Workstation 10 or later. Sorry, but if you want to use the VM in other environments, you are going to have to figure out how to do so, no support will be forthcoming from me. You will also need network access from your host system to the default public CentOS repos.

There are twelve (12) tasks that you need to complete in 90 minutes from VM power up. Most, if not all, of these tasks will probably appear in the real exam. But first you must fix a problem booting the operating system and get past the lost root password before you can read the file /TASKS which contains the twelve tasks that you most complete. Oh, and by the way, networking and package management are broken. You will have to get networking and package management working in order to install some necessary packages.

Just like in the real examination, no answers are or will be provided.

If you cannot correctly complete all the tasks in 120 minutes or less. You are absolutely NOT ready for the actual RHCSA exam.

Good luck!

Debugging submissions errors in TripleO CI

2016-08-25T00:00:00+00:00

Landing upstream submissions might be hard if you are not passing all the CI jobs that try to check that your code actually works.

Let’s assume that CI is working properly without any kind of infra issue or without any error introduced by mistake from other submissions. In which case, we might ending having something like:

The first thing that we should do it’s to double check the status from all the other jobs that are actually in the TripleO CI status page. This can be checked in the following site:

http://tripleo.org/cistatus.html

Also, we can get the jobs status by checking the Zuul dashboard.

http://status.openstack.org/zuul/

Or checking the TripleO test cloud nodepool.

http://grafana.openstack.org/dashboard/db/nodepool-tripleo-test-cloud

After checking that there are jobs passing CI let’s check why our job it’s not passing correctly.

For each job the folder structure should be similar to:

[TXT]  console.html
[DIR]  logs/
  [DIR]  overcloud-cephstorage-0/
  [DIR]  overcloud-controller-0/
  [DIR]  overcloud-novacompute-0/
  [   ]  postci.txt.gz
  [DIR]  undercloud/

It’s possible to check the deployment status in the console.html file there you will see the result of all the deployment steps executed in order to pass the CI job.

In case of having i.e. a failed deployment you can check postci.txt.gz to get the actual standard error from the deployment.

Also from folders overcloud-cephstorage-0, overcloud-controller-0 and overcloud-novacompute-0 you will have the content of /var that will point out all the services logs.

Other useful tip might be to get all the job logs folder with wget and crawl for a string containing the Error word.

#Get the CI job folder i.e. using the following URL.
wget -e robots=off -r --no-parent http://logs.openstack.org/00/000000/0/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha/xxxxxx/
#Parse:
grep -iR "Error: " *

You will probably see there something pointing out an error, and hopefully will give you clues about the next steps to fix them and land your submissions as soon as possible.

BAND-AID for OOM issues with TripleO manual deployments

2016-08-23T00:00:00+00:00

This post will explain how to fix OOM issues whe using TripleO.

If running free -m from your Undercloud or Overcloud nodes and getting some output like:

[asdf@fdsa]$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7668        5555         219        1065        1893         663

And as in the example there is no reference pointing to the swap memory size and/or usage, you might not be using swap in your TripleO deployments, to enable it, just have to follow two steps.

First in the Undercloud, when deploying stacks you might find that heat-engine (4 workers) takes lot of RAM, in this case for specific usage peaks can be useful to have a swap file. In order to have this swap file enabled and used by the OS execute the following instructions in the Undercloud:

#Add a 4GB swap file to the Undercloud
sudo dd if=/dev/zero of=/swapfile bs=1024 count=4194304
sudo mkswap /swapfile
#Turn ON the swap file
sudo chmod 600 /swapfile
sudo swapon /swapfile
#Enable it on start
echo "/swapfile swap swap defaults 0 0" | sudo tee -a /etc/fstab

Also when deploying the Overcloud nodes the controller might face some RAM usage peaks, in which case, create a swap file in each Overcloud node by using an already existing “extraconfig swap” template.

To achieve this second part, we just need to use the environmental file that loads the swap template in the resource registry when deploying the overcloud.

Now, deploy your Overcloud as usual i.e.:

cd
openstack overcloud deploy \
--libvirt-type qemu \
--ntp-server pool.ntp.org \
--templates /home/stack/tripleo-heat-templates \
-e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
-e /home/stack/tripleo-heat-templates/environments/enable-swap.yaml \
-e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml

Bye bye OOM’s!!!!

TripleO deep dive session #6 (Overcloud - Physical network)

2016-08-15T12:00:00+00:00

This is the sixth video from a series of “Deep Dive” sessions related to TripleO deployments.

In this session Dan Prince will dig into the physical overcloud networks.

So please, check the full session content on the TripleO YouTube channel.

Please check the sessions index to have access to all available content.

TripleO deep dive session #5 (Undercloud - Under the hood)

2016-08-05T20:00:00+00:00

This is the fifth video from a series of “Deep Dive” sessions related to TripleO deployments.

In this session James Slagle and Steven Hardy will dig into some underlying aspects related to the TripleO Undercloud.

This video session aims to cover the following sections:

What is under the hood of a TripleO underloud deployment.
Description of the undercloud components.
Show the undercloud components interaction.
Undercloud installing process.
Undercloud customization.
How to apply and test submissions in instack-undercloud.

So please, check the full session content on the TripleO YouTube channel.

Please check the sessions index to have access to all available content.

TripleO deep dive session #4 (Puppet modules)

2016-08-01T10:00:00+00:00

This is the fourth video from a series of “Deep Dive” sessions related to TripleO deployments.

This session will cover a series of basic Puppet topics related to TripleO deployments.

This video session aims to cover the following sections:

Introduction about Puppet OpenStack modules.
Services deployment using Puppet profiles.
Deployment composability with Heat.
Bring your own service to TripleO.

So please, check the full session content on the TripleO YouTube channel.

Please check the sessions index to have access to all available content.

Responsiveness (Graphical description)

2016-08-01T00:00:00+00:00

Responsiveness (Graphical description)

Explanation: No comments.
Disclaimer.

Testing instack-undercloud submissions locally

2016-07-26T00:00:00+00:00

This post is to describe how to run/test gerrit submissions related to instack-undercloud locally.

For this example I’m going to use this submission: https://review.openstack.org/#/c/347389/

The follwing steps allow to test the submissions related to instack-undercloud in a working environment.

  ./tripleo-ci/scripts/tripleo.sh --delorean-setup
  ./tripleo-ci/scripts/tripleo.sh --delorean-build openstack/instack-undercloud
  cd tripleo/instack-undercloud/
  #The submission to be tested
  git review -d 347389
  cd
  ./tripleo-ci/scripts/tripleo.sh --delorean-build openstack/instack-undercloud
  rpm -qa | grep instack-undercloud
  sudo rpm -e --nodeps <old_installed_instack-undercloud>
  find tripleo/ -name "*rpm"
  sudo rpm -iv --replacepkgs --force <located package>
  #Here we need to check that the changes were actually applied.
  #What I'm used to do it's to search the updated files using locate
  #and manually checking that the changes are OK.
  sudo rm -rf /root/.cache/image-create/source-repositories/*
  sudo rm -rf /opt/stack/puppet-modules

Now, in case that a puppet-tripleo change is needed, you can add the env. vars before re-installing the undercloud.

  export DIB_INSTALLTYPE_puppet_tripleo=source
  export DIB_REPOLOCATION_puppet_tripleo=https://review.openstack.org/openstack/puppet-tripleo
  export DIB_REPOREF_puppet_tripleo=refs/changes/XX/XXXXX/X

Now, we just need to run the installer.

  ./tripleo-ci/scripts/tripleo.sh --undercloud

Once this process completes, the output should be something similar to:


#################
tripleo.sh -- Undercloud install - DONE.
#################

Monday's coffee ensure => 'present'

2016-07-25T00:00:00+00:00

Monday’s coffee ensure => ‘present’

Explanation: No comments.
Disclaimer.

TripleO deep dive session #3 (Overcloud deployment debugging)

2016-07-22T00:00:00+00:00

This is the third video from a series of “Deep Dive” sessions related to TripleO deployments.

This session is related to how to troubleshoot a failed THT deployment.

This video session aims to cover the following topics:

Debug a TripleO failed overcloud deployment.
Debugging in real time the deployed resources.
Basic Openstack commands to see the deployment status.

So please, check the full session content on the TripleO YouTube channel.

Please check the sessions index to have access to all available content.

Climbing day in patones

2016-07-22T00:00:00+00:00

Climbing day in patones

Explanation: No comments.
Disclaimer.

When the shower has more commits than you

2016-07-21T00:00:00+00:00

When the shower has more commits than you”

Explanation: No comments.
Disclaimer.

Timeouts day..

2016-07-20T00:00:00+00:00

Timeouts day..

Explanation: No comments.
Disclaimer.

Happy Tuesday!

2016-07-19T00:00:00+00:00

Happy Tuesday!

Explanation: No comments.
Disclaimer.

TripleO deep dive session #2 (TripleO Heat Templates)

2016-07-18T00:00:00+00:00

This is the second video from a series of “Deep Dive” sessions related to TripleO deployments.

This session is related to a THT overview for all users who want to dig into the project.

This video session aims to cover the following topics:

A THT basic introduction overview.
A Template model used.
A description of the new composable services approach.
A code overview over the related code repositories.
A cloud deployment demo session.
A demo session with a deployment in live referring to debugging hints.

So please, check the full session content on the TripleO YouTube channel.

Please check the sessions index to have access to all available content.

The chapter 1

2016-07-18T00:00:00+00:00

The chapter 1

Explanation: No comments.
Disclaimer.

Pointers

2016-07-15T00:00:00+00:00

Pointers

Explanation: No comments.
Disclaimer.

Compiling

2016-07-14T00:00:00+00:00

Compiling

Explanation: No comments.
Disclaimer.

TripleO deep dive session #1 (Quickstart deployment)

2016-07-11T00:00:00+00:00

This is the first video from a series of “Deep Dive” sessions related to TripleO deployments.

The first session is related to the TripleO deployment using Quickstart.

Quickstart comes from RDO, to reduce the complexity of having a TripleO environment quickly, mostly for users without a strong and deep knowledge of TripleO configuration and it uses Ansible roles to automate all the different configuration tasks.

So please, check the full session content on the TripleO YouTube channel.

Last but not least, James Slagle (slagle) have posted some comments about how to apply new changes in the puppet modules when deploying the overcloud as the current task of re-create them is a time consuming and cumbersome process.

Using the upload-puppet-modules script we will be able to update the puppet modules when executing the overcloud deployment.

# From the undercloud
mkdir puppet-modules
cd puppet-modules
git clone https://git.openstack.org/openstack/puppet-tripleo tripleo
# Edit as needed under the tripleo folder
cd
git clone https://git.openstack.org/openstack/tripleo-common
export PATH="$PATH:tripleo-common/scripts"
upload-puppet-modules --directory puppet-modules/

Please check the sessions index to have access to all available content.

---------------------------------------------------------------------------------------
|                                     ,   .   ,                                       |
|                                     )-_'''_-(                                       |
|                                    ./ o\ /o \.                                      |
|                                   . \__/ \__/ .                                     |
|                                   ...   V   ...                                     |
|                                   ... - - - ...                                     |
|                                    .   - -   .                                      |
|                                     `-.....-´                                       |
|                          _______   _       _       ____                             |
|                         |__   __| (_)     | |     / __ \                            |
|                            | |_ __ _ _ __ | | ___| |  | |                           |
|                            | | '__| | '_ \| |/ _ \ |  | |                           |
|                            | | |  | | |_) | |  __/ |__| |                           |
|                     _____  |_|_|  |_| .__/|_|\___|\____/                            |
|                    |  __ \          | |     |  __ \(_)                              |
|                    | |  | | ___  ___|_|__   | |  | |___   _____                     |
|                    | |  | |/ _ \/ _ \ '_ \  | |  | | \ \ / / _ \                    |
|                    | |__| |  __/  __/ |_) | | |__| | |\ V /  __/                    |
|                    |_____/ \___|\___| .__/_ |_____/|_| \_/ \___|                    |
|                                     | |  (_)                                        |
|                         ___  ___  __|_|__ _  ___  _ __  ___                         |
|                        / __|/ _ \/ __/ __| |/ _ \| '_ \/ __|                        |
|                        \__ \  __/\__ \__ \ | (_) | | | \__ \                        |
|                        |___/\___||___/___/_|\___/|_| |_|___/                        |
|                                                                                     |
---------------------------------------------------------------------------------------

The Venezuelan Cuatro

2016-07-07T00:00:00+00:00

Let me present you a Venezuelan Cuatro.

This magical device is a string (nylon) musical instrument tuned (ad’f#’b) typical from Venezuela, similar to the Ukelele in it’s shape, but their character and playing technique are vastly different.

If you want to learn more about this instrument, you can download a complete chords sheet from The Ukulele Orchestra of Great Britain or this Cuatro method.

Also you can listen these excellent videos from Youtube.

Openstack & TripleO deployment using Inlunch - DEPRECATED

2016-07-07T00:00:00+00:00

Today I’m going to speak about the first Openstack installer I used to deploy TripleO. Inlunch, as its name aims it should make you “Get an Instack environment prepared for you while you head out for lunch.”

The steps that I’m used to run are:

Connect to your remote server (Your physical server) as root and generate the id_rsa.pub file and append it to the authorized_keys file.

ssh-keygen -t rsa
cd .ssh
cat id_rsa.pub >> authorized_keys

Install some dependencies and clone Inlunch.

rpm -iUvh http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-7.noarch.rpm
sudo yum -y install git ansible nano
git clone https://github.com/jistr/inlunch

Go to the inlunch folder and edit the answer file to fits your needs. In the answers files by default it creates 6 nodes with 5GB RAM each, I usually change this to 3 nodes with 8GB RAM.

cd inlunch
vi answers.yml.example

The last but not least Inlunch step it is to deploy our undercloud!!! As simple as it sounds. As you can see Inlunch uses Ansible for all the steps automation using SSH. In this case we added the root public key to the same server and the installation is pointed to localhost.

INLUNCH_ANSWERS=answers.yml.example INLUNCH_FQDN=localhost ./instack-virt.sh

Once you have finished this last step you can login in the undercloud node by sshing to the physical server using the 2200 port.

ssh -p 2200 root@<your_server_fqdn_goes_here>

This is it :) your undercloud it is up and running.

Now, I will show the following steps to deploy the master branch of tripleo-heat-templates to finish the overcloud deployment.

su - stack
source stackrc

Let’s clone all needed repositories.

git clone https://github.com/openstack/puppet-tripleo
git clone https://github.com/openstack/tripleo-docs
git clone https://github.com/openstack/tripleo-heat-templates

And to finish let’s deploy the TripleO pacemaker environment.

  openstack overcloud deploy \
  --libvirt-type qemu \
  --ntp-server clock.redhat.com \
  --templates /home/stack/tripleo-heat-templates \
  -e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
  -e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml

Now you should have deployed successfully your undercloud/overcloud environment using Inlunch.

Thanks Jiri for this amazing installer!!

Connecting the ADXL345 accelerometer to the Raspberry Pi 3

2016-07-05T00:00:00+00:00

Since a few months I’m kind of interested in research topics related to displacement calculation based on time series acceleration. This basically shows that when something starts moving it displacement depends on the velocity and the time that it has been moving, but due to the fact that the velocity changes continuously we take this variation as the object acceleration (meters per second per second). In this case we will use the accelerometer to estimate the relative position to a point p, which is derived using the double integration of the acceleration metrics.

Until now we have no speed, time or acceleration metrics so we are going to start from the very beginning.

The items needed for this project are; Solder, Solder support, Solder wire, Flux, Jumpers kit, Raspberry Pi 3, Raspberry Pi 3 case, GPIO board and the ADXL345 sensor.

Total budged for this project: 105.73 Eur.

First let’s connect our ADXL345 accelerometer to the Raspberry by connecting the jumpers as follow:

Raspberry GPIO pin	ADXL345 pin
GND	GND
3V	3V
SDA	SDA
SCL	SCL

This will lead to something like:

Or a wider view:

Once having the Raspberry wired, let’s configure it by the following steps.

From our Raspberry Pi, install:

sudo apt-get install python-smbus i2c-tools

Enable the I2C kernel module in the Raspberry Pi:

sudo raspi-config

Now, enable I2C kernel module in: Advanced Options -> I2C -> Would you like the ARM module.. -> Would you like it enabled by default..

Edit the modules file (sudo vim /etc/modules) and make sure it contains the following lines:

i2c-bcm2708
i2c-dev

Remove I2C from the blacklist file (/etc/modprobe.d/raspi-blacklist.conf) commenting the following line if it appears:

#blacklist i2c-bcm2708

After all these previous steps, reboot the Raspberry Pi

sudo reboot

Test the connection to the I2C module:

sudo i2cdetect -y 1

The command should print the following output:

Now, when having the module working properly, we need to get and use the python ADXL345 library to have access to the time based data.

Clone the library repository and execute the example code:

git clone https://github.com/pimoroni/adxl345-python
cd adxl345-python
sudo python example.py

The command’s output should be:

Showing the G forces in each axis.

This is it for the first part of the tutorial. Next posts will dig into the real-time data processing.

TripleO manual deployment - DEPRECATED

2016-07-04T00:00:00+00:00

This is a brief recipe about how to manually install TripleO in a remote 32GB RAM box.

From the hypervisor run:

  #In this dev. env. /var is only 50GB, so I will create
  #a sym link to another location with more capacity.
  #It will take easily more tan 50GB deploying a 3+1 overcloud
  sudo mkdir -p /home/libvirt/
  sudo ln -sf /home/libvirt/ /var/lib/libvirt
  #Add default stack user
  sudo useradd stack
  echo "stack:stack" | chpasswd
  echo "stack ALL=(root) NOPASSWD:ALL" | sudo tee -a /etc/sudoers.d/stack
  sudo chmod 0440 /etc/sudoers.d/stack
  su - stack

  sudo yum -y install epel-release
  sudo yum -y install yum-plugin-priorities

  export TRIPLEO_ROOT=/home/stack
  export TRIPLEO_RELEASE=rdo-trunk-master-tripleo
  #export TRIPLEO_RELEASE=rdo-trunk-newton-tested
  export TRIPLEO_RELEASE_DEPS=centos7
  #export TRIPLEO_RELEASE_DEPS=centos7-newton

  #Repository configured pointing to above release!
  sudo curl -o /etc/yum.repos.d/delorean.repo https://buildlogs.centos.org/centos/7/cloud/x86_64/$TRIPLEO_RELEASE/delorean.repo
  sudo curl -o /etc/yum.repos.d/delorean-deps.repo https://trunk.rdoproject.org/$TRIPLEO_RELEASE_DEPS/delorean-deps.repo

  #Configure the undercloud deployment
  export NODE_DIST=centos7
  export NODE_CPU=4
  export NODE_MEM=9000
  export NODE_COUNT=6
  export UNDERCLOUD_NODE_CPU=4
  export UNDERCLOUD_NODE_MEM=9000
  export FS_TYPE=ext4

  sudo yum install -y instack-undercloud
  instack-virt-setup

In the hypervisor run the following command to log-in in the undercloud:

  ssh root@`sudo virsh domifaddr instack | grep $(tripleo get-vm-mac instack) | awk '{print $4}' | sed 's/\/.*$//'`

From the undercloud we will install all the packages:

  #Add a 4GB swap file to the Undercloud
  sudo dd if=/dev/zero of=/swapfile bs=1024 count=4194304
  sudo mkswap /swapfile
  #Turn ON the swap file
  sudo chmod 600 /swapfile
  sudo swapon /swapfile
  #Enable it on start
  sudo echo "/swapfile          swap            swap    defaults        0 0" >> /etc/fstab

  #Login as the stack user
  su - stack

  export TRIPLEO_ROOT=/home/stack
  sudo yum -y install yum-plugin-priorities

  export TRIPLEO_RELEASE=rdo-trunk-master-tripleo
  #export TRIPLEO_RELEASE=rdo-trunk-newton-tested
  export TRIPLEO_RELEASE_BRANCH=master
  #export TRIPLEO_RELEASE_BRANCH=stable/newton

  export USE_DELOREAN_TRUNK=1
  export DELOREAN_TRUNK_REPO="https://buildlogs.centos.org/centos/7/cloud/x86_64/$TRIPLEO_RELEASE/"
  export DELOREAN_REPO_FILE="delorean.repo"
  export FS_TYPE=ext4

  git clone -b $TRIPLEO_RELEASE_BRANCH https://github.com/openstack/tripleo-heat-templates
  git clone https://github.com/openstack-infra/tripleo-ci.git

  ./tripleo-ci/scripts/tripleo.sh --all

  # The last command will execute:
  #  repo_setup        --repo-setup
  #  undercloud        --undercloud
  #  overcloud_images  --overcloud-images
  #  register_nodes    --register-nodes
  #  introspect_nodes  --introspect-nodes
  #  overcloud_deploy  --overcloud-deploy

Once the undercloud it is fully installed, deploy an overcloud (The last command should have created an overcloud, this is needed if you need to deploy another one).

cd
openstack overcloud deploy \
--libvirt-type qemu \
--ntp-server pool.ntp.org \
--templates /home/stack/tripleo-heat-templates \
-e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml \
-e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml
#Also can be added:
#--control-scale 3 \
#--compute-scale 3 \
#--ceph-storage-scale 1 -e /home/stack/tripleo-heat-templates/environments/storage-environment.yaml

This will hopefully deploy the TripleO overcloud, if not, refer to the troubleshooting section in the official site.

#Configure a DNS for the OC subnet, do this before deploying the Overcloud
neutron subnet-update `neutron subnet-list -f value | awk '{print $1}'` --dns-nameserver 192.168.122.1

Updated 2017/02/23: instack-virt-setup is deprecatred :( moving to tripleo-quickstart.

Updated 2016/11/25: instack-virt-setup env. vars. are defaulted to sane defaults, so they are optional now.

Connecting from your local machine to the TripleO overcloud horizon dashboard

2016-07-02T00:00:00+00:00

This will be my first blog post about TripleO deployments.

The goal of this post is to show how to chain multiple ssh tunnels to browse into the horizon dashboard, deployed in a TripleO environment.

In this case, we have deployed TripleO in a remote server “labserver” in which was launched an undercloud and overcloud deployment.

The horizon dashboard listens on the 80 port of the overcloud controller. From the user terminal we want to have access to the horizon dashboard, currently unreachable because we don’t have access to the deployed private IPs from the user’s terminal.

Below is a graphical representation about the described scenario.

STEPS

Connect the local terminal to labserver (create the first tunnel)

#Forward incoming 38080 traffic to local 38080 in the hypervisor
#labserver must be a reachable host
ssh -L 38080:localhost:38080 root@labserver

Connect to the undercloud from the labserver (create the second tunnel)

#Log-in as the stack user and get the undercloud IP
su - stack
undercloudIp=`sudo virsh domifaddr instack | grep $(tripleo get-vm-mac instack) | awk '{print $4}' | sed 's/\/.*$//'`
#Forward incoming 38080 traffic to local 38080 in the undercloud
ssh -L 38080:localhost:38080 root@$undercloudIp

Get the admin password for the Horizon dashboard

su - stack
source stackrc
cat overcloudrc |grep OS_PASSWORD | awk -F  '=' '{print $2}'

Connect to the overcloud controller from the undercloud (create the third and last tunnel)

#Get the controller IP
controllerIp=`nova list | grep controller | awk -F  '|' '{print $7}' | awk -F  '=' '{print $2}'`
#Forward incoming 38080 traffic to controller IP in 80 port
ssh -L 38080:"$controllerIp":80 heat-admin@"$controllerIp"
echo "From your browser open: http://localhost:38080/"

Now, if everything went as expected you should be able to see the horizon dashboard by using your favorite browser and typing http://localhost:38080/dashboard,
then use the admin user and the password printed before.

Note that in this case, by default the ssh hops should be done using different users.