How to save variables in gatling session or query results into files

This post is for users who is using gatling gradle plugin io.gatling.gradle and trying to solve the runtime resource path problem with ./bin/gatling.sh after building the bundle.

Background

I have intially setup a project following the gradle plugin, however, after developing the load tests. I realized it’s not super fun to run it with gradle in a remote host. I decided to pack everything in a (bundle structure)[https://gatling.io/docs/gatling/reference/current/general/bundle_structure/] so that I can use command line to run my tests anywhere, this post will not go into how I build the bundle. It works out in the beginning for a simple Simulation, however, as long as I started to use resources, specially I use a Feeder to inject some users’ data, which resources folder my test is using keeps bothers me in the gradle project and in the bundle.

The gradle project has following files:

1
2
3
4
5
6
7
8
9
10
11
project
+ build
++ classes
++ resources
+++ MyFile.csv
+ src
++ gatling
+++ resources
++++ myFile.csv
+++ scala
++++ MySimulation

The structure of the gatling bundle:

1
2
3
4
5
6
7
8
9
10
11
bundle
+ bin
++ gatling.sh
+ conf
++ gatling.conf
+ lib
+ user-files
++ resources
+++ myFile.csv
++ simulations
+++ MySimulation

What is the problem and Where I meet it

In gatling, it’s quite common that we utilize the Session to save or fetch data for a virtul user.

In a recent project, I realized that we might have a few Simulations which might not be ran at the same time, in fact, one Simulation may use the results from another Simulation or we need to log the results from on Simulation.

So I decide to use a PrintWriter to write some session variables into a file which is under the gradle project. I also want to persist it in the git repo so that I can rerun the Simulation which needs the results.

1
2
3
4
5
6
7
.exec(session => {
val writer = new PrintWriter(new FileOutputStream(new File(src/gatling/resources/myFile.csv), true))
writer.write(session("username").as[String] + "," + session("sessionKey").as[String])
writer.write("\n")
writer.close()
session
}

The runtime resource path problem

Running gatling Simulations with gradle task in IDE is super great, however, as I packed things in a bundle, it’s easily known that the above codes won’t work, as there is no src/gatling anymore. There’s a resources folder in the gatling class path if we check the bash gatling.sh

1
GATLING_CLASSPATH="$GATLING_HOME/lib/*:$GATLING_HOME/user-files/resources:$GATLING_HOME/user-files:$GATLING_CONF:"

which means we could use the classloader to help us to find the resources by getClass.getResource(“/myFile.csv”).getPath, it should give me the path to the file.

1
2
3
4
5
6
7
.exec(session => {
val writer = new PrintWriter(new FileOutputStream(new File(getClass.getResource("/myFile.csv").getPath), true))
writer.write(session("username").as[String] + "," + session("sessionKey").as[String])
writer.write("\n")
writer.close()
session
}

The gradle src resource path problem

With the code changes, we can still run the Simulations with gradle, however, as gradle will copy src/resources into the build/resources. The classloader shall be able to find the filepath in the build/resources. Whenever I run the simulations, I will have files generated in the build/resources, which will be deleted when I clean the project, by intention or accidentally.
Oh no, I need this file and I copied it to my src/resources each time I run it, it’s not fun but it’s fine…

Easy solution

Put an environment variable Bundle in the gatling.sh and uses following wrappers to get a file path

1
2
3
4
5
6
7
8
9
10
11
12
13
 // env Bundle=true

def testGeneratedFilePath(filename: String): String = {
var path = ""
val isBundle = System.getenv("Bundle") != null
if (isBundle)
path = s"src/gatling/resources/${filename}"
else
path = getClass.getResource(s"/${filename}").getPath
path
}

val testFileAbsolutePath = new File(testGeneratedFilePath("myFile.csv"));

Setup two stages of Dockerfile to build a nodejs app

In this post, I’ll create a simple http server app with nodejs and setup build scripts to bundle a project into a dist folder.
After validating that works well, I’ll create a Dockerfile to build the nodejs app in a container and then create a final image with only the dist folder as a new image layer.

An simple Nodejs App

  • Create a folder mkdir -p ~/two-stages-docker-build-nodejs-app and go to that folder cd ~/two-stages-docker-build-nodejs-app to intilize an npm project npm init --yes.
  • Add license npx license Apache-2
  • Create a src folder mkdir -p ~/two-stages-docker-build-nodejs-app/src and add a index.js file cd ~/two-stages-docker-build-nodejs-app/src && touch index.js with a simplest http server.
    1
    2
    3
    4
    5
    6
    7
    var http = require('http');

    //create a server object:
    http.createServer(function (req, res) {
    res.write('Hello World!'); //write a response to the client
    res.end(); //end the response
    }).listen(8080); //the server object listens on port 8080
  • We want to have gulp as a bundle tool and add babel, uglify and rename pipe to it
    1
    npm install @babel/core gulp gulp-babel gulp-uglify gulp-rename --save-dev
  • Create a gulpfile.js and add a simple build task in it, we take everything from src folder and put the result into dist folder
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    const { src, dest, parallel } = require('gulp');
    const babel = require('gulp-babel');
    const uglify = require('gulp-uglify');
    const rename = require('gulp-rename');

    function srcBuild() {
    return src("src/*.js")
    .pipe(babel())
    .pipe(dest('dist/'))
    .pipe(uglify())
    .pipe(rename({ extname: '.min.js' }))
    .pipe(dest('dist/'));
    }

    exports.default = srcBuild
  • Run the build to test it works npx gulp
  • Run the server to test it works node dist/index.min.js

To build the nodejs app into a docker container

  • Create a Dockerfile under the root of the project touch Dockerfile

    1
    2
    3
    4
    5
    6
    7
    8
    9
    FROM node:12 AS builder
    WORKDIR /build
    COPY . .
    RUN npm install && npx gulp

    FROM node:12
    WORKDIR /app
    COPY --from=builder /build/dist .
    ENTRYPOINT [ "node", "./index.js" ]
  • Build the image docker build . -t two-stages-docker-nodejs

  • Run the image docker run -p 8080:8080 two-stages-docker

  • Access in the browser http://localhost:8080

Related tools:

NPM
Gulp
Docker

References:

https://gulpjs.com/docs/en/getting-started/quick-start
https://docs.npmjs.com/getting-started
https://docs.docker.com/develop/develop-images/multistage-build/

Source code

https://github.com/kai-chu/PieceOfCodes/tree/master/two-stages-docker-build-nodejs-app

Ansible built-in module - with_items

ansible

This is an demo to ansible built-in item, with_items

Before the topic, I want to remind of the yaml basic syntax about list and dictionary.
To create a list in yaml, there are two forms(yaml or an abbreviated form) which are equivalent, and we can write both of those in the yaml file

1
2
3
4
5
6
listName:
- 1
- 2
- 3
//// which is the same as the abbreviated form
listName: [1,2,3]

And it’s the same for dictionary

1
2
3
4
5
6
- firstName: kai
lastName: chu
age: 29
phone: 888888
//// which is the same as the abbreviated form
{firstName: kai, lastName: chu, age: 29, phone: 888888}

Keeping that in mind, it’s easier to understand different usages in different projects, regardless of mixed syntax playbooks written by the DevOps.

The following explaination will be similar as what have been given by the offical examples
If you have clearly understood the offical examples, then you don’t have to go further with this post.

This post gives examples about list of values and list of dictionaries

In an ansible playbook, we can use with_items with a list of values, list of dictionaries or a variable, it can either yaml syntax or in an abbreviated form.

4 forms of using list of values

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
---
- name: >-
Demo ansible build-in withItems with list,
this lookup returns a list of items given to it,
if any of the top level items is also a list it will flatten it,
but it will not recurse
hosts: localhost
connection: local
vars:
list_in_var:
- green
- red
- blue

list_in_var_as_abbreviated_form: [green, red, blue]

tasks:
- name: "[List of items - 01] items defined in the same playbook"
debug:
msg: "An item: {{ item }}"
with_items:
- green
- red
- blue

- name: "[List of items - 02] items defined in a variable"
debug:
msg: "An item: {{ item }}"
with_items: "{{ list_in_var }}"

- name: "[List of items - 03] items in an abbreviated form defined in the same playbook"
debug:
msg: "An item: {{ item }}"
with_items: [green, red, blue]

- name: "[List of items - 04] items in an abbreviated form variable"
debug:
msg: "An item: {{ item }}"
with_items: "{{list_in_var_as_abbreviated_form}}"

The output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ ansible-playbook playbook.yaml 

PLAY [Demo ansible build-in withItems, this lookup returns a list of items given to it, if any of the top level items is also a list it will flatten it, but it will not recurse] ***

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [[List of items - 01] items defined in the same playbook] *****************
ok: [localhost] => (item=green) => {
"msg": "An item: green"
}
ok: [localhost] => (item=red) => {
"msg": "An item: red"
}
ok: [localhost] => (item=blue) => {
"msg": "An item: blue"
}
...

4 forms of using list of dictionaries

There is nothing special for dictionaries compared with list of values. The item will be a dictionary in this case and we can use item.key to access the value.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
- name: >-
Demo ansible build-in with_items with list of dictionaries,
this lookup returns a list of items given to it,
if any of the top level items is also a list it will flatten it,
but it will not recurse
hosts: localhost
connection: local
vars:
list_of_dictionaries_in_var:
- name: Green
color: green
- name: Red
color: red
- name: Blue
color: blue

list_of_dictionaries_in_var_as_abbreviated_form:
- {name: Green, color: green}
- {name: Red, color: red}
- {name: Blue, color: blue}

tasks:
- name: "[list of dict items - 01] items defined in the same playbook"
debug:
msg: "An item name: {{ item.name }}, color: {{ item.color }}"
with_items:
- name: Green
color: green
- name: Red
color: red
- name: Blue
color: blue

- name: "[list of dict items - 01] items defined in the same playbook"
debug:
msg: "An item name: {{ item.name }}, color: {{ item.color }}"
with_items:
- { name: Green, color: green }
- { name: Red, color: red }
- { name: Blue, color: blue }

- name: "[List of dict items - 02] items defined in an variable"
debug:
msg: "An item name: {{ item.name }}, color: {{ item.color }}"
with_items: "{{ list_of_dictionaries_in_var }}"

- name: "[List of dict items - 03] items defined in an abbreviated form variable"
debug:
msg: "An item name: {{ item.name }}, color: {{ item.color }}"
with_items: "{{list_of_dictionaries_in_var_as_abbreviated_form}}"

The output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ ansible-playbook playbook-with-items-dict.yaml 

PLAY [Demo ansible build-in with_items with list of dictionaries, this lookup returns a list of items given to it, if any of the top level items is also a list it will flatten it, but it will not recurse] ***

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [[list of dict items - 01] items defined in the same playbook] ************
ok: [localhost] => (item={u'color': u'green', u'name': u'Green'}) => {
"msg": "An item name: Green, color: green"
}
ok: [localhost] => (item={u'color': u'red', u'name': u'Red'}) => {
"msg": "An item name: Red, color: red"
}
ok: [localhost] => (item={u'color': u'blue', u'name': u'Blue'}) => {
"msg": "An item name: Blue, color: blue"
}
...

Summary

I found the module with_items is really useful when it comes to adding a few configurations for a provision. It is much flexible when we put configurations as key values in a variable file, with with_items module in a playbook, we don’t have to change the playbook when we need to add a new item.

Kubernetes - 02 - the 3 practical ways to use k8s secret

kubernetes

This is the second post about kubernetes secret, in the previous, I have list the 3 ways to create secrets. We can create as many secrets as we want. In this post, I will give the 3 practical ways to use the secrets in k8s deployment.

Let’s assume we have created a secret named my secret as following

1
2
3
4
5
6
7
8
apiVersion: v1
kind: Secret
metadata:
name: mysecret
type: Opaque
stringData:
username: admin
password: 1f2d1e2e67df

3 ways to use k8s secrets

  • As environments
  • As volume files
  • As Kubelet auth credentials to pull image

1 Used as environments

A secret is a dictionary, we can put it in a environment as key-values.
We can put the whole dictionary into the environment or we refer one of the keys and use its value

1.1 Use as a value of a user defined env var

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: v1
kind: Pod
metadata:
name: secret-env-pod
spec:
- containers
- name: container-name
env:
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: mysecret
key: username

The environment variable POSTGRES_USER shall be the value admin

1.2 Use keys from secret directly as env, key word envFrom

1
2
3
4
5
6
7
8
9
10
11
12
13
apiVersion: v1
kind: Pod
metadata:
name: secret-env-pod
spec:
- containers
- name: container-name
env:
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
envFrom:
- secretRef:
name: test-secret

The environment variable username and password shall be available in the container.

2 Used as volume files

We can use the secret keys to generate files in a volume, and mount it into a container. We have 2 keys in our secret, which means we will have two files (/username and /password) created in the volume.

2.1 Mount all keys

Two steps to mount a secret into a container.

2.2.1 Create a volume from a secret

1
2
3
4
volumes:
- name: volume-from-secret
secret:
secretName: mysecret

2.2.2 Mount the volume to a directory

1
2
3
4
volumeMounts:
- name: volume-from-secret
mountPath: "/etc/foo"
readOnly: true

Put them together in a pod yaml file

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: mypod
image: redis
volumeMounts:
- name: foo
mountPath: "/etc/foo"
readOnly: true
volumes:
- name: foo
secret:
secretName: mysecret

Since we mount the secret volume (I call a volume created from a secret) to /etc/foo, we can use the values from the files created from the secret keys.

1
2
3
4
$cat /etc/foo/username 
admin
$cat /etc/foo/password
1f2d1e2e67df

2.2 Mount subset of the secret keys to user defind subfolder

To select a specific key instead of mount all keys into the folder, we can add items when creating the secret volume

2.2.1 Create a volume from a secret

1
2
3
4
5
6
7
8
volumes:
- name: volume-from-secret
secret:
secretName: mysecret

items:
- key: username
path: my-group/my-username

2.2.2 Mount the volume to a directory(which is the same as above)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: mypod
image: redis
volumeMounts:
- name: foo
mountPath: "/etc/foo"
readOnly: true
volumes:
- name: foo
secret:
secretName: mysecret


items:
- key: username
path: my-group/my-username

Now you will find only username is projected.

1
2
$cat /etc/foo/my-group/my-username
admin

2.3 File mode

0644 is used by default, which can be changed as following

1
2
3
4
5
volumes:
- name: foo
secret:
secretName: mysecret
defaultMode: 0400

3 Kubelet auth to pull image

3.1 Use image pull secret in pod spec

When k8s is trying to pull image from image registry, it will check list of docker-registry secrets in the same namespace in the field imagePullSecrets from your pod yaml specification.

  • To create a secret for docker authentication

    1
    kubectl create secret docker-registry myregistrykey --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL
  • Refer the secret in the pod spec

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    apiVersion: v1
    kind: Pod
    metadata:
    name: foo
    namespace: awesomeapps
    spec:
    containers:
    - name: foo
    image: YOUR_PRIVATE_DOCKER_IMAGE
    imagePullSecrets:
    - name: myregistrykey

3.2 Use image pull secret in pod service account

Since each pod will be associated with an service account, we can also add the imagePullSecrets to the service account
Usually if we don’t specify a service account when defining pod or deployment, the default service account is used in that case.

3.2.1 Associate a secret to service account

To add a secret to a service account, add a field ‘imagePullSecrets’ to the sa spec.

  • Patch an existing service account
    We can patch the service account as following

    1
    kubectl patch serviceaccount default -p '{"imagePullSecrets": [{"name": "myregistrykey"}]}'
  • Create a new one service account with imagePullSecrets

    1
    2
    3
    4
    5
    6
    7
    8
    apiVersion: v1
    kind: ServiceAccount
    metadata:
    name: my-pod-used-service-account
    namespace: default

    imagePullSecrets:
    - name: myregistrykey

3.2.2 Config a pod to use the service account, field ‘serviceAccountName’ in Pod spec

1
2
3
4
5
6
7
8
9
10
apiVersion: v1
kind: Pod
metadata:
name: foo
namespace: awesomeapps
spec:
containers:
- name: foo
image: YOUR_PRIVATE_DOCKER_IMAGE
serviceAccountName: my-pod-used-service-account

When pod is created with service account my-pod-used-service-account, the imagePullSecrets will be added automatically in the spec, we can verify

1
kubectl get pod THE_POD_NAME -o=jsonpath='{.spec.imagePullSecrets[0].name}{"\n"}'

Related:
https://kaichu.se/Kubernetes/2020/09/19/kubernetes-01-the-3-practical-ways-to-create-k8s-secret.html

References:
https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-files-from-a-pod

https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/

Kubernetes - 01 - the 3 practical ways to create k8s secret

kubernetes
Just a summary about how to create k8s secret object, which is used to store a piece of sensitive information.

String data and Base64 encoding

An secret is saved as base64 encoded string, to generate a based64 string from your password in bash

1
$ echo -n "mypassword123" | base64 -w0

To decode a base64 string

1
$ echo 'MWYyZDFlMmU2N2Rm' | base64 --decode

Note: The serialized JSON and YAML values of Secret data are encoded as base64 strings. Newlines are not valid within these strings and must be omitted. When using the base64 utility on Darwin/macOS, users should avoid using the -b option to split long lines. Conversely, Linux users should add the option -w 0 to base64 commands or the pipeline base64 | tr -d ‘\n’ if the -w option is not available.

3 ways to manage secrets

There are 3 ways to use kubectl cli, the 3 corresponding ways to create secrets are as following.

3.1 Imperative commands to edit

3.1.1 Create from file

  • Generate base64 string to file
    1
    2
    $ echo -n 'admin' > ./username.txt
    $ echo -n '1f2d1e2e67df' > ./password.txt
  • Create secrets from file, the key of the secrets will be the filenames
    1
    2
    3
    $ kubectl create secret generic db-user-pass \
    --from-file=./username.txt \
    --from-file=./password.txt
    To specify another names
    1
    2
    3
    $ kubectl create secret generic db-user-pass \
    --from-file=username=./username.txt \
    --from-file=password=./password.txt

3.1.2 Create from literal

Literal escape with single quote (‘)

1
2
3
kubectl create secret generic dev-db-secret \
--from-literal=username=devuser \
--from-literal=password='S!B\*d$zDsb='

Note: To edit secret, command to use: kubectl edit secrets dev-db-secret

3.2 Imperative object files

3.2.1 Using here doc

1
2
3
4
5
6
7
8
9
10
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: mysecret
type: Opaque
data:
password: $(echo -n "s33msi4" | base64 -w0)
username: $(echo -n "jane" | base64 -w0)
EOF

3.2.2 yaml File

which is the same as the following 2 commands and a yaml file

1
2
3
4
5
6
7
8
9
10
11
12
$echo -n 'admin' | base64
$echo -n '1f2d1e2e67df' | base64

//mysecret.yaml
apiVersion: v1
kind: Secret
metadata:
name: mysecret
type: Opaque
data:
username: YWRtaW4=
password: MWYyZDFlMmU2N2Rm

3.2.3 string data

The above is the same as following string data example, the string data will be encoded when k8s creates secret

1
2
3
4
5
6
7
8
apiVersion: v1
kind: Secret
metadata:
name: mysecret
type: Opaque
stringData:
username: admin
password: 1f2d1e2e67df

Note: you can specify both data and stringdata in the same secret, the stringData will be used. I found this is useful if I want to encode a few lines of information

1
2
3
4
5
6
7
8
9
10
11
apiVersion: v1
kind: Secret
metadata:
name: mysecret
type: Opaque
data:
username: YWRtaW4=
password: MWYyZDFlMmU2N2Rm
stringData:
username: admin
password: 1f2d1e2e67df

The values from stringData will be used.

3.3 Using kustomization.yml file, Declarative object configuration

To use kustomization feature, we need to create a folder first and add our file there

1
2
$ mkdir myconfigs
$ touch kustomization.yaml

3.3.1 Generate from file

  • Create base64 stirng
    1
    2
    $ echo -n 'admin' > ./username.txt
    $ echo -n '1f2d1e2e67df' > ./password.txt
  • Add following generator to kustomization.yaml file
    1
    2
    3
    4
    5
    secretGenerator:
    - name: db-user-pass
    files:
    - username.txt
    - password.txt

3.3.2 Generate from literal

1
2
3
4
5
secretGenerator:
- name: db-user-pass
literals:
- username=admin
- password=1f2d1e2e67df

The next post will be the 3 practical ways to use k8s secret

References:
https://kubernetes.io/docs/concepts/configuration/secret/
https://kubernetes.io/docs/tasks/configmap-secret/managing-secret-using-kustomize/

Airflow variables in DAG

Variables are a generic way to store and retrieve arbitrary content or settings as a simple key value store within Airflow.

While your pipeline code definition and most of your constants and variables should be defined in code and stored in source control, it can be useful to have some variables or configuration items accessible and modifiable through the UI.

It can also be used as a context for different environments.

There are 3 ways to create variables

  • UI
  • CLI
  • code

UI

From the UI, we can navigate to Admin-Variables to manage.

CLI

From the CLI, we can use following commands

1
2
airflow variables -g key
airflow variables -s key value

Code

No matter where we setup a variable, in the end we want to read variables in a DAG so that we can easily change the context of a DAG run.

There are two ways to read variables in a DAG

  • Python Code

    1
    2
    3
    4
    5
    from airflow.models import Variable
    Variable.set("foo", "value")
    foo = Variable.get("foo")
    bar = Variable.get("bar", deserialize_json=True)
    baz = Variable.get("baz", default_var=None)
  • Jinja template
    You can use a variable from a jinja template with the syntax, such as a bash operator command:

    1
    echo {{ var.value.<variable_name> }}

    or if you need to deserialize a json object from the variable :

    1
    echo {{ var.json.<variable_name> }}

Best practice

You should avoid usage of Variables outside an operator’s execute() method or Jinja templates if possible, as Variables create a connection to metadata DB of Airflow to fetch the value, which can slow down parsing and place extra load on the DB.

Variables will create db connection every time scheduler parses a DAG

Example to understand best practice

  • Let’s set variable env=dev from CLI

    1
    $ airflow variables -s env dev
  • Create a DAG

    1
    2
    3
    4
    5
    6
    7
    8
    9
    from airflow.models import Variable

    env = Variable.get("env")
    print('' if env is None else env + 'parse time')

    with dag:
    os_operator = PythonOperator(task_id = "os_operator", python_callable=print_env)
    jinja_operator = BashOperator(task_id="get_variable_value", bash_command='echo {{ var.value.env }} ')

  • Running explaination
    When the scheduler parses the DAG, which shall happen every a few seconds, we will find devparse time in the log
    When the DAG is scheduled, we will see bashoperator print the dev variables

    1
    2
    3
    4
    5
    6
    7
    //os_operator
    [2020-04-08 14:56:50,752] {{logging_mixin.py:112}} INFO - devexecution time
    ...
    //get_variable_value
    [2020-04-08 14:56:59,133] {{bash_operator.py:115}} INFO - Running command: echo dev
    [2020-04-08 14:56:59,151] {{bash_operator.py:122}} INFO - Output:
    [2020-04-08 14:56:59,158] {{bash_operator.py:126}} INFO - dev

Another recommended layout to use profiles in an ansible-playbook project

Usually, I prefer to start a project from the recommended best practice layout in ansible official website.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
inventories/
production/
hosts # inventory file for production servers
group_vars/
group1.yml # here we assign variables to particular groups
group2.yml
host_vars/
hostname1.yml # here we assign variables to particular systems
hostname2.yml

staging/
hosts # inventory file for staging environment
group_vars/
group1.yml # here we assign variables to particular groups
group2.yml
host_vars/
stagehost1.yml # here we assign variables to particular systems
stagehost2.yml

library/
module_utils/
filter_plugins/

site.yml
webservers.yml
dbservers.yml

roles/
common/
webtier/
monitoring/
fooapp/

It covers the essential mulitple environments deployment so that we can easily switch deployment by running following command in production or staging

1
2
$ ansible-playbook -i inventories/production webservers.yml -k -K --ask-vault-pass
$ ansible-playbook -i inventories/staging webservers.yml -k -K --ask-vault-pass

As I mentioned in my previous post Using ansible playbook in a DevOps pipeline, we could add an all.yml file in the playbook group_vars to provide following information to ansible-playbook to prevent from inputing password.

1
2
3
ansible_user: YOUR_USER_NAME
ansible_password: YOUR_USER_PASSWORD
ansible_become_password: YOUR_BECOME_PASSWORD

The group_vars in the root of the playbook is called playbook group_vars

1
2
3
4
inventories/
group_vars
all.yml
webservers.yml

I feel it’s so inconvienient when I’m using my own user password instead of a shared service account between team members.
I don’t want tell others my vault password, in that case others can know my ansible_password and ansible_become_password.
Initially, I think I can create a template and everyone who wants to use the playbook should copy the project template and create their all.yml locally. It results in following project structure.

1
2
3
4
5
6
inventories/
group_vars
.gitignore -> all.yml
all.yml.cfg
all.yml (Anyone who doesnt' want to use -k -K --ask-vault-password options can create this in this local machine)
webservers.yml

It turns out it’s even more cumbersome, obviously…

I find another better solution out, where we can use the –extra-vars options to achieve my goal without constraints.
I decide to use the profile concept which I’ve learnt from ant build scripts in my previous company.
Here we don’t use playbook group_vars, instead, we create a profiles folder and add the vars for each profile, such as kai, chu

1
2
3
4
5
6
7
8
9
10
11
inventories/
production/
staging/
profiles/
template/
all.yml
kai/
all.yml
chu/
all.yml
webservers.yml

I have put ansible_user, ansible_password and ansible_become_password in the all.yml in folder kai
Now we gain the benefit of the profile by running following command

1
$ ansible-playbook -i inventories/production --extra-vars @profiles/kai/all.yml webservers.yml --vault-password-file ~/.ansible-vault-pass

It is an env/profile matrix solution, it gives the flexibility to test our ansible-playbook with any favourate vars
Let’s run the playbook with chu’s profile in staging before finish this posts

1
$ ansible-playbook -i inventories/staging --extra-vars @profiles/chu/all.yml webservers.yml --vault-password-file ~/.ansible-vault-pass

Summary

  • It’s good to use –extra-vars when we have some variables setup which is the ansible playbook user related, in other words, the variables are different for different ansible user.
  • It would be more appropriate to add one more inventories/test if there are a lot environment related differences.

an sible book
Check out what sible means here in the urban dictionary

Airflow start_date with cron schedule_interval is not confused anymore when you know this

Airflow DAG start_date with days_ago is making us confused all the time. When a dag will be kick off? Will it be started?

The first DAG Run is created based on the minimum start_date for the tasks in your DAG.
Subsequent DAG Runs are created by the scheduler process, based on your DAG’s schedule_interval, sequentially.

The notes from airflow official website makes sense when you look at it in the first look, however, when you try to create you airflow DAG with a cron string, you never know what it means.
When my friend Yuxia comes to discuss about her case about running a dag every even day at 1 a.m., I thought it was so easy to do that.

1
2
3
4
default_args = {
start_date = days_ago(1)
}
dag = DAG('demo-cron-schedule-interval', default_args = default_args, schedule_interval='0 1 2-30/2 * *', ...)
  • I can check the correctness with (Crontab guru)[https://crontab.guru/#0_1_1-31/2_*_*]
    Today is 2020-08-14, initially from above quote, the start_date will be 2020-08-13 and the first DAG Run shall be created, but my cron says it shouldn’t create a DAG Run yesterday since it’s odd day.

In fact, when she checked the system, there was not DAG Run started at 2020-08-14 01:00:00.

So what’s wrong with the quote? Why Yuxia’s DAG was not running?

Curiously, I checked the code logic in the scheduler_job
In this post, I’ll try to explain the outcome.

What scope

A DAG has start_date not set as datetiem.timedelta, it could e.g. dags_ago(1)
The start_date is set in default args ONLY
A DAG is using cron string or preset as schedule_interval, 0 1 2-30/2 * *

Issue to explain

Will the first DAG Run be kicked off by airflow scheduler?

Concepts from code

  • DAG start_date resolve, the scheduler is parsing the DAGs every 5 seconds (depending on setup).
    Each time when the scheduler is running, it will calculate a start_date depending on current time(utcnow()).
    days_ago(1) will be resolved as following.

    1
    start_date = utcnow() - (1 day) and By default the time is set to midnight, e.g. day - 1 00:00:00

    It’s very important to realise that start_date + (1 day) != utcnow()

  • DAG start_date adjustment, airflow will start subprocesses to create DAG Runs, it firstly checks the schedule_interval and calculate previous cron time(previous_cron), the further previous time(previous_pre_cron) and next cron time(next_cron) based on current time.
    previous_pre_cron -> previous_cron -> utcnow() -> next_cron.
    The cron times
    The start_date of a DAG will be adjusted by the scheduler. In our scope, we can think the start_date will be adjusted as following rules.
    It picks the later one from previous_pre_cron and the resolved start_date and update dag.start_date

    1
    dag.start_date = max(previous_pre_cron, dag.start_date)
  • Normalize_schedule to next_run_date which is the execution date, which is named as normalize_schedule in the code logic. It is the adjusted start_date that will be normalized. The next_run_date will be DAG Run execution_date. It will try to align the start_date to one of the cron times.
    For examples, cron times is 08-14 01:00:00 and 08-16 01:00:00, any start_time in between, e.g. 08-15 00:00:00 shall be aligned to 08-16 01:00:00. which means next cron time from the start date. If a start_time equals to a cron time, then the result will be the same. e.g. normalize_schedule(08-14 01:00:00)=08-14 01:00:00

  • Period end
    From FAQ, we know that Airflow scheduler triggers the task soon after the start_date + schedule_interval is passed, which I doult it results in confusion when it comes to cron schedule_interval context.

    From the code logic, I think it means the execution_date + schedule_interval. If you cron means every 2 days, then the schedule_interval shall be 2 day.

Figure out when a dag will be scheduled

To answer the question, we need to do 4 steps to get the result

  • Cron time calculation, previous_pre_cron, previous_cron, next_cron
  • Resolve start_date
  • Adjust start_date to align with schedule_interval
  • Normalize adjusted start_date
  • Calcurate Period
  • Decide a DAG run

Let’s assume some facts to continue a calculation example.

1
2
3
cron is set: `0 1 2-30/2 * *` 
start_date: days_ago(1)
today: 2020-08-14
  • Calculate previous_pre_cron, previous_cron and next_cron time based on the time when the scheduler runs. Since it runs peridically, so those times probably are changing during the day. We can take 3 examples as following.
scheduler time previous_pre_cron previous_cron next_cron
08-14 00:30:00 08-10 01:00:00 08-12 01:00:00 08-14 01:00:00
08-14 02:00:00 08-12 01:00:00 08-14 01:00:00 08-16 01:00:00
08-15 08:00:00 08-12 01:00:00 08-14 01:00:00 08-16 01:00:00
  • Resolve start_date
    Calculate the start_date based on the time when the scheduler runs, it changes as well when given config such as days_ago(1).
    It will have the same start_date during different scheduler times in a day. They all have the mid night of previous day as you can see as folowing.

    scheduler time start_date
    08-14 00:30:00 08-13 00:00:00
    08-14 02:00:00 08-13 00:00:00
    08-15 08:00:00 08-14 00:00:00
  • Adjust start_date to align with schedule_interval
    As discussed above, we compare the start_date with previous-pre cron to get the real start_date. The bigger one wins!

    scheduler time previous_pre_cron previous_cron next_cron start_date adjusted_start
    08-14 00:30:00 08-10 01:00:00 08-12 01:00:00 08-14 01:00:00 08-13 00:00:00 08-13 00:00:00
    08-14 02:00:00 08-12 01:00:00 08-14 01:00:00 08-16 01:00:00 08-13 00:00:00 08-13 00:00:00
    08-15 08:00:00 08-12 01:00:00 08-14 01:00:00 08-16 01:00:00 08-14 00:00:00 08-14 00:00:00
  • Normalize the adjusted start_date to find possible execution_date.
    Nomalize start_date(execution_date) is calculated by two steps,

    • Find the next cron time (next_cron(adjusted_start)) and pre cron time (pre_cron(adjusted_start)) based on the adjusted_start_date. (which is different from now())
    • Compare to normalize
1
nomalize(adjusted_start) = adjusted_start == pre_cron(adjusted_start) ? pre_cron(adjusted_start) : next_cron(adjusted_start)`
adjusted_start pre_cron(adjusted_start) next_cron(adjusted_start) nomalize(adjusted_start)
08-13 00:00:00 08-12 01:00:00 08-14 01:00:00 08-14 01:00:00
08-13 00:00:00 08-12 01:00:00 08-14 01:00:00 08-14 01:00:00
08-14 00:00:00 08-14 01:00:00 08-16 01:00:00 08-16 01:00:00
  • Period end
    It’s easier to get period end from the normalized start date

    1
    Period end = nomalize(adjusted_start) + schedule_interval
    adjusted_start nomalize(adjusted_start) Period end
    08-13 00:00:00 08-14 01:00:00 08-16 01:00:00
    08-13 00:00:00 08-14 01:00:00 08-16 01:00:00
    08-14 00:00:00 08-16 01:00:00 08-18 01:00:00
  • Decide if a run will be started
    We need to compare the normalized start date and period end with the current time again, if either one is later then now(), then scheduler won’t create a DAG Run

    scheduler time adjusted_start nomalize(adjusted_start) Period end DAG Run?
    08-14 00:30:00 08-13 00:00:00 08-14 01:00:00 08-16 01:00:00 no
    08-14 02:00:00 08-13 00:00:00 08-14 01:00:00 08-16 01:00:00 no
    08-15 08:00:00 08-14 00:00:00 08-16 01:00:00 08-18 01:00:00 no

Let’s take the same example above, however, change the start_date to days_ago(2)

  • Cron times and adjusted_start

    scheduler time previous_pre_cron previous_cron next_cron start_date adjusted_start
    08-14 00:30:00 08-10 01:00:00 08-12 01:00:00 08-14 01:00:00 08-12 00:00:00 08-12 00:00:00
    08-14 02:00:00 08-12 01:00:00 08-14 01:00:00 08-16 01:00:00 08-12 00:00:00 08-12 01:00:00
  • normalize

    scheduler time adjusted_start pre_cron(adjusted_start) next_cron(adjusted_start) nomalize(adjusted_start)
    08-14 00:30:00 08-12 00:00:00 08-10 01:00:00 08-12 01:00:00 08-12 01:00:00
    08-14 02:00:00 08-12 01:00:00 08-12 01:00:00 08-14 01:00:00 08-12 01:00:00
  • Period end and decision

    scheduler time nomalize(adjusted_start) Period end Dag Run?
    08-14 00:30:00 08-12 01:00:00 08-14 01:00:00 no
    08-14 02:00:00 08-12 01:00:00 08-14 01:00:00 yes

Summary

  • The first DAG Run is created based on the minimum start_date for the tasks in your DAG.
    It says based on, which doesn’t mean it will run the DAG at start_date.

  • Airflow scheduler triggers the task soon after the start_date + schedule_interval is passed
    The start_date doesn’t mean the start_date you put in the default_args, In fact, it doesn’t mean any start_date, when the schedule interval is cron job.
    It means the normalized-adjusted-and-resolved start_date that you give.

  • Will a DAG Run be started?
    If we want to make sure a DAG Run started in a specific day(2020-08-14). When we think about airflow scheduler is runing for that day from (08-14 00:00:00 to 08-14 23:59:59), the start_date resolved from days_ago(2) is actually fixed (2020-08-12 00:00:00). It makes things easier to make sure a DAG Run triggered.
    The start_date

The simple rules is to setup the number in days_ago(number_of_days) the same as or larger than your interval in your cron. e.g. if cron is saying every 2 days, then start_date is days_ago(2).

  • More
    Once a DAG Run is triggered, the start date is not that important anymore. The sub sequential run will be calculated from previous DAG Run execution date, which is already normalized and fixed date.

Advanced utils you would love in Airflow

There are some objects in airflow which are usually not in any demo from the offical website, sometimes we need to read the source code to get inspired by some pieces of code.
In this post, I will try to collect some of the usages that I have tested and hopefully someone lands in this page will take them away and use them in their project.

To get connection details in a DAG

In my working case, I have some db details that we are noting using airflow operator to fetch data from, instead of that I need to give those db details, such as host, username and password
to another out system as parameters. I don’t want to have them in my code, the best concept for keeping those information in airflow is the connections!
So I create a connection in airflow and I will need to get those details in my dag. here is the working example.

  • Create a connection from cli
    1
    2
    3
    4
    5
    $ airflow connections -a --conn_id test_connection_name \
    --conn_type http \
    --conn_host my_host_name.com \
    --conn_port 8999 \
    --conn_password 123456
  • Create a dag
    We can use BaseHook class method to get connection details by id, A full dag can be found here
    1
    2
    3
    4
    from airflow.hooks.base_hook import BaseHook
    ...
    connection = BaseHook.get_connection("username_connection")
    password = connection.password # This is a getter that returns the unencrypted password.
  • Backfill the DAG and check the log from the task in UI
    1
    $ airflow backfill test-connection-hook -s 2020-08-01 -e 2020-08-01
    1
    2
    3
    4
    5
    [2020-08-13 21:54:40,333] {standard_task_runner.py:78} INFO - Job 9150: Subtask getDetailsFromConnection
    ...
    [2020-08-13 21:54:40,478] {logging_mixin.py:112} INFO - 41, 123456
    ...
    [2020-08-13 21:54:45,256] {local_task_job.py:102} INFO - Task exited with return code 0

To render a Jinja template by using your own context

We usually provide operator args with a jinja template, if the args is templated, in the doc or code, you will find template_fields defined, such as template_fields= ['bash_command', 'env'] in the bash_operator

However, if you have a python code where you want to render your own variables, you can using following method from helpers module.

There is an helper method which is built on top of jinja in airflow, you can import it in your dag file

1
from airflow.utils.helpers import parse_template_string

Suppose you have a template string in your dag definition, however, you only know the context when the dag task is running.
For example, the execution_date which is provided in context.ds
Then you can use parse_template_string method to get a template and use the render with context to get your filename as following.

1
2
3
4
5
6
7
8
9
10
11
filename_template='abc-{{my_name}}.csv'

def my_sleeping_function(**context):
filename_template, filename_jinja_template = parse_template_string(filename_template)
filename = filename_jinja_template.render(my_name='Kai')

task = PythonOperator(
task_id='sleep'
python_callable=my_sleeping_function,
dag=dag,
)

I have another post with more details.

Using ansible playbook in a DevOps pipeline

In DevOps world, the monkeys only know automating everything. No interaction between human and machines!
Password prompt is always against the rule, here are a few steps to avoid that for ansible playbook. The solution is based on ssh username/password connection.

In the manual way, we usually run a ansible playbook in the following way, ansible will prompt to ask us to input the password.

  • In our situation, we cannot use ssh private key to connect to remote.

    1
    ansible-playbook -i inventory playbook.yml --ask-pass --ask-become-pass --ask-vault-pass

    However, it’s not very friendly with CI/CD process. A few steps to change your play book to make it easier to run in a pipeline.

  • Add group_vars or host_vars for your playbook, refer to Organize group vars

    1
    2
    3
    4
    playbook
    - group_vars
    - your_group_name.yml
    playbook.yml
  • Config ansible_user, ansible_password, ansible_become_password in your_group_name.yml file, they will be loaded when we run the playbook to avoid –ask-pass and –ask-become-pass
    More info

    1
    2
    3
    ansible_user: YOUR_USER_NAME
    ansible_password: YOUR_USER_PASSWORD
    ansible_become_password: YOUR_BECOME_PASSWORD
  • Encrypt your group vars to avoid clear password

    1
    2
    3
    4
    $ pwd
    playbook
    $ ansible-vault encrypt group_vars/your_group_name.yml
    # input the vault password YOUR_VAULT_PASS

    Now if you run the playbook with following command, you shall be able to execute the playbook by only inputing the vault password

    1
    $ ansible-playbook -i inventory playbook.yml --ask-vault-pass

    Btw, you can always use ansible-vault edit group_vars/your_group_name.yml to change the variables.

  • Create a vault file instead of using prompt way,two ways of giving password for vault

    1
    $ echo YOUR_VAULT_PASS >> ~/.ansible_vault_pass && chmod 600 ~/.ansible_vault_pass
  • The last step, run your ansible playbook with vault password file instead of asking

    1
    $ ansible-playbook -i inventory --vault-password-file ~/.ansible-vault-pass