Python template engine - Jinja2

Programmers usually want to use variables in their program to create different results dynamically from the same program. e.g. The first program for a newbie is usually

1
2
3
print("Hello World")
print("Hello Python")
print("Hello Kai")

The second program will teach us how to get an input from keyboard and print it out.

1
2
myinput = # get input from keyboard
print("Hello {}".format(myinput))

In a template engine, it does similar but much more powful thing. Usually, we call it a template language since it gives simple program capabilities, such as variables, loop and conditions.

Jinja2 is a template engine in python to combine a model(object values), expressions and statement tags with a template.
To understand what Jinja2 does, there are a few thing we need to understand firstly.

  • Template, a predefined the file or string aligned to Jinja2 syntax or senmantic
  • Model or variable, the variable used to replace the placeholder in your template
  • Expression, a function used to change value of the variable
  • Statement tags, logic control statements such as loop or if condition
  • Commnent, description about a template which will be ignored by the engine when rendering
  • Render, the procedure to replace variable or model in a template while following the statement controls

Initial look

Here is a simple tempalte from Jinja2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<!DOCTYPE html>
<html lang="en">
<head>
<title>My Webpage</title>
</head>
<body>
<ul id="navigation">
{% for item in navigation %} <- statement to loop through navigation
<li><a href="{{ item.href }}">{{ item.caption }}</a></li> <- use the variable 'item' created by statement
{% endfor %}
</ul>

<h1>My Webpage</h1>
{{ my_name }} <- will be replaced as Kai

{# a comment is ignored #} <- will be ignored
</body>
</html>

Let’s look at how jinja will render this template out when given variables. It’s better than just read the explanations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
navigation = [{href='/', caption='Home'}]
my_name='Kai'
Template(template).render(navigation=navigation, my_name=my_name)
///
<!DOCTYPE html>
<html lang="en">
<head>
<title>My Webpage</title>
</head>
<body>
<ul id="navigation">
<li><a href="/">Home</a></li> <- Loop is executed and value is replace
</ul>

<h1>My Webpage</h1>
Kai <- my_name is replaced

</body>
</html>

Delimiters

Delimiters is used by jinja engine to look up variables, expressions, statements and comments in a Template, Jinja runs different renders logic depending on the delimiters.

In a programmming language, the compiler or interpreter runs codes line by line, and understands our commands by keywords.
In Jinja2, it’s those special delimiters that the engine understands intially, and then the engine will try to understand the words between those delimiters.

There are a few kinds of delimiters. The default Jinja delimiters are configured as follows.

1
2
3
4
{{ ... }} for Expressions and variables to print to the template output
{% ... %} for Statements
{# ... #} for Comments not included in the template output
# ... ## for Line Statements
  • The advanced users can choose different delimiters when using Jinja2

Expressions

Math

1
{{ XX + - * / // % ** YY }}

String plus

We can concatenate string

1
{{ 'name' ~ name }}

Comparisons

1
{{ XX == != >= <= < > YY }}

Logic

1
2
3
{{ XX and or YY }}
{{ not XX}}
{{ not (XX and YY) }}

Test

A test expression is used against the is operator, a test expression is a function such as defined(value).
Full list of tests expression

1
2
{{ XX in [YY,ZZ]}}
{{ XX is test expression }}

When using a test expression in jinja template, we can ignore the only value parameter as following

1
{{ XX is defined }}

Some expressions have extra parameters, e.g. divisibleby(value, num)

1
{{ XX is divisibleby(3) }}

Object

We can use python method or attribute or items in a var object

1
{{ XX.method() XX.YY XX[YY] }}

Filter

We can alter a value using filter, or I think mapper is more suitable name, it’s similar as XX.map(mapper1).map(mapper2). Full list of filters

1
{{ XX | filter1 | fitler2 }}

Statements

Control statement

In a template, we can use for, if to control simple logic.

1
2
3
4
5
6
7
8
9
<ul>
{% for user in users %}
<li>{{ user.username }}</li>
{% endfor %}
</ul>

{% if login %} <- test if login is defined, not empty and not false
<div>Home</div>
{% endif %}

https://jinja.palletsprojects.com/en/2.11.x/templates/#list-of-control-structures

Define macros

A macro is used to defined command used piece of codes and used in the template as a function.

1
2
3
4
5
6
7
{% macro input(name, value='', type='text', size=20) -%}
<input type="{{ type }}" name="{{ name }}" value="{{
value|e }}" size="{{ size }}">
{%- endmacro %}

<p>{{ input('username') }}</p>
<p>{{ input('password', type='password') }}</p>

https://jinja.palletsprojects.com/en/2.11.x/templates/#macros
https://jinja.palletsprojects.com/en/2.11.x/templates/#call

Filters

There are built-in filters in Jinja2, such as upper, it can used by statements to make a block of contents to be uppcase

1
2
3
{% filter upper %}
This text becomes uppercase
{% endfilter %}

Assignment

Some times we can to setup some variables in the template, the assginment statement fits in

1
2
{% set navigation = [('index.html', 'Index'), ('about.html', 'About')] %}
{% set key, value = call_something() %}

Block Assignments

Similar to assignment, we can capture the context in a block to a variable

1
2
3
4
5
6
{% set navigation %}
my name is kai
{% endset %}

Same as
{% set navigation = 'my name is kai' %}

Understand some details about airflow dags

A DAG (directed acyclic graph) is a collection of tasks with directional dependencies. A dag also has a schedule time, a start date and an end date (optional).
For each schedule, (say daily or hourly), the DAG needs to run each individual tasks as their dependencies are met.
Certain tasks have the property of depending on their own past, meaning that they can’t run until their previous schedule (and upstream tasks) are completed.
A task is unit of work in airflow which runs some python code for some works for an execution_date

Execution_date, in airflow, each dag is running for a specific date to handle some data for that date.
The execution_date is the logical date and time which the DAG Run, and its task instances, are running for.
While a task_instance or DAG run might have a physical start date of now, their logical date might be 3 months ago because we are busy reloading something.

A DAG run and all task instances created within it are instanced with the same execution_date

Basically, a dag consists of following 3 things and working as an airflow DAG developer, you need to define it in a DAG file.

  • schedule, how often we run this DAG
  • tasks or operators, what to do in this DAG
  • own past condition (previous scheduled task)
  • dependencies, how tasks are depending on each other

Note: a pipeline definition and a dag definition are the same in the context of airflow

DAG definition file or pipeline definition

A DAG definition file is an python file where we defines all 3 elements above, it will be picked by airflow server, parsed and persisted into an database.
When airflow server parses the DAG definition, it will get the meta information from the python file, however it won’t execute the tasks defined in the file.
The actual tasks defined will be run in a different context when the scheduled time is met from the context where the script is parsed.
Different tasks run on different workers at different points in time, which means that this script cannot be used to cross communicate between tasks.

To define a DAG

1
2
3
4
5
6
7
8
9
from airflow import DAG
from airflow.utils.dates import days_ago

default_args = {
'start_date': days_ago(2),
'schedule_interval': timedelta(seconds=5)
}

dag = DAG('dag-id-data', default_args=default_args)
  • Explain:
    DAGs essentially act as namespaces for tasks. A task_id can only be added once to a DAG.
    If a dictionary of default_args is passed to a DAG, it will apply them to any of its operators. This makes it easy to apply a common parameter to many operators without having to type it many times. (Code 1)

Args to DAG:

dag_id, the id of the DAG, developer defined when creating the dag in python file.
description, the description for the DAG to e.g. be shown on the webserver
schedule_interval, defines how often that DAG runs

Values meaning

datetime.timedelta, will be added to your latest task instance’s execution_date to figure out the next schedule
dateutil.relativedelta.relativedelta
str that acts as a cron expression
start_date (datetime.datetime) – The timestamp from which the scheduler will attempt to backfill, e.g. das_ago(2)

To define a Task

1
2
3
4
5
6
from airflow.operators.bash_operator import BashOperator
op_print_date = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag,
)
  • Explain:
    task_id is defined by DAG developer
    BashOperator is used to run command ‘date’
    the task is assgined to the dag defined above

Dag Context with task define

Add keyword arguments when creating dag

1
task1 = Task(task_id='', dag = dag, ...)

Using python DAG context manager

1
2
With dag:
task1 = Task(task_id='', ...)

Dependency Define

Method call

1
task1.set_downstream(task2)

Bitshift Composition

1
task1 >> task2

Code Examples:

1
2
3
4
5
6
7
default_args = {
'start_date': datetime(2016, 1, 1),
'owner': 'airflow'
}
dag = DAG('dummy-dag', default_args)
op = DummyOperator(task_id='dummy', dag=dag)
print(op.owner) # Airflow the task will have the owner props inherited from the dag default args

Code 2: A simple DAG definition

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.utils.dates import days_ago
from datetime import timedelta

default_args = {
'depends_on_past': False,
'start_date': days_ago(2),
'schedule_interval': timedelta(seconds=5)
}

dag = DAG(
'dag-data-id',
default_args=default_args,
description='Data pipeline for data',
)

op_print_date = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag,
)

References

Airflow official website
Python datetime delta

What is apache airflow

What is airflow

Airflow is an clustered platform to schedule, run-through and monitor a series of tasks in a directed acyclic graph (DAGs).
It gives better management capabilities to create, log and link jobs what would be done by cron jobs.

Architecture

Airflow cluster setup will have webserver, scheduler, message broker and worker components as following.
Airflow arc
Webserver serves UI and scheduler will parse tasks defintion, the DAGs from Airflow Home folder, usually airflow maintainer
will deploy their DAGs definition there and scheduler will parse them and add meta data in to the meta DB.
Each DAG will be given a trigger time, which is similar as cron, then it’s met, scheduler will put the task into broker and one of the workers
will pick up that task and run it. Workers will work on tasks while following the dependencies defined in the DAG.

Concepts to understand before start

DAG

Directed Acyclic Graph – is a workflow of all the tasks you want to run, organized in a way that reflects their relationships and dependencies In general, each one should correspond to a single logical workflow

DAG definition

Airflow use python file to declare dag objects, a DAG function is provided to accept configurations. e.g. dag=DAG(‘dag_name’, default_args=default_args)

  • Dag arguments
    The arguments provided for the dag object when the dag is run
  • Dag default arguments
    The default dag argument, which is provided when defining the dag in dag definition python file
  • Context manager
    airflow will find a dag definition in a python and assign that dat to operators defined in that definition file
  • Dag objects build
    Airflow will find and run dag definition python files to create dags
  • Task relationship
    Tasks a a dag can has dependencies by using task_1 >> task_2, the same as task_1.set_downstream(task_2)
  • Bitshift Composition
    A convinient way to define relationships between tasks. >>
  • Upstream
    if we have a taskA, then the task running before it is the upstream of taskA, task A need to wait until upstream has completed successfully.
  • Downsteam
    if we have a taskA, then the task after it is the downstream of taskA

Dagbag

A folder where airflow is searching DAG definition python files, execute and build DAG objects into airflow system.
When searching for DAGs, Airflow only considers python files that contain the strings “airflow” and “DAG” by default.
To consider all python files instead, disable the DAG_DISCOVERY_SAFE_MODE configuration flag.

Task

A work unit in a DAG, it is an implementation of an Operator. for example a PythonOperator to execute some Python code, or a BashOperator to run a Bash command

Task instance

A runtime exection of a task, It should be a combination of a DAG, a task, and a point in time (execution_date)

Task lifecycle

A task goes through various stages from start to completion
Airflow lifecycle

Operators

Operators is specific implementations of tasks. It defines what the task does and are only loaded by Airflow if they are assigned to a DAG.
e.g.
BashOperator - executes a bash command
PythonOperator - calls an arbitrary Python function
EmailOperator - sends an email
SimpleHttpOperator - sends an HTTP request

Pool

A named of list of workers which can be refered by a task to manually balance airflow workload.

Queue

Queue for a executor to cache tasks

Worker

Processes to get tasks out from queue and run it, workers can listen to one or multiple queues of tasks.

Understand different levels of hypervisors beneath os

I have chosen the word ‘beneath os’ instead of ‘on top of hardware’ for describing hypervisors. It will remind of me that hypervisors were born as an abstraction level between an os and the actual hardware devices. A notice of qemu in kvm leads me to try to understand the differences in bewtween.

hypervisor diagram

Hosted hypervisor(type 2)

It is an application running on top of another os(named host os), it runs as a process on the host.
It’s kind of os software to emulate other system commands and translate or adapte them into host os commands, utilizing host os drivers to communicate with hardwares. (No hardware involved)

1
guest os -> type2 hypervisor -> host os -> hardware

Native Hypervisor(type 1)

Also called native or bare-metal hypervisor, is virtualization software that is installed on hardware directly.
It requires a processor with hardware virtualization extensions, such as Intel VT or AMD-V, to help to translate or map instructions to hardware cpu

1
guest os -> type1 hypervisor -> hardware

KVM is a type 1 hypervisor and Qemu is a Type 2 hypervisor.

Type 1 hypervisor comes installed with the hardware system like KVM in Linux. KVM provides hardware acceleration for virtual machines but it need Qemu to emulate any operating system.

Qemu is a Type 2 hypverisor, it can be installed on an operating system and it runs as an indepent process and the instructions we give in Quemu will be executed on the host machine. Qemu can run independently without KVM as its a emulator however the performance will be poor as Qemu doesnt do any hardware acceleration

There is a project that is on-going to integrate Qemu and KVM. This will be a type 1 hypervisor. So we will have all the benefits of Qemu as an emulator and KVM hardware acceleration for better performance.

References

  1. Wiki hypervisor
  2. Kernel based virtual machine
  3. linux-kvm org
  4. Quora difference between kvm and qemu
  5. UCSD hardwareVirt paper

Start a nodejs project and publish it into public npm registry

This is all about the basic steps to start a npm project and publish it into npmjs

Npmjs is a public npm registry which is managing packages. Anyone can register an account and publish his or her package to share with others. It’s free if you are only about publishing publich packages and can be accessed by anyone. Paid features can be found here

Register an account

To manage your own packages, an account is required in npmjs

Create a project

Simply create a folder in your local machine and run npm init to start a npm project

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
$ mkdir node_starter && cd node_starter
node_starter$ npm init --scope=kaichu

package name: (@kaichu/node_starter)
version: (1.0.0)
git repository:
keywords:
license: (ISC)
About to write to /Users/kaichu/Workspace/Dev/js/poc/node_in_deep/node_starter/package.json:
{
"name": "@kaichu/node_starter",
"version": "1.0.0",
"description": "A starter demo",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "Kai Chu",
"license": "ISC"
}

Create the index.js file

which is the main entrypoint of the package as specified in the package key “main”

1
2
node_starter$ touch index.js
node_starter$ echo 'console.log("hi nodejs")' >> index.js

Login the registry from command

Looks good enough to try out the publish, before a publish, you need to login the the registy in your local command. Run following command and fill in your username and password you have setup in the Step 1

1
$ npm login

which actually create a authToken in your home ~/.npmrc file

1
2
$ cat ~/.npmrc 
//registry.npmjs.org/:_authToken="xxxxx-xxxx-xxx"

Publish your first npm package

Make sure you are in the node_starter dir and run publich. the –access will tell npm registry if it’s a public or restricted package.
Since we are not paying, only public options are useful.

1
$ npm publish --access=public

Check the published result

You should have got an email and tell you haved published your package Go to npm regisry and

1
2
3
4
5
6
//Email
Successfully published @kaichu/node_starter@1.0.0

// Npm registry
Click your profile pic -> packages, node_starter is ready for you there.

NPM packages

Use your package

Since the package is published in npm registery, now you can use your npm i to install it.

1
2
3
4
5
6
7
8
9
10
$ npm i @kaichu/node_starter
```

## Tag your published package
When publishing a package, npm will add a tag `latest` to your package which can be seen in npmjs as following
![Latest tag default](/assets/npm_before_tag.png)

When publish packages, we can add extra options --tag to add new tag to a publish.
```bash
npm publish --access=public --tag=dev

However, since we have publihed the package, in default, we cannot publish the same version again. We can use
another separate npm cli to add dev tag.

1
2
$ npm dist-tag add @kaichu/node_starter@1.0.0 dev
$ npm dist-tag ls @kaichu/node_starter@1.0.0

The result will be shown in npmjs as well.
Latest tag default

References

  1. my node_starter

Setup airflow with openldap authentication with docker

Airflow provides a few ways to add authentication. https://airflow.apache.org/docs/stable/security.html
In this post, I want to demo how to setup a ldap server with docker and airflow to use that ldap.

1 Setup openldap using docker image, remember to enable LDAP_RFC2307BIS_SCHEMA, which will bring in the groupOfNames, posixGroup schema definition

1
2
3
4
5
$ docker run -p 389:389 -p 636:636 -v C:\Users\kach07:/ldap \
--env LDAP_RFC2307BIS_SCHEMA=true \
--name my-openldap-container \
--detach \
osixia/openldap:1.3.0

2 Create two files and put them in you home folder so that you can find them inside the container C:\Users\kach07
I’m lazy, I have mapped the whole home dir to my container. Alternatively, you can you docker cp to move files into your container.

  • group.ldif

    1
    2
    3
    4
    5
    6
    7
    8
    dn: cn=airflow-super-users,dc=example,dc=org
    cn: airflow-super-users
    objectClass: top
    objectClass: groupOfNames
    objectClass: posixGroup
    member: uid=test_user,dc=example,dc=org
    gidNumber: 100
    memberUid: test_user
  • user.ldif

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    dn: uid=test_user,dc=example,dc=org
    uid: test_user
    sn: User
    cn: Test User
    objectClass: person
    objectClass: organizationalPerson
    objectClass: posixAccount
    objectClass: top
    loginShell: /bin/bash
    homeDirectory: /home/testuser
    uidNumber: 1001
    gidNumber: 1001
  • You can chagne the password for your test_user [Commands are running inside your container]

    1
    2
    $ docker exec -it my-openldap-container /bin/bash
    $ ldappasswd -x -H ldap://localhost -D "cn=admin,dc=example,dc=org" -w admin -S -ZZ uid=test_user,dc=example,dc=org

    or

    1
    2
    $ docker exec my-openldap-container \
    ldappasswd -x -H ldap://localhost -D "cn=admin,dc=example,dc=org" -w admin -S -ZZ uid=test_user,dc=example,dc=org

3 Go into the container to use ldapadd command to add user and group entries [Commands are running inside your container]

1
2
3
4
$ docker exec -it my-openldap-container /bin/bash
$ cd /ldap
$ ldapadd -x -H ldap://localhost -w admin -D 'cn=admin,dc=example,dc=org' -f /ldap/user.ldif
$ ldapadd -x -H ldap://localhost -w admin -D 'cn=admin,dc=example,dc=org' -f /ldap/group.ldif

4 Check you container ip [], if you are running in linux, you can your docker run –network host to skip this part. I’m lazy to add network for two containers, and use the ip later in airflow to connect to the ldap server.

1
2
$ docker inspect my-openldap-container --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}'
//my output => 172.17.0.2

5 Change config in your airflow.cfg

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.ldap_auth

[ldap]
# set this to ldaps://<your.ldap.server>:<port>
uri =ldap://172.17.0.2:389
user_filter = objectClass=*
user_name_attr = uid
group_member_attr = memberOf
superuser_filter = memberOf=CN=airflow-super-users,dc=example,dc=org
data_profiler_filter =
bind_user = cn=admin,dc=example,dc=org
bind_password = admin
basedn = dc=example,dc=org
# You cannot comment it out, explicitly use cacert= is important
cacert =
search_scope = LEVEL

6 Start your airflow docker

1
$ docker run  --mount type=bind,source="C:/Users/kach07/airflow/airflow.cfg",target=/usr/local/airflow/airflow.cfg -p 8080:8080 testairflow webserver

I have built my own local container, since airflow.contrib.auth.backends.ldap_auth requires ldap3 and airflow[ldap] module

7 Open you airflow in browser, you shall be asked to input your password. Use the user test_user and password you setup to login.
http://localhost:8080/admin

8 More things to say

  • Others: if you want to play around ldapsearch in your command
    ldapsearch -xLLL -H ldap://localhost -w admin -D ‘cn=admin,dc=example,dc=org’ -b dc=example,dc=org cn=*

  • Important:
    Don’t add any quote in your configurations, I have # in my password, then I decide to use an env to test my setup

    1
    2
    docker run  --mount type=bind,source="C:/Users/kach07/airflow/airflow.cfg",target=/usr/local/airflow/airflow.cfg --env 
    "AIRFLOW__LDAP__BIND_PASSWORD=Fw5Mk#Sr" -p 8080:8080 testairflow webserver
  • For AD Users

    1
    2
    user_name_attr = sAMAccountName
    search_scope = SUBTREE

Start a ansible project

  1. Create a folder to hold all your files
    1
    $ mkdir ansible-project
  2. Add hosts file
    1
    2
    3
    $ touch hosts
    $ vi hosts
    abc[1:4].example.domain
  3. Add ansible.cfg file
    Create a template in project folder and copy it to your home folder so that other won’t operate it by mistake, regarding configurations can be refered https://docs.ansible.com/ansible/latest/reference_appendices/config.html
    1
    2
    3
    4
    $ touch ansible.cfg.template
    [remote_user]
    YOUR_USER_NAME
    $ copy ansible.cfg.template ~/.ansbile.cfg
  4. Run ping to test connection
    1
    $ ansible -i hosts all -k -m ping
  5. Add vars or vaults
    1
    2
    3
    mkdir vars && cd vars
    ansible-vault create vaults.yml
    touch vars.yml
  6. Add main.yml
    1
    2
    3
    4
    5
    6
    - hosts: all
    vars_files:
    - vars/vars.yml
    tasks:
    - name: Push dags to remote
    taskName:

Using airflow jinja to render a template with your own context

Jinja is well explained when using with operators which has support for a template field. It could be a bash_command parameters in a BashOperator as following

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
templated_command = """
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
"""

t3 = BashOperator(
task_id='templated',
depends_on_past=False,
bash_command=templated_command,
params={'my_param': 'Parameter I passed in'},
dag=dag,
)

Using jinja template in a op_args or op_kwargs in a python operator
def my_sleeping_function(**context):
# here context.random_base will be the actual execution_date, e.g. 2020-02-20

1
2
3
4
5
6
task = PythonOperator(
task_id='sleep'
python_callable=my_sleeping_function,
op_kwargs={'random_base': '{{ds}}'},
dag=dag,
)

however, if you have a python code where you want to render your own variables, you can using following method from helpers module.
There is an helper method which is built on top of jinja in airflow, you can import it in your dag file

1
from airflow.utils.helpers import parse_template_string

Suppose you have a template string in your dag definition, however, you only know the context when the dag task is running.
For example, the execution_date which is provided in context.ds
Then you can use parse_template_string method to get a template and use the render with context to get your filename as following

1
2
3
4
5
6
7
8
9
10
11
12
filename_template='abc-{{ds}}.csv'

def my_sleeping_function(**context):
filename_template, filename_jinja_template = parse_template_string(filename_template)
filename = filename_jinja_template.render(**context)

task = PythonOperator(
task_id='sleep'
python_callable=my_sleeping_function,
op_kwargs={'random_base': '{{ds}}'},
dag=dag,
)

Understand airflow execution time and dag parsing time

Airflow dags are written in python, there are actually two runtimes for the same codes. Firstly, airflow will parse the dags and create metadata in the database
And then when execution time is met, worker will run the python codes to do the actual schedule work.

Don’t mix them when writting a airflow dag. I’ll have a small example to tell the differences in between.

In this example, I want to demo how it is important for developers to keep that in mind.
Usually, when we create a dag, we need to give a paramerter so that we can reuse it in different environments without changing the code.
It is implemented by using variables.
In the following case, I want to get the python pip version and show it when the dag is executed.
I have printed in two places. One is the root level and the other one is in the python operator’s callable function.

1
2
3
4
5
6
7
8
9
10
11
//demo-airflow-capability.py
import os
...
print(os.environ['PYTHON_PIP_VERSION'])

def print_env():
print(os.environ['PYTHON_PIP_VERSION'])

with dag:
os_operator = PythonOperator(task_id = "os_operator", python_callable=print_env)
...

As we know, we can use python demo-airflow-capability.py to validate the syntax, which is also how the parse does.
The python version is printed pip version is shown once when parsed

1
2
$ python demo-airflow-capability.py
'20.0.2'

When we trigger the dag to be executed, you will find the python pip version '20.0.2' is printed once as well.
That is from the python operator.
To test a dag execution, we can use the command from airflow cli

1
$ airflow test demo-airflow-capability os_operator -e 2020-04-08

Conclusion:

  1. The same dag file will be parsed and executed in two different places with different logic routines.
  2. Think dag file as a main entry of parser
  3. Think root operator of the dependency tree as a main entry of executor
  4. Keep time consuming operations out of dag parsing phase, such as fetching variables

Appendix:

Dag file demo-airflow-capability.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from airflow.utils.dates import days_ago
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import os

args = {
'owner': 'airflow',
'start_date': days_ago(2),
}

dag = DAG(
dag_id='demo-airflow-capability',
default_args=args,
schedule_interval="@once",
tags=['example']
)


print(os.environ['PYTHON_PIP_VERSION'])

def print_env():
print(os.environ['PYTHON_PIP_VERSION'])

with dag:
os_operator = PythonOperator(task_id = "os_operator", python_callable=print_env)

os_operator

Ways to add tasks to ansible playbook

Going through official website for beginner is somehow difficult and confused, since ansible provides a few slightly different ways to add tasks to
playbook. As a software developer, I always think this is not make users’s world better, in contrast, which makes it hard to work.
My opinion is that

1
Giving a unique interface will make users much more easier than giving freedom to do the same thing in a few ways.

In this post, I’m summerizing 4 ways of adding tasks to ansible book. I do recommand that u shouldn’t mix them too much.

1 Tasks

1.1 Use tasks with key-values

1
2
3
4
5
6
7
---
- hosts: webservers
tasks:
- name: make sure apache is running
service:
name: httpd
state: started

1.2 Use tasks with module argument lists

1
2
3
4
5
---
- hosts: webservers
tasks:
- name: run this command and ignore the result
shell: /usr/bin/somecommand || /bin/true

1.3 Use tasks with import_tasks

1
2
3
4
5
---
- hosts: webservers
tasks:
- name: run this command and ignore the result
import_tasks: sometask.yaml

The command and shell modules are the only modules that just take a list of arguments and don’t use the key=value form

2 Roles

2.1 Use roles with simple name

1
2
3
4
---
- hosts: webservers
roles:
- common

2.2 Use roles with full key-values

1
2
3
4
5
6
---
- hosts: webservers
roles:
- role: common
vars:
dir: '/opt/a'

In 2.2, we add the key - role explicitly.

3. Task Role (Ansible 2.4)

Using import_role or include_role to add tasks in the target role, [Difference between import and include] (https://docs.ansible.com/ansible/latest/user_guide/playbooks_reuse.html#dynamic-vs-static)

1
2
3
4
5
6
7
8
9
---
- hosts: webservers
tasks: // Here is tasks, not roles compared to section 2
- import_role:
name: example
- include_role:
name: example
vars:
dir: '/opt/a'

4. Import a whole playbook

1
2
3
4
5
6
7
8
9
---
- name: Include a play after another play // here is top level of playbook
import_playbook: otherplays.yaml

//otherplays.yaml
---
- hosts: webservers
tasks:
...

5. References

https://docs.ansible.com/ansible/latest/user_guide/playbooks_intro.html
https://docs.ansible.com/ansible/latest/user_guide/playbooks_reuse_roles.html