Airflow performance requirements

1/21/2024

You can take a look at the Airflow Summit 2021 talkĭeep Dive into the Airflow Scheduler talk to perform the fine-tuning. In order to perform fine-tuning, it’s good to understand how Scheduler works under-the-hood. How often the scheduler should perform cleanup and check for orphaned tasks/adopting them How many new DAG runs should be created/scheduled per loop How many task instances scheduler processes in one loop How much time scheduler waits between re-parsing of the same DAG (it happens continuously) How many parsing processes you have in your scheduler Whether parsing your DAG file involves importing a lot of libraries or heavy processing at the top level how fast they can be parsed, how many tasks and dependencies they have) How large the DAG files are (remember DAG parser needs to read and parse the file every n seconds) The logic and definition of your DAG structure:.How much networking throughput you have available How much memory you have for your processing How fast the filesystem is (in many cases of distributed cloud filesystem you can pay extra to get What kind of filesystem you have to share the DAGs (impacts performance of continuously reading DAGs) In order to fine-tune your scheduler, you need to include a number of factors: Those two tasks are executed in parallel by the scheduler and run independently of each other inĭifferent processes. The Scheduler is responsible for two operations:Ĭontinuously parsing DAG files and synchronizing with the DAG in the databaseĬontinuously scheduling tasks for execution The following databases are fully supported and provide an “optimal” experience: The critical section is obtained by asking forĪ row-level write lock on every row of the Pool table (roughly equivalent to SELECT * FROM slot_pool FOR UPDATE NOWAIT but the exact query is slightly different). This critical section is where TaskInstances go from scheduled state and are enqueued to the executor, whilstĮnsuring the various concurrency and pool limits are respected. To achieve this we use database row-level locks (using SELECT.

Need to ensure that only a single scheduler is in this critical section at once - otherwise limits would notīe correctly respected. To maintain performance and throughput there is one part of the scheduling loop that does a number ofĬalculations in memory (because having to round-trip to the DB for each TaskInstance would be too slow) so we Using a different database please read on. Many copies of the scheduler as you like – there is no further set up or config options needed. The short version is that users of PostgreSQL 10+ or MySQL 8+ are all ready to go – you can start running as What can you do, to improve Scheduler’s performance.What resources might limit Scheduler’s performance.How to approach Scheduler’s fine-tuning.

0 Comments

Airflow performance requirements

Leave a Reply.

Author

Archives

Categories