Creating cron jobs with Python

As measured by the TIOBE index, Python was the most popular programming language used worldwide in June 2022, with a 12.2% share. Python is popular because it's easy to learn, has excellent libraries, and hits a sweet-spot of performance and flexibility that appeals to developers, data scientists and analysts alike. This flexibility also makes Python a robust and popular choice when creating a cron job.

Before going further we should define what, exactly, is a cron a job. By the classic definition, a cron job is a command on a Linux or Unix server that's run on a user-defined schedule using the system-level cron scheduler. This is still true today, and now nearly 50 years old, cron schedulers worldwide are running more jobs than ever before. But it's also true that here in 2022 the definition of cron job has expanded quite a bit, and generally means something running in the background at a scheduled time.

Most of the time, but not always, a defining characteristic of a cron job is the infamous cron schedule syntax, a terse format to define almost any kind of schedule that looks sort of like a regular expression. These schedules, like 5 4 * * * are hard enough to read and write that 400,000 visitors use Crontab Guru every month to decipher them into plain english.

How to create a cron job with Python

This guide will introduce a few of the most popular ways to create cron jobs with Python, but it is by no means an exhaustive list. There are many libraries and platforms for running jobs, but here we will cover:

  1. Running any Python script as a cron job on Linux.
  2. Running a cron job using Celery with the beat scheduler.
  3. Using a third party tool like Google Cloud Scheduler to run your Python code.

Running a Python script as a cron job

Before you can schedule your python code to run as a cron job, you need to have it organized as a script that you can run from the command line. If your code exists inside something like a Django view or FastAPI path, you will need to do some refactoring to move code into a shared module that can be used by your framework and invoked directly in a Python script.

If you are running your code as a command line script for the first time, there are usually a few things that every script needs.

  1. You will need to import every library you use, and some may require initialization. In this example, the script is loading and initializing Django:
    import os
    import sys
    import django
    
    BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
    sys.path.append(BASE_DIR)
    django.setup()
    
    from application import archiver
    archiver.run() 
  2. If you are using any observability tools like Sentry or DataDog, you will need to ensure those are initialized by whatever code you are calling, or initialize them here as part of your bootstrap. For example, this code initializing Sentry could be added before calling django.setup() above.
    import sentry_sdk
    sentry_sdk.init(
        dsn="https://examplePublicKey@o0.ingest.sentry.io/0",
        traces_sample_rate=1.0,
    )
  3. Write progress and error logs to stdout by configuring a logger or simply using print(). Cronitor users will be able to see full output for all jobs directly in the Cronitor UI, but even if you're not using Cronitor this output is gathered automatically by cron for failed jobs, so it's a best practice no matter what you have chosen for your monitoring stack.
  4. Save your script and commit it to your source control. We will call our script run_archiver.py

With script in hand, the first thing you should do is run it yourself from the command line to ensure that it runs successfully and writes the output logs that you will want to see from your job.

$ python run_archiver.py
Running Job
...
Success!

Finally, it's time to register your script with the cron daemon that is running on your machine. There are multiple ways to schedule cron jobs but this guide will focus on two simple options. The method you choose should depend on whether or not you have root access on the linux machine.

  • Scheduling a cron job when you have root or sudo privileges

    With administrator-level access to a linux instance, the best practice for scheduling a cron job for your scripts is to create an application-specific crontab file in the server's cron.d directory, usually located at /etc/cron.d/. This "drop in" directory is scanned every minute by cron and all you need to do is copy a crontab file into place and wait for the scheduled time for your jobs to run.

    The crontab file itself has a very simple format. It may include comments, environment variables, and the jobs to be run. Each job must fit on a single line that begins with a valid cron schedule expression, the user the job should run as, and finally the command string to be run.

    # Run the data archiver once a day as the dataproc user
    5 4 * * * dataproc /etc/scripts/run_archiver.py
  • Scheduling a cron job without root privileges

    If your system administrator allows it, cron jobs can also be scheduled by users in a private user crontab file. These user crontabs can only run commands as yourself. To interact with your user crontab, use the crontab command.

    View your crontab file:

    $ crontab -l

    Open your crontab in a text editor to make updates:

    $ crontab -e

    From the text editor, add your new cron entry:

    # Run the data archiver once a day
    5 4 * * * /etc/scripts/run_archiver.py

With your script written, tested and scheduled you are almost done. You would think that a script that works for you will also work when run as a cron job, but there are a few key differences between your shell and the cron execution environment and these differences can often cause headaches when scheduling jobs

  1. Use absolute paths

    The current working directory used by Cron depends on several factors, but it's usually the job owner's home directory. Avoid any confusion here by exclusively using fully-qualified paths when you add the job to the crontab file. Similarly, if your script is invoking any other commands, it should use a fully qualified path or set its own working directory before doing any invocations.

  2. Bring your environment variables

    Even though cron runs your cron jobs using a real user account, it does not trigger an interactive session, so any environment variables you may be loading in .bash_profile or .bashrc will not be available. If your script expects certain environment variables, you can either set them within the crontab file itself, or have each line of your crontab source your .bash_profile before invoking your script.

    5 4 * * * source ~/.bash_profile ; python /etc/scripts/run_archiver.py
  3. Optionally invoke your virtual environment

    If you are using a tool like virtualenv to isolate your Python dependencies, you will need to activate that virtual environment when invoking your script. The easiest way to do this is to use the python binary symlinked in the environment:

    5 4 * * * /etc/virtualenvs/example/bin/python /etc/scripts/run_archiver.py

By following these best practices and with a small amount luck, you will hopefully see that your cron job runs precisely at it's scheduled time and finishes successfully. If you are not so lucky, we have a comprehensive troubleshooting guide for cron jobs that will walk you step-by-step through the reasons that cause most cron job failures.


Running a cron job using Celery with the beat scheduler.

Celery is a popular Python library for running background tasks with asynchronous task queues that trigger commands in a separate worker process. The most common use case is to add jobs to a celery task queue from your application code when a user does things like ask for a password reset email. When you use the celery-beat package, you can also add scheduled jobs alongside these event-driven tasks. Note: This guide assumes that you are already running Celery in your app.

First, locate your Celery initialization code and add a task for your new cron job. This is usually in a file called tasks.py:

from celery import Celery
from celery.schedules import crontab
app = Celery()

# ... existing celery tasks here

@app.task
def run_archiver():
    from application import archiver
    archiver.run()

Define a beat schedule that will be read by Celery on start-up:

app.conf.beat_schedule = {
    'run-archiver': {
        'task': 'tasks.run_archiver',
        'schedule': '5 4 * * *'
    },
}
app.conf.timezone = 'UTC'

Bonus: Automatically monitor your Celery cron jobs with Cronitor

import cronitor.celery
# Discover all of your celery tasks and automatically add monitoring.
cronitor.celery.initialize(app, api_key="<cronitor api key here>")

Using a third party tool like Google Cloud Scheduler to run your Python code.

The idea behind a cloud scheduler is simple: You provide a schedule and a URL to an endpoint in your web application, and the service will load your endpoint at the appropriate times, often with nice features like retries and output capture. Major cloud vendors offer schedulers, for example the Google Cloud Scheduler offered by GCP, and there are a number of other SaaS services for scheduling and running jobs, including the popular free service Cron-Job.org

Here's a look at the configuration form for a new scheduled job using Cloud Scheduler. Beyond common fields like name and description, you are left with the same configuration options as you find with Linux or Celery cron jobs: A Frequency field for the cron schedule, a Timezone selector used when evaluating your schedule, and finally a Target that specifies the web URL to trigger.

Google Cloud Scheduler User Interface

The advantage of using a 3rd party scheduler is that you can add a cron job to your application simply by adding a new endpoint and scheduling its URL for execution. This is a popular method for getting started with cron jobs, but there are a few concerns with this approach that could result in unreliable cron job execution:

  1. Be careful with timeouts Because this job is triggered by a web request there are at least two places it might hit a timeout that kills your job before it has completed all of its work. First, the service you are using likely has implemented a timeout. For example, Google Cloud Scheduler has a 30 minute timeout on all requests. Separately, your webserver or application platform may also enforce a timeout. It is common for a nginx webserver to have timeouts as low as 60 seconds, and if you are running your application in a Serverless environment, AWS Lambda invocations are limited to just 15 minutes. When a timeout is hit your job will be killed immediately, no matter how close or far it is from finishing its job.
  2. Limited log ingestion Writing audit logs from a cron job as it runs is critical to verifying and troubleshooting your jobs, but this can be tricky when you are using a 3rd party service to invoke your jobs. Third party services will limit both the total bytes allowed and their retention period. This truncation may even happen silently so it's important to verify and validate your job logs after scheduling a new job.

Cronitor automatically finds and monitors cron jobs.  Complete your setup in minutes.