Nucleotide Sequence Querying with BLASTX#

This tutorial shows how computational biology intermixes with Flyte. The problem statement we will be looking at is querying a nucleotide sequence against a local protein database, to identify potential homologues. This guide will show you how to:

  • Load the data

  • Instantiate a ShellTask to generate and run the BLASTX search command

  • Load BLASTX results and plot a graph (e_value vs. pc_identity)

Using BLAST+ Programmatically with Biopython has been used as a reference to construct the tutorial.

About BLAST#

The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

You can read more about BLAST in the BLAST Homepage.

BLASTX#

BLASTx is a powerful tool to search for genes and predict their functions or relationships with other gene sequences, and is typically used for identifying the protein‐coding genes in genomic DNA/cDNA. It is also used to detect whether a novel nucleotide sequence is a protein‐coding gene or identify proteins encoded by transcripts or transcript variants.

In this tutorial, we will run a BLASTX search.

Data#

The database comprises predicted gene products from five Kitasatospora genomes. The query is a single nucleotide sequence of a predicted penicillin-binding protein from Kitasatospora sp. CB01950.

To run the example, download the database from Flytesnacks datasets.

Note

To run the example locally, download BLAST first. You can find OS-specific installation instructions in the user manual. This example uses BLAST 2.12.0 version.

Dockerfile#

FROM ubuntu:focal

ENV VENV /opt/venv
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV PYTHONPATH /root

RUN apt-get update \
    && apt-get install -y software-properties-common \
    && add-apt-repository ppa:deadsnakes/ppa \
    && apt-get install -y \
    && apt-get update \
    && apt-get install -y \
    cmake \
    python3.8 \
    python3.8-venv \
    python3.8-dev \
    make \
    build-essential \
    libssl-dev \
    libffi-dev \
    python3-pip \
    zlib1g-dev \
    vim \
    wget

# Install the AWS cli separately to prevent issues with boto being written over
RUN pip3 install awscli

# Install gcloud for GCP
RUN apt-get install curl --assume-yes

RUN curl -sSL https://sdk.cloud.google.com | bash
ENV PATH $PATH:/root/google-cloud-sdk/bin

# Virtual environment
RUN python3.8 -m venv ${VENV}
RUN ${VENV}/bin/pip install wheel

# Download BLAST
RUN wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.12.0/ncbi-blast-2.12.0+-x64-linux.tar.gz && \
    tar -xzf ncbi-blast-2.12.0+-x64-linux.tar.gz

ENV PATH=".:/ncbi-blast-2.12.0+/bin:${PATH}"

# Check if BLAST is installed
RUN echo $(blastx)

# Set the working directory
WORKDIR /root

# Install Python dependencies
COPY blast/requirements.txt /root
RUN ${VENV}/bin/pip install -r /root/requirements.txt

# Copy the makefile targets to expose on the container. This makes it easier to register.
COPY in_container.mk /root/Makefile
COPY blast/sandbox.config /root

# Copy the actual code
COPY blast/ /root/blast/

# Copy over the helper script that the SDK relies on
RUN cp ${VENV}/bin/flytekit_venv /usr/local/bin/
RUN chmod a+x /usr/local/bin/flytekit_venv

# This tag is supplied by the build script and will be used to determine the version
# when registering tasks, workflows, and launch plans
ARG tag
ENV FLYTE_INTERNAL_IMAGE $tag
ENV FLYTE_SDK_USE_STRUCTURED_DATASET TRUE

# Enable the virtualenv for this image. Note this relies on the VENV variable we've set in this image.
ENTRYPOINT ["/usr/local/bin/flytekit_venv"]

Gallery generated by Sphinx-Gallery