What is Dask in Python?

Overview

Dask is a Python library, which is used for parallel computing. It offers various features like:

  • BigData in Dask: It covers the commonly-known Python interfaces like pandas, NumPy, and more.

  • Dynamic task scheduling: It is enhanced for cooperative computational workloads.

Dask in Python

Why do we need Dask?

Most of the analytics use NumPy and pandas to analyze big data. These packages are helpful in supporting various computations. Dask is also of great use in cases where our dataset does not fit in the given memory. It helps in scaling up to a cluster with 1000s of cores or CPUs. It also allows scaling down to a single process, or a single core for processing.

Features of Dask

  • Developed with a wider community: Dask is an open-source platform. It was built in coordination with various other community projects like Numpy, Pandas, and Scikit-learn.

  • Dask DataFrames: Dask datasets are the same as the pandas DataFrame. We can deal with larger DataFrames with ease by using Dask. It assists users in manipulating larger data. 

  • Dask array and Dask-ML: Dask array offers parallel, larger than memory and n-dimensional arrays by using the blocked algorithms:

    • The blocked algorithm works by performing the smaller computation to complete a larger computation. It works by using all the cores on our system.

    • If we are working on a dataset that is larger than our memory, then these arrays are helpful.

    • Dask-ML also contains resources for both parallel and distributed machine learning.

Moreover, Dask also helps to break the given array into smaller pieces to offer effective data streaming from the disk. This process also decreases the memory footprint of our computation.

Installation

We can use the following command to install the package:

python –m pip install "dask[complete]"

Limitations of Dask

  • Dask in Python cannot parallelize within discrete or individual tasks.

  • It permits the remote execution of arbitrary code, due to the reason that it is a distributed computing framework. Thus, the Dask workers should be held within trustworthy networks only.

Explanations

In the section below, we are going to discuss different concepts of the Dask library that are used for parallel computing including arrays, bags, DataFrames, workloads, xarrays, and more.

Arrays

In the code below, we are going to a create Dask array of size 1000*1000, where each chunk of 100*100 being a Numpy array. Each Numpy array will be filled with random values using daskArray.random.random().

# import dask.arrays object
import dask.array as daskArray
# invoke random() to generate dask 2D array of 1000 * 10000 with 100 * 100 chunks
arr = daskArray.random.random((1000, 1000), chunks=(100, 100))
# print array on console
print(arr.compute())

Explanation

  • Line 5: We create a 1000*1000 Dask array, which contains 100*100 chunks of Numpy arrays to make efficient data processing.

  • Line 8: We print the array to the console by invoking the compute() method.

DataFrames

In the code below, we are going to create a Dask DataFrame. For this purpose, we are going to load a time series built-in dataset. Here, the timeseries() method will return the time and series dataset as a single DataFrame.

# include dask module
import dask.datasets as dd
# invoke timeseries() from dask.datasets
df = dd.timeseries()
# print time series DataFrame on console
print(df.compute())

Explanation

  • Line 5: We extract the time and series dataset as the df DataFrame.

  • Line 8: We print the dataset to the console.

Free Resources