Dask is a Python library, which is used for parallel computing. It offers various features like:
BigData in Dask: It covers the commonly-known Python interfaces like pandas, NumPy, and more.
Dynamic task scheduling: It is enhanced for cooperative computational workloads.
Most of the analytics use NumPy and pandas to analyze big data. These packages are helpful in supporting various computations. Dask is also of great use in cases where our dataset does not fit in the given memory. It helps in scaling up to a cluster with 1000s of cores or CPUs. It also allows scaling down to a single process, or a single core for processing.
Developed with a wider community: Dask is an open-source platform. It was built in coordination with various other community projects like Numpy, Pandas, and Scikit-learn.
Dask DataFrames: Dask datasets are the same as the pandas DataFrame. We can deal with larger DataFrames with ease by using Dask. It assists users in manipulating larger data.
Dask array and Dask-ML: Dask array offers parallel, larger than memory and n-dimensional arrays by using the blocked algorithms:
The blocked algorithm works by performing the smaller computation to complete a larger computation. It works by using all the cores on our system.
If we are working on a dataset that is larger than our memory, then these arrays are helpful.
Dask-ML also contains resources for both parallel and distributed machine learning.
Moreover, Dask also helps to break the given array into smaller pieces to offer effective data streaming from the disk. This process also decreases the memory footprint of our computation.
We can use the following command to install the package:
python –m pip install "dask[complete]"
Dask in Python cannot parallelize within discrete or individual tasks.
It permits the remote execution of arbitrary code, due to the reason that it is a distributed computing framework. Thus, the Dask workers should be held within trustworthy networks only.
In the section below, we are going to discuss different concepts of the Dask library that are used for parallel computing including arrays, bags, DataFrames, workloads, xarrays, and more.
In the code below, we are going to a create Dask array of size 1000*1000
, where each chunk of 100*100
being a Numpy array. Each Numpy array will be filled with random values using daskArray.random.random()
.
# import dask.arrays objectimport dask.array as daskArray# invoke random() to generate dask 2D array of 1000 * 10000 with 100 * 100 chunksarr = daskArray.random.random((1000, 1000), chunks=(100, 100))# print array on consoleprint(arr.compute())
Line 5: We create a 1000*1000
Dask array, which contains 100*100
chunks of Numpy arrays to make efficient data processing.
Line 8: We print the array to the console by invoking the compute()
method.
In the code below, we are going to create a Dask DataFrame. For this purpose, we are going to load a time series built-in dataset. Here, the timeseries()
method will return the time and series dataset as a single DataFrame.
# include dask moduleimport dask.datasets as dd# invoke timeseries() from dask.datasetsdf = dd.timeseries()# print time series DataFrame on consoleprint(df.compute())
Line 5: We extract the time and series dataset as the df
DataFrame.
Line 8: We print the dataset to the console.