DataFrame is one of the most popular data structures that helps users manipulate data easily. When we read data into a DataFrame, it will be structured with columns and rows, making it easy to analyze and work with.
In Julia, several ways exist to select only a subset of DataFrame columns, which we will cover in this Answer.
We can select a subset of columns using their actual column names, as shown below:
df = df[:,[:"A",:"B"]]
The above code selects columns with names A and B from df.
using DataFramesdf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,40,47,30],age=[15,16,19,18,15])df = df[:,[:"name",:"age"]]println(df)
Let’s explain the code provided above.
Line 1: We upload the already imported library DataFrames.
Lines 2–5: We create a DataFrame consisting of four columns and five rows, each containing students’ information.
Line 7: We select the DataFrame columns name and age only and assign the DataFrame to a new one named df.
Line 8: We print the new DataFrame.
We can select a subset of columns by specifying their index numbers. Here’s an example:
df = df[:,[1,3]]
The code df = df[:, [1, 3]] selects the columns with index 1 and 3 from the DataFrame df. The resulting DataFrame will only contain those selected columns, creating a subset of the original DataFrame.
using DataFramesdf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,40,47,30],age=[15,16,19,18,15])df = df[:,[1,3]]println(df)
student_id) and 3 (marks) and return a new DataFrame with only these columns. We assign this DataFrame to a new one also named df.select() or select!()We can also use select() or select!() functions to select a subset of DataFrame columns, as explained below.
Option 1
select!(df, [:"A", :"B"]))
The select!() function selects the columns A and B and then modifies the original DataFrame,df. This is referred to as modifying in place.
Option 2
df = select(df,[:"A",:"B"])
The select() function selects columns A and B from the original DataFrame and creates a copy. We can assign this new DataFrame to a separate variable named df.
using DataFramesdf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,40,47,30],age=[15,16,19,18,15])#using selectdf1 = select(df,[:"student_id",:"marks"])println(df1)println("-------------------------")#using select!select!(df, [:"name", :"age"])println(df)
Let’s explain the code provided above.
Lines 8–9: We use select() to subset the columns and assign the new DataFrame to a variable named df1 and then we print out df1.
Lines 13–14: We use select!() to select columns name and age. select!() modifies the original DataFrame, df, so no variable assignment is needed. We then print out the new df.
We can use boolean indexing, where we specify True or False values, to subset columns in a DataFrame.
df = df[:,[true,false]]
The code above selects 1 out of the 2 columns of the DataFrame.
using DataFramesdf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,40,47,30],age=[15,16,19,18,15])df= df[:,[true,false,true,true]]println(df)
Line 7: We use boolean indexing to select three columns, where true returns the column and false omits the column. Consequently, we choose only the columns student_id, marks, and age. The resulting DataFrame is then assigned to a new variable, also named df.
Free Resources