Pandas: How to drop data, read files and work with String methods

Hi and welcome

This is the third post on my series of blog posts on the pandas library. if you're new to the pandas library or need a re-fresher on it, then i suggest you go through the previous post .

In this post, i will be treating the following

  • How to import csv files
  • How to drop columns/rows
  • How to work with string methods

    Importing csv files

So far we have being working with data generated with python and numpy, however in the real situation, data scientist hardly work with those kind of data. The ideal situation is to import data from different file types, including databases. In this section we shall be considering how to import Comma Separated Values(csv) .

I will be working with a file saved in the same directly as the python file i'm working with. You can also import from other directories using absolute paths and other method of relative paths.

Now, create a file with the following content and save it as records.csv

re.PNG

To get the data into jupyter IDE the .read_csv() method will be used. To use the method, we pass the file path as the first argument; there are other arguments that can be passed along, but for now let's work with the defaults.

1    import pandas as pd
2    df=pd.read_csv("records.csv") # Reads in the data. It is saves in the variable 'df'
3    print(df) # displays the data set read in

1.PNG

Notice the cells with NaN. These are due to the missing values included in the data set, as shown in the image below

2.PNG

Now, there are a few things we need to take care of in the data set. First, we assume we don't need some information, hence we need to remove them.

Dropping Index/Columns

To drop a column/Index, we use the .drop() method, which takes a list of columns/indices to drop as the first argument

4    df.drop(["Name"])

3.PNG

The above code gives gives an error because by default the method looks for the items in the list on the row, i.e assumes "Name" is an index. To take care of this issue, we need to point it to the column. This is done using the axis parameter as shown below

4    df.drop(["Name"], axis="columns")

4.PNG That worked out fine.

Now print the content of df and notice that all columns still remains intact. This is because the .drop() method does not modify the DataFrame; it returns a new object. If you want the operation to modify the DataFrame, then you either reassign to the DataFrame or set the inplace parameter to True, as shown below

4    df.drop(["Name","age"], axis="columns", inplace=True)
5    print(df)

5.PNG

Notice that we dropped two columns and set the inplace=True.

For the sake of other illustrations we shall be working with the complete data set, i.e the data set with all four columns.

To drop a row or record , we follow the same pattern.

1    df.drop([1,4])#  Drops the indices with values of 1 and 4, hence left with [0,2,3,5,6,7]

6.PNG

Notice we didn't use the axis parameter; this is because rows is the default value of this parameter.

Working with string method

One may want to perform some string manipulation on the columns that are of string datatype, however just applying the string method on this is not possible.

For instance, notice that the names in the Name column are not consistent. The names are meant to be title case, i.e the first character of each word being upper case, however this is not the case for most of the cases in the Name and dept columns. Let's fix this

First, let's select and print the columns There are two basic methods that can be used to select a column. One is by using the bracket notation as shown below

1    print(df["Name"])

7.PNG

And the other is by the use of the dot notation as shown below

2    print(df.dept)

8.PNG

The bracket notation is better in most cases, especially when the column is not a single word or when one wants to make use of variables instead of the name as a literal.

Ok, let's do some string manipulations. We want to transform the items in the dept column to upper case and we could try the code that follows

3    df.dept.upper()

This gives an error

9.PNG

We can't just access the string methods. This is because we're not dealing directly with a string. To make this operation possible, we need to use the str sub-module, then we can access the string methods as shown below

4    df.dept.str.upper() # accesses the .upper method in the str sub- module

10.PNG

Simple, right? Now if you print df, you would notice that the column remains the same. To make this act stay in place, we need to reassign as shown below

5    df.dept=df.dept.str.upper()

Now, let's use this on the Name column, transforming the names to title case in order to make correction to the name in index 5.

6    df.Name=df.Name.str.title()  
7    print(df)

11.PNG

Finally, let's change the column names to upper case

8    df.columns=df.columns.str.upper()  
7    print(df)

12.PNG

Conclusion

Once you understand the methods in pandas, you will find out that there are a whole lot of ways of carrying out the same action. Well, we always look for the fastest or the one we can easily remember. In the next post, we shall be considering different ways of handling missing data(data imputation) and how to make selections from a DataFrame. have fun, practicing. Cheers!!

This post is part of a series of blog post on the Pandas library , based on the course Practical Machine Learning Course from The Port Harcourt School of AI (pmlcourse).

No Comments Yet