Data analysis from a CSV file in Python

Data analysis from a CSV file in Python

Learn to read and write CSV files in Python

What is a CSV file?

CSV stands for comma separated value.

You might have come across this file format while downloading data from an excel spreadsheet or a database. CSV files are convenient for storing tabular data.

It should be clear from the name that values in a CSV file are separated by a comma(by default).

Below is an example of CSV file containing information about a family.

my_family.csv

name,age,height(cm),weight(kg)
Lenin,30,188,90
Phil,42,178,76
Claire,40,165,54
Alex,18,140,46

Usually the first line in a CSV file is called the Header which identifies the column name and data type. Every row after the header is a data record.

From the above example, you can see that each value(whether part of header or a data record) is separated by a comma. This separator character is called a Delimiter. A CSV file may use other delimiters other than comma.

Examples of other delimiter -

  1. tab \t
  2. colon :
  3. semi colon ;
  4. pipe |

In this article, you will learn to work with CSV files using the csv module and the pandas library.


How to read CSV files using the csv module?

Reading from a CSV file is done with the csv.reader object. You can open the CSV file as a text file with Python’s built-in open() function.

Example 1

import csv

with open('my_family.csv') as input:
    csv_reader = csv.reader(input, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            print(f'Header row - {", ".join(row)}')
            line_count += 1
        else:
            print(f'{row[0]} is {row[1]} years old, {row[2]} cm tall and {row[3]} kg heavy')
            line_count += 1
    print(f'Total: {line_count} lines')

Output

Header row - name,  age,  height(cm),  weight(kg)
Lenin is  30 years old,  188 cm tall and  90 kg heavy
Phil is  42 years old,  178 cm tall and  76 kg heavy
Claire is  40 years old,  165 cm tall and  54 kg heavy
Alex is  18 years old,  140 cm tall and  46 kg heavy
Total: 5 lines

Since the first row is the header row(line_count will be 0), it is treated differently. You can also skip the header row while reading the CSV.

How to skip the header row in CSV with Python?

Since csv.reader object is an iterable, you can use next(reader object, None) function to return the header row and skip over it.

Example 2

import csv

with open('my_family.csv') as input:
    csv_reader = csv.reader(input, delimiter=',')
    line_count = 0
    next(csv_reader, None) #ignore the header
    for row in csv_reader:
        print(f'{row[0]} is {row[1]} years old, {row[2]} cm tall and {row[3]} kg heavy')
        line_count += 1
    print(f'Total: {line_count} lines')

Output

Lenin is  30 years old,  188 cm tall and  90 kg heavy
Phil is  42 years old,  178 cm tall and  76 kg heavy
Claire is  40 years old,  165 cm tall and  54 kg heavy
Alex is  18 years old,  140 cm tall and  46 kg heavy
Total: 4 lines


Reading CSV files as a dictionary

You can read the CSV file as a dictionary by using the csv.DictReader object.

An advantage of using the DictReader object is that it turns each row into a dictionary which make accessing the fields a little more easier.

Example 3

import csv

with open('my_family.csv') as input:
    csv_reader = csv.DictReader(input, delimiter=',')
    for row in csv_reader:
        print(f'{row["name"]} is {row["age"]} years old, {row["height(cm)"]} cm tall and {row["weight(kg)"]} kg heavy')
    print(f'Total: {csv_reader.line_num} lines')

The csv_reader.line_num method returns the total number of lines in the CSV file.

For the csv.DictReader object, Python uses the column names as key from the header row. The csv.DictReader object doesn't have the header row in it.


How to write to CSV files using the csv module?

You can write to a CSV file using the csv.writer object. Be careful to open the file in writing mode.

Example 1

import csv

header = ['Name', 'Age', 'Height(cm)', 'Weight(kg)']

data = [ ['Phil', 42, 178, 76],
        ['Alex', 18, 140, 46],
        ['Claire', 40, 165, 54] ]

filename = "my_family.csv"

with open(filename, 'w') as output:
    csvwriter = csv.writer(output)

    # Write a single list
    csvwriter.writerow(header)

    # Writing a list of lists
    csvwriter.writerows(data)

Output

Name,Age,Height(cm),Weight(kg)
Phil,42,178,76
Alex,18,140,46
Claire,40,165,54

The writerow method is going to write a list of values into a single row whereas writerows is going to write multiple rows from a buffer that contains one or more lists.

Using the delimiter parameter

Notice that no delimiter has been mentioned while creating the csv.writer object. In such cases, comma , is used as the default delimiter. You can also use a different delimiter by passing the delimiter parameter.

Example 2

import csv

header = ['Name', 'Age', 'Height(cm)', 'Weight(kg)']

data = [ ['Phil', 42, 178, 76],
        ['Alex', 18, 140, 46],
        ['Claire', 40, 165, 54] ]

filename = "my_family.csv"

with open(filename, 'w') as output:
    csvwriter = csv.writer(output, delimiter = '|')

    # Write a single list
    csvwriter.writerow(header)

    # Writing a list of lists
    csvwriter.writerows(data)

Output

Name|Age|Height(cm)|Weight(kg)
Phil|42|178|76
Alex|18|140|46
Claire|40|165|54

Writing a dictionary to a CSV file

You can write the dictionary into a CSV file using the DictWriter method. The fieldnames parameter is compulsory for passing the header information.

Example 3

import csv

header = ['Name', 'Age', 'Height(cm)', 'Weight(kg)']

data = [
    {"Name":"Phil", "Age": 42, "Height(cm)":178, "Weight(kg)":76},
    {"Name":"Claire", "Age": 40, "Height(cm)":165, "Weight(kg)":54},
    {"Name":"Alex", "Age": 18, "Height(cm)":140, "Weight(kg)":46}
]

filename = "my_family.csv"

with open(filename, 'w') as output:
  csvwriter = csv.DictWriter(output, fieldnames=header)
  csvwriter.writeheader()
  for row in data:
    csvwriter.writerow(row)

Output

Name,Age,Height(cm),Weight(kg)
Phil,42,178,76
Claire,40,165,54
Alex,18,140,46

You can also use writerows to write all the dictionaries to the CSV file at once.

Example 4

import csv

header = ['Name', 'Age', 'Height(cm)', 'Weight(kg)']

data = [
    {"Name":"Phil", "Age": 42, "Height(cm)":178, "Weight(kg)":76},
    {"Name":"Claire", "Age": 40, "Height(cm)":165, "Weight(kg)":54},
    {"Name":"Alex", "Age": 18, "Height(cm)":140, "Weight(kg)":46}
]

filename = "my_family.csv"

with open(filename, 'w') as output:
  csvwriter = csv.DictWriter(output, fieldnames=header)
  csvwriter.writeheader()
  csvwriter.writerows(data)

Output

Name,Age,Height(cm),Weight(kg)
Phil,42,178,76
Claire,40,165,54
Alex,18,140,46

Playing with additional parameters in csv module

quotechar

It refers to the character string that is used to quote values when special characters or delimiters appears inside the field. It defaults to ".

For example, if the delimiter of your CSV file is a comma and you have an address column that may have comma in it's values. Check out the example below.

my_family.csv

Name,Age,Height(cm),Weight(kg),Address
Phil,42,178,76,'Gryffindor room, Hogwarts'
Claire,40,165,54,'Snapes room, Hogwarts'
Alex,18,140,46,'4 Private Drive, Little Whinging'

The above CSV file is using single quotes to separate the address fields for each data record. You can pass this as the quotechar value.

Example 1

import csv

filename = "my_family.csv"

with open(filename, 'r') as output:
  csvreader = csv.reader(output, quotechar="'")
  for row in csvreader:
    print(row)

Output

['Name', 'Age', 'Height(cm)', 'Weight(kg)', 'Address']
['Phil', '42', '178', '76', 'Gryffindor room, Hogwarts']
['Claire', '40', '165', '54', 'Snapes room, Hogwarts']
['Alex', '18', '140', '46', '4 Private Drive, Little Whinging']

quoting

The quoting argument controls when quotes should be generated by the writer or recognized by the reader. It is of 4 types.

  1. csv.QUOTE_MINIMAL - It adds quote only when required(default).
  2. csv.QUOTE_ALL - It quotes everything regardless of the field type.
  3. csv.QUOTE_NONNUMERIC - It quotes everything except integers and floats.
  4. csv.QUOTE_NONE - It does not quote anything on output. However, while reading, quotes are included around the field values.

Example 1

import csv

filename = "my_family.csv"

header = ['Name','Age','Height(cm)','Weight(kg)','Address']

data = [
  ['Phil',42,178,76,'Gryffindor room, Hogwarts'],
  ['Claire',40,165,54,'Snapes room, Hogwarts'],
  ['Alex',18,140,46,'4 Private Drive, Little Whinging']
]

with open(filename, 'w') as output:
  csvwriter = csv.writer(output, quotechar="'", quoting=csv.QUOTE_ALL)
  csvwriter.writerow(header)
  csvwriter.writerows(data)

The above code uses csv.QUOTE_ALL as the quoting argument. This will ensure that every data point has a single quotation wrapped around it while being written to the CSV.

my_family.csv

'Name','Age','Height(cm)','Weight(kg)','Address'
'Phil','42','178','76','Gryffindor room, Hogwarts'
'Claire','40','165','54','Snapes room, Hogwarts'
'Alex','18','140','46','4 Private Drive, Little Whinging'

escapechar

Let's say, you don't want any quotation in your CSV file while executing the above code. So you use csv.QUOTE_NONE as the quoting argument.

Example 1

import csv

filename = "my_family.csv"

header = ['Name','Age','Height(cm)','Weight(kg)','Address']

data = [
  ['Phil',42,178,76,'Gryffindor room, Hogwarts'],
  ['Claire',40,165,54,'Snapes room, Hogwarts'],
  ['Alex',18,140,46,'4 Private Drive, Little Whinging']
]

with open(filename, 'w') as output:
  csvwriter = csv.writer(output, quotechar="'", quoting=csv.QUOTE_NONE)
  csvwriter.writerow(header)
  csvwriter.writerows(data)

The above code will throw you an error.

Output

Traceback (most recent call last):
  File "main.py", line 16, in <module>
    csvwriter.writerows(data)
_csv.Error: need to escape, but no escapechar set

The problem is that the address field contains commas. Since the quoting argument is set to csv.QUOTE_NONE, the csv module doesn't know how to escape the commas properly.

For this purpose, you can use the escapechar argument. It takes a single character string that is used to escape the delimiter when the the quoting is turned off.

The below code escapes the comma using a backslash \.

Example 2

import csv

filename = "my_family.csv"

header = ['Name','Age','Height(cm)','Weight(kg)','Address']

data = [
  ['Phil',42,178,76,'Gryffindor room, Hogwarts'],
  ['Claire',40,165,54,'Snapes room, Hogwarts'],
  ['Alex',18,140,46,'4 Private Drive, Little Whinging']
]

with open(filename, 'w') as output:
  csvwriter = csv.writer(output, quotechar="'", quoting=csv.QUOTE_NONE, escapechar='\\')
  csvwriter.writerow(header)
  csvwriter.writerows(data)

my_family.csv

Name,Age,Height(cm),Weight(kg),Address
Phil,42,178,76,Gryffindor room\, Hogwarts
Claire,40,165,54,Snapes room\, Hogwarts
Alex,18,140,46,4 Private Drive\, Little Whinging

Notice how the commas have been escaped with backslash \ and no error is thrown.

skipinitialspace

It skips the space following the delimiter. If True, the initial white spaces will be removed. It defaults to False.

my_family.csv

Name, Age, Height(cm), Weight(kg), Address
Phil, 42, 178, 76, 'Gryffindor room, Hogwarts'
Claire, 40, 165, 54, 'Snapes room, Hogwarts'
Alex, 18, 140, 46, '4 Private Drive, Little Whinging'

The above CSV file has spaces after every delimiter. If you read it without the skipinitialspace argument, there will be white spaces in your data points.

Example 1

import csv

with open('my_family.csv', 'r') as f:
    csv_reader = csv.reader(f, quotechar="'")

    for line in csv_reader:
        print(line)

Output

['Name', ' Age', ' Height(cm)', ' Weight(kg)', ' Address']
['Phil', ' 42', ' 178', ' 76', " 'Gryffindor room", " Hogwarts'"]
['Claire', ' 40', ' 165', ' 54', " 'Snapes room", " Hogwarts'"]
['Alex', ' 18', ' 140', ' 46', " '4 Private Drive", " Little Whinging'"]

To get rid of the whitespaces, set the skipinitialspace argument to True.

Example 2

import csv

with open('my_family.csv', 'r') as f:
    csv_reader = csv.reader(f, quotechar="'", skipinitialspace=True)

    for line in csv_reader:
        print(line)

Output

['Name', 'Age', 'Height(cm)', 'Weight(kg)', 'Address']
['Phil', '42', '178', '76', 'Gryffindor room, Hogwarts']
['Claire', '40', '165', '54', 'Snapes room, Hogwarts']
['Alex', '18', '140', '46', '4 Private Drive, Little Whinging']


How to read CSV files using the pandas module?

Reading CSV files into a pandas DataFrame is very straightforward. A pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

Example 1

import pandas as pd

df = pd.read_csv('my_family.csv')
print(df)

Output

     Name  Age  Height(cm)  Weight(kg)
0    Phil   42         178          76
1  Claire   40         165          54
2    Alex   18         140          46

Notice the following points:-

  1. It used the first line of CSV as column names automatically.
  2. It displays the CSV data like a spreadsheet, thus making it easy to perform data analysis.
  3. Pandas automatically converted the datatype for Age, Height(cm) and Weight(kg) columns to integer.

Example 2

import pandas as pd

df = pd.read_csv('my_family.csv')

print(type(df['Age'][0]))
print(type(df['Height(cm)'][0]))
print(type(df['Weight(kg)'][0]))

Output

<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>

Pandas trick to deal with CSVs without header

If your CSV is missing the header row, use the names arguments in pd.read_csv() method.

my_family.csv

Phil,42,178,76
Claire,40,165,54
Alex,18,140,46

Example 2

import pandas as pd

df = pd.read_csv('my_family.csv', 
            index_col='Name', 
            names=['Name', 'Age', 'Height(cm)', 'Weight(kg)']
                    )
print(df)

Output

        Age  Height(cm)  Weight(kg)
Name                               
Phil     42         178          76
Claire   40         165          54
Alex     18         140          46


How to write to CSV files using the pandas module?

To write a pandas dataframe to a CSV file, use df.to_csv method.

Example 1

import pandas as pd

df = pd.read_csv('my_family.csv', 
            index_col='Name', 
            names=['Name', 'Age', 'Height(cm)', 'Weight(kg)']
                    )
df.to_csv('my_new_family.csv')

Output

        Age  Height(cm)  Weight(kg)
Name                               
Phil     42         178          76
Claire   40         165          54
Alex     18         140          46


Data Analyst Project: Analyze Titanic data from Kaggle

The famous Titanic challenge by Kaggle is to build Machine Learning model that predicts which passengers survived the Titanic shipwreck.

However, in this section you are going to do simple data analysis on train.csv file and figure out the answers to the following questions:-

  1. How many male and female passengers were onboard the Titanic?
  2. How many male and female members survived the Titanic shipwreck?
  3. What is the median age of each sex?

Male to Female ratio on the Titanic

import pandas as pd

#load the csv file
df = pd.read_csv('train.csv')

# Column Names
print(df.columns)

# Count unique values in Sex column
print(df['Sex'].value_counts())

# Percentage of male and female passengers
print(df['Sex'].value_counts(normalize=True))

Output

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
male      577
female    314
Name: Sex, dtype: int64
male      0.647587
female    0.352413
Name: Sex, dtype: float64

The above analysis shows that 65% of people on Titanic were Male and 35% were Female.

Surviving male to female ratio on the Titanic

import pandas as pd

#load the csv file
df = pd.read_csv('train.csv')

# Column Names
print(df.columns)

# Count unique values in Sex column
print(df[df["Survived"] == 1]['Sex'].value_counts())

# Percentage of surviving male and female passengers
print(df[df["Survived"] == 1]['Sex'].value_counts(normalize=True))

Output

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
female    233
male      109
Name: Sex, dtype: int64
female    0.681287
male      0.318713
Name: Sex, dtype: float64

In the above code, first you filter the dataframe for surviving passengers and then use the value_counts() method to find out the unique male and female passengers.

The above analysis shows that 68% of surviving people on the Titanic were Female.

Median age of each sex

import pandas as pd

#load the csv file
df = pd.read_csv('train.csv')

# median age of each sex
median_age_men=df[df['Sex']=='male']['Age'].median()
median_age_women=df[df['Sex']=='female']['Age'].median()

print(f"The median age of men is {median_age_men}")
print(f"The median age of women is {median_age_women}")

Output

The median age of men is 29.0
The median age of women is 27.0

The above analysis shows that median age of male was 29 whereas median age of female was 27.

Did you find this article valuable?

Support Lenin Mishra by becoming a sponsor. Any amount is appreciated!