Data Analysis Using Python/Pandas

Almost every company/industry need analysts to better understand how to build products, better serve customers, leverage new opportunities and improve current processes. The techniques and methodologies used stem from the fields of computer science and statistics.

Data Analysis Using Python/Pandas

What is Data Analysis?

Process of evaluating data using analytical and statistical tools to discover useful information and aid in business decision making

Need for Data Analysis

Google Trends on Data Analysis, Science and Python Pandas Almost every company/industry need analysts to better understand how to build products, better serve customers, leverage new opportunities and improve current processes. The techniques and methodologies used stem from the fields of computer science and statistics.


Python

Python is an interpreted language which means it isn’t compiled directly to machine code and importantly is commonly used in an interactive fashion, and is quite different from programming languages like Java or C, where you write your code, compile it, and run it. In Python, you can start the interactive interpreter and begin writing code, line by line, with the interpreter evaluating each statement as you write it.

Quick Overview of Python

# Python Function
def multiply_numbers(x, y):
    return x * y

multiply_numbers(1, 2)
# Can you try to multiply three numbers?
# Data Structures in Python
# tuples - Immutable - Cannot change after construction.
x = (1, 'a', 2, 'b')
type(x)
x.append(3.3)
print(x)

# list - Mutable
x = [1, 'a', 2, 'b']
type(x)
x.append(3.3)
print(x)
# Append or concat to list
[1,2] + [3,4]

# Last element
x[-1]

# Strings/Arrays
firstname = 'Christopher Arthur Hansen Brooks'.split(' ')[0] # [0] selects the first element of the list
lastname = 'Christopher Arthur Hansen Brooks'.split(' ')[-1] # [-1] selects the last element of the list
print(firstname)
print(lastname)
# Reading CSV files
import csv

%precision 2

with open('mpg.csv') as csvfile:
    mpg = list(csv.DictReader(csvfile))

mpg[:3] # The first three dictionaries in our list.

# average city fuel economy across all cars.
print(sum(float(d['cty']) for d in mpg) / len(mpg))

# average hwy fuel economy across all cars.
sum(float(d['hwy']) for d in mpg) / len(mpg)

Let's  group the cars by number of cylinder, and finding the average cty mpg for each group.

Before we move to Pandas, Let’s check out numpy

import numpy as np
mylist = [1, 2, 3]
x = np.array(mylist)
x

y = np.array([4, 5, 6])
y

print(x + y) # elementwise addition     [1 2 3] + [4 5 6] = [5  7  9]
print(x - y) # elementwise subtraction  [1 2 3] - [4 5 6] = [-3 -3 -3]

x.dot(y) # dot product

Pandas

Pandas stands for “Python Data Analysis Library”. It is a high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Video on Pandas/Python from Pandas website - https://www.youtube.com/watch?v=_T8LGqJtuGc

Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

  • A fast and efficient DataFrame object for data manipulation with integrated indexing;
  • Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
  • Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
  • Flexible reshaping and pivoting of data sets;
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
  • Columns can be inserted and deleted from data structures for size mutability;
  • Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
  • High performance merging and joining of data sets;
  • Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
  • Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
  • Highly optimized for performance, with critical code paths written in Cython or C.
  • Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

    import pandas as pd
    
    animals = ['Tiger', 'Bear', None]
    pd.Series(animals)
    
    sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
    s = pd.Series(sports)
    s
    
    original_sports = pd.Series({'Archery': 'Bhutan',
                             'Golf': 'Scotland',
                             'Sumo': 'Japan',
                             'Taekwondo': 'South Korea'})
    cricket_loving_countries = pd.Series(['Australia',
                                      'Barbados',
                                      'India',
                                      'England'],
                                   index=['Cricket',
                                          'Cricket',
                                          'Cricket',
                                          'Cricket'])
    all_countries = original_sports.append(cricket_loving_countries)
    
    import pandas as pd
    purchase_1 = pd.Series({'Name': 'Abc',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
    purchase_2 = pd.Series({'Name': 'Def',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
    purchase_3 = pd.Series({'Name': 'Ghi',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})
    df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
    df.head()
    
    # Merge two dataframe
    staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'},
                         {'First Name': 'Sally', 'Last Name': 'Brooks', 'Role': 'Course liasion'},
                         {'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}])
    student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name': 'Brooks', 'School': 'Engineering'}])
    staff_df
    student_df
    pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])
    
    ## Group By
    mpg = pd.read_csv('mpg.csv')
    ...
    
Last modified October 6, 2020