Introduction to Pandas¶

BUSI 520: Python for Business Research¶

Kerry Back, JGSB, Rice University¶

What is Pandas?¶

Pandas is a powerful and flexible open-source data analysis and manipulation library. It provides two data structures (Series and DataFrame), each of which have many built-in methods. It provides constructors for series and dataframes, functions for reading existing dataframes (in csv, stata, sas, sql, excel, html and other formats), and functions for operating on series and dataframes.

Pandas was created by Wes McKinney, formerly of AQR and the author of the book Python for Data Analysis.

Creating Series and DataFrame using dictionaries¶

In [1]:
import pandas as pd
# Creating a Series using a dictionary
data_series = pd.Series({'a': 1, 'b': 2, 'c': 3})
data_series
Out[1]:
a    1
b    2
c    3
dtype: int64
In [2]:
# Creating a DataFrame using a dictionary
data_frame = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Occupation': ['Engineer', 'Doctor', 'Teacher']
})
data_frame
Out[2]:
Name Age Occupation
0 Alice 25 Engineer
1 Bob 30 Doctor
2 Charlie 35 Teacher

Creating a test dataframe with random numbers¶

In [3]:
import numpy as np
np.random.seed(0)
# Generating a 5x3 array of random integers between 0 and 100
random_data = np.random.randint(0, 100, size=(5, 3))
# Creating a DataFrame from the random data
random_df = pd.DataFrame(random_data, columns=['Column1', 'Column2', 'Column3'])
random_df
Out[3]:
Column1 Column2 Column3
0 44 47 64
1 67 67 9
2 83 21 36
3 87 70 88
4 88 12 58

Exploring the seaborn tips dataset¶

In [4]:
import seaborn as sns
# Loading the tips dataset
tips = sns.load_dataset('tips')
tips.head()
Out[4]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Extracting Columns and Rows from Series and DataFrames¶

Pandas provides various methods to extract data from Series and DataFrames. In this section, we'll explore three primary ways to achieve this:

  1. Using [] (basic indexing)
  2. Using .loc (label-based indexing)
  3. Using .iloc (integer-location based indexing)
In [5]:
# Using [] (basic indexing)
single_column = random_df['Column1']
multiple_columns = random_df[['Column1', 'Column3']]
series_value = data_series['a']
single_column, multiple_columns, series_value
Out[5]:
(0    44
 1    67
 2    83
 3    87
 4    88
 Name: Column1, dtype: int32,
    Column1  Column3
 0       44       64
 1       67        9
 2       83       36
 3       87       88
 4       88       58,
 1)
In [6]:
# Using .loc (label-based indexing)
row_1_col_1 = random_df.loc[0, 'Column1']
single_row = random_df.loc[1, :]
multiple_rows_cols = random_df.loc[[0, 2], ['Column1', 'Column3']]
series_value_loc = data_series.loc['b']

row_1_col_1, single_row, multiple_rows_cols, series_value_loc
Out[6]:
(44,
 Column1    67
 Column2    67
 Column3     9
 Name: 1, dtype: int32,
    Column1  Column3
 0       44       64
 2       83       36,
 2)
In [7]:
# Using .iloc (integer-location based indexing)
row_2_col_2 = random_df.iloc[1, 1]
single_row_iloc = random_df.iloc[2, :]
multiple_rows_cols_iloc = random_df.iloc[[1, 3], [0, 2]]
series_value_iloc = data_series.iloc[1]

row_2_col_2, single_row_iloc, multiple_rows_cols_iloc, series_value_iloc
Out[7]:
(67,
 Column1    83
 Column2    21
 Column3    36
 Name: 2, dtype: int32,
    Column1  Column3
 1       67        9
 3       87       88,
 2)

Filtering Rows Based on Logical Conditions¶

In pandas, you can filter rows of a DataFrame based on logical conditions. This is a common operation when you want to focus on a subset of your data that meets specific criteria. The process involves specifying conditions that return a Boolean Series, which can then be used to index the DataFrame and retrieve the desired rows. Let's explore how to perform this operation with examples.

In [8]:
# Filtering rows where Column1 values are greater than 50
condition_1 = random_df['Column1'] > 50
filtered_df_1 = random_df[condition_1]
filtered_df_1
Out[8]:
Column1 Column2 Column3
1 67 67 9
2 83 21 36
3 87 70 88
4 88 12 58
In [9]:
# Filtering rows where Column2 values are less than 30 or Column3 values are greater than 70
condition_2 = (random_df['Column2'] < 30) | (random_df['Column3'] > 70)
filtered_df_2 = random_df[condition_2]
filtered_df_2
Out[9]:
Column1 Column2 Column3
2 83 21 36
3 87 70 88
4 88 12 58
In [10]:
# Filtering rows where Column1 values are not equal to 50
condition_3 = ~(random_df['Column1'] == 50)
filtered_df_3 = random_df[condition_3]
filtered_df_3
Out[10]:
Column1 Column2 Column3
0 44 47 64
1 67 67 9
2 83 21 36
3 87 70 88
4 88 12 58

Element-wise Unary and Binary Operations on Columns¶

Pandas allows you to perform element-wise operations on columns, which means applying an operation to each individual element in a column or between columns. These operations can be broadly categorized into unary and binary operations.

  • Unary Operations: These operations are applied to a single column. Examples include taking the square root of each value or negating each value in a column.
  • Binary Operations: These operations are applied between two columns. Examples include adding, subtracting, multiplying, or dividing values between two columns. Additionally, we'll explore the np.where function, which is a versatile tool for creating new columns based on conditions.
In [11]:
# Unary Operation: Negating the values in Column1
random_df['Negated_Column1'] = -random_df['Column1']
# Binary Operation: Adding the values of Column1 and Column2
random_df['Sum_Column1_Column2'] = random_df['Column1'] + random_df['Column2']
# Using np.where: Creating a new column based on a condition
random_df['Column1_Positive'] = np.where(random_df['Column1'] > 50, 'Yes', 'No')
random_df
Out[11]:
Column1 Column2 Column3 Negated_Column1 Sum_Column1_Column2 Column1_Positive
0 44 47 64 -44 91 No
1 67 67 9 -67 134 Yes
2 83 21 36 -83 104 Yes
3 87 70 88 -87 157 Yes
4 88 12 58 -88 100 Yes

Handling Missing Data: dropna and fillna¶

In real-world datasets, it's common to encounter missing values. Pandas provides various methods to handle missing data. Two of the most commonly used methods are dropna and fillna.

  • dropna: This method removes missing values. You can specify whether to remove rows or columns that contain missing values.
  • fillna: This method fills in missing values with a specified value or method (e.g., forward fill, backward fill). Let's create a sample DataFrame with missing values and demonstrate the use of these methods.
In [12]:
import numpy as np
# Creating a sample DataFrame with missing values
data = {
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, 7, 8],
    'C': [9, 10, 11, np.nan]
}
df_missing = pd.DataFrame(data)
df_missing
Out[12]:
A B C
0 1.0 5.0 9.0
1 2.0 NaN 10.0
2 NaN 7.0 11.0
3 4.0 8.0 NaN
In [13]:
# Using dropna to drop rows with missing values
df_dropped = df_missing.dropna()
# Using fillna to fill missing values with 0
df_filled = df_missing.fillna(0)
df_dropped, df_filled
Out[13]:
(     A    B    C
 0  1.0  5.0  9.0,
      A    B     C
 0  1.0  5.0   9.0
 1  2.0  0.0  10.0
 2  0.0  7.0  11.0
 3  4.0  8.0   0.0)

Aggregation on Columns and Rows¶

Aggregation is the process of combining multiple values into a single value. In the context of a DataFrame, this often means computing summary statistics over columns or rows. Pandas provides a variety of methods to perform aggregation, allowing you to quickly compute statistics like the sum, average, maximum, minimum, and more. Let's explore how to perform aggregation on columns and rows using a sample DataFrame.

In [14]:
# Creating a sample DataFrame for demonstration
data = {
    'A': [10, 20, 30, 40],
    'B': [5, 15, 25, 35],
    'C': [1, 2, 3, 4]
}
df_agg = pd.DataFrame(data)
df_agg
Out[14]:
A B C
0 10 5 1
1 20 15 2
2 30 25 3
3 40 35 4
In [15]:
# Column-wise Aggregation
column_sum = df_agg.sum()
column_mean = df_agg.mean()
column_sum, column_mean
Out[15]:
(A    100
 B     80
 C     10
 dtype: int64,
 A    25.0
 B    20.0
 C     2.5
 dtype: float64)
In [16]:
# More column-wise aggregation
column_max = df_agg.max()
column_min = df_agg.min()
column_max, column_min
Out[16]:
(A    40
 B    35
 C     4
 dtype: int64,
 A    10
 B     5
 C     1
 dtype: int64)
In [17]:
# Row-wise Aggregation
row_sum = df_agg.sum(axis=1)
row_sum
Out[17]:
0    16
1    37
2    58
3    79
dtype: int64

Indexing in Pandas¶

The index of a DataFrame is one of the most essential components in pandas. It provides a label for each row, allowing for fast lookups and powerful data alignment capabilities. In this section, we'll explore the following indexing-related functionalities:

  1. set_index: This method allows you to set one or more columns as the index of a DataFrame.
  2. reset_index: This method resets the index of a DataFrame and brings it back to the default integer index.
  3. Multi-indexing: Multi-indexing allows you to have multiple levels of indices in a DataFrame, providing a way to work with higher-dimensional data using a two-dimensional DataFrame structure.
In [18]:
# Creating a sample DataFrame for demonstration
data = {
    'Year': [2020, 2020, 2021, 2021],
    'Month': ['Jan', 'Feb', 'Jan', 'Feb'],
    'Value': [100, 150, 110, 160]
}
df_index = pd.DataFrame(data)
df_index
Out[18]:
Year Month Value
0 2020 Jan 100
1 2020 Feb 150
2 2021 Jan 110
3 2021 Feb 160
In [19]:
# Setting the 'Year' column as the index
df_year_index = df_index.set_index('Year')
# Resetting the index
df_reset = df_year_index.reset_index()

df_year_index, df_reset
Out[19]:
(     Month  Value
 Year             
 2020   Jan    100
 2020   Feb    150
 2021   Jan    110
 2021   Feb    160,
    Year Month  Value
 0  2020   Jan    100
 1  2020   Feb    150
 2  2021   Jan    110
 3  2021   Feb    160)
In [20]:
# Setting a multi-index using 'Year' and 'Month'
df_multi_index = df_index.set_index(['Year', 'Month'])
df_multi_index
Out[20]:
Value
Year Month
2020 Jan 100
Feb 150
2021 Jan 110
Feb 160

Aggregating by Groups¶

Grouping and aggregating data is a powerful feature in pandas that allows you to segment your data into groups and then apply aggregate functions to each group. This is similar to the SQL's GROUP BY clause. The primary method for this in pandas is groupby(). Once data is grouped, you can apply various aggregation functions such as sum(), mean(), max(), and many others to compute statistics for each group. We'll use the tips dataset from seaborn to demonstrate this functionality. This dataset contains information about restaurant tips, with columns indicating the total bill, tip amount, gender of the tipper, day of the week, and other attributes.

In [21]:
import seaborn as sns
# Loading the tips dataset from seaborn
tips = sns.load_dataset('tips')
tips.head()
Out[21]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
In [22]:
# Grouping by a Single Column: Average total bill by day
avg_bill_by_day = tips.groupby('day')['total_bill'].mean()

avg_bill_by_day
Out[22]:
day
Thur    17.682742
Fri     17.151579
Sat     20.441379
Sun     21.410000
Name: total_bill, dtype: float64
In [23]:
# Grouping by Multiple Columns: Total tip amount by day and time
total_tip_by_day_time = tips.groupby(['day', 'time'])['tip'].sum()

total_tip_by_day_time
Out[23]:
day   time  
Thur  Lunch     168.83
      Dinner      3.00
Fri   Lunch      16.68
      Dinner     35.28
Sat   Lunch       0.00
      Dinner    260.40
Sun   Lunch       0.00
      Dinner    247.39
Name: tip, dtype: float64
In [24]:
# Applying Multiple Aggregation Functions: Mean, sum, and max of total_bill by day
multiple_aggregations = tips.groupby('day')['total_bill'].agg(['mean', 'sum', 'max'])
multiple_aggregations
Out[24]:
mean sum max
day
Thur 17.682742 1096.33 43.11
Fri 17.151579 325.88 40.17
Sat 20.441379 1778.40 50.81
Sun 21.410000 1627.16 48.17

Stack and Unstack Methods in Pandas¶

When working with multi-index DataFrames or Series in pandas, the stack and unstack methods are useful tools for reshaping the data.

  • stack(): This method "compresses" a level in the DataFrame's columns to produce either:
    • A Series (if there are multiple column levels) or
    • A DataFrame with a new inner-most level of row labels (if there's only one column level).
  • unstack(): This method "expands" a level in the DataFrame's rows to produce:
    • A DataFrame with a new inner-most level of column labels. To better understand these methods, let's use the aggregated data from the tips dataset and demonstrate the use of stack and unstack.
In [25]:
# using unstack to convert to `wide form'

wide1 = total_tip_by_day_time.unstack()
wide1
Out[25]:
time Lunch Dinner
day
Thur 168.83 3.00
Fri 16.68 35.28
Sat 0.00 260.40
Sun 0.00 247.39
In [26]:
wide2 = total_tip_by_day_time.unstack('day')
wide2
Out[26]:
day Thur Fri Sat Sun
time
Lunch 168.83 16.68 0.0 0.00
Dinner 3.00 35.28 260.4 247.39
In [30]:
# using stack to convert to 'long form'

long = wide1.stack()
long
Out[30]:
day   time  
Thur  Lunch     168.83
      Dinner      3.00
Fri   Lunch      16.68
      Dinner     35.28
Sat   Lunch       0.00
      Dinner    260.40
Sun   Lunch       0.00
      Dinner    247.39
dtype: float64
In [31]:
# extracting a row of a multi-indexed dataframe

long.loc[('Fri', 'Dinner')]
Out[31]:
35.28

Combining DataFrames: Concatenate, Join, and Merge¶

Pandas provides several methods to combine multiple DataFrames or Series. These methods are essential when working with data from multiple sources or when preparing data for analysis. In this section, we'll explore the following methods:

  1. concatenate (or concat): This method is used to concatenate two or more DataFrames or Series along a particular axis (either rows or columns).
  2. join: This method is used to join two or more DataFrames based on their indices.
  3. merge: This method is similar to SQL joins. It allows you to merge DataFrames based on common columns. Let's delve into each of these methods with examples.
In [32]:
# Sample DataFrames for demonstration
# DataFrames for concatenate
df1 = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1']})
df2 = pd.DataFrame({'A': ['A2', 'A3'], 'B': ['B2', 'B3']})
# DataFrames for join
df3 = pd.DataFrame({'C': ['C0', 'C1', 'C2'], 'D': ['D0', 'D1', 'D2']}, index=['X', 'Y', 'Z'])
df4 = pd.DataFrame({'E': ['E0', 'E1', 'E2'], 'F': ['F0', 'F1', 'F2']}, index=['X', 'Y', 'Z'])
# DataFrames for merge
df5 = pd.DataFrame({'key': ['K0', 'K1', 'K2'], 'G': ['G0', 'G1', 'G2']})
df6 = pd.DataFrame({'key': ['K0', 'K1', 'K2'], 'H': ['H0', 'H1', 'H2']})
In [33]:
# concatenating vertically
concatenated_df_v = pd.concat([df1, df2])
df1, df2, concatenated_df_v
Out[33]:
(    A   B
 0  A0  B0
 1  A1  B1,
     A   B
 0  A2  B2
 1  A3  B3,
     A   B
 0  A0  B0
 1  A1  B1
 0  A2  B2
 1  A3  B3)
In [34]:
# concatenating horizontally
concatenated_df_h = pd.concat([df3, df4], axis=1)
df3, df4, concatenated_df_h
Out[34]:
(    C   D
 X  C0  D0
 Y  C1  D1
 Z  C2  D2,
     E   F
 X  E0  F0
 Y  E1  F1
 Z  E2  F2,
     C   D   E   F
 X  C0  D0  E0  F0
 Y  C1  D1  E1  F1
 Z  C2  D2  E2  F2)
In [35]:
# Using join to join df3 and df4 based on their indices
joined_df = df3.join(df4)
joined_df
Out[35]:
C D E F
X C0 D0 E0 F0
Y C1 D1 E1 F1
Z C2 D2 E2 F2
In [36]:
# Using merge to merge df5 and df6 based on the 'key' column
merged_df = pd.merge(df5, df6, on='key')
merged_df
Out[36]:
key G H
0 K0 G0 H0
1 K1 G1 H1
2 K2 G2 H2

Inner and Outer Joins/Merges¶

When merging or joining DataFrames in pandas, you can specify how you want to handle the combination of rows based on the keys. The two most common types of joins/merges are:

  1. Inner Join/Merge: This type returns only the rows with matching keys in both DataFrames.
  2. Outer Join/Merge: This type returns all rows from both DataFrames, filling in NaN for missing values in columns. Let's delve deeper into each of these types with examples.
In [37]:
# Sample DataFrames for demonstration
df_A = pd.DataFrame({'key': ['K0', 'K1', 'K2'], 'value_A': ['A0', 'A1', 'A2']})
df_B = pd.DataFrame({'key': ['K1', 'K2', 'K3'], 'value_B': ['B1', 'B2', 'B3']})
df_A, df_B
Out[37]:
(  key value_A
 0  K0      A0
 1  K1      A1
 2  K2      A2,
   key value_B
 0  K1      B1
 1  K2      B2
 2  K3      B3)
In [38]:
# Performing an inner merge on df_A and df_B based on the 'key' column
inner_merged = pd.merge(df_A, df_B, on='key', how='inner')
inner_merged
Out[38]:
key value_A value_B
0 K1 A1 B1
1 K2 A2 B2
In [39]:
# Performing an outer merge on df_A and df_B based on the 'key' column
outer_merged = pd.merge(df_A, df_B, on='key', how='outer')
outer_merged
Out[39]:
key value_A value_B
0 K0 A0 NaN
1 K1 A1 B1
2 K2 A2 B2
3 K3 NaN B3

Map, Apply, and Transform Methods in Pandas¶

Pandas provides several methods to apply functions to Series and DataFrames, allowing for powerful and flexible data manipulation. In this section, we'll explore the following methods:

  1. map: Used with a Series to substitute each value with another value.
  2. apply: Used with both Series and DataFrames to apply a function along the input axis.
  3. transform: Used with groupby objects to produce a like-indexed DataFrame that is broadcastable back to the original object. Let's delve into each of these methods with explanations and examples.
In [40]:
# Sample Series for demonstration
names = pd.Series(['Alice', 'Bob', 'Charlie'])
# Using map to prefix each name with 'Hello, '
greetings = names.map(lambda x: 'Hello, ' + x)
greetings
Out[40]:
0      Hello, Alice
1        Hello, Bob
2    Hello, Charlie
dtype: object
In [41]:
# Sample Series and DataFrame for demonstration
numbers = pd.Series([1, 2, 3, 4])
df_numbers = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Using apply with a Series to square each number
squared_numbers = numbers.apply(lambda x: x**2)
# Using apply with a DataFrame to sum values along the columns
column_sums = df_numbers.apply(sum, axis=0)
squared_numbers, column_sums
Out[41]:
(0     1
 1     4
 2     9
 3    16
 dtype: int64,
 A     6
 B    15
 dtype: int64)
In [42]:
# Using the tips dataset for demonstration
import seaborn as sns
tips = sns.load_dataset('tips')
# Grouping by 'day' and using transform to compute mean tip for each day
mean_tips_by_day = tips.groupby('day')['tip'].transform('mean')
tips['mean_tip_by_day'] = mean_tips_by_day
tips.head()
Out[42]:
total_bill tip sex smoker day time size mean_tip_by_day
0 16.99 1.01 Female No Sun Dinner 2 3.255132
1 10.34 1.66 Male No Sun Dinner 3 3.255132
2 21.01 3.50 Male No Sun Dinner 3 3.255132
3 23.68 3.31 Male No Sun Dinner 2 3.255132
4 24.59 3.61 Female No Sun Dinner 4 3.255132

Exploratory Methods in Pandas: describe, info, head, and tail¶

When working with DataFrames in pandas, it's often useful to quickly explore and understand the structure and content of the data. Pandas provides several methods to help with this initial data exploration. In this section, we'll explore the following methods:

  1. describe: Provides a summary of the central tendencies, dispersion, and shape of the distribution of a dataset.
  2. info: Gives a concise summary of a DataFrame, including the number of non-null values in each column, data type, and memory usage.
  3. head: Returns the first few rows of a DataFrame or Series.
  4. tail: Returns the last few rows of a DataFrame or Series. Let's delve into each of these methods with explanations and examples using the tips dataset.
In [43]:
# Using describe method on the tips dataset
tips.describe()
Out[43]:
total_bill tip size mean_tip_by_day
count 244.000000 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672 2.998279
std 8.902412 1.383638 0.951100 0.197993
min 3.070000 1.000000 1.000000 2.734737
25% 13.347500 2.000000 2.000000 2.771452
50% 17.795000 2.900000 2.000000 2.993103
75% 24.127500 3.562500 3.000000 3.255132
max 50.810000 10.000000 6.000000 3.255132
In [44]:
# Using info method on the tips dataset
tips.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   total_bill       244 non-null    float64 
 1   tip              244 non-null    float64 
 2   sex              244 non-null    category
 3   smoker           244 non-null    category
 4   day              244 non-null    category
 5   time             244 non-null    category
 6   size             244 non-null    int64   
 7   mean_tip_by_day  244 non-null    float64 
dtypes: category(4), float64(3), int64(1)
memory usage: 9.3 KB
In [45]:
# Using head method on the tips dataset to view the first 3 rows
tips.head(3)
Out[45]:
total_bill tip sex smoker day time size mean_tip_by_day
0 16.99 1.01 Female No Sun Dinner 2 3.255132
1 10.34 1.66 Male No Sun Dinner 3 3.255132
2 21.01 3.50 Male No Sun Dinner 3 3.255132
In [46]:
# Using tail method on the tips dataset to view the last 3 rows
tips.tail(3)
Out[46]:
total_bill tip sex smoker day time size mean_tip_by_day
241 22.67 2.00 Male Yes Sat Dinner 2 2.993103
242 17.82 1.75 Male No Sat Dinner 2 2.993103
243 18.78 3.00 Female No Thur Dinner 2 2.771452

Reading and Writing DataFrames¶

Pandas has read methods for reading files in different formats into dataframes. These include read_csv, read_stata, read_sas, read_sql, and read_html.

There are corresponding write methods: to_csv, to_stata, ...

Stata files can include variable descriptions in addition to the data. If you execute read_stata, you will only get the data. If you also want the variable descriptions, then use the pandas StataReader.

In [47]:
from pandas.io.stata import StataReader

stata_file = StataReader('WAGE1.dta')
variables = stata_file.variable_labels()
data = stata_file.read()

variables
Out[47]:
{'wage': 'average hourly earnings',
 'educ': 'years of education',
 'exper': 'years potential experience',
 'tenure': 'years with current employer',
 'nonwhite': '=1 if nonwhite',
 'female': '=1 if female',
 'married': '=1 if married',
 'numdep': 'number of dependents',
 'smsa': '=1 if live in SMSA',
 'northcen': '=1 if live in north central U.S',
 'south': '=1 if live in southern region',
 'west': '=1 if live in western region',
 'construc': '=1 if work in construc. indus.',
 'ndurman': '=1 if in nondur. manuf. indus.',
 'trcommpu': '=1 if in trans, commun, pub ut',
 'trade': '=1 if in wholesale or retail',
 'services': '=1 if in services indus.',
 'profserv': '=1 if in prof. serv. indus.',
 'profocc': '=1 if in profess. occupation',
 'clerocc': '=1 if in clerical occupation',
 'servocc': '=1 if in service occupation',
 'lwage': 'log(wage)',
 'expersq': 'exper^2',
 'tenursq': 'tenure^2'}
In [48]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 526 entries, 0 to 525
Data columns (total 24 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   wage      526 non-null    float32
 1   educ      526 non-null    int8   
 2   exper     526 non-null    int8   
 3   tenure    526 non-null    int8   
 4   nonwhite  526 non-null    int8   
 5   female    526 non-null    int8   
 6   married   526 non-null    int8   
 7   numdep    526 non-null    int8   
 8   smsa      526 non-null    int8   
 9   northcen  526 non-null    int8   
 10  south     526 non-null    int8   
 11  west      526 non-null    int8   
 12  construc  526 non-null    int8   
 13  ndurman   526 non-null    int8   
 14  trcommpu  526 non-null    int8   
 15  trade     526 non-null    int8   
 16  services  526 non-null    int8   
 17  profserv  526 non-null    int8   
 18  profocc   526 non-null    int8   
 19  clerocc   526 non-null    int8   
 20  servocc   526 non-null    int8   
 21  lwage     526 non-null    float32
 22  expersq   526 non-null    int16  
 23  tenursq   526 non-null    int16  
dtypes: float32(2), int16(2), int8(20)
memory usage: 20.5 KB
In [49]:
# write to a csv file

data.to_csv("wage1.csv")
In [50]:
# read a csv file

data2 = pd.read_csv("wage1.csv")
data2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 25 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  526 non-null    int64  
 1   wage        526 non-null    float64
 2   educ        526 non-null    int64  
 3   exper       526 non-null    int64  
 4   tenure      526 non-null    int64  
 5   nonwhite    526 non-null    int64  
 6   female      526 non-null    int64  
 7   married     526 non-null    int64  
 8   numdep      526 non-null    int64  
 9   smsa        526 non-null    int64  
 10  northcen    526 non-null    int64  
 11  south       526 non-null    int64  
 12  west        526 non-null    int64  
 13  construc    526 non-null    int64  
 14  ndurman     526 non-null    int64  
 15  trcommpu    526 non-null    int64  
 16  trade       526 non-null    int64  
 17  services    526 non-null    int64  
 18  profserv    526 non-null    int64  
 19  profocc     526 non-null    int64  
 20  clerocc     526 non-null    int64  
 21  servocc     526 non-null    int64  
 22  lwage       526 non-null    float64
 23  expersq     526 non-null    int64  
 24  tenursq     526 non-null    int64  
dtypes: float64(2), int64(23)
memory usage: 102.9 KB