Lambda functions¶
Convenient way to write short function definitions
In [9]:
def f(x):
return x**2
print(f(3))
f = lambda x: x**2
print(f(3))
9 9
Line continuations¶
- can use a backslash \ at the end of a line to continue it on the next
- or enclose in ()
- lists can be continued without a backslash
In [18]:
def f(x):
return 2*x \
+ 3
print(f(2))
def f(x):
return (
2*x
+ 3
)
print(f(2))
lst = [
"a",
"b",
"c"
]
print(lst)
7 7 ['a', 'b', 'c']
Reading and writing dataframes with pandas¶
- Can read csv, excel, sas, stata, sql, ...
- Can write also
Two ways to read stata dta files¶
- pd.read_stata
- StataReader
In [3]:
import pandas as pd
df = pd.read_stata("WAGE1.DTA")
df
Out[3]:
wage | educ | exper | tenure | nonwhite | female | married | numdep | smsa | northcen | ... | trcommpu | trade | services | profserv | profocc | clerocc | servocc | lwage | expersq | tenursq | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.10 | 11 | 2 | 0 | 0 | 1 | 0 | 2 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.131402 | 4 | 0 |
1 | 3.24 | 12 | 22 | 2 | 0 | 1 | 1 | 3 | 1 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1.175573 | 484 | 4 |
2 | 3.00 | 11 | 2 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1.098612 | 4 | 0 |
3 | 6.00 | 8 | 44 | 28 | 0 | 0 | 1 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1.791759 | 1936 | 784 |
4 | 5.30 | 12 | 7 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.667707 | 49 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
521 | 15.00 | 16 | 14 | 2 | 0 | 1 | 1 | 2 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2.708050 | 196 | 4 |
522 | 2.27 | 10 | 2 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0.819780 | 4 | 0 |
523 | 4.67 | 15 | 13 | 18 | 0 | 0 | 1 | 3 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1.541159 | 169 | 324 |
524 | 11.56 | 16 | 5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.447551 | 25 | 1 |
525 | 3.50 | 14 | 5 | 4 | 1 | 1 | 0 | 2 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1.252763 | 25 | 16 |
526 rows × 24 columns
In [5]:
from pandas.io.stata import StataReader
file = StataReader("WAGE1.dta")
variables = file.variable_labels()
df = file.read()
In [6]:
variables
Out[6]:
{'wage': 'average hourly earnings', 'educ': 'years of education', 'exper': 'years potential experience', 'tenure': 'years with current employer', 'nonwhite': '=1 if nonwhite', 'female': '=1 if female', 'married': '=1 if married', 'numdep': 'number of dependents', 'smsa': '=1 if live in SMSA', 'northcen': '=1 if live in north central U.S', 'south': '=1 if live in southern region', 'west': '=1 if live in western region', 'construc': '=1 if work in construc. indus.', 'ndurman': '=1 if in nondur. manuf. indus.', 'trcommpu': '=1 if in trans, commun, pub ut', 'trade': '=1 if in wholesale or retail', 'services': '=1 if in services indus.', 'profserv': '=1 if in prof. serv. indus.', 'profocc': '=1 if in profess. occupation', 'clerocc': '=1 if in clerical occupation', 'servocc': '=1 if in service occupation', 'lwage': 'log(wage)', 'expersq': 'exper^2', 'tenursq': 'tenure^2'}
In [7]:
df
Out[7]:
wage | educ | exper | tenure | nonwhite | female | married | numdep | smsa | northcen | ... | trcommpu | trade | services | profserv | profocc | clerocc | servocc | lwage | expersq | tenursq | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.10 | 11 | 2 | 0 | 0 | 1 | 0 | 2 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.131402 | 4 | 0 |
1 | 3.24 | 12 | 22 | 2 | 0 | 1 | 1 | 3 | 1 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1.175573 | 484 | 4 |
2 | 3.00 | 11 | 2 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1.098612 | 4 | 0 |
3 | 6.00 | 8 | 44 | 28 | 0 | 0 | 1 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1.791759 | 1936 | 784 |
4 | 5.30 | 12 | 7 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.667707 | 49 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
521 | 15.00 | 16 | 14 | 2 | 0 | 1 | 1 | 2 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2.708050 | 196 | 4 |
522 | 2.27 | 10 | 2 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0.819780 | 4 | 0 |
523 | 4.67 | 15 | 13 | 18 | 0 | 0 | 1 | 3 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1.541159 | 169 | 324 |
524 | 11.56 | 16 | 5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.447551 | 25 | 1 |
525 | 3.50 | 14 | 5 | 4 | 1 | 1 | 0 | 2 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1.252763 | 25 | 16 |
526 rows × 24 columns
What can we do with pandas?¶
- Explore
- Select
- Transform
- Aggregate
- Sort, rank, and cut
- Filter
- Aggregate by group
- Merge
- Plot
Explore¶
- .info()
- .describe()
- .head()
- .tail()
- .columns
- .index
Select¶
- [] for columns
- or .column_name for columns
- .loc[] to get rows using labels
- .iloc[] to get rows using index 0, 1, ...
Transform¶
- mathematical operations to individual columns or rows
- combine columns or combine rows in mathematical operations (add, subtract, ...)
- create new columns or rows
- np.where(condition, if_true, if_false)
- .map(f) or .map(lambda x: ...)
- exercises:
- change the female column to be "F" if 1 and "M" if 0
- create a new column that is 1 if experience <=5, is 2 if 5 < experience <= 10, 3 if 10 < experience <= 20, 4 otherwise
- create a new column that is "northcen" if northcen=1, is "south" if south=1, is "west" if west=1, and is "east" otherwise.
Aggregate¶
- .sum(), .mean(), .median(), .quantile(), .std(), .var(), .min(), .max()
- .apply(lambda x: ...)
Sort, rank, and cut¶
- sort_values() or sort_index()
- .rank()
- pd.cut(data, bin_edges, right=True or False, labels=...)
- pd.qcut(data, number_of_groups, labels=...)
Filter¶
- df[df.wage<10]
- df[(df.wage<10) & (df.female==1)]
- df[(df.wage<10) | (df.female==1)]
- df.dropna() or df.dropna(subset=[...])
- not filtering, but .fillna is also useful
Aggregate by group¶
- df.groupby("female").wage.mean()
- df.groupby(["female", "married"]).wage.mean().unstack()
- can use any aggregation function including custom functions with .apply
- df.groupby("female").wage.apply( lambda x: x.quantile([0.2, 0.4, 0.6, 0.8]) )
- df.groupby("female").wage.apply( lambda x: pd.qcut(x, 5, labels=range(1, 6)) )
In [20]:
df["group"] = df.groupby(["female"], group_keys=False).wage.apply(
lambda x: pd.qcut(x, 5, labels=range(1, 6))
)
df[["female", "wage", "group"]]
ask Julius¶
- to read WAGE1.DTA and describe it and then do some aggregation by groups with it
- to get the seaborn tips dataset and do some aggregation by groups with it
- to use pandas datareader to get the Fama-French factors from French's data library and merge them with MOM from French's data library
- to create boxplots for the Fama-French factors and MOM