Python Pandas Basics Exercises_temp

02 Jan 2018

pd.Series(myarg), argument can be e.g. list, array, dictionary

myserie.head() myserie.tail()

ser.name : to assign name to a serie

pd.concat([myserie1, myserie2]) # set axis for rowise or columnwise if mydf1

myserie.to_frame().reset_index()

ser1.isin(ser2)

pd.DataFrame(mydictionary)

other useful functions

np.setdiff1d(ser1,ser2),

np.percentile(ser, q=[0, 25, 50, 75, 100])

ser.value_counts() frequency counts of each unique value

pd.qcut(): to bin a numeric series, e.g. pd.qcut(ser, q=[0, .10, .20, .3, .4, .5, .6, .7, .8, .9, 1], labels=[‘1st’, ‘2nd’, ‘3rd’, ‘4th’, ‘5th’, ‘6th’, ‘7th’, ‘8th’, ‘9th’, ‘10th’])

ser.values.reshape(7,5)

np.argwhere(ser % 3 == 0) find the positions of numbers that are multiples of 3 f

ser.map(lambda x: x.title())

ser.diff()

pd.to_datetime(ser)

ser_ts.dt.day.tolist()

ser_ts.dt.weekofyear.tolist()

ser_ts.dt.dayofyear.tolist()

ser_ts.dt.weekday_name.tolist()

mydf.groupby(myvalue).mean()

version 1

ser.resample(‘D’).ffill() # fill with previous value

can also be done like that:

ser.resample(‘D’).bfill() # fill with next value ser.resample(‘D’).bfill().ffill() # fill next else prev value

ser.autocorr(i)

df = pd.read_csv(‘https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv’, chunksize=50)

df2 = df2.transpose()

df = pd.read_csv(‘https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv’, converters={‘medv’: lambda x: ‘High’ if float(x) > 25 else ‘Low’} #-> converters !!! )

df.shape

df.dtypes

how many column under each dtype

df.get_dtype_counts()

same as

df.dtypes.value_counts()

summary statistics

df.describe()

-> by default, only numeric columns are included; to be sure to include all columns regardless of data types use include=”all” in argument

numpy array:

df_arr = df.values print(df_arr)

to list

df_list = df.values.tolist() print(df_list)

38. How to extract the row and column number of a particular cell with given criterion?

df.loc[df.Price == np.max(df.Price), [“Manufacturer”, “Model”, “Type”]]

b) What is the row and column number of the cell with the highest Price value?

rowN, colN = np.where(df.values == np.max(df.Price)) row, col = np.where(df.values == np.max(df.Price))

to get the value from the indices:

df.iloc[rowN, colN]

Get the value

df.iat[row[0], col[0]] df.iloc[row[0], col[0]] #=> iat snd iloc accepts row and column numbers.

Alternates

df.at[row[0], ‘Price’] df.get_value(row[0], ‘Price’) #=> Whereas at and loc accepts index and column names.

39. How to rename a specific columns in a dataframe?

Rename the column Type as CarType in df and replace the ‘.’ in column names with ‘_’.

df = pd.read_csv(‘https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv’) print(df.columns)

rename a column:

df=df.rename(columns = {‘Type’:’CarType’})

same as:

df.columns.values[2] = “CarType”

df.columns = df.columns.map(lambda x: x.replace(‘.’, ‘_’)) print(df.columns)

column selection: DataFrame or Series as return type

d1 = df[[“a”]] type(d1)

d2 = df.loc[:,[“a”]] type(d2)

d3 = df.iloc[:,[0]] type(d3)

=> on the other hand, this returns Series

s1 = df[“a”] type(s1)

s2 = df.loc[:,”a”] type(s2)

s3 = df.iloc[:,0] type(s3)

c. Sort the columns in reverse alphabetical order, that is colume ‘e’ first through column ‘a’ last.

df.sort_index(axis=1, ascending=False, inplace=True)

46. How to set the number of rows and columns displayed in the output?

pd.set_option(‘display.max_columns’, 10) pd.set_option(‘display.max_rows’, 10) pd.set_option(‘display.float_format’, lambda x: ‘%.4f’ % x)

all available options:

pd.describe_option()

version 1: Rounding

df.round(4)

version 2: Use apply to change format

df.apply(lambda x: ‘%.4f’ % x, axis=1)

or

df.applymap(lambda x: ‘%.4f’ % x)

Reset/undo float formatting

pd.options.display.float_format = None

48. How to format all the values in a dataframe as percentages?

Format the values in column ‘random’ of df as percentages.

df = pd.DataFrame(np.random.random(4), columns=[‘random’])

out = df.style.format({ ‘random’: ‘{0:.2%}’.format, })

out

df.index = primary key index

n = 5

df[‘a’].argsort()[::-1][n]

print(‘ser: ‘, ser.tolist(), ‘mean: ‘, round(ser.mean())) np.argwhere(ser > ser.mean())[1]

df = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))

print row sums

rowsums = df.apply(np.sum, axis=1)

last two rows with row sum greater than 100

last_two_rows = df.iloc[np.where(rowsums > 100)[0][-2:], :]

ser.quantile([low_perc, high_perc])

def cap_outliers(ser, low_perc, high_perc):

print(low_perc, '%ile: ', low, '|', high_perc, '%ile: ', high)
ser[ser < low] = low
ser[ser > high] = high
return(ser)

capped_ser = cap_outliers(ser, .05, .95)

55. How to reshape a dataframe to the largest possible square after removing the negative values?

Reshape df to the largest possible square with negative values removed. Drop the smallest values if need be. The order of the positive numbers in the result should remain the same as the original.

df = pd.DataFrame(np.random.randint(-20, 50, 100).reshape(10,-1))

Step 1: remove negative values from arr

arr = df[df > 0].values.flatten() arr_qualified = arr[~np.isnan(arr)]

Step 2: find side-length of largest possible square

n = int(np.floor(arr_qualified.shape[0]**.5))

Step 3: Take top n^2 items without changing positions

top_indexes = np.argsort(arr_qualified)[::-1] output = np.take(arr_qualified, sorted(top_indexes[:n**2])).reshape(n, -1) print(output)

56. How to swap two rows of a dataframe?

Swap rows 1 and 2 in df.

df = pd.DataFrame(np.arange(25).reshape(5, -1))

def swap_rows(df, i1, i2): a, b = df.iloc[i1, :].copy(), df.iloc[i2, :].copy() df.iloc[i1, :], df.iloc[i2, :] = b, a return df

print(swap_rows(df, 1, 2))

57. How to reverse the rows of a dataframe?

Reverse all the rows of dataframe df.

df = pd.DataFrame(np.arange(25).reshape(5, -1))

version 1

df.iloc[::-1, :]

version 2

df.loc[df.index[::-1], :]

df.apply(np.argmax, axis=1).value_counts().index[0]

mydict.update()

version 1

df.apply(lambda x: ((x - x.mean())/x.std()).round(2))

df.iat[i, i] = 0

df_grouped = df.groupby([‘col1’])

version 1

df_grouped.get_group(‘apple’)

version 2

df_grpd = df[‘taste’].groupby(df.fruit) df_grpd.get_group(‘banana’).sort_values().iloc[-2]

pd.merge(df1, df2, how=’inner’, left_on=[‘fruit’, ‘weight’], right_on=[‘pazham’, ‘kilo’], suffixes=[‘_left’, ‘_right’])

73. How to create lags and leads of a column in a dataframe?

Create two new columns in df, one of which is a lag1 (shift column a down by 1 row) of column ‘a’ and the other is a lead1 (shift column b up by 1 row).

df = pd.DataFrame(np.random.randint(1, 100, 20).reshape(-1, 4), columns = list(‘abcd’))

df[‘a_lag1’] = df[‘a’].shift(1) df[‘b_lead1’] = df[‘b’].shift(-1)

75. How to split a text column into two separate columns?

Split the string column in df to form a dataframe with 3 columns as shown.

Solution

df_out = df.row.str.split(‘,|\t’, expand=True)

Make first row as header

new_header = df_out.iloc[0] df_out = df_out[1:] df_out.columns = new_header print(df_out)

[ ]