Prévia do material em texto
LEARN DATA SCIENCE ONLINE DATAQUEST Start Learning For Free - www.dataquest.io Data Science Cheat Sheet Pandas KEY IMPORTS We'll use shorthand in this cheat sheet Import these to start df A pandas DataFrame object import pandas as pd A pandas Series object import numpy as IMPORTING DATA SELECTION in ascending order then col2 in descending pd.read_csv(filename) From a CSV file df[col] Returns column with label col as Series order pd.read_table(filename) From a delimited text df[[col1, Returns Columns as a new df groupby(col) Returns a groupby object for file (like TSV) DataFrame values from one column pd.read_excel(filename) From an Excel file .iloc[0] Selection by position df groupby Returns a groupby connection_object) s.loc[0] Selection by index object values from multiple columns Reads from a SQL table/database df.iloc[0, First row groupby mean() Returns the pd. read_json(json_string) Reads from a JSON df.iloc[0,0] First element of first column mean of the values in grouped by the formatted string, URL or file. values in col1 (mean can be replaced with Parses an html URL, string or DATA CLEANING almost any function from the statistics section) file and extracts tables to a list of dataframes df.columns = ['a', 'b', 'c'] Renames columns d.read_clipboard() Takes the contents of your pd.isnull() Checks for null Values, Returns [col2, aggfunc=mean) - Creates a pivot clipboard and passes it to read_table() Boolean Array table that groups by col1 and calculates the From a dict, keys for pd.notnull() Opposite of s.isnull() mean of col2 and columns names, values for data as lists Drops all rows that contain null groupby agg(np.mean) Finds the values average across all columns for every unique EXPORTING DATA (axis=1) Drops all columns that column 1 group f.to_csv(filename) Writes to a CSV file contain null values apply(np.mean) Applies a function across df.to_excel(filename) - Writes to an Excel file thresh=n) - Drops all rows each column connection_object) have have less than n non null values df.apply(np.max, axis=1) - Applies a function Writes to a SQL table df.fillna(x) - Replaces all null values with X across each row df.to_json(filename) Writes to a file in JSON .fillna(s.mean()) Replaces all null values with format the mean (mean can be replaced with almost JOIN/COMBINE df.to_html(filename) Saves as an HTML table any function from the statistics section) df1. append - Adds the rows in df1 to the df.to_clipboard() Writes to the clipboard .astype(float) - Converts the datatype of the end of df2 (columns should be identical) series to float concat ([df1, df2], axis=1) Adds the CREATE TEST OBJECTS .replace(1, 'one') Replaces all values equal to columns in df1 to the end of df2 (rows should be Useful for testing 1 with 'one' identical) random. rand 5 ['one', 'three']) Replaces columns and 20 rows of random floats all 1 with 'one' and 3 with 'three' joins the columns in df1 with the columns pd.Series(my_list) Creates a series from an (columns=lambda x: + 1) Mass on df2 where the rows for col have identical iterable my_list renaming of columns values. how can be one of 'left', 'right', index = pd.date_range( '1900/1/30', df.rename(columns={'old_name': 'new_ 'outer', 'inner' periods=df.shape[0]) Adds a date index name'}) Selective renaming df.set_index(' column_one') Changes the index STATISTICS VIEWING/INSPECTING DATA rename (index=lambda x: + 1) Mass These can all be applied to a series as well. df.head(n) First n rows of the DataFrame renaming of index lf.describe() Summary statistics for numerical df.tail(n) - Last n rows of the DataFrame columns shape Number of rows and columns FILTER, SORT, & GROUPBY df.mean() - Returns the mean of all columns df.info() - Index, Datatype and Memory 0.5] Rows where the col column df.corr() Returns the correlation between information is greater than 0.5 columns in a DataFrame df.describe() Summary statistics for numerical df[(df[co]] > 0.5) & (df[col] 0.7)] df.count() - Returns the number of non-null columns Rows where 0.7 > col > 0.5 values in each DataFrame column s.value_counts(dropna=False) Views unique sort_values (col1) Sorts values by in df.max() Returns the highest value in each values and counts ascending order column df.apply(pd.Series.value_counts) - Unique sort_values ascending=False) Sorts df.min() Returns the lowest value in each column values and counts for all columns values by col2 in descending order df.median() - Returns the median of each column df.sort_values([col1,col2], df.std() - Returns the standard deviation of each ascending= False]) Sorts values by column LEARN DATA SCIENCE ONLINE Start Learning For Free www.dataquest.io