Pandas Cheat Sheet
A complete Pandas reference for Python data analysis — DataFrames, indexing, filtering, grouping, merging, time series, and more. Use the search bar to instantly find the operation you need.
🐼 Looking for the underlying language? Python Cheat Sheet
import pandas
Setup & ImportImport the pandas library and common companions
Syntax:
import pandas as pdExamples:
import pandas as pd Standard pandas import aliasimport pandas as pd
import numpy as np Import pandas with NumPy for numerical operationspd.__version__ Check the installed pandas versionpd.set_option('display.max_columns', None) Show all columns when printing a DataFrameNotes:
The 'pd' alias is the universal convention for pandas in scripts and notebooks.
pd.Series()
Creating DataCreate a one-dimensional labeled array
Syntax:
pd.Series(data, index=None, name=None)Examples:
s = pd.Series([10, 20, 30, 40]) Create a Series from a lists = pd.Series([1, 2, 3], index=['a', 'b', 'c']) Series with custom index labelss = pd.Series({'a': 1, 'b': 2, 'c': 3}) Create a Series from a dictionarys = pd.Series(np.random.randn(5), name='random') Series with NumPy random values and a nameNotes:
A Series is essentially a single column — building block of a DataFrame.
pd.DataFrame()
Creating DataCreate a two-dimensional labeled data structure
Syntax:
pd.DataFrame(data, index=None, columns=None)Examples:
df = pd.DataFrame({'name': ['Ana', 'Bob'], 'age': [25, 30]}) From a dictionary of listsdf = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b']) From a list of lists with column namesdf = pd.DataFrame(np.random.rand(4, 3), columns=['x', 'y', 'z']) From a NumPy array with column labelsdf = pd.DataFrame(records, index=['r1', 'r2', 'r3']) Provide a custom row indexNotes:
DataFrames are the central pandas object — think of them as in-memory spreadsheets.
pd.date_range()
Creating DataGenerate a fixed-frequency DatetimeIndex
Syntax:
pd.date_range(start, end=None, periods=None, freq='D')Examples:
pd.date_range('2024-01-01', periods=7) Seven daily timestamps starting Jan 1, 2024pd.date_range('2024-01-01', '2024-12-31', freq='M') Month-end dates across the yearpd.date_range('2024-01-01', periods=24, freq='H') Hourly timestamps for one daypd.date_range('2024-01-01', periods=10, freq='B') Ten business days starting Jan 1, 2024Notes:
Common freq values: 'D' day, 'B' business day, 'W' week, 'M' month-end, 'Q' quarter, 'Y' year, 'H' hour, 'T' minute.
pd.read_csv()
Reading & WritingRead a CSV file into a DataFrame
Syntax:
pd.read_csv(filepath, sep=',', header=0, index_col=None, ...)Examples:
df = pd.read_csv('data.csv') Read a basic CSV filedf = pd.read_csv('data.csv', sep=';', encoding='utf-8') Custom separator and encodingdf = pd.read_csv('data.csv', index_col=0, parse_dates=['date']) Use first column as index, parse datesdf = pd.read_csv('data.csv', nrows=1000, usecols=['id', 'name']) Read only first 1000 rows of selected columnsNotes:
Use chunksize=N for streaming large files instead of loading everything into memory.
df.to_csv()
Reading & WritingWrite a DataFrame to a CSV file
Syntax:
df.to_csv(path, sep=',', index=True, header=True)Examples:
df.to_csv('output.csv', index=False) Save without the row indexdf.to_csv('output.csv', sep='\t', encoding='utf-8') Tab-separated with UTF-8 encodingdf.to_csv('output.csv.gz', compression='gzip') Compressed CSV outputdf.to_csv('output.csv', columns=['a', 'b'], index=False) Export only selected columnsNotes:
Always pass index=False unless the row index actually carries information.
Other formats
Reading & WritingRead and write Excel, JSON, Parquet, SQL, and HTML
Syntax:
pd.read_excel / read_json / read_parquet / read_sql / read_htmlExamples:
df = pd.read_excel('file.xlsx', sheet_name='Sheet1') Read a specific Excel sheetdf = pd.read_json('data.json', orient='records') Read a JSON file (record-oriented)df = pd.read_parquet('data.parquet') Read a Parquet columnar filedf = pd.read_sql('SELECT * FROM users', conn) Run a SQL query and load resultsdf.to_excel('output.xlsx', sheet_name='Data', index=False) Write to Exceldf.to_parquet('output.parquet') Write to Parquet (compact, fast)Notes:
Parquet is recommended for large datasets — much faster I/O and smaller files than CSV.
head() / tail() / sample()
Inspecting DataPreview the top, bottom, or random rows of a DataFrame
Syntax:
df.head(n=5) / df.tail(n=5) / df.sample(n=5)Examples:
df.head() First 5 rows (default)df.head(20) First 20 rowsdf.tail(3) Last 3 rowsdf.sample(n=5, random_state=42) 5 random rows (reproducible)Notes:
Use sample() for visual sanity checks — head() can be misleading on sorted data.
info() / describe() / shape
Inspecting DataInspect dtypes, summary statistics, and dimensions
Syntax:
df.info() / df.describe() / df.shapeExamples:
df.info() Column dtypes, non-null counts, memory usagedf.describe() Summary statistics for numeric columnsdf.describe(include='all') Stats for all columns (numeric + object)df.shape Tuple (rows, columns)df.dtypes Data type of each columndf.columns List of column namesNotes:
df.info() is the fastest way to spot missing data and incorrect dtypes.
value_counts()
Inspecting DataCount unique values in a Series
Syntax:
s.value_counts(normalize=False, dropna=True)Examples:
df['city'].value_counts() Frequency count of each citydf['city'].value_counts(normalize=True) Return percentages instead of countsdf['city'].value_counts(dropna=False) Include NaN in the countsdf['city'].nunique() Number of distinct valuesNotes:
Indispensable for exploring categorical columns — sorts results by count by default.
Select columns
Selecting & IndexingAccess one or multiple columns by name
Syntax:
df['col'] / df[['col1', 'col2']]Examples:
df['name'] Single column as a Seriesdf[['name', 'age']] Multiple columns as a DataFramedf.name Dot-access (only works for valid identifiers)df.filter(like='date') Columns whose names contain 'date'df.filter(regex='^id_') Columns matching a regex patternNotes:
Always prefer df['col'] over df.col — dot access fails on names with spaces or special chars.
.loc[]
Selecting & IndexingLabel-based row and column selection
Syntax:
df.loc[row_label, col_label]Examples:
df.loc[3] Row with label 3df.loc[0:5, 'name'] Rows 0–5 (inclusive), name columndf.loc[:, ['name', 'age']] All rows, specific columnsdf.loc[df['age'] > 21, 'name'] Boolean filter combined with column selectiondf.loc[3, 'age'] = 99 Update a single cell by labelNotes:
.loc is inclusive on both ends of slices, unlike Python list slicing.
.iloc[]
Selecting & IndexingInteger position-based row and column selection
Syntax:
df.iloc[row_pos, col_pos]Examples:
df.iloc[0] First rowdf.iloc[-1] Last rowdf.iloc[0:3, 0:2] First 3 rows, first 2 columnsdf.iloc[[0, 2, 4], :] Rows 0, 2, and 4 with all columnsdf.iloc[:, -1] Last column, all rowsNotes:
.iloc uses standard Python slicing — end index is exclusive.
Boolean indexing
FilteringFilter rows using a boolean condition
Syntax:
df[condition]Examples:
df[df['age'] > 21] Rows where age > 21df[(df['age'] > 21) & (df['country'] == 'US')] Combine conditions with & and |df[~(df['status'] == 'inactive')] Negate a condition with ~df[df['name'].str.startswith('A')] Names starting with 'A'Notes:
Always wrap each condition in parentheses — & and | have higher precedence than comparison operators.
query()
FilteringFilter rows using a string expression
Syntax:
df.query(expr)Examples:
df.query('age > 21') Rows where age > 21df.query('age > 21 and country == "US"') Multiple conditions, readable syntaxdf.query('city in ["NYC", "LA", "SF"]') Filter using membershipmin_age = 30
df.query('age >= @min_age') Reference a Python variable with @Notes:
Often more readable than boolean indexing — and faster on large DataFrames.
isin() / between()
FilteringFilter by membership or numeric range
Syntax:
s.isin(values) / s.between(low, high)Examples:
df[df['city'].isin(['NYC', 'LA', 'SF'])] Rows whose city is in a listdf[~df['city'].isin(['NYC'])] Rows whose city is NOT NYCdf[df['age'].between(18, 65)] Age between 18 and 65 (inclusive)df[df['date'].between('2024-01-01', '2024-12-31')] Date range filterNotes:
between() is inclusive on both ends — pass inclusive='neither' to exclude bounds.
isna() / dropna() / fillna()
CleaningDetect, drop, or impute missing values
Syntax:
df.isna() / df.dropna() / df.fillna(value)Examples:
df.isna().sum() Count missing values per columndf.dropna() Drop rows containing any NaNdf.dropna(subset=['email'], how='any') Drop rows with NaN in 'email' onlydf.fillna(0) Replace all NaN with 0df['age'].fillna(df['age'].mean(), inplace=True) Fill NaN with the column meandf.fillna(method='ffill') Forward-fill — propagate last valid valueNotes:
Always investigate WHY data is missing before dropping or imputing.
duplicated() / drop_duplicates()
CleaningIdentify and remove duplicate rows
Syntax:
df.duplicated() / df.drop_duplicates(subset=None)Examples:
df.duplicated().sum() Count duplicate rowsdf.drop_duplicates() Remove all duplicate rows (keep first)df.drop_duplicates(subset=['email']) Remove duplicates based on email columndf.drop_duplicates(keep='last') Keep the last occurrence instead of the firstNotes:
Pass keep=False to remove ALL occurrences of duplicates.
rename()
CleaningRename columns or index labels
Syntax:
df.rename(columns={...}, index={...})Examples:
df.rename(columns={'old_name': 'new_name'}) Rename a single columndf.rename(columns=str.lower) Lowercase all column namesdf.columns = ['a', 'b', 'c'] Replace all column names at oncedf.rename(index={0: 'first', 1: 'second'}) Rename row index labelsNotes:
Pass inplace=True to modify the DataFrame directly without re-assignment.
astype() / replace()
CleaningConvert dtypes or substitute specific values
Syntax:
df.astype(dtype) / df.replace(old, new)Examples:
df['age'] = df['age'].astype(int) Convert column to integerdf['date'] = pd.to_datetime(df['date']) Convert string to datetimedf['category'] = df['category'].astype('category') Convert to memory-efficient categoricaldf.replace({'N/A': np.nan, '-': np.nan}) Replace placeholder strings with NaNdf['status'].replace({'A': 'active', 'I': 'inactive'}) Map short codes to full labelsNotes:
Categorical dtype dramatically reduces memory for low-cardinality string columns.
apply() / map() / applymap()
ModifyingApply a function across rows, columns, or every cell
Syntax:
df.apply(func) / s.map(func) / df.applymap(func)Examples:
df['name'].map(str.upper) Uppercase a Series with a functiondf['age'].apply(lambda x: x * 2) Apply a lambda to each valuedf.apply(lambda col: col.max() - col.min()) Apply a function to each columndf.apply(lambda row: row['a'] + row['b'], axis=1) Apply a function row-wise (axis=1)df.applymap(lambda x: f'${x}') Apply to every cell of a DataFrameNotes:
Vectorized operations (df['x'] * 2) are far faster than apply() — only use apply for complex logic.
Add / drop columns
ModifyingCreate new columns or remove existing ones
Syntax:
df['new_col'] = ... / df.drop(columns=[...])Examples:
df['total'] = df['price'] * df['quantity'] Computed column from existing onesdf['rank'] = range(1, len(df) + 1) Add a sequential columndf.drop(columns=['unused']) Drop a single columndf.drop(['col1', 'col2'], axis=1, inplace=True) Drop multiple columns in placedf.assign(total=df['price'] * df['qty']) Method-chain-friendly column addNotes:
df.assign() returns a new DataFrame — ideal for fluent method chains.
sort_values() / sort_index()
SortingSort rows by column values or by the index
Syntax:
df.sort_values(by, ascending=True) / df.sort_index()Examples:
df.sort_values('age') Sort ascending by agedf.sort_values('age', ascending=False) Sort descending by agedf.sort_values(['country', 'age'], ascending=[True, False]) Sort by multiple columns with mixed orderdf.sort_index() Sort by the row indexdf.nlargest(5, 'sales') 5 rows with the largest sales valuesdf.nsmallest(5, 'price') 5 rows with the smallest price valuesNotes:
nlargest / nsmallest are faster than sort_values + head when you only need the top N.
groupby()
Grouping & AggregationGroup rows by a column and aggregate each group
Syntax:
df.groupby(by).agg(func)Examples:
df.groupby('country')['sales'].sum() Total sales per countrydf.groupby('country').agg({'sales': 'sum', 'orders': 'count'}) Different aggregations per columndf.groupby(['country', 'year'])['sales'].mean() Group by two columnsdf.groupby('country').agg(['sum', 'mean', 'count']) Multiple aggregations per groupdf.groupby('country').filter(lambda g: len(g) > 10) Keep only groups with >10 rowsdf.groupby('country').transform('mean') Broadcast group mean back to original shapeNotes:
Pass as_index=False to keep the grouping column as a regular column rather than the index.
Aggregation functions
Grouping & AggregationCommon reduction functions on Series and DataFrames
Syntax:
df.sum() / mean() / median() / min() / max() / std() / var() / count()Examples:
df['sales'].sum() Total of a columndf.mean(numeric_only=True) Mean of every numeric columndf['sales'].agg(['sum', 'mean', 'std']) Multiple aggregations on one columndf.agg({'sales': 'sum', 'price': 'mean'}) Different aggregations per columndf['sales'].cumsum() Cumulative sum (running total)df['sales'].rank(method='dense') Rank values (dense — no gaps after ties)Notes:
Pass numeric_only=True to silence warnings on DataFrames with mixed dtypes (pandas 2.x).
pivot_table() / pivot()
Grouping & AggregationReshape data into a pivot summary table
Syntax:
df.pivot_table(index, columns, values, aggfunc='mean')Examples:
df.pivot_table(index='country', columns='year', values='sales') Country × year matrix of mean salesdf.pivot_table(index='country', values='sales', aggfunc='sum') Total sales per countrydf.pivot_table(index='country', columns='year', values='sales', aggfunc='sum', fill_value=0) Sum aggregation with NaN replaced by 0df.pivot(index='date', columns='product', values='price') pivot() — reshape only, no aggregationpd.crosstab(df['country'], df['status']) Cross-tabulation (frequency table)Notes:
Use pivot_table when there are duplicates in the index/columns — pivot will raise an error.
merge()
Merging & JoiningSQL-style join of two DataFrames on key columns
Syntax:
pd.merge(left, right, on=, how='inner')Examples:
pd.merge(df1, df2, on='id') Inner join on the 'id' columnpd.merge(df1, df2, on='id', how='left') Left join — keep all rows from df1pd.merge(df1, df2, on='id', how='outer') Outer join — keep all rows from bothpd.merge(df1, df2, left_on='user_id', right_on='id') Different key column namespd.merge(df1, df2, on='id', suffixes=('_left', '_right')) Custom suffixes for overlapping columnsNotes:
Always check the resulting row count — duplicates in either side cause unexpected row multiplication.
concat()
Merging & JoiningStack DataFrames vertically or horizontally
Syntax:
pd.concat(objs, axis=0)Examples:
pd.concat([df1, df2]) Stack vertically (more rows)pd.concat([df1, df2], ignore_index=True) Reset the index after concatenationpd.concat([df1, df2], axis=1) Stack horizontally (more columns)pd.concat([df1, df2], keys=['a', 'b']) Add a hierarchical key per sourceNotes:
concat is for stacking; merge is for joining on a key. Don't mix them up.
join()
Merging & JoiningJoin DataFrames using their indexes
Syntax:
df1.join(df2, how='left')Examples:
df1.join(df2) Left join on the indexdf1.join(df2, how='inner') Inner join on the indexdf1.join([df2, df3]) Join multiple DataFrames at onceNotes:
join() is a thin wrapper around merge() that defaults to using the index. Use merge() when joining on column values.
pd.to_datetime()
Time SeriesConvert strings or numbers to pandas datetime
Syntax:
pd.to_datetime(arg, format=None, errors='raise')Examples:
df['date'] = pd.to_datetime(df['date']) Auto-parse a date columnpd.to_datetime(df['date'], format='%Y-%m-%d') Explicit format (faster, fails on bad rows)pd.to_datetime(df['date'], errors='coerce') Set unparseable values to NaT instead of raisingdf['ts'] = pd.to_datetime(df['epoch'], unit='s') Convert Unix epoch secondsNotes:
Set the column as the index (df.set_index('date')) to unlock resample() and time-based slicing.
.dt accessor
Time SeriesAccess datetime components on a Series
Syntax:
s.dt.Examples:
df['date'].dt.year Extract the yeardf['date'].dt.month_name() Month name (e.g. 'January')df['date'].dt.day_name() Day of the week namedf['date'].dt.dayofweek Day of the week (0=Mon, 6=Sun)df['date'].dt.strftime('%Y-%m') Format as a stringNotes:
The .dt accessor only works on datetime Series — convert with pd.to_datetime first.
resample()
Time SeriesGroup time-series data by a frequency and aggregate
Syntax:
df.resample(rule).agg(func)Examples:
df.set_index('date').resample('D').sum() Daily totalsdf.resample('W')['sales'].mean() Weekly mean of salesdf.resample('M').agg({'sales': 'sum', 'orders': 'count'}) Monthly aggregation across columnsdf.resample('Q').last() Quarter-end snapshot (last value)df['sales'].rolling(window=7).mean() 7-day rolling mean (moving average)Notes:
Resample requires a DatetimeIndex — call set_index() on your date column first.
.str accessor
String OperationsVectorized string methods on a Series
Syntax:
s.str.Examples:
df['name'].str.lower() Lowercase every stringdf['name'].str.strip() Strip leading and trailing whitespacedf['email'].str.contains('@gmail') Boolean mask for substring matchdf['name'].str.replace('Mr. ', '', regex=False) Replace a literal substringdf['phone'].str.extract(r'(\d{3})-(\d{4})') Extract regex capture groups into columnsdf['tags'].str.split(',', expand=True) Split into multiple columnsNotes:
All .str methods skip NaN values silently — no need to dropna first.
set_index() / reset_index()
Index OperationsPromote a column to the index or restore default integer index
Syntax:
df.set_index(col) / df.reset_index()Examples:
df.set_index('id') Use 'id' column as the row indexdf.set_index(['country', 'year']) Create a MultiIndex from two columnsdf.reset_index() Move the index back to columns and reset to integerdf.reset_index(drop=True) Reset the index and discard the old onedf.sort_index(level=0) Sort a MultiIndex by the first levelNotes:
MultiIndex DataFrames enable powerful hierarchical slicing — use df.xs(key, level=) to cross-section.
Statistical methods
StatisticsBuilt-in statistical and mathematical operations
Syntax:
df.corr() / cov() / quantile() / pct_change() / diff()Examples:
df.corr() Pearson correlation matrix between numeric columnsdf['sales'].quantile([0.25, 0.5, 0.75]) Quartiles of a columndf['price'].pct_change() Percent change between consecutive rowsdf['sales'].diff() Difference between consecutive rowsdf['x'].corr(df['y']) Correlation between two specific columnsNotes:
df.corr(method='spearman') for rank correlation when relationships aren't linear.
df.plot()
VisualizationQuick built-in matplotlib visualizations
Syntax:
df.plot(kind='line', x=, y=)Examples:
df.plot(x='date', y='sales') Line plot (default)df.plot(kind='bar', x='country', y='sales') Bar chartdf['age'].plot(kind='hist', bins=20) Histogramdf.plot(kind='scatter', x='age', y='income') Scatter plotdf.boxplot(column='sales', by='country') Box plot grouped by countryNotes:
Run %matplotlib inline in Jupyter, or call plt.show() in scripts to display the plot.
About the Pandas Cheat Sheet
This Pandas cheat sheet is a searchable, copy-ready quick reference for the most-used operations in the pandas Python library — the de-facto standard for tabular data analysis. It covers loading data, exploring it, cleaning it, reshaping it, joining it together, and producing aggregates and time-series summaries.
Every command on this page is grouped into a clear category and includes a real Python example with syntax highlighting. Whether you are wrangling a CSV in Jupyter, building an ETL pipeline, or studying for a data engineering interview, this reference helps you skip the documentation tab and find the right method in seconds.
Pandas is built on top of NumPy and provides two primary data structures — the one-dimensional Series and the two-dimensional DataFrame — alongside hundreds of methods for I/O,
selection, group-wise computation, and time-series analysis.
How to Use This Cheat Sheet
- 1 Search — type any keyword (e.g. groupby, merge, fillna, resample) into the search bar to instantly filter all matching commands and examples.
- 2 Filter by category — click a category pill (Cleaning, Grouping & Aggregation, Time Series, etc.) to narrow the list to a single topic.
- 3 Copy and adapt — every example is real, runnable Python. Select the snippet, paste it into your notebook, and tweak the column names to fit your DataFrame.
- 4 Read the notes — the small italic note under each command captures the gotcha, performance tip, or "why this matters" that the official docs often bury.
Common Use Cases
Exploratory Data Analysis
Loading a CSV, eyeballing it with head() / info(), computing summary stats, and surfacing missing values before deciding how to clean.
Data Cleaning & Wrangling
Dropping duplicates, filling NaN, converting types, renaming columns, and applying functions to standardise messy real-world data into something analysable.
Aggregation & Reporting
Using groupby(), pivot_table(), and crosstab() to roll up granular data into summary tables for dashboards or stakeholders.
Joining Datasets
SQL-style merge() and concat() for combining customer data with orders, lookup tables with fact tables, or stacking files from many days.
Time-Series Analysis
Resampling to daily/weekly/monthly grain, computing rolling means, and extracting datetime components for seasonality and trend analysis.
ETL & Data Pipelines
Reading from CSV, Excel, Parquet, or SQL; transforming with vectorised operations; and writing to a downstream warehouse or analytics tool.
Frequently Asked Questions
What is Pandas used for?
Pandas is a Python library for analysing tabular data — anything you would put in a spreadsheet or SQL table. It is widely used for data cleaning, exploration, transformation, time-series analysis, and as the input layer for machine learning libraries like scikit-learn. If your data has rows and columns, pandas is probably the fastest way to work with it in Python.
What is the difference between a Series and a DataFrame?
A Series is a one-dimensional labelled array — essentially a single column with an index.
A DataFrame is a two-dimensional structure with rows and columns,
where every column is a Series. Selecting a single column from a DataFrame returns a Series; selecting multiple columns returns a DataFrame.
Should I use loc or iloc?
Use .loc[] when you want to select by label
(column names or index labels). Use .iloc[] when you want to select by integer position.
One key gotcha: .loc slices are inclusive on both ends, while .iloc follows standard Python slicing (end is exclusive).
How do I handle missing values in Pandas?
Detect missing data with df.isna().sum(), drop it with df.dropna(), or fill it with df.fillna(value). Common imputation strategies include filling
with a constant (0 or "Unknown"), the column mean/median, or a forward-fill (method='ffill')
for time-series data. Always investigate why values are missing before deciding which strategy to use.
When should I use apply() vs vectorised operations?
Always prefer vectorised operations like df['x'] * 2 or df['a'] + df['b'] — they run in optimised C code and are typically
10–100× faster than apply(). Reserve apply() for genuinely complex per-row logic that cannot be expressed in vectorised form. The .str and .dt accessors are also vectorised and should be used over apply.
What is the difference between merge, join, and concat?
pd.merge() performs SQL-style joins on key columns and is the most flexible. df.join() is a thin wrapper around merge that defaults to joining on the index. pd.concat() stacks DataFrames end-to-end (vertically by default, horizontally with axis=1)
without any key matching — use it for combining files of the same shape.
Why is my Pandas code slow on large datasets?
Common causes: using apply() or iterrows() instead of vectorised ops; not setting the right dtypes (use 'category' for low-cardinality strings);
reading CSV when Parquet would be much faster; or chaining many intermediate DataFrames that copy memory.
For datasets larger than memory, consider Polars, Dask, or DuckDB as drop-in-ish alternatives.
Is this cheat sheet up to date with Pandas 2.x?
Yes. The examples are written for modern pandas (2.x) — including the numeric_only requirement on aggregations over mixed-dtype DataFrames, copy_on_write-friendly patterns,
and the recommendation to use pd.to_datetime() with an explicit format= argument for speed.