Seaborn Visulaization
import seaborn as sns
Seaborn is a Python data visualization library that provides simple code to create elegant visualizations for statistical exploration and insight. Seaborn is based on Matplotlib, but improves on Matplotlib in several ways:
- Seaborn provides a more visually appealing plotting style and concise syntax.
- Seaborn natively understands Pandas DataFrames, making it easier to plot data directly from CSVs.
- Seaborn can easily summarize Pandas DataFrames with many rows of data into aggregated charts.
If you’re unfamiliar with Pandas, just know that Pandas is a data analysis library for Python that provides easy-to-use data structures and allows you to organize and manipulate datasets so they can be visualized. To fully leverage the power of Seaborn, it is best to prepare your data using Pandas.
Over the next few exercises, we will explain how Seaborn relates to Pandas and how we can transform massive datasets into easily understandable graphics.
Basic Syntax
For Barplot
By default, Seaborn uses something called a bootstrapped confidence interval. Roughly speaking, this interval means that “based on this data, 95% of similar situations would have an outcome within this range”.
In our gradebook example, the confidence interval for the assignments means “if we gave this assignment to many, many students, we’re confident that the mean score on the assignment would be within the range represented by the error bar”.
The confidence interval is a nice error bar measurement because it is defined for different types of aggregate functions, such as medians and mode, in addition to means.
If you’re calculating a mean and would prefer to use standard deviation for your error bars, you can pass in the keyword argument ci="sd"
to sns.barplot()
which will represent one standard deviation. It would look like this:
sns.barplot(data=gradebook, x="name", y="grade", ci="sd")For example, to calculate the median, you can pass in np.median
to the estimator
keyword:
sns.barplot(data=df,
x="x-values",
y="y-values",
estimator=np.median)
Sometimes we’ll want to aggregate our data by multiple columns to visualize nested categorical variables.
For example, consider our hospital survey data. The mean satisfaction seems to depend on Gender
, but it might also depend on another column: Age Range
.,
We can compare both the Gender
and Age Range
factors at once by using the keyword hue
.
sns.barplot(data=df,
x="Gender",
y="Response",
hue="Age Range")
To review the seaborn workflow:
1. Ingest data from a CSV file to Pandas DataFrame.
df = pd.read_csv('file_name.csv')
2. Set sns.barplot()
with desired values for x
, y
, and set data
equal to your DataFrame.
sns.barplot(data=df, x='X-Values', y='Y-Values')
3. Set desired values for estimator
and hue
parameters.
sns.barplot(data=df, x='X-Values', y='Y-Values', estimator=len, hue='Value')
4. Render the plot using plt.show()
.
plt.show()- KDE plot - Kernel density estimator; shows a smoothed version of dataset. Use
sns.kdeplot()
. - Box plot - A classic statistical model that shows the median, interquartile range, and outliers. Use
sns.boxplot()
. - Violin plot - A combination of a KDE and a box plot. Good for showing multiple distributions at a time. Use
sns.violinplot()
.
Comments
Post a Comment