Correlation – explained with Python – Useful code

When you plot two variables, you see data dots scattered across the plane. Their overall tilt and shape tell you how the variables move together. Correlation turns that visual impression into a single number you can report and compare.

What correlation measures

Correlation summarises the direction and strength of association between two numeric variables on a scale from −1 to +1.

Sign shows direction
- positive – larger x tends to come with larger y
- negative – larger x tends to come with smaller y
Magnitude shows strength
- near 0 – weak association
- near 1 in size – strong association

Correlation does not prove causation.

Two methods to measure correlation

Pearson correlation – distance based

Pearson asks: how straight is the tilt of the data dots? It uses actual distances from a straight line, so it is excellent for line-like patterns and sensitive to outliers. Use when:

you expect a roughly straight relationship
units and distances matter
residuals look symmetric around a line

Spearman correlation – rank based

Spearman converts each variable to ranks (1st, 2nd, 3rd, …) and then computes Pearson on those ranks. It measures monotonic association: do higher x values tend to come with higher y values overall, even if the shape is curved.

Ranks ignore distances and care only about order, which gives two benefits:

robust to outliers and weird units
invariant to any monotonic transform (log, sqrt, min-max), since order does not change

Use when:

you expect a consistent up or down trend that may be curved
the data are ordinal or have many ties
outliers are a concern

r and p in plain language

r is the correlation coefficient. It is your effect size on the −1 to +1 scale.
p answers: if there were truly no association, how often would we see an r at least this large in magnitude just by random chance.

Small p flags statistical signal. It is not a measure of importance. Usually findings, where p is bigger than .05 should be ignored.

When Pearson and Spearman disagree?

Curved but monotonic (for example price vs horsepower with diminishing returns)
Spearman stays high because order increases consistently. Pearson is smaller because a straight line underfits the curve.
Outliers (for example a 10-year-old exotic priced very high)
Pearson can jump because distances change a lot. Spearman changes less because rank order barely changes.

https://www.youtube.com/watch?v=IdffxjPdNJY

Jupyter Notebook in GitHub with code from the video above.

Enjoy it! 🙂

Source link

Correlation – explained with Python – Useful code

What correlation measures

Two methods to measure correlation

Pearson correlation – distance based

Spearman correlation – rank based

r and p in plain language

When Pearson and Spearman disagree?

دیدگاه‌ها

دیدگاهتان را بنویسید لغو پاسخ

نوشته‌های بیشتر

Palantir Recruits High School Graduates Directly, Challenging the College Path

Python – Sliding Tile Puzzle – DFS + IDA* – Useful code

Motion Highlights #14

Clean code tips – Abstraction and objects | Code4IT