Correlation – explained with Python – Useful code


When you plot two variables, you see data dots scattered across the plane. Their overall tilt and shape tell you how the variables move together. Correlation turns that visual impression into a single number you can report and compare.

What correlation measures

Correlation summarises the direction and strength of association between two numeric variables on a scale from −1 to +1.

  • Sign shows direction
    • positive – larger x tends to come with larger y
    • negative – larger x tends to come with smaller y
  • Magnitude shows strength
    • near 0 – weak association
    • near 1 in size – strong association

Correlation does not prove causation.

Two methods to measure correlation

Pearson correlation – distance based

Pearson asks: how straight is the tilt of the data dots? It uses actual distances from a straight line, so it is excellent for line-like patterns and sensitive to outliers. Use when:

  • you expect a roughly straight relationship
  • units and distances matter
  • residuals look symmetric around a line

Spearman correlation – rank based

Spearman converts each variable to ranks (1st, 2nd, 3rd, …) and then computes Pearson on those ranks. It measures monotonic association: do higher x values tend to come with higher y values overall, even if the shape is curved.

Ranks ignore distances and care only about order, which gives two benefits:

  • robust to outliers and weird units
  • invariant to any monotonic transform (log, sqrt, min-max), since order does not change

Use when:

  • you expect a consistent up or down trend that may be curved
  • the data are ordinal or have many ties
  • outliers are a concern

r and p in plain language

  • r is the correlation coefficient. It is your effect size on the −1 to +1 scale.
  • p answers: if there were truly no association, how often would we see an r at least this large in magnitude just by random chance.

Small p flags statistical signal. It is not a measure of importance. Usually findings, where p is bigger than .05 should be ignored.

When Pearson and Spearman disagree?

  • Curved but monotonic (for example price vs horsepower with diminishing returns)
    Spearman stays high because order increases consistently. Pearson is smaller because a straight line underfits the curve.

  • Outliers (for example a 10-year-old exotic priced very high)
    Pearson can jump because distances change a lot. Spearman changes less because rank order barely changes.

https://www.youtube.com/watch?v=IdffxjPdNJY

Jupyter Notebook in GitHub with code from the video above.

Enjoy it! 🙂



Source link

دیدگاه‌ها

دیدگاهتان را بنویسید

نشانی ایمیل شما منتشر نخواهد شد. بخش‌های موردنیاز علامت‌گذاری شده‌اند *