国产在线第一页,日本特一级A片,亚洲午夜免费,日本A V中文字幕,亚州天堂,在线看一区二区三区,一起操视频网站,伊人大香蕉视频

   
   
    
    
     
     
      
      
       
       
        
         來(lái)源：數(shù)據(jù)STUDIO 
        
       
       
      
      
     
     
    
    
   
   
   
   
    
    
     
     
      
      
       
       
        
        本文約1000字，建議閱讀6分鐘
        
        本文為你總結(jié)相關(guān)系數(shù)矩陣的多種Python計(jì)算方法。

相關(guān)系數(shù)矩陣（Correlation matrix）是數(shù)據(jù)分析的基本工具。它們讓我們了解不同的變量是如何相互關(guān)聯(lián)的。在Python中，有很多個(gè)方法可以計(jì)算相關(guān)系數(shù)矩陣，今天我們來(lái)對(duì)這些方法進(jìn)行一個(gè)總結(jié)。

Pandas

Pandas的DataFrame對(duì)象可以使用corr方法直接創(chuàng)建相關(guān)矩陣。由于數(shù)據(jù)科學(xué)領(lǐng)域的大多數(shù)人都在使用Pandas來(lái)獲取數(shù)據(jù)，因此這通常是檢查數(shù)據(jù)相關(guān)性的最快、最簡(jiǎn)單的方法之一。

    
    
     
     
      
      
      
      
      
      
      
      
      
      
      
      
     
     
     
      import pandas as pd import seaborn as sns  data = sns.load_dataset('mpg') correlation_matrix = data.corr(numeric_only=True) correlation_matrix

如果你是統(tǒng)計(jì)和分析相關(guān)工作的，你可能會(huì)問(wèn)" p值在哪里？"，在最后我們會(huì)有介紹。

Numpy

Numpy也包含了相關(guān)系數(shù)矩陣的計(jì)算函數(shù)，我們可以直接調(diào)用，但是因?yàn)榉祷氐氖莕darray，所以看起來(lái)沒(méi)有pandas那么清晰。

    
    
     
     
      
      
      
      
      
      
      
      
      
      
     
     
     
      import numpy as np from sklearn.datasets import load_iris
 iris = load_iris() np.corrcoef(iris["data"])

為了更好的可視化，我們可以直接將其傳遞給sns.heatmap()函數(shù)。

    
    
     
     
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
     
     
     
      import seaborn as sns
 data = sns.load_dataset('mpg') correlation_matrix = data.corr()
 sns.heatmap(data.corr(),            annot=True,            cmap='coolwarm')

annot=True這個(gè)參數(shù)可以輸出一些額外的有用信息。一個(gè)常見hack是使用sns.set_context('talk')來(lái)獲得額外的可讀輸出。

這個(gè)設(shè)置是為了生成幻燈片演示的圖像，它能幫助我們更好地閱讀(更大的字體)。

Statsmodels

Statsmodels這個(gè)統(tǒng)計(jì)分析庫(kù)也是肯定可以的：

    
    
     
     
      
      
      
      
      
      
      
      
      
      
     
     
     
      import statsmodels.api as sm
 correlation_matrix = sm.graphics.plot_corr(    data.corr(),    xnames=data.columns.tolist())

plotly

默認(rèn)情況下plotly這個(gè)結(jié)果是如何從左下到右上運(yùn)行對(duì)角線1.0的。這種行為與大多數(shù)其他工具相反，所以如果你使用plotly需要特別注意。

    
    
     
     
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
     
     
     
      import plotly.offline as pyo pyo.init_notebook_mode(connected=True)
 import plotly.figure_factory as ff
 correlation_matrix = data.corr()
 fig = ff.create_annotated_heatmap(    z=correlation_matrix.values,    x=list(correlation_matrix.columns),    y=list(correlation_matrix.index),    colorscale='Blues')
 fig.show()

Pandas + Matplotlib更好的可視化

這個(gè)結(jié)果也可以直接使用用sns.pairplot(data)，兩種方法產(chǎn)生的圖差不多，但是seaborn只需要一句話：

    
    
     
     
      
      
     
     
     
      sns.pairplot(df[['mpg','weight','horsepower','acceleration']])

所以我們這里介紹如何使用Matplotlib來(lái)實(shí)現(xiàn)：

    
    
     
     
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
     
     
     
      import matplotlib.pyplot as plt
 pd.plotting.scatter_matrix(    data, alpha=0.2,    figsize=(6, 6),    diagonal='hist')
 plt.show()

相關(guān)性的p值

如果你正在尋找一個(gè)簡(jiǎn)單的矩陣(帶有p值)，這是許多其他工具(SPSS, Stata, R, SAS等)默認(rèn)做的，那如何在Python中獲得呢？

這里就要借助科學(xué)計(jì)算的scipy庫(kù)了，以下是實(shí)現(xiàn)的函數(shù)：

    
    
     
     
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
     
     
     
      from scipy.stats import pearsonr import pandas as pd import seaborn as sns
 def corr_full(df, numeric_only=True, rows=['corr', 'p-value', 'obs']):    """    Generates a correlation matrix with correlation coefficients,    p-values, and observation count.
    Args:    - df:                 Input dataframe    - numeric_only (bool): Whether to consider only numeric columns for                            correlation. Default is True.    - rows:               Determines the information to show.                            Default is ['corr', 'p-value', 'obs'].
    Returns:    - formatted_table: The correlation matrix with the specified rows.    """
    # Calculate Pearson correlation coefficients    corr_matrix = df.corr(        numeric_only=numeric_only)
    # Calculate the p-values using scipy's pearsonr    pvalue_matrix = df.corr(        numeric_only=numeric_only,        method=lambda x, y: pearsonr(x, y)[1])
    # Calculate the non-null observation count for each column    obs_count = df.apply(lambda x: x.notnull().sum())
    # Calculate observation count for each pair of columns    obs_matrix = pd.DataFrame(        index=corr_matrix.columns, columns=corr_matrix.columns)    for col1 in obs_count.index:        for col2 in obs_count.index:            obs_matrix.loc[col1, col2] = min(obs_count[col1], obs_count[col2])
    # Create a multi-index dataframe to store the formatted correlations    formatted_table = pd.DataFrame(        index=pd.MultiIndex.from_product([corr_matrix.columns, rows]),        columns=corr_matrix.columns    )
    # Assign values to the appropriate cells in the formatted table    for col1 in corr_matrix.columns:        for col2 in corr_matrix.columns:            if 'corr' in rows:                formatted_table.loc[                    (col1, 'corr'), col2] = corr_matrix.loc[col1, col2]
            if 'p-value' in rows:                # Avoid p-values for diagonal they correlate perfectly                if col1 != col2:                    formatted_table.loc[                        (col1, 'p-value'), col2] = f"({pvalue_matrix.loc[col1, col2]:.4f})"            if 'obs' in rows:                formatted_table.loc[                    (col1, 'obs'), col2] = obs_matrix.loc[col1, col2]
    return(formatted_table.fillna('')            .style.set_properties(**{'text-align': 'center'}))

直接調(diào)用這個(gè)函數(shù)，我們返回的結(jié)果如下：

    
    
     
     
      
      
      
      
      
      
     
     
     
      df = sns.load_dataset('mpg') result = corr_full(df, rows=['corr', 'p-value']) result

總結(jié)

我們介紹了Python創(chuàng)建相關(guān)系數(shù)矩陣的各種方法，這些方法可以隨意選擇（那個(gè)方便用哪個(gè)）。Python中大多數(shù)工具的標(biāo)準(zhǔn)默認(rèn)輸出將不包括p值或觀察計(jì)數(shù)，所以如果你需要這方面的統(tǒng)計(jì)，可以使用我們子厚提供的函數(shù)，因?yàn)橐M(jìn)行全面和完整的相關(guān)性分析，有p值和觀察計(jì)數(shù)作為參考是非常有幫助的。

編輯：黃繼彥

6 種在 Python 中創(chuàng)建相關(guān)系數(shù)矩陣的方法