使用PyPolars,讓Pandas快三倍
譯文【51CTO.com快譯】Pandas是數(shù)據(jù)科學(xué)家處理數(shù)據(jù)的最重要的Python軟件包之一。Pandas庫(kù)主要用于數(shù)據(jù)探索和可視化,它隨帶大量的內(nèi)置函數(shù)。Pandas無(wú)法處理大型數(shù)據(jù)集,因?yàn)樗鼰o(wú)法在CPU的所有核心上擴(kuò)展或分布進(jìn)程。
為了加快計(jì)算速度,您可以使用CPU的所有核心,并加快工作流程。有各種開(kāi)源庫(kù),包括Dask、Vaex、Modin、Pandarallel和PyPolars等,它們可以在CPU的多個(gè)核心上并行處理計(jì)算。我們?cè)诒疚闹袑⒂懻揚(yáng)yPolars庫(kù)的實(shí)現(xiàn)和用法,并將其性能與Pandas庫(kù)進(jìn)行比較。
PyPolars是什么?
PyPolars是一個(gè)類(lèi)似Pandas的開(kāi)源Python數(shù)據(jù)框庫(kù)。PyPolars利用CPU的所有可用核心,因此處理計(jì)算比Pandas更快。PyPolars有一個(gè)類(lèi)似Pandas的API。它是用Rust和Python包裝器編寫(xiě)的。
理想情況下,當(dāng)數(shù)據(jù)對(duì)于Pandas而言太大、對(duì)于Spark而言太小時(shí),使用 PyPolars。
PyPolars如何工作?
PyPolars庫(kù)有兩個(gè)API,一個(gè)是Eager API,另一個(gè)是Lazy API。Eager API與Pandas的API非常相似,執(zhí)行完成后立即獲得結(jié)果,這類(lèi)似Pandas。Lazy API與Spark非常相似,一執(zhí)行查詢,就形成地圖或方案。然后在CPU的所有核心上并行執(zhí)行。
圖1. PyPolars API
PyPolars基本上是連接到Polars庫(kù)的Python綁定。PyPolars庫(kù)好用的地方是,其API與Pandas相似,這使開(kāi)發(fā)人員更容易使用。
安裝:
可以使用以下命令從PyPl安裝 PyPolars:
- pip install py-polars
并使用以下命令導(dǎo)入庫(kù):
- iport pypolars as pl
基準(zhǔn)時(shí)間約束:
為了演示,我使用了一個(gè)含有2500萬(wàn)個(gè)實(shí)例的大型數(shù)據(jù)集(~6.4Gb)。
圖2. Pandas和Py-Polars基本操作的基準(zhǔn)時(shí)間數(shù)
針對(duì)使用Pandas和PyPolars庫(kù)的一些基本操作的上述基準(zhǔn)時(shí)間數(shù),我們可以觀察到 PyPolars幾乎比Pandas快2到3倍。
現(xiàn)在我們知道PyPolars有一個(gè)與Pandas非常相似的API,但仍沒(méi)有涵蓋Pandas的所有函數(shù)。比如說(shuō),PyPolars中就沒(méi)有.describe()函數(shù),相反我們可以使用df_pypolars.to_pandas().describe()。
用法:
- import pandas as pd
- import numpy as np
- import pypolars as pl
- import time
- WARNING!
- py-polars was renamed to polars, please install polars!
- https://pypi.org/project/polars/
- path = "data.csv"
讀取數(shù)據(jù):
- s = time.time()
- df_pandas = pd.read_csv(path)
- e = time.time()
- pd_time = e - s
- print("Pandas Loading Time = {}".format(pd_time))
- C:\ProgramData\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3071: DtypeWarning: Columns (2,7,14) have mixed types.Specify dtype option on import or set low_memory=False.
- has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
- Pandas Loading Time = 217.1734380722046
- s = time.time()
- df_pypolars = pl.read_csv(path)
- e = time.time()
- pl_time = e - s
- print("PyPolars Loading Time = {}".format(pl_time))
- PyPolars Loading Time = 114.0408570766449
shape:
- s = time.time()
- print(df_pandas.shape)
- e = time.time()
- pd_time = e - s
- print("Pandas Shape Time = {}".format(pd_time))
- (25366521, 19)
- Pandas Shape Time = 0.0
- s = time.time()
- print(df_pypolars.shape)
- e = time.time()
- pl_time = e - s
- print("PyPolars Shape Time = {}".format(pl_time))
- (25366521, 19)
- PyPolars Shape Time = 0.0010192394256591797
過(guò)濾:
- s = time.time()
- temp = df_pandas[df_pandas['PAID_AMT']>500]
- e = time.time()
- pd_time = e - s
- print("Pandas Filter Time = {}".format(pd_time))
- Pandas Filter Time = 0.8010377883911133
- s = time.time()
- temp = df_pypolars[df_pypolars['PAID_AMT']>500]
- e = time.time()
- pl_time = e - s
- print("PyPolars Filter Time = {}".format(pl_time))
- PyPolars Filter Time = 0.7790462970733643
Groupby:
- s = time.time()
- temp = df_pandas.groupby(by="MARKET_SEGMENT").agg({'PAID_AMT':np.sum, 'QTY_DISPENSED':np.mean})
- e = time.time()
- pd_time = e - s
- print("Pandas GroupBy Time = {}".format(pd_time))
- Pandas GroupBy Time = 3.5932095050811768
- s = time.time()
- temp = df_pypolars.groupby(by="MARKET_SEGMENT").agg({'PAID_AMT':np.sum, 'QTY_DISPENSED':np.mean})
- e = time.time()
- pd_time = e - s
- print("PyPolars GroupBy Time = {}".format(pd_time))
- PyPolars GroupBy Time = 1.2332513110957213
運(yùn)用函數(shù):
- %%time
- s = time.time()
- temp = df_pandas['PAID_AMT'].apply(round)
- e = time.time()
- pd_time = e - s
- print("Pandas Loading Time = {}".format(pd_time))
- Pandas Loading Time = 13.081078290939331
- Wall time: 13.1 s
- s = time.time()
- temp = df_pypolars['PAID_AMT'].apply(round)
- e = time.time()
- pd_time = e - s
- print("PyPolars Loading Time = {}".format(pd_time))
- PyPolars Loading Time = 6.03610580444336
值計(jì)算:
- %%time
- s = time.time()
- temp = df_pandas['MARKET_SEGMENT'].value_counts()
- e = time.time()
- pd_time = e - s
- print("Pandas ValueCounts Time = {}".format(pd_time))
- Pandas ValueCounts Time = 2.8194501399993896
- Wall time: 2.82 s
- %%time
- s = time.time()
- temp = df_pypolars['MARKET_SEGMENT'].value_counts()
- e = time.time()
- pd_time = e - s
- print("PyPolars ValueCounts Time = {}".format(pd_time))
- PyPolars ValueCounts Time = 1.7622406482696533
- Wall time: 1.76 s
描述:
- %%time
- s = time.time()
- temp = df_pandas.describe()
- e = time.time()
- pd_time = e - s
- print("Pandas Describe Time = {}".format(pd_time))
- Pandas Describe Time = 15.48347520828247
- Wall time: 15.5 s
- %%time
- s = time.time()
- temp = df_pypolars[temp_cols].to_pandas().describe()
- e = time.time()
- pd_time = e - s
- print("PyPolars Describe Time = {}".format(pd_time))
- PyPolars Describe Time = 44.31892013549805
- Wall time: 44.3 s
去重:
- %%time
- s = time.time()
- temp = df_pandas['MARKET_SEGMENT'].unique()
- e = time.time()
- pd_time = e - s
- print("Pandas Unique Time = {}".format(pd_time))
- Pandas Unique Time = 2.1443397998809814
- Wall time: 2.15 s
- %%time
- s = time.time()
- temp = df_pypolars['MARKET_SEGMENT'].unique()
- e = time.time()
- pd_time = e - s
- print("PyPolars Unique Time = {}".format(pd_time))
- PyPolars Unique Time = 1.0320448875427246
- Wall time: 1.03 s
保存數(shù)據(jù):
- s = time.time()
- df_pandas.to_csv("delete_1May.csv", index=False)
- e = time.time()
- pd_time = e - s
- print("Pandas Saving Time = {}".format(pd_time))
- Pandas Saving Time = 779.0419402122498
- s = time.time()
- df_pypolars.to_csv("delete_1May.csv")
- e = time.time()
- pd_time = e - s
- print("PyPolars Saving Time = {}".format(pd_time))
- PyPolars Saving Time = 439.16817021369934
結(jié)論
我們?cè)诒疚闹泻?jiǎn)要介紹了PyPolars庫(kù),包括它的實(shí)現(xiàn)、用法以及在一些基本操作中將其基準(zhǔn)時(shí)間數(shù)與Pandas相比較的結(jié)果。請(qǐng)注意,PyPolars的工作方式與Pandas非常相似, PyPolars是一種節(jié)省內(nèi)存的庫(kù),因?yàn)樗С值膬?nèi)存是不可變內(nèi)存。
可以閱讀說(shuō)明文檔詳細(xì)了解該庫(kù)。還有其他各種開(kāi)源庫(kù)來(lái)并行處理Pandas操作,并加快進(jìn)程。
參考資料:
Polars說(shuō)明文檔和GitHub存儲(chǔ)庫(kù):https://github.com/ritchie46/polars
[1] Polars Documentation and GitHub repository: https://github.com/ritchie46/polars
原文標(biāo)題:Make Pandas 3 Times Faster with PyPolars,作者:Satyam Kumar
【51CTO譯稿,合作站點(diǎn)轉(zhuǎn)載請(qǐng)注明原文譯者和出處為51CTO.com】