Introduction
When it comes to choosing a library or framework to work with DataFrame, there exists a wide selection for us to choose such as Pandas, Modin, or Dask. Each of these tools have their own strengths in terms of execution speed, memory efficiency, parallel computing or ease of utilize.
Among of the existing ones, Pandas is no doubt the most famous Python library for working with structured data, specfically DataFrame. Due to its ease of use and being built on a facile programming language like Python, Pandas has been entrusted and been used from simple analyses to to a more complex data manipulation or automative data pipelines.
With a large amount of data available in these days, the demand for such high speed execution on DataFrame has lead to the emergence of some alternatives to Pandas. For me personally, the most interesting one is Polars, a library that offers a better performance while maintaining the simplicity of uses to anyone who has experience with Pandas.
To find out if Polars truly outshine Pandas, we will explore it throughout these sections:
- What is Polars?
- Polars expressions
- Performance comparison
What is Polars?
A look back on Pandas
It is not always the cases where Pandas achieve the best performance, especially in a large dataset (in gigabytes I would say). Data scientists who interact with DataFrame regularly may have experience the slowness of Pandas at least once in their work time.
The reason for this is because Pandas only supports eager execution, in which the code is evaluated as soon as we run the code. Another thing we should note is that the core library of Pandas is single-threaded only and it does not cope with the parallelism concept. These disadvantages lead to some situations where reading, retrieving or manipulating DataFrame are slow using Pandas; and it would need supports from frameworks like Dask or Modin to leverage the parallelization.
Origin
Polars was initially created since 2020 by Ritchie Vink during the time of the COVID-19 pandemic. The motivation for starting this project was due to the fact he was dissastisfied with how DataFrame operates, which lead to the creation of Polars to solve his own use case.
Characteristics
Polars is a library that ultilize all availble cores on computer, parellelization in DataFrame operations. It was written in Rust – a language that aim to optimize the performane, safety and concurrency. In fact, Rust is a low-level programming language with direct access to hardware and memory, which makes it a great solution for the memory access and CPU multithreading leverage.
Besides of Rust, Polars is also built on top of Apache Arrow – a framework provides a standardized column-oriented memory structure and in-memory computing. It help addressing the poor performance in hierarchical data that we observed in Pandas.
With all of these benefits, Polars has its performance be on par with other existing libraries/frameworks. It could reduce redundant copies (no overhead), traverse memory cache efficiently, process the data with parallelization.
A lazy API
Polars is both lazy and semi-lazy API, in which it provides a query optimization process on the entire query (note that it it also support eager evaluation as well). When a query is input, Polars keep track it using a logical plan – where an optimizer is executed to accelerate the initial query and reduce memory usage – and distribute the work to different executors that are using the algorithm of the eager API. After all of the operation has been parallelly completed, a result will be returned, in a faster way ever comparing to the eager API like Pandas.
Polars expressions
Now it is time to come up with some Polars’ interpretations. In this section, this section will provide to you some simple expressions and functions so you could have a general idea how it works.
For the sake of these Polars example, we used a dataset called "All Stands in JoJo Bizarre Adventure with Stats", contributed by Shi Long Zhuang from Kaggle. The dataset involves fiction character called 'Stand', which is a unique embodiment of a person's life energy that can wield supernatural abilities. Stands are represented as a spiritual being hovering around its master and helping them in fighting.
Anyway that was too far from our objective. The dataset technically contains the stats of every Stands that has apppeared in the anime, e.g: speed, power, stamina and vice versa.
Read CSV (eager)
Reading an CSV file in Polars is quite similar to Pandas. We can simply call .read_csv()
method in this case.
This will eagerly execute the query and return a DataFrame that has been read from the input path.
jojo = pl.read_csv('jojo.csv')
print(jojo.head(5))
Result:
┌────────────────┬─────┬─────┬─────┬─────┬─────┬─────┬────────────────────────────┐
│ Stand ┆ PWR ┆ SPD ┆ RNG ┆ STA ┆ PRC ┆ DEV ┆ Story │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞════════════════╪═════╪═════╪═════╪═════╪═════╪═════╪════════════════════════════╡
│ Anubis ┆ B ┆ B ┆ E ┆ A ┆ E ┆ C ┆ Part 3: Stardust Crusaders │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Atum ┆ D ┆ C ┆ D ┆ B ┆ D ┆ D ┆ Part 3: Stardust Crusaders │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bastet ┆ E ┆ E ┆ B ┆ A ┆ E ┆ E ┆ Part 3: Stardust Crusaders │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Cream ┆ B ┆ B ┆ D ┆ C ┆ C ┆ D ┆ Part 3: Stardust Crusaders │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Death Thirteen ┆ C ┆ C ┆ E ┆ B ┆ D ┆ B ┆ Part 3: Stardust Crusaders │
└────────────────┴─────┴─────┴─────┴─────┴─────┴─────┴────────────────────────────┘
Read CSV (lazy)
In the introduction section, we have learned that Polars has the capability to operate at lazy mode.
To use the lazy execution mode, you can simply add the method lazy()
right after the
read_csv()
:
jojo = pl.read_csv('jojo.csv').lazy()
print(jojo.collect().head(5))
By using the lazy()
method, you was simply telling Polars to hold on the execution
and optimize all queries until the collect()
method is called.
The method collect()
starts the execution and return a DataFrame as a result.
In lazy mode, we works on an LazyFrame
object instead. The above example uses an explicit lazy evaluation.
For the implicit version, you can replace the combination of read_csv().lazy()
with scan_csv()
.
Expression pipeline
In Polars, the expressions (or queries) can be chained together to generate a new expression. The basic usage is to use .select
method
on your DataFrame (or LazyFrame) and pipe multiple expressions inside it.
exp = jojo.select([
pl.col("Stand").count(),
pl.col("Story").unique().count()
])
print(exp)
The snippet above basically says: count number of records in column Stand and number of unique records in column Story. A table of the result is printed out as follows:
Result:
┌───────┬───────┐
│ Stand ┆ Story │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═══════╪═══════╡
│ 156 ┆ 8 │
└───────┴───────┘
Filter and conditionals
In Pandas, we can perform the filtering by using indexing solely on the columns. For Polars, we can also perform the complex filtering operations as well. In the next snippet, we will list out all Stand with their names start with letter A and appear in story Part 3.
exp = jojo.select([
pl.col('Stand').filter(
pl.col("Stand").str.starts_with("A")
& pl.col("Story").str.contains("Part 3"))
])
print(exp)
Result:
┌────────┐
│ Stand │
│ --- │
│ str │
╞════════╡
│ Anubis │
├╌╌╌╌╌╌╌╌┤
│ Atum │
└────────┘
Binary functions
Polars also has the ability to filter the data in an if-else fashion. In Pandas, we can select or manipulate the data
with
conditional statements by using .apply()
or .loc
method. To
me personally,
Polars has a much better readability when using binary functions. The expression is written in when -> then -> otherwise
construct:
exp = jojo.select([
pl.col('*'),
pl.when(pl.col("PWR").is_in(["A", "B"])).then("Strong").otherwise("Weak")
])
print(exp.head(5))
The snippet above selects all columns in the DataFrame. It then uses the predicate expression check if the POWER
is A or B. If the predicate evalutes to true, the then
function will classified the character's power as Strong, and the
otherwise
function will classified the character's power as Weak.
Result:
┌────────────────┬─────┬─────┬─────┬─────┬─────┬─────┬────────────────────────────┬─────────┐
│ Stand ┆ PWR ┆ SPD ┆ RNG ┆ ... ┆ PRC ┆ DEV ┆ Story ┆ literal │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ ┆ str ┆ str ┆ str ┆ str │
╞════════════════╪═════╪═════╪═════╪═════╪═════╪═════╪════════════════════════════╪═════════╡
│ Anubis ┆ B ┆ B ┆ E ┆ ... ┆ E ┆ C ┆ Part 3: Stardust Crusaders ┆ Strong │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Atum ┆ D ┆ C ┆ D ┆ ... ┆ D ┆ D ┆ Part 3: Stardust Crusaders ┆ Weak │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Bastet ┆ E ┆ E ┆ B ┆ ... ┆ E ┆ E ┆ Part 3: Stardust Crusaders ┆ Weak │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Cream ┆ B ┆ B ┆ D ┆ ... ┆ C ┆ D ┆ Part 3: Stardust Crusaders ┆ Strong │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Death Thirteen ┆ C ┆ C ┆ E ┆ ... ┆ D ┆ B ┆ Part 3: Stardust Crusaders ┆ Weak │
└────────────────┴─────┴─────┴─────┴─────┴─────┴─────┴────────────────────────────┴─────────┘
GroupBy
The groupby function sometimes faces slow execution when working on large dataset in Pandas. Luckily, Polars has improved
the processing speed by altering the multi-threaded operation. In Polars, you can combine different aggregations by adding
multiple expressions in a list (just like the .select()
method). In the following example, we will try
a combination of aggregations:
Per GROUP "Story"
:
- count the number of Stands:
pl.count()
- aggregate the Stand values groups to a list:
pl.col("Stand").list()
- filter those stands with POWER and STAND as A and name the column StrongestStands:
pl.col("Stand").filter(
(pl.col("PWR") == "A") &
(pl.col("SPD") == "A")
).alias("StrongestStands")
df = (
jojo
.groupby("Story")
.agg(
[
pl.count(),
pl.col("Stand").list(),
pl.col("Stand").filter((pl.col("PWR") == "A")
& (pl.col("SPD") == "A")).alias("StrongestStands")
]
)
.sort("count", reverse=True)
)
print(df.head(5))
Result:
┌──────────────────────────────┬───────┬─────────────────────────────┬─────────────────────────────┐
│ Story ┆ count ┆ Stand ┆ StrongestStands │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ list[str] ┆ list[str] │
╞══════════════════════════════╪═══════╪═════════════════════════════╪═════════════════════════════╡
│ Part 3: Stardust Crusaders ┆ 33 ┆ ["Anubis", "Atum", ... ┆ ["Star Platinum", "The │
│ ┆ ┆ "Yellow T... ┆ World"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Part 5: Vento Aureo ┆ 29 ┆ ["Black Sabbath", "Baby ┆ ["Sticky Fingers", "King │
│ ┆ ┆ Face", .... ┆ Crimson... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Part 4: Diamond is ┆ 28 ┆ ["Achtung Baby", "Aqua ┆ ["Crazy Diamond", "Red Hot │
│ Unbreakable ┆ ┆ Necklace"... ┆ Chili... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Part 6: Stone Ocean ┆ 26 ┆ ["Whitesnake", "C-Moon", ┆ ["Star Platinum", "Star │
│ ┆ ┆ ... "Yo... ┆ Platinum... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Part 7: Steel Ball Run ┆ 24 ┆ ["20th Century Boy", "Ball ┆ ["Ball Breaker", "Dirty │
│ ┆ ┆ Break... ┆ Deeds Do... │
└──────────────────────────────┴───────┴─────────────────────────────┴─────────────────────────────┘
Performance comparison
In this final section, we will run a comparison to see whether Polars can outperform or not. Here is the version specification of the libraries:
- Pandas: 1.3.4
- Polars: 0.13.57
The Python version used in this test is 3.8.5.
All tests are performed in 5 consecutive runs and the average execution time is obtained.
Dataset
The dataset used in the comparision is a enlarged version of Asteroid Dataset contributed by Mir Sakhawat Hossain. The data was collected and under the supervision of NASA's Jet Propulsion Laboratory of California Institute of Technology. It is publicity available and you can find more details about it via: Asteroid Dataset.
The dataset's original size was 456MB with 958.524 rows. It was duplicated by 50 times, in which now it has 48.884.724 rows and the final size to be more than 3GB. Since there were 45 columns from the beginning, so I decided to cut down into 13 main features to keep simple and for easier data manipulation.
┌──────────┬─────────┬────────────────┬──────┬─────┬──────────┬────────┬────────────────┬───────┐
│ id ┆ spkid ┆ full_name ┆ pdes ┆ ... ┆ diameter ┆ albedo ┆ diameter_sigma ┆ class │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ str ┆ ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞══════════╪═════════╪════════════════╪══════╪═════╪══════════╪════════╪════════════════╪═══════╡
│ a0000001 ┆ 2000001 ┆ 1 Ceres ┆ 1 ┆ ... ┆ 939.4 ┆ 0.09 ┆ 0.2 ┆ MBA │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a0000002 ┆ 2000002 ┆ 2 Pallas ┆ 2 ┆ ... ┆ 545.0 ┆ 0.101 ┆ 18.0 ┆ MBA │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a0000003 ┆ 2000003 ┆ 3 Juno ┆ 3 ┆ ... ┆ 246.596 ┆ 0.214 ┆ 10.594 ┆ MBA │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a0000004 ┆ 2000004 ┆ 4 Vesta ┆ 4 ┆ ... ┆ 525.4 ┆ 0.4228 ┆ 0.2 ┆ MBA │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a0000005 ┆ 2000005 ┆ 5 Astraea ┆ 5 ┆ ... ┆ 106.699 ┆ 0.274 ┆ 3.14 ┆ MBA │
└──────────┴─────────┴────────────────┴──────┴─────┴──────────┴────────┴────────────────┴───────┘
Use cases
The list below consists of different tests that were run:
- Open the file and display the dimension of the DataFrame
- Sort the name column alphabetically (in an ascending order)
- Apply a function to the diameter column that divide by 2 to transform it into a radius column
- Filter all rows with the diameter equal or larger than that of 4 Vesta asteroid (an asteroid having a diameter of 525 km)
- Filter all rows with the diameter smaller than that of 4 Vesta asteroid, grouped by asteroid class, and calculate the mean diameter (in kilometers).
Before heading to the tests, let's have a look on the timing code (main function)
def run_use_case(n):
if n == 1:
use_case_1()
elif n == 2:
use_case_2()
elif n == 3:
use_case_3()
elif n == 4:
use_case_4()
else:
use_case_5()
def main(n):
print(f"Use case {n} start")
# Get the start time
start_time = time.time()
# Perform the test
run_use_case(n)
# Get the end time
end_time = time.time()
# Print out the elapsed time (seconds)
elapsed_time = round(end_time - start_time, 2)
print(f"Use case {n} completes in {elapsed_time} seconds.")
# Logging into file
run_log = "Use case {}".format(n)
time_log = "{} Seconds ".format(elapsed_time)
with open("polars_logs.txt", "a") as logfile:
logfile.write("%s: %s\n" % (run_log, time_log))
There will be two different files with the same main code just like this for Polars and Pandas tests. It will receive some arguments like the use case or number of iteration.
1. Open the file and display the dimension of the DataFrame
The most common use case to us since it is the first step we mostly would do when we start working on a CSV file. When opening a CSV file, we want to fast check the size of this dataset. This can be accomplished by reading the shape of the DataFrame, where the number of columns and number of rows are returned.
The code for Pandas:
def get_dataframe():
input_df = pd.read_csv(name)
return input_df
def use_case_1():
global df
print(df.shape)
In the snippet above, I made the DataFrame to be globally accessed since we will use it for other use cases after this. When we execute the Pandas code, we get the following result:
(48884724, 13)
Now we run a test using Polars with the following code:
def get_dataframe():
input_df = pl.read_csv(name, dtype={'full_name': pl.Utf8, 'pdes': pl.Utf8, 'name': pl.Utf8}).lazy()
return input_df
def use_case_1():
global lf
print(lf.collect().shape)
You can notice that in the Polars' code, the method .lazy()
were used. And from now on, all
use cases will have their operation run on a LazyFrame object. Here is the result after running the snippet:
(48884724, 13)
The result is the same to that of Pandas, which is great. Here are the execution time results:
Use case | Pandas | Polars | Difference |
---|---|---|---|
Open the file and display the dimension of the DataFrame | 380.52 seconds | 33.22 seconds | 347.3 seconds |
In this use case, Polars performed 10 times faster than Pandas. It can be seen that reading the file might be one of the operations that take longest time in many DataFrame library, yet Polars still makes the execution time to be reasonable.
2. Sort the name column alphabetically (in an ascending order)
The second use case is also a simple one. We will sort the column name in ascending order. Since sort can be troublesome in a large dataset, we expect that Pandas may face slowliness in this use case.
The code for Pandas:
def use_case_2():
global df
print((
df.sort_values('name')
.head(5)
))
For Polars, when using sort in default, it will let the null values on the top first. To achieve the same result as Pandas
where the null values are all put in the tail, we can set another parameter nulls_last=True
inside the method.
def use_case_2():
global lf
print((
lf.sort('name', nulls_last=True)
.collect()
.head()
))
When we execute both snippets, they return the same result as follows:
┌──────────┬─────────┬────────────────────────────┬────────┬─────┬──────────┬────────┬────────────────┬───────┐
│ id ┆ spkid ┆ full_name ┆ pdes ┆ ... ┆ diameter ┆ albedo ┆ diameter_sigma ┆ class │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ str ┆ ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞══════════╪═════════╪════════════════════════════╪════════╪═════╪══════════╪════════╪════════════════╪═══════╡
│ a0388282 ┆ 2388282 ┆ 388282 'Akepa (2006 RC118) ┆ 388282 ┆ ... ┆ null ┆ null ┆ null ┆ MBA │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a0388282 ┆ 2388282 ┆ 388282 'Akepa (2006 RC118) ┆ 388282 ┆ ... ┆ null ┆ null ┆ null ┆ MBA │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a0388282 ┆ 2388282 ┆ 388282 'Akepa (2006 RC118) ┆ 388282 ┆ ... ┆ null ┆ null ┆ null ┆ MBA │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a0388282 ┆ 2388282 ┆ 388282 'Akepa (2006 RC118) ┆ 388282 ┆ ... ┆ null ┆ null ┆ null ┆ MBA │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a0388282 ┆ 2388282 ┆ 388282 'Akepa (2006 RC118) ┆ 388282 ┆ ... ┆ null ┆ null ┆ null ┆ MBA │
└──────────┴─────────┴────────────────────────────┴────────┴─────┴──────────┴────────┴────────────────┴───────┘
Now let see which library is faster in sorting from this result:
Use case | Pandas | Polars | Difference |
---|---|---|---|
Sort the name column alphabetically (in an ascending order) | 302.64 seconds | 30.65 seconds | 273 seconds |
Once again, Polars has its execution time significantly greater than Pandas. The speedup is approximately 10 times faster when using sort function in Polars.
3. Apply a function to the diameter column that divide by 2 to transform it into a radius column
In this use case, let say we would like to derive the radius values from the diameters that have been provided from the dataset. To achieve this, we can divide each value in the diameter column by 2 and generate a new column called radius.
There are different ways to work with it in Pandas, but the most common one would be using .apply()
method:
def use_case_3():
global df
df['radius'] = df['diameter'].apply(lambda x: x / 2)
print(df.head(5))
Now in Polars, we can use the .map()
method to accomplish this task:
def use_case_3():
global lf
print((
lf.with_column(
pl.col('diameter')
.map(lambda s: s / 2)
.alias('radius')
)
.collect()
.head()
))
Here is the same result that has been returned by both snippets:
┌──────────┬─────────┬────────────────┬──────┬─────┬────────┬────────────────┬───────┬─────────┐
│ id ┆ spkid ┆ full_name ┆ pdes ┆ ... ┆ albedo ┆ diameter_sigma ┆ class ┆ radius │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ str ┆ ┆ f64 ┆ f64 ┆ str ┆ f64 │
╞══════════╪═════════╪════════════════╪══════╪═════╪════════╪════════════════╪═══════╪═════════╡
│ a0000001 ┆ 2000001 ┆ 1 Ceres ┆ 1 ┆ ... ┆ 0.09 ┆ 0.2 ┆ MBA ┆ 469.7 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ a0000002 ┆ 2000002 ┆ 2 Pallas ┆ 2 ┆ ... ┆ 0.101 ┆ 18.0 ┆ MBA ┆ 272.5 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ a0000003 ┆ 2000003 ┆ 3 Juno ┆ 3 ┆ ... ┆ 0.214 ┆ 10.594 ┆ MBA ┆ 123.298 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ a0000004 ┆ 2000004 ┆ 4 Vesta ┆ 4 ┆ ... ┆ 0.4228 ┆ 0.2 ┆ MBA ┆ 262.7 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ a0000005 ┆ 2000005 ┆ 5 Astraea ┆ 5 ┆ ... ┆ 0.274 ┆ 3.14 ┆ MBA ┆ 53.3495 │
└──────────┴─────────┴────────────────┴──────┴─────┴────────┴────────────────┴───────┴─────────┘
And the results of the performance tests in this use case:
Use case | Pandas | Polars | Difference |
---|---|---|---|
Apply a function to the diameter column that divide by 2 to transform it into a radius column | 15.24 seconds | 0.86 seconds | 14.38 seconds |
In this test, Polars also beats Pandas in terms of execution time.
4. Filter all rows with the diameter equal or larger than that of 4 Vesta asteroid (an asteroid having a diameter of 525 km)
Next, we will try out a use case where we will get a subset of the dataset based on conditions. This is a common operation during a EDA in which we are indexing the DataFrame to get the desired results.
In Pandas, the indexing operation is quite simple:
def use_case_4():
global df
vesta_diameter = 525
print((
df[df.diameter >= vesta_diameter]
.head(5)
))
For Polars, we can use the .filter()
method that we have learned in the previous sections:
def use_case_4():
global lf
vesta_diameter = 525
print((
lf.filter(pl.col('diameter') >= vesta_diameter)
.collect()
.head()
))
As expected, both snippets returned the same result:
┌──────────┬─────────┬────────────────────────────┬───────┬─────┬──────────┬────────┬────────────────┬───────┐
│ id ┆ spkid ┆ full_name ┆ pdes ┆ ... ┆ diameter ┆ albedo ┆ diameter_sigma ┆ class │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ str ┆ ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞══════════╪═════════╪════════════════════════════╪═══════╪═════╪══════════╪════════╪════════════════╪═══════╡
│ a0000001 ┆ 2000001 ┆ 1 Ceres ┆ 1 ┆ ... ┆ 939.4 ┆ 0.09 ┆ 0.2 ┆ MBA │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a0000002 ┆ 2000002 ┆ 2 Pallas ┆ 2 ┆ ... ┆ 545.0 ┆ 0.101 ┆ 18.0 ┆ MBA │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a0000004 ┆ 2000004 ┆ 4 Vesta ┆ 4 ┆ ... ┆ 525.4 ┆ 0.4228 ┆ 0.2 ┆ MBA │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a0020000 ┆ 2020000 ┆ 20000 Varuna (2000 WR106) ┆ 20000 ┆ ... ┆ 900.0 ┆ 0.07 ┆ 140.0 ┆ TNO │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a0000001 ┆ 2000001 ┆ 1 Ceres ┆ 1 ┆ ... ┆ 939.4 ┆ 0.09 ┆ 0.2 ┆ MBA │
└──────────┴─────────┴────────────────────────────┴───────┴─────┴──────────┴────────┴────────────────┴───────┘
Let's compare the performance of these two libraries:
Use case | Pandas | Polars | Difference |
---|---|---|---|
Filter all rows with the diameter equal or larger than that of 4 Vesta asteroid (an asteroid having a diameter of 525 km) | 1.02 seconds | 0.24 seconds | 0.78 seconds |
Despite being beaten by Polars in this use case as well, Pandas still perform very well in filtering (indexing) operation. The ratio is much smaller comparing to other use cases, which is 4x this time.
5. Filter all rows with the diameter smaller than that of 4 Vesta asteroid, grouped by asteroid class, and calculate the mean diameter (in kilometers)
We have come a long way! Now in this final test, let's try out a more complex DataFrame operation. We will filter all rows based on the diameter feature, then group the filtered records by asteroid class and calculate the mean diameter for each group.
Here is the code of Pandas:
def use_case_5():
global df
vesta_diameter = 525
print((
df[df.diameter <= vesta_diameter]
.groupby(['class'])['diameter'].mean()
.head(5)
))
And the code of Polars. Notice that there is no sort method in the Pandas' snippet since the groupby has already organized the group in an ascending order while being executed.
def use_case_5():
global lf
vesta_diameter = 525
print((
lf.filter(pl.col('diameter') <= vesta_diameter)
.groupby(pl.col('class'))
.agg(pl.col('diameter').mean())
.sort('class')
.collect()
.head()
))
The obtained result by running both the snippets:
┌───────┬───────────┐
│ class ┆ diameter │
│ --- ┆ --- │
│ str ┆ f64 │
╞═══════╪═══════════╡
│ AMO ┆ 1.752 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ APO ┆ 0.955645 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ AST ┆ 13.044125 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ATE ┆ 0.615705 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ CEN ┆ 52.731294 │
└───────┴───────────┘
So far, so good. Let's have a look on the performance:
Use case | Pandas | Polars | Difference |
---|---|---|---|
Filter all rows with the diameter smaller than that of 4 Vesta asteroid, grouped by asteroid class, and calculate the mean diameter (in kilometers). | 68.63 seconds | 0.66 seconds | 67.97 seconds |
Polars performed 10 times faster than Pandas in this test as well. From my prior experiences when working with DataFrame, the execution time of Pandas involving groupby was usually this long. Polars, with its ability of parallelization, has made this operation to be completed extremely fast.
Conclusion
Up until now, we have learned about Polars' characteristics and some common expressions. We have also conduct a comparison between it and Pandas in 5 different use cases. In general, Polars has the faster execution time comparing to Pandas in every tests.
Personal comment
Pandas is (again) no doubt one of the easiest library to learn, especially when the documentation and the built-in code has been improving
by the generous community. It seems like Pandas' best perk is its ease of use. As for performance, you can actually improve it by combining with
the function to_numpy()
(since Pandas is built on top of NumPy) or with Dask or Modin.
However, it might make the users to take more effort to leverage such high performance. This is when Polars can shine up, in which you can get familliar with the function or syntax quickly while getting a better performance without much difficulties or hacky workarounds.
So, Polars or Pandas?
Well, my only suggestion to you is that: you should give Polars a try. Then maybe you can come up with some ideas where Polars can help optimizing your works or projects.
Thank you for reading this blog!