Leitura CSV – Python Pandas

Você tem um CSV grande, você vai lê-lo no Pandas - mas toda vez que você o carrega, você tem que esperar o carregamento do CSV. E isso diminui o ciclo de feedback do desenvolvimento e pode diminuir significativamente o processamento da produção.

Mas é mais rápido ler os dados mais rapidamente. Vamos ver como.

Neste artigo abordaremos:

Leitura CSV padrão do Pandas.
O leitor CSV mais rápido e paralelo introduzido na v1.4.
Uma abordagem diferente que pode tornar as coisas ainda mais rápidas.

—Itamar

Original: https://pythonspeed.com/articles/pandas-read-csv-fast/

The fastest way to read a CSV in Pandas

by Itamar Turner-Trauring
Last updated 25 Jan 2022, originally created 25 Jan 2022

You have a large CSV, you’re going to be reading it in to Pandas—but every time you load it, you have to wait for the CSV to load. And that slows down your development feedback loop, and might meaningfully slows down your production processing.

But it’s faster to read the data in faster. Let’s see how.

In this article we’ll cover:

  1. Pandas’ default CSV reading.
  2. The faster, more parallel CSV reader introduced in v1.4.
  3. A different approach that can make things even faster.

Reading a CSV, the default way

I happened to have a 850MB CSV lying around with the local transit authority’s bus delay data, as one does. Here’s the default way of loading it with Pandas:

import pandas as pd

df = pd.read_csv("large.csv")

Here’s how long it takes, by running our program using the time utility:

$ time python default.py 

real    0m13.245s
user    0m11.808s
sys     0m1.378s

If you’re not familiar with the time utility’s output, I recommend reading my article on the topic, but basically “real” is the elapsed time on a wallclock, and the other two measures are CPU time broken down by time running in application code (“user”) and time running in the Linux kernel (“sys”).

The Pandas CSV reader has multiple backends; this is the "c" one written in C. If we use the "python" backend it runs much slower, but I won’t bother demonstrating that because it’s, well, slow.

Reading a CSV with PyArrow

In Pandas 1.4, released in January 2022, there is a new backend for CSV reading, relying on the Arrow library’s CSV parser. It’s still marked as experimental, and it doesn’t support all the features of the default parser—but it is faster.

Here’s how we use it:

import pandas as pd

df = pd.read_csv("large.csv", engine="pyarrow")

And when we run it:

$ time python arrow.py 

real    0m2.707s
user    0m4.945s
sys     0m1.527s

Let’s compare the two implementations:

CSV parserElapsed timeCPU time (user+sys)
Default C13.2 seconds13.2 seconds
PyArrow2.7 seconds6.5 seconds

Focusing first on the amount of CPU time, the PyArrow implementation uses half as much CPU. So that’s a good improvement.

Second, the elapsed time is even faster, in fact elapsed time is much lower than the CPU time. That’s because it’s using parallelism—unlike the default backend it’s taking advantage of the fact that my computer has multiple cores.

Now, parallelism may or may not be a benefit, depending on how you ran your code. If you previously only ran it on a single core, that’s a free performance improvement. But if you were already manually utilizing multiple cores, for example by loading multiple CSV files in parallel, adding parallelism here won’t speed things up, and potentially could slow things down a little.

However, given that the PyArrow backend is also inherently faster, seeing as the total CPU time was cut in half, it’s likely to provide meaningful speedups even if you already had parallelism.

Rethinking the problem

Loading a CSV is fundamentally a lot of work:

  1. You need to split into lines.
  2. You need to split each line on commas.
  3. You need to deal with string quoting.
  4. You need to guess(!) the types of columns, unless you explicitly pass them to Pandas.
  5. You need to convert strings into integers, dates, and other non-string types.

All of this takes CPU time.

And if you’re getting a CSV from a third-party, and you’re only processing it once, there’s not much you can do about this. But what if you’re loading the same CSV multiple times? Or, alternatively, what if you are the one generating the input file in some other part of your data processing pipeline?

Instead of reading in a CSV, you could read in some other file format that is faster to process. Let’s see an example, using the Parquet data format. Parquet files are designed to be read quickly: you don’t have to do as much parsing as you would with CSV. And unlike CSV, where the column type is not encoded in the file, in Parquet the columns have types stored in the actual file.

First, we’ll convert the CSV file to a Parquet file; we disable compression so we’re doing a more apples-to-apples comparison with the CSV. Of course, if you’re the one generating the file in the first place, you don’t need a conversion step, you can just write your data straight to Parquet.

import pandas as pd

df = pd.read_csv("large.csv")
df.to_parquet("large.parquet", compression=None)

We run this once:

$ time python convert.py

real    0m18.403s
user    0m15.695s
sys     0m2.107s

We can read the Parquet file; the fastparquet engine seems the faster of the two options on my computer, but you can also the try the pyarrow backend.

import pandas as pd

df = pd.read_parquet("large.parquet", engine="fastparquet")

If we run this:

$ time python parquet.py 

real    0m2.441s
user    0m1.990s
sys     0m0.575s

To compare:

ParserElapsed timeCPU time (user+sys)
Default CSV13.2 seconds13.2 seconds
PyArrow CSV2.7 seconds6.5 seconds
fastparquet2.4 seconds2.6 seconds

Measured purely by CPU, fastparquet is by far the fastest. Whether it gives you an elapsed time improvement will depend on whether you have existing parallelism or not, your particular computer, and so on. And different CSV files will presumably have different parsing costs; this is just one example. But clearly reading the Parquet format is much more efficient.

The best CSV is no CSV

CSV is a bad format. Besides the inefficiency of parsing it, the lack of type data means parsing is always going to be more error-prone and ambiguous than a structured file format with actual column types. So if you can, avoid using CSV and use a better format, for example Parquet.

If you are stuck with CSV, consider using the new PyArrow CSV parser in Pandas 1.4; you’ll get a nice speed-up, especially if your program is not currently taking advantage of multiple CPUs.

Deixe um comentário

Faça o login usando um destes métodos para comentar:

Logo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair /  Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair /  Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair /  Alterar )

Conectando a %s

Este site utiliza o Akismet para reduzir spam. Saiba como seus dados em comentários são processados.