Who doesn’t love the read_csv() function of Pandas?
It checks the boxes of being both simple and elegant!
And it is the first thing I learned when I started my journey in the data world (which, to be honest, is still at the beginning phase and didn’t move ahead. But that’s a story for some other time )
But how does this function fare against the datasets that contain millions or even billions of rows?
Pandas API is primarily designed for in-memory analytics, and hence, it tries to load the entire data into the memory.
So reading large datasets through this function is slower and can sometimes even lead to out-of-memory errors!
One way around this problem is to use the ‘chunksize’ parameter along with the read_csv() function.
It enables chunking of the dataset into several small portions, and we could load each chunk separately for our usage.
One of the many other alternatives is to use Dask API’s read_csv function.
Dask is almost similar to Pandas but uses the parallel computing technique to resolve our woes when faced with large datasets!
Comments
Post a Comment