macORnvidia OP t1_j0z24ie wrote on December 20, 2022 at 2:03 PM

Reply to comment by sayoonarachu in laptop for Data Science and Scientific Computing: proart vs legion 7i vs thinkpad p16/p1-gen5 by macORnvidia

>. For example, the largest Parque file I've cleaned in pandas was about 7 million rows and about 10gb in size of just text. It can run queries through it in a few seconds.

Using rapids? Like cudf?

sayoonarachu t1_j0zlw4i wrote on December 20, 2022 at 4:25 PM

No. I was just using pandas (cpu) for simple quick regex and removing and replacing text rows. It was just for a hobby project. The data was scraped from Midjourney and Stable diffusion discord so there were millions of rows of duplicate prompts and poor quality prompts which I had pandas delete and in the end the number of unique rows with more than 50 characters amounted to about 700k which was then used to train gpt-neo 125m.

I didn't know about cudf. Thanks 😅