macORnvidia OP t1_j0z24ie wrote
Reply to comment by sayoonarachu in laptop for Data Science and Scientific Computing: proart vs legion 7i vs thinkpad p16/p1-gen5 by macORnvidia
>. For example, the largest Parque file I've cleaned in pandas was about 7 million rows and about 10gb in size of just text. It can run queries through it in a few seconds.
Using rapids? Like cudf?
sayoonarachu t1_j0zlw4i wrote
No. I was just using pandas (cpu) for simple quick regex and removing and replacing text rows. It was just for a hobby project. The data was scraped from Midjourney and Stable diffusion discord so there were millions of rows of duplicate prompts and poor quality prompts which I had pandas delete and in the end the number of unique rows with more than 50 characters amounted to about 700k which was then used to train gpt-neo 125m.
I didn't know about cudf. Thanks 😅
Viewing a single comment thread. View all comments