Submitted by AutoModerator t3_100mjlp in MachineLearning
hysse t1_j30tpx8 wrote
Reply to comment by jakderrida in [D] Simple Questions Thread by AutoModerator
Thanks for the answer. I need to train a relatively large model and I need an efficient tokenizer.
I don't see how a tokenizer written in pytorch (or tensorflow) can be faster than a HuggingFace tokenizer (for example). HuggingFace has a rust backend that make the tokenizer faster and I guess that torchtext has an optimized backend too.
Knowing that the tokenizer run in cpu and not gpu, how can it run faster if I wrote it with pytorch (or even in python) ?
jakderrida t1_j30u1jq wrote
To be clear, ChatGPT provided the answer.
Here's how ChatGPT answered your next question...
It is generally true that a Rust backend can provide better performance than a pure Python implementation, especially for a high-throughput task like tokenization. This is because Rust is a systems programming language that is designed to be fast and efficient. It can also make use of low-level optimization techniques that are not possible in Python, such as manual memory management.
That being said, it is still possible to write a fast and efficient tokenizer in Python. There are several ways to do this:
-
Use optimized libraries: There are several libraries in Python that are optimized for performance, such as NumPy and PyPy. Using these libraries can help to improve the performance of your tokenizer.
-
Use Cython or Numba: Cython and Numba are tools that can be used to optimize Python code by compiling it to native machine code. This can provide a significant performance boost over pure Python code.
-
Use multithreading or multiprocessing: If your tokenizer is CPU-bound, you can use multithreading or multiprocessing to parallelize the work and make use of multiple CPU cores.
-
Use a just-in-time (JIT) compiler: JIT compilers, such as Numba, can analyze your Python code at runtime and generate optimized machine code on the fly. This can provide a significant performance boost over pure Python code.
Overall, there are many ways to optimize the performance of a Python-based tokenizer. It is worth considering using a combination of these techniques to achieve the best results.
hysse t1_j30ub8q wrote
Haha I knew it. Unfortunately, I don't think ChatGPT can give a good answer to that question...
jakderrida t1_j30uieu wrote
Yeah, when I read it, I knew right away that I'd seem like a bigger imbecile if I let you think it was me. The first one was impressive, though.
Viewing a single comment thread. View all comments