hysse t1_j30tpx8 wrote on January 5, 2023 at 7:15 AM

Reply to comment by jakderrida in [D] Simple Questions Thread by AutoModerator

Thanks for the answer. I need to train a relatively large model and I need an efficient tokenizer.

I don't see how a tokenizer written in pytorch (or tensorflow) can be faster than a HuggingFace tokenizer (for example). HuggingFace has a rust backend that make the tokenizer faster and I guess that torchtext has an optimized backend too.

Knowing that the tokenizer run in cpu and not gpu, how can it run faster if I wrote it with pytorch (or even in python) ?

jakderrida t1_j30u1jq wrote on January 5, 2023 at 7:18 AM

To be clear, ChatGPT provided the answer.

Here's how ChatGPT answered your next question...

It is generally true that a Rust backend can provide better performance than a pure Python implementation, especially for a high-throughput task like tokenization. This is because Rust is a systems programming language that is designed to be fast and efficient. It can also make use of low-level optimization techniques that are not possible in Python, such as manual memory management.

That being said, it is still possible to write a fast and efficient tokenizer in Python. There are several ways to do this:

Use optimized libraries: There are several libraries in Python that are optimized for performance, such as NumPy and PyPy. Using these libraries can help to improve the performance of your tokenizer.
Use Cython or Numba: Cython and Numba are tools that can be used to optimize Python code by compiling it to native machine code. This can provide a significant performance boost over pure Python code.
Use multithreading or multiprocessing: If your tokenizer is CPU-bound, you can use multithreading or multiprocessing to parallelize the work and make use of multiple CPU cores.
Use a just-in-time (JIT) compiler: JIT compilers, such as Numba, can analyze your Python code at runtime and generate optimized machine code on the fly. This can provide a significant performance boost over pure Python code.

Overall, there are many ways to optimize the performance of a Python-based tokenizer. It is worth considering using a combination of these techniques to achieve the best results.

hysse t1_j30ub8q wrote on January 5, 2023 at 7:22 AM

Haha I knew it. Unfortunately, I don't think ChatGPT can give a good answer to that question...

jakderrida t1_j30uieu wrote on January 5, 2023 at 7:24 AM

Yeah, when I read it, I knew right away that I'd seem like a bigger imbecile if I let you think it was me. The first one was impressive, though.