Hello everyone,

I'm interested in diving into the field of computer vision and I recently came across the concept of Vision Transformer (ViT). I want to understand this concept in depth but I'm not sure what prerequisites I need to have in order to grasp the concept fully.

Do I need to have a strong background in Recurrent Neural Networks (RNNs) and Transformer (Attention Is All You Need) to understand ViT, or can I get by just knowing the basics of deep learning and Convolutional Neural Networks (CNNs)?

I would really appreciate if someone could shed some light on this and provide some guidance.

Thank you in advance!

Comments

You must log in or register to comment.

the_architect_ai t1_j71izep wrote on February 3, 2023 at 12:13 PM

I suggest you just dive straight in. Part of learning is to find out what you don’t know and slowly cover your bases from there.

AerysSk t1_j71kz0d wrote on February 3, 2023 at 12:33 PM

This is the correct attitude. Dive in, and if you meet obstacles, find it. It's what makes the learning journey fun: you don't just learn one thing, but many things.

SAbdusSamad OP t1_j758r29 wrote on February 4, 2023 at 4:04 AM

Great advice. This seems to be a good starting point.

SimonJDPrince t1_j72bw7l wrote on February 3, 2023 at 3:56 PM

Explained in my forthcoming book:

https://udlbook.github.io/udlbook/

Should be a good place to start, and if it isn't then I'm really interested to know where you struggled so I can improve the explanation.

fermangas t1_j734867 wrote on February 3, 2023 at 6:55 PM

I was going to recommend this book. You beat me to it.

jmmcd t1_j7qb9i1 wrote on February 8, 2023 at 5:34 PM

This book is really excellent! I'm working through it and collecting a few typos. I'll pass them on when done. I'm going to recommend it to my students this semester.

SAbdusSamad OP t1_j757w05 wrote on February 4, 2023 at 3:56 AM

I recently obtained a PDF of the book and began searching for information on ViT. Unfortunately, it appears that the book does not cover this topic. However, I plan to utilize the Transformer chapter to gain an understanding of ViT.

SimonJDPrince t1_j784yjf wrote on February 4, 2023 at 8:26 PM

ViT is at the end of the transformers chapter. Perhaps I forgot to put it in the index?

SAbdusSamad OP t1_j79ub7q wrote on February 5, 2023 at 4:34 AM

I apologize for that oversight. Yes, the book does cover Transformers for images.

42gauge t1_j7eal36 wrote on February 6, 2023 at 3:52 AM

What are the math/ML prerequisites for this text?

SimonJDPrince t1_j7htrs8 wrote on February 6, 2023 at 10:11 PM

Pretty much nothing to get through the first half. High school calculus and a basic grasp of probability. Should be accessible to almost everyone. Second half needs more knowledge of probability, but I'm filling out appendices with this info.

Jurph t1_j71nymu wrote on February 3, 2023 at 1:02 PM

I recommend diving in, but getting out a notepad and writing down any term you don't understand. So if you get two paragraphs in and someone says this simply replaces back-propagation, making the updated weights sufficient for the skip-layer convolution and you realize that you don't understand back-prop or weights or skip-layer convolution ... then you probably need to stop, go learn those ideas, and then go back and try again.

For deep neural nets, back-propagation, etc., there will be a point where a full understanding will require calculus or other strong mathematic principles. For example, you can't accurately explain why back-prop works without a basic intuition for the Chain Rule. Similarly, activation functions like ReLu and sigmoid require a strong algebraic background for their graphs to be a useful shorthand. But you can "take it on faith" that it works, treat that part of the system like a black box, and revisit it once you understand what it's doing.

I would say the biggest piece of foundational knowledge is the idea of "functions", their role in mappings and transforms, and how things similar to Newton's Method are meant to work to get approximate solutions after several steps. A lot of machine learning is based on the idea of expressing the problem as a composed set of mathematical expressions that can be solved iteratively. Grasping the idea of a "loss function" that can be minimized is core to the entire discipline.

[deleted] t1_j72u4c2 wrote on February 3, 2023 at 5:51 PM

[deleted]

Jurph t1_j73ozbe wrote on February 3, 2023 at 9:07 PM

Hey, I dove into "Progressive Growing of GANs" without knowing what weights were. And now here I am, four or five years later. I've trained my own classifiers based on ViTs, DNNs, written python interfaces for them, and I'm working on tooling to make Automatic1111's GUI behave better with Stable Diffusion. We've all got to start somewhere.

atharvat80 t1_j71u3oa wrote on February 3, 2023 at 1:53 PM

If you want to take the top down approach I'd recommend that you start by learning what transformers are. Transformers were originally intended for language modelling so if you look up a NLP lecture series like Stanford CS224n they cover that in detail form a NLP perspective, it should be helpful regardless. Or you can check out CS231n they have a whole lecture on attention, transformers and ViT. Start there and look up the stuff thats unclear from there.

Lmk of you'd like me to link any other resources, I'll edit this later. Happy learning!

SAbdusSamad OP t1_j75922f wrote on February 4, 2023 at 4:07 AM

These courses seem to have excellent content. I will definitely consider these as great resources.

new_name_who_dis_ t1_j71w8up wrote on February 3, 2023 at 2:09 PM

If I recall correctly, ViT is a purely transformer based architecture. So you don't need to know RNNs or CNNs, just transformers.

JustOneAvailableName t1_j71yj42 wrote on February 3, 2023 at 2:26 PM

Understanding what is extremely easy and rather useless, to understand a paper you need to understand some level of why. If you have time to go in depth, aim to understand the what not and why not.

So I would argue at least some basic knowledge of CNNs is required.

SAbdusSamad OP t1_j71z0zp wrote on February 3, 2023 at 2:30 PM

Well, I do have idea about CNNs. I have limited knowledge of RNNs. But I don't have knowledge of Attention is All You Need.

Erosis t1_j72rzdl wrote on February 3, 2023 at 5:38 PM

You'll probably be fine learning transformers directly, but a better understanding of RNNs might make some of the NLP tutorials/papers containing transformers more easily comprehensible.

Attention is an very important component of transformers, but attention can be applied to RNNs, too.

SAbdusSamad OP t1_j759v4v wrote on February 4, 2023 at 4:13 AM

I agree that having a background in RNNs and attention with RNNs can make the learning process for transformers, and by extension ViT, much easier.

tripple13 t1_j723bf0 wrote on February 3, 2023 at 3:00 PM

I strongly disagree. Having an understanding of seq2seq prior Transformers, goes a long way.

new_name_who_dis_ t1_j723k5w wrote on February 3, 2023 at 3:01 PM

I mean the more you understand the better obviously. But it's not necessary, it's just context for what we don't do anymore.

[deleted] t1_j71m65o wrote on February 3, 2023 at 12:45 PM

[removed]

icanelectoo t1_j75h90j wrote on February 4, 2023 at 5:24 AM

Look up some papers that discuss them, then look up the papers those paper refers to. Write out a summary as if you had to explain it to someone else who's never seen it before.

Alternatively you could ask chatGPT.

teenaxta t1_j76i085 wrote on February 4, 2023 at 1:23 PM

most ViT discussions or videos I saw assume you have an idea of attention and transformers

watch this video series to get an idea of attention and transformers in general and then you'll be good to go

https://www.youtube.com/watch?v=mMa2PmYJlCo

[deleted] t1_j73h9fl wrote on February 3, 2023 at 8:18 PM

[removed]

juanigp t1_j71p88u wrote on February 3, 2023 at 1:13 PM

matrix multiplication, linear projections, dot product

nicholsz t1_j728g2l wrote on February 3, 2023 at 3:34 PM

OS. Kernel. Bus. Processor. Transistor. p-n junction

juanigp t1_j73a6z4 wrote on February 3, 2023 at 7:33 PM

It was my grain of sand, self attention is a bunch of matrix multiplications. 12 layers of the same, it makes sense to understand why QK^t. If the question would have been how to understand maskrcnn the answer would have been different.

Edit: 12 layers in ViT base / BERT base