Submitted by AutoModerator t3_100mjlp in MachineLearning

Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead!

Thread will stay alive until next one so keep posting after the date in the title.

Thanks to everyone for answering questions in the previous thread!

19

Comments

You must log in or register to comment.

ateeb_khan_13 t1_j2imtti wrote

Hey, I have trained some deep learning model, I want to deploy. Anyone have tips or some tutorial that can help me with that?

1

CygnusX1 t1_j2jxcl2 wrote

What are techniques or best practices for detecting/segmenting large objects in high resolution images? Some problems I run into are training with large image chip sizes (e.g. 1024x1024) to make sure the entire object can fit on a chip, culminating in GPU memory management pains. I've been using MaskRCNN.

7

i_likebrains t1_j2k46t1 wrote

What batch sizes, learning rates and number of epochs are suitable for smaller datasets?

2

pacozaa t1_j2l0cz1 wrote

How to start learning machine learning? Kaggle? Course? There are so many keywords I don't know here.

5

waiting4omscs t1_j2lbbsu wrote

Not sure if this is a simple question, so I'll ask here before making a thread. How would on-device machine learning be done? Just watched a video about "moonwalker" shoes that use AI to adapt your stride to their mechanical wheeled shoes. In a video I watched, the developer said that the shoe "learns your stride". How would it be done on-device like that? What would the underlying architecture be? What kind of algorithms/models? Would there be trained parameters already?

3

gmish27 t1_j2liuuh wrote

Problem - how to detect empty areas on a brick wall?

Context - posters/hoardings are put on the brick walls. I've images in which people have manually annotated the empty areas (cross-section of the wall not covered by any hoarding) with a rectangular bounding box. Can I use these images to train an object detection model? If so which one should I use?

Expected outcome - If a new image of a brick wall is presented the model should create the bounding box for any empty area on its own

2

-s-u-n-n-y- t1_j2lnrj2 wrote

Is there any kind of available AI tool I can use for asking a wide range of questions to? For example, I work autonomously in a role at my job that has never existed before. I’d love to be able to ask AI how to automate some tasks or for excel formulas that are quite complex. Any advice appreciated.

1

v2thegreat t1_j2lpumb wrote

These comes under hyperparameter optimization, so you will definitely need to play around with them, but here are my rules of thumb (take it with a grain of salt!)

Learning rate: start with a large learning rate (ex 10e-3), and if the model overfits, then reduce it down to 10e-6. There's a stackoverflow article that explains this quite well.

Number of epochs: it's right before your model's loss starts diverging from the validation loss. Plot them out and where they diverge is where the overfitting happens.

Batch size: large enough that the data fits in memory to speed things up in general

3

tdgros t1_j2m63e5 wrote

I can't say for sure, but there isn't necessarily any online training. You can imagine some hypernetwork regressing good parameters for a low level task such as controlling the shoes' motors. It could also be a combination of good old school sensor fusion and a nice marketing speech ;)

3

Rawvik t1_j2mj7bj wrote

My manager recently asked me to create a chat bot for the company that helps end users solve their queries related to the product. However, as far as I know, the company has some already made word documents which contain info related to how various functionalities in the product work. There are multiple word documents. So I think I will have to use these documents as my data to train the chat bot. So my question is, does anyone have any idea how to approach this problem? Or is there any step by step guide available to tackle this in NLP? Please suggest something.

1

oilfee t1_j2mki8y wrote

How much data do I need for a transformer model? If I'm not mistaken, GPT-3 uses something like 50 PB of text? But maybe it gets 'decent' result with much less? I just don't want to fall into the trap of the small business owner who hires a data scientist and asks her to use deep learning for their 130-entry customer data base (which I've encountered before). But like, 1M tokens? 10M?

1

tdgros t1_j2nfzj6 wrote

a hypernetwork is a term that can be used when a network outputs coefficients for another network.

Sensor fusion is typically used with low-level sensors that are noisy, biased, limited in their dynamics... but can complement each other, be "fused". For UAV navigation, we fuse accelerometers, gyros, pressure sensors, GPS and vision...

2

No_Remote5392 t1_j2nltj5 wrote

Hello , i'm trying to develop a 1d cnn with gene expression as input , to predict cancer type .
The problem is that my label are very unbalanced , and i am wondering what should i do ?
Squamous cell carcinoma , NOS : 368
Transitional cell carcinoma : 66
Papillary transistional cell carcinoma : 1
Carcinoma NOS : 1
Papillary transitional cell carcinoma : 1
what should i do with the label with only 1 observation ?
Thank you very much

1

v2thegreat t1_j2o5v0y wrote

It can, but I want to know why you want to use transformers in the first place. Having the entire context is important to avoid solving the wrong problem, especially one that might get expensive depending on what you're trying to do

1

v2thegreat t1_j2o816v wrote

Well, to answer your original question: it depends on what problem you're trying to solve!

In theory yes you can work with a large corpus of data with a large language model, but as chatgpt showed us, it's not necessarily the case that a larger model will do better always, but rather that fine-tuning might give better results

I hope this helps!

1

oilfee t1_j2o8mba wrote

I'm interested in numbers, not "it depends". How much data in bytes or tokens would I need for

- text generation

- image generation

- sound generation

- function classes

- protein sequences

- chess games

to achieve some sort of saturation of learnability, like diminishing return for a given architecture? Is it the same ball park? Have different data set sizes been compared with different model sizes?

1

v2thegreat t1_j2oablu wrote

For transformers that's likely a difficult question to answer without experimentation, but I always recommend to start small. It's generally hard enough to go from 0 to 1 without also worrying about scaling things up.

Currently, we're seeing that larger and larger models aren't really slowing down and continue to become more powerful.

I'd say that this deserves it's own post rather than a simple question.

Good luck and please respond when you end up solving it!

1

Useful-Command-8793 t1_j2osba8 wrote

Looking for a tool/website or approach (basically point me in the right direction)

That I can input several paragraphs of text and have an output which is similar to the previous paragraphs.

This might be too specific but google has turned up 0 answers for me

0

Odd_Engineer20 t1_j2pfdud wrote

So I’m working on creating a new ML algorithm(I know stupid) but it’s for fun. But I’m in the testing phase and one thing I don’t have figured out is turning a picture into a (x, y) coordinate? I’m trying to do this with the mnist data set but I’m not sure how to go about it

1

hysse t1_j2qqwsf wrote

Which tool is the best to train a tokenizer ? HuggingFace library seems the simplest one but is it the most efficient (computing) ? If yes, what torchtext, nltk... are useful for ?

3

SnowTime11 t1_j2r5jk0 wrote

I have been classifying series of data that are cyclic, as in not exactly periodic but repeating. As a form of data augmentation, I've been trying to separately classify the single cycles rather than the whole series. To get the final class score, I average the scores (before softmax) of the cycles belonging to the same series. This approach seem to yield very good results, for some reasons I believe:

Smaller input data leads to a smaller model, and segmenting the input increases the available data

Focusing on a single period seems to make the classifier highlight better features from saliency maps

Combining the output of the classifier can be beneficial, as in if one cycle is corrupted and wrongly classified the others may compensate from it. This probably happens even when classifying the whole time series, but with the segmentation is more explicit.

Has this been done in any other work? Am I falling into some kind of fallacy by applying this segmentation?

1

emmytau t1_j2r7cbc wrote

Is it problematic that my Bart summarization model's training loss drop below validation loss? I could for example stop the training already after 2 epochs. However, it would be nice to train more epochs but maybe it would just require more data - or do you have any training argument suggestions?

See graph of training- and validation loss https://imgur.com/mF7Frfd

Model here: https://huggingface.co/emmyapi/distilbart-podimo-data-eval-2

2

Own_Neighborhood_773 t1_j2uv5mk wrote

Hi! I'm a video editor and I have several videos of a person walking that the client wants stitched together. Almost like Timelapse except each shot is handheld, different clothing, lighting etc. They want this sequence to change shots seamlessly every step. I've don't similar things before but they have a few hundred clips. Is there a way/service I can utilize that will compare all the videos and identify where each shot is most similar? I've researched and several things like pose estimation, indexer, analyzer all come up but I'm computer savvy not a coder. Thanks for any direction you can give me!

0

Yukary t1_j2v6309 wrote

So we have a NLU model trained using spacy from python . We don't know when to stop training. How do we know our model is production ready? Anyone has experience in this please tell me. Thanks

0

Yukary t1_j2v81j9 wrote

between Node NLP in nodejs and Spacy in Python which one you guys see more powerful

0

LoquatFabulous6947 t1_j2w6cpv wrote

Hi guys, I'm not sure if this is a simple question so I will ask here before making a thread. Can somebody help me to create a K Fold cross validation class from scratch?

0

alcanthro t1_j2xj4kv wrote

How do you make a person: how do you take LLMs and make them have memory and understanding, including understanding of the self and the other?

−1

jakderrida t1_j2zxpxe wrote

The batch size, learning rate, and number of epochs can all affect the model's performance on a smaller dataset. Here are some general guidelines that you can use as a starting point:

Batch size: A smaller batch size can be more appropriate for smaller datasets because it allows the model to make updates based on more diverse data. For example, a batch size of 32 or 64 is a good starting point for a smaller dataset.

Learning rate: The learning rate determines how fast the model updates its weights. A higher learning rate can allow the model to make rapid progress at the beginning of training, but it can also make the model more prone to overfitting. A lower learning rate can make the model's progress slower, but it can also help the model to generalize better to new data. A learning rate in the range of 0.001 to 0.01 is a good starting point for a smaller dataset.

Number of epochs: The number of epochs is the number of times the model sees the entire dataset during training. A smaller dataset may require fewer epochs to prevent overfitting. For example, you may want to start with a small number of epochs (e.g., 10 or 20) and increase it if the model's performance on the validation set is still improving.

Keep in mind that these are just general guidelines, and the optimal batch size, learning rate, and number of epochs will depend on the specific characteristics of your dataset and model. It may be helpful to experiment with different combinations of these hyperparameters to find the best settings for your particular case.

3

jakderrida t1_j2zxul1 wrote

The Hugging Face library is a popular tool for training a tokenizer and is relatively easy to use. It is based on the Transformers library, which is built on top of PyTorch, and it provides a wide range of pre-trained models and tools for natural language processing tasks.

In terms of efficiency, the Hugging Face library should be sufficient for most use cases. However, if you need to train a very large model or you want to optimize the training process for maximum efficiency, you may want to consider using a more specialized library like PyTorch or TensorFlow directly.

Other natural language processing libraries like NLTK (Natural Language Toolkit) and torchtext are also useful for a variety of tasks, such as text preprocessing, part-of-speech tagging, and language modeling. NLTK is a general-purpose library that provides a wide range of tools for working with human language data, while torchtext is a PyTorch library that provides tools for preprocessing and working with text data in PyTorch.

3

jakderrida t1_j2zy3s2 wrote

I would recommend considering the following strategies to handle imbalanced labels in your dataset:

Oversampling: You can oversample the minority classes by generating synthetic examples or by sampling with replacement from the minority classes. This can help to balance the class distribution and improve the model's performance on the minority classes.

Undersampling: You can undersample the majority classes by randomly sampling a smaller number of examples from the majority classes. This can help to balance the class distribution and prevent the model from being biased towards the majority classes.

Weighted loss: You can assign higher weights to the minority classes in the loss function to give them more influence on the model's learning. This can help to balance the class distribution and improve the model's performance on the minority classes.

Class-specific metrics: You can use metrics that are specifically designed to evaluate the model's performance on imbalanced datasets, such as the F1 score or the AUC (Area Under the Curve) of a precision-recall curve.

In your particular case, you may want to consider oversampling or using weighted loss, since you have only one example for some of the minority classes. It may also be helpful to combine these strategies to achieve the best results.

1

jakderrida t1_j2zybhc wrote

There are a few ways to determine when to stop training a natural language understanding (NLU) model:

Monitoring the performance on a validation set: One approach is to monitor the performance of the model on a validation set during training and stop training when the performance on the validation set stops improving or starts to degrade. This can help to prevent overfitting and ensure that the model generalizes well to new data.

Using early stopping: Another approach is to use early stopping, which involves setting a maximum number of epochs and stopping training when the performance on the validation set has not improved for a certain number of epochs. This can help to prevent overfitting by stopping training when the model is no longer making progress.

Using human evaluation: If you have access to human annotators, you can also use human evaluation to determine when the model is ready for production. You can use a subset of your data as a test set and have the annotators evaluate the model's performance on this test set. When the model's performance meets your desired accuracy threshold, you can consider it ready for production.

Ultimately, the best way to determine when a model is production-ready will depend on the specific requirements of your application and the resources available to you. It may be helpful to experiment with different approaches and see which one works best for your particular case.

1

banana-apple123 t1_j309whg wrote

So I am trying to reduce the dimensions of my hypothetical data.

I read that PCA is a good tool but it only works for linear data set. If the data is non linear autoencoders can do a better job

First of all, how does one determine if their data is linear. Do I just plot the features against each other and see if they form a straight line?

Second, ignoring computer limitations, are autoencoders better than pca for nonlinear data.

Thanks for any comments and help!

1

hysse t1_j30tpx8 wrote

Thanks for the answer. I need to train a relatively large model and I need an efficient tokenizer.

I don't see how a tokenizer written in pytorch (or tensorflow) can be faster than a HuggingFace tokenizer (for example). HuggingFace has a rust backend that make the tokenizer faster and I guess that torchtext has an optimized backend too.

Knowing that the tokenizer run in cpu and not gpu, how can it run faster if I wrote it with pytorch (or even in python) ?

1

jakderrida t1_j30u1jq wrote

To be clear, ChatGPT provided the answer.

Here's how ChatGPT answered your next question...

It is generally true that a Rust backend can provide better performance than a pure Python implementation, especially for a high-throughput task like tokenization. This is because Rust is a systems programming language that is designed to be fast and efficient. It can also make use of low-level optimization techniques that are not possible in Python, such as manual memory management.

That being said, it is still possible to write a fast and efficient tokenizer in Python. There are several ways to do this:

  1. Use optimized libraries: There are several libraries in Python that are optimized for performance, such as NumPy and PyPy. Using these libraries can help to improve the performance of your tokenizer.

  2. Use Cython or Numba: Cython and Numba are tools that can be used to optimize Python code by compiling it to native machine code. This can provide a significant performance boost over pure Python code.

  3. Use multithreading or multiprocessing: If your tokenizer is CPU-bound, you can use multithreading or multiprocessing to parallelize the work and make use of multiple CPU cores.

  4. Use a just-in-time (JIT) compiler: JIT compilers, such as Numba, can analyze your Python code at runtime and generate optimized machine code on the fly. This can provide a significant performance boost over pure Python code.

Overall, there are many ways to optimize the performance of a Python-based tokenizer. It is worth considering using a combination of these techniques to achieve the best results.

1

proto-n t1_j30xu2t wrote

Is there anywhere one can download the training dataset of gpt2 (or equivalent)? Or do you have to crawl it yourself for legal reasons?

Nvm, after an hour: common crawl, openwebtext2, the pile

1

anakin9472 t1_j316b7w wrote

Hey guys I'm doing research about Named Entity Recognition (NER) in Natural Language Processing (NLP) with Python, I'm wondering which library is the best suitable for the topic

  1. Natural Language Toolkit (NLTK)

  2. BERT

  3. Spacy

Or if you guys have better recommendations, would you give me some advice?

1

RedBallG t1_j318vl2 wrote

I recently read the paper "BERT: Pre-training of Deep Bidirectional Transformers for

Language Understanding" and I was facinated by their masked language modeling method of pre-training. However, attempting to implement the method into pytorch for my own transformer model became difficult. In the paper, it states:

"In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM."

How is it possible to only consider the masked embeddings and output only those outputs from the transformer encoder into an output softmax?

I tried to mask the output of the model to only output into the softmax but, the model learned this and outputted the mask by default. I felt like wasn't a correct implementation of masked language modeling so I disregarded it.

1

Lolkac t1_j319nir wrote

Hello all,

I have a business idea that I would like to check with someone who knows machine learning, AI generation of people and deepfakes. If my current understanding of the industry is correct, its a billion dollar industry.

But i want to consult with experts first.

1

comradeswitch t1_j33yet8 wrote

And what you describe can also happen partially, where a model is developed offline that "learns to learn" or simply pretrained on data that's likely to be representative, and then this is placed on the embedded system that has a much simpler learning task or just starts out much closer to optimal.

But I think you nailed it with the last sentence. I need the Scooby Doo meme, where it's "AI on a wearable embedded computer" revealed to have been a Kalman filter all along.

2

comradeswitch t1_j33zsw6 wrote

Do you have data with no cancer? It's going to require careful treatment of the categories with only one example, but one-shot learning is a topic of great research that describes this problem exactly. Starting there should be helpful.

Also, you have "transistional" and "transitional" listed with 1 each- if that typo is in the original data, you should fix that! And then you'll have 2 examples.

Unfortunately, the answer here may be "acquire more data", because you have many categories for the total samples you have as well as multiple with 1 example only.

1

comradeswitch t1_j341em8 wrote

This is in essence how convolutional neural networks work- most often, looking at small patches of an image with many overlapping windows and the same core model looking at each. Then the same can be done for the output of the very small patches to get summarization of slightly larger patches of the image, and so on. At the end, the output is coming from many different analyses of different, overlapping segments of the data considered together.

I'd be wary of creating explicit synthetic examples that contain e.g. exactly one cycle of interest or whatever unless you know for a fact that it's how the model will be evaluated. You can imagine how snipping out a cycle from beginning to end could give an easier problem than taking segments of the same length but with random phase, for example. It may be simpler and more robust to do this in the model directly with convolution and feed in the whole series at once.

1

euphoriation t1_j349t7o wrote

In order to find text similarity by comparing string embeddings, is it necessary to use a vector database? Alternatively, could the same results be achieved by averaging the embeddings of a set of strings and then calculating the distance between the average and the embedding of another string? In this context, would a vector database provide any additional benefits, or is it possible to achieve the same results without one? Additionally, I am wondering if the pricing of vector database solutions such as Pinecone and Milvus is justified for my use case, or if there are other more cost-effective options available.

1

thchang-opt t1_j34lb34 wrote

How do I set PyTorch to use double precision for all layers and calculations when creating my torch.nn model?

1

sdw23 t1_j37cbqs wrote

Hi, any free tool to convert Audio/video to text (as subtitles.) locally or on cloud? (Must support Japanese and English.

1

throwaway2676 t1_j39vamk wrote

Is an embedding layer (or at least a simple/standard one) the same thing as a fully connected layer from one-hot encoded tokens to a hidden layer of length <embedding dimension>? The token embeddings would be the weight matrix, but with the biases set to 0.

3

idonthaveenoughchara t1_j3b4q6p wrote

Does anyone here happen to have a trained GANILLA model and would you be able to translate a few images for me into stylised images? I’m making a book for my niece’s 5th birthday and would love to have the images of our holiday as if an illustrator drew them. I would do this myself but my laptop has unfortunately given up on me - any help or advice would also be appreciated :)

1

fr4nl4u t1_j3cwune wrote

I have to accelerate the labelling process of a collection of sounds. To do so, I would like to build a representation from audio data and compute distances/find clusters. Do you know the most frequent representation used and/or the possible embedding techniques ?

1

trnka t1_j3g5f4a wrote

Yeah that's pretty common. If you'd like to do more machine learning, as your team and company grows you might try asking your boss to hire more SDEs so that you can spend more time with machine learning. Or alternatively, ask for more training so that the backend engineering goes more quickly.

As for "keeping up with the field", I don't recommend worrying about it. It's challenging, maybe impossible, to actually stay up to date on everything even if it's only ML. I find it's better to make a habit of learning something every day, however small, and focus on the growth aspect rather than some sense of "falling behind".

1

trnka t1_j3g5uer wrote

You're right that it's just a matrix multiply of a one-hot encoding. Though representing it as an embedding layer is just faster.

I wouldn't call it a fully-connected layer though. In a fully-connected layer, the input to the matrix multiply is the output of everything in the previous layer, not just the output of a single unit. The weights that multiply the output(s) of the first unit are not the same weights multiplying the output of any other unit.

It's more like a length 1 convolution that projects the one-hot vocab down to the embedding space.

3

trnka t1_j3g6cax wrote

It depends on what you want to do:

  • If you just want to apply NER, I'd recommend Spacy because it's fast and they have pretrained models for many languages.
  • If you're looking to fine-tune or train your own NER, either Spacy or Huggingface to use BERT.
  • If you're looking to build your own neural network architecture for NER, PyTorch is most popular.
1

PleasantInspection12 t1_j3gkrqy wrote

Hi,
I am currently pursuing my undergraduate degree in CS. I am very interested in ML and want to pursue a career in this field as a ML Engineer. I am currently learning ML and building few projects alongside.
However, I want to know how realistic it is to get a job as MLE after Bachelor's degree (as I certainly don't wanna stay jobless after graduation even though I love ML too much). I really look forward to learn about the experience of other members regarding this.

1

rudtjeban t1_j3gto4q wrote

So, I am very new to AI category, and just would like to learn all the basic things from the perspective of an AI user (not a developer). Then I found out the MLPerf AI benchmark v2.1 result here: mlcommons.org/en/training-normal-21/
But there are so many different numbers, which makes me confused. On the top of the table it says "benchmark results (minutes)" but what does it mean? Does it mean higher score equals better performance or is it the opposite?
The reason why I am confused with this data is because all the tech media said Nvidia H100 GPU outperforms everybody in this benchmark v2.1 result from MLPerf,
but the table from above website shows that the numbers for the rows of nvidia H100 are not the highest nor the lowest on many of the categories. Can someone tell me how to read the numbers properly and see which ones to look out for?

1

throwaway2676 t1_j3h780s wrote

> In a fully-connected layer, the input to the matrix multiply is the output of everything in the previous layer, not just the output of a single unit.

But if the previous layer is 0 everywhere except for one unit, the result is the same, no?

My mental picture is that input layer 0 has V = <token vocabulary size> neurons, and layer 1 has E_d = <embedding dimension> neurons. Layer 0 is 1 in 1 neuron, 0 everywhere else, as one-hot encoding normally goes. The embedding layer 1 is then given by x@W, where x is the layer 0 as a row vector, and W is the weight matrix with dimensions V x E_d. The matrix multiplication then "picks out" the desired row. That would be a fully connected linear layer with no bias.

1

Ralen_Hlaalo t1_j3h9uzg wrote

What are the best resources to get into AI as someone who is already a professional software engineer? A lot of the courses and tutorials seem targeted at complete beginners.

1

Ellianel t1_j3hc1a5 wrote

While reviewing works concerning automatic fake news detection, I discovered that some papers tend to divide the topic into two approaches: data-mining-oriented and NLP-oriented (both using ML).

I'm not sure what's the difference, since NLP too can use hand-crafted features obtained by data mining. Can someone explain me how these approaches differ?

1

trnka t1_j3i2zxx wrote

You don't need to choose, and there's definitely a market for people that are capable of both good software engineering and good machine learning. Personally I'm a big believer in being well-rounded in terms of skills.

If I had to guess, what you're saying might just mean that you have more to learn about software engineering than machine learning right now. And that'll change over time.

1

trnka t1_j3i3vk4 wrote

If your input is only ever a single word, that's right.

Usually people work with texts, or sequences of words. The embedding layer maps the sequence of words to a sequence of embedding vectors. It could be implemented as a sequence of one-hot encodings multiplied by the same W though.

2

mbrtlchouia t1_j3if998 wrote

I want to learn deep learning... Do I need to start with machine learning first then head to DL? and what course/book do I need?

1

debrises t1_j3irf9s wrote

>What are techniques or best practices for detecting/segmenting large objects in high resolution images? Some problems I run into are training with large image chip sizes

The first thing that came to my mind was gradient accumulation if you have limited GPU memory. Fitting an image of that size on a single GPU could result in a very small batch, which is not so good for training speed and stability.

PyTorch lightning offers such a feature if you're using PyTorch.

1

debrises t1_j3itq4j wrote

Larger batch sizes lead to a better gradient estimation, meaning, optimizer steps tend to be in the “right” direction, thus leading to faster convergence.

Run a test epoch to see when your model converges, and then use slightly more epochs so that your model can try to find different minimum points. And use model checkpoint callback.

As for loss, just use an Optimizer from the Adam family, like AdamW. It handles most of the problems that can happen pretty well.

The learning rate heavily depends on what range of values your loss has. Think about it this way: if your loss is equal to 10 then using the lr of 0.01 will get us 10 * 0.01 = 0.1. We then compute partial derivatives of this value with respect to each weight and backpropagate that and update our weights. Usually, we want our weights to have small values and to be centered around zero, updating them by even smaller values every step. The point is that your model doesn't know what values your loss takes and thus, you have to optimize the learning rate to find that nice value that connects your loss signal to your weights.

1

Remote_Event_4290 t1_j3jrvah wrote

Hi! I am a student and have been very interested in the ways that bias can be removed from ML datasets, and I have some ideas of how bias could hypothetically be reduced but am by no means an expert. I would greatly appreciate any feedback, recommendations, or additions to some of the ideas that I currently have.

Right now, it seems that there is no specific way to completely remove bias from ML datasets, but I have been attempting to create a hypothetical design or a process to prevent bias as much as possible.

First off, the quality of the raw data is really the most important part of machine learning datasets, but collecting good data is more of a statistical problem. Based on what the learning model is trying to do, you would need to consult with statisticians on determining the quality of the data and if it is even valid, and if you should be generating a random sample, or using all raw data.

As far as the learning model itself, I have formulated a few suggestions for the dataset itself:

  • One of the first ideas that came to mind is excluding 'sensitive' demographic data, like age, sex, race, etc, which may work in certain cases, but could also backfire. * For example, one way of reducing bias is to use the demographics to pre-filter the data and ensure groups are accurately represented.
  • One thing you can do is create two datasets and run them through a machine learning model, one with the demographics, and one without, and then compare the results, audit for bias, and see if there is anything you can improve.
  • In some cases, it is also possible to only include variables relevant to the topic, but ultimately could be harmful as you lose more and more data points.
  • It's also possible that you could pick a subset of the data to do things like, ensure minority populations were represented or alternatively create a dataset to represent each option, run each through the model with known outcomes, and evaluate and/or train it against itself.

I also found that it must be necessary for there to be input and opinions on the dataset given by multiple professionals of different backgrounds to prevent any bias from the creator. * Most importantly, there must always be frequent checkups to monitor if any bias has arisen and if so, ways that it can be removed.

Does anyone have any feedback or suggestions for me?

1

tridentsaredope t1_j3jwot4 wrote

How can I store the information needed to regenerate my features?

Let's say I have a feature f0 that was generated by a function foo with the inputs foo(a,b,c). I store the feature once it is created but if new data becomes available I want to update the feature.

I thought to do a simple table with [name, function, inputs] for the rows but I'm not sure this is the best method. Is there a standard practice for this regeneration of features?

1

RandomScriptingQs t1_j3kmc1x wrote

I'm only peripherally involved with ML/AI in that I try and apply some helpful techniques to biological problems but recently I have enjoyed listening to discussions around AGI but most of the papers I've come across from a quick google scholar search seem to be *about* AGI and not attempts at implementing something closer to, or approaching, AGI.
Is that a fair assessment? Has my lack of depth in the field given me a false initial glance?
Are there any authors/labs working on AGI in particular whose papers you would recommend reading?
e.g. "Artificial General Intelligence vs. Industry 4.0: Do They Need Each Other?", "Deep Learning and Artificial General Intelligence: Still a Long Way to Go", "Why general artificial intelligence will not be realized", and, "Approaches to Artificial General Intelligence: An Analysis", all seem to be about AGI in contrast to, "Towards artificial general intelligence via a multimodal foundation model", which attempts to implement something.
Full disclosure: I haven't read these papers yet. I am trying to find good, reputable papers to read.

1

trnka t1_j3ldgc9 wrote

Microsoft has a good checklist to consider if you haven't seen it.

There are many publications on fairness nowadays so I'd also suggest reading some survey papers. Here are a few that have a good number of citations:

I'm pretty sure there are many workshops and conferences on fairness in AI nowadays too that would be good for ideas. There are even ML toolkits to help detect or reduce bias these days, so those would be good to search for.

Hope this helps! Fairness has become a pretty big area over the last several years

1

cborja36 t1_j3nzxjv wrote

What should you do when your model fails the test set? That is, the test set is supposed to give you an unbiased view of how your model should behave with real-world data, but the moment that test set prevents you from deploying a model and forces you to improve it, it is no longer unbiased. And if this keeps happening, what is the difference between a test set and a validation set?

1

LetGoAndBeReal t1_j3oit19 wrote

How should I think about the way a large language model gains new specific knowledge? For example, suppose you have a model trained on hundreds of gigabytes of text and then want to continue its training to gain knowledge of a single specific fact it has not yet encountered such as “Steven Pinker is the author of The Language Instinct.”

I imagine that presenting it with a single sentence such as this embedded in a training set would contribute very little to its ability to subsequently answer the question “Who was the author of The Language Instinct?” Is that correct?

Is there some heuristic for how many exposures a model like GPT3.5 would need to a new fact, as such, before its weights and biases were adjusted enough to embody this fact?

1

Mountain_Past_6513 t1_j3qddlc wrote

My company has allowed $2000 to get a gpu for NLP fine tuning tasks. What are my options ? I would prefer a professional card instead of the gaming gpus. Whichever fits in the budget. Thanks

1

leonardokoen t1_j3qiyt4 wrote

I want to build a surrogate model and perform sensitivity analysis for a heavy simulation. The number of inputs 5 and the number of the simulations ~= 80. The question is to perform sensitivity analysis using Morris method(with optimal trajectories and simplexes) and then build the surrogate or describe the input space better with Latin Hypercube Sampling create the surrogate model and perform sensitivity analysis on the surrogate using Sobol ? If there are any papers on this issue let me know...

1

lilpolymorph t1_j3qrnoh wrote

I dont understand the fact that I have to perform preprocessing and feature selection on my training data set only as to prevent data leakage but when I try to use my classifiers in python they want equal dimensions of my train and validation sets. of course they are not anymore if I only preprocess the training set??? What do i have to do.

1

Asheradd0 t1_j3qshxs wrote

I want to increase the resolution of the output images. I want to implement the progressive growing of GANs approach with a pre-trained model, but, it is complicated since the pre-trained model contains (2 encoders "one for faces and one for voice", decoder, and a discriminator).

How should I update the actual architecture/code to reach my goal?

&#x200B;

PS: I saw medium posts about this topic but it isn't the case cuz they are creating their generator/discriminator from scratch.

1

LetGoAndBeReal t1_j3r0p45 wrote

Thank you for this. It seems this paper could surely help answer my question, if only I could understand it!

A challenge I keep coming up against in my quest to quickly learn about ML/NN is that almost everything I read is either too high level to provide meaningful explanation or too technically dense for me to follow. I guess I will just take note of this paper for now and circle back to it when I'm a bit further along.

1

trnka t1_j3t1t18 wrote

If you're doing the preprocessing and feature selection manually (meaning without the use of a library), yeah that's a pain.

If you're using sklearn, generally if you do all your preprocessing and feature selection with their classes in a sklearn pipeline you should be good. For example, if your input data is a pandas dataframe you can use a ColumnTransformer to tell it which columns to preprocess in which ways, such as a OneHotEncoder on categorical columns. Then you can follow it up with feature selection before your model.

Sklearn's classes are implemented so that they only train the preprocessing and feature selection on the training data.

1

psy_cho_path t1_j3tai3g wrote

Do I need to learn the mathematical foundations of AI/ML before learning ML?

1

LifeguardPrudent7217 t1_j3tjdsz wrote

Is it a good idea to use Confusion Matrix for a multi class problem? I'm talking about 7 classes. It just looks a bit confusing..

1

bundapeste07 t1_j3u4y1r wrote

Hey guys! Do we have an AI to add some objects into an image, mainly related with house/constructions ones? For example "Add a window to this house" and send a specific house photo?

1

trnka t1_j3vcyo3 wrote

Yeah it can be helpful even if you can't easily read the axes. I've even found it helpful for 50-class. It helped me quickly see that the model was over-predicting the top few classes, and that showed up as vertical bands.

2

TiredMoose69 t1_j3w9j4x wrote

I would like to start a project (possibly) using openai's API to make a GPT-3 based bot finetuned on Messenger/Whatsapp chatlogs of mine. Any suggestions on which model to use?

From what i see i have around 100M tokens for it to learn, I am currently working on changing the format of the json file and the txt file of the exported data but i am confused regarding which model to use. I think the davinci003 is an overkill for something like that.

(My data is in Greek, i used a GPT-2-Small-125M model trained for many hours on a colab notebook, and the results were not that great :D that's why i wanted to try a bigger model)

The format i want to use it for is idealy a chat bot that you ask something and replies as the other person would (my friend).

Do you think it's possible to train it on my local PC (RTX3060ti) for privacy reasons?

Any help/suggestions would be highly appreciated!

1

cdrn83 t1_j3x455a wrote

I have a bunch of word documents (more or less with the same format). What tool/service would you recommend to make it learn from all those documents so that I can ask it questions about the content?

1

Tart_Beginning t1_j3zub5s wrote

Is it true that learning rate matters less if you’re using an adaptive optimizer? If so, would you argue for or against using learning rate decay, and why?

1

Tart_Beginning t1_j3zutzn wrote

What is the difference between fine-tuning and transfer learning? Can you do deep learning without either of those things?

1

Tart_Beginning t1_j3zwkgv wrote

I have mixed feelings about this. I am getting my bachelor’s in CS in May and have (ENTIRELY unexpectedly) gotten to the final round of interviews for an MLE job; I have done no formal course work on anything AI related, but I have had two internships in ML. I feel like I have sold myself short for the software engineer, data scientist, or even MLE I could have been with a bit more work by BS-ing my way into a field I barely understand, and I know if I do somehow get this job, I will be scrambling to learn everything I don’t know on the fly. Having bombed plenty of interviews so far, it’s become painfully clear that this might have been a mistake. If you’re going to do ML, dive in head first and try to deeply understand it - take any course you can on it and start reading papers NOW to best prepare yourself for getting a job when you graduate. It IS possible, but you also want to be good at what you do and not end up in way over your head. I know I’m headed for at least a master’s degree later even if I get this job because this field is evolving too quickly for me to grasp the major concepts and keep up with the current science. Sorry if this is rambling, these are just my thoughts being in the position I am currently, hopefully this is helpful in some way!

1

trnka t1_j40i4fa wrote

Fine-tuning is when you take a pretrained network, change the output layer only, and run the optimizer a little more.

Transfer learning is when you take any sort of pretraining. Fine-tuning is one example of transfer learning. Using pretrained word embeddings is another example of transfer learning.

You can do deep learning without either. It's just that existing pretrained models and components are so good that it's tough to build a competitive model without either.

2

SpaceBoy4984 t1_j43o134 wrote

Hi, I dont know if this is the right place to ask this question, but what book should I study for learning the theoretical math part on machine learning? I’ve implemented ml as in programming so I have a good grasp of the “big picture”, but I want to start reading research papers and understand the math part. I already have a solid understanding of calculus, linear algebra, statistics and probability. As of now I am seeing

  1. Probabilistic Machine Learning: An Introduction (Adaptive Computation and Machine Learning series) by Kevin P. Murphy
  2. Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai Ben-David

but I’m not sure which one to get. Any recommendation on which one & why would be a great help for me!

I’ve also had some recommendations on “ Elements of Statistical Learning” by Trevor Hastie and took a look into that, but I want something a bit more advanced than that.

1

trnka t1_j45klqc wrote

No it's not strictly needed, though I haven't seen a course that teaches ML starting from the application and working backwards to the fundamentals. In teaching that's sometimes called "top down" as opposed to starting from fundamentals.

If you're taking courses, you may need to pick up a bit of math along the way. If you're self-taught, you might try starting with tutorials of ML libraries like scikit-learn and keeping a journal of any terms you need to look up later.

1

JobPsychological5509 t1_j46h1lm wrote

Hi,

I need to build a prediction model using a classification model and pattern recognition model cascaded.

Classification model will have two classes 0 and 1. This model will give a series of 0s and 1s, which will then be fed to pattern recognition model. Please, Let me know if this sounds feasible.

1

farox t1_j46pkao wrote

Company is essentially a grow house. Think growing weed in containers (but that's not it). So they want to integrate AI in this whole thing. My understanding is that we will way too few data points to train some sort of model. In my mind we could probably use some statistical analysis (if I pee on that plant, 2 weeks later it's grown 10% than the unpeed one)

Does that make sense? How best to go about this? Thanks!

1

EdenistTech t1_j46q21e wrote

Does anyone have working example code for the Supervised Clustering algorithms (SPAM, SRIDHCR, and SCEC) by Eick et al.? I haven’t been able to find any online.

1

mildresponse t1_j46sh8k wrote

Why do some tokenizers assign negative floats to each token? For instance, I am looking at this json file, and the tokens start about 1/3 of the way down the page. Each one is part of a two-element list with the structure "[<token>, negative decimal number with 15 digits of accuracy]"

1

theghostofmandela t1_j47f4zf wrote

how do you stay up to date on ML/AI research? what are good feeds to follow?

1

JustHereForATechProb t1_j47yxsr wrote

Hi, I'm making a automatic bookmark organizer.

It consists of two tasks

  • Finding similarity between bookmarks, in order to put them in the same folder. [Solved Using the "all-MiniLM-L6-v2" model.]
  • Tagging, bookmarks with relevant tags

A Bookmarks contains:

  • Page title (String)
  • URL (String, regex'd "\W+" filtered)

Right now. I am looking for a model, that, given a string gives tags. Or to put in other words, given list/string of different words, give back a set of words that generalize/summarize said string.

But I wouldn't know what kind of machine learning task that would be categorized as so I wouldn't know what to search for.

Any suggestions would be most helpful.

1

blaher123 t1_j48jl4q wrote

does anyone have any experience using Youtube videos for text to speech/speech to text data?

I can get the subtitle data for videos, although they don't make it easy. While the subtitles themselves are accurate I also need accurate timestamps and the timestamps from Youtube (which seem to be designed for close captioning rather than accuracy) seem to be just inaccurate enough to make them not useful. Am I just doing things wrong and there is a way you guys use to get an accurate timed Youtube transcripts?

1

I-am_Sleepy t1_j49d3mv wrote

FYI, using output from first stage model is called model stacking

Are you trying to model time series classification (many-to-one)? I don't know if making it a 2 stage model is appropriate i.e. using 0 and 1 as an intermediate representation

The hierarchical classification error will propagate through multiple stage if using raw prediction from previous stage alone. For example, if first stage model is 0.9 in accuracy, and second stage is also 0.9. The maximal accuracy two stage model will be 0.9*0.9 = 0.81 (performance degrade)

1

Competitive-Net-1483 t1_j4aq1to wrote

What about an APP social-Fi and Web3 sports games running on blockchain, with AI technology as a referee to manage your performance ? dotmoovs

1

two-legged-greek t1_j4btrbx wrote

How is Lensa able to strongly align the generated facial features with the user's? i've been trying for a while in dream studio now using my own images, and I can't seem to generate anything decent. Tried with a variety of steps and cfg parameters, augmenting the prompt with things like keywords like hyperrealistic but zilch. Any thoughts?

1

MegavirusOfDoom t1_j4cj1u8 wrote

[D] What is the future of NLP for the coming 24 months? Dall-E clones MidJourney and SD took 6-8 months to appear, so is that how long it will take for clones of ChatGPT? Perhaps less delay given the higher investment and market potential?

1

MegavirusOfDoom t1_j4cro9x wrote

Check how the magic wand works on github and open source code for Gimp. There are probably a lot of specific terminologies for these selection algorythms, and when you have found descriptions of pros working on the field you will have access to a lot of their research.

1

PleasantInspection12 t1_j4cykir wrote

Hi, as you suggested getting deep intuition, regarding that I am continuously learning (as fast as I can without losing details) through courses, books, and articles. I am also building projects (kinda basic but starting somewhere) with the datasets available on kaggle. Also, try to do some brainstorming about how and why some model works in my free time.

You mentioned that you had two ML related roles before without any AI related course work. I will be really glad if I could learn more about your experience. If you don't mind, can I message you privately?

1

TiredMoose69 t1_j4l525j wrote

no :( But i did train a GPT-2 355M model on chatbot like data. The output of it was fun but not that great hahaha

I am now looking into something like this:

https://github.com/daveshap/LongtermChatExternalSources

I think i will use the API from openai to load messages like this so that it can "remember" them every time i prompt to it. If you're interested in working on something similar PM me we can share ideas.

1