qalis
qalis t1_j9y4c1m wrote
Yes, absolutely, for any size of the dataset and model this is strictly necessary. You can use cross-validation, Leave-One-Out CV, or bootstrap techniques (e.g. 0.632+ bootstrap). You don't need to validate if you don't have any hyperparameters, but this is very rarely the case; the only examples I can think of is Random Forest and Extremely Randomized Trees, where sufficiently large number of trees is typically enough.
qalis t1_j8r97h8 wrote
Reply to comment by krumb0y in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
I do make PRs for those things. The average waiting time for review is about a few months. The average time to actually release it is even more. I both support and criticize Huggingface.
qalis t1_j8r6o9x wrote
Completely agree. Their "side libraries" are even worse, such as Optimum. The design decisions there are not questionable, they are outright stupid at times. Like forcing input to be a PyTorch tensor... and then converting it to Numpy array inside. Without an option to pass a Numpy array. Even first time interns at my company tend not to make such mistakes.
qalis t1_j8driqb wrote
I am working in this field for my PhD, so I think I can help.
A bit of self promotion, but my Master's thesis was about GNNs: https://arxiv.org/abs/2211.03666. It should be very beginner-friendly, since I had to write it while also learning about this step by step.
"Introduction to. Graph Neural Networks". Zhiyuan Liu and Jie Zhou. Tsinghua University is slightly outdated due to how fast this field is going on, but good intro.
"Graph Neural Networks Foundations, Frontiers, and Applications" (https://graph-neural-networks.github.io/) is cutting-edge, good reviews. I haven't read it though, but looks very promising.
Overviews and articles are also great, e.g. https://distill.pub/2021/gnn-intro/ or a well known (in this field) https://arxiv.org/abs/1901.00596. You should also definitely read papers about GCN (very intuitively written), GAT, GraphSAGE and GIN, the most classic 4 graph convolution architectures.
Fair comparison is, unfortunately, not common in this field. Many well-known works, e.g. GIN, do not even use a test set, and are quite unclear about this, so approach every paper with a lot of suspicion. This paper about fair comparison is becoming more and more used: https://arxiv.org/abs/1912.09893. This baseline, not GNN but similar, gives very strong results: https://arxiv.org/abs/1811.03508. I will be releasing a paper about a related method, LTP (Local Topological Profile), you can look out for it in the later part of the year.
Other interesting architectures to read about: graph transformers, Simple Graph Convolution (SGC), DiffPool, gPool, PinSAGE, DimeNet.
This very exciting area is just starting to develop, despite a lot of work done. There is no well working way to do transfer learning, for example. It is very hard to predict what will happen in 4-5 years, but e.g. Google Maps travel time prediction is currently based on GAT, and Pinterest recommendations on PinSAGE, so graph-based ML is already used in large-scale production systems. Those methods are also more and more commonly used in biological sciences, where molecular data is ubiquitous.
qalis t1_j8csdd1 wrote
Reply to [D] Quality of posts in this sub going down by MurlocXYZ
On the related note, can anyone recommend more technically or research-oriented ML subreddits? I already unsubscribed from r/Python due to sheer amount of low effort spam questions, and I am considering the same for r/MachineLearning for the same reason.
qalis t1_j6xkna5 wrote
Reply to [D] PC takes a long time to execute code, possibility to use a cloud/external device? by Emergency-Dig-5262
If you are tuning hyperparams for RF for 21 hours, you are doing something wrong. RFs often do not require any tuning at all! Additionally, are you using all available cores? Are you using some better HPO algorithm like TPE in Optima or Hyperopt?
qalis t1_j6o79xv wrote
Reply to comment by Internal-Diet-514 in [D] Have researchers given up on traditional machine learning methods? by fujidaiti
A better distinction would be that deep learning excels in application that require representation learning, i.e. transformation from domains that do not lie in Euclidean metric space (e.g. graphs) or that are too problematic in the raw form and require processing in another domain (e.g. images, audio). This is very similar to feature extraction, but representation learning is a bit more general term.
Tabular ML does not need this in general, since after obtaining feature vectors we already have a representation and deep learning like MLP can only apply (exponentially) nonlinear transformation of that space, instead of really learning fundamentally new representations of that data, which is the case e.g. for images, going from raw pixel values space into vector space that captures semantic features in the image.
qalis t1_j6o6jjv wrote
Reply to comment by aschroeder91 in [D] Have researchers given up on traditional machine learning methods? by fujidaiti
The simplest way to do this is combining autoencoders (e.g. VAEs) and boosting, I have seen this multiple times on Kaggle.
qalis t1_j6o6cou wrote
Reply to comment by nucLeaRStarcraft in [D] Have researchers given up on traditional machine learning methods? by fujidaiti
That's a nice paper. There is also an interesting, but very niche line of using gradient boosting as a classification head for neural networks. Gradient flows through it normally, after all, just tree addition is used instead of gradient descent steps. But sadly I could not find any trustworthy open sourced implementation of this approach. If this works, it could bridge a gap between deep learning and boosting models.
qalis t1_j6o5zha wrote
Reply to comment by coffeecoffeecoffeee in [D] Have researchers given up on traditional machine learning methods? by fujidaiti
Yeah, I like her works. iModels library (linked in my comment under "rule-based learning" link) is also written by her coworkers IIRC, or at least implements a lot of models from her works. Although I disagree with her arguments in "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead", paper which she is arguably the most well known for.
qalis t1_j6mmvwg wrote
Reply to comment by silentsnake in [D] Have researchers given up on traditional machine learning methods? by fujidaiti
Absolutely. OPs question was about research, so I did not include this, but it's absolutely true. It also makes sense - everyone has relational DBs, they are cheap and scalable, so chances are a business already has a quite reasonable data for ML just waiting in their tabular database. This, of course, means money, which means money for research, even in-company research, which may not be even published, but is research nonetheless.
qalis t1_j6mczg1 wrote
Absolutely not! There is still still a lot of research going into traditional ML methods. For tabular data, it is typically vastly superior to deep learning. Especially boosting models receive a lot of attention due to very good implementations available. See for example:
- SketchBoost, CuPy-based boosting from NeurIPS 2022, aimed at incredibly fast multioutput classification
- A Short Chronology Of Deep Learning For Tabular Data by Sebastian Raschka, a great literature overview of deep learning on tabular data; spoiler: it does not work, and XGBoost or similar models are just better
- in time series forecasting, LightGBM-based ensembles typically beat all deep learning methods, while being much faster to train; see e.g. this paper, you can also see it at Kaggle competitions or other papers; my friend works in this area at NVidia and their internal benchmarks (soon to be published) show that top 8 models in a large scale comparison are in fact various LightGBM ensemble variants, not deep learning models (which, in fact, kinda disappointed them, since it's, you know, NVidia)
- all domains requiring high interpretability absolutely ignore deep learning at all, and put all their research into traditional ML; see e.g. counterfactual examples, important interpretability methods in finance, or rule-based learning, important in medical or law applications
qalis t1_j6mbu5s wrote
Reply to [Discussion] Misinformation about ChatGPT and ML in media and where to find good sources of information by Silvestron
I recently complied and went through a reading / watching list, going from basic NLP to ChatGPT:
- NLP Demystified to learn NLP, especially transformers
- Medium article nicely summarizing the main points of GPT-1, 2 and 3
- GPT-1 lecture and GPT-1 paper to learn about general idea of GPT-like models
- GPT-2 lecture and GPT-2 paper to learn about large scale self-supervised pretraining that fuels GPT training
- GPT-3 lecture 1 and GPT-3 lecture 2 and GPT-3 paper to learn about GPT-3
- InstructGPT page and InstructGPT paper to learn about InstructGPT, the sibling model of ChatGPT; as far as I understand, this is the same as "GPT-3.5"
- ChatGPT page to learn about differences between InstructGPT and ChatGPT, which are relatively small as far as I understand; it is also sometimes called "fine-tuned GPT-3.5", AFAIK
Bonus reading (heavy math warning, experience with RL required!):
- the main difference between GPT-3 and InstructGPT/ChatGPT is reinforcement learning with human feedback (RLHF)
- RLHF is based on Proximal Policy Optimization algorithm
qalis t1_j6ir4fh wrote
Reply to comment by RogerKrowiak in [D] Simple Questions Thread by AutoModerator
Yes, you can. Variables in tabular learning are (in general) independent in terms of preprocessing. In fact, in most cases you will perform such different preprocessings, e.g. one-hot + SVD for high cardinality categorical variables, binary encoding for simple binary choices, integer encoding for ordinal variables.
qalis t1_j6iqvql wrote
Reply to comment by grenouillefolle in [D] Simple Questions Thread by AutoModerator
Somewhat more limited than your question, but I know two such papers: "Tunability: Importance of Hyperparameters of Machine Learning Algorithms" P. Probst et al., and "Hyperparameters and tuning strategies for random forest" P. Probst et al.
Both are on Arxiv. First one concerns tunability of multiple ML algorithms, i.e. how sensitive are they in general to hyperparameter choice. Second one delves deeper into the same area, but specifically for random forests, gathering results from many other works. Using those ideas, I was able to dramatically decrease the computational resources for tuning by better designing hyperparameter grids.
qalis t1_j5ukjvr wrote
ChatGPT does NOT retrieve any data at all from the internet. It merely remembers statistical patterns of words coming one after another in the typical texts. It has no knowledge of facts, and no means to get them whatsoever. It was also trained with data up to 2021, so there is no training data after that whatsoever. There was an older attempt with WebGPT, but it did not get anywhere AFAIK.
What you need is a semantic search model, which summarizes semantic information from texts as vectors and then performs vector search based on your query. You can use transformer-based model for text vectorization, of course, which may work reasonably well. For specific searches, however, I am pretty sure that in your use case regexes will be just fine.
If you are sure that you need semantic search, use domain-specific model like SciBERT for best results, or fine-tune some pretrained model from Huggingface.
qalis t1_jeaqs4u wrote
Reply to [D] Directed Graph-based Machine Learning Pipeline tool? by Driiper
Airflow, Metaflow, Zen.ML, Kedro