dojoteef

dojoteef OP t1_jc6om7a wrote

Thanks for the vote of confidence!

Unfortunately, I recently deleted my twitter account 🫣. I was barely active there: a handful of tweets in nearly a decade and a half...

That said, I'll probably post my preprint on this sub when it's ready. I also need to recruit some play testers, so will probably post on r/discoelysium recruiting participants in the next few weeks (to ensure high quality evaluations we need people who have played the game before, rather than using typical crowdsourcing platforms like MTurk).

1

dojoteef OP t1_jc4hwyw wrote

If you actually want the NPCs to meaningfully add to the game rather than merely being mouthpieces then your approach won't work. How do you ensure what they say is consistent with the game world? E.g. what if they make up the location of a hidden treasure, offer to give you an item, etc. All of that needs to be accounted for in the game logic as well, otherwise they'll say things that make no sense in the game world.

It's actually a challenging problem and requires research. As far as I know there a very few people actively researching this area; if they are, then they certainly aren't publishing it. Hopefully my next paper which investigates using LLMs in Disco Elysium will help change that.

18

dojoteef t1_j8sqm4i wrote

I commend what Huggingface is trying to do (be the source for the latest models that is consistent and easy to use), but every time I've used the library I've had to tackle bugs that were very time consuming to pinpoint, which is exacerbated by the structure of the code. The worst bugs have been subtle heisenbugs: the code seemed to work most of the time, but failed at other times. The heisenbugs are what made me stop using Huggingface altogether, unless it's my only option.

For example, I ran into a bug that only manifested when downloading a specific pretrained model for a task, which in turn downloads a config file that had a bug in the config. As a user it was super difficult to know where the source of the bug was without extensive spelunking. I've had many similarly difficult to diagnose issues each time I've used the Huggingface ecosystem.

I understand that what you're tasked with as a company is a huge undertaking for such a small team. Maybe splitting the package into a "stable" package and a "nightly" package could help (with stable being extensively bug tested more like an Ubuntu LTS release). My guess is that your team is likely too small to support that approach while adding new features at the same speed.

14

dojoteef t1_j8e2m8g wrote

Tbh, it's because I took a step back and haven't been moderating the sub the past week and a half. I've been the one mod doing the majority of the filtering of these posts over the past couple of years and the noise has just been going up exponentially over that time. It's very time consuming and I'm pretty burned out doing it, so I've taken some time away. I brought this up with the other mods before stepping back a bit.

It's probably good to try to get more mods, but I think the majority of the current mods are afraid to hire on new mods that might have a different philosophy of moderating, thus changing the feel of the sub.

1

dojoteef t1_j60evd7 wrote

I'd guess that it's an easier optimization problem. GANs are known to have stability issues during training, likely due to the adversarial formulation.

I think a more interesting question is why it also performs better than VAEs, since diffusion models also fall under the category of variational inference. Again I'd assume it's an easier optimization problem due to having a large number of denoising steps. Perhaps a technique like DRAW could match diffusion models if used with more steps? Not sure.

13

dojoteef t1_j5l399n wrote

This has been studied quite a bit. You can just follow the citation graph of the fastText paper: Enriching Word Vectors with Subword Information

For example, people have investigated sampling different subword tokenizations during training (Stochastic Tokenization with a Language Model for Neural Text Classification) and character-aware embeddings (CharBERT: Character-aware Pre-trained Language Model).

4

dojoteef t1_j2p0jrg wrote

Better late than never. Started my PhD in my mid thirties and I'm glad I did.

That said, I knew exactly what I wanted to work on (it's relatively niche) and have been fortunate enough to find an advisor willing to let me work in that area. If you're unsure, then it might make sense to work in industry for a while and later decide if you want to come back for a PhD.

9

dojoteef t1_j1v4j4r wrote

You don't need to tell them one is AI or model generated. Could be two model generated texts or two human written texts. Merely having another text for comparison allows people to better frame the task since otherwise they essentially need to imagine a baseline for comparison, which people rarely do.

−3

dojoteef t1_j1uy04f wrote

Very interesting idea. It could easily be applied to images since digital watermarks already exist. Not sure how feasible it is for AI generated text.

Tbh, I imagine it behooves companies to do this so they are less likely to train on media (text, images, audio, etc) produced from a model. The more ubiquitous the use of AI generation becomes, the more of an issue this poses. Currently that problem is likely quite minimal and probably acts to inject a small bit of noise into training (and the knowledge distillation effect could make slightly improve training efficiency).

Though I guess a new data cleaning step could be running a classification model to classify if the media trained on is likely AI generated, though that would likely be less efficient than a hash produced at the time of generation.

0

dojoteef t1_j1uwubj wrote

Nice job!

Though, to produce a better comparison it's best to show two examples side-by-side (one by a human, the other by the model, in a randomized order of course). The reason is that most people are not trained to analyze short snippets of text out of context. People trained to do that, e.g. English teachers, can better distinguish generated text without a baseline to compare against, but most people (crowd sourced evaluation) will likely produce a very biased analysis not reflective of the real ability for humans to distinguish between the two.

For a more thorough investigation of this phenomenon you can check out our research:

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

26

dojoteef t1_j0ayqqq wrote

See the graphs in the paper that introduced nucleus sampling: The Curious Case of Neural Text Degeneration. They visualize how human authored text has different statistical properties from machine generated text. That's mainly a tradeoff between fluency and coherence. Sampling procedures like top-k or nucleus sampling restrict the tokens that can be emitted and thus introduce statistical bias in the generated text, but produce more fluent text. Rather, sampling from the full distribution gets closer to the distribution of human-authored text, but often degenerates into incoherence (hence the title of the paper).

12

dojoteef t1_j02kku4 wrote

This is great! Is it realistically possible to train LLMs ala BLOOM from scratch using these, or just do finetuning? I guess I'm wondering how the training speed scales with more compute nodes.

Even if we assume high end GPUs/TPUs, a frequent bottleneck is throughput due to network latency. How big of an issue is that? For example, I had previously tried scaling to multi-node training on my University's cluster and it turned out that it was faster to do gradient accumulation on a single node than to do multi-node training because the network switches were not purchased with high-throughput in mind.

1

dojoteef t1_j0275on wrote

While there is a field of research investigating federated learning which might one day allow for an ML@Home type project, as it stands the current algorithms require too much memory, computation, and bandwidth for training the very large models like GPT3.

I'm hopeful that an improved approach will be devised that mitigates these issue (in fact I have some ideas I'm considering for my next research project), but as it stands these issues render a real ML@Home type project currently infeasible.

1

dojoteef t1_iyw254f wrote

Mistakes happen. In this case the authors report the issue publicly and should be commended for that.

The NeurIPS organizers can choose to address the issue in whatever way they deem appropriate, especially as the authors are not hiding the fact that their results were changed.

Of course you're free to assume it's malicious if you want (at least that seems to be the stance your taking, but if it's not then I might have misinterpreted your response).

174

dojoteef t1_iyvxzsz wrote

See the author's explanation on OpenReview:

> We update the result tables in the camera-ready version. The revision is due to a different data version of query augmentation. Previously, the data is cooked by one of our co-authors while using a different train-test split to train the query generator, causing some data leakage issue. All experiments in the previous submission are based on this query augmentation version, so the performance is relatively higher. When preparing the camera-ready version, we review and reproduce the code end-to-end for official release. At that time, we realize the data leakage problem. So, we re-cook the query augmentation data and reproduce all the experiments again in the new table. After solving the data leakage problem, NCI still shows more than 15% improvement over the current best SOTA. We have released the complete open-source code at GitHub: > > https://github.com/solidsea98/Neural-Corpus-Indexer-NCI > > Welcome to follow and reproduce our work. Looking forward to further discussions and collaborations.

171