niclas_wue

niclas_wue t1_j842pyo wrote

Hey, great idea, looks very interesting. Do you use the abstract as an input or do you actually parse the paper? I built something quite similar: http://www.arxiv-summary.com which summarizes trending AI papers as bullet points. However, I think a chrome extension allows for a much more flexible paper choice, which is really great.

37

niclas_wue OP t1_j4yukoz wrote

Yes, it is possible to use citations as a measure of a paper's impact. However, when a paper is newly published, there are typically no citations yet, so this would result in a delayed signal. Retweets and GitHub stars provide a faster indication of a paper's impact. I believe that speed is important because, as a paper becomes older, there are already many reviews and articles written by humans that (at least for now) provide a better summary of the paper.

2

niclas_wue OP t1_j4r5wb1 wrote

Yes, it can be applied to every document, a book would be more expensive, because it has more text and thus more input tokens. The pdf needs to be converted to text, because the API only accepts text, some equations which can be written using Unicode are directly put into the network and it can understand. Other equations are currently skipped. So far I have spent almost 100$ in tokens to summarize the papers, so there need to be some paid features in the near future or a reduction in the amount of papers.

1

niclas_wue OP t1_j4k17s2 wrote

Thank you, I am glad you like it! At the moment, only the web server is public. You can find it here: https://github.com/niclaswue/arxiv-smry It is a Hugo server with a blog theme. Every blog is a markdown file. When a new file is pushed to git it automatically gets published on the blog.

The rest is basically a bunch of (messy) Python scripts for extracting the text, then asking GPT-3 for a summary and compiling the answers to a markdown file. Finally, I use GitPython to automatically push new summaries to the repo.

3

niclas_wue OP t1_j4i9r9w wrote

Thanks for your ideas. Building a paid experience for companies is a great idea, I will consider it.

Category tagging like „computer vision“, „natural language processing“ etc. should be relatively straightforward. Will implement this in the next couple of days :)

More paper specific tags could be generated using GPT-3, I think that would make sense, when the database is a bit larger. Right now, I would guess that most tags would be unique to a single paper.

3

niclas_wue OP t1_j4fqqy6 wrote

Thanks for asking! My first prototype collected all new arxiv papers in certain ML-related categories via the API, however I quickly realized that this would be way to costly. Right now, I collect all papers from PapersWithCode's "Top" (last 30 days) and the "Social" Tab, which is based on Twitter likes and retweets. Finally, I filter using this formula:

p.number_of_likes + p.number_of_retweets > 20 or p.number_github_stars > 100

In rare cases, when the paper is really long or not parsable with "grobid", I will exclude the paper for now.

10