Data incubator challenge

9/15/2023

I then put together some R code that generates the distribution of next words from any specified trigram. My end goal for this project would be to create an app (aided by parallel processing capability from PySpark, as the data wrangling stage can be expensive) that will aid researchers in planning the trajectory of their work.As a hypothetical example of a distribution of suggested next words that could be submitted by collective thought stream contributors, I processed a corpus of 2 360 000 US English Twitter messages collected in May 2012 ( freely available here) into every four-gram word sequence. One can also incorporate number of citations and references, or even the full text of the eprint, as well as analyze the frequency with which groups of buzzwords appear in the same abstract. I propose to train a neural network on (buzzword, date) -> frequency data in order to predict future trends in research. However, there is still a great deal of analysis to be done. Much of the work that has already been done is in the realm of data scraping and munging, and this is now all automated in preparation for implementation in an app. These can be visualized in a variety of ways, but I chose first to look at both a word cloud and at a heatmap showing the evolution of buzzword frequencies over time. Computing the frequencies of each buzzword in a rolling window of 6 months, offset by a week between iterations to reduce the volume of data, I obtained a series of buzzword frequencies over time. At this stage, each abstract has been reduced to a normalized bag of buzzwords. Now, I could safely filter out NTLK's convenient corpus of English-language words, after reducing each matching word to its stem. However, technical terms are often phrases constructed from more mundane words (usually no more than three), so we must be careful here! Using Python's Natural Language Toolkit (NTLK), I analyzed the remaining text from each abstract to determine the most likely bigram and trigram word collocations, and merged each occurrence into a single monogram. My next goal was then to filter out any non-technical terminology in each of the abstracts. I then cleaned the data by removing MathJax/LaTeX tags, extra whitespace, punctuation, and common stopwords, and then by fixing the case of the remaining text. With all these relatable problems in mind, I propose arX-Live, the arXiv-trend predictive engine! Using the arXiv API, I was able to scrape a year's worth of eprint abstracts from the arXiv servers in my chosen category (high energy physics - theory, but it could be anything). There have also been moments after finishing a project when I had a choice of what direction to take my research in next, but I simply didn't know what topics were in high enough demand to obtain funding. At different times, I've had both the experience of being scooped on a project, and of discovering after many hours of grueling labor that a newly developed technique would have saved me time.

Project proposal: arX-LiveĪs a doctoral student, I was used to feeling hopeless about keeping up to date on all the latest research in my field. Images generated by Challenge.ipynb are displayed inline in the IPython notebooks, while images generated by Project.ipynb are displayed both inline and as. Run Project.ipynb for challenge question 3, the project proposal.Run Challenge.ipynb for challenge questions 1 and 2.Challenge questions for The Data Incubator

0 Comments

Data incubator challenge

Leave a Reply.

Author

Archives

Categories