Wednesday, July 16, 2014

One of the things I realized after moving into industry is that, contrary to a widely held belief in Academia, it is important to stay on top of the latest research not only on the fields that directly affect my line of work but also on not-so-closely related areas. It is very difficult to do a competent job by relying exclusively on the tricks you learned in graduate school. I am still very much still a neophite in machine learning, but even for organizing my own head when I work through textbooks, I find it important to know what are the topics and questions to which people are currently paying attention in the research frontier. Besides, every day you find new problems at work and you never know where you are going to find that tip that is going to help you solve them.1 In short, reading abstracts of the newest research in statistics and machine learning cannot hurt.

To make the process easier, I spent my Saturday morning writing a Twitter bot (@arXivStats) in Python that calls the arXiv API periodically and tweets all new papers that are publised in the stat category. The structure is very simple: a class for inputting data (papers) that collects and parses the response of the API using BeautifulSoup, and a class for outputting data (tweet) that transforms the dictionary returned by the papers class that publishes a list of tweets through the tweepy module. A cron job sitting on my Raspberry Pi runs the script every 24 hours. Just another step to making my Twitter feed the main source of scientific and technical literature.

The source code can be found here.

  1. It is just uncanny how many things I have figured out just by taking a very quick look at an introductory textbook on statistical mechanics. 

