In 2007, I was introduced to Twitter via the written qualifying exam towards my Ph.D.. At first, I did not know what to do with it. After a good year or so (maybe even sooner) passed, I began to follow some very interesting people that share the same interests as me. It has transformed my academic experience. It is great to run across tweets promoting conferences and newly released papers in my field. One of my favorite parts about Twitter, aside from interacting with tweeps, is the ability for me to quickly post a status update on what I am doing and I can even refer to it later. I consider it a platform for collaboration because I see what others are doing via tweets as well as linked blogs, whether it is a Twitter user, or some offline user. I quickly realized that 140 characters were not enough to solidify my thoughts and participate in the community. Thus, I decided to start this blog so I can share cool things I have found in my research/work with others anywhere on the web and communicate in more than 140 characters.
Here are some things that I am very interested in and will post about. Of course this is not an exhaustive list and a lot of this stuff overlaps!
- Data Extraction. Data extraction covers topics such as working with data feeds, XML, JSON, social media APIs, and web scraping.
- Data Mining and Cleaning. Extracting elementary building blocks the extracting meaning. This covers topics such as text mining and parsing.
- Data Processing. Processing data using open-source software such as R, as well as the map/reduce paradigm for processing big data. The implementation of map/reduce I have used is Hadoop, but I may discuss others like Project Disco and GreenPlum.
- Information Retrieval. Techniques for searching and crawling the web, recommendation systems, etc.
- Natural Language Processing. Training computers to make sense of text produced by humans. I am more interested in topic models for representing documents.
- Machine Learning. Using data to make machines (computers) make better decisions. This is where statistics comes in.
- Databases and Data Warehousing. Databases and programming with databases such as MySQL, PostgreSQL but moreso…
- NoSQL. An interesting movement promoting non-relational databases that instead use document-oriented data stores. Members argue that WWW and other similar data streams are better represented and stored as a document rather than a rectangle. Some systems I have used include CouchDB, MongoDB, and Neo4j but several others exist. So far, I am impressed.
- High Performance Computing. Working with ubiquitous data streams can be computationally expensive because, well, there is so much data. Stuff like using multiple cores in a single system, performing computations on disk, using clusters and the map-reduce platform (mostly Hadoop) are important here.
- Programming Languages. Programming language theory is not really my thing, but every once in a while I run across some really interesting tidbits and trivia that may be worth sharing here.
Now that I’ve written my introductory posts, the real fun begins…