SDS-2.2, Scalable Data Science

List of pointers to potential course projects

from 2016 and 2017

Theoretical Projects in Scalable Data Science

  1. Exact Matrix Completion via Convex Optimization

  2. Abstract

    • Suppose that one observes an incomplete subset of entries selected from a low-rank matrix. When is it possible to complete the matrix and recover the entries that have not been seen? We demonstrate that in very general settings, one can perfectly recover all of the missing entries from most sufficiently large subsets by solving a convex programming problem that finds the matrix with the minimum nuclear norm agreeing with the observed entries. The techniques used in this analysis draw upon parallels in the field of compressed sensing, demonstrating that objects other than signals and images can be perfectly reconstructed from very limited information.
  3. As published in DOI:10.1145/2184319.2184343
  4. Originally published in FCM 9, 6 (2009)

  5. Data Science and Prediction

  6. Big data promises automated actionable knowledge creation and predictive models for use by both humans and computers.
  7. DOI:10.1145/2500499

Applied Projects in Scalable Data Science

  1. Content Recommendation on Web Portals

  2. How to offer recommendations to users when they have not specified what they want.

  3. DOI:10.1145/2461256.2461277
  4. Techniques and Applications for Sentiment Analysis
  5. Looking back at big data (Digital humanities)
    • As computational tools open up new ways of understanding history, historians and computer scientists are working together to explore the possibilities.
    • DOI:10.1145/2436256.2436263
  6. Computational Epidemiology
    • The challenge of developing and using computer models to understand and control the diffusion of disease through populations.
    • DOI:10.1145/2483852.2483871
  7. Replicated Data Consistency Explained Through Baseball
    • A broader class of consistency guarantees can, and perhaps should, be offered to clients that read shared data.
    • DOI:10.1145/2500500
  8. Community Sense and Response Systems: Your phone as a quake detector
    • The Caltech CSN project collects sensor data from thousands of personal devices for real-time response to dangerous earthquakes.
    • DOI:10.1145/2622628.2622633
  9. Reshaping (non-State) Terrorist Networks (fields: State/national security, social psychology, counter-recruitment, counter/de-radicalization...)
    • To destabilize terrorist organizations, the STONE algorithms identify a set of operatives whose removal would maximally reduce lethality.
    • DOI:10.1145/2632661.2632664
  10. Rise of Hate Groups in the US (fields: social psychology, understanding online emergence of "hate groups", ...)
  11. New News Aggregator Apps (ML at work)
  12. Natural Language Translation at the Intersection of AI and HCI

    • Abstract: The fields of artificial intelligence (AI) and human-computer interaction (HCI) are influencing each other like never before. Widely used systems such as Google Translate, Facebook Graph Search, and RelateIQ hide the complexity of large-scale AI systems behind intuitive interfaces. But relations were not always so auspicious. The two fields emerged at different points in the history of computer science, with different influences, ambitions, and attendant biases. AI aimed to construct a rival, and perhaps a successor, to the human intellect. Early AI researchers such as McCarthy, Minsky, and Shannon were mathematicians by training, so theorem-proving and formal models were attractive research directions. In contrast, HCI focused more on empirical approaches to usability and human factors, both of which generally aim to make machines more useful to humans. Many attendees at the first CHI conference in 1983 were psychologists and engineers. Presented papers had titles such as "Design Principles for Human-Computer Interfaces" and "Psychological Issues in the Use of Icons in Command Menus," hardly appealing fare for mainstream AI researchers.

    Since the 1960s, HCI has often been ascendant when setbacks in AI occurred, with successes and failures in the two fields redirecting mindshare and research funding.14 Although early figures such as Allen Newell and Herbert Simon made fundamental contributions to both fields, the competition and relative lack of dialogue between AI and HCI are curious. Both fields are broadly concerned with the connection between machines and intelligent human agents. What has changed recently is the deployment and adoption of user-facing AI systems. These systems need interfaces, leading to natural meeting points between the two fields.

  13. Sensing Emotions

  14. Putting the Data Science into Journalism
    • News organizations increasingly use techniques like data mining, Web scraping, and data visualization to uncover information that would be impossible to identify and present manually.
    • DOI:10.1145/2742484
  15. Big Data Meets Big Science (Extra Reading)
    • Next-generation scientific instruments are forcing researchers to question the limits of massively parallel computing.
    • DOI:10.1145/2617660
  16. Big Data and its techincal challenges (Extra reading)
  17. Exploring the inherent technical challenges in realizing the potential of Big Data.
  18. DOI:10.1145/2611567
  19. Exascale Computing and Big Data (Extra Reading)
  20. Battling Evil: Dark Silicon (Extra Reading)
  21. TensorFlow: Google Open Sources Their Machine Learning Tool (see InfoQ)
    • TensorFlow is a machine learning library created by the Brain Team researchers at Google and now open sourced under the Apache License 2.0. TensorFlow is detailed in the whitepaper TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. The source code can be found on Google Git. It is a tool for writing and executing machine learning algorithms. Computations are done in a data flow graph where the nodes are mathematical operations and the edges are tensors (multidimensional data arrays) that are exchanged between nodes. An user constructs the graph and writes the algorithms that executed on each node. TensorFlow takes care of executing the code asynchronously on different devices, cores, and threads.... TensorFlow is used by Google for GMail (SmartReply), Search (RankBrain), Pictures (Inception Image Classification Model), Translator (Character Recognition), and other products.
    • See https://databricks.com/blog/2016/01/25/deep-learning-with-spark-and-tensorflow.html

Keep reading... I have not updated since early 2017!!!

  • Association for Computing Machinery (ACM) Communications is a nice central point for a quick overview into current computationallu focussed mathematical sciences.
  • PNAS/Science/Nature - usual popular science venues
  • Hacker News
  • ...

Shared Student Notebooks for sds-2.2

Several notebooks that stduents tried along the course are part of the course content