SDS-2.2, Scalable Data Science

List of pointers to potential course projects

from 2016 and 2017

Theoretical Projects in Scalable Data Science

Exact Matrix Completion via Convex Optimization
Abstract
- Suppose that one observes an incomplete subset of entries selected from a low-rank matrix. When is it possible to complete the matrix and recover the entries that have not been seen? We demonstrate that in very general settings, one can perfectly recover all of the missing entries from most sufficiently large subsets by solving a convex programming problem that finds the matrix with the minimum nuclear norm agreeing with the observed entries. The techniques used in this analysis draw upon parallels in the field of compressed sensing, demonstrating that objects other than signals and images can be perfectly reconstructed from very limited information.
As published in DOI:10.1145/2184319.2184343
Originally published in FCM 9, 6 (2009)
Data Science and Prediction
Big data promises automated actionable knowledge creation and predictive models for use by both humans and computers.
DOI:10.1145/2500499

Applied Projects in Scalable Data Science

Content Recommendation on Web Portals
How to offer recommendations to users when they have not specified what they want.
DOI:10.1145/2461256.2461277
Techniques and Applications for Sentiment Analysis
- The main applications and challenges of one of the hottest research areas in computer science.
- DOI:10.1145/2436256.2436274
Looking back at big data (Digital humanities)
- As computational tools open up new ways of understanding history, historians and computer scientists are working together to explore the possibilities.
- DOI:10.1145/2436256.2436263
Computational Epidemiology
- The challenge of developing and using computer models to understand and control the diffusion of disease through populations.
- DOI:10.1145/2483852.2483871
Replicated Data Consistency Explained Through Baseball
- A broader class of consistency guarantees can, and perhaps should, be offered to clients that read shared data.
- DOI:10.1145/2500500
Community Sense and Response Systems: Your phone as a quake detector
- The Caltech CSN project collects sensor data from thousands of personal devices for real-time response to dangerous earthquakes.
- DOI:10.1145/2622628.2622633
Reshaping (non-State) Terrorist Networks (fields: State/national security, social psychology, counter-recruitment, counter/de-radicalization...)
- To destabilize terrorist organizations, the STONE algorithms identify a set of operatives whose removal would maximally reduce lethality.
- DOI:10.1145/2632661.2632664
Rise of Hate Groups in the US (fields: social psychology, understanding online emergence of "hate groups", ...)
- watch Democracy Now story on This Year (2016) in Hate and Extremism
- read https://www.splcenter.org/intelligence-report, The Intelligence Report is the Southern Poverty Law Center's award-winning magazine. The quarterly publication provides comprehensive updates to law enforcement agencies, the media and the general public. See several articles on different 'hate groups' published on February 17, 2016.
New News Aggregator Apps (ML at work)
- How apps like Inkl and SmartNews are overcoming the challenges of aggregation to win over content publishers and users alike
- DOI:10.1145/2800445
- References:
Natural Language Translation at the Intersection of AI and HCI
- Abstract: The fields of artificial intelligence (AI) and human-computer interaction (HCI) are influencing each other like never before. Widely used systems such as Google Translate, Facebook Graph Search, and RelateIQ hide the complexity of large-scale AI systems behind intuitive interfaces. But relations were not always so auspicious. The two fields emerged at different points in the history of computer science, with different influences, ambitions, and attendant biases. AI aimed to construct a rival, and perhaps a successor, to the human intellect. Early AI researchers such as McCarthy, Minsky, and Shannon were mathematicians by training, so theorem-proving and formal models were attractive research directions. In contrast, HCI focused more on empirical approaches to usability and human factors, both of which generally aim to make machines more useful to humans. Many attendees at the first CHI conference in 1983 were psychologists and engineers. Presented papers had titles such as "Design Principles for Human-Computer Interfaces" and "Psychological Issues in the Use of Icons in Command Menus," hardly appealing fare for mainstream AI researchers.
Since the 1960s, HCI has often been ascendant when setbacks in AI occurred, with successes and failures in the two fields redirecting mindshare and research funding.14 Although early figures such as Allen Newell and Herbert Simon made fundamental contributions to both fields, the competition and relative lack of dialogue between AI and HCI are curious. Both fields are broadly concerned with the connection between machines and intelligent human agents. What has changed recently is the deployment and adoption of user-facing AI systems. These systems need interfaces, leading to natural meeting points between the two fields.
- DOI:10.1145/2767151
Sensing Emotions
- How computer systems detect the internal emotional states of users.
- DOI:10.1145/2800498
- Further Reading:
  - Rosalind W. Picard, Affective computing, MIT Press, Cambridge, MA, 1997
  - The latest scientific findings indicate that emotions play an essential role in decision making, perception, learning, and more—that is, they influence the very mechanisms of rational thinking. Not only too much, but too little emotion can impair decision making. According to Rosalind Picard, if we want computers to be genuinely intelligent and to interact naturally with us, we must give computers the ability to recognize, understand, even to have and express emotions.
  - Rafael A. Calvo , Sidney D'Mello , Jonathan Gratch , Arvid Kappas, The Oxford Handbook of Affective Computing, Oxford University Press, Oxford, 2014
  - The Oxford Handbook of Affective Computing is a definitive reference in the burgeoning field of affective computing (AC)
  - Bartlett, M., Littlewort, G., Frank, M., and Lee, K. Automated Detection of Deceptive Facial Expressions of Pain, Current Biology, 2014.
    - Highlights
      - Untrained human observers cannot differentiate faked from genuine pain expressions
      - With training, human performance is above chance but remains poor
      - A computer vision system distinguishes faked from genuine pain better than humans
      - The system detected distinctive dynamic features of expression missed by humans
- Carlos Busso , Zhigang Deng , Serdar Yildirim , Murtaza Bulut , Chul Min Lee , Abe Kazemzadeh , Sungbok Lee , Ulrich Neumann , Shrikanth Narayanan, Analysis of emotion recognition using facial expressions, speech and multimodal information, Proceedings of the 6th international conference on Multimodal interfaces, October 13-15, 2004, State College, PA, USA
  - Abstract:The interaction between human beings and computers will be more natural if computers are able to perceive and respond to human non-verbal communication such as emotions. Although several approaches have been proposed to recognize human emotions based on facial expressions or speech, relatively limited work has been done to fuse these two, and other, modalities to improve the accuracy and robustness of the emotion recognition system. This paper analyzes the strengths and the limitations of systems based only on facial expressions or acoustic information. It also discusses two approaches used to fuse these two modalities: decision level and feature level integration. Using a database recorded from an actress, four emotions were classified: sadness, anger, happiness, and neutral state. By the use of markers on her face, detailed facial motions were captured with motion capture, in conjunction with simultaneous speech recordings. The results reveal that the system based on facial expression gave better performance than the system based on just acoustic information for the emotions considered. Results also show the complementarily of the two modalities and that when these two modalities are fused, the performance and the robustness of the emotion recognition system improve measurably.
Putting the Data Science into Journalism
- News organizations increasingly use techniques like data mining, Web scraping, and data visualization to uncover information that would be impossible to identify and present manually.
- DOI:10.1145/2742484
Big Data Meets Big Science (Extra Reading)
- Next-generation scientific instruments are forcing researchers to question the limits of massively parallel computing.
- DOI:10.1145/2617660
Big Data and its techincal challenges (Extra reading)
Exploring the inherent technical challenges in realizing the potential of Big Data.
DOI:10.1145/2611567
Exascale Computing and Big Data (Extra Reading)
- Scientific discovery and engineering innovation requires unifying traditionally separated high-performance computing and big data analytics. The twin ecosystems of HPC and big data and the challenges facing both
- DOI:10.1145/2699414
- Watch https://www.youtube.com/watch?list=PLn0nrSd4xjjbIHhktZoVlZuj2MbrBBC_f&v=eLMChVev6hw
Battling Evil: Dark Silicon (Extra Reading)
- The changing nature of computing as chips with more transistors than can be concurrently activated become more commonplace.
- Read http://www.hpcdan.org/reeds_ruminations/2011/05/battling-evil-dark-silicon.html
TensorFlow: Google Open Sources Their Machine Learning Tool (see InfoQ)
- TensorFlow is a machine learning library created by the Brain Team researchers at Google and now open sourced under the Apache License 2.0. TensorFlow is detailed in the whitepaper TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. The source code can be found on Google Git. It is a tool for writing and executing machine learning algorithms. Computations are done in a data flow graph where the nodes are mathematical operations and the edges are tensors (multidimensional data arrays) that are exchanged between nodes. An user constructs the graph and writes the algorithms that executed on each node. TensorFlow takes care of executing the code asynchronously on different devices, cores, and threads.... TensorFlow is used by Google for GMail (SmartReply), Search (RankBrain), Pictures (Inception Image Classification Model), Translator (Character Recognition), and other products.
- See https://databricks.com/blog/2016/01/25/deep-learning-with-spark-and-tensorflow.html

Keep reading... I have not updated since early 2017!!!

Association for Computing Machinery (ACM) Communications is a nice central point for a quick overview into current computationallu focussed mathematical sciences.
PNAS/Science/Nature - usual popular science venues
Hacker News
...

Shared Student Notebooks for sds-2.2

Several notebooks that stduents tried along the course are part of the course content

Yevgen text analysis of Russian words