Show The Graduate Center Menu
 
 

Big Data Management & Analysis

Instructor

Professor Huy Vo

Rationale

Big data is sometimes defined as data that are too big to fit onto the analyst’s computer. With storage and networking getting significant cheaper and faster, big data sets could easily reach the hands of data enthusiasts with just a few mouse clicks. These enthusiasts could be policy makers, government employees or managers, who would like to draw insights and (business) value from big data. Thus, it is crucial for big data to be made available to the non-expert users in such a way that they can process the data without the need of a supercomputing expert. The course aims to provide a broad understanding of big data and current technologies in managing and processing them with a focus on urban data sets.

Course description

An approach to make big data available to a wide audience target is to build big data programming frameworks that can deal with big data in as close a paradigm as the way it deals with “small data.” Such a framework should be as simple as possible, even if not as efficient as custom-designed parallel solutions. Users should expect that if their code works within these frameworks for small data, it will also work for big data. The course will provide an overview of big data analytic lifecycle and how to use current techniques and frameworks to build big data analysis pipelines. This course will use Python as the main programming language; however, other languages may also be accepted where applicable, e.g. using Java for Hadoop.

List of topics

Topics may include but are not limited to:

  • Big Data Analytics Lifecycle

    • Roles for a successful analytics project

    • Case study to apply the data analytics lifecycle

  • Big Data challenges and how to deal with them

    • Volume/Velocity: streaming computation

    • Volume: massive parallel computing

    • Variety: NoSQL databases

  • Big Data computing model

    • MapReduce paradigm

    • Bring compute to the data

    • Data transformation with higher order functions

  • Big Data technologies and frameworks

    • Apache Hadoop and Hadoop Streaming

    • Apache Hue, Pig, Hive, Oozie

    • Apache Spark

    • Virtualization and cloud computing with Amazon and Azure

  • Big Data analytics with Spark

    • Processing spatial-temporal data sets efficiently

    • Building and running machine learning pipelines at scale

    • Debugging big data pipelines

Learning objectives

The student must be able to demonstrate a working knowledge of big data technologies and how to use them in real-world applications. In particular, after the course, the students are expected to:

  • Understand the big data ecosystem including its data life cycle

  • Gain experience in identifying big urban data challenges and develop analytical solutions for them

  • Understand the big data programming paradigm: streaming, parallel computing and MapReduce

  • Gain knowledge in implementing analytical tools to analyze big data with Apache Spark & Hadoop

Assessment

  • Hands-on labs will be offered throughout the course to bolster the knowledge learned in each topic. Each class session will be divided into a 60 minute lecture and a 60-minute lab, where a lab submission is mandatory to assess the knowledge learned. The students will also be given weekly programming assignments to assess their ability in implementing analytic algorithms using big data frameworks. 60%

  • Important big data knowledge to be assessed by a final project, which uses real city data, includes but not limited to: big data management and analysis with Hadoop Streaming, Pig, Hive, Apache Spark, Spark ML, and spatial analysis at scale. 40%