Description (CISC 5950, Spring 2021)

This course covers Apache Hadoop and Spark technologies and their ecosystems in the context of mining big data. It provides students both theoretical background and hands-on computing techniques in big data analytics and its applications. The students will learn how to collect, query, and analyze data. Topics include Hadoop core technologies (HDFS, MapReduce, Yarn), Spark Streaming, MLlib, Clustering, and Spark SQL. Scala programming language will be taught as part of the Spark component.

  • Instructor: Dr. Ying Mao
  • Lectures: Wednesday 5:30 p.m. - 7:45 p.m.
  • Location: Online
  • Email: ymao41 at fordham
  • Office: Online
  • Office Hours: by appointment

Lecture Plan

The various course materials, including assignments will be listed in this webpage. Please check back regularly.
  1. Course Logistics and Introduction to Big Data Systems
    • Lecture Materials:
      • Slides are available on Blackboard.
      • Assignment:
        • Create your cloud service account (Google, Azure, and AWS).
        • Hand-on: Start a computing virtual machine on the cloud and run “Hello World” Java/Python programs to make sure you have the environment.
        • Note: We will mainly focuse on computing resources on the cloud, not networking, storage and other services.
        • PS: Please pay attention to your bill settings and always remember to close the cloud instance when you are done.
      • Reading: Probably most of you already know how to use “git” for version management. For those of you not familiar with “git”, please read the link. Not in a hurry, but we need it in the future labs and projects.
  2. Hadoop Distributed File System and Hands-on Examples
    • Lecture Materials:
      • Slides are available on Blackboard.
      • Assignment:
        • Follow the instructions and complete the 3-node cluster configuration.
        • Read the code on github HDFS-Test.
        • Read the command HDFS commands, Link.
  3. MapReduce Programming Framework
    • Lecture Materials: - Slides are available on Blackboard. - Assignment: - Set up the HDFS and MapReduce 3-node cluster and check out the webpage, 50070 for HDFS and 8088 for MapReduce - Run the examples of HDFS AND MapReduce on the cloud - Read and understand the code for both HDFS AND MapReduce (the python or java code, not the configuration scripts) - Reading Assignment: - MapReduce toturial: link. - Materials of the basic Maven: Maven-1, Maven-2
  4. Resource Management on the Cloud and YARN
    • Lecture Materials: - Slides are available on Blackboard. - Assignment: - Lab 1 has been posted on Blackboard - Online reading: - Hadoop Capacity Scheduler - Hadoop Fair Scheduler - Hadoop Yarn - Passing parameters to Hadoop Streaming Python: link - Passing parameters to shell: link. - Run the experiments in yarn-test on your Google Cloud (please open port 9000) - Please git pull for the latest code in mapreduce-test folder and read the CODE for examples in the mapreduce-test - Optional: Conduct Hadoop MapReduce experiments on Intel HiBench: link. (In fact, if you read the code, you will see that our yarn-test is based on HiBench.)