• Online, Self-Paced
Course Description

Executing Dataproc implementations with big data can provide a variety of methods. This course will continue the study of Dataproc implementations with Spark and Hadoop using the cloud shell and introduce BigQuery PySpark REPL package.

Learning Objectives

Implementation using Dataproc

  • start the course
  • describe the various Spark and Hadoop processes that can be performed with Dataproc
  • recognize the benefits of separating storage and compute services using Cloud Dataproc
  • recall the process of monitoring and logging Dataproc jobs
  • demonstrate the process of using an SSH tunnel to connect to the master and worker nodes in a cluster
  • define the Spark REPL package and how it's used in Linux

Implementation using Cloud Shell

  • describe the compute and storage processes and the benefits of their separation and the virtualized distribution of Hadoop
  • define BigQuery and its benefits for large-scale analytics
  • describe the MapReduce programming model
  • demonstrate the process of submitting multiple jobs with Dataproc

Practice: Dataproc Implementations

  • recognize the various Dataproc and Cloud Shell job operations and implementations

Framework Connections

The materials within this course focus on the Knowledge Skills and Abilities (KSAs) identified within the Specialty Areas listed below. Click to view Specialty Area details within the interactive National Cybersecurity Workforce Framework.

Feedback

If you would like to provide feedback for this course, please e-mail the NICCS SO at NICCS@hq.dhs.gov.