Hadoop Development Fundamentals Training Course
Course Summary
The two day Hadoop Development Fundamentals course provides a hands-on introduction to "big data" and distributed computing to software developers. We will cover the two most important aspects of the Hadoop software stack: the Hadoop Distributed File System (HDFS) and MapReduce. We will also look at several higher level tools in the Hadoop ecosystem, that make writing code to store and process terabytes and petabytes of data across a cluster as simple and straightforward as traditional in-memory, file-based or RDBMS data. This course includes labs on: HDFS, MapReduce, Pig, Hive and HBase.[top] Duration
2 days.
[top] Objectives
On completion of this course, students will have implemented several common algorithms using MapReduce, and will have optimized them using advanced features of the MapReduce API. Students will be able to:
- Explain what Hadoop is, and how it compares to traditional data storage and processing
- Articulate strategies for safely and efficiently distributing computation over a large cluster
- Implement and debug distributed algorithms
- Navigate the rich Hadoop ecosystem to choose appropriate higher-level libraries and frameworks for development tasks
[top] Audience
The Hadoop Development Fundamentals course is geared toward software developers with experience in the Java programming language.
[top] Instructors
Dan Rosen believes in beautiful code. Beautiful code is understandable and maintainable, it is self-documenting and self-testing, it is robust and scalable, it can be composed and reused. Beautiful code doesn't come around every day, and even the most elegant code can still have its warts, but when you see beautiful code, you know it.
For twelve years, Dan has been doing his best to write and help others write some damn fine code. Dan is author of Marakana's Scala Fundamentals course, the latest addition the the Marakana course catalog. Before joining Marakana, he worked as a Developer Advocate at Atlassian, teaching developers how to write plugins for Atlassian's collaboration and development tools. Prior to Atlassian, Dan worked in both engineering and sales for Coverity, helping developers maintain code quality using Coverity's sophisticated static and dynamic analysis tools.
Between Coverity, Atlassian and Marakana, his tutorials have covered C/C++ best practices, Java web development (including Maven, Spring, OSGi, Guava, and RESTful web services using Jersey and Jackson), front-end development using jQuery, and functional programming with Scala.
Dan's latest hobby is lurking on StackOverflow as user "mergeconflict," waiting for tricky Haskell and Scala language questions to jump on. More about Dan Rosen...
For twelve years, Dan has been doing his best to write and help others write some damn fine code. Dan is author of Marakana's Scala Fundamentals course, the latest addition the the Marakana course catalog. Before joining Marakana, he worked as a Developer Advocate at Atlassian, teaching developers how to write plugins for Atlassian's collaboration and development tools. Prior to Atlassian, Dan worked in both engineering and sales for Coverity, helping developers maintain code quality using Coverity's sophisticated static and dynamic analysis tools.
Between Coverity, Atlassian and Marakana, his tutorials have covered C/C++ best practices, Java web development (including Maven, Spring, OSGi, Guava, and RESTful web services using Jersey and Jackson), front-end development using jQuery, and functional programming with Scala.
Dan's latest hobby is lurking on StackOverflow as user "mergeconflict," waiting for tricky Haskell and Scala language questions to jump on. More about Dan Rosen...
[top] Outline
- Introduction to Hadoop:
- RDBMS vs Hadoop
- Ecosystem tour (9 products)
- Vendor comparison (Cloudera, Hortonworks, MapR, Amazon EMR)
- Hardware Recommendations
- HDFS: File System details
- NameNode and DataNode architecture
- Write pipeline
- Read pipeline
- Heartbeats
- Rack awareness
- Block scanner
- MapReduce:
- JobTracker/TaskTracker architecture
- Shuffle: Sort + Partitioning
- Speculative Execution
- input/output formats
- distributed cache
- Pig:
- Pig philosophy and architecture
- Grunt shell
- Loading data
- Exploring Pig
- Latin commands
- Hive:
- Hive architecture
- Hive vs RDBMS
- HiveQL and the shell
- Managing tables (external vs managed)
- Data types and schemas
- Partitions and buckets
- HBase:
- Architecture and schema design
- HBase vs. RDBMS
- HMaster and Region Servers
- Column Families and Regions
- Write pipeline
- Read pipeline
- Next-Gen Hadoop:
- Intro to the high level concepts coming in Hadoop 2.0:
- HDFS HA
- HDFS Federation
- MapReduce 2.0
- Intro to the high level concepts coming in Hadoop 2.0: