Let’s dive in:
Python and Scala are two programming languages.
Python is a dynamic and easy-to-learn programming language. Scala is a typed programming language that is widely used in the IT industry.
2. Hadoop (MapReduce and HDFS)
Python and Scala are two programming languages.
A lot of the technologies are based on Hadoop. It will assist in comprehending fundamental concepts such as scalability, replication, failure tolerance, and partitioning.
3. Ingestion
Raw data is ingested into the platform using the tools listed below.
Cloud Pub/Sub, Apache Kafka, BigQuery, AWS Kinesis
4. Fast Storage
This category is for storing real-time data that can be accessed quickly.
AWS Kinesis, Apache Kafka, and Cloud Pub/Sub
5. Slow storage/direct data lake access
Access to the data lake via direct storage is slow.
The data for batch processing is stored in this category.
AWS S3, Google Cloud Storage, HDFS
Analytics and real-time processing
The tools listed below are used to process the streaming data.
Cloud Dataflow, Apache Flink, Spark Streaming, Amazon Kinesis Analytics
7. Batch processing and analytics
Batch processing is done with these tools.
AWS Glue, Apache Spark, and Dataproc
8. Orchestration overlay
The data pipelines are orchestrated using these tools.
AWS Step Functions, Google Cloud Composer, and Apache Airflow
9. Data Warehouse
The data pipelines are orchestrated using these tools.
Redshift, BigQuery, Snowflake, and PostgreSQL are all AWS services.
10. Serving Layer
To provide services to data consumers.
Relational Databases — PostgreSQL, Mysql NoSQL Databases — Cassandra, DynamoDB
These categories and technologies will ensure that you develop the tree of knowledge while concentrating on the most important parts, the root, and stem, and branching out as needed.