Visit Silicon Valley Logo

Spark 2 Workbook Answers 🔥 Top

# 4️⃣ Action – trigger the computation and collect the count unique_word_count = distinct_words.count()

print(f"Unique words: unique_word_count")

val spark = SparkSession.builder() .appName("DeptSalary") .getOrCreate()

def fetch_batch(it): session = requests.Session() for url in it: yield session.get(url).text session.close() spark 2 workbook answers

val df = spark.read .option("header","true") .option("inferSchema","true") .csv("hdfs:///data/employees.csv")

– bulk HTTP calls:

val result = df .groupBy($"department") .agg(count("*").as("emp_cnt"), avg($"salary").as("avg_salary")) .filter($"emp_cnt" > 5) # 4️⃣ Action – trigger the computation and

```python from pyspark import SparkContext

Add a short paragraph for each stage, explaining why you chose that API.

---

## 7. Putting It All Together – A Mini‑Project Blueprint

1. **Ingestion** – `spark.read.json` or `textFile`. 2. **Parsing** – `withColumn` + `from_unixtime`, `regexp_extract`. 3. **Cleaning** – filter out malformed rows, `na.drop`. 4. **Enrichment** – join with a static lookup table (broadcast). 5. **Aggregation** – `groupBy(date, status).agg(count("*").as("cnt"))`. 6. **Output** – write to Parquet partitioned by `date` **or** stream to console for debugging.

| Tip | How to Apply | |-----|--------------| | **Show Spark’s lazy evaluation** | Mention that transformations build a DAG, actions trigger execution. | | **Explain the physical plan** | Use `df.explain()` in a note to demonstrate understanding of shuffle, broadcast, etc. | | **State assumptions** | “Assume the input file fits in HDFS and each line is a UTF‑8 string.” | | **Edge‑case handling** | Talk about empty files, null values, or malformed CSV rows. | | **Performance hints** | Suggest `repartition` before a heavy shuffle or using `broadcast` for small lookup tables. | | **Testing** | Show a tiny local test (e.g., `sc.parallelize(["a b","b c"]).flatMap(...).collect()`). | | **Clean code** | Use meaningful variable names, consistent indentation, and short comments. | **Ingestion** – `spark

View Map