Beginner's Guide to Data Analytics with Python

Hey everyone, Roberto here! 👋

Have you ever wondered how fast the world's population is growing? It's a fascinating question! I recently built a small project to track this in real-time, and along the way, I got to explore an amazing tool called Microsoft Azure Fabric.

In this post, I'll walk you through my journey, from writing a simple Python script to creating interactive dashboards. We'll break down what Azure Fabric is, how it works with tools like Apache Spark and Power BI, and how you can start your own data adventures. Let's dive in!

Note: You'll find the python code I wrote at the end of this post along with the PowerBI graphics screenshots

Part 1: The Code & The Cloud

Every data project starts with a question and some data. My goal was to get live population numbers. To do this, I wrote a simple Python script.

As the graphic shows, the code does three main things:

Scrapes Data: It visits the Worldometers website, which has live population counters, and grabs the data.
Cleans It Up: The raw data isn't always perfect, so the script cleans it and organizes it.
Calculates Growth: It then calculates metrics like how many people are added per day or per second for different countries.

But where does this data go? And how do we process it efficiently? That's where Azure Fabric comes in. Think of Fabric as a giant, all-in-one workshop for data. It combines everything you need—data engineering, data science, storage, and business intelligence—into a single, unified platform. This means you don't have to jump between a dozen different tools. It’s all right there, making it easier to build powerful analytics solutions.

Part 2: What's Inside Fabric? A Look at the Engine

So, what makes Fabric so powerful? It’s built from several core components that work together seamlessly. At the heart of it all is Apache Spark, the engine that does the heavy lifting.

Let's break it down:

Fabric Core Components: Fabric bundles powerful services like Data Factory for moving data, Synapse Analytics for deep analysis, a Data Lake for storage, and Power BI for visualization. It’s a complete toolkit!
Apache Spark: This is the real powerhouse. Spark is a distributed computing engine, which is a fancy way of saying it can process massive amounts of data incredibly fast by splitting tasks across many computers. In Fabric, Spark is built-in, so you can run complex calculations (like our population growth) without any complicated setup.
How They Work Together: In our project, we use a Spark Notebook directly inside Fabric to run our Python code. Fabric automatically handles the scaling and optimization, so we can focus on the logic, not the infrastructure.

Part 3: Our Data's New Home - The Lakehouse

Once our data is processed, it needs a place to live. In modern data platforms, this home is often a Lakehouse. It’s a cool concept that combines the best of two worlds: a Data Lake and a Data Warehouse.

What is a Lakehouse? Imagine a Data Lake as a huge, natural lake where you can store anything—raw files, images, logs (unstructured data). A Data Warehouse, on the other hand, is like a neatly organized library with structured shelves for specific books (structured data). A Lakehouse gives you both: the flexibility of a lake and the organization of a warehouse. It allows you to store all your data in one place, in an open format.
Why is it useful? This approach is cost-effective, flexible, and prevents data from being locked away in separate silos. It’s the perfect foundation for analytics, business intelligence, and even AI.

Our data journey is simple: the Python script scrapes the data, Spark transforms it, and then it’s saved in our Fabric Lakehouse, ready for the final step.

Part 4: Making Data Beautiful with Power BI

Data is great, but it’s even better when you can see it! This is where Power BI shines. It’s a tool that creates beautiful, interactive charts, graphs, and dashboards from your data.

Seamless Integration: Because Power BI is part of Fabric, it connects directly to our Lakehouse. This means our dashboards can display live, real-time data without us having to manually export or refresh anything. The charts you see in the graphic are the ones I built for this project!
The Value of Visualization: Power BI lets you drag and drop to create visuals, explore trends, and find insights. You can see which countries are growing the fastest, compare trends over time, and even view the data on a map. It turns rows of numbers into a story you can understand.

From a simple line of code to a fully interactive dashboard, the entire flow is managed within Fabric. It’s a smooth, end-to-end journey from raw data to valuable insights.

Conclusion: Your Turn to Explore!

So, what did I learn? That building a data project from scratch is more accessible than ever with tools like Azure Fabric. It simplifies the entire process, letting you focus on creativity and discovery.

Ready to start your own adventure? Here’s how you can begin.

How to Start Exploring Fabric:

Get the Free Trial: Microsoft offers a free trial for Azure Fabric. You can sign up and get access to all the tools we discussed.
Start with a Notebook: The easiest way to begin is by creating a PySpark Notebook. You can write simple Python code and see how it runs on the Spark engine.
Follow a Tutorial: Microsoft has excellent tutorials. Try the “end-to-end” tutorials for Lakehouse or Data Warehouse to get a feel for the platform.

Free Data to Play With:

You don’t need a fancy, expensive dataset to learn. There are tons of free, public datasets you can use:

Kaggle: A massive repository of datasets on everything from movies to machine learning.
Google Dataset Search: A search engine specifically for datasets.
Awesome Public Datasets on GitHub: A curated list of free datasets covering topics like climate, economics, and healthcare.

Pick a topic that interests you, find a dataset, and start exploring. The best way to learn is by doing! I hope this post inspires you to build something amazing.

The Code

This is the python code I used to scrap the population data. I scheduled this to run every 45 in from a Fabric pipeline and updated it real time from PowerBI

import requests
import pandas as pd
import datetime
from io import StringIO # Added this for the warning fix
from pyspark.sql import functions as F

# 1. SCRAPE DATA
url = "https://www.worldometers.info/world-population/population-by-country/"
response = requests.get(url)
# Wrapping response.text in StringIO() to fix the FutureWarning
df_list = pd.read_html(StringIO(response.text)) 
df_raw = df_list[0]

# 2. CLEAN & TIMESTAMP
now = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
df_raw = df_raw.iloc[:20, [1, 2]]
df_raw.columns = ['Country', 'Population_2025']

# 3. CREATE SPARK DATAFRAME (The fix for your NameError)
df_spark = spark.createDataFrame(df_raw)

# 4. DEFINE GROWTH RATES defined as percentage per year
growth_rates = {
    "India": 0.89, "China": -0.23, "United States": 0.54, "Indonesia": 0.79, 
    "Pakistan": 1.57, "Nigeria": 2.08, "Brazil": 0.38, "Bangladesh": 1.22, 
    "Russia": -0.57, "Ethiopia": 2.58, "Mexico": 0.83, "Japan": -0.52, 
    "Egypt": 1.57, "Philippines": 0.81, "DR Congo": 3.25, "Vietnam": 0.6, 
    "Iran": 0.93, "Turkey": 0.24, "Germany": -0.56, "Thailand": -0.07
}
mapping_expr = F.create_map([F.lit(x) for x in sum(growth_rates.items(), ())])

# 5. CALCULATE EVERYTHING
df_final = df_spark.withColumn("Scrape_Timestamp", F.to_timestamp(F.lit(now), "yyyy-MM-dd HH:mm:ss")) \
    .withColumn("Annual_Rate", mapping_expr[F.col("Country")]) \
    .withColumn("Base_Pop", F.col("Population_2025").cast("double")) \
    .withColumn("Seconds_Today", F.hour(F.current_timestamp()) * 3600 + F.minute(F.current_timestamp()) * 60 + F.second(F.current_timestamp())) \
    .withColumn("Live_Population", 
                F.col("Base_Pop") + (F.col("Base_Pop") * (F.col("Annual_Rate")/100) * (F.col("Seconds_Today")/31536000))) \
    .withColumn("Growth_Per_Second", 
                (F.col("Base_Pop") * (F.col("Annual_Rate")/100)) / 31536000) \
    .withColumn("Growth_Per_Day", 
                (F.col("Base_Pop") * (F.col("Annual_Rate")/100)) / 365)

# 6. SAVE TO TABLE
(df_final.select("Country", "Scrape_Timestamp", "Live_Population", "Growth_Per_Second", "Growth_Per_Day")
  .write
  .mode("append")
  .option("mergeSchema", "true") 
  .saveAsTable("population_growth_history"))

print(f"✅ Success! Data appended for {now}")