Marco Santoni

Talk at Politecnico di Milano

2025-11-05T00:00:00+01:00

I was invited to give a talk at Politecnico di Milano at Osservatorio Big Data. The event took place on November 4th, 2025, and I my talk was "Guida galattica per data product AI-ready", which means "Hitchhiker's guide to AI-ready data products".

I shared the experience we had at TeamSystem in building data products that are enablers for text2SQL agents. These agents require high quality context about the data they are querying. This context is not only about the schema of the database but also about business rules, data quality, and other metadata that can help the agent to generate accurate SQL queries.

We deveoped a platform that integrates the metadata of our data products the the context of our text2SQL agents. The idea is to build the metadata once only and reuse it across different uses cases (both AI agents and traditional data consumers).

Below a picture with Databricks folks that contributed to the talk

Switch to UV

2025-10-20T00:00:00+02:00

I just moved the git repo of this blog from an old conda+pip based setup to using uv. On Mac, start by

brew install uv

Then, I initialized the uv project and just imported the dependencies specified in the requirements.txt file.

uv init --python 3.13
uv add --requirements requirements.txt

And that is basically it. From now on, to run pelican commands, just prefix them with uv run, e.g.,

uv run pelican content

The transition was smooth and fast. Highly recommended!

Talk at Codemotion 2025

2025-10-15T00:00:00+02:00

I've been speaking at Codemotion for the first time in October 2025 thanks to the work done over the last months at TeamSystem. With my colleague Mattia De Leo, we presented our recent work on building AI assistants based on knowledge graphs and large language models. The talk was well received, and we had a lot of interesting questions from the audience.

I focused my talk on Text2SQL, a task that consists of translating natural language queries into SQL queries. This is a challenging task, and I explained why understanding the semantics of the natural language query and the structure of the database schema is not enough. Business context and further metadata are key to generate accurate SQL queries.

Speaking at Big Data London 2025

2025-10-01T00:00:00+02:00

I've been for the first time at Big Data London in September 2025. I gave a talk with my colleague Andrea Romeo about a challenging task we faced at TeamSystem. We developed an offloading of thousands of SQL Server tenants via CDC (Change Data Capture) to a data lake via Debezium.

I appreciated the conference which I consider close te being actually a fair. It was a great chance to meet key vendors in the big data space and to discuss with them about their products.

Weight AI Eng skills by page count

2025-08-31T21:41:00+02:00

I am currently reading "AI Engineering" by Chip Huyen and am really enjoying it. I spent some years as data scientist in the past, and now I found some analogies between data science and AI engineering. The analogy is in the way the industry is talking about the discipline and what actually engineering teams are fighting on daily. Before going into that, let's define how we can evaluate the importance of different skills in AI engineering.

What AI engineering skill is actually the most important while being the skill less spoken about?

To do so, I will use a totally arbitrary method: weighting each skill by the number of pages it appears in the book. This is not a perfect metric, but it can give us some insights into which skills the author considers more important. So I drew the following chart based on the page count:

Evaluation, evaluation, evaluation

I have been considering evaluation as the skill that differentiates a senior data scientist from a junior one likewise testing is the skill that differentiates a senior software engineer from a junior one. In the context of AI engineering, evaluation becomes even more crucial. Why is it so relevant? Taking some points from Huyen's book:

open ended outputs, for a given input, there are so many possible correct responses.
the more intelligent AI models become, the harder it is to evaluate them. You can no longer evaluate a response based on how it sounds.
black box models, no details such as the model architecture, training data, and the training process

evals are surprisingly often all you need
— Greg Brockman (@gdb) December 9, 2023

Book review: thinking in bets

2025-08-24T13:41:00+02:00

“Wanna bet?” triggers us to engage in that third step that we only sometimes get to. Being asked if we are willing to bet money on it makes it much more likely that we will examine our information in a less biased way, be more honest with ourselves about how sure we are of our beliefs, and be more open to updating and calibrating our beliefs.

A couple of months ago, I read Thinking in Bets by Annie Duke. The book presents a compelling case for decision-making under uncertainty and offers practical strategies for improving our thinking processes.

Key Takeaways

One of the key takeaways from the book is the concept of "resulting," which is the tendency to judge the quality of a decision based on its outcome rather than the reasoning behind it. Duke argues that this mindset can lead to poor decision-making in the long run, as it encourages us to ignore valuable information and lessons learned from our experiences.

My (conditioned) opinion on the book

Before reading Thinking in Bets , I had the chance to read books like

Thinking, Fast and Slow by Daniel Kahneman
The Signal and the Noise by Nate Silver
The Black Swan by Nassim Nicholas Taleb

While Thinking in Bets offers valuable insights, most concepts reminded me of ideas presented in these other works. I think it serves as a useful primer for those new to the subject, but it may not offer enough depth for readers already familiar with these concepts.

Other quotes I liked

We are discouraged from saying “I don’t know” or “I’m not sure.” We regard those expressions as vague, unhelpful, and even evasive. But getting comfortable with “I’m not sure” is a vital step to being a better decision-maker. We have to make peace with not knowing.

In most of our decisions, we are not betting against another person. Rather, we are betting against all the future versions of ourselves that we are not choosing. We are constantly deciding among alternative futures: one where we go to the movies, one where we go bowling, one where we stay home.

People are credulous creatures who find it very easy to believe and very difficult to doubt. [actually citing Daniel Gilbert]

Surprisingly, being smart can actually make bias worse. Let me give you a different intuitive frame: the smarter you are, the better you are at constructing a narrative that supports your beliefs, rationalizing and framing the data to fit your argument or point of view. After all, people in the “spin room” in a political setting are generally pretty smart for a reason.

Is Lakehouse Monitoring worth it?

2025-08-23T09:41:00+02:00

I've created a toy Lakehouse Monitoring in Databricks setup to explore its features and capabilities. The goal is to understand how it works and what benefits it can bring. Here's an overview of what I cover in this post:

How to Setup a toy Lakehouse Monitoring
- Dashboard
- Alerts
Pricing
My opinion on what I've seen
Where's Databricks going?
- My2C

If you want to know more about what Databricks' Lakehouse Monitoring can do, I recommend checking out the official documentation. I have prepared a basic map of concepts that can help you get started.

How to Setup a toy Lakehouse Monitoring

Let's start by creating a table we can work with. It should be a time-series table

create table workspace.default.sales (
    timestamp TIMESTAMP,
    amount DOUBLE
)

I then create a basic notebook insert 1h of data.ipynb to fill table with data. Then, setup a job to run that notebook every hour.

I'll not add the code here because it is quite basic. It randomly adds records to the table with random values (within the time windown of the hour).

select * from workspace.default.sales
limit 10

timestamp	amount
2025-08-22T08:58:07.929Z	22.570402586080093
2025-08-22T08:51:51.929Z	20.713874028846366
2025-08-22T09:03:54.929Z	21.97633174572098
2025-08-22T08:28:44.929Z	27.94416169489641
2025-08-22T09:05:29.929Z	21.307407500066127
2025-08-22T08:17:03.929Z	22.37476392747984
2025-08-22T09:05:05.929Z	26.446829879517953
2025-08-22T08:42:32.929Z	27.86840740526422
2025-08-22T08:33:55.929Z	27.236961570798947
2025-08-22T08:42:30.929Z	25.395336538015343

Then, let's create the Monitor via Unity Catalog Explorer 👇

I set up the monitor as TimeSeries profile. I pointed out the timestamp column and a granularity of 1 hour. The schedule of the monitor is actually daily.

Below, a screenshot of the Unity Catalog Explorer page to create the Lakehouse Monitoring.

What happens after the creation of the Monitoring? By default, two new tables are created

<table_name>_profile_metrics
<table_name>_drift_metrics

Let's inspect them

SHOW TABLES IN workspace.default;

database	tableName	isTemporary
default	sales	false
default	sales_drift_metrics	false
default	sales_profile_metrics	false
	_sqldf	true

select * from workspace.default.sales_profile_metrics

window	log_type	logging_table_commit_version	granularity	slice_key	slice_value	column_name	count	data_type	num_nulls	avg	min	max	stddev	num_zeros	num_nan	min_length	max_length	avg_length	non_null_columns	frequent_items	median	distinct_count	percent_nan	percent_null	percent_zeros	percent_distinct
List(2025-08-22T08:00:00.000Z, 2025-08-22T09:00:00.000Z)	INPUT	26	1 hour	null	null	:table	1344	null	null	null	null	null	null	null	null	null	null	null	List(timestamp, amount)	null	null	null	null	null	null	null
List(2025-08-22T08:00:00.000Z, 2025-08-22T09:00:00.000Z)	INPUT	26	1 hour	null	null	amount	1344	double	0	25.059042979855878	20.0007185120017	29.99500143216646	2.8817003714622857	0	0	null	null	null	null	null	25.133808306174608	1277	0.0	0.0	0.0	95.01488095238095
List(2025-08-22T08:00:00.000Z, 2025-08-22T09:00:00.000Z)	INPUT	26	1 hour	null	null	timestamp	1344	timestamp	0	null	1.755850122929519E9	1.755853196019946E9	null	null	null	null	null	null	null	null	1.755851952929519E9	1161	null	0.0	null	86.38392857142857
List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)	INPUT	26	1 hour	null	null	timestamp	1192	timestamp	0	null	1.755853200929519E9	1.755856792565437E9	null	null	null	null	null	null	null	null	1.755854777019946E9	997	null	0.0	null	83.64093959731544
List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)	INPUT	26	1 hour	null	null	:table	1192	null	null	null	null	null	null	null	null	null	null	null	List(timestamp, amount)	null	null	null	null	null	null	null
List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)	INPUT	26	1 hour	null	null	amount	1192	double	0	24.847526487373074	20.01063581181195	29.99539398790598	2.8880160456500867	0	0	null	null	null	null	null	24.694267025212234	1192	0.0	0.0	0.0	100.0
List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)	INPUT	26	1 hour	null	null	amount	941	double	0	24.969054277784952	20.015936515703217	29.981071930502402	2.846773237150427	0	0	null	null	null	null	null	24.969920953185955	925	0.0	0.0	0.0	98.29968119022317
List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)	INPUT	26	1 hour	null	null	timestamp	941	timestamp	0	null	1.755856803565437E9	1.755860399121952E9	null	null	null	null	null	null	null	null	1.755858763121952E9	857	null	0.0	null	91.07332624867162
List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)	INPUT	26	1 hour	null	null	:table	941	null	null	null	null	null	null	null	null	null	null	null	List(timestamp, amount)	null	null	null	null	null	null	null
List(2025-08-22T11:00:00.000Z, 2025-08-22T12:00:00.000Z)	INPUT	26	1 hour	null	null	:table	995	null	null	null	null	null	null	null	null	null	null	null	List(timestamp, amount)	null	null	null	null	null	null	null

The profile table has a row for each pair

window (the beginning and end of every hour)
column_name every column of the table. In addition, it adds a special row :table to compute the table-level profile.

Optionally, it can slice on column values when specified at the time of the creation of the Monitor

For each row, it computes a bunch of statistics like avg, quantiles, min, max, etc. (when applicable, eg for float columns).

select * from workspace.default.sales_drift_metrics

window	granularity	slice_key	slice_value	column_name	data_type	window_cmp	drift_type	count_delta	avg_delta	percent_null_delta	percent_zeros_delta	percent_distinct_delta	non_null_columns_delta	js_distance	ks_test	wasserstein_distance	population_stability_index	chi_squared_test	tv_distance	l_infinity_distance
List(2025-08-22T11:00:00.000Z, 2025-08-22T12:00:00.000Z)	1 hour	null	null	:table	null	List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)	CONSECUTIVE	-418	null	null	null	null	List(0, 0)	null	null	null	null	null	null	null
List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)	1 hour	null	null	:table	null	List(2025-08-22T08:00:00.000Z, 2025-08-22T09:00:00.000Z)	CONSECUTIVE	-152	null	null	null	null	List(0, 0)	null	null	null	null	null	null	null
List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)	1 hour	null	null	:table	null	List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)	CONSECUTIVE	-251	null	null	null	null	List(0, 0)	null	null	null	null	null	null	null
List(2025-08-22T11:00:00.000Z, 2025-08-22T12:00:00.000Z)	1 hour	null	null	timestamp	timestamp	List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)	CONSECUTIVE	-418	null	0.0	null	-5.604875005841791	null	null	null	null	null	null	null	null
List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)	1 hour	null	null	timestamp	timestamp	List(2025-08-22T08:00:00.000Z, 2025-08-22T09:00:00.000Z)	CONSECUTIVE	-152	null	0.0	null	-2.742988974113132	null	null	null	null	null	null	null	null
List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)	1 hour	null	null	timestamp	timestamp	List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)	CONSECUTIVE	-251	null	0.0	null	7.4323866513561825	null	null	null	null	null	null	null	null
List(2025-08-22T11:00:00.000Z, 2025-08-22T12:00:00.000Z)	1 hour	null	null	amount	double	List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)	CONSECUTIVE	-418	0.013276990751066364	0.0	0.0	-0.21172707932451829	null	null	List(0.049, 0.3829208808885818)	0.16058939377867895	0.028216393260236203	null	null	null
List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)	1 hour	null	null	amount	double	List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)	CONSECUTIVE	-251	0.12152779041187856	0.0	0.0	-1.7003188097768316	null	null	List(0.038, 0.4228041687817168)	0.14481617304902772	0.021676869284417942	null	null	null
List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)	1 hour	null	null	amount	double	List(2025-08-22T08:00:00.000Z, 2025-08-22T09:00:00.000Z)	CONSECUTIVE	-152	-0.21151649248280435	0.0	0.0	4.985119047619051	null	null	List(0.062, 0.014853915612309707)	0.21231645982023184	0.022438267042335924	null	null	null

The drift table is similar to the profile table. The drift table has a row for each pair

window (the beginning and end of every hour)
column_name every column of the table. In addition, it adds a special row :table to compute the table-level profile.

In addition, it has the window_cmp, where cmp stands for compare. All the statistics are compared against another window (the previous one). There are various statistics like

count_delta
ks_test, in statistics, the Kolmogorov–Smirnov can be used to test whether two samples came from the same distribution

Dashboard

Lakehouse Monitoring creates also a dashboard automatically that displays the data in these profile and drift tables.

😓 However, I find this dashboard too crowded and not ready to use. You need to work on it to customize it by yourself.

Alerts

Monitor alerts are created and used the same way as other Databricks SQL alerts. You create a Databricks SQL query on the monitor profile metrics table or drift metrics table. You then create a Databricks SQL alert for this query.

Pricing

Lakehouse Monitoring is billed under a serverless jobs SKU. You can monitor its usage via system.billing.usage table or via the Usage dashboard at Account console.

You need to pay attention. I expect that the costs may rise for columns with high number of columns if you don't finetune the monitor.

SELECT usage_date, sum(usage_quantity) as dbus
FROM system.billing.usage
WHERE
  usage_date >= DATE_SUB(current_date(), 30) AND
  sku_name like "%JOBS_SERVERLESS%" AND
  custom_tags["LakehouseMonitoring"] = "true"
GROUP BY usage_date
ORDER BY usage_date DESC

usage_date	dbus
2025-08-22	1.852757467777777736

My opinion on what I've seen

Lakehouse monitoring is all about these two profile and drift tables. It is a kind of brute force approach that runs standardized monitoring over the specified table and stores the output in the profiling tables. Is it convenient? It depends on what you're looking for. It is not a free lunch.

Pros 🟢

It takes little effort to setup. By default common controls are applied to all columns in the monitored table.
Most common monitoring scenarios are covered by TimeSeries profile or by Snapshot profile (I left apart the inference-ML for the sake of simplicity). The setup time is shorter when compared to anything made by yourself.
You have a framework ready to use. You save the time required designing it, and you avoid reinventing the wheel. You can focus on your business needs rather than on data engineering stuff.
I like the simple but effective design of the drift metric table and of the windowing. Making something like this by yourself will probably let you hit against some hidden edge-case (like anytime you work with time and dates).

Cons 🔴

Once the metrics are computed in the profile and drift tables, only half of the job is done. You still have to decide what to monitor and how to do it. You're probably not interested in monitor any single column in any row of the metric tables (otherwise you may alerted by too many false alarms). A finetuning of the actual alerts is still required, and it is not coming for free.
You can't know in advance the overall cost of the monitoring. You need to try with a realistic (production-alike) scenario and monitor soon how much you're paying. I expect it to depend mainly on
- the data volume
- the columns in the table
- the frequency of the controls

Where's Databricks going?

In addition to Lakehouse Monitoring, Databricks has released a feature (in Beta) of data quality monitoring. This new monitoring

is quicker to setup. It is toggle on an entire Schema and monitors all the tables in the schema.
monitors only simple freshness and completeness quality controls
has no parametrization
still needs alerts to be set manually

I made a short recap here.

Feature	Lakehouse Monitoring	Data Quality Monitoring (Beta)
Scope	Table. It is set at table level. It monitors the table and its columns.	Schema. It is set at schema level and monitors all tables in such schema.
Setup	Choose the profile, eventual slicing, window and frequency.	On-off on the schema.
What is monitored	Various statistics as snapshot, time series, and inference.	Freshness (is data recent?) and completeness (is the volume as expected?)
Customization	Limited	No
Alert	To be set manually on the output table.	To be set manually on the output table.

My2C

🟢 I think Databricks is going in the right direction. Fast adoption of basic quality controls. Avoid the "didn't notice data is old in production" moments with little effort.

🔴 The alerting setup is still quite SQL-based and there is some trial-and-error around it. I would expect that a basic alert should be enabled by default.

Learn basics of MCP with FastMCP

2025-08-17T08:35:00+02:00

I was looking for a resource to get a deeper understanding of MCP (Model Context Protocol). Rather than looking for resources or books, I opted for RTFM. Actually not the MCP manual as I would not have fun reading protocol specs. I took FastMCP and went through the docs. I was not just reading the docs. While reading the docs, I was:

sketching a concept map
coding some basic hello-world examples

I enjoyed this approach because it was quite rapid (I could not invest days but rather hours) while practical and hands on. The concept map helps keeping some notes for me in future. Notes you write by yourself are the ones that stick best.

Concept map

I used draw.io to draw the concept map and exported to SVG for best web rendering. You can explore it here 👇

Little codebase

I made a basic server and client with streamable HTTP transport mode. Everything is in this repo. There is a basic example for each key component of an MCP server

tool
resource
prompt

Opinions about MCP

I'll share a couple of opinions I got while exploring MCP and FastMCP.

Evolving rapidly

FastMCP is evolving according to the MCP specs of course. These specs are quite recent. The first stable version was in November 2024 while the latest (and third) in June 2025. I read about an important feature like Structured Output and found it was only few weeks old at the time of my reading. It is a great sign that things are moving so fast, but, at the same time, you should consider this quick evolution if you're working on a production-ready application.

You may want to stay simple and minimize the overall engineering investment. You may find yourself investing in engineering features that few months later might be supported by the protocol or by the ecosystem

Good design

I appreaciated the design of the protocol and of FastMCP itself. It is simple enough and based on three elements (tools, resources, prompts), but still catches a large amount of needs of agent applications. There are useful interfaces and features for common needs like

interactive input by users
progress monitoring
logging and messaging
sampling from client's LLM

The design is composable making it scalable for larger applications. An MCP server can literally import another MCP server or mount it. The name-clashes or duplicates can be handled explictly by developers.

Rich ecosystem

FastMCP is one example of the ecosystem of tools and frameworks that is growing around MCP. The ecosystem is what matters (more than the protocol design).

Engineering is still THE thing

Building an MCP server is still an engineering and design job. Should feature X of my server be a resource? Or a tool? Should this input be parametrized? There will be no correct or wrong answers to these questions. Design styles will emerge and the touch of software architects will emerge to make sure things can scale and are easy to maintain.

Why reading Deep Work

2025-04-13T19:35:00+02:00

Shallow Work: Noncognitively demanding, logistical-style tasks, often performed while distracted. These efforts tend not to create much new value in the world and are easy to replicate.

Cal Newport gives this definition of how I spend part of my work-time in his Deep Work book. Checking emails, messaging on Teams, and attending meetings are few examples of such "noncognitively demanding" activities. On some days, these activities may even fill up my work-time. How much space is left for cognitively intensive task? When I have space for few consecutive hours of highly focused work, I have the feeling that what I'm doing is actually valuable.

I'm not alone. Talking with my peers, I understood that almost anyone is facing the same issue. Days being full of non focused and low demanding tasks. I'm part of what Newport defines "knowledge workers". The output of our work is not some physical manufact, neither a service that is directly customer-facing. The paradox is the following. Knowledge workers give their best when they have space for focus, however organizations adopt tools for instant communication that are designed to drain such attention.

Deep Work Hypothesis

Is it de facto a problem? Or is it actually an opportunity for those who are aware of it? Here it comes the interesting point made by Newport. He defines the "Deep Work Hypothesis" as:

The ability to perform deep work is becoming increasingly rare at exactly the same time it is becoming increasingly valuable in our economy. As a consequence, the few who cultivate this skill, and then make it the core of their working life, will thrive.

The claim is that we operate in a distracting work enviroment. At the same time, the job industry require continuous learning and high specialization. Both of these requirements need deep focus. Deep focus is today harder to achieve. People who can shield themselves from distractions and dedicate time to focused work are the ones who will thrive.

Busyness

What are symptoms of lack of deep work in your daily job? If you consider yourself a knowledge-worker, spending most of time answering emails or instant-messaging are indicators that deep work is lacking. Why is it so common in most companies then? Newport introduces the Principle of Least Resistance.

In a business setting, without clear feedback on the impact of various behaviors to the bottom line, we will tend toward behaviors that are easiest in the moment.

So, why do many workers (especially in large organizations) spend so much time in low-intensity activities (like emailing or chatting)? Becaust it is easy. This is the short answer by the Principle of Least Resistance. Answering emails or messages gives an immediate feedback to the worker, while long and focused activities likely lack such quick feedback. Our brain is tempted by such quick feedbacks. Resisting the quick feedback of going through email requires a greater effort.

This workday schedule has two drawbacks. First, continuous interruptions and context switch reduce the focus and attention. Low levels of cognitively intense activities reduce the value generated by such activities. Second, the worker is unhappy with the work he/she's doing. Newport cites a study by the psychologist Csikszentmihalyi. His studies demonstrated that, surprisingly, we are most satisfied when we're given difficult tasks to accomplish rather than when relaxing.

The best moments usually occur when a person’s body or mind is stretched to its limits in a voluntary effort to accomplish something difficult and worthwhile.”

Budget of willpower

We cannot just decide to concentrate and expect it to happen. It just does not work. We have a limited amount of willpower. It decreases when we use it. For simplicity, we can consider it as a daily budget of willpower. The key recommendation by Newport is the following. Build a set of routines and rituals that help you develop deep work habits. By doing so, you minimize the amount of willpower you need to use to focus. The less you willpower you use for each focuse session, the more you save for focusing on the rest of the day.

Newport describes four approaches to building such routines:

monastic philosophy
bimodal philosophy
rithmic philosophy
journalistic philosophy

I'll not go through describing each of them. You should read the book to get a full picture of them.

I'll share an example to help you get a taste of what it means to build your "philosophy" of routines. The books makes the example of the workday schedule of Charles Darwin (you can get more details here).

Charles Darwin had a similarly strict structure for his working life during the period when he was perfecting On the Origin of Species. As his son Francis later remembered, he would rise promptly at seven to take a short walk. He would then eat breakfast alone and retire to his study from eight to nine thirty. The next hour was dedicated to reading his letters from the day before, after which he would return to his study from ten thirty until noon. After this session, he would mull over challenging ideas while walking on a prescribed route that started at his greenhouse and then circled a path on his property. He would walk until satisfied with his thinking then declare his workday done.

What I do

To cultivate the habit of deep work and improve my focus, I have set the following goals for myself:

Focused Activities Early Morning or Weekends
I aim to dedicate the early hours of my workdays or weekends to cognitively demanding tasks. These are the times when my mind is fresh, and distractions are minimal. By prioritizing deep work during these periods, I can make significant progress on challenging projects.
Turn Off Popups and Notifications on Desktop
To shield myself from distractions, I will disable all unnecessary popups and notifications on my desktop. This includes email alerts, instant messaging notifications, and other interruptions that can break my focus. Creating a distraction-free environment is essential for maintaining deep concentration.
Emails in Late Afternoon or Evenings
I will reserve time for checking and responding to emails in the late afternoon or evenings. This ensures that my most productive hours are not consumed by shallow work. By batching email tasks, I can handle them more efficiently without constant context switching.
Take Notes During Meetings to Help Focus
During meetings, I will take detailed notes to stay engaged and focused. This practice not only helps me retain important information but also prevents my mind from wandering. It ensures that I am fully present and can contribute meaningfully to discussions.

By adhering to these goals, I aim to build a sustainable routine that supports deep work and enhances the quality of my output.

Careers in data and AI: seminar and podcast

2024-06-30T06:41:00+02:00

I celebrated 10 years since my graduation trying to giving back somthing I learned about our industry to students. I was invited by Prof. Francesco Calimeri from UniCal to hold a seminar to students at the last year of MSc. in Computer Science and AI. A couple of pictures from the event here 👇

I then took the content of my presentation and organized a panel with 2 special guests (Paolo Platter and Alberto Danese) at Intervista Pythonista. This is what came out of it:

Organizing a conference: Py4AI

2024-03-25T09:21:00+01:00

I had a lot of fun (and work) over the last 6 months working or organizing a conference, Py4AI. The conference was held in Pavia, Italy, on March 16th 2024. The conference was a success, with over 200 attendees and 12 speakers. The conference was organized by a group of volunteers, including myself, Alessandro Ferrari, Pietro Peterlongo, Cesare Placanica, Thao Hoang, and Luca Baggi. A screenshot with the speakers lineup:

You can find here the Youtube playlist with the talks delivered at the conference.

Reading "The Design of Web APIs"

2023-08-27T07:35:00+02:00

Why bothering reading a book about design of web APIs when working in data science like I do? I found this book called The Design of Web APIs by Arnaud Lauret and decided to give it a try.

Why reading it

Data science is shifting towards turning models and solutions as API products. And this can be true not only when you develop a product you actually sell publicly. Even when developing a data product inside an organization, you may want to expose your data service via APIs.

So, when you are at this point of developing an APIs, there are plenty of design decisions to take (eg which routes to expose, which result codes, which response payload, etc.). If you start this design process without some guidelines, you may spend plenty of energies on trying to answer these design questions or even risk to introduce technical debt that you will pay later.

What is the book about

The book states that, when you take the role of an API designer, you are just like a designer of real-world object. An API is made for users and shuold help them to achieve their goals. API designers should avoid that internal details of the backend affect the design of the APIs. The focus of the designer is to simplify the job of the consumer.

Usability is what distinguishes awesome APIs from mediocre or passable ones.

The book is focused on shifting your point of view from the provider to the consumer. It may seem obvious and everyone might agree on it, but it is not that straightforward to make it happen because we may have bias or we may take design shorcuts that simplify the development of our backend. The author introduces methods like API goals canvas to help us listing out the needs of the user and focusing on them.

You will find in the books charts or schemas like the one above that explain the design choices you can face on a daily basis. In this example, you may want to stick to fully REST compliace with a POST /orders. Or you may want to relax this constraint via a non-REST design like POST /cart/check-out that might actually be more intuitive for the consumer developers.

And more technicalities

The book has a focus on these design choices (eg the resource expansion pattern for nested object in API responses), but is a good source of knowledge to learn more about some technical details around APIs that you can use on a daily basis. For example, you will read chapters about

OpenAPI Specification
OAuth2
features of HTTP you might not be using (eg there are around 200 different standard HTTP headers)
data format standards like ISO 4217 for currrencies or ISO 8601 for date and time-related data
etc.

All about dev experience

It is the first book I read so far entirely dedicated to developer experience. How can we improve the productivity and the overall satisfaction of the developers using our APIs?

By reading it, you can learn an approach that goes beyond designing web APIs. You learn to focus on what simplifies the life of a developer, and I'm sure this thinking has an effect on how you write your code, your internal tools or even your docs.

Expectations from a Data Analyst

2023-08-07T07:35:00+02:00

When you work as a data analyst or data scientist (I'll use the terms interchangeably) in a company, you may not be training predictive models every single day. A significant (and often interesting) part of your job is answering business questions via data mining regardless if you do it with machine learning, descriptive statistics or whatever. You may start with a business question like:

why are our revenues increasing in the last quarter?

what are common patterns between our loyal customers?

Such simple questions often require a complex work that goes beyond knowing well statistics. You need to know how how your business work and what are the expectations of your stakeholder.

What the analyst enjoys the most (and the least)

Once a business question has arrived, where do we start from? Most data analysts would start mining into the data exploration phase. This phase is usually the first one of the activity, and the data analysts look into distributions and patterns in the data. The goal here is to get a good comprehension of the data we are sitting on. And usually the data analysts has fun during this data exploration time.🎉🙌 He or she is playing with charts and with some statistics from the dataset.

What does the data analyst usually not enjoy doing? 👎😭 Based on my experience, preparing the presentation about the results of the analysis is the part of the activity that most data analysts enjoy the least. And what does it imply?

Imagine you have 10 days to work on this data analysis before the meeting with your stakeholders. As our dear data analysts enjoy playing with the data more than playing with PowerPoint, they would probably spend 9 days on mining the data and 1 day working on the presentation. And probably the 9 days do not depend on the actual complexity of the task. If the business question can be answerend in 5 days with some basic descriptive statistics, the data analysts would probably invest more and more time trying some more advanced modelling technique or some more fancy data visualization. Why? Because they enjoy it. So, the data analysis part of the activity fills all the available space like a gas in a room would do.

Last day (if not very last hours) is usually left to working on the presentation.

The wrong interpretation of the role

I was expecting from a data analyst to focus on the data mining, and that looked fine to me. She/he would share the data with other stakeholders (eg marketing staff), and THEY would get the insights because THEY are the domain expert. The data scientist would get the data, would let the data talk, and the business stakeholder would read the insights.

I thought it was OK to present an exploratory analysis. And I was wrong.

Explanatory over exploratory

Recently, I read Storytelling with Data by Cole Nussbaumer Knaflic. The author explains why data scientists should show explanatory analyses (rather than exploratory).

If you are the one analyzing and communicating the data, you likely know it best—you are a subject matter expert. This puts you in a unique position to interpret the data and help lead people to understanding and action.

Once the explorary data mining phase is over, the data analyst should take the time and the effort to interpret the data. She/he should turn the data into information that can answer the need of the audience.

Why is it hard? We often believe that the audience is the subject matter expert and know what is actually the valuable information behind the data. That's why working on the explanatory phase is an uncomfortable zone for a data scientist, but she/he should feel confident in making recommendations and observations.

If we entitle a data analyst to interpret the business insights of the data, there are at least 2 things he/she should take into considerations:

take enough time to interpret the data
review the data visualizations to explicitly communicate his/her interpretation

Regarding the 1st point, looking for business insights is surprisingly time consuming. You cannot just dedicate the very last hours of your activity to looking for explanations behind data patterns. We should probably reconsider a classic sequential approach to the activity (eg explore, explain, present) in favor of an approach that organizes our time as to have quick iterations around business hypothesis and repeat multiple iterations before concluding our activity.

Regarding, the 2nd point, I'll go a bit deeper with an example.

Example: review your data viz

You can find of course many examples on Knaflic's book. Let's look at one I picked one from her website. Imagine we're working in a hospital and are analyzing lengths of hospitals stays after a surgery. For each stay of year 2019, we're given

the quarter of the year
the length of stay

When a data analysts is done with the first exploratory analysis, what could be the output?

In this chart, data is presented to the audience. However, how easy is it to get valuable information out of it? You may notice some patterns (eg increase in frequency <=24 stays over the year), however finding patterns is hard or requires quite some cognitive effort.

What if the data analyst would take this effort of extracting the information out of the data? How should the presentation be revisited? She/he should be confident in highlighting what's actually valuable in the data and focus the attention on the reader on that.

In this example, the data analyst can make the key information explicit. She/he can find out that the <=24 stays have increased over the year and could know that this is considered a success. Why not emphasizing it on the chart?

Let's look at how an explanatory chart would look like.

The chart now has a clear message that is stated in the title and is fully described in the text next to the actual plot. The new chart looks clean because any visual component that is not useful to grab the chosen message is either hidden or grayed out. The data analyst in this case has focused on explaining why 2019 was a success rather than showing plain data. That's why the bars of the <=24 stays are highlighted in black, while, in contrast, the remaining bars are grayed out. The choice of colors captures the attention of the reader on the patterns and on the signal, rather than on the data itself.

Looking forward

This article is mainly inspired by Knaflic's book and by my experience on interacting with stakeholders over the last years. I haven't done a research of the literature on the topic, so please consider this article as a set of opinionated recommendations on how a data scientist could maximize his/her impact when working on data mining activities. Agreeing with this approach means that a data scientist should dedicate energies to

getting a deep knowledge of the business of the company she/he works at and the market where it competes
fine tuning and improving the data visualizations she/he by iterating over and over on them (not stopping at the default chart styles generated by statistics softwares)

These thoughts do not apply fully to every single company of course. They make sense in teams or companies where data scientists spend a part of their time making data explorations and data mining activities to answer questions that business stakeholders ask them. I would appreciate any feedback or thought you have on it!

Guest at DaGrande podcast

2023-07-23T06:35:00+02:00

I was recently guest at a new podcast called "DaGrande". The podcast was launched by Stefano Bosisio and aims at helping students that are near to conclude their studies. "DaGrande" consists of a series of interviews where professionals from a variety of industries share tips or insights abuot career that they would have loved to hear when they were younger (eg when still at university).

I was the one interviewed in the second episode (see below), and I shared my advices for starting a career in the Data and AI world.

Learning by teaching

2023-01-28T09:35:00+01:00

The picture below was taken just at the beginning of the exam of the course called Apache Spark for Data Analysis at ITS Rizzoli in Milan on November 2022. I was the one taking the picture because I was actually the lecturer of this course. In this post, I'll tell you why I ended up teaching this course.

Learning

I have been working daily with Apache Spark for three years so far, and I've been implementing a variety of batch and streaming data transformations with it. I felt I knew the basics of the framework so that I was autonomous in creating new jobs. However, I wanted to go deeper in understanding how Spark works and what are the best practices to follow (or the antipatterns to avoid).

Rather than studying by myself a book about Spark or something like that, I asked myself: "why not teaching an introductory course"? And that was actually a good idea. I found that teaching has been an extremely effective way to learn. My course consisted of 44 hours of training spanning on 11 lessons over 2 months. While that may look not that large, preparing 44 hours of training material and designing the lessons requires a dense preparation on the topic you are teaching. I decided to design the course with more practice than theory and with plenty of live coding.

So, preparing this course has been an amazing opportunity to actually learn how Apache Spark works. After the end of the course, I have the impression I've truly improved my coding skills in PySpark way more than what I would have achieved by any dedicated training.

Impact

The ITS is a 2 years technical school dedicated to 19-20 years old students. It is an alternative to academy studies, and it is designed to be shorter and with a technical foucs. These ITS schools often focus on areas or skills that are in high demand by the job market. Therefore, students are often able to find their first job soon.

My goal was helping young developers learning a key technology like Spark. Knowing Spark is almost a requirement for applying to Data Engineer positions, and the role of the Data Engineer is one with the highest demand in the tech job market. So, I decided to design the course I would have liked to follow 3 years ago to speed up learning these skills. I liked the idea of giving my contribute in supporting these young developers finding their first job in the Data domain.

Teaching

When you know a topic or a technology, it does not mean you are able to teach it. Teaching is a complex task where you cannot take anything for given and need find the good pace for the class. In a class of 23 students, I found a variety of expertises or a variety of backgrounds meaning that you need to balance them for teaching at the good rythm.

Another challenge is how not to make a lecture boring and having a good mix of theory and practive because you'll find both students that look for more of one or for more of the other. Teaching this course was then also an opportunity to improve my teaching skills, and these are not skills that you apply only during lectures. They are actually communication skills that you can apply and distill ona daily basis when working.

Revenue

I liked the idea of having a small second revenue, and, before starting preparing the course, I thought teaching would have been a great idea because I would have been paid to learn. The salary of a lecturer in the tech domain can vary a lot depending on the context, but it is generally ranging from 40-200 euro per hour (this is not an official statistic, it's just an approximation). However, this salary does not account for the preparation of the training. So, is revenue actually a good reason to teach? Probably not if you give this course only once or twice. The effort of the preparation is so large that the revenue will not compensate for it. If instead you have the opportunity to repeat the same training over and over, than it starts to make sense on the economical side too.

Opportunity

Three years ago I just would not have had the time to prepare a course like this. Why? I use to spend 2 hours per day commuting. I now have instead the chance to work from home quite often, and this gives me 1-2 hours of extra life. I was then able to prepare the course material incrementally over a couple of months before the course started.

The opportunity came when I interviewed Andrea Biancini at Intervista Pythonista podcast. Thanks to him, I knew a bit more about the tech education and training world and heard for the first time about ITS. Then, the idea was sticking in my head because I was looking forward to experience being a lecturer for the first time.

Course design

When you prepare a course, the nice part is that you can actually design the course you would like to attend. My course then consisted mainly of live coding sessions that started with a brief introduction of a topic (eg Spark APIs, Streaming, etc) and then ended with an excercise on that topic that the students could try to solve. I decided to open source the trainig material I prepared so that any other student or teach may benefit from it when needed. To simplify the course setup, I run the coding sessions on Databricks community edition so that students only needed a browser and an internet connection to work on a Spark cluster.

What helped the design of the cours was adopting a textbook. Having a textbook speeds up the design of the contents of the course and gives the students a reference resource in case they want to go deeper on the topic. I chose Learning Spark second edition by Damji et al. that is made freely available by Databricks.

Speaker at PyCon IT 2022

2022-08-05T06:41:00+02:00

I went back to PyCon IT 2022 in Florence in June. I gave one talk called Why Is Our Project Late? where I introduces mental and statistical bias that lead us to make wrong estimates when making a plan.

Furthermore, we held a live session of Intervista Pythonista podcast interviewing Fabio Pliger, the creator of PyScript.

My Webinar on Databricks and PySpark

2022-06-11T19:35:00+02:00

I was invited by Python Biella community to hold a webinar introducing PySpark on Databricks (in Italian). You can find the video below and the code here.

My 6 Gems on Data Visualization

2022-02-26T19:35:00+01:00

I have been working quite some time with charts and business intelligence in the last 5 years. When you spend time building business reports, you may perceive data visualization as a cold technical and business tool. However, there are 6 hidden gems in data visualization that I found by chance. I realized data visualization is not as cold as I thought. Let me recap for you these 6 gems.

1) The first chart ever

William Playfair was a Scottish engineer and political scientist from the 18th century. He is considered as the author of the very first chart:

The chart was published back in 1786. It shows the volumes of imports and exports of Scotland over one year on a scale of 10k pounds. Each country is given two bars: one for volume of imports, one for volume of exports.

I am so used to seeing bar charts that I never asked myself who was the inventor or when they first appeared. It's nice to find out that the have been invented way before the invention of calculators and that they have changed so little since then.

2) The best graphic ever

Charles Minard represented 6 types of data about Napoleon's 1812 Russia campaign in one single chart. This visual was considered by Edward Tufte as "the best statistical graphic ever produced".

Minard represented in two dimensions six types of data: the number of Napoleon's troops; distance; temperature; the latitude and longitude; direction of travel; and location relative to specific dates.

3) Non-neutrality: the Legarithmic scale

Is data visualization a neutral discipline? Not really. Basic decisions like the choice of scale or of the limit of axes might change radically the information perceived by the reader. Take a look at the following tweet by Matteo Salvini (leader of "Lega" party) about results of a poll on popularity of Italian politicians:

Nonostante menzogne, attacchi e processi, milioni di Italiani credono, sperano, confidano nella Lega.
Eh già, e siamo ancora qua…
Non si molla mai, GRAZIE! pic.twitter.com/DFMecxPFzC
— Matteo Salvini (@matteosalvinimi) September 11, 2021

Do you notice anything wrong with the chart? The y axis looks a bit tweaked. The difference between the axis does not follow any reasonable scale (perhapse a "Legarithmic" scale?) since the difference between the 3 bars is not consistent. Here is how the same data looks when plotted in Excel.

However, the effect on the reader is not the same, isn't it?

4) Beyond shapes: infographics

Otto Neurath was one of the main contributor to the picture language, aka ISOTYPE (International System of Typographic Picture Education). This method consists of replacing classic shapes in data visualization (eg bars, circles, etc) with a set of standardized symbols. Quantities are represented by repeating the same symbol over and over proportionally to the measure. Consider the following example by Otto Neurath from 1930.

The chart represents the density of population in different cities. The information is represented as the number of persons that would live in a flat of 200 m2. The count of persons is not represented by a digit or by a bar, but it is represented by the repetition of a symbol as many time as the count of persons for that city. The result is effective. Density is no more a number, and you can feel the size of the measure. Infographics can turn cold numbers into tangible perceptions of a phenomenon.

5) Pie charts: bad by definition

"Bad by definition" is the title of one of my favourite blog posts about data visualization. This article is a clean explanation of why you should not use pie charts for most of the use cases. The article starts with this example.

Can you rank the slices of the pie by size? You'd probably struggle a bit trying to answer. The reason is that our brain is not used to measure and compare angles. It's funny to see pie charts being used every now and then in business reports. Most of the times, a basic bar chart would be way more effective to let the user understand the numbers behind. However, it seems that pie charts are now endemic in corporations, and the way is still long before getting rid of it 😁

6) What is data visualization?

Is data visualization a branch of computer science? It turns out that data visualization is broader discipline, and it is part of information design. Information design is the practice of presenting information in a way that fosters an efficient and effective understanding of the information.

Can the same data of a bar chart be represented in plain text? Yes.

Would plain text require us the same effort to understand the information behind the numbers? Probably not.

Would we even be able to get such information from plain text? Probably not because visualizing information helps our brain to perceive what's going on.

I recently wrote about an article on the impact of information design on journalism. The article starts from a recent tax reform in Italy. Most information media have kept showing tables about the new tax rates, however I found quite hard to get a clear and full picture of the reform. I was not able to find online a single data visualization about the data behind the reform. So, I have done it by myself, and it turned out the article was quite appreciated (with more than 2.3k reads at the time of this writing and plenty of positive feedbacks on social networks).

The reason why the article was so viral is that one single line chart was able to describe the reform way more effectively than the textual tables you could find online. I find this a decent example of "efficient and effective understanding of information" that is the overall goal of information design.

References

This article is a collection of notes I took in the last couple of years. Historical charts are inspired by talks by Paolo Ciuccarelli. The ideas behind the critics to pie charts is inspired by the article of Yan Holtz. Plenty of details are of course from Wikipedia.

How I started podcasting

2021-11-07T09:35:00+01:00

On May 2021, the first episode of my first podcast went live. The podcast is called Intervista Pythonista and is co-hosted with Cesare Placanica. Cesare and I are members of the Python Milano community that helped us to kick-off the idea.

Why podcasting?

I am a heavy podcast listener. I love podcasts because they are dense conversations on topics I love. These conversations let me hear the points of view of experts in the field and stay up to date with new trends.

I prefer podcasts over videos for two reasons. First, I can listen to them while I'm doing something else (usually low-attention tasks like dish-washing or running). Second, I don't need to sit in front of a screen after I've been working daily for 8+ hours still in front of a screen.

Why now?

Cesare and I participated as panelists in a community talk at last Codemotion conference. The panel was an informal discussion on topic like data team organization, learning tips, and latest trends in data science.

We had a surprisingly high number of attendees during the panel. I noticed that an informal chat between experts is a content that people were enjoying more than I expected. I suspect that people miss the informal chat they used to have during in-person meetups and conferences (ie suspended since the beginning of the pandemics).

So, I got back to Cesare with the idea:

Why don't we start podcasting?

Cesare was like: "tell me more about it". The idea was to interview an expert in Python or in its neighborhood. The format was inspired by Michael Kennedy's Talk Python to Me podcast. I was thinking to a similar format but narrowing it to an Italian audience by running interviews in Italian. The goal was not only to create valuable content for Italian Pythonistas, but also to give voice to local community members. Knowing with a direct interview the persons behind a tech community is a way to help the community grow by making it appear somehow closer to you.

The decision was taken. It was time to start.

How to run a podcast?

Neither Cesare nor I ever run a podcast before. None of us was expert of audio recording and audio post-processing. Fortunately, we live in a time where you can find plenty of user friendly tools to create digital content. After doing some research, I found Anchor by Spotify. Anchor defines itself as "the easiest way to make a podcast". And it probabily is.

Anchor lets you start a new podcast in minutes for free. You can record, cut, merge, and publish episodes directly via the mobile app. The app lets you invite guests to join the recording too. Anchor will then take care of distributing the content on major podcasting platforms.

What is missing? A website and a logo! It turns out that Anchor creates a podcast page for your podcast. I simply bought a domain and linked it to that page. Regarding the logo, I have to confess I designed in Power Point.

Guests?

Ok, we decided how to record and how to publish. It's time to record our first episode... who should we invite? Cesare and I started listing names of community members, colleagues, and even friends that could be interviewed. We soon had around 20 names, and our first choice was Marco Bonzanini (thanks Marco again for your availability!).

We keep on updating a kind of kanban board that lists potential guests, guests that have accepted the invitation, and those that have already been scheduled. We decided to have a fixed schedule for recording (every 2 weeks, on the same day, at the same time). Having a recurring schedule reduces complexity and made things work.

At the end of every recording, we ask the guest to suggest us 1 or 2 names of potential future guests. This recommendation helps us filling the list of future guests with new names, and it lets us meet new Pythonistas outside of our direct network.

Some numbers

Two days ago, we published the 10th episode, and we have enough history to look back at numbers. As of 7th November 2021, we had 1,364 plays. Our top episode had 167 plays. The 84% of listeners are from Italy, and 2 out of 3 listeners uses their mobile device to listen to the podcast.

What I'm most glad of are not these numbers, but the messages we receive often via Slack or LinkedIn. Sometimes listeners writes us to say thanks for the valuable content they listened to. These messages are the highest reward for the time and effort we put into this podcast and the main reason we are doing this.

Getting PSM I Scrum Certification

2021-08-24T09:41:00+02:00

I've been working with Scrum framework over the last 18 months, and I thought it was time to test that what I was doing was real Scrum or kind-of-Scrum. I decided to take the Professional Scrum Master I certification exam to test my knowledge of the framework.

Which certification?

Where to start? It seems that the founders of Scrum have created 3 independent organizations that have 3 independent certification paths.

Scrum.org
Scrum Alliance
Scrum Inc

While Scrum Alliance and Scrum Inc require attending a class to take the exam, Scrum.org lets you directly take the exam thus allowing self-study. I did not find any in-person class in my area anytime soon and decided to go for Scrum.org exam. I did not consider attending an online class because I already spend most of the working time in front of a screen and prefer other ways of learning rather than online courses.

How to prepare?

In short, read the Scrum Guide at least 3-4 times. Focus on highlighting who is accountable for every artifact and activity (eg only the Developers are accountable for the Sprint Backlog, all the Scrum Team is accountable for the Sprint Goal, etc).

Repeat a few times excercises that simulate exam questions (either official or not)

Scrum.org Open Assessment
Great set of 80 questions by Mikhail Lapshin
Few free questions on Volderkon

I also enjoyed looking at some posters available on Scrum.org that help you visualize some aspects of the framework:

The exam

The exam is an online quiz of 80 questions to be answered in 60 minutes. I suggest using the Bookmark feature of the quick. It lets you bookmark questions you're doubtful about and review them later. It took me about 40-45 minutes to go quickly through all questions. I then had approximately 15 minutes to review the bookmarked questions.

I've read on few forums that people encountered performance issues in the exam webpage. However, I did not find any issue and the exam run smoothly.

You can have notes either printed or on your laptop because there are no controls like browser locks or similar ones. You are basically free to look at any resource you like during the exam. The time pressure is a decent guarantee against cheating.

When you complete the exam, you'll have a printed certification, a badge like this:

Your certificate will also available on your Credly profile (if you have any).

Notes from Designing Data-Intensive Applications

2021-04-10T07:31:00+02:00

Designing Data-Intensive Applications by Martin Kleppmann was not a quick-read. Let me be clear, it is not such a long book (the paper version is 400 pages), but it is so dense of information that takes some time to go through. The book covers indeed a broad spectrum of data technologies and is dense of details in each paragraph. So, be ready before starting the journey.

What did I learn from the book? I'll take few quotes from my notes.

An architecture that is appropriate for one level of load is unlikely to cope with 10 times that load. If you are working on a fast-growing service, it is therefore likely that you will need to rethink your architecture on every order of magnitude load increase — or perhaps even more often than that.

We need to be able to test, develop, and change quickly our architecture. The book covers the main data solution designs, but you need a team and an organizaiton that is able to adapt and improve the architecture constantly. And more importantly, avoid premature optimization as much as possible. Prefer simplicity over complexity.

If the same query can be written in 4 lines in one query language but requires 29 lines in another, that just shows that different data models are designed to satisfy different use cases. It’s important to pick a data model that is suitable for your application.

Don't focus on data processing performance only, data models and query languages do matter. The overall simplicity and readability of the solution design should be taken into account when choosing the data model.

On the surface, a data warehouse and a relational OLTP database look similar, because they both have a SQL query interface. However, the internals of the systems can look quite different, because they are optimized for very different query patterns. Many database vendors now focus on supporting either transaction processing or analytics workloads, but not both.

We experienced this difference in my team. We started by building a data warehouse on top of SQL, but we run into performance issues quite soon. The statement by Kleppmann may seem obvious, but there are plenty of organization building data warehouses on SQL for a variety of reasons.

... we will explore some of the most common ways how data flows between processes: via databases, via service calls (eg REST and RPC), and via asynchronous message passing (eg MQTT, AMQP).

I find this an amazing summary. In the end, any data flow architecture falls in one these 3 categories, isn't it true?

When you deploy a new version of your application (of a server-side application, at least), you may entirely replace the old version with the new version within a few minutes. The same is not true of database contents: the five-year-old data will still be there, in the original encoding, unless you have explicitly rewritten it since then. This observation is sometimes summed up as data outlives code.

Migrating data is harder than updating an application (and there are richer tools available for deploying an application than migrating a database).

May your application’s evolution be rapid and your deployments be frequent.

I love this wish 😊

All of the difficulty in replication lies in handling changes to replicated data, and that’s what this chapter is about. We will discuss three popular algorithms for replicating changes between nodes: single-leader, multi-leader, and leaderless replication. Almost all distributed databases use one of these three approaches.

I found this quote in the introduction to the Replication chapter of the book. I heard often mentioning these replication mechanism, but for the first time I did a deep dive in the topic (that is not as easy as I would have expected). Kleppmann throughout the book makes you clear one thing: there are many things that can go wrong around data (timestamp alignment, networking, nodes down, etc), and they will go wrong at some point.

Because of this risk of skew and hot spots, many distributed datastores use a hash function to determine the partition for a given key. A good hash function takes skewed data and makes it uniformly distributed.

And fortunately this hashing is often managed under the hood by datastores themselvs, eg Azure Cosmos.

Atomicity, isolation, and durability are properties of the database, whereas consistency (in the ACID sense) is a property of the application. The application may rely on the database’s atomicity and isolation properties in order to achieve consistency, but it’s not up to the database alone. Thus, the letter C doesn’t really belong in ACID.

Interesting to read that the C in such a popular acronym is there just to make the acronym work.

Errors will inevitably happen, but many software developers prefer to think only about the happy path rather than the intricacies of error handling.

True story, but experience helps thinking a bit more to the sad path.

Simply dumping data in its raw form allows for several such transformations. This approach has been dubbed the sushi principle: "raw data is better"

I have been following the sushi principle in the last year without being aware of this definition. Nice name!

Database triggers can be used to implement change data capture by registering triggers that observe all changes to data tables and add corresponding entries to a changelog table. However, they tend to be fragile and have significant performance overheads. Parsing the replication log can be a more robust approach, although it also comes with challenges, such as handling schema changes.

I see replication log parsing as a growing trend. It enables the method "take data to datalake and then we'll see what to do". Furthermore, it fits for steaming data applications too. Today, not all vendors support the publication of such change logs natively (eg I didn't find a simple solution for SQL Server).

If you are mathematically inclined, you might say that the application state is what you get when you integrate an event stream over time, and a change stream is what you get when you differentiate the state by time, as shown in figure. The analogy has limitations (for example, the second derivative of state does not seem to be meaningful), but it’s a useful starting point for thinking about data.

This brilliant analogy is the intro of the chapter I enjoyed the most within the entire book, ie the Stream Processing chapter. It represents a database as the latest cache representing the replication logs (the opposite point of view we normally have).

In the absence of widespread support for a good distributed transaction protocol, I believe that log-based derived data is the most promising approach for integrating different data systems.

I have seen Kafka as a tool for stream processing so far. I was not thinking of it as a tool for integrating data systems. The last chapter of the book gives a hint on how log-based derived data may become a popular pattern soon.

The trend has been to keep stateless application logic separate from state management (databases): not putting application logic in the database and not putting persistent state in the application. As people in the functional programming community like to joke, "We believe in the separation of Church and state"

Good one.

Models of Data Science teams: Chess vs Checkers

2021-03-27T09:35:00+01:00

How many data engineers should we hire? Are they too many compared to our data scientists?

One of the key decisions to take when building a data science team is the mix of roles. This means choosing the right mix of background and of activities that each member of the team should have. I'll compare two models of teams I've experienced so far and define them as chess-team model and checkers-team model.

Chess-Team Model

The chess-team model is the common model we read about in literature. In a chess-team, each member of the team has a specific role. Roles are usually: data engineers, data scientists, and machine learning engineers. These roles typically correspond to different sets of skills (eg ML and statistics vs coding and devops) and to different set of activities (model selection vs data preparation vs model deployment).

Similarly to a chess piece which has a clear role that is different from the other pieces, a member of a data science chess-team is assigned a subset of the tasks that are part of the development pipeline. Let's consider a simplistic development pipeline:

data preparation -> data engineer
model development -> data scientists
model deployment -> machine learning engineer

The three activities of this development pipeline correspond to the three roles of the team, and there is little space for confusion. A data engineer probably won't work a lot on the model development and selection, while a data scientist probably won't be the one deploying the model in production.

Checkers-Team Model

The checkers-team model is a definition of a team model that I introduce in this post. In a checkers-team, each member of the team does not have a specific role because he may in charge of working on any step of the development pipeline. There are no roles like data engineer or data scientist because taking such a role implies limiting the scope of activities a team member should work on. Let' make an example. In a checkers-team, there is no data scientist because no one is in charge of model development only.

So, what is the role of someone working in a checkers-team? A member of the team can be defined as a full-stack data developer. A full-stack data developer is someone that for example works on data extraction AND model development AND model deployment. In a checkers-team, everyone works possibly on every piece of the development lifecycle. In this sense, the team is more similar to checkers pieces. There is no move that a piece can take and another piece cannot. Similarly, there is no activity that any team member cannot do. For example, everyone can contribute to building devops pipelines and automation.

Of course, every team member has a different background and a different set of skills from his/her teammates. One can come from a software engineering experience, another one can come from data science studies. However, the strategy of building a checkers-team is to invest in training team members to grow horizontally their set of skills.

Pros and Cons

Let's consider some key differences between a chess and a checkers team model.

Flexibility. The balance of types of activities is not stable over time in a team. There can be times when there is a peak of work items in data engineering and little or no work items in ML model development. These peaks can be due to different phases of the data product development cycle or due to varying business requirements. A checkers-team is flexible and can adapt quickly to these peaks. A checkers-team could for example dedicate the entire team to develop data engineering pipelines in a Scrum sprint if needed. The same flexibility is not as easy in a chess-team model where you have constraints due to different skills and different responsibilities.

Complexity. Not every data science team is facing the same level of complexity in their projects. Imagine a team that is building an AI model for self-driving cars. It is a complex problem to solve that requires advanced skills in computer vision and AI. These skills cannot be learned quickly but usually need a specific education or career path. When facing such problems, you need team members which are specialists in area like vision or AI. A chess-team is designed to host specialists in certain fields and is designed to grow vertically such skills. In a checkers-team, there are not such specialists.

Awareness. A member of a checkers-team knows in details every phase of the development cycle. While he is designing a ML model, he is aware at the same time of how the release pipeline and the operations of the model work. He may take decisions during model selection that take into consideration where the model will be hosted and possible constraints of the production platform. On the other hand, a data scientist of a chess-team knows less details (because he has not being working on it by himself) of how the model will be deployed and run. This minor awareness may lead to assumptions taken during model development, and these assumptions can bring to more complexity to those in charge of deploying such model.

Sense of Ownership. In a checkers-team, you are in charge of both engineering data pipelines, developing models, and deploying them. Any issue that may occur in these phases is also your issue. You can't delegate too much, and, therefore, you naturally feel responsible to contribute to the resolution. Distributing the ownership makes every team member more active in improving the development life cycle.

When is a Team Model Right?

The answer depends on the context and the organization you work at. Is the data science team is working on the core product of the company? If this is the case, the models that are developed may need a level of specialization that can't just be achieved by a checkers-team.

Or is the team rather working on adding tiny features or on improving the operations of the company? In this case, probably you won't be developing state-of-the-art AI models, and you can rely existing libraries or SaaS that make life easier for you. As complexity is not an obstacle, going for checkers-team may be a good option.

What is the size of your data science team? Or even how many teams do you have? Large organizations go for multiple data teams. These teams may be divided functionally (eg 1 team of data engineers + 1 separate team of data sciensts) or they may be divided by business units (eg 1 data team for marketing and 1 data team for recommender system). You can't of course adopt the checkers-team model in an large organization that design the data teams by functions, but you may still adopt this model in a large organization that creates multiple self-organized teams each dedicated to a specific business unit.

A last point to consider is the IT architecture. A checkers-team requires the same person to work on very different tasks. This is viable only if the complexity of such tasks is small. Adopting SaaS and PaaS resources simplifies every task by hiding the complexity of managing and running the resources. They let you focus on your goal. For example, building an API endpoint hosted by a function-as-a-service is something feasible by a data scientist with a mathematical background. Doing the same from scratch on an on-premise server is not as feasible.

Images courtesy of @pecanlie and @rafaelrex

Choosing my next job title (in a data science career)

2021-01-08T07:41:00+01:00

I'm now part of a data and AI team in a fintech spinoff. When I joined the company, it did not make sense to spend time in defining precise job titles because we were to build everything from scratch (both software, teams and organization). My job title was therefore a generic "AI Practitioner". One year later, teams and responsibilities are more clear, and it is now time to define my job title.

What was I doing up to now?

I have a background in data science and software engineering. I started my career in 2013 as "Data Scientist and Software Developer" (what we would call today a Machine Learning Engineer?) in a small startup. I was then defined as an "Associate" when working as a data scientist in a consulting firm. In the last 3 years, I worked in a manufacturing firm as "Data Scientist".

What am I doing now?

In the company I currently work at, I work in the data and AI team. My main activities include:

planning and prioritizing of our data solution
designing our data and software architecture
developing in first person our data integrations, analytics reporting, ML models and data solutions
making sure our Scrum cerimonies run smoothly

My job has a mix of coding, architecture design, and project/product management. Why such a variety of responsibilities? I work in a small team part of company that is growing quickly starting from zero. Each team is quite autonomous in doing their work by taking an end-to-end ownership of the activity. For example, in my data and AI team we handle our work end-to-end. We are responsible for the entire pipeline: definining roadmaps, development, deployment, and monitoring.

My job title?

It is now time to define a job title that can summarize my responsibilities listed above. These are some alternatives I took into consideration:

Job title	Comments
Senior Data Scientist/Engineer	Too vertical on a piece of the pipeline compared to the spectrum of activities I work on
Data Architect	Nicely defines the technical activities of designing and scaling our data solutions, but lacks the ownership of the backlog and of the product roadmap
Data Product Owner	States clearly the ownership of the product backlog, but I feel that the "Product Owner" title is too tight to a Scrum role and lacks of technical responsibilities
Lead Data and AI	States the responsibility of leading a team of experts in a domain. However, it does not feature any ownership on the product roadmap. Furthermore, it states a clear hierarchy in the team that goes against our team and company culture (a culture of distributed ownership and flat organization)

I was not satisified with the job titles above. Then, I came up with "Data Product Manager". I felt this job title was what I was looking for because:

as a Product Manager, you are responsible for the product roadmap and strategy
the prefix "Data" adds a technical taste. By doing some research, I found that a TPM (Technical Product Manager) is a common job title that defines a product manager that is also in charge of the technical side of the product (architecture, etc)
it states the ownership of our data product but does add any hierarchy-sounding adjectives
my end-to-end range of activities can fit well in this definition

I shared these thoughts with my manager that agreed both on the definition of my responsibilities and on the job title. Let's see if these notes can help those that are facing the same challenge of choosing their own job title.

What we expected from Covid on March 10th

2020-12-26T09:35:00+01:00

The first Covid case in Italy was found on February 21st 2020. A couple of weeks later we were entering the lockdown with this number of new daily cases.

The number of Covid-19 new cases was growing really fast every day. We had no clue about what was going to happen and about when it would have ended. Was it going to end soon? How quickly was the virus spreading? I was wondering whether our feelings and expectations would have turned out to be true or not. So, I run a little experiment with 7 friends. I asked each of them the following 2 questions on March 10th 2020:

What will the total number of Covid19 cases be by April 1st?
When will the number of new cases be smaller than 50 again?

The goal of these questions was to investigate our ability as humans to nearly understand the size and the duration of such an unseen event like a global pandemy. Let's look at the answers we gave to these 2 questions.

Total cases by April 1st

The total number of Covid19 cases in Italy was 110k (precisely 110574). These were our 7 predictions made on March 10th.

We see that 5 out of 7 respondents predicted a number of cases below 60k (with 2 respondents even below 25k). Only 2 out of 7 respondents gave more realistic predictions (110k and 130k respectively). Why were most respondents too optimistic? If we look at the very first chart, an exponential growth of new cases was already happening on March 10th. Perhaps, the majority of respondents were perceiveing the growth as linear.

Does our brain have misperceptions about exponential growth? My little experiment gave this insight, but I was curious whether there is some scientific literature about this misperception. I found a paper written back in 1975: "Misperception of exponential growth" by Wagennar and Sagaria.

In this paper, researchers presented the beginning of an exponential time series starting ranging between 1970 and 1974. They presented this time series in different experiments both in the form of a series of numbers and in the form of a graph (see chart above). They asked to predict the value of this time series by 1979. A considerable underestimation of growth was encountered in all groups in all conditions.

The results of this paper helped me understanding why most of my respondents notably underestimated the growth of Covid-19 cases in Italy. Our brain is capable of intuitions only for linear growths and not for exponential growths.

The following question naturally comes up: if the underestimation of the Covid-19 growth was common in the vast majority of the citizens due to our unavoidable misperception, how has this impacted on micro and macro decisions when facing the pandemic?

New cases smaller than 50 again

We now move to the second question of my little experiment (asked on March 10th): "When will the number of new cases be smaller than 50 again?". Plotting the answers:

I drew a black vertical line for each date given as answer. We were too optimistic in this survey too. 5 out of 7 respondents expected the situation to go under control (the red horizontal line represents the threshold of 50 cases in the question) by May 1st. No one was expecting the high number of daily cases to go beyond June 18th. As of today (December 26th), the number of daily cases in Italy did not go below 50 in a single day since then.

We were just starting to experience an extraordinary event, and we were not expecting it to last for that long. This bias in perceiving the pandemic shorter than it was probably helped the social distancing policies. Changing your social habits is a privation that you willingly make if you expect it to last for a short time. Imagine that we knew Covid-19 would last for 9 months or even more.

Another question naturally comes up about the economic policies that were taken to tackle the pandemic: were they subject to the same short-term bias that was measured in this experiment?

Summary: Building AI Solutions with Azure ML

2020-08-19T06:41:00+02:00

While studying for the Azure Data Scientist Associate certification, I took notes from Building AI Solution with Azure ML course. In this single page, you'll find the entire content of the course (as of 18th August, 2020). This page is a small support for those preparing for earning the certification.

Intro

Azure ML Workspace

workspaces are azure resources. include:

compute
notebooks
pipelines
data
experiments
models

created alongside

storage account: files by WS + data
application insights
key vault
vm
container registry

permission: RBAC

edition - basic (no graphic designer) - enterprise

Tools

Azure ML Studio - designer (no code ML model dev) - automated ML

Azure ML SDK

Azure ML CLI Extensions

Compute Instances - choose VM - store notebooks independently of VMs

VS Code - Azure ML Extension

Experiments

Azure ML tracks run of experiments

...
run = experiment.start_logging()
...
run.complete()

logging metrics. run.log('name', value). You can review them via RunDetails(run).show()
experiment output file. Example: trained models. run.upload_file(..).

Script as an experiment. In the script, you can get the context: run = Rune.get_context(). To run it, you define:

RunConfiguration: python environment
ScriptRunConfig: associates RunConfiguration with script

Train a ML model

Estimators

Estimator: encapsulates a run configuration and a script configuration in a single object. Save trained model as pickle in outputs folder

estimator = Estimator(
  source_directory='experiment',
  entry_script='training.py',
  compute_target='local',
  conda_packages=['scikit-learn']
)
experiment = Experiment(workspace, name='train_experiment')
run = experiment.submit(config=estimator)

Framework-specific estimators simplify configurations

from azureml.train.sklearn import SKLearn

estimator = SKLearn(
  source_directory='experiment',
  entry_script='training.py',
  compute_target='local'
)

Script parameters

Use argparse to read the parameters in a script (eg regularization rate). To pass a parameter to an Estimator:

estimator = SKLearn(
  source_directory='experiment',
  entry_script='training.py',
  script_params={'--reg_rate': 0.1}
  compute_target='local'
)

Registering models

Once the experiment Run has completed, you can retrieve its outputs (eg trained model).

run.download_file(name='outputs/models.pkl', output_file_path='model.pkl')

Registering a model allows to track multiple versions of a model.

model = Model.register(
  workspace=ws,
  model_name='classification_model',
  model_path='model.pkl', #local path
  description='a classification model',
  tags={'dept': 'sales'},
  model_framework=Model.Framework.SCIKITLEARN,
  model_framework_version='0.20.3'
)

or register from run:

run.register_model(
  ...
  model_path='outputs/model.pkl'
  ...
  )

Datastores

Abstractions of cloud data sources encapsulating the information required to connect.

You can register a data store

via ML Studio
via SDK

ws = Workspace.from_config()
blob = Datastore.register_azure_blob_container(
  workspace=ws,
  datastore_name='blob_data',
  container_name='data_container',
  account_name='az_acct',
  account_key='123456'
)

In the SDK, you can list data stores.

Use datastores

Most common: Azure blob and file

blob_ds.upload(
  src_dir='/files',
  target_path='/data/files',
  overwrite=True
)
blob_ds.download(
  target_path='downloads',
  prefix='/data'
)

You pass a data reference to the script to use a datastore. Data access models

download: contents downloaded to the compute context of experiment
upload: files generated by experiment are uploaded after run
mount: path of datastore mounted as remote storage (only on remote compute target)

Pass reference as script parameter:

data_ref = blob_ds.path('data/files').as_download(path_on_compute='training_data')
estimator = SKLearn(
  source_directory='experiment_folder',
  entry_script='training_script.py',
  compute_target='local',
  script_params={'--data_folder': data_ref}
)

Retrieve it in script and use it like local folder:

parser = argparse.ArgumentParser()
parser.add_argument('--data_folder', type='str', dest='data_folder')
args = parser.parse_args()
data_files = os.listdir(args.data_folder)

Datasets

Datasets are versioned packaged data objects consumed in experiments and pipelines. Types

tabular: read as table
file: list of file paths

You can create dataset via Azure ML Studio or via SDK. File paths can have wildcards (/files/*.csv).

Once a dataset is created, you can register it in the workspace (available later too).

Tabular:

from azureml.core import Dataset

blob_ds = we.get_default_datastore()
csv_paths = [
  (blob_ds, 'data/files/current_data.csv'),
  (blob_ds, 'data/files/archive/*.csv')
]
tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
tab_ds = tab_ds.register(workspace, name='csv_table')

File:

blob_ds = ws.get_default_datastore()
file_ds = Dataset.File.from_files(path=(blob_ds, 'data/files/images/*.jpg'))
file_ds = file_ds.register(workspace=ws, name='img_files')

Retrieve a dataset

ws = Workspace.from_config()

# Get a dataset from workspace datasets collection
ds1 = ws.datasets['csv_table']

# Get a dataset by name from the datasets class
ds2 = Dataset.get_by_name(ws, 'img_files')

Datasets can be versioned. Create a new versioning by registering with same name and create_new_version property:

file_ds = file_ds.register(workspace=ws, name='img_files', create_new_version=True)

Retrieve specific version:

img_ds = Dataset.get_by_name(workspace=ws, name='img_files', version=2)

Compute Contexts

The runtime context for each experiment consists of

environment for the script, which includes all packages
compute target on which the environment will be deployed

Intro to Environments

Python runs in virtual environments (eg Conda, pip). Azure creates a Docker container and creates the environment. You create environments by

Conda or pip yaml file and load it:

env = Environment.from_conda_specification(name='training_env', file_path='./conda.yml')

from existing Conda environment:

env = Environment.from_conda_environment(name='training_env',
                            conda_environment_name='py_env')

specifying packages:

env = Environment('training_env')
deps = CondaDependencies.create(conda_packages=['pandas', 'numpy']
                              pip_packages=['azureml-defaults'])
env.python.conda_dependencies = deps

Once created, you can register the environment in the workspace.

env.register(workspace=ws)

Retrieve and assign it to a ScriptRunConfig or an Estimator

tr_env = Environment.get(workspace=ws, name='training_env')
estimator = Estimator(
  source_directory='experiment_folder',
  entry_script='training_script.py',
  compute_target='local',
  environment_definition=tr_env
  )

Compute targets

Compute targets are physical or virtual computer on which experiments are run. Types of compute

local compute: your workstation or a virtual machine
compute clusters: multi-node clusters of VMs that automatically scale up or down
inference clusters: to deploy models, they use containers to initiate computing
attached compute: attach a VM or Databricks cluster that you already use

You can create a compute target via AML studio or via SDK. A managed compute target is one managed by AML. Via SDK

ws = Workspace.from_config()
compute_name = 'aml-cluster'
compute_config = AmlCompute.provisioning_configuration(
  vm_size='STANDARD_DS12_V2',
  min_nodes=0,
  max_nodes=4,
  vm_priority='dedicated'
  )
aml_cluster = ComputeTarget.create(we, compute_name, compute_config)
aml_cluster.wait_for_completion()

An unmanaged compute target is defined and managed outside AML. You can attach it via SDK:

ws = Workspace.from_config()
compute_name = 'db-cluster'
db_workspace_name = 'db_workspace'
db_resource_group = 'db_resource_group'
db_access_token = 'aocsinaocnasoivn'
db_config = DatabricksCompute.attach_configuration(
  resource_group=db_resource_group,
  workspace_name=db_workspace_name,
  access_token=db_access_token
  )
db_cluster = ComputeTarget.create(we, compute_name, db_config)
db_cluster.wait_for_completion()

You can check if a compute target does not exist already:

compute_name = 'aml_cluster'
try:
  aml_cluster = ComputeTarget(workspace=ws, name=compute_name)
except ComputeTargetException:
  # create it
  ...

You can use a compute target in an experiment run by specifying it as a parameter

compute_name = 'aml_cluster'
training_env = Environment.get(workspace=ws, name='training_env')
estimator = Estimator(
  source_directory='experiment_folder',
  entry_script='training_script.py',
  environment_definition=training_env,
  compute_target=compute_name
  )
# or specify a ComputeTarget object
training_cluster = ComputeTarget(workspace=ws, name=compute_name)
estimator = Estimator(
  source_directory='experiment_folder',
  entry_script='training_script.py',
  environment_definition=training_env,
  compute_target=training_cluster
  )

Orchestrating with Pipelines

A pipeline is a workflow of ml tasks in which each tasks is implemented as a step (either sequential or parallel). You can combine different compute targets. Common types of step:

PythonScriptStep
EstimatorStep: runs an estimator
DataTransferStep: uses ADF
DatabricksStep
AdlaStep: runs a U-SQL job in Azure Data Lake Analytics

Define steps:

step1 = PythonScriptStep(
  name='prepare data',
  source_directory='scripts',
  script_name='data_prep.py',
  compute_target='aml-cluster',
  runconfig=run_config
  )

step2 = EstimatorStep(
  name='train model',
  estimator=sk_estimator,
  compute_target='aml-cluster'
  )

Assign steps to pipeline:

train_pipeline = Pipeline(
  workspace=ws,
  steps=[step1,step2]
  )
# create experiment and run pipeline
experiment = Experiment(workspace=ws, name='training-pipeline')
pipeline_run = experiment.submit(train_pipeline)

Pass data between steps

The PipelineData object is a special kind of DataReference that

reference a location in a store
creates a da dependency between pipelines

To pass it

define a PipelineData object that references a location in a data store
specify the object as input or output for the steps that use it
pass the PipelineData object as a script parameter in steps that run scripts

Example

raw_ds = Dataset.get_by_name(ws, 'raw_dataset')
# Define object to pass data between steps
data_store = ws.get_default_datastore()
prepped_data = PipelineData('prepped', datastore=data_store)

step1 = PythonScriptStep(
  name='prepare data',
  source_directory='scripts',
  script_name='data_prep.py',
  compute_target='aml-cluster',
  runconfig=run_config,
  # specify dataset
  inputs = [raw_ds.as_named_input('raw_data')],
  # specify PipelineData as output
  outputs = [prepped_data],
  # script reference
  arugments = ['--folder', prepped_data]
  )

step2 = EstimatorStep(
  name='train model',
  estimator=sk_estimator,
  compute_target='aml-cluster'
  # specify PipelineData
  inputs = [prepped_data],
  # pass reference to estimator script
  estimator_entry_script_arguments = ['--folder', prepped_data]
  )

Inside the script, you can get reference to PipelineData object from the argument, and use it like a local folder.

parser = argpare.ArgumentParser()
parser.add_argument('--folder', type=str, dest='folder')
args = parser.parse_args()
output_folder = args.folder

# ...

# save data to PipelineData location
os.makedirs(output_folder, exist_ok=True)
output_path = os.path.join(output_folder, 'prepped_data.csv')
df.to_csv(output_path)

Reuse steps

By default, the step output from a previous pipeline run is reused without rerunning the step (if script, source directory and other params have not changed). You can control this:

step1 = PythonScriptStep(
  #...
  allow_reuse=False
  )

You can force the steps to run regardless of individual configuration:

pipeline_run = experiment.submit(train_pipeline, regenerate_outputs=True)

Publish pipelines

You can publish a pipelien to create a REST endpoint through which the pipeline can be run on demand.

published_pipeline = pipeline.publish(
  name='training_pipeline',
  description='Model training pipeline',
  version='1.0'
  )

You can view it in ML Studio and get the endpoint:

published_pipeline.endpoint

You start a published endpoint by making an HTTP request to it. You pass the authorisation header (with token) and a JSON payload specifying the experiment name. The pipeline is run asynchronously, you get the run ID as response.

Pipeline parameters

Create a PipelineParameter object for each parameter. Example:

reg_param = PipelineParameter(name='reg_rate', default_value=0.01)
# ...
step2 = EstimatorStep(
  # ...
  estimator_entry_script_arguments=[
    '--folder', prepped,
    '--reg', reg_param
  ]
)

After you publish a parametrised pipeline, you can pass parameter values in the JSON payload of the REST interface. Example

requests.post(
  enpoint,
  headers=auth_header,
  json={
    'ExperimentName': 'run_training_pipeline',
    'ParameterAssignments': {
      'reg_rate': 0.1
    }
  }
  )

Schedule pipelines

Define a ScheduleRecurrence and use it to create a Schedule.

daily = ScheduleRecurrence(
  frequency='Day',
  interval=1
  )
pipeline_schedule = Schedule.create(
  ws,
  name='Daily Training',
  description='train model every day',
  pipeline_id=published_pipeline.id,
  experiment_name='Training_Pipeline',
  recurrence=daily
  )

To schedule a pipeline to run whenever data changes, you must create a Schedule that monitors a specific path on a datastore:

training_datastore = Datastore(workspace=ws, name='blob_data')
pipeline_schedule = Schedule.create(
  # ...
  datastore=training_datastore,
  path_on_datastore='data/training'
  )

Deploy ML Models

You can deploy ass container to several compute targets

Azure ML compute instance
Azure container instance
Azure function
Azure Kubernetes service
IoT module

Steps

register the model
inference configuration
deployment configuration
deploy model

Register the model

After training, you must register the model to Azure ML workspace.

classification_model = Model.register(
  workspace=ws,
  model_name='classification_model',
  model_path='model.pkl',
  description='A classification model'
  )

Or you can use the reference to the run:

run.register_model(
  model_name='classification_model',
  model_path='outputs/model.pkl',
  description='A classification model'
  )

Inference configuration

The model will be deployed as a service consisting of

a script to load the model and return predictions for submitted data
an environment in which the script will be run

Create the entry script (or scoring script) as a Python file including 2 functions

init() called when service is initialised (load model from registry)
run(raw_data) called when new data is submitted to the service (generate predictions)

Example

def init():
  global model
  model_path = Model.get_model_path('classification_model')
  model = joblib.load(model_path)

def run(raw_data):
  data = np.array(json.loads(raw_data)['data'])
  predictions = model.predict(data)
  # return predictions as any JSON seriazable format
  return predictions.tolist()

You can configure the environment using Conda. You can use a CondaDependencies class to create a default environment (including azureml-defaults and other commonly-used) and add any other required packages. You then serialize the environment to a string and save it.

myenv = CondaDependencies()
myenv.add_conda_package('scikit-learn')

env_file = 'service_files/env.yml'
with open(env_file, 'w') as f:
  f.write(myenv.serialize_to_string())

After creating the script and the environment, you combine them in an InferenceConfig:

classifier_inference_config = InferenceConfig(
  runtime='python',
  source_directory='service_files',
  entry_script='score.py',
  conda_file='env.yml'
  )

Deployment configuration

Now that you have the entry script and the environment, you configure the compute service. If you deploy to an AKS cluster, you create it

cluster_name = 'aks-cluster'
compute_config = AksCompute.provisioning_configuration(location='eastus')
production_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
production_cluster.wait_for_completion()

You define the deployment configuration

classifier_deploy_config = AksWebservice.deploy_configuration(
  cpu_cores=1,
  memory_gb=1
)

Deploy the model

model = ws.models['classification_model']
service = Model.deploy(
  name='classification-service',
  models=[model],
  inference_config=classifier_inference_config,
  deploy_config=classifier_deploy_config,
  deployment_target=production_cluster
  )
service.wait_for_deployment()

Consuming a real-time inferencing service

For testing, you can use the AML SDK to call a web service through the run method of a WebService object. Typically, you send data to run method in a JSON like

{
  'data':[
    [0.1, 0.2, 3.4],
    [0.9, 8.2, 2.5],
    ...
  ]
}

The response is a JSON with a prediction for each case

response = service.run(input_data=json_data)
predictions = json.loads(response)

In production, you use a REST endpoint. You find the endpoint of a deployed service in Azure ML studio, or by retrieving the scoring_url property of a Webservice object:

endpoint = service.scoring_uri

There are 2 kinds of authentication:

key: requests are authenticated by specifying the key associated with the service
token: requests are authenticated by providing a JSON Web Token (JWT)

By default, authentication is disabled for Azure Container Instance service (set to key-based authentication for AKS).

To make an authenticate call to the REST endpoint, you include the oey or the token in the request header.

Troubleshooting service deployment

You can

check the service state (should be healty): service.state
review service logs: service.get_logs()
deploy to local container

Batch inference pipelines

Pipeline to read input data, load a registered model, predict labels, and write results.

Register a model
Create a scoring script. The run(mini_batch) method makes the inference on each batch.
Create a pipeline with ParallelRunStep
Run the pipeline and retrieve the step output

Azure ML provides a pipeline step performs parallel batch inference. Using ParallelRunStep class, you can read batches of files from a File dataset and write the output to a PipelineData reference. You can set the output_action to "append_row" (ensuring all instances of the step will collate the result to a single output file named parallel_run_step.txt).

batch_data_set = ws.datasets('batch-data')

# output location
default_ds = we.get_default_datastore()
output_dir = PipelineData(
  name='inferences',
  datastore=default_ds,
  output_path_on_compute='results'
)

parallel_run_config = ParallelRunConfig(
  source_directory='batch_scripts',
  entry_script='batch_scoring_script.py',
  mini_batch_size="5",
  error_threshold=10,
  output_action="append_row",
  environment=batch_env,
  compute_target=aml_cluster,
  node_count=4
  )

parallelrun_step = ParallelRunStep(
  name="batch-score",
  parallel_run_config=parallel_run_config,
  inputs=[batch_data_set.as_named_input('batch_data')],
  output=output_dir,
  arguments=[],
  allow_reuse=True
  )

pipeline = Pipeline(
  workspace=ws,
  steps=[parallelrun_step]
  )

Run the pipeline and retrieve output.

pipeline_run = Experiment(ws, 'batch_prediction_pipeline').submit(pipeline)
pipeline_run.wait_for_completion()

prediction_run = next(pipeline_run.get_children())
prediction_output = prediction_run.get_output_data('inferences')
prediction_output.download(local_path='results')

Publishing a batch inference pipeline

You can publish it as a REST service.

published_pipeline = pipeline_run.publish_pipeline(
  name='Batch_Prediction_Pipeline',
  description='Batch Pipeline',
  version='1.0'
  )

rest_endpoint = published_pipeline.endpoint

Once published, you can use the endpoint to initiate a batch inferencing job.

You can also schedule the published pipeline to have it run automatically.

weekly = ScheduleRecurrence(frequency='Week', interval=1)
pipeline_schedule = Schedule.create(
  ws,
  name='Weekly Predictions',
  description='batch inferencing',
  pipeline_id=published_pipeline.id,
  experiment_name='Batch_Prediction',
  recurrence=weekly
  )

Tuning hyperparameters

Accomplished by training multiple models, using same algorithm and training data but different hyperparameter values. Then, evaluate for each the performance metric (eg accuracy), and the best-performing model is selected.

In Azure ML, you make an experiment that consist of a hyperdrive run, which initiates a child run for each hyperparameter. Each child run uses a training script with parametrised hyperparameter values to train a model, and logs the target performance metric achieved by the training model.

Define a search space

Depends on the type of hyperparameter:

discrete. Make a choice out of
an explicit python list: choice([10, 20, 30])
a range: choice(range(1,10))
select values from a discrete distribution: qnormal, quniform, qlognormal, qloguniform
continuous. Use any of these distribution: normal, uniform, lognormal, loguniform

Define a search space by creating a dictionary with parameter expressions for each hyperparameter.

from azureml.train.hyperdrive import choice, normal

param_space = {
  '--batch_size': choice(16, 32, 64),
  '--learning_rate': normal(10, 3)
}

Configuring sampling

The values used in a tuning run depend on the type of sampling used.

Grid sampling. Every possible combination when hyperparameters are discrete.

param_space = {
  '--batch_size': choice(16, 32, 64),
  '--learning_rate': choice(10, 20)
}

param_sampling = GridParameterSampling(param_space)

Random sampling. Randomly select a value for each hyperparameter.

param_space = {
  '--batch_size': choice(16, 32, 64),
  '--learning_rate': normal(10, 3)
}

param_sampling = RandomParameterSampling(param_space)

Bayesian sampling. Based on Bayesian optimisation algorithm that tries to select parameter combinations that will result in improved performance from the previous selection.

param_space = {
  '--batch_size': choice(16, 32, 64),
  '--learning_rate': uniform(0.5, 0.1)
}

param_sampling = BayesianParameterSampling(param_space)

Can only be used with choice, uniform, quniform distributions and can't be combined with early termination.

Configuring an early termination

Typically, you set a maximum number of iterations, but this could still result in a large number of runs that don't result in a better model than a combination that has already been tried.

To help preventing wasting time, you can set an early termination policy that abandons runs that are unlikely to produce a better result than previously completed runs. The policy is evaluated at an evaluation interval you specify, based on each time the target performance metric is logged. You can also set a delay evaluation parameter to avoid evaluating the policy until a minimum number of iterations have been completed.

Note. Early termination is particularly useful for deep learning scenarios where a deep neural network is trained iteratively over a number of epochs. The training script can report the target metric after each epoch, and if the run is significantly underperforming previous runs after the same number of intervals, it can be abandoned.

Bandit policy. Stop a run if the target performance metric underperforms the best run so far by a specified margin.

early_termination_policy = BanditPolicy(
  slack_amount=0.2, # abandon runs when metric is 0.2 or more worse than best run after the same number of intervals
  evaluation_interval=1,
  delay_evaluation=5
  )

You can also use a slack factor comparing the metric as ration rather than an absolute value.

Median stopping policy. Abandoning runs where the target performance metric is worse than the median of the running averages fo all runs.

early_termination_policy = MedianStoppingPolicy(
  evaluation_interval=1,
  delay_evaluation=5
  )

Truncation selection policy. Cancelling the lower performing X%% of runs at each evaluation interval based on the truncation_percentage valu you specify for X.

early_termination_policy = TruncationSelectionPolicy(
  truncation_percentage=10,
  evaluation_interval=1,
  delay_evaluation=5
  )

Running a hyperparameter tuning experiment

In Azure ML, you tune hyper by running a hyperdrive experiment. You need to create a training script just the way you would do for any other training experiment, except that you must:

include an argument for each hyperparameter
log the target performance metric.

This example script trains a logistic regression using a --regularization argument (regularization rate), and logs the accuracy.

parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01)
args = parser.parse_args()
reg = args.reg_rate

# get experiment run context
run = Run.get_context()

data = run.input_datasets['training_data'].to_pandas_dataframe()
X = data[['feature1', 'feature2', 'feature3', 'feature4']].values
y = data['label'].values
X_train, X_test, y_train, y_test = train_test_split(X, y test_size=0.3)

model = LogisticRegression(C=1/reg, solver='liblinear').fit(X_train, y_train)

# calculate and log accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
run.log('Accuracy', np.float(acc))

# save trained model
os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/model.pkl')

run.complete()

To prepare the hyperdrive experiment, you use a HyperDriveConfig object to configure the experiment run.

hyperdrive = HyperDriveConfig(
  estimator=sklearn_estimator,
  hyperparameter_sampling=param_sampling,
  policy=None,
  primary_metric_name='Accuracy',
  primary_metricgoal=PrimaryMetricGoal.MAXIMIZE,
  max_total_runs=6,
  max_concurrent_runs=4
  )

experiment = Experiment(workspace=ws, name='hyperdrive_training')
hyperdrive_run = experiment.submit(config=hyperdrive)

You can monitor hyperdrive experiment in Azure ML studio. The experiment will initiate a child run for each hyperparameter combination to be tried

Automate model selection

Visual interface for automated ML in Azure ML Studio for Enterprise edition only.

You can use automated ML to train models for the tasks below. Azure ML supports common algorithms for these tasks:

classification
logistic regression
light gradient boosting machine
decision tree
random forest
naive Bayes
linear SVM
XGBoost
DNN classifier
others...
regression
linear regression
light gradient boosting machine
decision tree
random forest
elastic net
LARS Lasso
XGBoost
Others
time series forecasting
linear regression
light gradient boosting machine
decision tree
random forest
elastic net
LARS Lasso
XGBoost
others

By default, automated machine learning, will randomly select from the full range of algorithms for the specified task. You can choose to block individual algorithms from being selected.

Preprocessing and featurization

Automated ML (AutoML) can apply preprocessing transformations to your data.

scaling and normalization applied to numeric data automatically
optional featurization
missing value imputation
categorical encoding
dropping high cardinality features (eg IDs)
feature engineering (eg date parts from DateTime)

Running AutoML experiment

You can use Auzure ML Studio UI or use SDK (using AutoMLConfig class).

automl_run_config = RunConfiguration(framework='python')
automl_config = AutoMLConfig(
  name='auto ml experiment',
  task='classification',
  primary_metric='AUC_weighted',
  compute_target=aml_compute,
  training_data=train_dataset,
  validation_data=test_dataset,
  label_column_name='label',
  featurization='auto',
  iterations=12,
  max_concurrent_iterations=4
  )

With Azure ML Studio, you can create or select an Azure ML dataset to be used as input for your AutoML experiment. When using the SDK, you can submit data by

specify a dataset or dataframe of training data that includes features and label to be predicted
optionally, specify a second validation data dataset or dataframe. If this is not provided, Azure ML will apply cross-validation.

Alternatively:

specify a dataset, dataframe, or numpy array of X values containing features with a corresponding y array of label values

One of the most important setting you specify is primary_metric (ie target performance metric). Azure ML supports a set of named metrics for each type of task.

get_primary_metrics('classification')

You can submit an AutoML experiment like any other SDK-based experiment:

automl_experiment = Experiment(ws, 'automl_experiment')
automl_run = automl_experiment.submit(automl_config)

You can easily identify the best run in Auzre ML studio, and download or deploy the model it generated. Via SDK:

best_run, fitted_model = automl_run.get_output()
best_run_metrics = best_run.get_metrics()
for metric_name in best_run_metrics:
  metric = best_run_metrics[metric_name]
  print(metric_name, metric)

AutoML uses scikit-learn pipelines. You can view the steps in the fitted model you obtained from the best run.

for step in fitted_model.named_steps:
  print(step)

Explain ML models

Model explainers use statistical techniques to calculate feature importance. Explainers work by evaluating a test data set of feature cases and the labels the model predicts for them.

Global feature importance quantifies the relative importance of each feature in the test dataset as a whole: which feature in the dataset influences prediction?

Local feature importance measures the influence of each feature value for a specific individual prediction. Example, will Sam go deafult?

Prediction=0: Samuel won't default on the loan repayment

Features:

loan amount; support for 0: 0.9; support for 1: -0.9
income; support for 0: 0.6
age; support for 0: -0.2
marital status; support for 0: 0.1

Because this is a classification model, each feature gets a local importance value for each possible class, indicating the amount of support for that class based on the feature value.

The most important feature for a prediction of class 1 is loan amount. There could be multiple reasons why local importance for an individualprediction varies form global importance for the overall dataset. For example, Sam might have a lower income than average, but the loan amount in this case might be unusually small.

For a multi-class classification model, a local importance value for each possible class is calculated for every feature, with the total across all classes always being 0.

For a regression model, the local importance values simply indicate the level of influence each feature has on the predicted scalar label.

Using explainers

You can use Azure ML SDK to create explainers for models even if they were not trained using an Azure ML experiment.

You install the azureml-interpret package. Types of explainer include:

MimicExplainer creates a global surrogate model that approximates your trained model and can be used to generate explanations. This explainable model must have the same kind of architecture as your trained model (eg linear or tree-based)
TabularExplainer acts as a wrapper around various SHAP explainer algorithms, automatically choosing the one that is most appropriate for your model architecture
PFIExplainer (Permutation Feature Importance) analyzes feature importance by shuffling feature values and measuring the impact on prediction performance

Example for hypothetical model named loan_model

mim_explainer = MimicExplainer(
  model=loan_model,
  initialization_examples=X_test,
  explainable_model=DecisionTreeExplainableModel,
  features=['loan_amount', 'income', 'age', 'marital_status'],
  classes=['reject', 'approve']
  )

tab_explainer = TabularExplainer(
  model=loan_model,
  initialization_examples=X_test,
  features=['loan_amount', 'income', 'age', 'marital_status'],
  classes=['reject', 'approve']
  )

pfi_explainer = PFIExplainer(
  model=loan_model,
  features=['loan_amount', 'income', 'age', 'marital_status'],
  classes=['reject', 'approve']
  )

To retrieve global feature importance, call the explain_global() method of your explainer, and then use the get_feature_importance_dict() method to get a dictionary of the feature importance values.

global_mim_explanation = mim_explainer.explain_global(X_train)
global_mim_feature_importance = global_mim_explanation.get_feature_importance_dict()

# same as MimixExplainer
global_tab_explanation = mim_explainer.explain_global(X_train)
global_tab_feature_importance = global_tab_explanation.get_feature_importance_dict()

# requires actual labels
global_pfi_explanation = mim_explainer.explain_global(X_train)
global_pfi_feature_importance = global_pfi_explanation.get_feature_importance_dict()

To retriev local feature importance from a MimicExplainer or a TabularExplainer, you must call the explain_local() specifying the subset of cases you want to explain. Then you use the get_ranked_local_names() and get_ranked_local_values() to retrieve dictionares.

# same for tab_explainer too
local_mim_explanation = mim_explainer.explain(X_test[0:5])
local_mim_features = local_mim_explanation.get_ranked_local_names()
local_mim_importance = local_mim_explanation.get_ranked_local_values()

PFIExplainer does not support local feature importance explanations.

Creating explanations

You can create an explainer and upload the explanation it generates to the run for later analysis.

To create an explanation for the experiment script, you'll need to ensure that the azureml-interpret and azureml-contrib-interpret packages are installed in the run environment. Then you can use these to create an explanation from your trained model and upload it to the run outputs.

run = Run.get_context()

# code to train model goes here

# get explanation
explainer = TabularExplainer(model, X_train, features=features, classes=labels)
explanation = explainer.explain_global(X_test)

# get an explanation client and upload the explanation
explain_client = ExplanationClient.from_run(run)
explain_client.upload_model_explanation(explanation, comment='Tabular Explanation')

run.complete()

You can view the explanation you created for your model in the Explanations tab for the run in Azure ML Studio.

Visualizing explanations

Model explanations in Azure ML Studio include multiple visualizations that you can use to explore feature importance. Visualizations:

global feature importance
summary importance: shows the distribution of individual importance values for each feature across the test dataset
local feature importance by selecting an individual data point

Monitor models

You can use Application Insights to capture and review telemetry from models published with Azure ML. You must have an Application Insights resource associated with your Azure ML workspace.

When you create an Azure ML workspace, you can select an Application Insights resource. If you do not select an existing resource, a new one is created in the same resource group as your workspace.

When deploying a new real-time service, you can enable Application Insights in the deployment configuration for the service.

dep_config = AciWebservice.deploy_configuration(
  cpu_cores=1,
  memory_gb=1,
  enable_app_insights=True
  )

If you want to enable Application Insights for a service that is already deployed, you can modify the deployment configuration for AKS based services in the Azure portal.

Capture and view telemetry

Application Insights automatically captures any information written to the standard output and error logs, and provides a query capability to view data in these logs.

You can write any value to the standard output in the scoring script by using a print:

def run(raw_data):
  data = json.loads(raw_data)['data']
  predictions = model.predict(data)
  print('Data: ' + str(data) + ' - Predictions: ' + str(predictions))
  return predictions.tolist()

Azure ML creates a custom dimension in the data model for the output you write.

Yuo can use the Log Analytics query interface for the Applcation Insights in the Azure portal. It supports a SQL-like query syntax.

Monitor data drift

Over time there may be trends that change the profile of the data, making your model less accurate. This change in data profiles between training and inferencing is known as data drift.

Azure ML supports data drift monitoring through the use of datasets. You can compare two registered datasets to detect data drift, or you can capture new feature data submitted to a deployed model service and compare it to the dataset with which the model was trained.

You register 2 datasets:

a baseline dataset: original training data
a target dataset that will be compared to the baseline on time intervals. This dataset requires a column for each feature you want to compare, and a timestamp column

You define a dataset monitor to detect data drift and trigger alerts if the rate of drift exceeds a specified threshold. You can create dataset monitors using Azure ML Studio or by using the DataDriftDetector class.

monitor = DataDriftDetector.create_from_datasets(
  workspace=ws,
  name='dataset-drift-monitor',
  baseline_data_set=train_ds,
  target_data_set=new_data_ds,
  compute_target='aml-cluster',
  frequency= 'week',
  feature_list=['age', 'height', 'bmi'],
  latency=24
  )

You can backfill to immediately compare baseline to existing data in target.

backfill = monitor.backfill( dt.datetime.now() - dt.timedelta(weeks=6), dt.datetime.now())

If you have deployed a model as a real-time web service, you can capture new inferencing data s it is submitted, and compare it to the original training data. It has the benefit of automatically collecting new target data as the deployed model is used.

You include the training dataset in the model registration to provide a baseline.

model = Model.register(
  workspace=ws,
  model_path='./model/model.pkl',
  model_name='mymodel',
  datasets=[(Dataset.Scenario.TRAINING, train_ds)]
  )

You enable data collection for services in which the model is used. You use the ModelDataCollector class in each service's scoring script, writing code to capture data and predictions and write them to the data collector (which will store them in Azure blob storage).

def init():
  global model, data_collect, predict_collect
  model_name = 'mymodel'
  model = joblib.load(Model.get_model_path(model_name))

  # enable collection of data and predictions
  data_collect = ModelDataCollector(
    model_name,
    designation='inputs',
    features=['age', 'height', 'bmi']
    )
  predict_collect = ModelDataCollector(
    model_name,
    designation='predictions',
    features=['prediction']
    )

def run(raw_data):
  data = json.loads(raw_data['data'])
  predictions = model.predict(data)

  data_collect(data)
  predict_collect(predictions)

  return predictions.tolist()

With the data collection code in place in the scoring script, you can enable data collection in the deployment configuration.

dep_config = AksWebservice.deploy_configuration(collect_model_data=True)

You can configure data drift monitoring by using a DataDriftDetector class.

model = ws.models['mymodel']
datadrift = DataDriftDetector.create_from_model(
  ws,
  model.name,
  model.version,
  services=['my-svc'],
  frequency='Week'
  )

Scheduling alerts

You can specify a threshold for the rate of data drift and an operator email for notifications.

Monitoring works by running a comparison at scheduled frequency (day, week, or month), and calculating data drift metrics for the features. For dataset monitors, you can specify a latency indicating the number of hours to allow for new data to be collected and added to the target dataset. For deployed model data drifts monitor, you can specify a schedule_start time value to indicate when the data drift run should start (if omitted, the run will start at the current time).

Data drift is measured using a calculated magnitude of change in the statistical distributions of feature values over time. You can configure a threshold for data drift magnitude.

alert_email = AlertConfiguration('data_scientist@contoso.com')
monitor = DataDriftDetector.create_from_datasets(
  ws,
  'dataset-drift-detector',
  baseline_data_set,
  target_data_set,
  compute_target=cpu_cluster,
  frequency='Week',
  latency=2,
  drift_threshold=0.3,
  alert_configuration=alert_email
  )

Error when restarting Databricks streaming job

2020-04-19T18:00:00+02:00

This is an error I encountered when I have a Spark Streaming job running on Databricks 6.1. Consider the case I have to update a running streaming query. Databricks recommends to always start (and restart too?) a streaming query on a new dedicated cluster. However, in some scenario you might not be able to do so, and you may want to:

cancel the job run
update the notebooks
restart the job run

By taking these steps, I encountered these error:

Concurrent update to the log. Multiple streaming jobs detected for ...

# or

Multiple streaming queries are concurrently using ... [checkpoint]

They did not occur every time I restarted the query, but most of the times. When restarting 2-3 times, the issue was solved and the streaming query run smoothly. By investigating a bit more the error, we found that cancelling a job run via Databricks CLI was not letting the stream query close smoothly. What happened? The running query was not closing cleanly the checkpoints. So, when a new job run started, it raised an error because it found a corrupted checkpoint.

Solution

You can

upgrade do Databricks 6.3 and set spark.sql.streaming.stopActiveRunOnRestart to true
wait for Databricks 7 to be release where this configuration is enabled by default

New Work on atacmonitor.com

2020-03-08T18:00:00+01:00

My side project atacmonitor features a new guise. Data is now being collected for all bus and tram lines in Rome. Data pull is achieved via Python functions running on AWS Lambda. Data is then stored in MongoDB hosted in MongoDB Atlas. Atlas also provides the charts in the page. An overview of the new architecture is presented below.

Link to the post of the first release.

The Pragmatic Programmer [Highlights]

2018-02-10T14:31:00+01:00

Rather than construction, software is more like gardening— it is more organic than concrete. You plant many things in a garden according to an initial plan and conditions. Some thrive, others are destined to end up as compost. [...] You constantly monitor the health of the garden, and make adjustments (to the soil, the plants, the layout) as needed.

The Pragmatic Programmer: from Journeyman to Master by Andrew Hunt and David Thomas is a guide to best practices of software development. A software developer is like a woodcrafter. There are good practices that help him in achieving quality and efficiency in its work. I will summarize here some interesting hints that you can find in the book.

The book was originally published in 1999, so technologies and tools are quite outdated. However, the main principle remain surprisingly up to date.

1. Don't Repeat Yourself

DRY— Don't Repeat Yourself The alternative is to have the same thing expressed in two or more places. If you change one, you have to remember to change the others [...]. It isn't a question of whether you'll remember: it's a question of when you'll forget.

2. Coding over GUIs

A benefit of GUIs is WYSIWYG— what you see is what you get. The disadvantage is WYSIAYG— what you see is all you get.

3. One Editor for All

We think it is better to know one editor very well, and use it for all editing tasks: code, documentation, memos, system administration, and so on. Without a single editor, you face a potential modern day Babel of confusion.

4. Always Source Control. Always.

Always. Even if you are a single-person team on a one-week project. Even if it's a "throw-away" prototype. Even if the stuff you're working on isn't source code. Make sure that everything is under source code control— documentation, phone number lists, memos to vendors, makefiles, build and release procedures, that little shell script that burns the CD master— everything.

5. Things can Happen

It goes THIS CAN NEVER HAPPEN... "This code won't be used 30 years from now, so two-digit dates are fine." "This application will never be used abroad, so why internationalize it?" "count can't be negative." "This printf can't fail.". Let's not practice this kind of self-deception, particularly when coding.

6. Become a User

There's a simple technique for getting inside your users' requirements that isn't used often enough: become a user. Are you writing a system for the help desk? Spend a couple of days monitoring the phones with an experienced support person. Are you automating a manual stock control system? Work in the warehouse for a week. As well as giving you insight into how the system will really be used, you'd be amazed at how the request "May I sit in for a week while you do your job?" helps build trust and establishes a basis for communication with your users. Just remember not to get in the way!

7. Web Docs over Files

Web-based distribution also avoids the typical two-inch-thick binder entitled Requirements Analysis that no one ever reads and that becomes outdated the instant ink hits paper. If it's on the Web, the programmers may even read it.

8. Quality, quality, quality.

Teams as a whole should not tolerate broken windows— those small imperfections that no one fixes. The team must take responsibility for the quality of the product, supporting developers who understand the no broken windows

Some team methodologies have a quality officer— someone to whom the team delegates the responsibility for the quality of the deliverable. This is clearly ridiculous: quality can come only from the individual contributions of all team members.

9. Marketing the Project

There is a simple marketing trick that helps teams communicate as one: generate a brand. When you start a project, come up with a name for it, ideally something off-the-wall.

10. Manual Ensures Errors

A great way to ensure both consistency and accuracy is to automate everything the team does.

Reference

Hunt, Andrew; Thomas, David. The Pragmatic Programmer: From Journeyman to Master. Pearson Education. Kindle Edition.

6 Take-Aways after Reading "The Signal and The Noise"

2017-11-11T19:07:00+01:00

The Signal and The Noise by Nate Silver is a must-read book for those interested in predictions. It is not a technical book. You will not learn any algorithm. However, it presents a series of real-world scenarios when predictions did work and where predictions did not work. The book is well written and is full of valuable references to support its arguments.

1. Anyone can beat an index fund

After all, any investor can do as well as the average investor with almost no effort. All he needs to do is buy an index fund that tracks the average of the S&P500. In so doing he will come extremely close to replicating the average portfolio of every other trader, from Harvard MBAs to noise traders to George Soros' hedge fund manager. You have to be really good -or foolhardy- to turn that proposition down.

2. Bayesian statistics is less wrong

Recently, however, some well-respected statisticians have begun to argue that frequentist statistics should no longer be taught to undergraduates.

Frequentist statistics emphasizes the purity of the experiment: every hypothesis could be tested to a perfect conclusion if only enough data were collected. These methods don't encourage us to think about the plausibility of our hypothesis.

3. A bug made Deep Blue beat Kasparov

But what had inspired Kasparov to commit this mistake? His anxiety over Deep Blue's forty-fourth move in the first game - the move in which the computer had moved its rook for no apparent purpose. Kasparov had concluded that the counterintuitive play must be a sign of superior intelligence. He had never considered that it was simply a bug.

4. When predictions work - Weather

Weather predictions do not rely on statistics, nor on machine learning. They employ heavy simulations. The earth is split in cells, and the meteorological dynamics are simulated via well known models. The first weather simulation ever done is by the English physicist Lewis Fry Richardson in 1916.

5. When predictions don't work - Earthquakes

These processes may not literally be random, but they are so irreducibly complex (right down the last grain of sand) that it just won't be possible to predict them beyond a certain level.

6. When predictions don't work - Economics

Raw data for economics isn't much good.

"Why do people [economists ed.] not give intervals? Because they're embarrassed"

They are embarrassed because they are just too large.

My Talk about Superset [Python Milano Meetup]

2017-06-22T17:56:00+02:00

Yesterday, I gave a talk Python Milano Meetup. The Meetup was designed as Python pills: three 20-minutes talks in a row. The talks:

Superset: data visualization at AirBnB - Marco Santoni
Java Vs Python - Cesare Placanica
pdb in action - Lorenzo Mele

Very nice talk of @Airbnb #Superset with @MrSantoni at #PythonMilano. I see juicy applications for us #BIM guys. https://t.co/Pf1r9nhNEd
— Chiara Rizzarda (@CrShelidon) June 21, 2017

I presented superset, the open source project by AirBnB. It is a data visualization platform developed in Python. It allows to create interactive dashboards. The setup time is extremely short. It interesting for enterprises because the package features deep and granular authorization policies. The dashboards can be designed by business users too. You can indeed design dashboards without writing SQL queries (but there's still the option to write SQL of course). superset can integrate to most SQL databases thanks to SQLAlchemy query layer. Furthermore, druid.io database is supported. I presented atacmonitor as an example of a superset application.

Manufacturing. When data is not a commodity

2017-02-25T17:56:00+01:00

What does it mean to work as a data scientist in manufacturing? What is the value behind data? Data science has gained popularity in domains like internet, but the industrial production domain has specific requirements.

I gave a talk at Data Driven Innovation about the specific challenges when doing data science in manufacturing. I introduced the approach to data science that we are deploying at Brembo. The talk was part of a track dedicated to Industry 4.0 and to IoT.

Weighted Random Sampling with PostgreSQL [Follow-up]

2017-02-10T21:00:00+01:00

I received valuable feedbacks by Jim Nasby regarding the post about weighted random sampling with PostgreSQL. I will report here Jim's email.

Sadly, Common Table Expressions (CTE)s are insanely expensive, because each one must be fully materialized. So in your example, you're essentially creating 5 temp tables (one for each CTE). Obviously that's not a big deal with only 4 weights and 1000 samples, but for other use cases that overhead could really add up. Note that this is not the same as the OFFSET 0 trick... You can get a similar breakdown of code by using subselects in FROM clauses. That would look something like:

SELECT color
   FROM (<samples code>) AS samples
   JOIN (
     SELECT <cumulative_bounds SELECT>
       FROM (
         SELECT <sampling_cumulative_prob SELECT>
           FROM (....)
        ) AS sampling_cumulative_prob
     ) AS cumulative_bounds ON ...

Not as nice as WITH, but not horrible. You can also create temporary views for each of the intermediate steps.

in weights_with_sum, you can get rid of the join in favor of sum(weight) OVER() AS weight_sum.

Finally, random() produces 0.0 <= x < 1.0, so the bounds on the numrange should be '[)', not '(]'. Personally, I would just create the numrange immediately in cummulative_bounds, but that's mostly just a matter of style.

BTW, if you've got plpythonu loaded there's probably an easier way to generate the set of ranges, which could then be joined to the random samples.

BTW, width_bucket(operand anyelement, thresholds anyarray) (see second instance on docs) might be even faster; it'd definitely be simpler:

SELECT color[width_bucket(random(), thresholds)
   FROM generate_series(1,1000)
     , (
       SELECT array_agg(color) AS colors
           , array_agg(cum_prod) AS thresholds
         FROM sampling_cumulative_prod
     ) AS prob;

Monitoring Bus Frequencies in Rome

2017-01-21T18:00:00+01:00

I have just launched atacmonitor. It is a website providing information about the waiting time at bus stops in Rome.

Overview

The datasource is live data about bus waiting time of ATAC, Rome's public transport company. The transport office provides public API with real-time data.

I have implemented a simple application that is regularly pulling such data and storing it in a PostgreSQL database. The data is presented via AirBnB's Supereset, an open source visualization platform. I deployed such application via Heroku PaaS.

I have kicked-off the project and just few bus stops are being monitored. The goal is to have all bus stops monitored soon.

Blog Migrated to Pelican on GitHub Pages

2016-12-28T15:38:00+01:00

I have migrated my blog. It is built under Pelican, a static site generator. It allows me to write posts as plain markdown or even Jupyter notebooks. I then use GitHub Pages to version and publish the blog. I am continuing to use Aruba as domain provider. It is sufficient to rename the CNAME and the ANAME variables to hide the blog under the marcosantoni.com domain.

The migration from Wordpress to Pelican was sped up by the pelican-import plugin. This blog post is a good reference for deploying a Pelican blog on GitHub Pages

Insights from IEEE Big Data 16

2016-12-26T16:22:00+01:00

I have attended the IEEE Big Data 16 conference in Washington DC. I thank my company for sponsoring the trip. The conference included a special symposium dedicated to manufacturing. The symposium hosted some participants of the Bosch Production Line Performance competition from Kaggle.

2016 IEEE International Conference on Big Data kicked off today in Washington, DC. Share highlights w/ hashtag #IEEEBigData16 & we’ll RT!
— IEEE Big Data (@ieeebigdata) December 5, 2016

I'll list here a few notes I took during the conference.

Streaming Processing. I heard about the most popular architectures nowadays, and I highly recommend reading the blog posts by the authors of such architectures:
- Lambda architecture
- Kappa architecture
K-Spectral Centroid. The K-Spectral Centroid algorithm clusters time series by their shape, and finds the most representative shape (the cluster centroid) for each cluster.
K-D Tree partition: an algorithm for space partitioning.
Database Decay. Interesting keynote by Michael Stonebraker. Shortly, large applications often share a centralized database used by different groups of a company. The DBA point of view:
- High Risk. When changing a DB schema, I need to find applications all around in the company and update them accordingly (do I have budget for that?).
- Low Risk. No change in schema, I do a workaround in data.
- Claim. DBA want to lower the risk. --> no change in schema --> ER diagram diverges from reality --> database decay.
- At some point, a total rewrite is the only way forward.
- If you work in analytics getting data from operational DB, you realize data is getting more and more dirty.
PMML Scoring Engine. Max Ferguson introduced what a Predictive Model Markup Language (PMML) is. Basically, if you train a model and want to share it in a different application, PMML is a standard that defines how models should be stored as an XML.
Uncertainty in RFs. Random Forests can express uncertainty. One just needs to look at distribution of predictions among the decision trees of the model.
Bosch. Rumi Ghosh introduced the data science team at Bosch.
- Insight from production plants: plant managers prefer interpretable models (logistic regression or decision tree) over black box models.
- Research directions:
- Root cause analysis (via Bayesian inference)
- Class imbalance
3 Approaches in Kaggle Competition. Bohdan Pavlyshenko gave a talk on the three approaches he explored during the Kaggle competition about failure detection:
- Pure machine learning approach. 2-Levels of model ensembling, a pure black-box.
- Generalized Linear Model with Lasso regularization. Informative about feature impact.
- Bayesian model in BUGS. It enables to obtain the estimate of the probability distribution for each coefficient.
FTLR. Follow the regularized leader: a feature engineering method used to convert all categorical feature into one numerical feature.
CRF. Conditional Random Fields is a class of predictive models used when the dataset is represented as a graph. Each node is a sample with a vector X and a target variable y.

Weighted Random Sampling with PostgreSQL

2016-08-23T16:22:00+02:00

You have a table like the following:

CREATE TABLE weights (
color varchar primary key,
weight float
);

INSERT INTO weights (color, weight)
VALUES
('red', 8),
('blue', 3),
('green', 10),
('yellow', 10);

The table lists the weights associated with certain colors. Imagine a weight representing how much you like that color.

Now, you want to add 1000 colored tiles to your website. You want the color of the tiles to be sampled at random according to the weights table.

We'll write a PostgreSQL script that implements such random sampling. I'll write the entire query first, and then explain each part separately.

CREATE TABLE sampled_colors AS
WITH weights_with_sum AS (
SELECT
color,
weight,
weight_sum
FROM weights
CROSS JOIN (SELECT sum(weight) AS weight_sum FROM weights) s
),
sampling_probability AS (
SELECT
color,
weight / weight_sum AS prob
FROM weights_with_sum
),
sampling_cumulative_prob AS (
SELECT
color,
sum(prob) OVER (order by color) AS cum_prob
FROM sampling_probability
),
cumulative_bounds AS (
SELECT
color,
COALESCE(
lag(cum_prob) OVER (ORDER BY cum_prob, color),
0
) AS lower_cum_bound,
cum_prob AS upper_cum_bound
FROM sampling_cumulative_prob
),
samples AS (
SELECT
generate_series(1, 1000) AS sample_idx,
random() AS sample
)
SELECT
color
FROM samples
JOIN cumulative_bounds ON
sample::numeric <@ numrange(lower_cum_bound::numeric,
upper_cum_bound::numeric, '(]');

Let's look at one piece at a time.

WITH weights_with_sum AS (
SELECT
color,
weight,
weight_sum
FROM weights
CROSS JOIN (SELECT sum(weight) AS weight_sum FROM weights) s
),
sampling_probability AS (
SELECT
color,
weight / weight_sum AS prob
FROM weights_with_sum
)
SELECT *
FROM sampling_probability;
-- output:
color | prob
--------+--------------------
red | 0.258064516129032
blue | 0.0967741935483871
green | 0.32258064516129
yellow | 0.32258064516129

Here, we're just normalizing the weights. Each weight is divided by the total sum of the weights. In this way, we are re-writing each weight as a discrete probability of that color being sampled.

...
sampling_cumulative_prob AS (
SELECT
color,
sum(prob) OVER (order by color) AS cum_prob
FROM sampling_probability
),
cumulative_bounds AS (
SELECT
color,
COALESCE(
lag(cum_prob) OVER (ORDER BY cum_prob, color),
0
) AS lower_cum_bound,
cum_prob AS upper_cum_bound
FROM sampling_cumulative_prob
)
SELECT *
FROM cumulative_bounds;
-- output:
color | lower_cum_bound | upper_cum_bound
--------+--------------------+--------------------
blue | 0 | 0.0967741935483871
green | 0.0967741935483871 | 0.419354838709677
red | 0.419354838709677 | 0.67741935483871
yellow | 0.67741935483871 | 1

In this piece of code, we're are representing the weights as a cumulative distribution function.

...
samples AS (
SELECT
generate_series(1, 1000) AS sample_idx,
random() AS sample
)
SELECT
color
FROM samples
JOIN cumulative_bounds ON
sample::numeric <@ numrange(lower_cum_bound::numeric,
upper_cum_bound::numeric, '(]');

In the last part, we're sampling 1000 times a random number between 0 and 1. We then assign this sample to the corresponding color based on the values of the cumulative function. For example, if the first sample is 0.45, it will match the 'red' range (0.41-0.67). Therefore, that sample will be 'red'.

The result of the query is a table filled with 1000 colors sampled at random based on the weights.

SELECT *
FROM sampled_colors
LIMIT 10;
-- output:
color
--------
green
green
red
yellow
yellow
green
blue
red
red
red

Can we check that the result is correct? Were the weights really taken into account?

SELECT
color,
count(*)
FROM sampled_colors
GROUP BY 1;
-- output:
color | count
--------+-------
yellow | 309
green | 320
red | 276
blue | 95

The proportion of samples is quite close to the proportion of the weights. This similarity is clear if we compare this table with the discrete probability table above.

Applied Bayesian Inference with PyMC [video]

2016-06-30T17:03:00+02:00

I was glad to give an intro to Bayesian Inference at PyData Florence 2016. The video of the talk is now out.

A Simple Machine Learning Pipeline

2016-06-19T10:37:00+02:00

This post contains the code that I used in my talk at Python Milano Meetup on June 22nd 2016. The talk was a quick overview of Pipeline, a nice API by scikitlearn to abstract your machine learning algorithm. It is based on the Boston Housing Data Set.

We'll just load the data set from sklearn.

from sklearn.datasets import load_boston
housing_data = load_boston()
print housing_data.DESCR

We might want to make it a Pandas dataframe to make things easier to handle.

import pandas as pd
df = pd.DataFrame(housing_data.data)
df.columns = housing_data.feature_names
df['PRICE'] = housing_data.target
df.head()

The goal is to predict the PRICE variable given the other features. How does this variable distribute?

import matplotlib.pyplot as plt
df.PRICE.hist()
plt.xlabel('PRICE')

{.alignnone .size-full .wp-image-74 width="378" height="271"}

Let's turn the dataframe into a ML-friendly notation.

X = df.drop('PRICE', axis=1)
y = df['PRICE']

We will now define the metric that assess the accuracy of our algorithm/pipeline. Let's use the good old cross validation.

from sklearn import cross_validation
def evaluate_model(X, y, algorithm):
print 'Mean Squared Error'
scores = cross_validation.cross_val_score(algorithm, X, y,
scoring='mean_squared_error')
print -scores
print 'Accuracy: %0.2f' % -scores.mean()

So, now, we can try a bunch of algorithms and see which one works best by calling evaluate_model. It is now time to implement a first algorithm. So, let's explore a bit the data set. Is there any pattern we can exploit?

plt.figure(figsize=(10,7))
plt.scatter(df['RM'], y)
plt.xlabel('Average number of rooms')
plt.ylabel('Housing price in \$1000\'s')
plt.show()

{.alignnone .size-full .wp-image-78 width="610" height="438"}

As expected, there is a relation between the average number of rooms and the median price. So, let's build the first algorithm.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LinearRegression

def just_RM_column(X):
RM_col_index = 5
return X[:, [RM_col_index]]

pipe = make_pipeline(
FunctionTransformer(just_RM_column),
LinearRegression()
)

How well does it perform?

evaluate_model(X, y, pipe)
'''Mean Squared Error [43.19492771 41.72813479 46.89293772] Accuracy:
43.94'''

Can we visualize what the pipeline is actually doing?

def plot_model_RM(X, y, pipe):
X_train, X_test, y_train, y_test =
cross_validation.train_test_split(
X,
y,
test_size=0.33,
random_state=5
)
pipe.fit(X_train, y_train)
fake_X_train = np.array(X_train)
fake_X_train[:, 5] = np.linspace(min(fake_X_train[:, 5]),
max(fake_X_train[:, 5]), num=len(fake_X_train[:, 5]))
fake_X_test = np.array(X_test)
fake_X_test[:, 5] = np.linspace(min(fake_X_test[:, 5]),
max(fake_X_test[:, 5]), num=len(fake_X_test[:, 5]))
plt.figure(figsize=(20,7))
plt.subplot(1, 2, 1)
plt.scatter(X_train['RM'], y_train)
plt.scatter(fake_X_train[:, 5], pipe.predict(fake_X_train),
color='r')
plt.xlabel('Average number of rooms')
plt.ylabel('Housing price in \$1000\'s')
plt.title('Train Data Set')
plt.subplot(1, 2, 2)
plt.scatter(X_test['RM'], y_test)
plt.scatter(fake_X_test[:, 5], pipe.predict(fake_X_test),
color='r')
plt.xlabel('Average number of rooms')
plt.ylabel('Housing price in \$1000\'s')
plt.title('Test Data Set')
plt.show()

plot_model_RM(X, y, pipe)

{.alignnone .size-full .wp-image-84 width="1173" height="449"}

We now do a bit of feature engineering. We square the features.

def add_squared_col(X):
return np.hstack((X, X**2))

pipe = make_pipeline(
FunctionTransformer(just_RM_column),
FunctionTransformer(add_squared_col),
LinearRegression()
)

We evaluate this other pipeline.

evaluate_model(X, y, pipe)
'''
Mean Squared Error
[ 40.31207562 36.75642688 40.75444834]
Accuracy: 39.27'''

And we see how the algorithm is fitting the data set.

plot_model_RM(X, y, pipe)

{.alignnone .size-full .wp-image-86 width="1165" height="449"} We now try a different model like a decision tree.

from sklearn.tree import DecisionTreeRegressor

pipe = make_pipeline(
FunctionTransformer(just_RM_column),
FunctionTransformer(add_squared_col),
DecisionTreeRegressor(max_depth=3)
)
evaluate_model(X, y, pipe)
'''
Mean Squared Error
[ 57.28366371 61.5437311 84.32756118]
Accuracy: 67.72
'''
plot_model_RM(X, y, pipe)

{.alignnone .size-full .wp-image-87 width="1165" height="449"}

We now explore a second feature: INDUS.

plt.figure(figsize=(10,7))
plt.scatter(df['INDUS'], y)
plt.xlabel('Average number of rooms')
plt.ylabel('Housing price in \$1000\'s')
plt.show()

{.alignnone .size-full .wp-image-89 width="610" height="438"}

So, we see another relation between INDUS and PRICE. So, let's add this second feature.

def RM_and_INDUS_cols(X):
RM_col_index = 5
INDUS_col_index = 2
return X[:, [RM_col_index, INDUS_col_index]]

pipe = make_pipeline(
FunctionTransformer(RM_and_INDUS_cols),
FunctionTransformer(add_squared_col),
LinearRegression()
)

evaluate_model(X, y, pipe)
'''
Mean Squared Error
[ 32.3420789 31.4260901 35.95835866]
Accuracy: 33.24
'''

Now, plotting a model in 3D needs a bit more effort.

def plot_model_RM_INDUS(X, y, pipe):
X_train, X_test, y_train, y_test =
cross_validation.train_test_split(
X,
y,
test_size=0.33,
random_state=5
)
pipe.fit(X_train, y_train)
X_test = np.array(X_test)
fig = plt.figure(figsize=(10,7))
ax = p3.Axes3D(fig)
x = X_test[:, 2]
y = X_test[:, 5]
z = y_test
ax.scatter(x, y, z, c='r', marker='o')
x = np.arange(min(x), max(x), (max(x) - min(x)) / 100.0)
y = np.arange(min(y), max(y), (max(y) - min(y)) / 100.0)
X, Y = np.meshgrid(x, y)
Z = np.zeros(X.shape)
fake_X = np.zeros((1, 10))
for i in range(X.shape[0]):
for j in range(X.shape[1]):
fake_X[0, 2] = X[i, j]
fake_X[0, 5] = Y[i, j]
Z[i, j] = pipe.predict(fake_X)[0]
ax.plot_surface(X, Y, Z, alpha=0.2)
ax.set_xlabel('INDUS')
ax.set_ylabel('RM')
ax.set_zlabel('Price')

plot_model_RM_INDUS(X, y, pipe)

{.alignnone .size-full .wp-image-91 width="720" height="504"}

How pretty is that?

The following step is to use all the features available. So, we move to a 13-dimensional feature vector.

pipe = make_pipeline(
LinearRegression()
)
evaluate_model(X, y, pipe)
'''
Mean Squared Error
[ 20.50009513 22.42870192 27.88911654]
Accuracy: 23.61'''

The error got quite smaller. We cannot however plot the model in 13-dimensions. We will now re-use the function that adds a squared feature.

pipe = make_pipeline(
FunctionTransformer(add_squared_col),
LinearRegression()
)
evaluate_model(X, y, pipe)
'''
Mean Squared Error
[ 16.7819682 14.599869 18.17785453]
Accuracy: 16.52'''

Even better. Now, we will switch to a ridge-regressor (combined with a normalization of the features).

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

pipe = make_pipeline(
StandardScaler(),
FunctionTransformer(add_squared_col),
Ridge(alpha=3)
)
evaluate_model(X, y, pipe)
'''
Mean Squared Error
[ 16.4292824 14.50522561 18.27167008]
Accuracy: 16.40'''

Install a .deb file from terminal on Ubuntu

2016-05-23T08:18:00+02:00

I use Ubuntu 16.04. Sometimes, when I double-click a .deb file, the installation program does not work. What often solves the problem is installing it from terminal.

sudo dpkg -i my_deb_file.deb
sudo apt-get -f install

Insights from Data Science Milan - 19/05/16

2016-05-20T17:56:00+02:00

#DeepLearning introduction and enterprise architectures using #H2O - first #DataScienceMilan meetup! - https://t.co/I8LsfaFJSu
— Andrea Scarso (@andreaesseci) May 18, 2016

A new Data Science meetup is out in Milan. Two talks about Deep Learning were given in the first event.

Neural Networks and Deep Learning: An Introduction. @MilanHighTech. The first talk by Valentino Zocca was a quick intro to Deep Learning The speaker was able to explain the role of the additional layers in a neural network. Each layer is learning something, and each one is learning a different representation of the output. In particular, each additional layer is learning a more abstract representation of the output.

{.alignnone width="370" height="506"}

Each layer is learning a higher level of abstraction. In the example, the first layer is learning the edges in the image; the second layer is learning the parts of a face like the nose or the eye; the third layer is learning large sections of a face. Ref: "Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations", Lee et al.

Bringing Deep Learning into production. @axlpado. The speaker gave his point of view on deploying machine learning algorithms in production. There are a variety of frameworks, and it's always easy to choose which one to adopt. He gave a series of interesting tips, and I'll write here the main ones.

You can write machine learning in many languages such as Python, Java, R, Matlab, Scala, etc. A good guideline is: choose the one you know the most. Do not add the complexity of learning a new language to the complexity of designing the algorithm.

Different languages in different teams.

{.alignnone .size-full .wp-image-58 width="896" height="504"}

It can be challenge to bring machine learning models from a team to another. The reason is that often teams work in different languages or in different frameworks. This organization leads to complex deployment processes.

{.alignnone .size-full .wp-image-59 width="896" height="504"}

Paolo recommended to have the entire team on the same framework. The idea is to have the deployment pipeline as smooth as possible. It can be an effort for the data scientists at the beginning to learn the data engineer tools, but it can make the difference on the long term.

Bayesian A/B Testing in Python

2016-05-15T15:33:00+02:00

Imagine you re-designing your e-commerce website. You have to decide whether the "Buy Item" button should be blue or green. You decide to setup an A/B test, so you build two versions of the item page:

Page A which has a blue button;
Page B which has a green button.

Pages A and B are identical except for the color of the button. You want to quantify the likelihood of a user clicking the "Buy Item" button when she is on page A or on page B. So, you start the experiment by sending each user either to page A or to page B. Each time, you monitor whether she clicked "Buy Item" or not.

Frequentist vs Bayesian

One could simply approximate the effectiveness of each page by computing the success rate on the two pages. E.g. if N=1000 users visited page A, and 50 of them clicked the button, one could say that the likelihood of clicking the button on page A is 50/1000 \~= 5%. This is the so-called Frequentist approach which envisions the probability in terms of event frequency. However, the following issues might arise on a daily basis:

what if N is small (e.g. N=50)? Can we still be confident by just computing the success rate?
What if N is different between page A and page B? Let's say that 500 users visited page A and 2000 users visited page B. How can we combine such imbalanced experiments?
How large should N be to achieve a 90% confidence in my estimates?

We'll now introduce a simple Bayesian solution that allows to run the A/B test and to handle the issues listed above. The code makes use of PyMC package, and it was inspired by reading "Bayesian Methods for Hackers" by Cameron Davidson-Pilon.

Evaluate Page A

We'll first show how to evaluate the success rate on page A with a Bayesian approach. The goal is to infer the probability of clicking the "Buy Item" button on page A. We model this probability as a Bernoulli distribution with parameter $p_A$:

$$P(click | \text{page}=A) = \begin{cases} p_A & click=1\ 1-p_A & click=0\ \end{cases}$$

So, $p_A$ is the parameter indicating the probability of clicking the button on page A. This parameter is unknown and the goal of the experiment is to infer it.

from pymc import Uniform, rbernoulli, Bernoulli, MCMC
from matplotlib import pyplot as plt
import numpy as np

# true value of p_A (unknown)
p_A_true = 0.05
# number of users visiting page A
N = 1500
occurrences = rbernoulli(p_A_true, N)

print 'Click-BUY:'
print occurrences.sum()
print 'Observed frequency:'
print occurrences.sum() / float(N)

In this code, we are simulating a realisation of the experiment where 1000 users visited page A. Here, occurrences indicate how many visitors have actually clicked on the button in this realisation.

The next step consist of defining our prior on the $p_A$ parameter. The prior definition is the first step of Bayesian inference and is a way to indicate our prior belief in the variable.

p_A = Uniform('p_A', lower=0, upper=1)
obs = Bernoulli('obs', p_A, value=occurrences, observed=True)

In this section, we define the prior of $p_a$ to be a uniform distribution. The obs variable indicates the Bernoulli distribution representing the observations of the click events (indeed governed by the $p_a$ parameter). The two variables are assigned to Uniform and Bernoulli which are stochastic variable objects part of PyMC. Each variable is associated with a string name (p_A * and obs in this case). The obs variable has the value * and the observed parameter set because we have observed the realisations of the experiments.

# defining a Monte Carlo Markov Chain model
mcmc = MCMC([p_A, obs])
# setting the size of the simulations to 20k particles
mcmc.sample(20000, 1000)
# the resulting posterior distribution is stored in the trace variable
print mcmc.trace('p_A')[:]

In this section, the MCMC model is initialised, and the variables p_A and obs are given to it as input. The sample model will run the Monte Carlo simulations and fit the observed data to the prior belief. The posterior distribution is accessible via the .trace attribute as an array of realisations. We can now visualise the result of the inference.

plt.figure(figsize=(8, 7))
plt.hist(mcmc.trace('p_A')[:], bins=35, histtype='stepfilled',
normed=True)
plt.xlabel('Probability of clicking BUY')
plt.ylabel('Density')
plt.vlines(p_A_true, 0, 90, linestyle='--', label='True p_A')
plt.legend()
plt.show()

{.alignnone .wp-image-38 .size-full width="800" height="700"}

Then, we might want to answer the question: where am I 90% confident that the true $p_A$ lies? That's easy to answer.

p_A_samples = mcmc.trace('p_A')[:]
lower_bound = np.percentile(p_A_samples, 5)
upper_bound = np.percentile(p_A_samples, 95)
print 'There is 90%% probability that p_A is between %s and %s' %
(lower_bound, upper_bound)
# There is 90% probability that p_A is between 0.0373019596856 and
0.0548052806892

Comparing Page A and Page B

We'll now repeat what we have done for page A, and we add a new variable delta indicating the difference between $p_A$ and $p_B$.

from pymc import Uniform, rbernoulli, Bernoulli, MCMC, deterministic
from matplotlib import pyplot as plt

p_A_true = 0.05
p_B_true = 0.04
N_A = 1500
N_B = 750

occurrences_A = rbernoulli(p_A_true, N_A)
occurrences_B = rbernoulli(p_B_true, N_B)

print 'Observed frequency:'
print 'A'
print occurrences_A.sum() / float(N_A)
print 'B'
print occurrences_B.sum() / float(N_B)

p_A = Uniform('p_A', lower=0, upper=1)
p_B = Uniform('p_B', lower=0, upper=1)

@deterministic
def delta(p_A=p_A, p_B=p_B):
return p_A - p_B

obs_A = Bernoulli('obs_A', p_A, value=occurrences_A, observed=True)
obs_B = Bernoulli('obs_B', p_B, value=occurrences_B, observed=True)

mcmc = MCMC([p_A, p_B, obs_A, obs_B, delta])
mcmc.sample(25000, 5000)

p_A_samples = mcmc.trace('p_A')[:]
p_B_samples = mcmc.trace('p_B')[:]
delta_samples = mcmc.trace('delta')[:]

plt.subplot(3,1,1)
plt.xlim(0, 0.1)
plt.hist(p_A_samples, bins=35, histtype='stepfilled', normed=True,
color='blue', label='Posterior of p_A')
plt.vlines(p_A_true, 0, 90, linestyle='--', label='True p_A
(unknown)')
plt.xlabel('Probability of clicking BUY via A')
plt.legend()
plt.subplot(3,1,2)
plt.xlim(0, 0.1)
plt.hist(p_B_samples, bins=35, histtype='stepfilled', normed=True,
color='green', label='Posterior of p_B')
plt.vlines(p_B_true, 0, 90, linestyle='--', label='True p_B
(unknown)')
plt.xlabel('Probability of clicking BUY via B')
plt.legend()
plt.subplot(3,1,3)
plt.xlim(0, 0.1)
plt.hist(delta_samples, bins=35, histtype='stepfilled', normed=True,
color='red', label='Posterior of delta')
plt.vlines(p_A_true - p_B_true, 0, 90, linestyle='--', label='True
delta (unknown)')
plt.xlabel('p_A - p_B')
plt.legend()
plt.show()

{.alignnone .wp-image-40 .size-full width="800" height="600"}

Then, we can answer a question like: what is the probability that $ p_A > p_B$?

print 'Probability that p_A > p_B:'
print (delta_samples > 0).mean()
# Probability that p_A > p_B
# 0.8919

Insights from PyData Florence 16

2016-04-20T06:05:00+02:00

I have just joined PyData conference in Florence, and I will list briefly some interesting insights.

Oh my... We are already overcrowded @pyconit and it's *just* the beginning!! 🎉🎉 good job guys! 🙌🏻 #pycon7
— (((Valerio Maggio))) (@leriomaggio) April 15, 2016

Time Travel and Time Series Analysis with Pandas and Statsmodels, @hendorf. The focus of the talk was time series analysis. The speaker pointed out something that a data scientist should not forget when doing such time series analysis. He pointed out that the time level of aggregation is something to do with care when doing such analysis. Do you take into account that February has a number of days that accounts to only 90% of the number of days of March? If you compare e.g. sales per month, you cannot just ignore this fact. In the talk, I found out that statsmodels has some nice tools that perform trend analysis and seasonality analysis.

Machine learning and IoT for automatic presence detection of workers on fall protection life lines, @stefanoterna. The talk was an excellent overview of how TomorrowData is able to deploy machine learning systems in the "real world". Their system uses neural networks to detect a man walking on industrial cables. It was interesting to hear about the different challenges that one has to consider in the Internet of Things area due to hardware and environmental constraints. The fact that they had to manually annotate the signals coming from an accelerometer reminded me of my work about indoor localization. In this kind of areas, the data collection is indeed a challenge due to its manual cost (compared to the datasets you can easily collect through a web app).

Introduzione a Orange Data Mining, @ericbonfadini. Eric introduced Orange Data Mining which is both a python library and a GUI for machine learning projects. I found interesting the nice GUI. It allows to define pipelines of jobs to mine data. You can quickly get insights about data and play around with machine learning models. I see this tool as quite useful mainly for didactic purposes. I think it can be a nice tool for teachers to explain data mining and machine learning in a nice graphical way. It is really suitable for lectures.

"Simple APIs and innovative documentation processes" keynote by @EGouillart now live @PyData @pyconit #pydatait pic.twitter.com/Gt8cxIyafJ
— PyData Italy (@pydatait) April 16, 2016

Simple APIs and innovative documentation processes: looking back at the success of Scientific Python, @EGouillart. The talk was the point of view of a core developer of a scientific package like scikit-image. The speaker gave nice insights about the API design choices that need to be taken when you contribute to open source projects. For example, what is the advantage of getting rid of most classes in your package and mainly expose functions. The idea is that, if you get rid of the boilerplate of classes, you are forced to expose/return just numpy arrays which you can then easily integrate to other tools in your pipeline, e.g. scikit-learn. Another thing to take into account is that 54% of the users of packages are running a Windows machine (although probably the developers of such package don't). So, you need to take into account the tech gap between the developers and the end users. Finally, the speaker mentioned the power of Sphinx as a documentation tool.

Building Data Pipelines in Python, @marcobonzanini. Luigi is an awesome tool because simply it makes you feel relaxed when you are running a data pipeline. You can programmatically define arbitrary dependencies between tasks, and Luigi will make sure that the dependencies are fulfilled. Marco's talk was a really nice intro to the tool.

Going Functional in the Python Data Science Stack, @data_hope. The speaker explained the directed acyclic graphs that are behind functional programming. It was interesting to hear about Dask package and how you can bring its lazy evaluation model. Dask allows you to abstract your code and perform operations on datasets that do not fit in memory. The speaker pointed out that doing functional programming means to decouple "how" from "what". You can just focus on "what" your algorithm should do, then you just choose "how" it will do it (e.g. Dask).

Reti Neurali in Python, @spiunno. The talk was a great overview of what are neural networks and how you can implement them with Theano and Lasagne. The speaker was able give a talk that was suitable both to beginners and both to an intermediate audience. In particular, the Q&A session was really active, and interesting topics were discussed, e.g. preventing overfitting, computational costs, gravitational waves, etc. Regarding overfitting prevention, I learnt about "dropout" which is a nice technique that consists basically in dropping out links of the networks at random for each sample. The advantage is that you prevent overfitting and reduce the computational cost at the same time.

@hendorf thank you for coming! enjoy your next conference :)
— PyCon Italy (@pyconit) April 20, 2016