<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Marco Santoni</title><link href="https://www.marcosantoni.com/" rel="alternate"></link><link href="https://www.marcosantoni.com/feeds/all.atom.xml" rel="self"></link><id>https://www.marcosantoni.com/</id><updated>2026-04-27T00:00:00+02:00</updated><entry><title>An AI-augmented workflow for a yearly coding exam: authoring + grading</title><link href="https://www.marcosantoni.com/an-ai-augmented-workflow-for-a-yearly-coding-exam-authoring-grading.html" rel="alternate"></link><published>2026-04-27T00:00:00+02:00</published><updated>2026-04-27T00:00:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2026-04-27:/an-ai-augmented-workflow-for-a-yearly-coding-exam-authoring-grading.html</id><summary type="html">&lt;p&gt;Each year I run an end-of-course coding exam for a Big Data Specialist class. The setup had ossified over time: I'd hand-author the exercises, hand-grade dozens of submissions over a weekend, and hand-write per-student feedback reports. It worked, but it didn't scale — and year-over-year consistency drifted because I was the …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Each year I run an end-of-course coding exam for a Big Data Specialist class. The setup had ossified over time: I'd hand-author the exercises, hand-grade dozens of submissions over a weekend, and hand-write per-student feedback reports. It worked, but it didn't scale — and year-over-year consistency drifted because I was the only consistency check.&lt;/p&gt;
&lt;p&gt;This year I rebuilt the workflow around &lt;strong&gt;two AI-powered loops&lt;/strong&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Authoring&lt;/strong&gt; — a coding agent generates a new yearly edition from a written rubric.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Grading&lt;/strong&gt; — each student submission is graded per-exercise by the OpenAI API against a reference solution.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This post is about what I learned: &lt;em&gt;what to delegate to the AI, what to lock down with code, and where I still keep a human in the loop&lt;/em&gt;. I won't show the actual exam content (it rotates yearly and is not meant to leak), but I'll show the surrounding scaffolding in detail.&lt;/p&gt;
&lt;h2&gt;The constraints&lt;/h2&gt;
&lt;p&gt;The constraints set the shape of everything else:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Yearly editions.&lt;/strong&gt; Each year I publish a new edition with different specifics, so a leaked solution from last year is useless. The skills tested must stay the same.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mixed delivery.&lt;/strong&gt; Several Python exercises (Pandas + SQL), one Power BI exercise, plus two open-ended discussion questions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These constraints push toward a &lt;em&gt;templated&lt;/em&gt; approach: the shape of the exam stays fixed, the contents rotate.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Part 1 — Authoring: rubric-driven generation&lt;/h2&gt;
&lt;p&gt;The hard authoring problem isn't writing one good exercise. It's writing a new set of exercises that together feel &lt;strong&gt;as hard as last year's, no more no less&lt;/strong&gt;. If the bonus is too easy this year, top students walk; if it's too hard, the class average tanks. Either way the score distribution becomes incomparable to previous cohorts.&lt;/p&gt;
&lt;h3&gt;Step 1: write the rubric first&lt;/h3&gt;
&lt;p&gt;I wrote a single document — &lt;code&gt;editions/README.md&lt;/code&gt; — that pins the &lt;em&gt;shape&lt;/em&gt; of each exercise:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Difficulty stars (⭐ to ⭐⭐⭐)&lt;/li&gt;
&lt;li&gt;Expected time (10–25 min)&lt;/li&gt;
&lt;li&gt;Concepts tested&lt;/li&gt;
&lt;li&gt;The SQL surface (&lt;code&gt;WHERE&lt;/code&gt;, &lt;code&gt;JOIN&lt;/code&gt;, &lt;code&gt;GROUP BY&lt;/code&gt;, …)&lt;/li&gt;
&lt;li&gt;The Pandas surface (&lt;code&gt;groupby&lt;/code&gt;, &lt;code&gt;dt.to_period&lt;/code&gt;, &lt;code&gt;diff&lt;/code&gt;, &lt;code&gt;fillna&lt;/code&gt;, …)&lt;/li&gt;
&lt;li&gt;"Authoring guidance" describing &lt;strong&gt;which knobs to turn between editions&lt;/strong&gt; and which to leave alone&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The rubric is anchored to a known-good edition. So when I author the next year's edition, I'm not inventing complexity from scratch — I'm parametrising the same complexity over a different choice of table, dimension, time bucket, etc. This makes the next edition a &lt;em&gt;constrained transformation&lt;/em&gt;, not an open-ended creative task — exactly the kind of work an LLM does well.&lt;/p&gt;
&lt;h3&gt;Step 2: the authoring loop&lt;/h3&gt;
&lt;div class="mermaid"&gt;
flowchart LR
    R[editions/README.md&lt;br/&gt;complexity rubric] --&gt; A[Coding agent&lt;br/&gt;Claude Code]
    P[Previous edition&lt;br/&gt;e.g. editions/2025/] --&gt; A
    A --&gt; D[Draft new edition:&lt;br/&gt;Esame, esercizio, soluzione, Domande]
    D --&gt; V[Run soluzione.py&lt;br/&gt;against shared DB]
    V -- fail --&gt; A
    V -- ok --&gt; H[Human review:&lt;br/&gt;wording, ambiguity, fairness]
    H --&gt; S[Ship to students]
&lt;/div&gt;

&lt;p&gt;I open a coding agent in the repo and prompt with something like:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Generate the 2027 edition. Read &lt;code&gt;editions/README.md&lt;/code&gt; for the rubric and &lt;code&gt;editions/2026/&lt;/code&gt; as the most recent example. Create &lt;code&gt;editions/2027/Esame.md&lt;/code&gt;, &lt;code&gt;esercizio.py&lt;/code&gt;, &lt;code&gt;soluzione.py&lt;/code&gt;, &lt;code&gt;Domande.md&lt;/code&gt;. Verify the solution runs end-to-end against the shared SQLite DB and produces coherent CSVs. Don't reuse the entity from 2026."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent reads the rubric and the most recent edition, drafts the four files following the contracted shape, runs the reference solution against the shared DB, reports row counts and a summary, and stops.&lt;/p&gt;
&lt;h3&gt;What I deliberately did NOT delegate&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Writing the rubric itself.&lt;/strong&gt; That's the institutional memory of "what does this exam test, at what level". An LLM shouldn't draft its own exam standards from scratch — but it can faithfully apply standards that already exist.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The choice of difficulty caps.&lt;/strong&gt; The bonus is hard &lt;em&gt;on purpose&lt;/em&gt;. The agent doesn't get to make it easier because that's the local optimum on whatever it tried first.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Final review.&lt;/strong&gt; I read every word of the new edition. The agent is fast, not infallible — and exam wording in Italian needs a native ear.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Pitfalls I hit, and how the rubric absorbed them&lt;/h3&gt;
&lt;p&gt;Three real bugs surfaced while authoring the second edition. Each one became a new line in the rubric so it doesn't recur.&lt;/p&gt;
&lt;p&gt;The pattern: &lt;strong&gt;whenever the AI got something subtly wrong in a way I had to think about, I codified the lesson in the rubric.&lt;/strong&gt; The rubric grows; future editions get safer.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;Part 2 — Grading: OpenAI API, one exercise at a time&lt;/h2&gt;
&lt;p&gt;After the exam, dozens of &lt;code&gt;esercizio.py&lt;/code&gt; files land in &lt;code&gt;evaluation/exam_submissions/&amp;lt;student&amp;gt;/&lt;/code&gt;. The hard part of grading isn't producing one good evaluation — it's producing many that are &lt;em&gt;consistent across students&lt;/em&gt; and &lt;em&gt;fair across exercises&lt;/em&gt;.&lt;/p&gt;
&lt;h3&gt;The bedrock: a section-marker convention&lt;/h3&gt;
&lt;p&gt;The single most important infrastructure choice was forcing each submission to be &lt;strong&gt;syntactically partitionable&lt;/strong&gt;. Each exercise is wrapped in explicit markers:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# === EXCERCISE N START === do not edit this line&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="n"&gt;student&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="c1"&gt;# === EXCERCISE N END === do not edit this line&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Students see this clearly in the template. They are told, in writing and out loud, not to touch these lines. The evaluator's &lt;code&gt;extract_code_sections&lt;/code&gt; is then a single regex.&lt;/p&gt;
&lt;h3&gt;The grading pipeline&lt;/h3&gt;
&lt;div class="mermaid"&gt;
flowchart TD
    A[exam_submissions/&lt;br/&gt;student folder] --&gt; B[load esercizio.py]
    B --&gt; C[extract sections&lt;br/&gt;via marker regex]
    C --&gt; D{for each&lt;br/&gt;exercise N}
    D --&gt; E[build prompt:&lt;br/&gt;description + reference + student code]
    E --&gt; F[OpenAI API call]
    F --&gt; G[parse JSON:&lt;br/&gt;score + feedback]
    G --&gt; H[merge into&lt;br/&gt;code_evaluation.json]
    D --&gt; D
    H --&gt; I[postprocess.py]
    I --&gt; J[student_reports/&lt;br/&gt;Name.md]
&lt;/div&gt;

&lt;script type="module"&gt;
  import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs';
  mermaid.initialize({ startOnLoad: true });
&lt;/script&gt;

&lt;p&gt;Each exercise is graded &lt;strong&gt;independently&lt;/strong&gt;. The prompt structure is identical across students.&lt;/p&gt;
&lt;p&gt;A few details that made the pipeline reliable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reference comparison.&lt;/strong&gt; Every prompt includes the reference solution's code &lt;em&gt;for that exact exercise&lt;/em&gt;. The model isn't grading in a vacuum; it's comparing against a known-good version on the same task.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Structured JSON output.&lt;/strong&gt; A fixed schema makes downstream merging trivial. No prose-parsing fragility.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Per-exercise scope.&lt;/strong&gt; The model never sees the whole file. It sees one exercise at a time. This keeps the reasoning &lt;em&gt;local&lt;/em&gt;: a student who failed exercise 1 can still get full marks on exercise 5, and the model isn't tempted to "average vibes" across the file.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resume from cache.&lt;/strong&gt; The merged &lt;code&gt;code_evaluation.json&lt;/code&gt; is the source of truth. Re-running the evaluator only grades students that aren't already in it. A transient API failure no longer restarts a 2-hour batch.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;From JSON to per-student markdown&lt;/h3&gt;
&lt;p&gt;A short &lt;code&gt;postprocess.py&lt;/code&gt; fans the merged JSON out into one markdown report per student:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gh"&gt;# Valutazione Mario Rossi&lt;/span&gt;

&lt;span class="gs"&gt;**Punteggio totale**&lt;/span&gt;: 17.0/20
&lt;span class="gs"&gt;**Percentuale**&lt;/span&gt;: 85.0%

&lt;span class="gu"&gt;## Esercizio 1&lt;/span&gt;
&lt;span class="gs"&gt;**Punteggio**&lt;/span&gt;: 5/5
&lt;span class="gs"&gt;**Feedback**&lt;/span&gt;:
&lt;span class="k"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Implementazione corretta e completa
&lt;span class="k"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Buon uso di ORDER BY e di un confronto stretto

&lt;span class="gu"&gt;## Esercizio 2&lt;/span&gt;
&lt;span class="gs"&gt;**Punteggio**&lt;/span&gt;: 4/5
&lt;span class="gs"&gt;**Punti di forza**&lt;/span&gt;:
&lt;span class="k"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Join multi-tabella corretto
&lt;span class="gs"&gt;**Suggerimenti di miglioramento**&lt;/span&gt;:
&lt;span class="k"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Considera l&amp;#39;uso di &lt;span class="sb"&gt;`parse_dates`&lt;/span&gt; direttamente in &lt;span class="sb"&gt;`read_sql_query`&lt;/span&gt;
...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Students get one file with a per-exercise breakdown. I do a final spot-check pass on a sample — usually the AI gradings line up with my own, with disagreement clustered on partially-correct solutions where partial credit is genuinely subjective. That's exactly where I want my human time to go.&lt;/p&gt;
&lt;hr&gt;
&lt;h2&gt;What stays human&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Authoring sign-off.&lt;/strong&gt; I read every word of every new edition.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open-ended discussion questions.&lt;/strong&gt; Each edition includes two &lt;code&gt;Domande.md&lt;/code&gt; questions probing &lt;em&gt;why&lt;/em&gt; the student did what they did. These are written or oral, reviewed by me. The signal is exactly the kind of thing that gets flattened by an LLM summary.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Takeaway&lt;/h2&gt;
&lt;p&gt;The interesting question isn't "can AI grade exams" or "can AI write exams". Both are demonstrably yes, and have been for a while. The interesting question is &lt;strong&gt;what's the smallest structure I need to put around the AI&lt;/strong&gt; so its output is consistent year-over-year and fair across dozens of students.&lt;/p&gt;
&lt;p&gt;For me that turned out to be three pieces of scaffolding:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A &lt;strong&gt;rubric&lt;/strong&gt; that locks down what must stay the same across editions.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;marker convention&lt;/strong&gt; that gives the grader a clean unit of work.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;resume cache&lt;/strong&gt; that makes the whole pipeline idempotent.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Once those three are in place, the AI does what it's good at — fluent text, code comparison, structured feedback — and I do what I'm good at — judging whether the result is fair, and updating the rubric when it isn't.&lt;/p&gt;
&lt;p&gt;Net result: a process that used to take a weekend per edition and a weekend per grading round now takes a few hours of focused human review on top of the AI's first pass. And the year-over-year consistency, which used to live entirely in my head, now lives in a markdown file that I can hand to a colleague.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Panel at PyData Milan: Managing Teams, Stakeholders and Delivery in the GenAI Era</title><link href="https://www.marcosantoni.com/panel-at-pydata-milan-managing-teams-stakeholders-and-delivery-in-the-genai-era.html" rel="alternate"></link><published>2026-03-22T00:00:00+01:00</published><updated>2026-03-22T00:00:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2026-03-22:/panel-at-pydata-milan-managing-teams-stakeholders-and-delivery-in-the-genai-era.html</id><summary type="html">&lt;p&gt;&lt;img src="../images/pydata-milan-2026/panel.jpg" width="400" /&gt;&lt;/p&gt;
&lt;p&gt;I had the pleasure of being a guest panelist at &lt;a href="https://www.meetup.com/pydata-milano/events/313716178/"&gt;PyData Milan&lt;/a&gt; on March 18, 2026. The event was hosted at TeamSystem's office in Milan.&lt;/p&gt;
&lt;p&gt;The panel was about &lt;strong&gt;Managing Teams, Stakeholders and Delivery in the GenAI Era&lt;/strong&gt;. Together with &lt;a href="https://www.linkedin.com/in/parvaneh-shafiei/"&gt;Parvaneh Shafiei&lt;/a&gt; (AI Manager at TUI Musement) and &lt;a href="https://www.linkedin.com/in/albertodanese/"&gt;Alberto Danese …&lt;/a&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img src="../images/pydata-milan-2026/panel.jpg" width="400" /&gt;&lt;/p&gt;
&lt;p&gt;I had the pleasure of being a guest panelist at &lt;a href="https://www.meetup.com/pydata-milano/events/313716178/"&gt;PyData Milan&lt;/a&gt; on March 18, 2026. The event was hosted at TeamSystem's office in Milan.&lt;/p&gt;
&lt;p&gt;The panel was about &lt;strong&gt;Managing Teams, Stakeholders and Delivery in the GenAI Era&lt;/strong&gt;. Together with &lt;a href="https://www.linkedin.com/in/parvaneh-shafiei/"&gt;Parvaneh Shafiei&lt;/a&gt; (AI Manager at TUI Musement) and &lt;a href="https://www.linkedin.com/in/albertodanese/"&gt;Alberto Danese&lt;/a&gt; (Head of Data Science &amp;amp; Advanced Analytics at Nexi), we discussed topics ranging from data and AI governance, to shipping AI solutions in large enterprises, to how GenAI and coding agents are transforming team organization.&lt;/p&gt;
&lt;p&gt;It was a great conversation, and I enjoyed sharing experiences and perspectives with the other panelists in front of the PyData Milan community.&lt;/p&gt;
&lt;h2&gt;Recording&lt;/h2&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/Pm6uaaLBwVY?si=WkUjCj2bkbzZGnVI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen&gt;&lt;/iframe&gt;</content><category term="posts"></category></entry><entry><title>Fitness Functions for Data and AI: Computational Policies</title><link href="https://www.marcosantoni.com/fitness_functions_data_mesh.html" rel="alternate"></link><published>2026-03-15T08:35:00+01:00</published><updated>2026-03-15T08:35:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2026-03-15:/fitness_functions_data_mesh.html</id><summary type="html">&lt;p&gt;In &lt;em&gt;Building Evolutionary Architectures&lt;/em&gt; and &lt;em&gt;Software Architecture: The Hard Parts&lt;/em&gt;, Neal Ford and colleagues introduce the concept of &lt;strong&gt;fitness functions&lt;/strong&gt; — automated checks that verify whether a system preserves its desired architectural characteristics over time. The idea is simple: if you care about a quality (latency, coupling, resilience), define an objective …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In &lt;em&gt;Building Evolutionary Architectures&lt;/em&gt; and &lt;em&gt;Software Architecture: The Hard Parts&lt;/em&gt;, Neal Ford and colleagues introduce the concept of &lt;strong&gt;fitness functions&lt;/strong&gt; — automated checks that verify whether a system preserves its desired architectural characteristics over time. The idea is simple: if you care about a quality (latency, coupling, resilience), define an objective measure and automate its verification. Don't rely on manual reviews or good intentions.&lt;/p&gt;
&lt;p&gt;&lt;img src="./images/software_architecture_hard_parts.jpg" alt="Software Architecture: The Hard Parts book cover" style="max-width: 400px; height: auto;" /&gt;&lt;/p&gt;
&lt;p&gt;Fitness functions can run automatically in your &lt;strong&gt;CI/CD pipelines&lt;/strong&gt;, but they are not typical testing or static analysis tools. Unit tests verify that your code behaves correctly. SAT tools check for code smells or vulnerabilities. Fitness functions operate at a different level: they measure how well your system &lt;strong&gt;fits the overall architecture design principles&lt;/strong&gt; adopted by your team or company. Is your service respecting the agreed coupling boundaries? Is your module's dependency graph consistent with the target architecture? These are the kind of questions fitness functions answer.&lt;/p&gt;
&lt;h3&gt;Example: component size threshold&lt;/h3&gt;
&lt;p&gt;Let's make this concrete with an example from the book. Suppose your team has an architecture principle that &lt;strong&gt;no single component should dominate the codebase&lt;/strong&gt;. You can encode this as a fitness function:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fitness function&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No component shall exceed X% of the overall codebase&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Type&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Holistic, automated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Trigger&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CI/CD pipeline on deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it checks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The percentage of overall source code represented by each component&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Action&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Alerts the architect if any component exceeds the threshold&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For instance, imagine your application has 6 components and you set the threshold at 30%:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Codebase %&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Account&lt;/td&gt;
&lt;td&gt;18%&lt;/td&gt;
&lt;td&gt;Pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ticketing&lt;/td&gt;
&lt;td&gt;15%&lt;/td&gt;
&lt;td&gt;Pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Purchases&lt;/td&gt;
&lt;td&gt;22%&lt;/td&gt;
&lt;td&gt;Pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Notifications&lt;/td&gt;
&lt;td&gt;8%&lt;/td&gt;
&lt;td&gt;Pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reporting&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;35%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Fail or Warning&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auth&lt;/td&gt;
&lt;td&gt;2%&lt;/td&gt;
&lt;td&gt;Pass&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The Reporting component has grown beyond the threshold. This doesn't necessarily mean something is broken — but it's a signal that this component may be accumulating too many responsibilities and should be reviewed by the architect.&lt;/p&gt;
&lt;p&gt;The threshold depends on the size of the application. For a small application with 10 components, 30% might be a reasonable limit to catch outliers. For a large application with 50 components, 10% would be more appropriate. The point is not the exact number — it's that &lt;strong&gt;the architectural principle is encoded as an automated, objective check&lt;/strong&gt; rather than left to someone's judgment during code review.&lt;/p&gt;
&lt;p&gt;This is what makes fitness functions different from tests. A unit test would check that a function returns the right output. This fitness function checks that your system's structure still reflects the architecture your team agreed on.&lt;/p&gt;
&lt;h2&gt;Data Mesh Computational Policies as Fitness Functions&lt;/h2&gt;
&lt;p&gt;My data engineering team at TeamSystem has been applying this same concept to &lt;strong&gt;data mesh&lt;/strong&gt; — but instead of checking software architecture characteristics, we check &lt;strong&gt;data product characteristics&lt;/strong&gt;. In data mesh terms, these are called &lt;strong&gt;computational policies&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Example of policy output" src="./images/computational_policy.png"&gt;&lt;/p&gt;
&lt;h3&gt;What are computational policies?&lt;/h3&gt;
&lt;p&gt;In &lt;em&gt;Data Mesh&lt;/em&gt;, Zhamak Dehghani introduces computational policies as automated governance mechanisms. Each domain team owns its data products autonomously, but certain global guarantees must hold across the mesh — schema conventions, SLO adherence, discoverability metadata, interoperability contracts, access control rules. Rather than enforcing these through a central governance board or manual review, you encode them as automated checks that run continuously.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Example of computational policy architecture" src="./images/computational_policies_data_mesh.svg"&gt;&lt;/p&gt;
&lt;h2&gt;The parallel with fitness functions&lt;/h2&gt;
&lt;p&gt;The overlap is tight. Replace "architecture characteristic" with "data product governance rule" and the pattern is the same.&lt;/p&gt;
&lt;p&gt;Both fitness functions and computational policies share key properties:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Automated&lt;/strong&gt; — neither relies on humans reviewing things manually at scale&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Objective and measurable&lt;/strong&gt; — they produce a pass/fail or a metric, not a subjective opinion&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Guardrails, not gates&lt;/strong&gt; — the goal is to enable team autonomy while preventing drift from system-wide properties&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lifecycle-aware&lt;/strong&gt; — some run at build time (schema validation), some at deploy time (contract compatibility), some continuously at runtime (SLO monitoring)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Protect emergent properties&lt;/strong&gt; — individual teams making locally rational decisions can degrade global qualities like interoperability or discoverability, so you need automated checks at a higher level&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What changes is the scope&lt;/h2&gt;
&lt;p&gt;The main difference is one of scope and domain. Fitness functions in &lt;em&gt;The Hard Parts&lt;/em&gt; focus on software architecture characteristics — coupling, cohesion, latency, scalability, resilience. Computational policies focus on data product characteristics — schema quality, freshness, discoverability, interoperability, compliance.&lt;/p&gt;
&lt;p&gt;The underlying mechanism and philosophy are identical.&lt;/p&gt;
&lt;h2&gt;Atomic vs holistic&lt;/h2&gt;
&lt;p&gt;The book categorizes fitness functions as &lt;strong&gt;atomic&lt;/strong&gt; (checking a single property) or &lt;strong&gt;holistic&lt;/strong&gt; (checking cross-cutting concerns). The same categorization applies to computational policies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Atomic&lt;/strong&gt;: does this data product expose standard metadata? Does it meet its declared freshness SLO?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Holistic&lt;/strong&gt;: can a consumer actually discover, understand, and consume this product end-to-end?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Holistic policies are harder to implement but more valuable — they catch the gaps that atomic checks miss.&lt;/p&gt;
&lt;h2&gt;Why this framing matters&lt;/h2&gt;
&lt;p&gt;If you come from a software architecture background and are entering the data mesh space, recognizing that computational policies are fitness functions gives you a familiar mental model. You already know the philosophy. You already understand why automation matters, why objectivity matters, why guardrails beat gates. And if you're already practicing data mesh governance, calling your policies "fitness functions" can help communicate their purpose to stakeholders who come from a software engineering background. It's a useful shared vocabulary.&lt;/p&gt;
&lt;p&gt;But to me, there's a deeper reason why this framing matters. Guidelines can — and sometimes should — be ignored. Domain teams must be free to adopt their own design principles when their context demands it. The power of computational policies is that they &lt;strong&gt;enable adoption&lt;/strong&gt; by teams while &lt;strong&gt;enforcing or nudging&lt;/strong&gt; company-level principles through automation, not bureaucracy. Governance teams evolve from writing documents that nobody reads into &lt;strong&gt;computational policy engineers&lt;/strong&gt; — people who encode organizational knowledge into automated, executable checks. That's a fundamentally different role, and a far more effective one.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Festina Lente - Make haste slowly while learning technology</title><link href="https://www.marcosantoni.com/festina_lente.html" rel="alternate"></link><published>2025-12-31T08:35:00+01:00</published><updated>2025-12-31T08:35:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2025-12-31:/festina_lente.html</id><summary type="html">&lt;p&gt;The phrase &lt;em&gt;festina lente&lt;/em&gt;—“make haste slowly”—was used by the Roman emperor Augustus as a personal motto. Suetonius reports that Augustus repeated it to his generals and administrators as a guiding principle: advance steadily, but never recklessly; act with urgency, but never without reflection. In that sense, the slogan …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The phrase &lt;em&gt;festina lente&lt;/em&gt;—“make haste slowly”—was used by the Roman emperor Augustus as a personal motto. Suetonius reports that Augustus repeated it to his generals and administrators as a guiding principle: advance steadily, but never recklessly; act with urgency, but never without reflection. In that sense, the slogan is not about slowness at all, but about disciplined speed—progress that is fast precisely because it is anchored in care, preparation, and clear thinking.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Statue of Augustus" src="./images/augustus.jpg"&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;By Vicenç Valcárcel Pérez - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=99990278&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For a software engineer constantly exposed to new tools, frameworks, and abstractions, &lt;strong&gt;festina lente is a survival strategy&lt;/strong&gt;. The industry encourages you to chase every new shiny thing, but adopting technology too quickly can lead to attempting to keep up with the latest trends at the expense of feeling overwhelmed. Festina lente reminds you to &lt;strong&gt;pause and reflect&lt;/strong&gt; before jumping on the next bandwagon. Take the time to evaluate whether a new technology truly fits your needs, aligns with your goals, and is worth the investment of learning and integration.&lt;/p&gt;
&lt;p&gt;Once the decision to learn a new technology is made, festina lente encourages you to &lt;strong&gt;approach the learning process with patience and care&lt;/strong&gt;. This means avoiding distractions from trying to learn too many things at once, and instead focusing on mastering one technology at a time. It also means being willing to invest the necessary time and effort to truly understand the technology, rather than rushing through tutorials or documentation.&lt;/p&gt;
&lt;h2&gt;Turtle and sail&lt;/h2&gt;
&lt;p&gt;&lt;img alt="Festina lente representation by Medici" src="./images/festina_lente.jpg"&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Image by © Marie-Lan Nguyen / Wikimedia Commons, CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=21764505&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The phrase &lt;em&gt;festina lente&lt;/em&gt; is often associated with the image of a turtle carrying a sail. The turtle represents the slow and steady progress that comes from careful reflection and preparation, while the sail represents the speed and agility that comes from taking action with urgency. Together, they symbolize the balance between speed and caution that is at the heart of the festina lente philosophy.&lt;/p&gt;
&lt;p&gt;&lt;img alt="AI pubblications by year" src="./images/ai_publications_yearly.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Steady progress&lt;/strong&gt;. The chart above is taken from &lt;a href="https://hai.stanford.edu/assets/files/hai_ai_index_report_2025.pdf"&gt;AI Index 2025 Annual Report&lt;/a&gt; by Stanford University. It shows the number of AI-related publications per year. The trend is clear: the number of publications is growing steadily over time. This indicates that the field of AI is advancing at a steady pace, with new research and developments being published regularly.&lt;/p&gt;
&lt;p&gt;The risk is that what we learnt just few years ago is becoming obsolete very quickly. I learnt organizing my learning over longer time horizons (approximately 3 to 6 months) to avoid being overwhelmed by the pace of change. I focus on mastering one technology at a time, and I avoid trying to learn too many things at once. I make sure I &lt;strong&gt;consiously decide not to learn&lt;/strong&gt; or even explore some technologies that are trending but not relevant for my goals.&lt;/p&gt;
&lt;p&gt;What I most care about is that the turtle 🐢 keeps moving forward steadily, even if slowly.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Talk at Politecnico di Milano</title><link href="https://www.marcosantoni.com/talk-at-politecnico-di-milano.html" rel="alternate"></link><published>2025-11-05T00:00:00+01:00</published><updated>2025-11-05T00:00:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2025-11-05:/talk-at-politecnico-di-milano.html</id><summary type="html">&lt;p&gt;I was invited to give a talk at Politecnico di Milano at Osservatorio Big Data. The event took place on November 4th, 2025, and I my talk was &lt;em&gt;"Guida galattica per data product AI-ready"&lt;/em&gt;, which means &lt;em&gt;"Hitchhiker's guide to AI-ready data products"&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="../images/poli-big-data/poli1.jpeg" width="400" /&gt;&lt;/p&gt;
&lt;p&gt;I shared the experience we had at TeamSystem …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I was invited to give a talk at Politecnico di Milano at Osservatorio Big Data. The event took place on November 4th, 2025, and I my talk was &lt;em&gt;"Guida galattica per data product AI-ready"&lt;/em&gt;, which means &lt;em&gt;"Hitchhiker's guide to AI-ready data products"&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="../images/poli-big-data/poli1.jpeg" width="400" /&gt;&lt;/p&gt;
&lt;p&gt;I shared the experience we had at TeamSystem in building data products that are enablers for text2SQL agents. These agents require high quality context about the data they are querying. This context is not only about the schema of the database but also about business rules, data quality, and other metadata that can help the agent to generate accurate SQL queries.&lt;/p&gt;
&lt;p&gt;We deveoped a platform that integrates the metadata of our data products the the context of our text2SQL agents. The idea is to build the metadata once only and reuse it across different uses cases (both AI agents and traditional data consumers).&lt;/p&gt;
&lt;p&gt;Below a picture with Databricks folks that contributed to the talk&lt;/p&gt;
&lt;p&gt;&lt;img src="../images/poli-big-data/poli2.jpeg" width="400" /&gt;&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Switch to UV</title><link href="https://www.marcosantoni.com/switch-to-uv.html" rel="alternate"></link><published>2025-10-20T00:00:00+02:00</published><updated>2025-10-20T00:00:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2025-10-20:/switch-to-uv.html</id><summary type="html">&lt;p&gt;I just moved the git repo of this blog from an old conda+pip based setup to using &lt;code&gt;uv&lt;/code&gt;. On Mac, start by&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;brew&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;uv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then, I initialized the uv project and just imported the dependencies specified in the &lt;code&gt;requirements.txt&lt;/code&gt; file.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;uv&lt;span class="w"&gt; &lt;/span&gt;init&lt;span class="w"&gt; &lt;/span&gt;--python&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;.13
uv&lt;span class="w"&gt; &lt;/span&gt;add …&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</summary><content type="html">&lt;p&gt;I just moved the git repo of this blog from an old conda+pip based setup to using &lt;code&gt;uv&lt;/code&gt;. On Mac, start by&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;brew&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;uv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then, I initialized the uv project and just imported the dependencies specified in the &lt;code&gt;requirements.txt&lt;/code&gt; file.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;uv&lt;span class="w"&gt; &lt;/span&gt;init&lt;span class="w"&gt; &lt;/span&gt;--python&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;.13
uv&lt;span class="w"&gt; &lt;/span&gt;add&lt;span class="w"&gt; &lt;/span&gt;--requirements&lt;span class="w"&gt; &lt;/span&gt;requirements.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;And that is basically it. From now on, to run pelican commands, just prefix them with &lt;code&gt;uv run&lt;/code&gt;, e.g.,&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;uv&lt;span class="w"&gt; &lt;/span&gt;run&lt;span class="w"&gt; &lt;/span&gt;pelican&lt;span class="w"&gt; &lt;/span&gt;content
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The transition was smooth and fast. Highly recommended!&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Talk at Codemotion 2025</title><link href="https://www.marcosantoni.com/talk-at-codemotion-2025.html" rel="alternate"></link><published>2025-10-15T00:00:00+02:00</published><updated>2025-10-15T00:00:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2025-10-15:/talk-at-codemotion-2025.html</id><summary type="html">&lt;p&gt;&lt;img src="../images/codemotion/talk1.jpeg" width="400" /&gt;&lt;/p&gt;
&lt;p&gt;I've been speaking at Codemotion for the first time in October 2025 thanks to the work done over the last months at TeamSystem. With my colleague Mattia De Leo, we presented our recent work on building AI assistants based on knowledge graphs and large language models. The talk was well …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img src="../images/codemotion/talk1.jpeg" width="400" /&gt;&lt;/p&gt;
&lt;p&gt;I've been speaking at Codemotion for the first time in October 2025 thanks to the work done over the last months at TeamSystem. With my colleague Mattia De Leo, we presented our recent work on building AI assistants based on knowledge graphs and large language models. The talk was well received, and we had a lot of interesting questions from the audience.&lt;/p&gt;
&lt;p&gt;&lt;img src="../images/codemotion/talk2.jpeg" width="400" /&gt;&lt;/p&gt;
&lt;p&gt;I focused my talk on Text2SQL, a task that consists of translating natural language queries into SQL queries. This is a challenging task, and I explained why understanding the semantics of the natural language query and the structure of the database schema is not enough. Business context and further metadata are key to generate accurate SQL queries.&lt;/p&gt;
&lt;p&gt;&lt;img src="../images/codemotion/talk3.jpeg" width="400" /&gt;&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Speaking at Big Data London 2025</title><link href="https://www.marcosantoni.com/speaking-at-big-data-london-2025.html" rel="alternate"></link><published>2025-10-01T00:00:00+02:00</published><updated>2025-10-01T00:00:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2025-10-01:/speaking-at-big-data-london-2025.html</id><summary type="html">&lt;p&gt;&lt;img src="../images/bigdatalondon/talk1.jpg" width="400" /&gt;&lt;/p&gt;
&lt;p&gt;I've been for the first time at Big Data London in September 2025. I gave a talk with my colleague Andrea Romeo about a challenging task we faced at TeamSystem. We developed an offloading of thousands of SQL Server tenants via CDC (Change Data Capture) to a data lake via …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img src="../images/bigdatalondon/talk1.jpg" width="400" /&gt;&lt;/p&gt;
&lt;p&gt;I've been for the first time at Big Data London in September 2025. I gave a talk with my colleague Andrea Romeo about a challenging task we faced at TeamSystem. We developed an offloading of thousands of SQL Server tenants via CDC (Change Data Capture) to a data lake via Debezium.&lt;/p&gt;
&lt;p&gt;&lt;img src="../images/bigdatalondon/talk2.jpeg" width="400" /&gt;&lt;/p&gt;
&lt;p&gt;I appreciated the conference which I consider close te being actually a fair. It was a great chance to meet key vendors in the big data space and to discuss with them about their products. &lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/79xt6HGG2L4?si=1wRkwcXHzM68EgCQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen&gt;&lt;/iframe&gt;</content><category term="posts"></category></entry><entry><title>Weight AI Eng skills by page count</title><link href="https://www.marcosantoni.com/weight-ai-eng-skills-by-page-count.html" rel="alternate"></link><published>2025-08-31T21:41:00+02:00</published><updated>2025-08-31T21:41:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2025-08-31:/weight-ai-eng-skills-by-page-count.html</id><summary type="html">&lt;p&gt;I am currently reading "AI Engineering" by Chip Huyen and am really enjoying it. I spent some years as data scientist in the past, and now I found some analogies between data science and AI engineering. The analogy is in the way the industry is talking about the discipline and …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I am currently reading "AI Engineering" by Chip Huyen and am really enjoying it. I spent some years as data scientist in the past, and now I found some analogies between data science and AI engineering. The analogy is in the way the industry is talking about the discipline and what actually engineering teams are fighting on daily. Before going into that, let's define how we can evaluate the importance of different skills in AI engineering.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;What AI engineering skill is actually the most important while being the skill less spoken about?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To do so, I will use a totally arbitrary method: weighting each skill by the number of pages it appears in the book. This is not a perfect metric, but it can give us some insights into which skills the author considers more important. So I drew the following chart based on the page count:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Chart with AI engineering skills weighted by page count" src="./images/ai_eng_skills_by_pages_in_book.png"&gt;&lt;/p&gt;
&lt;h3&gt;Evaluation, evaluation, evaluation&lt;/h3&gt;
&lt;p&gt;I have been considering evaluation as &lt;strong&gt;the skill&lt;/strong&gt; that differentiates a senior data scientist from a junior one likewise testing is the skill that differentiates a senior software engineer from a junior one. In the context of AI engineering, evaluation becomes even more crucial. Why is it so relevant? Taking some points from Huyen's book:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;open ended outputs, for a given input, there are so many possible correct responses.&lt;/li&gt;
&lt;li&gt;the more intelligent AI models become, the harder it is to evaluate them. You can no longer evaluate a response based on how it sounds.&lt;/li&gt;
&lt;li&gt;black box models, no details such as the model architecture, training data, and the training process&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote class="twitter-tweet"&gt;&lt;p lang="en" dir="ltr"&gt;evals are surprisingly often all you need&lt;/p&gt;&amp;mdash; Greg Brockman (@gdb) &lt;a href="https://twitter.com/gdb/status/1733553161884127435?ref_src=twsrc%5Etfw"&gt;December 9, 2023&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="https://platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;</content><category term="posts"></category></entry><entry><title>Book review: thinking in bets</title><link href="https://www.marcosantoni.com/book-review-thinking-in-bets.html" rel="alternate"></link><published>2025-08-24T13:41:00+02:00</published><updated>2025-08-24T13:41:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2025-08-24:/book-review-thinking-in-bets.html</id><summary type="html">&lt;blockquote&gt;
&lt;p&gt;“&lt;strong&gt;Wanna bet?&lt;/strong&gt;” triggers us to engage in that third step that we only sometimes get to. Being asked if we are willing to bet money on it makes it much more likely that we will examine our information in a less biased way, be more honest with ourselves about how …&lt;/p&gt;&lt;/blockquote&gt;</summary><content type="html">&lt;blockquote&gt;
&lt;p&gt;“&lt;strong&gt;Wanna bet?&lt;/strong&gt;” triggers us to engage in that third step that we only sometimes get to. Being asked if we are willing to bet money on it makes it much more likely that we will examine our information in a less biased way, be more honest with ourselves about how sure we are of our beliefs, and be more open to updating and calibrating our beliefs.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;A couple of months ago, I read &lt;em&gt;Thinking in Bets&lt;/em&gt; by Annie Duke. The book presents a compelling case for decision-making under uncertainty and offers practical strategies for improving our thinking processes.&lt;/p&gt;
&lt;p&gt;&lt;img alt="book cover" src="./images/bookshelf/thinking_in_bets.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;Key Takeaways&lt;/h2&gt;
&lt;p&gt;One of the key takeaways from the book is the concept of "&lt;strong&gt;resulting&lt;/strong&gt;," which is the tendency to judge the quality of a decision based on its outcome rather than the reasoning behind it. Duke argues that this mindset can lead to poor decision-making in the long run, as it encourages us to ignore valuable information and lessons learned from our experiences.&lt;/p&gt;
&lt;h2&gt;My (conditioned) opinion on the book&lt;/h2&gt;
&lt;p&gt;Before reading &lt;em&gt;Thinking in Bets&lt;/em&gt; , I had the chance to read books like&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Thinking, Fast and Slow&lt;/em&gt; by Daniel Kahneman&lt;/li&gt;
&lt;li&gt;&lt;em&gt;The Signal and the Noise&lt;/em&gt; by Nate Silver&lt;/li&gt;
&lt;li&gt;&lt;em&gt;The Black Swan&lt;/em&gt; by Nassim Nicholas Taleb&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While &lt;em&gt;Thinking in Bets&lt;/em&gt; offers valuable insights, most concepts reminded me of ideas presented in these other works. I think it serves as a useful primer for those new to the subject, but it may not offer enough depth for readers already familiar with these concepts.&lt;/p&gt;
&lt;h2&gt;Other quotes I liked&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;We are discouraged from saying “I don’t know” or “I’m not sure.” We regard those expressions as vague, unhelpful, and even evasive. But &lt;strong&gt;getting comfortable with “I’m not sure”&lt;/strong&gt; is a vital step to being a better decision-maker. We have to make peace with not knowing.&lt;/p&gt;
&lt;p&gt;In most of our decisions, we are not betting against another person. Rather, we are betting against all the &lt;strong&gt;future versions of ourselves&lt;/strong&gt; that we are not choosing. We are constantly deciding among alternative futures: one where we go to the movies, one where we go bowling, one where we stay home.&lt;/p&gt;
&lt;p&gt;People are credulous creatures who find it very easy to believe and very difficult to doubt. [actually citing Daniel Gilbert]&lt;/p&gt;
&lt;p&gt;Surprisingly, being smart can actually make bias worse. Let me give you a different intuitive frame: the smarter you are, the better you are at constructing a narrative that supports your beliefs, rationalizing and framing the data to fit your argument or point of view. After all, people in the “spin room” in a political setting are generally pretty smart for a reason.&lt;/p&gt;
&lt;/blockquote&gt;</content><category term="posts"></category></entry><entry><title>Is Lakehouse Monitoring worth it?</title><link href="https://www.marcosantoni.com/is-lakehouse-monitoring-worth-it.html" rel="alternate"></link><published>2025-08-23T09:41:00+02:00</published><updated>2025-08-23T09:41:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2025-08-23:/is-lakehouse-monitoring-worth-it.html</id><summary type="html">&lt;p&gt;I've created a toy Lakehouse Monitoring in Databricks setup to explore its features and capabilities. The goal is to understand how it works and what benefits it can bring. Here's an overview of what I cover in this post:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#setup-a-toy-lakehouse-monitoring"&gt;How to Setup a toy Lakehouse Monitoring&lt;/a&gt;  &lt;ul&gt;
&lt;li&gt;Dashboard&lt;/li&gt;
&lt;li&gt;Alerts&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pricing"&gt;Pricing&lt;/a&gt;  &lt;/li&gt;
&lt;li&gt;&lt;a href="#my-opinion-on-what-ive-seen"&gt;My …&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;</summary><content type="html">&lt;p&gt;I've created a toy Lakehouse Monitoring in Databricks setup to explore its features and capabilities. The goal is to understand how it works and what benefits it can bring. Here's an overview of what I cover in this post:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#setup-a-toy-lakehouse-monitoring"&gt;How to Setup a toy Lakehouse Monitoring&lt;/a&gt;  &lt;ul&gt;
&lt;li&gt;Dashboard&lt;/li&gt;
&lt;li&gt;Alerts&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#pricing"&gt;Pricing&lt;/a&gt;  &lt;/li&gt;
&lt;li&gt;&lt;a href="#my-opinion-on-what-ive-seen"&gt;My opinion on what I've seen&lt;/a&gt;  &lt;/li&gt;
&lt;li&gt;&lt;a href="#wheres-databricks-going"&gt;Where's Databricks going?&lt;/a&gt;  &lt;ul&gt;
&lt;li&gt;My2C&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you want to know more about what Databricks' Lakehouse Monitoring can do, I recommend checking out the official documentation. I have prepared a basic map of concepts that can help you get started.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Lakehouse Monitoring Concepts Map" src="./images/map_of_concepts_lakehouse_monitoring.svg"&gt;&lt;/p&gt;
&lt;h2 id="setup-a-toy-lakehouse-monitoring"&gt;How to Setup a toy Lakehouse Monitoring&lt;/h2&gt;

&lt;p&gt;Let's start by creating a table we can work with. It should be a time-series table&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;create&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;table&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;timestamp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DOUBLE&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I then create a basic notebook &lt;code&gt;insert 1h of data.ipynb&lt;/code&gt; to fill table with data. Then, setup a job to run that notebook every hour.&lt;/p&gt;
&lt;p&gt;I'll not add the code here because it is quite basic. It randomly adds records to the table with random values (within the time windown of the hour).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;
&lt;span class="k"&gt;limit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;style scoped&gt;
  .table-result-container {
    max-height: 300px;
    overflow: auto;
  }
  table, th, td {
    border: 1px solid black;
    border-collapse: collapse;
  }
  th, td {
    padding: 5px;
  }
  th {
    text-align: left;
  }
&lt;/style&gt;
&lt;div class='table-result-container'&gt;&lt;table class='table-result'&gt;&lt;thead style='background-color: white'&gt;&lt;tr&gt;&lt;th&gt;timestamp&lt;/th&gt;&lt;th&gt;amount&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;2025-08-22T08:58:07.929Z&lt;/td&gt;&lt;td&gt;22.570402586080093&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2025-08-22T08:51:51.929Z&lt;/td&gt;&lt;td&gt;20.713874028846366&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2025-08-22T09:03:54.929Z&lt;/td&gt;&lt;td&gt;21.97633174572098&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2025-08-22T08:28:44.929Z&lt;/td&gt;&lt;td&gt;27.94416169489641&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2025-08-22T09:05:29.929Z&lt;/td&gt;&lt;td&gt;21.307407500066127&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2025-08-22T08:17:03.929Z&lt;/td&gt;&lt;td&gt;22.37476392747984&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2025-08-22T09:05:05.929Z&lt;/td&gt;&lt;td&gt;26.446829879517953&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2025-08-22T08:42:32.929Z&lt;/td&gt;&lt;td&gt;27.86840740526422&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2025-08-22T08:33:55.929Z&lt;/td&gt;&lt;td&gt;27.236961570798947&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2025-08-22T08:42:30.929Z&lt;/td&gt;&lt;td&gt;25.395336538015343&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Then, let's create the Monitor via Unity Catalog Explorer 👇&lt;/p&gt;
&lt;p&gt;I set up the monitor as &lt;code&gt;TimeSeries&lt;/code&gt; profile. I pointed out the &lt;code&gt;timestamp&lt;/code&gt; column and a granularity of 1 hour. The schedule of the monitor is actually daily.&lt;/p&gt;
&lt;p&gt;Below, a screenshot of the Unity Catalog Explorer page to create the Lakehouse Monitoring.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of Unity Catalog Explorer page to create the Lakehouse Monitoring" src="./images/create_monitor.png"&gt;&lt;/p&gt;
&lt;p&gt;What happens after the creation of the Monitoring? By default, two new tables are created&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;&amp;lt;table_name&amp;gt;_profile_metrics&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;&amp;lt;table_name&amp;gt;_drift_metrics&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let's inspect them&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;SHOW&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;TABLES&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;IN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;style scoped&gt;
  .table-result-container {
    max-height: 300px;
    overflow: auto;
  }
  table, th, td {
    border: 1px solid black;
    border-collapse: collapse;
  }
  th, td {
    padding: 5px;
  }
  th {
    text-align: left;
  }
&lt;/style&gt;
&lt;div class='table-result-container'&gt;&lt;table class='table-result'&gt;&lt;thead style='background-color: white'&gt;&lt;tr&gt;&lt;th&gt;database&lt;/th&gt;&lt;th&gt;tableName&lt;/th&gt;&lt;th&gt;isTemporary&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;default&lt;/td&gt;&lt;td&gt;sales&lt;/td&gt;&lt;td&gt;false&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;default&lt;/td&gt;&lt;td&gt;sales_drift_metrics&lt;/td&gt;&lt;td&gt;false&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;default&lt;/td&gt;&lt;td&gt;sales_profile_metrics&lt;/td&gt;&lt;td&gt;false&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;/td&gt;&lt;td&gt;_sqldf&lt;/td&gt;&lt;td&gt;true&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales_profile_metrics&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;style scoped&gt;
  .table-result-container {
    max-height: 300px;
    overflow: auto;
  }
  table, th, td {
    border: 1px solid black;
    border-collapse: collapse;
  }
  th, td {
    padding: 5px;
  }
  th {
    text-align: left;
  }
&lt;/style&gt;
&lt;div class='table-result-container'&gt;&lt;table class='table-result'&gt;&lt;thead style='background-color: white'&gt;&lt;tr&gt;&lt;th&gt;window&lt;/th&gt;&lt;th&gt;log_type&lt;/th&gt;&lt;th&gt;logging_table_commit_version&lt;/th&gt;&lt;th&gt;monitor_version&lt;/th&gt;&lt;th&gt;granularity&lt;/th&gt;&lt;th&gt;slice_key&lt;/th&gt;&lt;th&gt;slice_value&lt;/th&gt;&lt;th&gt;column_name&lt;/th&gt;&lt;th&gt;count&lt;/th&gt;&lt;th&gt;data_type&lt;/th&gt;&lt;th&gt;num_nulls&lt;/th&gt;&lt;th&gt;avg&lt;/th&gt;&lt;th&gt;min&lt;/th&gt;&lt;th&gt;max&lt;/th&gt;&lt;th&gt;stddev&lt;/th&gt;&lt;th&gt;num_zeros&lt;/th&gt;&lt;th&gt;num_nan&lt;/th&gt;&lt;th&gt;min_length&lt;/th&gt;&lt;th&gt;max_length&lt;/th&gt;&lt;th&gt;avg_length&lt;/th&gt;&lt;th&gt;non_null_columns&lt;/th&gt;&lt;th&gt;frequent_items&lt;/th&gt;&lt;th&gt;median&lt;/th&gt;&lt;th&gt;distinct_count&lt;/th&gt;&lt;th&gt;percent_nan&lt;/th&gt;&lt;th&gt;percent_null&lt;/th&gt;&lt;th&gt;percent_zeros&lt;/th&gt;&lt;th&gt;percent_distinct&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T08:00:00.000Z, 2025-08-22T09:00:00.000Z)&lt;/td&gt;&lt;td&gt;INPUT&lt;/td&gt;&lt;td&gt;26&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;:table&lt;/td&gt;&lt;td&gt;1344&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;List(timestamp, amount)&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T08:00:00.000Z, 2025-08-22T09:00:00.000Z)&lt;/td&gt;&lt;td&gt;INPUT&lt;/td&gt;&lt;td&gt;26&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;amount&lt;/td&gt;&lt;td&gt;1344&lt;/td&gt;&lt;td&gt;double&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;25.059042979855878&lt;/td&gt;&lt;td&gt;20.0007185120017&lt;/td&gt;&lt;td&gt;29.99500143216646&lt;/td&gt;&lt;td&gt;2.8817003714622857&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;25.133808306174608&lt;/td&gt;&lt;td&gt;1277&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;95.01488095238095&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T08:00:00.000Z, 2025-08-22T09:00:00.000Z)&lt;/td&gt;&lt;td&gt;INPUT&lt;/td&gt;&lt;td&gt;26&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;1344&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;1.755850122929519E9&lt;/td&gt;&lt;td&gt;1.755853196019946E9&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;1.755851952929519E9&lt;/td&gt;&lt;td&gt;1161&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;86.38392857142857&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)&lt;/td&gt;&lt;td&gt;INPUT&lt;/td&gt;&lt;td&gt;26&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;1192&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;1.755853200929519E9&lt;/td&gt;&lt;td&gt;1.755856792565437E9&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;1.755854777019946E9&lt;/td&gt;&lt;td&gt;997&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;83.64093959731544&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)&lt;/td&gt;&lt;td&gt;INPUT&lt;/td&gt;&lt;td&gt;26&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;:table&lt;/td&gt;&lt;td&gt;1192&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;List(timestamp, amount)&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)&lt;/td&gt;&lt;td&gt;INPUT&lt;/td&gt;&lt;td&gt;26&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;amount&lt;/td&gt;&lt;td&gt;1192&lt;/td&gt;&lt;td&gt;double&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;24.847526487373074&lt;/td&gt;&lt;td&gt;20.01063581181195&lt;/td&gt;&lt;td&gt;29.99539398790598&lt;/td&gt;&lt;td&gt;2.8880160456500867&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;24.694267025212234&lt;/td&gt;&lt;td&gt;1192&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;100.0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)&lt;/td&gt;&lt;td&gt;INPUT&lt;/td&gt;&lt;td&gt;26&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;amount&lt;/td&gt;&lt;td&gt;941&lt;/td&gt;&lt;td&gt;double&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;24.969054277784952&lt;/td&gt;&lt;td&gt;20.015936515703217&lt;/td&gt;&lt;td&gt;29.981071930502402&lt;/td&gt;&lt;td&gt;2.846773237150427&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;24.969920953185955&lt;/td&gt;&lt;td&gt;925&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;98.29968119022317&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)&lt;/td&gt;&lt;td&gt;INPUT&lt;/td&gt;&lt;td&gt;26&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;941&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;1.755856803565437E9&lt;/td&gt;&lt;td&gt;1.755860399121952E9&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;1.755858763121952E9&lt;/td&gt;&lt;td&gt;857&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;91.07332624867162&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)&lt;/td&gt;&lt;td&gt;INPUT&lt;/td&gt;&lt;td&gt;26&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;:table&lt;/td&gt;&lt;td&gt;941&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;List(timestamp, amount)&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T11:00:00.000Z, 2025-08-22T12:00:00.000Z)&lt;/td&gt;&lt;td&gt;INPUT&lt;/td&gt;&lt;td&gt;26&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;:table&lt;/td&gt;&lt;td&gt;995&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;List(timestamp, amount)&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;profile&lt;/strong&gt; table has a row for each pair&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;window&lt;/code&gt; (the beginning and end of every hour)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;column_name&lt;/code&gt; every column of the table. In addition, it adds a special row &lt;code&gt;:table&lt;/code&gt; to compute the table-level profile.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Optionally, it can slice on column values when specified at the time of the creation of the &lt;em&gt;Monitor&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;For each row, it computes a bunch of statistics like &lt;code&gt;avg&lt;/code&gt;, &lt;code&gt;quantiles&lt;/code&gt;, &lt;code&gt;min&lt;/code&gt;, &lt;code&gt;max&lt;/code&gt;, etc. (when applicable, eg for float columns).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;select&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales_drift_metrics&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;style scoped&gt;
  .table-result-container {
    max-height: 300px;
    overflow: auto;
  }
  table, th, td {
    border: 1px solid black;
    border-collapse: collapse;
  }
  th, td {
    padding: 5px;
  }
  th {
    text-align: left;
  }
&lt;/style&gt;
&lt;div class='table-result-container'&gt;&lt;table class='table-result'&gt;&lt;thead style='background-color: white'&gt;&lt;tr&gt;&lt;th&gt;window&lt;/th&gt;&lt;th&gt;granularity&lt;/th&gt;&lt;th&gt;monitor_version&lt;/th&gt;&lt;th&gt;slice_key&lt;/th&gt;&lt;th&gt;slice_value&lt;/th&gt;&lt;th&gt;column_name&lt;/th&gt;&lt;th&gt;data_type&lt;/th&gt;&lt;th&gt;window_cmp&lt;/th&gt;&lt;th&gt;drift_type&lt;/th&gt;&lt;th&gt;count_delta&lt;/th&gt;&lt;th&gt;avg_delta&lt;/th&gt;&lt;th&gt;percent_null_delta&lt;/th&gt;&lt;th&gt;percent_zeros_delta&lt;/th&gt;&lt;th&gt;percent_distinct_delta&lt;/th&gt;&lt;th&gt;non_null_columns_delta&lt;/th&gt;&lt;th&gt;js_distance&lt;/th&gt;&lt;th&gt;ks_test&lt;/th&gt;&lt;th&gt;wasserstein_distance&lt;/th&gt;&lt;th&gt;population_stability_index&lt;/th&gt;&lt;th&gt;chi_squared_test&lt;/th&gt;&lt;th&gt;tv_distance&lt;/th&gt;&lt;th&gt;l_infinity_distance&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T11:00:00.000Z, 2025-08-22T12:00:00.000Z)&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;:table&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)&lt;/td&gt;&lt;td&gt;CONSECUTIVE&lt;/td&gt;&lt;td&gt;-418&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;List(0, 0)&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;:table&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;List(2025-08-22T08:00:00.000Z, 2025-08-22T09:00:00.000Z)&lt;/td&gt;&lt;td&gt;CONSECUTIVE&lt;/td&gt;&lt;td&gt;-152&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;List(0, 0)&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;:table&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)&lt;/td&gt;&lt;td&gt;CONSECUTIVE&lt;/td&gt;&lt;td&gt;-251&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;List(0, 0)&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T11:00:00.000Z, 2025-08-22T12:00:00.000Z)&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)&lt;/td&gt;&lt;td&gt;CONSECUTIVE&lt;/td&gt;&lt;td&gt;-418&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;-5.604875005841791&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;List(2025-08-22T08:00:00.000Z, 2025-08-22T09:00:00.000Z)&lt;/td&gt;&lt;td&gt;CONSECUTIVE&lt;/td&gt;&lt;td&gt;-152&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;-2.742988974113132&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;timestamp&lt;/td&gt;&lt;td&gt;List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)&lt;/td&gt;&lt;td&gt;CONSECUTIVE&lt;/td&gt;&lt;td&gt;-251&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;7.4323866513561825&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T11:00:00.000Z, 2025-08-22T12:00:00.000Z)&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;amount&lt;/td&gt;&lt;td&gt;double&lt;/td&gt;&lt;td&gt;List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)&lt;/td&gt;&lt;td&gt;CONSECUTIVE&lt;/td&gt;&lt;td&gt;-418&lt;/td&gt;&lt;td&gt;0.013276990751066364&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;-0.21172707932451829&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;List(0.049, 0.3829208808885818)&lt;/td&gt;&lt;td&gt;0.16058939377867895&lt;/td&gt;&lt;td&gt;0.028216393260236203&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T10:00:00.000Z, 2025-08-22T11:00:00.000Z)&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;amount&lt;/td&gt;&lt;td&gt;double&lt;/td&gt;&lt;td&gt;List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)&lt;/td&gt;&lt;td&gt;CONSECUTIVE&lt;/td&gt;&lt;td&gt;-251&lt;/td&gt;&lt;td&gt;0.12152779041187856&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;-1.7003188097768316&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;List(0.038, 0.4228041687817168)&lt;/td&gt;&lt;td&gt;0.14481617304902772&lt;/td&gt;&lt;td&gt;0.021676869284417942&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;List(2025-08-22T09:00:00.000Z, 2025-08-22T10:00:00.000Z)&lt;/td&gt;&lt;td&gt;1 hour&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;amount&lt;/td&gt;&lt;td&gt;double&lt;/td&gt;&lt;td&gt;List(2025-08-22T08:00:00.000Z, 2025-08-22T09:00:00.000Z)&lt;/td&gt;&lt;td&gt;CONSECUTIVE&lt;/td&gt;&lt;td&gt;-152&lt;/td&gt;&lt;td&gt;-0.21151649248280435&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;0.0&lt;/td&gt;&lt;td&gt;4.985119047619051&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;List(0.062, 0.014853915612309707)&lt;/td&gt;&lt;td&gt;0.21231645982023184&lt;/td&gt;&lt;td&gt;0.022438267042335924&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;td&gt;null&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;drift&lt;/strong&gt; table is similar to the profile table. The &lt;strong&gt;drift&lt;/strong&gt; table has a row for each pair&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;window&lt;/code&gt; (the beginning and end of every hour)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;column_name&lt;/code&gt; every column of the table. In addition, it adds a special row &lt;code&gt;:table&lt;/code&gt; to compute the table-level profile.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In addition, it has the &lt;code&gt;window_cmp&lt;/code&gt;, where &lt;em&gt;cmp&lt;/em&gt; stands for &lt;em&gt;compare&lt;/em&gt;. All the statistics are compared against another window (the previous one). There are various statistics like&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;count_delta&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ks_test&lt;/code&gt;, in statistics, the Kolmogorov–Smirnov can be used to test whether two samples came from the same distribution&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Dashboard&lt;/h4&gt;
&lt;p&gt;Lakehouse Monitoring creates also a dashboard automatically that displays the data in these &lt;em&gt;profile and drift&lt;/em&gt; tables.&lt;/p&gt;
&lt;p&gt;😓 However, I find this dashboard too crowded and not ready to use. You need to work on it to customize it by yourself.&lt;/p&gt;
&lt;h4&gt;Alerts&lt;/h4&gt;
&lt;p&gt;Monitor alerts are created and used the same way as other Databricks SQL alerts. You create a Databricks SQL query on the monitor profile metrics table or drift metrics table. You then create a Databricks SQL alert for this query.&lt;/p&gt;
&lt;h2 id="pricing"&gt;Pricing&lt;/h2&gt;

&lt;p&gt;Lakehouse Monitoring is billed under a serverless jobs SKU. You can monitor its usage via &lt;code&gt;system.billing.usage&lt;/code&gt; table or via the Usage dashboard at Account console.&lt;/p&gt;
&lt;p&gt;You need to pay attention. I expect that the costs may rise for columns with high number of columns if you don't finetune the monitor.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;usage_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usage_quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dbus&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;billing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;usage&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;usage_date&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;DATE_SUB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;current_date&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AND&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;sku_name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;like&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;&amp;quot;%JOBS_SERVERLESS%&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AND&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;custom_tags&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;&amp;quot;LakehouseMonitoring&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;&amp;quot;true&amp;quot;&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;BY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;usage_date&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;BY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;usage_date&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;style scoped&gt;
  .table-result-container {
    max-height: 300px;
    overflow: auto;
  }
  table, th, td {
    border: 1px solid black;
    border-collapse: collapse;
  }
  th, td {
    padding: 5px;
  }
  th {
    text-align: left;
  }
&lt;/style&gt;
&lt;div class='table-result-container'&gt;&lt;table class='table-result'&gt;&lt;thead style='background-color: white'&gt;&lt;tr&gt;&lt;th&gt;usage_date&lt;/th&gt;&lt;th&gt;dbus&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;2025-08-22&lt;/td&gt;&lt;td&gt;1.852757467777777736&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;

&lt;h2 id="my-opinion-on-what-ive-seen"&gt;My opinion on what I've seen&lt;/h2&gt;

&lt;p&gt;Lakehouse monitoring is all about these two profile and drift tables. It is a kind of brute force approach that runs standardized monitoring over the specified table and stores the output in the profiling tables. Is it convenient? It depends on what you're looking for. It is not a free lunch.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros 🟢&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It takes little &lt;strong&gt;effort&lt;/strong&gt; to setup. By default common controls are applied to all columns in the monitored table.&lt;/li&gt;
&lt;li&gt;Most common monitoring scenarios are covered by &lt;code&gt;TimeSeries&lt;/code&gt; profile or by &lt;code&gt;Snapshot&lt;/code&gt; profile (I left apart the inference-ML for the sake of simplicity). The &lt;strong&gt;setup time&lt;/strong&gt; is shorter when compared to anything made by yourself.&lt;/li&gt;
&lt;li&gt;You have a framework ready to use. You save the time required designing it, and you avoid reinventing the wheel. You can &lt;strong&gt;focus on your business&lt;/strong&gt; needs rather than on data engineering stuff.&lt;/li&gt;
&lt;li&gt;I like the simple but effective &lt;strong&gt;design&lt;/strong&gt; of the &lt;em&gt;drift&lt;/em&gt; metric table and of the windowing. Making something like this by yourself will probably let you hit against some hidden edge-case (like anytime you work with time and dates). &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cons 🔴&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Once the metrics are computed in the profile and drift tables, only half of the job is done. You still have to decide &lt;strong&gt;what&lt;/strong&gt; to monitor and &lt;strong&gt;how&lt;/strong&gt; to do it. You're probably not interested in monitor any single column in any row of the metric tables (otherwise you may alerted by too many false alarms). A finetuning of the actual alerts is still required, and it is not coming for free.&lt;/li&gt;
&lt;li&gt;You can't know in advance the overall &lt;strong&gt;cost&lt;/strong&gt; of the monitoring. You need to try with a realistic (production-alike) scenario and monitor soon how much you're paying. I expect it to depend mainly on&lt;ul&gt;
&lt;li&gt;the data volume&lt;/li&gt;
&lt;li&gt;the columns in the table&lt;/li&gt;
&lt;li&gt;the frequency of the controls&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="wheres-databricks-going"&gt;Where's Databricks going?&lt;/h2&gt;

&lt;p&gt;In addition to Lakehouse Monitoring, Databricks has released a feature (in Beta) of &lt;a href="https://docs.databricks.com/aws/en/lakehouse-monitoring/data-quality-monitoring"&gt;data quality monitoring&lt;/a&gt;. This new monitoring&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;is quicker to setup. It is toggle on an entire Schema and monitors all the tables in the schema.&lt;/li&gt;
&lt;li&gt;monitors only simple freshness and completeness quality controls&lt;/li&gt;
&lt;li&gt;has no parametrization&lt;/li&gt;
&lt;li&gt;still needs alerts to be set manually&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I made a short recap here.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Lakehouse Monitoring&lt;/th&gt;
&lt;th&gt;Data Quality Monitoring (Beta)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scope&lt;/td&gt;
&lt;td&gt;Table. It is set at table level. It monitors the table and its columns.&lt;/td&gt;
&lt;td&gt;Schema. It is set at schema level and monitors all tables in such schema.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup&lt;/td&gt;
&lt;td&gt;Choose the profile, eventual slicing, window and frequency.&lt;/td&gt;
&lt;td&gt;On-off on the schema.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What is monitored&lt;/td&gt;
&lt;td&gt;Various statistics as snapshot, time series, and inference.&lt;/td&gt;
&lt;td&gt;Freshness (is data recent?) and completeness (is the volume as expected?)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Customization&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alert&lt;/td&gt;
&lt;td&gt;To be set manually on the output table.&lt;/td&gt;
&lt;td&gt;To be set manually on the output table.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h4&gt;My2C&lt;/h4&gt;
&lt;p&gt;🟢 I think Databricks is going in the right direction. Fast adoption of basic quality controls. Avoid the "&lt;em&gt;didn't notice data is old in production&lt;/em&gt;" moments with little effort.&lt;/p&gt;
&lt;p&gt;🔴 The alerting setup is still quite SQL-based and there is some trial-and-error around it. I would expect that a basic alert should be enabled by default.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Learn basics of MCP with FastMCP</title><link href="https://www.marcosantoni.com/explore_fast_mcp.html" rel="alternate"></link><published>2025-08-17T08:35:00+02:00</published><updated>2025-08-17T08:35:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2025-08-17:/explore_fast_mcp.html</id><summary type="html">&lt;p&gt;I was looking for a resource to get a deeper understanding of MCP (Model Context Protocol). Rather than looking for resources or books, I opted for RTFM. Actually not the MCP manual as I would not have fun reading protocol specs. I took &lt;a href="https://gofastmcp.com/getting-started/welcome"&gt;FastMCP&lt;/a&gt; and went through the docs. I …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I was looking for a resource to get a deeper understanding of MCP (Model Context Protocol). Rather than looking for resources or books, I opted for RTFM. Actually not the MCP manual as I would not have fun reading protocol specs. I took &lt;a href="https://gofastmcp.com/getting-started/welcome"&gt;FastMCP&lt;/a&gt; and went through the docs. I was not just reading the docs. While reading the docs, I was:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;sketching a concept map&lt;/li&gt;
&lt;li&gt;coding some basic hello-world examples&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I enjoyed this approach because it was quite rapid (I could not invest days but rather hours) while practical and hands on. The concept map helps keeping some notes for me in future. Notes you write by yourself are the ones that stick best.&lt;/p&gt;
&lt;h2&gt;Concept map&lt;/h2&gt;
&lt;p&gt;I used &lt;a href="https://www.drawio.com/"&gt;draw.io&lt;/a&gt; to draw the concept map and exported to SVG for best web rendering. You can explore it here 👇&lt;/p&gt;
&lt;p&gt;&lt;img src="./images/fastmcp_concepts.svg" width="800" alt="A concept diagram of the main FastMCP Pyhon library components." /&gt;&lt;/p&gt;
&lt;h2&gt;Little codebase&lt;/h2&gt;
&lt;p&gt;I made a basic server and client with streamable HTTP transport mode. Everything is in this &lt;a href="https://github.com/Marco-Santoni/explore-fast-mcp"&gt;repo&lt;/a&gt;. There is a basic example for each key component of an MCP server&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;tool&lt;/li&gt;
&lt;li&gt;resource&lt;/li&gt;
&lt;li&gt;prompt&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Opinions about MCP&lt;/h2&gt;
&lt;p&gt;I'll share a couple of opinions I got while exploring MCP and FastMCP.&lt;/p&gt;
&lt;h3&gt;Evolving rapidly&lt;/h3&gt;
&lt;p&gt;FastMCP is evolving according to the MCP specs of course. These specs are quite recent. The first stable version was in November 2024 while the latest (and third) in June 2025. I read about an important feature like &lt;em&gt;Structured Output&lt;/em&gt; and found it was only &lt;a href="https://modelcontextprotocol.io/specification/2025-06-18/changelog"&gt;few weeks old&lt;/a&gt; at the time of my reading. It is a great sign that things are moving so fast, but, at the same time, you should consider this quick evolution if you're working on a &lt;strong&gt;production-ready&lt;/strong&gt; application.&lt;/p&gt;
&lt;p&gt;You may want to stay simple and minimize the overall engineering investment. You may find yourself investing in engineering features that few months later might be supported by the protocol or by the ecosystem&lt;/p&gt;
&lt;h2&gt;Good design&lt;/h2&gt;
&lt;p&gt;I appreaciated the design of the protocol and of FastMCP itself. It is simple enough and based on three elements (tools, resources, prompts), but still catches a large amount of needs of agent applications. There are useful interfaces and features for common needs like&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;interactive input by users&lt;/li&gt;
&lt;li&gt;progress monitoring&lt;/li&gt;
&lt;li&gt;logging and messaging&lt;/li&gt;
&lt;li&gt;sampling from client's LLM&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The design is &lt;strong&gt;composable&lt;/strong&gt; making it scalable for larger applications. An MCP server can literally &lt;em&gt;import&lt;/em&gt; another MCP server or mount it. The name-clashes or duplicates can be handled explictly by developers.&lt;/p&gt;
&lt;h2&gt;Rich ecosystem&lt;/h2&gt;
&lt;p&gt;FastMCP is one example of the ecosystem of tools and frameworks that is growing around MCP. The ecosystem is what matters (more than the protocol design).&lt;/p&gt;
&lt;h2&gt;Engineering is still THE thing&lt;/h2&gt;
&lt;p&gt;Building an MCP server is still an engineering and design job. Should feature X of my server be a resource? Or a tool? Should this input be parametrized? There will be no correct or wrong answers to these questions. Design styles will emerge and the touch of software architects will emerge to make sure things can scale and are easy to maintain.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Why reading Deep Work</title><link href="https://www.marcosantoni.com/deep_work_book.html" rel="alternate"></link><published>2025-04-13T19:35:00+02:00</published><updated>2025-04-13T19:35:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2025-04-13:/deep_work_book.html</id><summary type="html">&lt;blockquote&gt;
&lt;p&gt;Shallow Work: Noncognitively demanding, logistical-style tasks, often performed while distracted. These efforts tend not to create much new value in the world and are easy to replicate.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cal Newport gives this definition of how I spend part of my work-time in his &lt;em&gt;Deep Work&lt;/em&gt; book. Checking emails, messaging on Teams …&lt;/p&gt;</summary><content type="html">&lt;blockquote&gt;
&lt;p&gt;Shallow Work: Noncognitively demanding, logistical-style tasks, often performed while distracted. These efforts tend not to create much new value in the world and are easy to replicate.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cal Newport gives this definition of how I spend part of my work-time in his &lt;em&gt;Deep Work&lt;/em&gt; book. Checking emails, messaging on Teams, and attending meetings are few examples of such "noncognitively demanding" activities. On some days, these activities may even fill up my work-time. How much space is left for cognitively intensive task? When I have space for few consecutive hours of highly focused work, I have the feeling that what I'm doing is actually valuable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;I'm not alone&lt;/strong&gt;. Talking with my peers, I understood that almost anyone is facing the same issue. Days being full of non focused and low demanding tasks. I'm part of what Newport defines "knowledge workers". The output of our work is not some physical manufact, neither a service that is directly customer-facing. The paradox is the following. Knowledge workers give their best when they have space for focus, however organizations adopt tools for instant communication that are designed to drain such attention.&lt;/p&gt;
&lt;p&gt;&lt;img src="./images/bookshelf/deep_work.jpg" width="200" alt="Cover of the book Deep Work by Cal Newport." /&gt;&lt;/p&gt;
&lt;h2&gt;Deep Work Hypothesis&lt;/h2&gt;
&lt;p&gt;Is it de facto a problem? Or is it actually an &lt;strong&gt;opportunity&lt;/strong&gt; for those who are aware of it? Here it comes the interesting point made by Newport. He defines the "Deep Work Hypothesis" as:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The ability to perform deep work is becoming increasingly rare at exactly the same time it is becoming increasingly valuable in our economy. As a consequence, the few who cultivate this skill, and then make it the core of their working life, will thrive.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The claim is that we operate in a distracting work enviroment. At the same time, the job industry require continuous learning and high specialization. Both of these requirements need deep focus. Deep focus is today harder to achieve. People who can shield themselves from distractions and dedicate time to focused work are the ones who will thrive.&lt;/p&gt;
&lt;p&gt;&lt;img src="./images/work_schedule.png" width="300" alt="A visual representation of a typical work schedule highlighting interruptions and shallow work." /&gt;&lt;/p&gt;
&lt;h2&gt;Busyness&lt;/h2&gt;
&lt;p&gt;What are symptoms of lack of deep work in your daily job? If you consider yourself a knowledge-worker, spending most of time answering emails or instant-messaging are indicators that deep work is lacking. Why is it so common in most companies then? Newport introduces the &lt;em&gt;Principle of Least Resistance&lt;/em&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In a business setting, without clear feedback on the impact of various behaviors to the bottom line, we will tend toward behaviors that are &lt;strong&gt;easiest in the moment&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So, why do many workers (especially in large organizations) spend so much time in low-intensity activities (like emailing or chatting)? Becaust it is easy. This is the short answer by the &lt;em&gt;Principle of Least Resistance&lt;/em&gt;. Answering emails or messages gives an immediate feedback to the worker, while long and focused activities likely lack such quick feedback. Our brain is tempted by such quick feedbacks. Resisting the quick feedback of going through email requires a greater effort.&lt;/p&gt;
&lt;p&gt;This workday schedule has two drawbacks. First, continuous interruptions and context switch reduce the focus and attention. Low levels of cognitively intense activities reduce the value generated by such activities. Second, the worker is unhappy with the work he/she's doing. Newport cites a study by the psychologist Csikszentmihalyi. His studies demonstrated that, surprisingly, we are most satisfied when we're given difficult tasks to accomplish rather than when relaxing.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The best moments usually occur when a person’s body or mind is stretched to its limits in a voluntary effort to accomplish something difficult and worthwhile.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Budget of willpower&lt;/h2&gt;
&lt;p&gt;We cannot just decide to concentrate and expect it to happen. It just does not work. We have a limited amount of willpower. It decreases when we use it. For simplicity, we can consider it as a daily budget of willpower. The key recommendation by Newport is the following. Build a set of routines and rituals that help you develop deep work habits. By doing so, you minimize the amount of willpower you need to use to focus. The less you willpower you use for each focuse session, the more you save for focusing on the rest of the day.&lt;/p&gt;
&lt;p&gt;Newport describes four approaches to building such routines:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;monastic philosophy&lt;/li&gt;
&lt;li&gt;bimodal philosophy&lt;/li&gt;
&lt;li&gt;rithmic philosophy&lt;/li&gt;
&lt;li&gt;journalistic philosophy&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I'll not go through describing each of them. You should read the book to get a full picture of them.&lt;/p&gt;
&lt;p&gt;&lt;img src="./images/darwin_sandwalk.jpg" width="400" alt="Charles Darwin's Sandwalk, a path he used for focused thinking and reflection." /&gt;&lt;/p&gt;
&lt;p&gt;I'll share an example to help you get a taste of what it means to build your "philosophy" of routines. The books makes the example of the workday schedule of Charles Darwin (you can get more details &lt;a href="https://www.darwinproject.ac.uk/commentary/curious/darwin-and-working-home"&gt;here&lt;/a&gt;).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Charles Darwin had a similarly strict structure for his working life during the period when he was perfecting On the Origin of Species. As his son Francis later remembered, he would rise promptly at seven to take a short walk. He would then eat breakfast alone and retire to his study from eight to nine thirty. The next hour was dedicated to reading his letters from the day before, after which he would return to his study from ten thirty until noon. After this session, he would mull over challenging ideas while walking on a prescribed route that started at his greenhouse and then circled a path on his property. He would walk until satisfied with his thinking then declare his workday done.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;What I do&lt;/h2&gt;
&lt;p&gt;To cultivate the habit of deep work and improve my focus, I have set the following goals for myself:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Focused Activities Early Morning or Weekends&lt;/strong&gt;&lt;br&gt;
    I aim to dedicate the early hours of my workdays or weekends to cognitively demanding tasks. These are the times when my mind is fresh, and distractions are minimal. By prioritizing deep work during these periods, I can make significant progress on challenging projects.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Turn Off Popups and Notifications on Desktop&lt;/strong&gt;&lt;br&gt;
    To shield myself from distractions, I will disable all unnecessary popups and notifications on my desktop. This includes email alerts, instant messaging notifications, and other interruptions that can break my focus. Creating a distraction-free environment is essential for maintaining deep concentration.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Emails in Late Afternoon or Evenings&lt;/strong&gt;&lt;br&gt;
    I will reserve time for checking and responding to emails in the late afternoon or evenings. This ensures that my most productive hours are not consumed by shallow work. By batching email tasks, I can handle them more efficiently without constant context switching.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Take Notes During Meetings to Help Focus&lt;/strong&gt;&lt;br&gt;
    During meetings, I will take detailed notes to stay engaged and focused. This practice not only helps me retain important information but also prevents my mind from wandering. It ensures that I am fully present and can contribute meaningfully to discussions.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By adhering to these goals, I aim to build a sustainable routine that supports deep work and enhances the quality of my output.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Careers in data and AI: seminar and podcast</title><link href="https://www.marcosantoni.com/seminar_careers.html" rel="alternate"></link><published>2024-06-30T06:41:00+02:00</published><updated>2024-06-30T06:41:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2024-06-30:/seminar_careers.html</id><summary type="html">&lt;p&gt;I celebrated 10 years since my graduation trying to giving back somthing I learned about our industry to students. I was invited by Prof. &lt;a href="https://www.mat.unical.it/calimeri/"&gt;Francesco Calimeri&lt;/a&gt; from UniCal to hold a seminar to students at the last year of MSc. in Computer Science and AI. A couple of pictures from …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I celebrated 10 years since my graduation trying to giving back somthing I learned about our industry to students. I was invited by Prof. &lt;a href="https://www.mat.unical.it/calimeri/"&gt;Francesco Calimeri&lt;/a&gt; from UniCal to hold a seminar to students at the last year of MSc. in Computer Science and AI. A couple of pictures from the event here 👇&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.marcosantoni.com/images/careers_data_ai_1.jpg" width="400" /&gt;
&lt;img src="https://www.marcosantoni.com/images/careers_data_ai_2.jpg" width="400" /&gt;&lt;/p&gt;
&lt;p&gt;I then took the content of my presentation and organized a panel with 2 special guests (Paolo Platter and Alberto Danese) at Intervista Pythonista. This is what came out of it:&lt;/p&gt;
&lt;iframe style="border-radius:12px" src="https://open.spotify.com/embed/episode/7yX22Ixbe1Us74G4srANHs?utm_source=generator&amp;theme=0" width="100%" height="152" frameBorder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"&gt;&lt;/iframe&gt;</content><category term="posts"></category></entry><entry><title>Organizing a conference: Py4AI</title><link href="https://www.marcosantoni.com/py4ai2024.html" rel="alternate"></link><published>2024-03-25T09:21:00+01:00</published><updated>2024-03-25T09:21:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2024-03-25:/py4ai2024.html</id><summary type="html">&lt;p&gt;I had a lot of fun (and work) over the last 6 months working or organizing a conference, Py4AI. The conference was held in Pavia, Italy, on March 16th 2024. The conference was a success, with over 200 attendees and 12 speakers. The conference was organized by a group of …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I had a lot of fun (and work) over the last 6 months working or organizing a conference, Py4AI. The conference was held in Pavia, Italy, on March 16th 2024. The conference was a success, with over 200 attendees and 12 speakers. The conference was organized by a group of volunteers, including myself, Alessandro Ferrari, Pietro Peterlongo, Cesare Placanica, Thao Hoang, and Luca Baggi. A screenshot with the speakers lineup:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://www.marcosantoni.com/images/py4ai_speakers.png" width="600" /&gt;&lt;/p&gt;
&lt;p&gt;You can find &lt;a href="https://youtube.com/playlist?list=PL0RwQVm3YPu5k9iIaQUehwgh2M1DgKWaT&amp;amp;si=O3AhF3JMpTA3iVZ3"&gt;here&lt;/a&gt; the Youtube playlist with the talks delivered at the conference.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Reading "The Design of Web APIs"</title><link href="https://www.marcosantoni.com/design_web_api.html" rel="alternate"></link><published>2023-08-27T07:35:00+02:00</published><updated>2023-08-27T07:35:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2023-08-27:/design_web_api.html</id><summary type="html">&lt;p&gt;Why bothering reading a book about design of web APIs when working in data science like I do? I found this book called &lt;em&gt;The Design of Web APIs&lt;/em&gt; by &lt;a href="https://apihandyman.io/about/"&gt;Arnaud Lauret&lt;/a&gt; and decided to give it a try.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The Design of Web APIs" src="https://www.marcosantoni.com/images/bookshelf/webapi.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;Why reading it&lt;/h2&gt;
&lt;p&gt;Data science is shifting towards turning models and solutions …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Why bothering reading a book about design of web APIs when working in data science like I do? I found this book called &lt;em&gt;The Design of Web APIs&lt;/em&gt; by &lt;a href="https://apihandyman.io/about/"&gt;Arnaud Lauret&lt;/a&gt; and decided to give it a try.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The Design of Web APIs" src="https://www.marcosantoni.com/images/bookshelf/webapi.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;Why reading it&lt;/h2&gt;
&lt;p&gt;Data science is shifting towards turning models and solutions as API products. And this can be true not only when you develop a product you actually sell publicly. Even when developing a data product inside an organization, you may want to expose your data service via APIs.&lt;/p&gt;
&lt;p&gt;So, when you are at this point of developing an APIs, there are plenty of design decisions to take (eg which routes to expose, which result codes, which response payload, etc.). If you start this design process without some guidelines, you may spend plenty of energies on trying to answer these design questions or even risk to introduce technical debt that you will pay later.&lt;/p&gt;
&lt;h2&gt;What is the book about&lt;/h2&gt;
&lt;p&gt;The book states that, when you take the role of an &lt;em&gt;API designer&lt;/em&gt;, you are just like a designer of real-world object. An API is made for users and shuold help &lt;strong&gt;them&lt;/strong&gt; to achieve &lt;strong&gt;their&lt;/strong&gt; goals. API designers should avoid that internal details of the backend affect the design of the APIs. The focus of the designer is to simplify the job of the consumer.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Usability is what distinguishes awesome APIs from mediocre or passable ones.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The book is focused on shifting your point of view &lt;strong&gt;from the provider to the consumer&lt;/strong&gt;. It may seem obvious and everyone might agree on it, but it is not that straightforward to make it happen because we may have bias or we may take design shorcuts that simplify the development of our backend. The author introduces methods like &lt;em&gt;API goals canvas&lt;/em&gt; to help us listing out the needs of the user and focusing on them.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Example of design tradeoffs" src="https://www.marcosantoni.com/images/api_design_tradeoff.png"&gt;&lt;/p&gt;
&lt;p&gt;You will find in the books charts or schemas like the one above that explain the design choices you can face on a daily basis. In this example, you may want to stick to fully REST compliace with a &lt;code&gt;POST /orders&lt;/code&gt;. Or you may want to relax this constraint via a non-REST design like &lt;code&gt;POST /cart/check-out&lt;/code&gt; that might actually be more intuitive for the consumer developers.&lt;/p&gt;
&lt;h2&gt;And more technicalities&lt;/h2&gt;
&lt;p&gt;The book has a focus on these design choices (eg the &lt;em&gt;resource expansion&lt;/em&gt; pattern for nested object in API responses), but is a good source of knowledge to learn more about some technical details around APIs that you can use on a daily basis. For example, you will read chapters about&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;OpenAPI Specification&lt;/li&gt;
&lt;li&gt;OAuth2&lt;/li&gt;
&lt;li&gt;features of HTTP you might not be using (eg there are around 200 different standard HTTP headers)&lt;/li&gt;
&lt;li&gt;data format standards like &lt;em&gt;ISO 4217&lt;/em&gt; for currrencies or &lt;em&gt;ISO 8601&lt;/em&gt; for date and time-related data&lt;/li&gt;
&lt;li&gt;etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;All about dev experience&lt;/h2&gt;
&lt;p&gt;It is the first book I read so far entirely dedicated to developer experience. How can we improve the productivity and the overall satisfaction of the developers using our APIs?&lt;/p&gt;
&lt;p&gt;By reading it, you can learn an approach that goes beyond designing web APIs. You learn to focus on what simplifies the life of a developer, and I'm sure this thinking has an effect on how you write your code, your internal tools or even your docs.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Expectations from a Data Analyst</title><link href="https://www.marcosantoni.com/expectations_from_data_analyst.html" rel="alternate"></link><published>2023-08-07T07:35:00+02:00</published><updated>2023-08-07T07:35:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2023-08-07:/expectations_from_data_analyst.html</id><summary type="html">&lt;p&gt;When you work as a data analyst or data scientist (I'll use the terms interchangeably) in a company, you may not be training predictive models every single day. A significant (and often interesting) part of your job is answering business questions via data mining regardless if you do it with …&lt;/p&gt;</summary><content type="html">&lt;p&gt;When you work as a data analyst or data scientist (I'll use the terms interchangeably) in a company, you may not be training predictive models every single day. A significant (and often interesting) part of your job is answering business questions via data mining regardless if you do it with machine learning, descriptive statistics or whatever. You may start with a business question like:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;why are our revenues increasing in the last quarter?&lt;/p&gt;
&lt;p&gt;what are common patterns between our loyal customers?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Such simple questions often require a complex work that goes beyond knowing well statistics. You need to know how how your business work and what are the expectations of your stakeholder.&lt;/p&gt;
&lt;h2&gt;What the analyst enjoys the most (and the least)&lt;/h2&gt;
&lt;p&gt;Once a business question has arrived, where do we start from? Most data analysts would start mining into the &lt;strong&gt;data exploration&lt;/strong&gt; phase. This phase is usually the first one of the activity, and the data analysts look into distributions and patterns in the data. The goal here is to get a good comprehension of the data we are sitting on. And usually the data analysts has fun during this data exploration time.🎉🙌 He or she is playing with charts and with some statistics from the dataset.&lt;/p&gt;
&lt;p&gt;What does the data analyst usually &lt;strong&gt;not&lt;/strong&gt; enjoy doing? 👎😭 Based on my experience, preparing the &lt;strong&gt;presentation&lt;/strong&gt; about the results of the analysis is the part of the activity that most data analysts enjoy the least. And what does it imply?&lt;/p&gt;
&lt;p&gt;&lt;img alt="Time dedicated to slides" src="https://www.marcosantoni.com/images/time_dedicated_to_slides_small.png"&gt;&lt;/p&gt;
&lt;p&gt;Imagine you have &lt;strong&gt;10 days&lt;/strong&gt; to work on this data analysis before the meeting with your stakeholders. As our dear data analysts enjoy playing with the data more than playing with PowerPoint, they would probably spend 9 days on mining the data and 1 day working on the presentation. And probably the 9 days do not depend on the actual complexity of the task. If the business question can be answerend in 5 days with some basic descriptive statistics, the data analysts would probably invest more and more time trying some more advanced modelling technique or some more fancy data visualization. Why? Because they enjoy it. So, the data analysis part of the activity fills all the available space like a gas in a room would do.&lt;/p&gt;
&lt;p&gt;Last day (if not very last hours) is usually left to working on the presentation.&lt;/p&gt;
&lt;h2&gt;The wrong interpretation of the role&lt;/h2&gt;
&lt;p&gt;I was expecting from a data analyst to focus on the data mining, and that looked fine to me. She/he would share the data with other stakeholders (eg marketing staff), and THEY would get the insights because THEY are the domain expert. The data scientist would get the data, would let the data talk, and the business stakeholder would read the insights.&lt;/p&gt;
&lt;p&gt;I thought it was OK to present an exploratory analysis. And I was wrong.&lt;/p&gt;
&lt;h2&gt;Explanatory over exploratory&lt;/h2&gt;
&lt;p&gt;Recently, I read &lt;a href="https://www.storytellingwithdata.com/books"&gt;Storytelling with Data&lt;/a&gt; by Cole Nussbaumer Knaflic. The author explains why data scientists should show &lt;strong&gt;explanatory&lt;/strong&gt; analyses (rather than exploratory).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you are the one analyzing and communicating the data, you likely know it best—you are a subject matter expert. This puts you in a unique position to interpret the data and help lead people to understanding and action.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Once the explorary data mining phase is over, the data analyst should take the time and the effort to &lt;strong&gt;interpret&lt;/strong&gt; the data. She/he should turn the data into information that can answer the need of the audience.&lt;/p&gt;
&lt;p&gt;Why is it hard? We often believe that the audience is the subject matter expert and know what is actually the valuable information behind the data. That's why working on the explanatory phase is an uncomfortable zone for a data scientist, but she/he should feel confident in making recommendations and observations.&lt;/p&gt;
&lt;p&gt;If we entitle a data analyst to interpret the business insights of the data, there are at least 2 things he/she should take into considerations:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;take enough &lt;strong&gt;time to interpret&lt;/strong&gt; the data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;review the data visualizations&lt;/strong&gt; to explicitly communicate his/her interpretation&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Regarding the 1st point, looking for business insights is surprisingly time consuming. You cannot just dedicate the very last hours of your activity to looking for explanations behind data patterns. We should probably reconsider a classic sequential approach to the activity (eg explore, explain, present) in favor of an approach that organizes our time as to have quick iterations around business hypothesis and repeat multiple iterations before concluding our activity.&lt;/p&gt;
&lt;p&gt;Regarding, the 2nd point, I'll go a bit deeper with an example.&lt;/p&gt;
&lt;h2&gt;Example: review your data viz&lt;/h2&gt;
&lt;p&gt;You can find of course many examples on Knaflic's book. Let's look at one I picked one from &lt;a href="https://www.storytellingwithdata.com/makeovers"&gt;her website&lt;/a&gt;. Imagine we're working in a hospital and are analyzing lengths of hospitals stays after a surgery. For each stay of year 2019, we're given&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the quarter of the year&lt;/li&gt;
&lt;li&gt;the length of stay&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When a data analysts is done with the first &lt;em&gt;exploratory&lt;/em&gt; analysis, what could be the output?&lt;/p&gt;
&lt;p&gt;&lt;img alt="Exploratory data analysis chart" src="https://www.marcosantoni.com/images/surgery_data_exploratory_small.png"&gt;&lt;/p&gt;
&lt;p&gt;In this chart, data is presented to the audience. However, how easy is it to get valuable information out of it? You may notice some patterns (eg increase in frequency &lt;code&gt;&amp;lt;=24&lt;/code&gt; stays over the year), however finding patterns is hard or requires quite some cognitive effort.&lt;/p&gt;
&lt;p&gt;What if the data analyst would take this effort of extracting the information out of the data? How should the presentation be revisited? She/he should be confident in highlighting what's actually valuable in the data and focus the attention on the reader on that.&lt;/p&gt;
&lt;p&gt;In this example, the data analyst can make the key information explicit. She/he can find out that the &lt;code&gt;&amp;lt;=24&lt;/code&gt; stays have increased over the year and could know that this is considered a success. Why not emphasizing it on the chart?&lt;/p&gt;
&lt;p&gt;Let's look at how an &lt;em&gt;explanatory&lt;/em&gt; chart would look like.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Explanatory data analysis chart" src="https://www.marcosantoni.com/images/surgery_data_explanatory_small.png"&gt;&lt;/p&gt;
&lt;p&gt;The chart now has a clear message that is stated in the title and is fully described in the text next to the actual plot. The new chart looks clean because any visual component that is not useful to grab the chosen message is either hidden or grayed out. The data analyst in this case has focused on explaining why 2019 was a success rather than showing plain data. That's why the bars of the &lt;code&gt;&amp;lt;=24&lt;/code&gt; stays are highlighted in black, while, in contrast, the remaining bars are grayed out. The choice of colors captures the attention of the reader on the patterns and on the signal, rather than on the data itself.&lt;/p&gt;
&lt;h2&gt;Looking forward&lt;/h2&gt;
&lt;p&gt;This article is mainly inspired by Knaflic's book and by my experience on interacting with stakeholders over the last years. I haven't done a research of the literature on the topic, so please consider this article as a set of opinionated recommendations on how a data scientist could maximize his/her impact when working on data mining activities. Agreeing with this approach means that a data scientist should dedicate energies to&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;getting a deep knowledge of the business of the company she/he works at and the market where it competes&lt;/li&gt;
&lt;li&gt;fine tuning and improving the data visualizations she/he by iterating over and over on them (not stopping at the default chart styles generated by statistics softwares)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These thoughts do not apply fully to every single company of course. They make sense in teams or companies where data scientists spend a part of their time making data explorations and data mining activities to answer questions that business stakeholders ask them. I would appreciate any feedback or thought you have on it!&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Guest at DaGrande podcast</title><link href="https://www.marcosantoni.com/guest_at_dagrande_podcast.html" rel="alternate"></link><published>2023-07-23T06:35:00+02:00</published><updated>2023-07-23T06:35:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2023-07-23:/guest_at_dagrande_podcast.html</id><summary type="html">&lt;p&gt;I was recently guest at a new podcast called "DaGrande". The podcast was launched by &lt;a href="https://www.linkedin.com/in/stefano-bosisio1/"&gt;Stefano Bosisio&lt;/a&gt; and aims at helping students that are near to conclude their studies. "DaGrande" consists of a series of interviews where professionals from a variety of industries share tips or insights abuot career that …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I was recently guest at a new podcast called "DaGrande". The podcast was launched by &lt;a href="https://www.linkedin.com/in/stefano-bosisio1/"&gt;Stefano Bosisio&lt;/a&gt; and aims at helping students that are near to conclude their studies. "DaGrande" consists of a series of interviews where professionals from a variety of industries share tips or insights abuot career that they would have loved to hear when they were younger (eg when still at university).&lt;/p&gt;
&lt;p&gt;I was the one interviewed in the second episode (see below), and I shared my advices for starting a career in the Data and AI world.&lt;/p&gt;
&lt;iframe style="border-radius:12px" src="https://open.spotify.com/embed/episode/0ePWYBwwqq71hGJw4H0woX?utm_source=generator&amp;theme=0" width="100%" height="352" frameBorder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"&gt;&lt;/iframe&gt;</content><category term="posts"></category></entry><entry><title>Learning by teaching</title><link href="https://www.marcosantoni.com/learning_by_teaching.html" rel="alternate"></link><published>2023-01-28T09:35:00+01:00</published><updated>2023-01-28T09:35:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2023-01-28:/learning_by_teaching.html</id><summary type="html">&lt;p&gt;The picture below was taken just at the beginning of the exam of the course called &lt;em&gt;Apache Spark for Data Analysis&lt;/em&gt; at &lt;a href="https://www.itsrizzoli.it/en/home-en/"&gt;ITS Rizzoli&lt;/a&gt; in Milan on November 2022. I was the one taking the picture because I was actually the lecturer of this course. In this post, I'll tell …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The picture below was taken just at the beginning of the exam of the course called &lt;em&gt;Apache Spark for Data Analysis&lt;/em&gt; at &lt;a href="https://www.itsrizzoli.it/en/home-en/"&gt;ITS Rizzoli&lt;/a&gt; in Milan on November 2022. I was the one taking the picture because I was actually the lecturer of this course. In this post, I'll tell you why I ended up teaching this course.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The day of the exam" src="https://www.marcosantoni.com/images/spark_exam.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;Learning&lt;/h2&gt;
&lt;p&gt;I have been working daily with Apache Spark for three years so far, and I've been implementing a variety of batch and streaming data transformations with it. I felt I knew the basics of the framework so that I was autonomous in creating new jobs. However, I wanted to go deeper in understanding how Spark works and what are the best practices to follow (or the antipatterns to avoid).&lt;/p&gt;
&lt;p&gt;Rather than studying by myself a book about Spark or something like that, I asked myself: "&lt;em&gt;why not teaching an introductory course&lt;/em&gt;"? And that was actually a good idea. I found that &lt;strong&gt;teaching&lt;/strong&gt; has been an extremely effective way to &lt;strong&gt;learn&lt;/strong&gt;. My course consisted of 44 hours of training spanning on 11 lessons over 2 months. While that may look not that large, preparing 44 hours of training material and designing the lessons requires a dense preparation on the topic you are teaching. I decided to design the course with more practice than theory and with plenty of live coding.&lt;/p&gt;
&lt;p&gt;So, preparing this course has been an amazing opportunity to actually learn how Apache Spark works. After the end of the course, I have the impression I've truly improved my coding skills in PySpark way more than what I would have achieved by any dedicated training.&lt;/p&gt;
&lt;h2&gt;Impact&lt;/h2&gt;
&lt;p&gt;The &lt;a href="https://en.wikipedia.org/wiki/Istituto_tecnico_superiore"&gt;ITS&lt;/a&gt; is a 2 years technical school dedicated to 19-20 years old students. It is an alternative to academy studies, and it is designed to be shorter and with a technical foucs. These ITS schools often focus on areas or skills that are in high demand by the job market. Therefore, students are often able to find their first job soon.&lt;/p&gt;
&lt;p&gt;My goal was helping young developers learning a key technology like Spark. Knowing Spark is almost a requirement for applying to Data Engineer positions, and the role of the Data Engineer is one with the highest demand in the tech job market. So, I decided to design the course I would have liked to follow 3 years ago to speed up learning these skills. I liked the idea of giving my contribute in supporting these young developers finding their first job in the Data domain.&lt;/p&gt;
&lt;h2&gt;Teaching&lt;/h2&gt;
&lt;p&gt;When you know a topic or a technology, it does not mean you are able to teach it. Teaching is a complex task where you cannot take anything for given and need find the good pace for the class. In a class of 23 students, I found a variety of expertises or a variety of backgrounds meaning that you need to balance them for teaching at the good rythm.&lt;/p&gt;
&lt;p&gt;Another challenge is how not to make a lecture boring and having a good mix of theory and practive because you'll find both students that look for more of one or for more of the other. Teaching this course was then also an opportunity to improve my teaching skills, and these are not skills that you apply only during lectures. They are actually communication skills that you can apply and distill ona daily basis when working.&lt;/p&gt;
&lt;h2&gt;Revenue&lt;/h2&gt;
&lt;p&gt;I liked the idea of having a small second revenue, and, before starting preparing the course, I thought teaching would have been a great idea because I would have been paid to learn. The salary of a lecturer in the tech domain can vary a lot depending on the context, but it is generally ranging from 40-200 euro per hour (this is not an official statistic, it's just an approximation). However, this salary does &lt;strong&gt;not account&lt;/strong&gt; for the preparation of the training. So, is &lt;em&gt;revenue&lt;/em&gt; actually a good reason to teach? Probably not if you give this course only once or twice. The effort of the preparation is so large that the revenue will not compensate for it. If instead you have the opportunity to repeat the same training over and over, than it starts to make sense on the economical side too.&lt;/p&gt;
&lt;h2&gt;Opportunity&lt;/h2&gt;
&lt;p&gt;Three years ago I just would not have had the time to prepare a course like this. Why? I use to spend 2 hours per day commuting. I now have instead the chance to work from home quite often, and this gives me 1-2 hours of extra life. I was then able to prepare the course material incrementally over a couple of months before the course started.&lt;/p&gt;
&lt;p&gt;The opportunity came when I &lt;a href="https://open.spotify.com/episode/4OWbyxGWcEPcQULpNTiNqU"&gt;interviewed Andrea Biancini&lt;/a&gt; at Intervista Pythonista podcast. Thanks to him, I knew a bit more about the tech education and training world and heard for the first time about ITS. Then, the idea was sticking in my head because I was looking forward to experience being a lecturer for the first time.&lt;/p&gt;
&lt;h2&gt;Course design&lt;/h2&gt;
&lt;p&gt;When you prepare a course, the nice part is that you can actually design the course you would like to attend. My course then consisted mainly of live coding sessions that started with a brief introduction of a topic (eg Spark APIs, Streaming, etc) and then ended with an excercise on that topic that the students could try to solve. I decided to &lt;a href="https://github.com/Marco-Santoni/databricks-from-scratch/tree/main/training-spark"&gt;open source&lt;/a&gt; the trainig material I prepared so that any other student or teach may benefit from it when needed. To simplify the course setup, I run the coding sessions on &lt;a href="https://community.cloud.databricks.com/login.html"&gt;Databricks community edition&lt;/a&gt; so that students only needed a browser and an internet connection to work on a Spark cluster.&lt;/p&gt;
&lt;p&gt;What helped the design of the cours was adopting a textbook. Having a textbook speeds up the design of the contents of the course and gives the students a reference resource in case they want to go deeper on the topic. I chose &lt;a href="https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf"&gt;Learning Spark&lt;/a&gt; second edition by Damji et al. that is made freely available by Databricks.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Speaker at PyCon IT 2022</title><link href="https://www.marcosantoni.com/pyconit_2022.html" rel="alternate"></link><published>2022-08-05T06:41:00+02:00</published><updated>2022-08-05T06:41:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2022-08-05:/pyconit_2022.html</id><summary type="html">&lt;p&gt;I went back to PyCon IT 2022 in Florence in June. I gave one talk called &lt;em&gt;Why Is Our Project Late?&lt;/em&gt; where I introduces mental and statistical bias that lead us to make wrong estimates when making a plan.&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/zcDQwIQQwR4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;

&lt;p&gt;Furthermore, we held a live session of &lt;a href="https://intervistapythonista.com/"&gt;Intervista Pythonista&lt;/a&gt; podcast interviewing …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I went back to PyCon IT 2022 in Florence in June. I gave one talk called &lt;em&gt;Why Is Our Project Late?&lt;/em&gt; where I introduces mental and statistical bias that lead us to make wrong estimates when making a plan.&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/zcDQwIQQwR4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;

&lt;p&gt;Furthermore, we held a live session of &lt;a href="https://intervistapythonista.com/"&gt;Intervista Pythonista&lt;/a&gt; podcast interviewing Fabio Pliger, the creator of PyScript.&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/l5-ecdsBaHE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;</content><category term="posts"></category></entry><entry><title>My Webinar on Databricks and PySpark</title><link href="https://www.marcosantoni.com/webinar_databricks_biella.html" rel="alternate"></link><published>2022-06-11T19:35:00+02:00</published><updated>2022-06-11T19:35:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2022-06-11:/webinar_databricks_biella.html</id><content type="html">&lt;p&gt;I was invited by &lt;a href="https://pythonbiellagroup.it/it/"&gt;Python Biella&lt;/a&gt; community to hold a webinar introducing PySpark on Databricks (in Italian). You can find the video below and the &lt;a href="https://github.com/Marco-Santoni/databricks-from-scratch/tree/main/live_python_biella"&gt;code here&lt;/a&gt;.&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/m0OiFDBJ0Rw?start=114" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen&gt;&lt;/iframe&gt;</content><category term="posts"></category></entry><entry><title>My 6 Gems on Data Visualization</title><link href="https://www.marcosantoni.com/data_viz_hidden_gems.html" rel="alternate"></link><published>2022-02-26T19:35:00+01:00</published><updated>2022-02-26T19:35:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2022-02-26:/data_viz_hidden_gems.html</id><summary type="html">&lt;p&gt;I have been working quite some time with charts and business intelligence in the last 5 years. When you spend time building business reports, you may perceive data visualization as a cold technical and business tool. However, there are &lt;strong&gt;6 hidden gems&lt;/strong&gt; in data visualization that I found by chance …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I have been working quite some time with charts and business intelligence in the last 5 years. When you spend time building business reports, you may perceive data visualization as a cold technical and business tool. However, there are &lt;strong&gt;6 hidden gems&lt;/strong&gt; in data visualization that I found by chance. I realized data visualization is not as cold as I thought. Let me recap for you these 6 gems.&lt;/p&gt;
&lt;h2&gt;1) The first chart ever&lt;/h2&gt;
&lt;p&gt;William Playfair was a Scottish engineer and political scientist from the 18th century. He is considered as the author of the very first chart:&lt;/p&gt;
&lt;p&gt;&lt;img alt="By William Playfair - The Commercial and Political Atlas, 1786 (3th ed. edition 1801), Public Domain" src="https://www.marcosantoni.com/images/datavizhiddengems/playfair_first_chart_800.jpg"&gt;&lt;/p&gt;
&lt;p&gt;The chart was published back in 1786. It shows the volumes of imports and exports of Scotland over one year on a scale of 10k pounds. Each country is given two bars: one for volume of imports, one for volume of exports.&lt;/p&gt;
&lt;p&gt;I am so used to seeing bar charts that I never asked myself who was the inventor or when they first appeared. It's nice to find out that the have been invented way before the invention of calculators and that they have changed so little since then.&lt;/p&gt;
&lt;h2&gt;2) The best graphic ever&lt;/h2&gt;
&lt;p&gt;Charles Minard represented 6 types of data about Napoleon's 1812 Russia campaign in one single chart. This visual was considered by &lt;a href="https://www.nationalgeographic.com/culture/article/charles-minard-cartography-infographics-history"&gt;Edward Tufte&lt;/a&gt; as "&lt;em&gt;the best statistical graphic ever produced&lt;/em&gt;".&lt;/p&gt;
&lt;p&gt;&lt;img alt="By Charles Minard (1869): map of Napoleon's disastrous Russian campaign of 1812" src="https://www.marcosantoni.com/images/datavizhiddengems/minardnapoleon_800.png"&gt;&lt;/p&gt;
&lt;p&gt;Minard represented in two dimensions &lt;a href="https://ageofrevolution.org/200-object/flow-map-of-napoleons-invasion-of-russia/"&gt;six types&lt;/a&gt; of data: the number of Napoleon's troops; distance; temperature; the latitude and longitude; direction of travel; and location relative to specific dates.&lt;/p&gt;
&lt;h2&gt;3) Non-neutrality: the Legarithmic scale&lt;/h2&gt;
&lt;p&gt;Is data visualization a neutral discipline? Not really. Basic decisions like the choice of scale or of the limit of axes might change radically the information perceived by the reader. Take a look at the following tweet by Matteo Salvini (leader of "Lega" party) about results of a poll on popularity of Italian politicians:&lt;/p&gt;
&lt;blockquote class="twitter-tweet"&gt;&lt;p lang="it" dir="ltr"&gt;Nonostante menzogne, attacchi e processi, milioni di Italiani credono, sperano, confidano nella Lega. &lt;br&gt;Eh già, e siamo ancora qua…&lt;br&gt;Non si molla mai, GRAZIE! &lt;a href="https://t.co/DFMecxPFzC"&gt;pic.twitter.com/DFMecxPFzC&lt;/a&gt;&lt;/p&gt;&amp;mdash; Matteo Salvini (@matteosalvinimi) &lt;a href="https://twitter.com/matteosalvinimi/status/1436662148709629952?ref_src=twsrc%5Etfw"&gt;September 11, 2021&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="https://platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;Do you notice anything wrong with the chart? The y axis looks a bit tweaked. The difference between the axis does not follow any reasonable scale (perhapse a "Legarithmic" scale?) since the difference between the 3 bars is not consistent. Here is how the same data looks when plotted in Excel.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Unbiased chart of the same data shown in Matteo Salvini's tweet" src="https://www.marcosantoni.com/images/datavizhiddengems/realchartfromtweet_800.png"&gt;&lt;/p&gt;
&lt;p&gt;However, the effect on the reader is not the same, isn't it?&lt;/p&gt;
&lt;h2&gt;4) Beyond shapes: infographics&lt;/h2&gt;
&lt;p&gt;Otto Neurath was one of the main contributor to the &lt;em&gt;picture language&lt;/em&gt;, aka ISOTYPE (International System of Typographic Picture Education). This method consists of replacing classic shapes in data visualization (eg bars, circles, etc) with a set of standardized symbols. Quantities are represented by repeating the same symbol over and over proportionally to the measure. Consider the following example by Otto Neurath from 1930.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Otto Neurath, Residential density in big cities - 1930" src="https://www.marcosantoni.com/images/datavizhiddengems/isotypeexample_800.png"&gt;&lt;/p&gt;
&lt;p&gt;The chart represents the density of population in different cities. The information is represented as the number of persons that would live in a flat of 200 m2. The count of persons is not represented by a digit or by a bar, but it is represented by the repetition of a symbol as many time as the count of persons for that city. The result is effective. Density is no more a number, and you can &lt;em&gt;feel&lt;/em&gt; the size of the measure. Infographics can turn cold numbers into tangible perceptions of a phenomenon.&lt;/p&gt;
&lt;h2&gt;5) Pie charts: bad by definition&lt;/h2&gt;
&lt;p&gt;"Bad by definition" is the title of one of my &lt;a href="https://www.data-to-viz.com/caveat/pie.html"&gt;favourite blog posts&lt;/a&gt; about data visualization. This article is a clean explanation of why you should not use pie charts for most of the use cases. The article starts with this example.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Yan Holtz - The issue with pie chart" src="https://www.marcosantoni.com/images/datavizhiddengems/piechart_400.png"&gt;&lt;/p&gt;
&lt;p&gt;Can you rank the slices of the pie by size? You'd probably struggle a bit trying to answer. The reason is that our brain is not used to measure and compare angles. It's funny to see pie charts being used every now and then in business reports. Most of the times, a basic bar chart would be way more effective to let the user understand the numbers behind. However, it seems that pie charts are now endemic in corporations, and the way is still long before getting rid of it 😁&lt;/p&gt;
&lt;h2&gt;6) What is data visualization?&lt;/h2&gt;
&lt;p&gt;Is data visualization a branch of computer science? It turns out that data visualization is broader discipline, and it is part of &lt;a href="https://visme.co/blog/information-design/"&gt;information design&lt;/a&gt;. Information design is the practice of presenting information in a way that fosters an efficient and effective understanding of the information.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Plain text representation of data" src="https://www.marcosantoni.com/images/datavizhiddengems/irpef_table_800.png"&gt;&lt;/p&gt;
&lt;p&gt;Can the same data of a bar chart be represented in plain text? Yes.&lt;/p&gt;
&lt;p&gt;Would plain text require us the same effort to understand the information behind the numbers? Probably not.&lt;/p&gt;
&lt;p&gt;Would we even be able to get such information from plain text? Probably not because visualizing information helps our brain to perceive what's going on.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Same representation of data via line chart" src="https://www.marcosantoni.com/images/datavizhiddengems/irpef_chart.png"&gt;&lt;/p&gt;
&lt;p&gt;I recently wrote about &lt;a href="https://medium.com/@marcosantoni_39266/riforma-irpef-i-grafici-che-avrei-voluto-vedere-7a69f7577bc3"&gt;an article&lt;/a&gt; on the impact of information design on journalism. The article starts from a recent tax reform in Italy. Most information media have kept showing tables about the new tax rates, however I found quite hard to get a clear and full picture of the reform. I was not able to find online a single data visualization about the data behind the reform. So, I have done it by myself, and it turned out the article was quite appreciated (with more than 2.3k reads at the time of this writing and plenty of positive feedbacks on social networks).&lt;/p&gt;
&lt;p&gt;The reason why the article was so viral is that one single line chart was able to describe the reform way more effectively than the textual tables you could find online. I find this a decent example of "&lt;em&gt;efficient and effective understanding of information&lt;/em&gt;" that is the overall goal of information design.&lt;/p&gt;
&lt;h2&gt;References&lt;/h2&gt;
&lt;p&gt;This article is a collection of notes I took in the last couple of years. Historical charts are inspired by talks by &lt;a href="https://twitter.com/pciuccarelli"&gt;Paolo Ciuccarelli&lt;/a&gt;. The ideas behind the critics to pie charts is inspired by the article of &lt;a href="https://www.data-to-viz.com/caveat/pie.html"&gt;Yan Holtz&lt;/a&gt;. Plenty of details are of course from Wikipedia.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>How I started podcasting</title><link href="https://www.marcosantoni.com/start_podcasting.html" rel="alternate"></link><published>2021-11-07T09:35:00+01:00</published><updated>2021-11-07T09:35:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2021-11-07:/start_podcasting.html</id><summary type="html">&lt;p&gt;On May 2021, the first episode of my first podcast went live. The podcast is called &lt;a href="http://intervistapythonista.com/"&gt;Intervista Pythonista&lt;/a&gt; and is co-hosted with &lt;a href="https://it.linkedin.com/in/cesare-placanica"&gt;Cesare Placanica&lt;/a&gt;. Cesare and I are members of the &lt;a href="http://milano.python.it/"&gt;Python Milano&lt;/a&gt; community that helped us to kick-off the idea.&lt;/p&gt;
&lt;h2&gt;Why podcasting?&lt;/h2&gt;
&lt;p&gt;I am a heavy podcast listener. I …&lt;/p&gt;</summary><content type="html">&lt;p&gt;On May 2021, the first episode of my first podcast went live. The podcast is called &lt;a href="http://intervistapythonista.com/"&gt;Intervista Pythonista&lt;/a&gt; and is co-hosted with &lt;a href="https://it.linkedin.com/in/cesare-placanica"&gt;Cesare Placanica&lt;/a&gt;. Cesare and I are members of the &lt;a href="http://milano.python.it/"&gt;Python Milano&lt;/a&gt; community that helped us to kick-off the idea.&lt;/p&gt;
&lt;h2&gt;Why podcasting?&lt;/h2&gt;
&lt;p&gt;I am a heavy podcast listener. I love podcasts because they are dense conversations on topics I love. These conversations let me hear the points of view of experts in the field and stay up to date with new trends.&lt;/p&gt;
&lt;p&gt;I prefer podcasts over videos for two reasons. First, I can listen to them while I'm doing something else (usually low-attention tasks like dish-washing or running). Second, I don't need to sit in front of a screen after I've been working daily for 8+ hours still in front of a screen.&lt;/p&gt;
&lt;h2&gt;Why now?&lt;/h2&gt;
&lt;p&gt;Cesare and I participated as panelists in a &lt;a href="https://talks.codemotion.com/panel-online---stories-of-python-and-dat"&gt;community talk&lt;/a&gt; at last Codemotion conference. The panel was an informal discussion on topic like data team organization, learning tips, and latest trends in data science.&lt;/p&gt;
&lt;p&gt;We had a surprisingly high number of attendees during the panel. I noticed that an informal chat between experts is a content that people were enjoying more than I expected. I suspect that people miss the &lt;strong&gt;informal chat&lt;/strong&gt; they used to have during in-person meetups and conferences (ie suspended since the beginning of the pandemics).&lt;/p&gt;
&lt;p&gt;So, I got back to Cesare with the idea:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Why don't we start podcasting?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cesare was like: "tell me more about it". The idea was to interview an expert in Python or in its neighborhood. The format was inspired by Michael Kennedy's &lt;a href="https://talkpython.fm/"&gt;Talk Python to Me&lt;/a&gt; podcast. I was thinking to a similar format but narrowing it to an Italian audience by running interviews in Italian. The goal was not only to create valuable content for Italian Pythonistas, but also to give voice to local community members. Knowing with a direct interview the persons behind a tech community is a way to help the community grow by making it appear somehow closer to you.&lt;/p&gt;
&lt;p&gt;The decision was taken. It was time to start.&lt;/p&gt;
&lt;h2&gt;How to run a podcast?&lt;/h2&gt;
&lt;p&gt;Neither Cesare nor I ever run a podcast before. None of us was expert of audio recording and audio post-processing. Fortunately, we live in a time where you can find plenty of user friendly tools to create digital content. After doing some research, I found &lt;a href="https://anchor.fm/"&gt;Anchor&lt;/a&gt; by Spotify. Anchor defines itself as "&lt;em&gt;the easiest way to make a podcast&lt;/em&gt;". And it probabily is.&lt;/p&gt;
&lt;p&gt;Anchor lets you start a new podcast in minutes for free. You can record, cut, merge, and publish episodes directly via the mobile app. The app lets you invite guests to join the recording too. Anchor will then take care of distributing the content on major podcasting platforms.&lt;/p&gt;
&lt;p&gt;What is missing? A website and a logo! It turns out that Anchor creates a podcast page for your podcast. I simply bought a domain and linked it to that page. Regarding the logo, I have to confess I designed in Power Point.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Intervista Pythonista logo" src="https://www.marcosantoni.com/images/intervista_pythonista.png"&gt;&lt;/p&gt;
&lt;h2&gt;Guests?&lt;/h2&gt;
&lt;p&gt;Ok, we decided how to record and how to publish. It's time to record our first episode... who should we invite? Cesare and I started listing names of community members, colleagues, and even friends that could be interviewed. We soon had around 20 names, and our first choice was &lt;a href="https://marcobonzanini.com/category/podcast/"&gt;Marco Bonzanini&lt;/a&gt; (thanks Marco again for your availability!).&lt;/p&gt;
&lt;iframe src="https://anchor.fm/marco-santoni/embed/episodes/Ep-1-Diventare-imprenditori-di-se-stessi-con-NLP-e10a9g9/a-a5fjhcg" height="102px" width="400px" frameborder="0" scrolling="no"&gt;&lt;/iframe&gt;

&lt;p&gt;We keep on updating a kind of kanban board that lists potential guests, guests that have accepted the invitation, and those that have already been scheduled. We decided to have a fixed schedule for recording (every 2 weeks, on the same day, at the same time). Having a recurring schedule reduces complexity and made things work.&lt;/p&gt;
&lt;p&gt;At the end of every recording, we ask the guest to suggest us 1 or 2 names of potential future guests. This recommendation helps us filling the list of future guests with new names, and it lets us meet new Pythonistas outside of our direct network.&lt;/p&gt;
&lt;h2&gt;Some numbers&lt;/h2&gt;
&lt;p&gt;Two days ago, we published the &lt;a href="https://anchor.fm/marco-santoni/episodes/Ep-10-Demand-forecasting-con-serie-temporali-gerarchiche-e19q48p"&gt;10th episode&lt;/a&gt;, and we have enough history to look back at numbers. As of 7th November 2021, we had &lt;em&gt;1,364&lt;/em&gt; plays. Our top episode had &lt;em&gt;167&lt;/em&gt; plays. The &lt;em&gt;84%&lt;/em&gt; of listeners are from Italy, and 2 out of 3 listeners uses their mobile device to listen to the podcast.&lt;/p&gt;
&lt;p&gt;What I'm most glad of are not these numbers, but the messages we receive often via &lt;a href="https://pythonmilano.herokuapp.com/"&gt;Slack&lt;/a&gt; or &lt;a href="https://www.linkedin.com/company/python-milano"&gt;LinkedIn&lt;/a&gt;. Sometimes listeners writes us to say thanks for the valuable content they listened to. These messages are the highest reward for the time and effort we put into this podcast and the main reason we are doing this.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Getting PSM I Scrum Certification</title><link href="https://www.marcosantoni.com/getting-psm-i-scrum-certification.html" rel="alternate"></link><published>2021-08-24T09:41:00+02:00</published><updated>2021-08-24T09:41:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2021-08-24:/getting-psm-i-scrum-certification.html</id><summary type="html">&lt;p&gt;I've been working with Scrum framework over the last 18 months, and I thought it was time to test that what I was doing was real Scrum or kind-of-Scrum. I decided to take the &lt;em&gt;Professional Scrum Master I&lt;/em&gt; certification exam to test my knowledge of the framework.&lt;/p&gt;
&lt;h2&gt;Which certification?&lt;/h2&gt;
&lt;p&gt;Where …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I've been working with Scrum framework over the last 18 months, and I thought it was time to test that what I was doing was real Scrum or kind-of-Scrum. I decided to take the &lt;em&gt;Professional Scrum Master I&lt;/em&gt; certification exam to test my knowledge of the framework.&lt;/p&gt;
&lt;h2&gt;Which certification?&lt;/h2&gt;
&lt;p&gt;Where to start? It seems that the founders of Scrum have created 3 independent organizations that have 3 independent certification paths.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Scrum.org&lt;/li&gt;
&lt;li&gt;Scrum Alliance&lt;/li&gt;
&lt;li&gt;Scrum Inc&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While &lt;em&gt;Scrum Alliance&lt;/em&gt; and &lt;em&gt;Scrum Inc&lt;/em&gt; require attending a class to take the exam, &lt;em&gt;Scrum.org&lt;/em&gt; lets you directly take the exam thus allowing self-study. I did not find any in-person class in my area anytime soon and decided to go for &lt;em&gt;Scrum.org&lt;/em&gt; exam. I did not consider attending an online class because I already spend most of the working time in front of a screen and prefer other ways of learning rather than online courses.&lt;/p&gt;
&lt;h2&gt;How to prepare?&lt;/h2&gt;
&lt;p&gt;In short, read the &lt;a href="https://scrumguides.org/"&gt;Scrum Guide&lt;/a&gt; at least 3-4 times. Focus on highlighting &lt;strong&gt;who&lt;/strong&gt; is accountable for every artifact and activity (eg only the Developers are accountable for the Sprint Backlog, all the Scrum Team is accountable for the Sprint Goal, etc).&lt;/p&gt;
&lt;p&gt;Repeat a few times excercises that simulate exam questions (either official or not)&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Scrum.org &lt;a href="https://www.scrum.org/open-assessments/scrum-open"&gt;Open Assessment&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Great set of 80 questions by &lt;a href="https://mlapshin.com/index.php/scrum-quizzes/"&gt;Mikhail Lapshin&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Few free questions on &lt;a href="https://www.volkerdon.com/courses/take/sm-po-scaled-scrum-3-in-1/quizzes/24259915-product-owner-free-assessment"&gt;Volderkon&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I also enjoyed looking at some posters available on Scrum.org that help you visualize some aspects of the framework:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://scrumorg-website-prod.s3.amazonaws.com/drupal/2021-01/Scrumorg-Scrum-Framework-tabloid.pdf"&gt;Scrum Framework Poster&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://scrumorg-website-prod.s3.amazonaws.com/drupal/2018-05/ScrumValues-Tabloid.pdf"&gt;Scrum Values Poster&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;The exam&lt;/h2&gt;
&lt;p&gt;The exam is an online quiz of 80 questions to be answered in 60 minutes. I suggest using the &lt;em&gt;Bookmark&lt;/em&gt; feature of the quick. It lets you bookmark questions you're doubtful about and review them later. It took me about 40-45 minutes to go quickly through all questions. I then had approximately 15 minutes to review the bookmarked questions.&lt;/p&gt;
&lt;p&gt;I've read on few forums that people encountered performance issues in the exam webpage. However, I did not find any issue and the exam run smoothly.&lt;/p&gt;
&lt;p&gt;You can have notes either printed or on your laptop because there are no controls like browser locks or similar ones. You are basically free to look at any resource you like during the exam. The time pressure is a decent guarantee against cheating.&lt;/p&gt;
&lt;p&gt;When you complete the exam, you'll have a printed certification, a badge like this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cases up to March 10th" src="https://www.marcosantoni.com/images/psmi_badge.png"&gt;&lt;/p&gt;
&lt;p&gt;Your certificate will also available on your &lt;a href="credly.com/"&gt;Credly&lt;/a&gt; profile (if you have any).&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Notes from Designing Data-Intensive Applications</title><link href="https://www.marcosantoni.com/review_designing_data_intensive.html" rel="alternate"></link><published>2021-04-10T07:31:00+02:00</published><updated>2021-04-10T07:31:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2021-04-10:/review_designing_data_intensive.html</id><summary type="html">&lt;p&gt;&lt;a href="https://dataintensive.net/"&gt;Designing Data-Intensive Applications&lt;/a&gt; by Martin Kleppmann was not a quick-read. Let me be clear, it is not such a long book (the paper version is 400 pages), but it is so dense of information that takes some time to go through. The book covers indeed a broad spectrum of data …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="https://dataintensive.net/"&gt;Designing Data-Intensive Applications&lt;/a&gt; by Martin Kleppmann was not a quick-read. Let me be clear, it is not such a long book (the paper version is 400 pages), but it is so dense of information that takes some time to go through. The book covers indeed a broad spectrum of data technologies and is dense of details in each paragraph. So, be ready before starting the journey.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Ocean of distributed data" src="https://www.marcosantoni.com/images/data_map_600.jpg"&gt;&lt;/p&gt;
&lt;p&gt;What did I learn from the book? I'll take few quotes from my notes.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;An architecture that is appropriate for one level of load is unlikely to cope with 10 times that load. If you are working on a fast-growing service, it is therefore likely that you will need to rethink your architecture on every order of magnitude load increase — or perhaps even more often than that.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We need to be able to test, develop, and change quickly our architecture. The book covers the main data solution designs, but you need a team and an organizaiton that is able to adapt and improve the architecture constantly. And more importantly, avoid &lt;a href="http://wiki.c2.com/?PrematureOptimization"&gt;premature optimization&lt;/a&gt; as much as possible. Prefer simplicity over complexity.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If the same query can be written in 4 lines in one query language but requires 29 lines in another, that just shows that different data models are designed to satisfy different use cases. It’s important to pick a data model that is suitable for your application.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Don't focus on data processing performance only, data models and query languages do matter. The overall simplicity and readability of the solution design should be taken into account when choosing the data model.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;On the surface, a data warehouse and a relational OLTP database look similar, because they both have a SQL query interface. However, the internals of the systems can look quite different, because they are optimized for very different query patterns. Many database vendors now focus on supporting either transaction processing or analytics workloads, but not both.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We experienced this difference in my team. We started by building a data warehouse on top of SQL, but we run into performance issues quite soon. The statement by Kleppmann may seem obvious, but there are plenty of organization building data warehouses on SQL for a variety of reasons.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;... we will explore some of the most common ways how data flows between processes: via databases,  via service calls (eg REST and RPC), and via asynchronous message passing (eg MQTT, AMQP).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I find this an amazing summary. In the end, any data flow architecture falls in one these 3 categories, isn't it true?&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;When you deploy a new version of your application (of a server-side application, at least), you may entirely replace the old version with the new version within a few minutes. The same is not true of database contents: the five-year-old data will still be there, in the original encoding, unless you have explicitly rewritten it since then. This observation is sometimes summed up as data outlives code.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Migrating data is harder than updating an application (and there are richer tools available for deploying an application than migrating a database).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;May your application’s evolution be rapid and your deployments be frequent.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I love this wish 😊&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;All of the difficulty in replication lies in handling changes to replicated data, and that’s what this chapter is about. We will discuss three popular algorithms for replicating changes between nodes: &lt;em&gt;single-leader&lt;/em&gt;, &lt;em&gt;multi-leader&lt;/em&gt;, and &lt;em&gt;leaderless replication&lt;/em&gt;. Almost all distributed databases use one of these three approaches.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I found this quote in the introduction to the &lt;em&gt;Replication&lt;/em&gt; chapter of the book. I heard often mentioning these replication mechanism, but for the first time I did a deep dive in the topic (that is not as easy as I would have expected). Kleppmann throughout the book makes you clear one thing: there are many things that can go wrong around data (timestamp alignment, networking, nodes down, etc), and they will go wrong at some point.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Because of this risk of skew and hot spots, many distributed datastores use a hash function to determine the partition for a given key. A good hash function takes skewed data and makes it uniformly distributed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And fortunately this hashing is often managed under the hood by datastores themselvs, eg Azure Cosmos.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Atomicity, isolation, and durability are properties of the database, whereas consistency (in the ACID sense) is a property of the application. The application may rely on the database’s atomicity and isolation properties in order to achieve consistency, but it’s not up to the database alone. Thus, the letter C doesn’t really belong in ACID.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Interesting to read that the &lt;em&gt;C&lt;/em&gt; in such a popular acronym is there just to make the acronym work.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Errors will inevitably happen, but many software developers prefer to think only about the happy path rather than the intricacies of error handling.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;True story, but experience helps thinking a bit more to the &lt;em&gt;sad path&lt;/em&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Simply dumping data in its raw form allows for several such transformations. This approach has been dubbed the sushi principle: "raw data is better"&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I have been following the &lt;em&gt;sushi principle&lt;/em&gt; in the last year without being aware of this definition. Nice name!&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Database triggers can be used to implement change data capture by registering triggers that observe all changes to data tables and add corresponding entries to a changelog table. However, they tend to be fragile and have significant performance overheads. Parsing the replication log can be a more robust approach, although it also comes with challenges, such as handling schema changes.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I see replication log parsing as a growing trend. It enables the method "take data to datalake and then we'll see what to do". Furthermore, it fits for steaming data applications too. Today, not all vendors support the publication of such change logs natively (eg I didn't find a simple solution for &lt;em&gt;SQL Server&lt;/em&gt;).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Database state as integral of stream" src="https://www.marcosantoni.com/images/state_as_integral_of_stream_600.png"&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you are mathematically inclined, you might say that the application state is what you get when you integrate an event stream over time, and a change stream is what you get when you differentiate the state by time, as shown in figure. The analogy has limitations (for example, the second derivative of state does not seem to be meaningful), but it’s a useful starting point for thinking about data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This brilliant analogy is the intro of the &lt;strong&gt;chapter I enjoyed the most&lt;/strong&gt; within the entire book, ie the &lt;em&gt;Stream Processing&lt;/em&gt; chapter. It represents a database as the latest cache representing the replication logs (the opposite point of view we normally have).&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the absence of widespread support for a good distributed transaction protocol, I believe that log-based derived data is the most promising approach for integrating different data systems.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I have seen Kafka as a tool for stream processing so far. I was not thinking of it as a tool for integrating data systems. The last chapter of the book gives a hint on how &lt;em&gt;log-based derived data&lt;/em&gt; may become a popular pattern soon.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The trend has been to keep stateless application logic separate from state management (databases): not putting application logic in the database and not putting persistent state in the application. As people in the functional programming community like to joke, "We believe in the separation of Church and state"&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Good one.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Models of Data Science teams: Chess vs Checkers</title><link href="https://www.marcosantoni.com/chess_vs_checkers_teams.html" rel="alternate"></link><published>2021-03-27T09:35:00+01:00</published><updated>2021-03-27T09:35:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2021-03-27:/chess_vs_checkers_teams.html</id><summary type="html">&lt;blockquote&gt;
&lt;p&gt;How many data engineers should we hire? Are they too many compared to our data scientists?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;One of the key decisions to take when building a data science team is the &lt;strong&gt;mix of roles&lt;/strong&gt;. This means choosing the right mix of background and of activities that each member of the …&lt;/p&gt;</summary><content type="html">&lt;blockquote&gt;
&lt;p&gt;How many data engineers should we hire? Are they too many compared to our data scientists?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;One of the key decisions to take when building a data science team is the &lt;strong&gt;mix of roles&lt;/strong&gt;. This means choosing the right mix of background and of activities that each member of the team should have. I'll compare two models of teams I've experienced so far and define them as &lt;strong&gt;chess-team&lt;/strong&gt; model and &lt;strong&gt;checkers-team&lt;/strong&gt; model.&lt;/p&gt;
&lt;h2&gt;Chess-Team Model&lt;/h2&gt;
&lt;p&gt;&lt;img alt="Chess board" src="https://www.marcosantoni.com/images/chess_400.jpg"&gt;&lt;/p&gt;
&lt;p&gt;The chess-team model is the common model we read about in literature. In a chess-team, each member of the team has a &lt;strong&gt;specific role&lt;/strong&gt;. Roles are usually: &lt;em&gt;data engineers&lt;/em&gt;, &lt;em&gt;data scientists&lt;/em&gt;, and &lt;em&gt;machine learning engineers&lt;/em&gt;. These roles typically correspond to different sets of skills (eg ML and statistics vs coding and devops) and to different set of activities (model selection vs data preparation vs model deployment).&lt;/p&gt;
&lt;p&gt;Similarly to a chess piece which has a clear role that is different from the other pieces, a member of a data science chess-team is assigned a subset of the tasks that are part of the development pipeline. Let's consider a simplistic development pipeline:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;data preparation -&amp;gt; data engineer&lt;/li&gt;
&lt;li&gt;model development -&amp;gt; data scientists&lt;/li&gt;
&lt;li&gt;model deployment -&amp;gt; machine learning engineer&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The three activities of this development pipeline correspond to the three roles of the team, and there is little space for confusion. A data engineer probably won't work a lot on the model development and selection, while a data scientist probably won't be the one deploying the model in production.&lt;/p&gt;
&lt;h2&gt;Checkers-Team Model&lt;/h2&gt;
&lt;p&gt;&lt;img alt="Checkers board" src="https://www.marcosantoni.com/images/checkers_400.jpg"&gt;&lt;/p&gt;
&lt;p&gt;The checkers-team model is a definition of a team model that I introduce in this post. In a checkers-team, each member of the team does not have a specific role because he may  in charge of working on &lt;strong&gt;any step of the development&lt;/strong&gt; pipeline. There are no roles like &lt;em&gt;data engineer&lt;/em&gt; or &lt;em&gt;data scientist&lt;/em&gt; because taking such a role implies limiting the scope of activities a team member should work on. Let' make an example. In a checkers-team, there is no &lt;em&gt;data scientist&lt;/em&gt; because no one is in charge of model development &lt;strong&gt;only&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;So, what is the role of someone working in a checkers-team? A member of the team can be defined as a &lt;strong&gt;full-stack data developer&lt;/strong&gt;. A full-stack data developer is someone that for example works on data extraction &lt;em&gt;AND&lt;/em&gt; model development &lt;em&gt;AND&lt;/em&gt; model deployment. In a checkers-team, everyone works possibly on every piece of the development lifecycle. In this sense, the team is more similar to checkers pieces. There is no move that a piece can take and another piece cannot. Similarly, there is no activity that any team member cannot do. For example, everyone can contribute to building devops pipelines and automation.&lt;/p&gt;
&lt;p&gt;Of course, every team member has a different &lt;strong&gt;background&lt;/strong&gt; and a different set of skills from his/her teammates. One can come from a software engineering experience, another one can come from data science studies. However, the strategy of building a checkers-team is to invest in &lt;strong&gt;training&lt;/strong&gt; team members to grow &lt;strong&gt;horizontally&lt;/strong&gt; their set of skills.&lt;/p&gt;
&lt;h2&gt;Pros and Cons&lt;/h2&gt;
&lt;p&gt;Let's consider some key differences between a chess and a checkers team model.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Flexibility.&lt;/strong&gt; The balance of types of activities is not stable over time in a team. There can be times when there is a peak of work items in data engineering and little or no work items in ML model development. These peaks can be due to different phases of the data product development cycle or due to varying business requirements. A checkers-team is flexible and can adapt quickly to these peaks. A checkers-team could for example dedicate the entire team to develop data engineering pipelines in a Scrum sprint if needed. The same flexibility is not as easy in a chess-team model where you have constraints due to different skills and different responsibilities.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Complexity.&lt;/strong&gt; Not every data science team is facing the same level of complexity in their projects. Imagine a team that is building an AI model for self-driving cars. It is a complex problem to solve that requires advanced skills in computer vision and AI. These skills cannot be learned quickly but usually need a specific education or career path. When facing such problems, you need team members which are specialists in area like vision or AI. A chess-team is designed to host specialists in certain fields and is designed to grow vertically such skills. In a checkers-team, there are not such specialists.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Awareness.&lt;/strong&gt; A member of a checkers-team knows in details every phase of the development cycle. While he is designing a ML model, he is aware at the same time of how the release pipeline and the operations of the model work. He may take decisions during model selection that take into consideration where the model will be hosted and possible constraints of the production platform. On the other hand, a data scientist of a chess-team knows less details (because he has not being working on it by himself) of how the model will be deployed and run. This minor awareness may lead to assumptions taken during model development, and these assumptions can bring to more complexity to those in charge of deploying such model.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sense of Ownership.&lt;/strong&gt; In a checkers-team, you are in charge of both engineering data pipelines, developing models, and deploying them. Any issue that may occur in these phases is also &lt;em&gt;your&lt;/em&gt; issue. You can't delegate too much, and, therefore, you naturally feel responsible to contribute to the resolution. Distributing the ownership makes every team member more active in improving the development life cycle.&lt;/p&gt;
&lt;h2&gt;When is a Team Model Right?&lt;/h2&gt;
&lt;p&gt;The answer depends on the context and the organization you work at. Is the data science team is working on the &lt;strong&gt;core product&lt;/strong&gt; of the company? If this is the case, the models that are developed may need a level of specialization that can't just be achieved by a checkers-team.&lt;/p&gt;
&lt;p&gt;Or is the team rather working on adding tiny features or on improving the operations of the company? In this case, probably you won't be developing state-of-the-art AI models, and you can rely existing &lt;strong&gt;libraries or SaaS&lt;/strong&gt; that make life easier for you. As complexity is not an obstacle, going for checkers-team may be a good option.&lt;/p&gt;
&lt;p&gt;What is the size of your data science team? Or even how many teams do you have? Large organizations go for multiple data teams. These teams may be divided &lt;strong&gt;functionally&lt;/strong&gt; (eg 1 team of data engineers + 1 separate team of data sciensts) or they may be divided by &lt;strong&gt;business units&lt;/strong&gt; (eg 1 data team for marketing and 1 data team for recommender system). You can't of course adopt the checkers-team model in an large organization that design the data teams by functions, but you may still adopt this model in a large organization that creates multiple self-organized teams each dedicated to a specific business unit.&lt;/p&gt;
&lt;p&gt;A last point to consider is the &lt;strong&gt;IT architecture&lt;/strong&gt;. A checkers-team requires the same person to work on very different tasks. This is viable only if the complexity of such tasks is small. Adopting &lt;strong&gt;SaaS and PaaS&lt;/strong&gt; resources simplifies every task by hiding the complexity of managing and running the resources. They let you focus on your goal. For example, building an API endpoint hosted by a function-as-a-service is something feasible by a data scientist with a mathematical background. Doing the same from scratch on an on-premise server is not as feasible.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Images courtesy of &lt;a href="https://unsplash.com/photos/DC-UrroFRr4"&gt;@pecanlie&lt;/a&gt; and &lt;a href="https://unsplash.com/photos/U_Kz2RnfFAk"&gt;@rafaelrex&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Choosing my next job title (in a data science career)</title><link href="https://www.marcosantoni.com/choosing_next_job_title.html" rel="alternate"></link><published>2021-01-08T07:41:00+01:00</published><updated>2021-01-08T07:41:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2021-01-08:/choosing_next_job_title.html</id><summary type="html">&lt;p&gt;I'm now part of a data and AI team in a fintech spinoff. When I joined the company, it did not make sense to spend time in defining precise job titles because we were to build everything from scratch (both software, teams and organization). My job title was therefore a …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I'm now part of a data and AI team in a fintech spinoff. When I joined the company, it did not make sense to spend time in defining precise job titles because we were to build everything from scratch (both software, teams and organization). My job title was therefore a generic "&lt;em&gt;AI Practitioner&lt;/em&gt;". One year later, teams and responsibilities are more clear, and it is now time to define my job title.&lt;/p&gt;
&lt;h2&gt;What was I doing up to now?&lt;/h2&gt;
&lt;p&gt;I have a background in data science and software engineering. I started my career in 2013 as "&lt;em&gt;Data Scientist and Software Developer&lt;/em&gt;" (what we would call today a &lt;em&gt;Machine Learning Engineer&lt;/em&gt;?) in a small startup. I was then defined as an "&lt;em&gt;Associate&lt;/em&gt;" when working as a data scientist in a consulting firm. In the last 3 years, I worked in a manufacturing firm as "&lt;em&gt;Data Scientist&lt;/em&gt;".&lt;/p&gt;
&lt;h2&gt;What am I doing now?&lt;/h2&gt;
&lt;p&gt;In the company I currently work at, I work in the data and AI team. My main activities include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;planning and prioritizing of our data solution&lt;/li&gt;
&lt;li&gt;designing our data and software architecture&lt;/li&gt;
&lt;li&gt;developing in first person our data integrations, analytics reporting, ML models and data solutions&lt;/li&gt;
&lt;li&gt;making sure our Scrum cerimonies run smoothly&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;My job has a mix of coding, architecture design, and project/product management. Why such a variety of responsibilities? I work in a small team part of company that is growing quickly starting from zero. Each team is quite autonomous in doing their work by taking an end-to-end ownership of the activity. For example, in my data and AI team we handle our work end-to-end. We are responsible for the entire pipeline: definining roadmaps, development, deployment, and monitoring.&lt;/p&gt;
&lt;h2&gt;My job title?&lt;/h2&gt;
&lt;p&gt;It is now time to define a job title that can summarize my responsibilities listed above. These are some alternatives I took into consideration:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Job title&lt;/th&gt;
&lt;th&gt;Comments&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Senior Data Scientist/Engineer&lt;/td&gt;
&lt;td&gt;Too vertical on a piece of the pipeline compared to the spectrum of activities I work on&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Architect&lt;/td&gt;
&lt;td&gt;Nicely defines the technical activities of designing and scaling our data solutions, but lacks the ownership of the backlog and of the product roadmap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Product Owner&lt;/td&gt;
&lt;td&gt;States clearly the ownership of the product backlog, but I feel that the "Product Owner" title is too tight to a Scrum role and lacks of technical responsibilities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lead Data and AI&lt;/td&gt;
&lt;td&gt;States the responsibility of leading a team of experts in a domain. However, it does not feature any ownership on the product roadmap. Furthermore, it states a clear hierarchy in the team that goes against our team and company culture (a culture of distributed ownership and flat organization)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;I was not satisified with the job titles above. Then, I came up with &lt;strong&gt;"Data Product Manager"&lt;/strong&gt;. I felt this job title was what I was looking for because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;as a Product Manager, you are responsible for the product roadmap and strategy&lt;/li&gt;
&lt;li&gt;the prefix "Data" adds a technical taste. By doing some research, I found that a TPM (Technical Product Manager) is a common job title that defines a product manager that is also in charge of the technical side of the product (architecture, etc)&lt;/li&gt;
&lt;li&gt;it states the ownership of our data product but does add any hierarchy-sounding adjectives&lt;/li&gt;
&lt;li&gt;my end-to-end range of activities can fit well in this definition&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I shared these thoughts with my manager that agreed both on the definition of my responsibilities and on the job title. Let's see if these notes can help those that are facing the same challenge of choosing their own job title.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>What we expected from Covid on March 10th</title><link href="https://www.marcosantoni.com/what_we_expected_from_covid.html" rel="alternate"></link><published>2020-12-26T09:35:00+01:00</published><updated>2020-12-26T09:35:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2020-12-26:/what_we_expected_from_covid.html</id><summary type="html">&lt;p&gt;The first Covid case in Italy was found on February 21st 2020. A couple of weeks later we were entering the lockdown with this number of new daily cases.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cases up to March 10th" src="https://www.marcosantoni.com/images/cases_up_to_march_10.png"&gt;&lt;/p&gt;
&lt;p&gt;The number of Covid-19 new cases was growing really fast every day. We had no clue about what was going to …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The first Covid case in Italy was found on February 21st 2020. A couple of weeks later we were entering the lockdown with this number of new daily cases.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cases up to March 10th" src="https://www.marcosantoni.com/images/cases_up_to_march_10.png"&gt;&lt;/p&gt;
&lt;p&gt;The number of Covid-19 new cases was growing really fast every day. We had no clue about what was going to happen and about when it would have ended. Was it going to end soon? How quickly was the virus spreading? I was wondering whether our feelings and &lt;strong&gt;expectations&lt;/strong&gt; would have turned out to be true or not. So, I run a little &lt;strong&gt;experiment&lt;/strong&gt; with 7 friends. I asked each of them the following 2 questions on March 10th 2020:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;What will the total number of Covid19 cases be by April 1st?&lt;/li&gt;
&lt;li&gt;When will the number of new cases be smaller than 50 again?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The goal of these questions was to investigate our ability as humans to nearly understand the size and the duration of such an unseen event like a global pandemy. Let's look at the answers we gave to these 2 questions.&lt;/p&gt;
&lt;h2&gt;Total cases by April 1st&lt;/h2&gt;
&lt;p&gt;The total number of Covid19 cases in Italy was &lt;code&gt;110k&lt;/code&gt; (precisely &lt;code&gt;110574&lt;/code&gt;). These were our 7 predictions made on March 10th.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cases up to March 10th" src="https://www.marcosantoni.com/images/prediction_cases_april_1.png"&gt;&lt;/p&gt;
&lt;p&gt;We see that 5 out of 7 respondents predicted a number of cases below &lt;code&gt;60k&lt;/code&gt; (with 2 respondents even below &lt;code&gt;25k&lt;/code&gt;). Only 2 out of 7 respondents gave more realistic predictions (&lt;code&gt;110k&lt;/code&gt; and &lt;code&gt;130k&lt;/code&gt; respectively). Why were most respondents too &lt;strong&gt;optimistic&lt;/strong&gt;? If we look at the very first chart, an exponential growth of new cases was already happening on March 10th. Perhaps, the majority of respondents were perceiveing the growth as linear.&lt;/p&gt;
&lt;p&gt;Does our brain have &lt;strong&gt;misperceptions&lt;/strong&gt; about exponential growth? My little experiment gave this insight, but I was curious whether there is some scientific literature about this misperception. I found a &lt;a href="https://link.springer.com/article/10.3758/BF03204114"&gt;paper&lt;/a&gt; written back in 1975: &lt;em&gt;"Misperception of exponential growth"&lt;/em&gt; by Wagennar and Sagaria.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Experiment in 1975 paper" src="https://www.marcosantoni.com/images/prediction_cases_misperception_experiment.png"&gt;&lt;/p&gt;
&lt;p&gt;In this paper, researchers presented the beginning of an exponential time series starting ranging between 1970 and 1974. They presented this time series in different experiments both in the form of a series of numbers and in the form of a graph (see chart above). They asked to predict the value of this time series by 1979. A considerable &lt;strong&gt;underestimation&lt;/strong&gt; of growth was encountered in all groups in all conditions.&lt;/p&gt;
&lt;p&gt;The results of this paper helped me understanding why most of my respondents notably underestimated the growth of Covid-19 cases in Italy. Our brain is capable of intuitions only for linear growths and not for exponential growths.&lt;/p&gt;
&lt;p&gt;The following question naturally comes up: if the underestimation of the Covid-19 growth was common in the vast majority of the citizens due to our unavoidable misperception, how has this impacted on micro and macro decisions when facing the pandemic?&lt;/p&gt;
&lt;h2&gt;New cases smaller than 50 again&lt;/h2&gt;
&lt;p&gt;We now move to the second question of my little experiment (asked on March 10th): &lt;em&gt;"When will the number of new cases be smaller than 50 again?"&lt;/em&gt;. Plotting the answers:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Daily cases below 50" src="https://www.marcosantoni.com/images/prediction_cases_below_50.png"&gt;&lt;/p&gt;
&lt;p&gt;I drew a black vertical line for each date given as answer. We were too optimistic in this survey too. 5 out of 7 respondents expected the situation to go under control (the red horizontal line represents the threshold of 50 cases in the question) by May 1st. No one was expecting the high number of daily cases to go beyond &lt;strong&gt;June 18th&lt;/strong&gt;. As of today (December 26th), the number of daily cases in Italy did not go below 50 in a single day since then.&lt;/p&gt;
&lt;p&gt;We were just starting to experience an extraordinary event, and we were not expecting it to last for that long. This bias in perceiving the pandemic shorter than it was probably helped the social distancing policies. Changing your social habits is a privation that you willingly make if you expect it to last for a short time. Imagine that we knew Covid-19 would last for 9 months or even more.&lt;/p&gt;
&lt;p&gt;Another question naturally comes up about the economic policies that were taken to tackle the pandemic: were they subject to the same short-term bias that was measured in this experiment?&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Summary: Building AI Solutions with Azure ML</title><link href="https://www.marcosantoni.com/summary_building_ai_solutions_azure_ml.html" rel="alternate"></link><published>2020-08-19T06:41:00+02:00</published><updated>2020-08-19T06:41:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2020-08-19:/summary_building_ai_solutions_azure_ml.html</id><summary type="html">&lt;p&gt;While studying for the &lt;em&gt;Azure Data Scientist Associate&lt;/em&gt; certification, I took notes from &lt;a href="https://docs.microsoft.com/en-us/learn/paths/build-ai-solutions-with-azure-ml-service/"&gt;Building AI Solution with Azure ML&lt;/a&gt; course. In this single page, you'll find the entire content of the course (as of 18th August, 2020). This page is a small support for those preparing for earning the certification …&lt;/p&gt;</summary><content type="html">&lt;p&gt;While studying for the &lt;em&gt;Azure Data Scientist Associate&lt;/em&gt; certification, I took notes from &lt;a href="https://docs.microsoft.com/en-us/learn/paths/build-ai-solutions-with-azure-ml-service/"&gt;Building AI Solution with Azure ML&lt;/a&gt; course. In this single page, you'll find the entire content of the course (as of 18th August, 2020). This page is a small support for those preparing for earning the certification.&lt;/p&gt;
&lt;h1&gt;Intro&lt;/h1&gt;
&lt;h2&gt;Azure ML Workspace&lt;/h2&gt;
&lt;p&gt;workspaces are azure resources. include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;compute&lt;/li&gt;
&lt;li&gt;notebooks&lt;/li&gt;
&lt;li&gt;pipelines&lt;/li&gt;
&lt;li&gt;data&lt;/li&gt;
&lt;li&gt;experiments&lt;/li&gt;
&lt;li&gt;models&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;created alongside&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;storage account: files by WS + data&lt;/li&gt;
&lt;li&gt;application insights&lt;/li&gt;
&lt;li&gt;key vault&lt;/li&gt;
&lt;li&gt;vm&lt;/li&gt;
&lt;li&gt;container registry&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;permission: RBAC&lt;/p&gt;
&lt;p&gt;edition
- basic (no graphic designer)
- enterprise&lt;/p&gt;
&lt;h2&gt;Tools&lt;/h2&gt;
&lt;p&gt;Azure ML Studio
- designer (no code ML model dev)
- automated ML&lt;/p&gt;
&lt;p&gt;Azure ML SDK&lt;/p&gt;
&lt;p&gt;Azure ML CLI Extensions&lt;/p&gt;
&lt;p&gt;Compute Instances
- choose VM
- store notebooks independently of VMs&lt;/p&gt;
&lt;p&gt;VS Code - Azure ML Extension&lt;/p&gt;
&lt;h2&gt;Experiments&lt;/h2&gt;
&lt;p&gt;Azure ML tracks run of experiments&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;experiment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start_logging&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;logging metrics. &lt;code&gt;run.log('name', value)&lt;/code&gt;. You can review them via &lt;code&gt;RunDetails(run).show()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;experiment output file. Example: trained models. &lt;code&gt;run.upload_file(..)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Script as an experiment&lt;/strong&gt;. In the script, you can get the context: &lt;code&gt;run = Rune.get_context()&lt;/code&gt;. To run it, you define:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;RunConfiguration: python environment&lt;/li&gt;
&lt;li&gt;ScriptRunConfig: associates RunConfiguration with script&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Train a ML model&lt;/h1&gt;
&lt;h2&gt;Estimators&lt;/h2&gt;
&lt;p&gt;Estimator: encapsulates a run configuration and a script configuration in a single object. Save trained model as pickle in &lt;code&gt;outputs&lt;/code&gt; folder&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;estimator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Estimator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;source_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;experiment&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;entry_script&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training.py&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;local&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;conda_packages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;scikit-learn&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;experiment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Experiment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;train_experiment&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;experiment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;estimator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Framework-specific estimators simplify configurations&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;azureml.train.sklearn&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SKLearn&lt;/span&gt;

&lt;span class="n"&gt;estimator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SKLearn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;source_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;experiment&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;entry_script&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training.py&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;local&amp;#39;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Script parameters&lt;/h2&gt;
&lt;p&gt;Use &lt;code&gt;argparse&lt;/code&gt; to read the parameters in a script (eg regularization rate). To pass a parameter to an &lt;code&gt;Estimator&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;estimator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SKLearn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;source_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;experiment&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;entry_script&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training.py&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;script_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;--reg_rate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;local&amp;#39;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Registering models&lt;/h2&gt;
&lt;p&gt;Once the experiment &lt;code&gt;Run&lt;/code&gt; has completed, you can retrieve its outputs (eg trained model).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;download_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outputs/models.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_file_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;model.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Registering a model allows to track multiple versions of a model.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification_model&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;model.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;#local path&lt;/span&gt;
  &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;a classification model&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;dept&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;sales&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="n"&gt;model_framework&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Framework&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SCIKITLEARN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;model_framework_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;0.20.3&amp;#39;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or register from run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="o"&gt;...&lt;/span&gt;
  &lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outputs/model.pkl&amp;#39;&lt;/span&gt;
  &lt;span class="o"&gt;...&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h1&gt;Datastores&lt;/h1&gt;
&lt;p&gt;Abstractions of cloud data sources encapsulating the information required to connect.&lt;/p&gt;
&lt;p&gt;You can register a data store&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;via ML Studio&lt;/li&gt;
&lt;li&gt;via SDK&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;ws&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Workspace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_config&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;blob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Datastore&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register_azure_blob_container&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;datastore_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;blob_data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;container_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;data_container&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;account_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;az_acct&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;account_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;123456&amp;#39;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In the SDK, you can list data stores.&lt;/p&gt;
&lt;h2&gt;Use datastores&lt;/h2&gt;
&lt;p&gt;Most common: Azure blob and file&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;blob_ds&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;upload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;src_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/files&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;target_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/data/files&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;overwrite&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;blob_ds&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;target_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;downloads&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/data&amp;#39;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You pass a data reference to the script to use a datastore. Data access models&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;download: contents downloaded to the compute context of experiment&lt;/li&gt;
&lt;li&gt;upload: files generated by experiment are uploaded after run&lt;/li&gt;
&lt;li&gt;mount: path of datastore mounted as remote storage (only on remote compute target)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Pass reference as script parameter:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;data_ref&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blob_ds&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;data/files&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;as_download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path_on_compute&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training_data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;estimator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SKLearn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;source_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;experiment_folder&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;entry_script&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training_script.py&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;local&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;script_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;--data_folder&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data_ref&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Retrieve it in script and use it like local folder:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;--data_folder&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;str&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dest&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;data_folder&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;data_files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;listdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_folder&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Datasets&lt;/h2&gt;
&lt;p&gt;Datasets are versioned packaged data objects consumed in experiments and pipelines. Types&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;tabular: read as table&lt;/li&gt;
&lt;li&gt;file: list of file paths&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can create dataset via Azure ML Studio or via SDK. File paths can have wildcards (&lt;code&gt;/files/*.csv&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;Once a dataset is created, you can &lt;strong&gt;register&lt;/strong&gt; it in the workspace (available later too).&lt;/p&gt;
&lt;p&gt;Tabular:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;azureml.core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;

&lt;span class="n"&gt;blob_ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;we&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_default_datastore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;csv_paths&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blob_ds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;data/files/current_data.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blob_ds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;data/files/archive/*.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;tab_ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tabular&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_delimited_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;csv_paths&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tab_ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tab_ds&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;csv_table&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;File:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;blob_ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_default_datastore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;file_ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blob_ds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;data/files/images/*.jpg&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;file_ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;file_ds&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;img_files&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Retrieve&lt;/strong&gt; a dataset&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;ws&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Workspace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_config&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Get a dataset from workspace datasets collection&lt;/span&gt;
&lt;span class="n"&gt;ds1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datasets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;csv_table&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Get a dataset by name from the datasets class&lt;/span&gt;
&lt;span class="n"&gt;ds2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_by_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;img_files&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Datasets can be &lt;strong&gt;versioned&lt;/strong&gt;. Create a new versioning by registering with same name and &lt;code&gt;create_new_version&lt;/code&gt; property:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;file_ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;file_ds&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;img_files&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;create_new_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Retrieve specific version:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;img_ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_by_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;img_files&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h1&gt;Compute Contexts&lt;/h1&gt;
&lt;p&gt;The runtime context for each experiment consists of&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;environment&lt;/em&gt; for the script, which includes all packages&lt;/li&gt;
&lt;li&gt;&lt;em&gt;compute target&lt;/em&gt; on which the environment will be deployed&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Intro to Environments&lt;/h2&gt;
&lt;p&gt;Python runs in virtual environments (eg &lt;code&gt;Conda&lt;/code&gt;, &lt;code&gt;pip&lt;/code&gt;). Azure creates a Docker container and creates the environment. You create environments by&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Conda&lt;/code&gt; or &lt;code&gt;pip&lt;/code&gt; yaml file and load it:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Environment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_conda_specification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training_env&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;./conda.yml&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;from existing &lt;code&gt;Conda&lt;/code&gt; environment:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Environment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_conda_environment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training_env&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="n"&gt;conda_environment_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;py_env&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;specifying packages:&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Environment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training_env&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CondaDependencies&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conda_packages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;pandas&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;numpy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                              &lt;span class="n"&gt;pip_packages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;azureml-defaults&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conda_dependencies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;deps&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Once created, you can register the environment in the workspace.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Retrieve and assign it to a &lt;code&gt;ScriptRunConfig&lt;/code&gt; or an &lt;code&gt;Estimator&lt;/code&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;tr_env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Environment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training_env&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;estimator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Estimator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;source_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;experiment_folder&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;entry_script&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training_script.py&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;local&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;environment_definition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tr_env&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Compute targets&lt;/h2&gt;
&lt;p&gt;Compute targets are physical or virtual computer on which experiments are run. Types of compute&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;local compute&lt;/em&gt;: your workstation or a virtual machine&lt;/li&gt;
&lt;li&gt;&lt;em&gt;compute clusters&lt;/em&gt;: multi-node clusters of VMs that automatically scale up or down&lt;/li&gt;
&lt;li&gt;&lt;em&gt;inference clusters&lt;/em&gt;: to deploy models, they use containers to initiate computing&lt;/li&gt;
&lt;li&gt;&lt;em&gt;attached compute&lt;/em&gt;: attach a VM or Databricks cluster that you already use&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can create a compute target via AML studio or via SDK. A &lt;strong&gt;managed&lt;/strong&gt; compute target is one managed by AML. Via SDK&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;ws&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Workspace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_config&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;compute_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;aml-cluster&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;compute_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AmlCompute&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;provisioning_configuration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;vm_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;STANDARD_DS12_V2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;min_nodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;max_nodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;vm_priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;dedicated&amp;#39;&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;aml_cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ComputeTarget&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;we&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compute_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compute_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;aml_cluster&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wait_for_completion&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;An &lt;strong&gt;unmanaged&lt;/strong&gt; compute target is defined and managed outside AML. You can attach it via SDK:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;ws&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Workspace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_config&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;compute_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;db-cluster&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;db_workspace_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;db_workspace&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;db_resource_group&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;db_resource_group&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;db_access_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;aocsinaocnasoivn&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;db_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DatabricksCompute&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attach_configuration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;resource_group&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;db_resource_group&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;workspace_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;db_workspace_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;access_token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;db_access_token&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;db_cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ComputeTarget&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;we&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compute_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;db_cluster&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wait_for_completion&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can check if a compute target does not exist already:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;compute_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;aml_cluster&amp;#39;&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="n"&gt;aml_cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ComputeTarget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;compute_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ComputeTargetException&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# create it&lt;/span&gt;
  &lt;span class="o"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can use a compute target in an experiment run by specifying it as a parameter&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;compute_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;aml_cluster&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;training_env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Environment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training_env&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;estimator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Estimator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;source_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;experiment_folder&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;entry_script&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training_script.py&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;environment_definition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;training_env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;compute_name&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# or specify a ComputeTarget object&lt;/span&gt;
&lt;span class="n"&gt;training_cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ComputeTarget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;compute_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;estimator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Estimator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;source_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;experiment_folder&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;entry_script&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training_script.py&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;environment_definition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;training_env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;training_cluster&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h1&gt;Orchestrating with Pipelines&lt;/h1&gt;
&lt;p&gt;A &lt;em&gt;pipeline&lt;/em&gt; is a workflow of ml tasks in which each tasks is implemented as a &lt;em&gt;step&lt;/em&gt; (either sequential or parallel). You can combine different compute targets. Common types of step:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;PythonScriptStep&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;EstimatorStep&lt;/em&gt;: runs an estimator&lt;/li&gt;
&lt;li&gt;&lt;em&gt;DataTransferStep&lt;/em&gt;: uses ADF&lt;/li&gt;
&lt;li&gt;&lt;em&gt;DatabricksStep&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;AdlaStep&lt;/em&gt;: runs a &lt;code&gt;U-SQL&lt;/code&gt; job in Azure Data Lake Analytics&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Define steps:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;step1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PythonScriptStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;prepare data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;source_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;scripts&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;script_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;data_prep.py&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;aml-cluster&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;runconfig&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;run_config&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;step2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;EstimatorStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;train model&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;estimator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sk_estimator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;aml-cluster&amp;#39;&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Assign steps to pipeline:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;train_pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;step1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;step2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# create experiment and run pipeline&lt;/span&gt;
&lt;span class="n"&gt;experiment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Experiment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training-pipeline&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;experiment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_pipeline&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Pass data between steps&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;PipelineData&lt;/code&gt; object is a special kind of &lt;code&gt;DataReference&lt;/code&gt; that&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;reference a location in a store&lt;/li&gt;
&lt;li&gt;creates a da dependency between pipelines&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To pass it&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;define a &lt;code&gt;PipelineData&lt;/code&gt; object that references a location in a data store&lt;/li&gt;
&lt;li&gt;specify the object as input or output for the steps that use it&lt;/li&gt;
&lt;li&gt;pass the &lt;code&gt;PipelineData&lt;/code&gt; object as a script parameter in steps that run scripts&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;raw_ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_by_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;raw_dataset&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Define object to pass data between steps&lt;/span&gt;
&lt;span class="n"&gt;data_store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_default_datastore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;prepped_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PipelineData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;prepped&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datastore&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data_store&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;step1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PythonScriptStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;prepare data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;source_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;scripts&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;script_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;data_prep.py&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;aml-cluster&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;runconfig&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;run_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="c1"&gt;# specify dataset&lt;/span&gt;
  &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;raw_ds&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;as_named_input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;raw_data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
  &lt;span class="c1"&gt;# specify PipelineData as output&lt;/span&gt;
  &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;prepped_data&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="c1"&gt;# script reference&lt;/span&gt;
  &lt;span class="n"&gt;arugments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;--folder&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prepped_data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;step2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;EstimatorStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;train model&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;estimator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sk_estimator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;aml-cluster&amp;#39;&lt;/span&gt;
  &lt;span class="c1"&gt;# specify PipelineData&lt;/span&gt;
  &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;prepped_data&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="c1"&gt;# pass reference to estimator script&lt;/span&gt;
  &lt;span class="n"&gt;estimator_entry_script_arguments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;--folder&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prepped_data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Inside the script, you can get reference to &lt;code&gt;PipelineData&lt;/code&gt; object from the argument, and use it like  a local folder.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argpare&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;--folder&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dest&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;folder&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;output_folder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;folder&lt;/span&gt;

&lt;span class="c1"&gt;# ...&lt;/span&gt;

&lt;span class="c1"&gt;# save data to PipelineData location&lt;/span&gt;
&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;makedirs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_folder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;output_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_folder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;prepped_data.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Reuse steps&lt;/h2&gt;
&lt;p&gt;By default, the step output from a previous pipeline run is reused without rerunning the step (if script, source directory and other params have not changed). You can control this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;step1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PythonScriptStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="c1"&gt;#...&lt;/span&gt;
  &lt;span class="n"&gt;allow_reuse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can force the steps to run regardless of individual configuration:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pipeline_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;experiment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_pipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;regenerate_outputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Publish pipelines&lt;/h2&gt;
&lt;p&gt;You can publish a pipelien to create a REST endpoint through which the pipeline can be run on demand.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;published_pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training_pipeline&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Model training pipeline&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;1.0&amp;#39;&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can view it in ML Studio and get the endpoint:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;published_pipeline&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You start a published endpoint by making an HTTP request to it. You pass the authorisation header (with token) and a JSON payload specifying the experiment name. The pipeline is run asynchronously, you get the run ID as response.&lt;/p&gt;
&lt;h2&gt;Pipeline parameters&lt;/h2&gt;
&lt;p&gt;Create a &lt;code&gt;PipelineParameter&lt;/code&gt; object for each parameter. Example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;reg_param&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PipelineParameter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;reg_rate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;span class="n"&gt;step2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;EstimatorStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="c1"&gt;# ...&lt;/span&gt;
  &lt;span class="n"&gt;estimator_entry_script_arguments&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;--folder&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prepped&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;--reg&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reg_param&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;After you publish a parametrised pipeline, you can pass parameter values in the JSON payload of the REST interface. Example&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;enpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;auth_header&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;ExperimentName&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;run_training_pipeline&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;ParameterAssignments&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="s1"&gt;&amp;#39;reg_rate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Schedule pipelines&lt;/h2&gt;
&lt;p&gt;Define a &lt;code&gt;ScheduleRecurrence&lt;/code&gt; and use it to create a &lt;code&gt;Schedule&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;daily&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ScheduleRecurrence&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;frequency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Day&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline_schedule&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Schedule&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Daily Training&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;train model every day&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;pipeline_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;published_pipeline&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;experiment_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Training_Pipeline&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;recurrence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;daily&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To schedule a pipeline to run whenever &lt;strong&gt;data changes&lt;/strong&gt;, you must create a &lt;code&gt;Schedule&lt;/code&gt; that monitors a specific path on a datastore:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;training_datastore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Datastore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;blob_data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline_schedule&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Schedule&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="c1"&gt;# ...&lt;/span&gt;
  &lt;span class="n"&gt;datastore&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;training_datastore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;path_on_datastore&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;data/training&amp;#39;&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h1&gt;Deploy ML Models&lt;/h1&gt;
&lt;p&gt;You can deploy ass &lt;strong&gt;container&lt;/strong&gt; to several compute targets&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Azure ML compute instance&lt;/li&gt;
&lt;li&gt;Azure container instance&lt;/li&gt;
&lt;li&gt;Azure function&lt;/li&gt;
&lt;li&gt;Azure Kubernetes service&lt;/li&gt;
&lt;li&gt;IoT module&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Steps&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;register the model&lt;/li&gt;
&lt;li&gt;inference configuration&lt;/li&gt;
&lt;li&gt;deployment configuration&lt;/li&gt;
&lt;li&gt;deploy model&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;&lt;a name="registermodel"&gt;&lt;/a&gt;Register the model&lt;/h2&gt;
&lt;p&gt;After training, you must register the model to Azure ML workspace.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;classification_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification_model&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;model.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A classification model&amp;#39;&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Or you can use the reference to the run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification_model&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outputs/model.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A classification model&amp;#39;&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;&lt;a name="scoringscript"&gt;&lt;/a&gt;Inference configuration&lt;/h2&gt;
&lt;p&gt;The model will be deployed as a service consisting of&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a script to load the model and return predictions for submitted data&lt;/li&gt;
&lt;li&gt;an environment in which the script will be run&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Create the &lt;em&gt;entry script&lt;/em&gt; (or &lt;em&gt;scoring script&lt;/em&gt;) as a Python file including 2 functions&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;init()&lt;/code&gt; called when service is initialised (load model from registry)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;run(raw_data)&lt;/code&gt; called when new data is submitted to the service (generate predictions)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;
  &lt;span class="n"&gt;model_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_model_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification_model&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
  &lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;# return predictions as any JSON seriazable format&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can configure the environment using Conda. You can use a &lt;code&gt;CondaDependencies&lt;/code&gt; class to create a default environment (including &lt;code&gt;azureml-defaults&lt;/code&gt; and other commonly-used) and add any other required packages. You then serialize the environment to a string and save it.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;myenv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CondaDependencies&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;myenv&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_conda_package&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;scikit-learn&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;env_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;service_files/env.yml&amp;#39;&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;env_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;myenv&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;serialize_to_string&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;After creating the script and the environment, you combine them in an &lt;code&gt;InferenceConfig&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;classifier_inference_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;InferenceConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;python&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;source_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;service_files&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;entry_script&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;score.py&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;conda_file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;env.yml&amp;#39;&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Deployment configuration&lt;/h2&gt;
&lt;p&gt;Now that you have the entry script and the environment, you configure the compute service. If you deploy to an AKS cluster, you create it&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;cluster_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;aks-cluster&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;compute_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AksCompute&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;provisioning_configuration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;eastus&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;production_cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ComputeTarget&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cluster_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compute_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;production_cluster&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wait_for_completion&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You define the deployment configuration&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;classifier_deploy_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AksWebservice&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;deploy_configuration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;cpu_cores&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;memory_gb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Deploy the model&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification_model&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;service&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;deploy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification-service&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="n"&gt;inference_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;classifier_inference_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;deploy_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;classifier_deploy_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;deployment_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;production_cluster&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wait_for_deployment&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Consuming a real-time inferencing service&lt;/h2&gt;
&lt;p&gt;For &lt;strong&gt;testing&lt;/strong&gt;, you can use the AML SDK to call a web service through the &lt;code&gt;run&lt;/code&gt; method of a &lt;code&gt;WebService&lt;/code&gt; object. Typically,  you send data to &lt;code&gt;run&lt;/code&gt; method in a JSON like&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;3.4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;8.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="o"&gt;...&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The response is a JSON with a prediction for each case&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In &lt;strong&gt;production&lt;/strong&gt;, you use a REST endpoint. You find the endpoint of a deployed service in Azure ML studio, or by retrieving the &lt;code&gt;scoring_url&lt;/code&gt; property of a &lt;code&gt;Webservice&lt;/code&gt; object:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scoring_uri&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;There are 2 kinds of &lt;strong&gt;authentication&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;key: requests are authenticated by specifying the key associated with the service&lt;/li&gt;
&lt;li&gt;token: requests are authenticated by providing a JSON Web Token (JWT)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By default, authentication is disabled for Azure Container Instance service (set to key-based authentication for AKS).&lt;/p&gt;
&lt;p&gt;To make an authenticate call to the REST endpoint, you include the oey or the token in the request header.&lt;/p&gt;
&lt;h2&gt;Troubleshooting service deployment&lt;/h2&gt;
&lt;p&gt;You can&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;check the service state (should be &lt;em&gt;healty&lt;/em&gt;): &lt;code&gt;service.state&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;review service logs: &lt;code&gt;service.get_logs()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;deploy to local container&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Batch inference pipelines&lt;/h1&gt;
&lt;p&gt;Pipeline to read input data, load a registered model, predict labels, and write results.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="#registermodel"&gt;Register&lt;/a&gt; a model&lt;/li&gt;
&lt;li&gt;Create a &lt;a href="#scoringscript"&gt;scoring script&lt;/a&gt;. The &lt;code&gt;run(mini_batch)&lt;/code&gt; method makes the inference on each batch.&lt;/li&gt;
&lt;li&gt;Create a pipeline with ParallelRunStep&lt;/li&gt;
&lt;li&gt;Run the pipeline and retrieve the step output&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Azure ML provides a pipeline step performs parallel batch inference. Using &lt;code&gt;ParallelRunStep&lt;/code&gt; class, you can read batches of files from a &lt;code&gt;File&lt;/code&gt; dataset and write the output to a &lt;code&gt;PipelineData&lt;/code&gt; reference. You can set the &lt;code&gt;output_action&lt;/code&gt; to &lt;em&gt;"append_row"&lt;/em&gt; (ensuring all instances of the step will collate the result to a single output file named &lt;code&gt;parallel_run_step.txt&lt;/code&gt;).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;batch_data_set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datasets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;batch-data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# output location&lt;/span&gt;
&lt;span class="n"&gt;default_ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;we&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_default_datastore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;output_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PipelineData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;inferences&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;datastore&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_ds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;output_path_on_compute&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;results&amp;#39;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;parallel_run_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ParallelRunConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;source_directory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;batch_scripts&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;entry_script&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;batch_scoring_script.py&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;mini_batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;5&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;error_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;output_action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;append_row&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;batch_env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;aml_cluster&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;node_count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;parallelrun_step&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ParallelRunStep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;batch-score&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;parallel_run_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;parallel_run_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;batch_data_set&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;as_named_input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;batch_data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
  &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
  &lt;span class="n"&gt;allow_reuse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;parallelrun_step&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Run the pipeline and retrieve output.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pipeline_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Experiment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;batch_prediction_pipeline&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline_run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wait_for_completion&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;prediction_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pipeline_run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_children&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;prediction_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prediction_run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_output_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;inferences&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;prediction_output&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;local_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;results&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Publishing a batch inference pipeline&lt;/h2&gt;
&lt;p&gt;You can publish it as a &lt;strong&gt;REST&lt;/strong&gt; service.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;published_pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pipeline_run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;publish_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Batch_Prediction_Pipeline&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Batch Pipeline&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;1.0&amp;#39;&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;rest_endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;published_pipeline&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Once published, you can use the endpoint to initiate a batch inferencing job.&lt;/p&gt;
&lt;p&gt;You can also &lt;strong&gt;schedule&lt;/strong&gt; the published pipeline to have it run automatically.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;weekly&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ScheduleRecurrence&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frequency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Week&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline_schedule&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Schedule&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Weekly Predictions&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;batch inferencing&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;pipeline_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;published_pipeline&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;experiment_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Batch_Prediction&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;recurrence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;weekly&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h1&gt;Tuning hyperparameters&lt;/h1&gt;
&lt;p&gt;Accomplished by training multiple models, using same algorithm and training data but different hyperparameter values. Then, evaluate for each the performance metric (eg accuracy), and the best-performing model is selected.&lt;/p&gt;
&lt;p&gt;In Azure ML, you make an experiment that consist of a &lt;em&gt;hyperdrive&lt;/em&gt; run, which initiates a child run for each hyperparameter. Each child run uses a training script with parametrised hyperparameter values to train a model, and logs the target performance metric achieved by the training model.&lt;/p&gt;
&lt;h2&gt;Define a search space&lt;/h2&gt;
&lt;p&gt;Depends on the type of hyperparameter:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;discrete&lt;/strong&gt;. Make a &lt;code&gt;choice&lt;/code&gt; out of&lt;/li&gt;
&lt;li&gt;an explicit python &lt;code&gt;list&lt;/code&gt;: &lt;code&gt;choice([10, 20, 30])&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;a &lt;code&gt;range&lt;/code&gt;: &lt;code&gt;choice(range(1,10))&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;select values from a discrete distribution: &lt;em&gt;qnormal, quniform, qlognormal, qloguniform&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;continuous&lt;/strong&gt;. Use any of these distribution: &lt;em&gt;normal, uniform, lognormal, loguniform&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Define a search space by creating a dictionary with parameter expressions for each hyperparameter.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;azureml.train.hyperdrive&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;normal&lt;/span&gt;

&lt;span class="n"&gt;param_space&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;--batch_size&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;--learning_rate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Configuring sampling&lt;/h2&gt;
&lt;p&gt;The values used in a tuning run depend on the type of &lt;em&gt;sampling&lt;/em&gt; used.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Grid sampling.&lt;/strong&gt; Every possible combination when hyperparameters are discrete.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;param_space&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;--batch_size&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;--learning_rate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;param_sampling&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GridParameterSampling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;param_space&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Random sampling.&lt;/strong&gt; Randomly select a value for each hyperparameter.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;param_space&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;--batch_size&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;--learning_rate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;param_sampling&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RandomParameterSampling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;param_space&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Bayesian sampling.&lt;/strong&gt; Based on Bayesian optimisation algorithm that tries to select parameter combinations that will result in improved performance from the previous selection.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;param_space&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;--batch_size&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;--learning_rate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;param_sampling&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BayesianParameterSampling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;param_space&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Can only be used with &lt;em&gt;choice, uniform, quniform&lt;/em&gt; distributions and can't be combined with &lt;em&gt;early termination&lt;/em&gt;.&lt;/p&gt;
&lt;h2&gt;Configuring an early termination&lt;/h2&gt;
&lt;p&gt;Typically, you set a maximum number of iterations, but this could still result in a large number of runs that don't result in a better model than a combination that has already been tried.&lt;/p&gt;
&lt;p&gt;To help preventing wasting time, you can set an &lt;em&gt;early termination&lt;/em&gt; policy that abandons runs that are unlikely to produce a better result than previously completed runs. The policy is evaluated at an &lt;em&gt;evaluation interval&lt;/em&gt; you specify, based on each time the target performance metric is logged. You can also set a &lt;em&gt;delay evaluation&lt;/em&gt; parameter to avoid evaluating the policy until a minimum number of iterations have been completed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note.&lt;/strong&gt; Early termination is particularly useful for deep learning scenarios where a deep neural network is trained iteratively over a number of epochs. The training script can report the target metric after each epoch, and if the run is significantly underperforming previous runs after the same number of intervals, it can be abandoned.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Bandit policy.&lt;/strong&gt; Stop a run if the target performance metric underperforms the best run so far by a specified margin.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;early_termination_policy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BanditPolicy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;slack_amount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# abandon runs when metric is 0.2 or more worse than best run after the same number of intervals&lt;/span&gt;
  &lt;span class="n"&gt;evaluation_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;delay_evaluation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can also use a slack &lt;em&gt;factor&lt;/em&gt; comparing the metric as ration rather than an absolute value.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Median stopping policy.&lt;/strong&gt; Abandoning runs where the target performance metric is worse than the median of the running averages fo all runs.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;early_termination_policy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MedianStoppingPolicy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;evaluation_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;delay_evaluation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Truncation selection policy.&lt;/strong&gt; Cancelling the lower performing &lt;em&gt;X%%&lt;/em&gt; of runs at each evaluation interval  based on the &lt;em&gt;truncation_percentage&lt;/em&gt; valu you specify for &lt;em&gt;X&lt;/em&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;early_termination_policy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TruncationSelectionPolicy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;truncation_percentage&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;evaluation_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;delay_evaluation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Running a hyperparameter tuning experiment&lt;/h2&gt;
&lt;p&gt;In Azure ML, you tune hyper by running a &lt;em&gt;hyperdrive&lt;/em&gt; experiment. You need to create a training script just the way you would do for any other training experiment, except that you &lt;strong&gt;must&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;include an argument for each hyperparameter&lt;/li&gt;
&lt;li&gt;log the target performance metric.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This example script trains a logistic regression using a &lt;code&gt;--regularization&lt;/code&gt; argument (regularization rate), and logs the &lt;em&gt;accuracy&lt;/em&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;--regularization&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dest&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;reg_rate&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;reg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reg_rate&lt;/span&gt;

&lt;span class="c1"&gt;# get experiment run context&lt;/span&gt;
&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_context&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_datasets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;training_data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_pandas_dataframe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;feature1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;feature2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;feature3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;feature4&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;label&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;
&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;reg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;solver&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;liblinear&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# calculate and log accuracy&lt;/span&gt;
&lt;span class="n"&gt;y_hat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;acc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;average&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_hat&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Accuracy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# save trained model&lt;/span&gt;
&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;makedirs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outputs&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outputs/model.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To prepare the hyperdrive experiment, you use a &lt;code&gt;HyperDriveConfig&lt;/code&gt; object to configure the experiment run.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;hyperdrive&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;HyperDriveConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;estimator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sklearn_estimator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;hyperparameter_sampling&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;param_sampling&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;primary_metric_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Accuracy&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;primary_metricgoal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PrimaryMetricGoal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MAXIMIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;max_total_runs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;max_concurrent_runs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;experiment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Experiment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hyperdrive_training&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;hyperdrive_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;experiment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;hyperdrive&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can monitor hyperdrive experiment in Azure ML studio. The experiment will initiate a child run for each hyperparameter combination to be tried&lt;/p&gt;
&lt;h1&gt;Automate model selection&lt;/h1&gt;
&lt;p&gt;Visual interface for automated ML in Azure ML Studio for &lt;em&gt;Enterprise&lt;/em&gt; edition only.&lt;/p&gt;
&lt;p&gt;You can use automated ML to train models for the tasks below. Azure ML supports common algorithms for these tasks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;classification&lt;/li&gt;
&lt;li&gt;logistic regression&lt;/li&gt;
&lt;li&gt;light gradient boosting machine&lt;/li&gt;
&lt;li&gt;decision tree&lt;/li&gt;
&lt;li&gt;random forest&lt;/li&gt;
&lt;li&gt;naive Bayes&lt;/li&gt;
&lt;li&gt;linear SVM&lt;/li&gt;
&lt;li&gt;XGBoost&lt;/li&gt;
&lt;li&gt;DNN classifier&lt;/li&gt;
&lt;li&gt;others...&lt;/li&gt;
&lt;li&gt;regression&lt;/li&gt;
&lt;li&gt;linear regression&lt;/li&gt;
&lt;li&gt;light gradient boosting machine&lt;/li&gt;
&lt;li&gt;decision tree&lt;/li&gt;
&lt;li&gt;random forest&lt;/li&gt;
&lt;li&gt;elastic net&lt;/li&gt;
&lt;li&gt;LARS Lasso&lt;/li&gt;
&lt;li&gt;XGBoost&lt;/li&gt;
&lt;li&gt;Others&lt;/li&gt;
&lt;li&gt;time series forecasting&lt;/li&gt;
&lt;li&gt;linear regression&lt;/li&gt;
&lt;li&gt;light gradient boosting machine&lt;/li&gt;
&lt;li&gt;decision tree&lt;/li&gt;
&lt;li&gt;random forest&lt;/li&gt;
&lt;li&gt;elastic net&lt;/li&gt;
&lt;li&gt;LARS Lasso&lt;/li&gt;
&lt;li&gt;XGBoost&lt;/li&gt;
&lt;li&gt;others&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By default, automated machine learning, will randomly select from the full range of algorithms for the specified task. You can choose to &lt;strong&gt;block&lt;/strong&gt; individual algorithms from being selected.&lt;/p&gt;
&lt;h2&gt;Preprocessing and featurization&lt;/h2&gt;
&lt;p&gt;Automated ML (AutoML) can apply preprocessing transformations to your data.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;scaling and normalization&lt;/strong&gt; applied to numeric data &lt;strong&gt;automatically&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;optional featurization&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;missing value imputation&lt;/li&gt;
&lt;li&gt;categorical encoding&lt;/li&gt;
&lt;li&gt;dropping high cardinality features (eg IDs)&lt;/li&gt;
&lt;li&gt;feature engineering (eg date parts from DateTime)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Running AutoML experiment&lt;/h2&gt;
&lt;p&gt;You can use Auzure ML Studio UI or use SDK (using &lt;code&gt;AutoMLConfig&lt;/code&gt; class).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;automl_run_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RunConfiguration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;framework&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;python&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;automl_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoMLConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;auto ml experiment&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;primary_metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;AUC_weighted&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;aml_compute&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;training_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;validation_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;test_dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;label_column_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;label&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;featurization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;auto&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;iterations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;max_concurrent_iterations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;With Azure ML Studio, you can create or select an Azure ML &lt;em&gt;dataset&lt;/em&gt; to be used as input for your AutoML experiment. When using the SDK, you can submit data by&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;specify a dataset or dataframe of &lt;em&gt;training data&lt;/em&gt; that includes features and label to be predicted&lt;/li&gt;
&lt;li&gt;optionally, specify a second &lt;em&gt;validation data&lt;/em&gt; dataset or dataframe. If this is not provided, Azure ML will apply cross-validation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Alternatively:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;specify a dataset, dataframe, or numpy array of &lt;em&gt;X&lt;/em&gt; values containing features with a corresponding &lt;em&gt;y&lt;/em&gt; array of label values&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;One of the most important setting you specify is &lt;strong&gt;primary_metric&lt;/strong&gt; (ie target performance metric). Azure ML supports a set of named metrics for each type of task.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;get_primary_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can &lt;strong&gt;submit&lt;/strong&gt; an AutoML experiment like any other SDK-based experiment:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;automl_experiment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Experiment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;automl_experiment&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;automl_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;automl_experiment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;automl_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can easily identify the best run in Auzre ML studio, and download or deploy the model it generated. Via SDK:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;best_run&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fitted_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;automl_run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_output&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;best_run_metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;best_run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_metrics&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;metric_name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;best_run_metrics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="n"&gt;metric&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;best_run_metrics&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;metric_name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metric_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;AutoML uses &lt;em&gt;scikit-learn&lt;/em&gt; pipelines. You can view the steps in the fitted model you obtained from the best run.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;fitted_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;named_steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h1&gt;Explain ML models&lt;/h1&gt;
&lt;p&gt;Model explainers use statistical techniques to calculate &lt;strong&gt;feature importance&lt;/strong&gt;. Explainers work by evaluating a test data set of feature cases and the labels the model predicts for them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Global feature importance&lt;/strong&gt; quantifies the relative importance of each feature in the test dataset as a whole: which feature in the dataset influences prediction?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Local feature importance&lt;/strong&gt; measures the influence of each feature value for a specific individual prediction. Example, will Sam go deafult?&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Prediction=0: Samuel won't default on the loan repayment&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;loan amount&lt;/em&gt;; support for 0: &lt;code&gt;0.9&lt;/code&gt;; support for 1: &lt;code&gt;-0.9&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;income&lt;/em&gt;; support for 0: &lt;code&gt;0.6&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;age&lt;/em&gt;; support for 0: &lt;code&gt;-0.2&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;marital status&lt;/em&gt;; support for 0: &lt;code&gt;0.1&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Because this is a &lt;em&gt;classification&lt;/em&gt; model, each feature gets a local importance value for each possible class, indicating the amount of support for that class based on the feature value.&lt;/p&gt;
&lt;p&gt;The most important feature for a prediction of class 1 is &lt;em&gt;loan amount&lt;/em&gt;. There could be multiple reasons why local importance for an individualprediction varies form global importance for the overall dataset. For example, Sam might have a lower income than average, but the loan amount in this case might be unusually small.&lt;/p&gt;
&lt;p&gt;For a multi-class classification model, a local importance value for each possible class is calculated for every feature, with the total across all classes always being 0.&lt;/p&gt;
&lt;p&gt;For a &lt;strong&gt;regression model&lt;/strong&gt;, the local importance values simply indicate the level of influence each feature has on the predicted scalar label.&lt;/p&gt;
&lt;h2&gt;Using explainers&lt;/h2&gt;
&lt;p&gt;You can use Azure ML SDK to create explainers for models even if they were not trained using an Azure ML experiment.&lt;/p&gt;
&lt;p&gt;You install the &lt;code&gt;azureml-interpret&lt;/code&gt; package. Types of explainer include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;MimicExplainer&lt;/code&gt; creates a &lt;em&gt;global surrogate model&lt;/em&gt; that approximates your trained model and can be used to generate explanations. This explainable model must have the same kind of architecture as your trained model (eg linear or tree-based)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TabularExplainer&lt;/code&gt; acts as a wrapper around various SHAP explainer algorithms, automatically choosing the one that is most appropriate for your model architecture&lt;/li&gt;
&lt;li&gt;&lt;code&gt;PFIExplainer&lt;/code&gt; (&lt;em&gt;Permutation Feature Importance&lt;/em&gt;) analyzes feature importance by shuffling feature values and measuring the impact on prediction performance&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example for hypothetical model named &lt;code&gt;loan_model&lt;/code&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;mim_explainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MimicExplainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;loan_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;initialization_examples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;explainable_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;DecisionTreeExplainableModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;loan_amount&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;income&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;age&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;marital_status&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;reject&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;approve&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;tab_explainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TabularExplainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;loan_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;initialization_examples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;loan_amount&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;income&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;age&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;marital_status&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;reject&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;approve&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pfi_explainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PFIExplainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;loan_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;loan_amount&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;income&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;age&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;marital_status&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;reject&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;approve&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To retrieve &lt;strong&gt;global feature importance&lt;/strong&gt;, call the &lt;code&gt;explain_global()&lt;/code&gt; method of your explainer, and then use the &lt;code&gt;get_feature_importance_dict()&lt;/code&gt; method to get a dictionary of the feature importance values.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;global_mim_explanation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mim_explainer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;explain_global&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;global_mim_feature_importance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;global_mim_explanation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_feature_importance_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# same as MimixExplainer&lt;/span&gt;
&lt;span class="n"&gt;global_tab_explanation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mim_explainer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;explain_global&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;global_tab_feature_importance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;global_tab_explanation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_feature_importance_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# requires actual labels&lt;/span&gt;
&lt;span class="n"&gt;global_pfi_explanation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mim_explainer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;explain_global&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;global_pfi_feature_importance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;global_pfi_explanation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_feature_importance_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To retriev &lt;strong&gt;local feature importance&lt;/strong&gt; from a &lt;code&gt;MimicExplainer&lt;/code&gt; or a &lt;code&gt;TabularExplainer&lt;/code&gt;, you must call the &lt;code&gt;explain_local()&lt;/code&gt; specifying the subset of cases you want to explain. Then you use the &lt;code&gt;get_ranked_local_names()&lt;/code&gt; and &lt;code&gt;get_ranked_local_values()&lt;/code&gt; to retrieve dictionares.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# same for tab_explainer too&lt;/span&gt;
&lt;span class="n"&gt;local_mim_explanation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mim_explainer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;explain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;local_mim_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;local_mim_explanation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_ranked_local_names&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;local_mim_importance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;local_mim_explanation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_ranked_local_values&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;PFIExplainer&lt;/code&gt; does not support local feature importance explanations.&lt;/p&gt;
&lt;h2&gt;Creating explanations&lt;/h2&gt;
&lt;p&gt;You can create an explainer and upload the explanation it generates to the run for later analysis.&lt;/p&gt;
&lt;p&gt;To create an explanation for the &lt;strong&gt;experiment script&lt;/strong&gt;, you'll need to ensure that the &lt;code&gt;azureml-interpret&lt;/code&gt; and &lt;code&gt;azureml-contrib-interpret&lt;/code&gt; packages are installed in the run environment. Then you can use these to create an explanation from your trained model and upload it to the run outputs.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_context&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# code to train model goes here&lt;/span&gt;

&lt;span class="c1"&gt;# get explanation&lt;/span&gt;
&lt;span class="n"&gt;explainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TabularExplainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;explanation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;explainer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;explain_global&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# get an explanation client and upload the explanation&lt;/span&gt;
&lt;span class="n"&gt;explain_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ExplanationClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;explain_client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;upload_model_explanation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;explanation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Tabular Explanation&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can view the explanation you created for your model in the &lt;em&gt;Explanations&lt;/em&gt; tab for the run in Azure ML Studio.&lt;/p&gt;
&lt;h2&gt;Visualizing explanations&lt;/h2&gt;
&lt;p&gt;Model explanations in Azure ML Studio include multiple visualizations that you can use to explore feature importance. Visualizations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;global feature importance&lt;/li&gt;
&lt;li&gt;summary importance: shows the distribution of individual importance values for each feature across the test dataset&lt;/li&gt;
&lt;li&gt;local feature importance by selecting an individual data point&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Monitor models&lt;/h1&gt;
&lt;p&gt;You can use Application Insights to capture and review telemetry from models published with Azure ML. You must have an Application Insights resource associated with your Azure ML workspace.&lt;/p&gt;
&lt;p&gt;When you create an Azure ML workspace, you can select an Application Insights resource. If you do not select an existing resource, a new one is created in the same resource group as your workspace.&lt;/p&gt;
&lt;p&gt;When deploying a new real-time service, you can &lt;strong&gt;enable&lt;/strong&gt; Application Insights in the deployment configuration for the service.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;dep_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AciWebservice&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;deploy_configuration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;cpu_cores&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;memory_gb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;enable_app_insights&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you want to enable Application Insights for a service that is already deployed, you can modify the deployment configuration for AKS based services in the Azure portal.&lt;/p&gt;
&lt;h2&gt;Capture and view telemetry&lt;/h2&gt;
&lt;p&gt;Application Insights automatically captures any information written to the standard output and error logs, and provides a query capability to view data in these logs.&lt;/p&gt;
&lt;p&gt;You can write any value to the standard output in the scoring script by using a &lt;code&gt;print&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Data: &amp;#39;&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39; - Predictions: &amp;#39;&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Azure ML creates a &lt;em&gt;custom dimension&lt;/em&gt; in the data model for the output you write.&lt;/p&gt;
&lt;p&gt;Yuo can use the Log Analytics query interface for the Applcation Insights in the Azure portal. It supports a SQL-like query syntax.&lt;/p&gt;
&lt;h1&gt;Monitor data drift&lt;/h1&gt;
&lt;p&gt;Over time there may be trends that change the profile of the data, making your model less accurate. This change in data profiles between training and inferencing is known as &lt;em&gt;data drift&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Azure ML supports data drift monitoring through the use of &lt;em&gt;datasets&lt;/em&gt;. You can compare two registered datasets to detect data drift, or you can capture new feature data submitted to a deployed model service and compare it to the dataset with which the model was trained.&lt;/p&gt;
&lt;p&gt;You register 2 datasets:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a &lt;em&gt;baseline&lt;/em&gt; dataset: original training data&lt;/li&gt;
&lt;li&gt;a &lt;em&gt;target&lt;/em&gt; dataset that will be compared to the baseline on time intervals. This dataset requires a column for each feature you want to compare, and a timestamp column&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You define a &lt;em&gt;dataset monitor&lt;/em&gt; to detect data drift and trigger alerts if the rate of drift exceeds a specified threshold. You can create dataset monitors using Azure ML Studio or by using the &lt;code&gt;DataDriftDetector&lt;/code&gt; class.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;monitor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DataDriftDetector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create_from_datasets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;dataset-drift-monitor&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;baseline_data_set&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_ds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;target_data_set&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;new_data_ds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;aml-cluster&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;frequency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;week&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;feature_list&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;age&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;height&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;bmi&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="n"&gt;latency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can &lt;em&gt;backfill&lt;/em&gt; to immediately compare baseline to existing data in target.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;backfill&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;monitor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backfill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;weeks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you have &lt;strong&gt;deployed a model&lt;/strong&gt; as a real-time web service, you can capture new inferencing data s it is submitted, and compare it to the original training data. It has the benefit of automatically collecting new target data as the deployed model is used.&lt;/p&gt;
&lt;p&gt;You include the training dataset in the model registration to provide a baseline.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;./model/model.pkl&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mymodel&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;datasets&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Scenario&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TRAINING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_ds&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You enable data collection for services in which the model is used. You use the &lt;code&gt;ModelDataCollector&lt;/code&gt; class in each service's scoring script, writing code to capture data and predictions and write them to the data collector (which will store them in Azure blob storage).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data_collect&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predict_collect&lt;/span&gt;
  &lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;mymodel&amp;#39;&lt;/span&gt;
  &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_model_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

  &lt;span class="c1"&gt;# enable collection of data and predictions&lt;/span&gt;
  &lt;span class="n"&gt;data_collect&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ModelDataCollector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;designation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;inputs&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;age&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;height&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;bmi&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;predict_collect&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ModelDataCollector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;designation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;predictions&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;prediction&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
  &lt;span class="n"&gt;predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="n"&gt;data_collect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;predict_collect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;predictions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;With the data collection code in place in the scoring script, you can enable data collection in the deployment configuration.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;dep_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AksWebservice&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;deploy_configuration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collect_model_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can configure &lt;strong&gt;data drift monitoring&lt;/strong&gt; by using a &lt;code&gt;DataDriftDetector&lt;/code&gt; class.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mymodel&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;datadrift&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DataDriftDetector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create_from_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;services&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;my-svc&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="n"&gt;frequency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Week&amp;#39;&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Scheduling alerts&lt;/h2&gt;
&lt;p&gt;You can specify a threshold for the rate of data drift and an operator email for notifications.&lt;/p&gt;
&lt;p&gt;Monitoring works by running a comparison at scheduled &lt;strong&gt;frequency&lt;/strong&gt; (day, week, or month), and calculating data drift metrics for the features. For dataset monitors, you can specify a &lt;strong&gt;latency&lt;/strong&gt; indicating the number of hours to allow for new data to be collected and added to the target dataset. For deployed model data drifts monitor, you can specify a &lt;code&gt;schedule_start&lt;/code&gt; time value to indicate when the data drift run should start (if omitted, the run will start at the current time).&lt;/p&gt;
&lt;p&gt;Data drift is measured using a calculated &lt;em&gt;magnitude&lt;/em&gt; of change in the statistical distributions of feature values over time. You can configure a &lt;strong&gt;threshold&lt;/strong&gt; for data drift magnitude.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;alert_email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AlertConfiguration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;data_scientist@contoso.com&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;monitor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DataDriftDetector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create_from_datasets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;dataset-drift-detector&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;baseline_data_set&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;target_data_set&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;compute_target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cpu_cluster&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;frequency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Week&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;latency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;drift_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;alert_configuration&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;alert_email&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</content><category term="posts"></category></entry><entry><title>Error when restarting Databricks streaming job</title><link href="https://www.marcosantoni.com/error_restarting_databricks_streaming.html" rel="alternate"></link><published>2020-04-19T18:00:00+02:00</published><updated>2020-04-19T18:00:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2020-04-19:/error_restarting_databricks_streaming.html</id><summary type="html">&lt;p&gt;This is an error I encountered when I have a Spark Streaming job running on Databricks 6.1. Consider the case I have to update a running streaming query. Databricks &lt;a href="https://docs.databricks.com/spark/latest/structured-streaming/production.html#configure-jobs-to-restart-streaming-queries-on-failure"&gt;recommends&lt;/a&gt; to always start (and restart too?) a streaming query on a &lt;strong&gt;new&lt;/strong&gt; dedicated cluster. However, in some scenario you …&lt;/p&gt;</summary><content type="html">&lt;p&gt;This is an error I encountered when I have a Spark Streaming job running on Databricks 6.1. Consider the case I have to update a running streaming query. Databricks &lt;a href="https://docs.databricks.com/spark/latest/structured-streaming/production.html#configure-jobs-to-restart-streaming-queries-on-failure"&gt;recommends&lt;/a&gt; to always start (and restart too?) a streaming query on a &lt;strong&gt;new&lt;/strong&gt; dedicated cluster. However, in some scenario you might not be able to do so, and you may want to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;cancel the job run&lt;/li&gt;
&lt;li&gt;update the notebooks&lt;/li&gt;
&lt;li&gt;restart the job run&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By taking these steps, I encountered these error:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Concurrent&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;update&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Multiple&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;streaming&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;detected&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;

&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;or&lt;/span&gt;

&lt;span class="n"&gt;Multiple&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;streaming&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;are&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;concurrently&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;using&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;checkpoint&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;They did not occur every time I restarted the query, but most of the times. When restarting 2-3 times, the issue was solved and the streaming query run smoothly. By investigating a bit more the error, we found that cancelling a job run via Databricks CLI was not letting the stream query close smoothly. What happened? The running query was not closing cleanly the &lt;a href="https://docs.databricks.com/spark/latest/structured-streaming/production.html#enable-checkpointing"&gt;checkpoints&lt;/a&gt;. So, when a new job run started, it raised an error because it found a corrupted checkpoint.&lt;/p&gt;
&lt;h2&gt;Solution&lt;/h2&gt;
&lt;p&gt;You can&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;upgrade do Databricks 6.3 and set &lt;a href="https://docs.databricks.com/release-notes/runtime/6.3.html#improvements"&gt;spark.sql.streaming.stopActiveRunOnRestart&lt;/a&gt; to true&lt;/li&gt;
&lt;li&gt;wait for Databricks 7 to be release where this configuration is enabled by default&lt;/li&gt;
&lt;/ul&gt;</content><category term="posts"></category></entry><entry><title>New Work on atacmonitor.com</title><link href="https://www.marcosantoni.com/refactor_atacmonitor.html" rel="alternate"></link><published>2020-03-08T18:00:00+01:00</published><updated>2020-03-08T18:00:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2020-03-08:/refactor_atacmonitor.html</id><summary type="html">&lt;p&gt;&lt;img alt="atacmonitor chart" src="https://www.marcosantoni.com/images/atacmonitor_chart.png"&gt;&lt;/p&gt;
&lt;p&gt;My side project &lt;a href="http://www.atacmonitor.com/"&gt;atacmonitor&lt;/a&gt; features a new guise. Data is now being collected for &lt;strong&gt;all bus and tram&lt;/strong&gt; lines in Rome. Data pull is achieved via Python functions running on AWS Lambda. Data is then stored in MongoDB hosted in MongoDB Atlas. Atlas also provides the charts in the page …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="atacmonitor chart" src="https://www.marcosantoni.com/images/atacmonitor_chart.png"&gt;&lt;/p&gt;
&lt;p&gt;My side project &lt;a href="http://www.atacmonitor.com/"&gt;atacmonitor&lt;/a&gt; features a new guise. Data is now being collected for &lt;strong&gt;all bus and tram&lt;/strong&gt; lines in Rome. Data pull is achieved via Python functions running on AWS Lambda. Data is then stored in MongoDB hosted in MongoDB Atlas. Atlas also provides the charts in the page. An overview of the new architecture is presented below.&lt;/p&gt;
&lt;p&gt;&lt;img alt="atacmonitor architecture" src="https://www.marcosantoni.com/images/atacmonitor_architecture_2.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://www.marcosantoni.com/monitoring_bus_frequencies_in_rome.html"&gt;Link&lt;/a&gt; to the post of the first release.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>The Pragmatic Programmer [Highlights]</title><link href="https://www.marcosantoni.com/the_pragmatic_programmer.html" rel="alternate"></link><published>2018-02-10T14:31:00+01:00</published><updated>2018-02-10T14:31:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2018-02-10:/the_pragmatic_programmer.html</id><summary type="html">&lt;blockquote&gt;
&lt;p&gt;Rather than construction, software is more like gardening— it is more organic than concrete. You plant many things in a garden according to an initial plan and conditions. Some thrive, others are destined to end up as compost. [...] You constantly monitor the health of the garden, and make adjustments (to …&lt;/p&gt;&lt;/blockquote&gt;</summary><content type="html">&lt;blockquote&gt;
&lt;p&gt;Rather than construction, software is more like gardening— it is more organic than concrete. You plant many things in a garden according to an initial plan and conditions. Some thrive, others are destined to end up as compost. [...] You constantly monitor the health of the garden, and make adjustments (to the soil, the plants, the layout) as needed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;em&gt;The Pragmatic Programmer: from Journeyman to Master&lt;/em&gt; by Andrew Hunt and David Thomas is a guide to best practices of software development. A software developer is like a woodcrafter. There are good practices that help him in achieving quality and efficiency in its work. I will summarize here some interesting hints that you can find in the book.&lt;/p&gt;
&lt;p&gt;The book was originally published in 1999, so technologies and tools are quite outdated. However, the main principle remain surprisingly up to date.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The Pragmatic Programmer" src="https://www.marcosantoni.com/images/pragmatic_programmer.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;1. Don't Repeat Yourself&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;DRY— Don't Repeat Yourself The alternative is to have the same thing expressed in two or more places. If you change one, you have to remember to change the others [...]. It isn't a question of whether you'll remember: it's a question of &lt;strong&gt;when you'll forget&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;2. Coding over GUIs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;A benefit of GUIs is WYSIWYG— what you see is what you get. The disadvantage is WYSIAYG— what you see is &lt;strong&gt;all&lt;/strong&gt; you get.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;3. One Editor for All&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;We think it is better to know one editor very well, and use it for all editing tasks: code, documentation, memos, system administration, and so on. Without a single editor, you face a potential modern day Babel of confusion.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;4. Always Source Control. Always.&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Always. Even if you are a single-person team on a one-week project. Even if it's a "throw-away" prototype. Even if the stuff you're working on isn't source code. Make sure that everything is under source code control— documentation, phone number lists, memos to vendors, makefiles, build and release procedures, that little shell script that burns the CD master— everything.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;5. Things can Happen&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;It goes THIS CAN NEVER HAPPEN... "This code won't be used 30 years from now, so two-digit dates are fine." "This application will never be used abroad, so why internationalize it?" "count can't be negative." "This printf can't fail.". Let's not practice this kind of self-deception, particularly when coding.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;6. Become a User&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;There's a simple technique for getting inside your users' requirements that isn't used often enough: become a user. Are you writing a system for the help desk? Spend a couple of days monitoring the phones with an experienced support person. Are you automating a manual stock control system? Work in the warehouse for a week. As well as giving you insight into how the system will really be used, you'd be amazed at how the request "May I sit in for a week while you do your job?" helps build trust and establishes a basis for communication with your users. Just remember not to get in the way!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;7. Web Docs over Files&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Web-based distribution also avoids the typical two-inch-thick binder entitled Requirements Analysis that no one ever reads and that becomes outdated the instant ink hits paper. If it's on the Web, the programmers may even read it.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;8. Quality, quality, quality.&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Teams as a whole should not tolerate broken windows— those small imperfections that no one fixes. The team must take responsibility for the quality of the product, supporting developers who understand the no broken windows&lt;/p&gt;
&lt;p&gt;Some team methodologies have a quality officer— someone to whom the team delegates the responsibility for the quality of the deliverable. This is clearly &lt;strong&gt;ridiculous&lt;/strong&gt;: quality can come only from the individual contributions of all team members.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;9. Marketing the Project&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;There is a simple marketing trick that helps teams communicate as one: generate a brand. When you start a project, come up with a name for it, ideally something off-the-wall.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;10. Manual Ensures Errors&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;A great way to ensure both consistency and accuracy is to automate everything the team does.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Reference&lt;/h2&gt;
&lt;p&gt;Hunt, Andrew; Thomas, David. The Pragmatic Programmer: From Journeyman to Master. Pearson Education. Kindle Edition.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>6 Take-Aways after Reading "The Signal and The Noise"</title><link href="https://www.marcosantoni.com/the_signal_and_the_noise.html" rel="alternate"></link><published>2017-11-11T19:07:00+01:00</published><updated>2017-11-11T19:07:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2017-11-11:/the_signal_and_the_noise.html</id><summary type="html">&lt;p&gt;&lt;em&gt;The Signal and The Noise&lt;/em&gt; by Nate Silver is a must-read book for those interested in predictions. It is not a technical book. You will not learn any algorithm. However, it presents a series of real-world scenarios when predictions did work and where predictions did not work. The book is …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;em&gt;The Signal and The Noise&lt;/em&gt; by Nate Silver is a must-read book for those interested in predictions. It is not a technical book. You will not learn any algorithm. However, it presents a series of real-world scenarios when predictions did work and where predictions did not work. The book is well written and is full of valuable references to support its arguments.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The Signal and The Noise by Nate Silver" src="https://www.marcosantoni.com/images/signal_and_noise_book.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;1. Anyone can beat an index fund&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;After all, any investor can do as well as the average investor with almost no effort. All he needs to do is buy an index fund that tracks the average of the S&amp;amp;P500. In so doing he will come extremely close to replicating the average portfolio of every other trader, from Harvard MBAs to noise traders to George Soros' hedge fund manager. You have to be &lt;em&gt;really&lt;/em&gt; good -or foolhardy- to turn that proposition down.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;2. Bayesian statistics is less wrong&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Recently, however, some well-respected statisticians have begun to argue that frequentist statistics should no longer be taught to undergraduates.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Frequentist statistics emphasizes the purity of the experiment: every hypothesis could be tested to a perfect conclusion if only enough data were collected. These methods don't encourage us to think about the plausibility of our hypothesis.&lt;/p&gt;
&lt;h2&gt;3. A bug made Deep Blue beat Kasparov&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;But what had inspired Kasparov to commit this mistake? His anxiety over Deep Blue's forty-fourth move in the first game - the move in which the computer had moved its rook for no apparent purpose. Kasparov had concluded that the counterintuitive play must be a sign of superior intelligence. He had never considered that it was simply a bug.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;4. When predictions work - Weather&lt;/h2&gt;
&lt;p&gt;Weather predictions do not rely on statistics, nor on machine learning. They employ heavy simulations. The earth is split in cells, and the meteorological dynamics are simulated via well known models. The first weather simulation ever done is by the English physicist Lewis Fry Richardson in 1916.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Richardson's Matrux" src="https://www.marcosantoni.com/images/richardson_grid.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;5. When predictions don't work - Earthquakes&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;These processes may not literally be random, but they are so irreducibly complex (right down the last grain of sand) that it just won't be possible to predict them beyond a certain level.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;6. When predictions don't work - Economics&lt;/h2&gt;
&lt;p&gt;Raw data for economics isn't much good.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"Why do people [economists ed.] not give intervals? Because they're embarrassed"&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;They are embarrassed because they are just too large.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>My Talk about Superset [Python Milano Meetup]</title><link href="https://www.marcosantoni.com/talk_python_pills.html" rel="alternate"></link><published>2017-06-22T17:56:00+02:00</published><updated>2017-06-22T17:56:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2017-06-22:/talk_python_pills.html</id><summary type="html">&lt;p&gt;Yesterday, I gave a talk &lt;a href="https://www.meetup.com/Python-Milano/events/239846600/"&gt;Python Milano Meetup&lt;/a&gt;. The Meetup was designed as Python pills: three 20-minutes talks in a row. The talks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Superset: data visualization at AirBnB - Marco Santoni&lt;/li&gt;
&lt;li&gt;Java Vs Python - Cesare Placanica&lt;/li&gt;
&lt;li&gt;pdb in action - &lt;a href="https://twitter.com/greenkey"&gt;Lorenzo Mele&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;Very nice talk of &lt;a href="https://twitter.com/Airbnb"&gt;@Airbnb&lt;/a&gt; &lt;a href="https://twitter.com/hashtag/Superset?src=hash"&gt;#Superset&lt;/a&gt; with &lt;a href="https://twitter.com/MrSantoni"&gt;@MrSantoni&lt;/a&gt; at &lt;a href="https://twitter.com/hashtag/PythonMilano?src=hash"&gt;#PythonMilano …&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;</summary><content type="html">&lt;p&gt;Yesterday, I gave a talk &lt;a href="https://www.meetup.com/Python-Milano/events/239846600/"&gt;Python Milano Meetup&lt;/a&gt;. The Meetup was designed as Python pills: three 20-minutes talks in a row. The talks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Superset: data visualization at AirBnB - Marco Santoni&lt;/li&gt;
&lt;li&gt;Java Vs Python - Cesare Placanica&lt;/li&gt;
&lt;li&gt;pdb in action - &lt;a href="https://twitter.com/greenkey"&gt;Lorenzo Mele&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;Very nice talk of &lt;a href="https://twitter.com/Airbnb"&gt;@Airbnb&lt;/a&gt; &lt;a href="https://twitter.com/hashtag/Superset?src=hash"&gt;#Superset&lt;/a&gt; with &lt;a href="https://twitter.com/MrSantoni"&gt;@MrSantoni&lt;/a&gt; at &lt;a href="https://twitter.com/hashtag/PythonMilano?src=hash"&gt;#PythonMilano&lt;/a&gt;. I see juicy applications for us &lt;a href="https://twitter.com/hashtag/BIM?src=hash"&gt;#BIM&lt;/a&gt; guys. &lt;a href="https://t.co/Pf1r9nhNEd"&gt;https://t.co/Pf1r9nhNEd&lt;/a&gt;&lt;/p&gt;&amp;mdash; Chiara Rizzarda (@CrShelidon) &lt;a href="https://twitter.com/CrShelidon/status/877595912612311041"&gt;June 21, 2017&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="//platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;I presented &lt;a href="https://github.com/airbnb/superset"&gt;superset&lt;/a&gt;, the open source project by AirBnB. It is a data visualization platform developed in Python. It allows to create interactive dashboards. The setup time is extremely short. It interesting for enterprises because the package features deep and granular authorization policies. The dashboards can be designed by business users too. You can indeed design dashboards without writing SQL queries (but there's still the option to write SQL of course). &lt;code&gt;superset&lt;/code&gt; can integrate to most SQL databases thanks to &lt;code&gt;SQLAlchemy&lt;/code&gt; query layer. Furthermore, &lt;code&gt;druid.io&lt;/code&gt; database is supported. I presented &lt;a href="http://www.marcosantoni.com/monitoring_bus_frequencies_in_rome.html"&gt;atacmonitor&lt;/a&gt; as an example of a &lt;code&gt;superset&lt;/code&gt; application.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Manufacturing. When data is not a commodity</title><link href="https://www.marcosantoni.com/datadriveninnovation17.html" rel="alternate"></link><published>2017-02-25T17:56:00+01:00</published><updated>2017-02-25T17:56:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2017-02-25:/datadriveninnovation17.html</id><summary type="html">&lt;p&gt;What does it mean to work as a data scientist in manufacturing? What is the value behind data? Data science has gained popularity in domains like internet, but the industrial production domain has specific requirements.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Waiting times" src="https://www.marcosantoni.com/images/ddi_talk.jpg"&gt;&lt;/p&gt;
&lt;p&gt;I gave a talk at &lt;a href="http://2017.datadriveninnovation.org/"&gt;Data Driven Innovation&lt;/a&gt; about the specific challenges when doing data …&lt;/p&gt;</summary><content type="html">&lt;p&gt;What does it mean to work as a data scientist in manufacturing? What is the value behind data? Data science has gained popularity in domains like internet, but the industrial production domain has specific requirements.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Waiting times" src="https://www.marcosantoni.com/images/ddi_talk.jpg"&gt;&lt;/p&gt;
&lt;p&gt;I gave a talk at &lt;a href="http://2017.datadriveninnovation.org/"&gt;Data Driven Innovation&lt;/a&gt; about the specific challenges when doing data science in manufacturing. I introduced the approach to data science that we are deploying at &lt;a href="http://www.brembo.com/en"&gt;Brembo&lt;/a&gt;. The talk was part of a track dedicated to Industry 4.0 and to IoT.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Weighted Random Sampling with PostgreSQL [Follow-up]</title><link href="https://www.marcosantoni.com/weighted_random_sampling_follow_up.html" rel="alternate"></link><published>2017-02-10T21:00:00+01:00</published><updated>2017-02-10T21:00:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2017-02-10:/weighted_random_sampling_follow_up.html</id><summary type="html">&lt;blockquote&gt;
&lt;p&gt;I received valuable feedbacks by &lt;a href="https://www.linkedin.com/in/decibel/"&gt;Jim Nasby&lt;/a&gt; regarding &lt;a href="http://www.marcosantoni.com/2016/08/23/weighted-random-sampling-with-postgresql.html"&gt;the post&lt;/a&gt; about weighted random sampling with PostgreSQL. I will report here Jim's email.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Sadly, Common Table Expressions (CTE)s are &lt;em&gt;insanely&lt;/em&gt; expensive, because
each one must be fully materialized. So in your example, you're
essentially creating 5 temp tables (one for …&lt;/p&gt;</summary><content type="html">&lt;blockquote&gt;
&lt;p&gt;I received valuable feedbacks by &lt;a href="https://www.linkedin.com/in/decibel/"&gt;Jim Nasby&lt;/a&gt; regarding &lt;a href="http://www.marcosantoni.com/2016/08/23/weighted-random-sampling-with-postgresql.html"&gt;the post&lt;/a&gt; about weighted random sampling with PostgreSQL. I will report here Jim's email.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Sadly, Common Table Expressions (CTE)s are &lt;em&gt;insanely&lt;/em&gt; expensive, because
each one must be fully materialized. So in your example, you're
essentially creating 5 temp tables (one for each CTE). Obviously that's
not a big deal with only 4 weights and 1000 samples, but for other use
cases that overhead could really add up. Note that this is not the same
as the &lt;code&gt;OFFSET 0&lt;/code&gt; trick...
You can get a similar breakdown of code by using subselects in &lt;code&gt;FROM&lt;/code&gt;
clauses. That would look something like:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="k"&gt;JOIN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;cumulative_bounds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;sampling_cumulative_prob&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(....)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sampling_cumulative_prob&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cumulative_bounds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;ON&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Not as nice as &lt;code&gt;WITH&lt;/code&gt;, but not horrible. You can also create temporary
views for each of the intermediate steps.&lt;/p&gt;
&lt;p&gt;in weights_with_sum, you can get rid of the &lt;code&gt;join&lt;/code&gt; in favor of &lt;code&gt;sum(weight)
OVER() AS weight_sum&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Finally, &lt;code&gt;random()&lt;/code&gt; produces &lt;code&gt;0.0 &amp;lt;= x &amp;lt; 1.0&lt;/code&gt;, so the bounds on the &lt;code&gt;numrange&lt;/code&gt;
should be &lt;code&gt;'[)'&lt;/code&gt;, not &lt;code&gt;'(]'&lt;/code&gt;. Personally, I would just create the &lt;code&gt;numrange&lt;/code&gt;
immediately in &lt;code&gt;cummulative_bounds&lt;/code&gt;, but that's mostly just a matter of style.&lt;/p&gt;
&lt;p&gt;BTW, if you've got &lt;code&gt;plpythonu&lt;/code&gt; loaded there's probably an easier way to
generate the set of ranges, which could then be joined to the random
samples.&lt;/p&gt;
&lt;p&gt;BTW, &lt;code&gt;width_bucket(operand anyelement, thresholds anyarray)&lt;/code&gt; (see &lt;em&gt;second&lt;/em&gt;
instance on &lt;a href="https://www.postgresql.org/docs/current/static/functions-math.html"&gt;docs&lt;/a&gt;)
might be even faster; it'd definitely be simpler:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;width_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;thresholds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;generate_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;array_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;colors&lt;/span&gt;
&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;array_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cum_prod&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;thresholds&lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sampling_cumulative_prod&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;prob&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</content><category term="posts"></category></entry><entry><title>Monitoring Bus Frequencies in Rome</title><link href="https://www.marcosantoni.com/monitoring_bus_frequencies_in_rome.html" rel="alternate"></link><published>2017-01-21T18:00:00+01:00</published><updated>2017-01-21T18:00:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2017-01-21:/monitoring_bus_frequencies_in_rome.html</id><summary type="html">&lt;p&gt;I have just launched &lt;a href="http://atacmonitor.com/"&gt;atacmonitor&lt;/a&gt;. It is a website providing information about the waiting time at bus stops in Rome.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Waiting times" src="https://www.marcosantoni.com/images/atacmonitor.gif"&gt;&lt;/p&gt;
&lt;h2&gt;Overview&lt;/h2&gt;
&lt;p&gt;The datasource is live data about bus waiting time of ATAC, Rome's public transport company. The transport office provides &lt;a href="https://romamobilita.it/it/azienda/open-data/api-real-time"&gt;public API&lt;/a&gt; with real-time data.&lt;/p&gt;
&lt;p&gt;I have implemented a &lt;a href="https://github.com/Marco-Santoni/atacmonitor-data"&gt;simple …&lt;/a&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;I have just launched &lt;a href="http://atacmonitor.com/"&gt;atacmonitor&lt;/a&gt;. It is a website providing information about the waiting time at bus stops in Rome.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Waiting times" src="https://www.marcosantoni.com/images/atacmonitor.gif"&gt;&lt;/p&gt;
&lt;h2&gt;Overview&lt;/h2&gt;
&lt;p&gt;The datasource is live data about bus waiting time of ATAC, Rome's public transport company. The transport office provides &lt;a href="https://romamobilita.it/it/azienda/open-data/api-real-time"&gt;public API&lt;/a&gt; with real-time data.&lt;/p&gt;
&lt;p&gt;I have implemented a &lt;a href="https://github.com/Marco-Santoni/atacmonitor-data"&gt;simple application&lt;/a&gt; that is regularly pulling such data and storing it in a PostgreSQL database. The data is presented via AirBnB's &lt;a href="http://airbnb.io/superset/"&gt;Supereset&lt;/a&gt;, an open source visualization platform. I deployed such application via &lt;a href="www.heroku.com"&gt;Heroku&lt;/a&gt; PaaS.&lt;/p&gt;
&lt;p&gt;I have kicked-off the project and just few bus stops are being monitored. The goal is to have all bus stops monitored soon.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Blog Migrated to Pelican on GitHub Pages</title><link href="https://www.marcosantoni.com/migrated_to_pelican.html" rel="alternate"></link><published>2016-12-28T15:38:00+01:00</published><updated>2016-12-28T15:38:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2016-12-28:/migrated_to_pelican.html</id><summary type="html">&lt;p&gt;I have migrated my blog. It is built under &lt;a href="http://blog.getpelican.com/"&gt;Pelican&lt;/a&gt;, a static site generator. It allows me to write posts as plain markdown or even Jupyter notebooks. I then use &lt;a href="https://pages.github.com/"&gt;GitHub Pages&lt;/a&gt; to version and publish the blog. I am continuing to use &lt;a href="https://www.aruba.it/home.aspx"&gt;Aruba&lt;/a&gt; as domain provider. It is sufficient …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I have migrated my blog. It is built under &lt;a href="http://blog.getpelican.com/"&gt;Pelican&lt;/a&gt;, a static site generator. It allows me to write posts as plain markdown or even Jupyter notebooks. I then use &lt;a href="https://pages.github.com/"&gt;GitHub Pages&lt;/a&gt; to version and publish the blog. I am continuing to use &lt;a href="https://www.aruba.it/home.aspx"&gt;Aruba&lt;/a&gt; as domain provider. It is sufficient to rename the &lt;code&gt;CNAME&lt;/code&gt; and the &lt;code&gt;ANAME&lt;/code&gt; variables to hide the blog under the &lt;code&gt;marcosantoni.com&lt;/code&gt; domain.&lt;/p&gt;
&lt;p&gt;The migration &lt;a href="http://mathamy.com/migrating-to-github-pages-using-pelican.html"&gt;from Wordpress to Pelican&lt;/a&gt; was sped up by the &lt;code&gt;pelican-import&lt;/code&gt; plugin. &lt;a href="https://fedoramagazine.org/make-github-pages-blog-with-pelican/"&gt;This blog post&lt;/a&gt; is a good reference for deploying a Pelican blog on GitHub Pages&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Insights from IEEE Big Data 16</title><link href="https://www.marcosantoni.com/ieee_big_data_16.html" rel="alternate"></link><published>2016-12-26T16:22:00+01:00</published><updated>2016-12-26T16:22:00+01:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2016-12-26:/ieee_big_data_16.html</id><summary type="html">&lt;p&gt;I have attended the IEEE Big Data 16 conference in Washington DC. I thank my company for sponsoring the trip. The conference included a &lt;a href="http://cci.drexel.edu/bigdata/bigdata2016/SpecialSymposium.html"&gt;special symposium&lt;/a&gt; dedicated to manufacturing. The symposium hosted some participants of the &lt;a href="https://www.kaggle.com/c/bosch-production-line-performance"&gt;Bosch Production Line Performance&lt;/a&gt; competition from Kaggle.&lt;/p&gt;
&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;2016 IEEE International Conference on Big Data …&lt;/p&gt;&lt;/blockquote&gt;</summary><content type="html">&lt;p&gt;I have attended the IEEE Big Data 16 conference in Washington DC. I thank my company for sponsoring the trip. The conference included a &lt;a href="http://cci.drexel.edu/bigdata/bigdata2016/SpecialSymposium.html"&gt;special symposium&lt;/a&gt; dedicated to manufacturing. The symposium hosted some participants of the &lt;a href="https://www.kaggle.com/c/bosch-production-line-performance"&gt;Bosch Production Line Performance&lt;/a&gt; competition from Kaggle.&lt;/p&gt;
&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;2016 IEEE International Conference on Big Data kicked off today in Washington, DC. Share highlights w/ hashtag &lt;a href="https://twitter.com/hashtag/IEEEBigData16?src=hash"&gt;#IEEEBigData16&lt;/a&gt; &amp;amp; we’ll RT!&lt;/p&gt;&amp;mdash; IEEE Big Data (@ieeebigdata) &lt;a href="https://twitter.com/ieeebigdata/status/805799488128425984"&gt;December 5, 2016&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="//platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;I'll list here a few notes I took during the conference.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Streaming Processing.&lt;/strong&gt; I heard about the most popular architectures nowadays, and I highly recommend reading the blog posts by the authors of such architectures:&lt;ul&gt;
&lt;li&gt;&lt;a href="http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html"&gt;Lambda architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.oreilly.com/ideas/questioning-the-lambda-architecture"&gt;Kappa architecture&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;K-Spectral Centroid.&lt;/strong&gt; The K-Spectral Centroid algorithm clusters time series by their shape, and finds the most representative shape (the cluster centroid) for each cluster.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;K-D Tree partition:&lt;/strong&gt; an algorithm for space partitioning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database Decay.&lt;/strong&gt; Interesting keynote by Michael Stonebraker. Shortly, large applications often share a centralized database used by different groups of a company. The DBA point of view:&lt;ul&gt;
&lt;li&gt;High Risk. When changing a DB schema, I need to find applications all around in the company and update them accordingly (do I have budget for that?).&lt;/li&gt;
&lt;li&gt;Low Risk. No change in schema, I do a workaround in data.&lt;/li&gt;
&lt;li&gt;Claim. DBA want to lower the risk. --&amp;gt; no change in schema --&amp;gt; ER diagram diverges from reality --&amp;gt; database decay.&lt;/li&gt;
&lt;li&gt;At some point, a total rewrite is the only way forward.&lt;/li&gt;
&lt;li&gt;If you work in analytics getting data from operational DB, you realize data is getting more and more dirty.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PMML Scoring Engine.&lt;/strong&gt; Max Ferguson introduced what a Predictive Model Markup Language (PMML) is. Basically, if you train a model and want to share it in a different application, PMML is a standard that defines how models should be stored as an XML.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Uncertainty in RFs.&lt;/strong&gt; Random Forests can express uncertainty. One just needs to look at distribution of predictions among the decision trees of the model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bosch.&lt;/strong&gt; Rumi Ghosh introduced the data science team at Bosch.&lt;ul&gt;
&lt;li&gt;Insight from production plants: plant managers prefer interpretable models (logistic regression or decision tree) over black box models.&lt;/li&gt;
&lt;li&gt;Research directions:&lt;/li&gt;
&lt;li&gt;Root cause analysis (via Bayesian inference)&lt;/li&gt;
&lt;li&gt;Class imbalance&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;3 Approaches in Kaggle Competition.&lt;/strong&gt; &lt;a href="https://www.kaggle.com/bpavlyshenko"&gt;Bohdan Pavlyshenko&lt;/a&gt; gave a talk on the three approaches he explored during the Kaggle competition about failure detection:&lt;ul&gt;
&lt;li&gt;Pure machine learning approach. 2-Levels of model ensembling, a pure black-box.&lt;/li&gt;
&lt;li&gt;Generalized Linear Model with Lasso regularization. Informative about feature impact.&lt;/li&gt;
&lt;li&gt;Bayesian model in BUGS. It enables to obtain the estimate of the probability distribution for each coefficient.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;FTLR.&lt;/strong&gt; Follow the regularized leader: a feature engineering method used to convert all categorical feature into one numerical feature.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CRF.&lt;/strong&gt; Conditional Random Fields is a class of predictive models used when the dataset is represented as a graph. Each node is a sample with a vector X and a target variable y.&lt;/li&gt;
&lt;/ul&gt;</content><category term="posts"></category></entry><entry><title>Weighted Random Sampling with PostgreSQL</title><link href="https://www.marcosantoni.com/2016/08/23/weighted-random-sampling-with-postgresql.html" rel="alternate"></link><published>2016-08-23T16:22:00+02:00</published><updated>2016-08-23T16:22:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2016-08-23:/2016/08/23/weighted-random-sampling-with-postgresql.html</id><summary type="html">&lt;p&gt;You have a table like the following:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;primary&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;INTO&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;red&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;blue&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;green&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;yellow&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The table lists the weights associated with certain colors. Imagine a
weight representing how much you like that color.&lt;/p&gt;
&lt;p&gt;Now …&lt;/p&gt;</summary><content type="html">&lt;p&gt;You have a table like the following:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;varchar&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;primary&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;INTO&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;red&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;blue&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;green&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;yellow&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The table lists the weights associated with certain colors. Imagine a
weight representing how much you like that color.&lt;/p&gt;
&lt;p&gt;Now, you want to add 1000 colored tiles to your website. You want the
color of the tiles to be &lt;strong&gt;sampled at random&lt;/strong&gt; according to the
&lt;em&gt;weights&lt;/em&gt; table.&lt;/p&gt;
&lt;p&gt;We'll write a PostgreSQL script that implements such random sampling.
I'll write the &lt;strong&gt;entire query first&lt;/strong&gt;, and then explain each part
separately.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sampled_colors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weights_with_sum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;weight_sum&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt;
&lt;span class="k"&gt;CROSS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;JOIN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weight_sum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;sampling_probability&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weight_sum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;prob&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weights_with_sum&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;sampling_cumulative_prob&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;OVER&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;order&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cum_prob&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sampling_probability&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;cumulative_bounds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;lag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cum_prob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;OVER&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;BY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cum_prob&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lower_cum_bound&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;cum_prob&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;upper_cum_bound&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sampling_cumulative_prob&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;generate_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample_idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cumulative_bounds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;ON&lt;/span&gt;
&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;@&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;numrange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lower_cum_bound&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;upper_cum_bound&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;(]&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Let's look at one piece at a time.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weights_with_sum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;weight_sum&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt;
&lt;span class="k"&gt;CROSS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;JOIN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weight_sum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;sampling_probability&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weight_sum&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;prob&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;weights_with_sum&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sampling_probability&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- output:&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;prob&lt;/span&gt;
&lt;span class="c1"&gt;--------+--------------------&lt;/span&gt;
&lt;span class="n"&gt;red&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;258064516129032&lt;/span&gt;
&lt;span class="n"&gt;blue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0967741935483871&lt;/span&gt;
&lt;span class="n"&gt;green&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;32258064516129&lt;/span&gt;
&lt;span class="n"&gt;yellow&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;32258064516129&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here, we're just normalizing the weights. Each weight is divided by the
total sum of the weights. In this way, we are re-writing each weight as
a &lt;strong&gt;discrete probability&lt;/strong&gt; of that color being sampled.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;sampling_cumulative_prob&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;OVER&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;order&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cum_prob&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sampling_probability&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;cumulative_bounds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;lag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cum_prob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;OVER&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;BY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cum_prob&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lower_cum_bound&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;cum_prob&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;upper_cum_bound&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sampling_cumulative_prob&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cumulative_bounds&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- output:&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lower_cum_bound&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;upper_cum_bound&lt;/span&gt;
&lt;span class="c1"&gt;--------+--------------------+--------------------&lt;/span&gt;
&lt;span class="n"&gt;blue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0967741935483871&lt;/span&gt;
&lt;span class="n"&gt;green&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0967741935483871&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;419354838709677&lt;/span&gt;
&lt;span class="n"&gt;red&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;419354838709677&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;67741935483871&lt;/span&gt;
&lt;span class="n"&gt;yellow&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;67741935483871&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In this piece of code, we're are representing the weights as a
&lt;strong&gt;cumulative&lt;/strong&gt; distribution function.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;generate_series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample_idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cumulative_bounds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;ON&lt;/span&gt;
&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;@&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;numrange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lower_cum_bound&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;upper_cum_bound&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;numeric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;(]&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In the last part, we're sampling 1000 times a random number between 0
and 1. We then assign this sample to the corresponding color based on
the values of the cumulative function. For example, if the first sample
is 0.45, it will match the &lt;em&gt;'red'&lt;/em&gt; range (0.41-0.67). Therefore, that
sample will be &lt;em&gt;'red'&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;The result of the query is a table filled with 1000 colors sampled at
random based on the weights.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sampled_colors&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- output:&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;
&lt;span class="c1"&gt;--------&lt;/span&gt;
&lt;span class="n"&gt;green&lt;/span&gt;
&lt;span class="n"&gt;green&lt;/span&gt;
&lt;span class="n"&gt;red&lt;/span&gt;
&lt;span class="n"&gt;yellow&lt;/span&gt;
&lt;span class="n"&gt;yellow&lt;/span&gt;
&lt;span class="n"&gt;green&lt;/span&gt;
&lt;span class="n"&gt;blue&lt;/span&gt;
&lt;span class="n"&gt;red&lt;/span&gt;
&lt;span class="n"&gt;red&lt;/span&gt;
&lt;span class="n"&gt;red&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Can we check that the result is correct? Were the weights really taken
into account?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sampled_colors&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;BY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- output:&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;count&lt;/span&gt;
&lt;span class="c1"&gt;--------+-------&lt;/span&gt;
&lt;span class="n"&gt;yellow&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;309&lt;/span&gt;
&lt;span class="n"&gt;green&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;320&lt;/span&gt;
&lt;span class="n"&gt;red&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;276&lt;/span&gt;
&lt;span class="n"&gt;blue&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;95&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The proportion of samples is quite close to the proportion of the
weights. This similarity is clear if we compare this table with the
discrete probability table above.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Applied Bayesian Inference with PyMC [video]</title><link href="https://www.marcosantoni.com/2016/06/30/applied-bayesian-inference-with-pymc-video.html" rel="alternate"></link><published>2016-06-30T17:03:00+02:00</published><updated>2016-06-30T17:03:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2016-06-30:/2016/06/30/applied-bayesian-inference-with-pymc-video.html</id><content type="html">&lt;p&gt;I was glad to give an intro to Bayesian Inference at PyData Florence
2016. The video of the talk is now out.&lt;/p&gt;
&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/BX1MjMDKhXU" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;</content><category term="posts"></category></entry><entry><title>A Simple Machine Learning Pipeline</title><link href="https://www.marcosantoni.com/2016/06/19/a-simple-machine-learning-pipeline.html" rel="alternate"></link><published>2016-06-19T10:37:00+02:00</published><updated>2016-06-19T10:37:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2016-06-19:/2016/06/19/a-simple-machine-learning-pipeline.html</id><summary type="html">&lt;p&gt;This post contains the code that I used in my talk at Python Milano
Meetup on &lt;a href="http://www.meetup.com/Python-Milano/events/231710577/"&gt;June 22nd
2016&lt;/a&gt;. The talk
was a quick overview of &lt;strong&gt;Pipeline&lt;/strong&gt;, a nice API by &lt;em&gt;scikitlearn&lt;/em&gt; to
abstract your machine learning algorithm. It is based on the Boston
&lt;a href="https://archive.ics.uci.edu/ml/datasets/Housing"&gt;Housing Data Set&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We'll just load …&lt;/p&gt;</summary><content type="html">&lt;p&gt;This post contains the code that I used in my talk at Python Milano
Meetup on &lt;a href="http://www.meetup.com/Python-Milano/events/231710577/"&gt;June 22nd
2016&lt;/a&gt;. The talk
was a quick overview of &lt;strong&gt;Pipeline&lt;/strong&gt;, a nice API by &lt;em&gt;scikitlearn&lt;/em&gt; to
abstract your machine learning algorithm. It is based on the Boston
&lt;a href="https://archive.ics.uci.edu/ml/datasets/Housing"&gt;Housing Data Set&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We'll just load the data set from &lt;em&gt;sklearn&lt;/em&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_boston&lt;/span&gt;
&lt;span class="n"&gt;housing_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_boston&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="n"&gt;housing_data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DESCR&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We might want to make it a Pandas dataframe to make things easier to
handle.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;housing_data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;housing_data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;feature_names&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;PRICE&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;housing_data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="table" src="https://www.marcosantoni.com/images/table.png"&gt;&lt;/p&gt;
&lt;p&gt;The goal is to predict the &lt;em&gt;PRICE&lt;/em&gt; variable given the other features.
How does this variable distribute?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PRICE&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;PRICE&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="download
(8)" src="https://www.marcosantoni.com/images/download-8.png"&gt;{.alignnone
.size-full .wp-image-74 width="378" height="271"}&lt;/p&gt;
&lt;p&gt;Let's turn the dataframe into a ML-friendly notation.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;PRICE&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;PRICE&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We will now define the metric that assess the accuracy of our
algorithm/pipeline. Let's use the good old cross validation.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cross_validation&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;algorithm&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Mean Squared Error&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cross_validation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cross_val_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;algorithm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;scoring&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mean_squared_error&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Accuracy: &lt;/span&gt;&lt;span class="si"&gt;%0.2f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;So, now, we can try a bunch of algorithms and see which one works best
by calling &lt;em&gt;evaluate_model&lt;/em&gt;. It is now time to implement a first
algorithm. So, let's explore a bit the data set. Is there any pattern we
can exploit?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;RM&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Average number of rooms&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Housing price in \$1000&lt;/span&gt;&lt;span class="se"&gt;\&amp;#39;&lt;/span&gt;&lt;span class="s1"&gt;s&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="download" src="https://www.marcosantoni.com/images/download.png"&gt;{.alignnone
.size-full .wp-image-78 width="610" height="438"}&lt;/p&gt;
&lt;p&gt;As expected, there is a relation between the average number of rooms and
the median price. So, let's build the first algorithm.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.pipeline&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;make_pipeline&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FunctionTransformer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LinearRegression&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;just_RM_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;RM_col_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;RM_col_index&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;FunctionTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;just_RM_column&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;LinearRegression&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;How well does it perform?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;evaluate_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="sd"&gt;&amp;#39;&amp;#39;&amp;#39;Mean Squared Error [43.19492771 41.72813479 46.89293772] Accuracy:&lt;/span&gt;
&lt;span class="sd"&gt;43.94&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Can we visualize what the pipeline is actually doing?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;plot_model_RM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
&lt;span class="n"&gt;cross_validation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.33&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;fake_X_train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;fake_X_train&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fake_X_train&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fake_X_train&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fake_X_train&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;fake_X_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;fake_X_test&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fake_X_test&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fake_X_test&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fake_X_test&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;RM&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fake_X_train&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fake_X_train&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Average number of rooms&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Housing price in \$1000&lt;/span&gt;&lt;span class="se"&gt;\&amp;#39;&lt;/span&gt;&lt;span class="s1"&gt;s&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Train Data Set&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;RM&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fake_X_test&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fake_X_test&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Average number of rooms&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Housing price in \$1000&lt;/span&gt;&lt;span class="se"&gt;\&amp;#39;&lt;/span&gt;&lt;span class="s1"&gt;s&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Test Data Set&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;plot_model_RM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="download
(1)" src="https://www.marcosantoni.com/images/download-1.png"&gt;{.alignnone
.size-full .wp-image-84 width="1173" height="449"}&lt;/p&gt;
&lt;p&gt;We now do a bit of feature engineering. We square the features.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_squared_col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hstack&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;FunctionTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;just_RM_column&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;FunctionTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;add_squared_col&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;LinearRegression&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We evaluate this other pipeline.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;evaluate_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="sd"&gt;&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;span class="sd"&gt;Mean Squared Error&lt;/span&gt;
&lt;span class="sd"&gt;[ 40.31207562 36.75642688 40.75444834]&lt;/span&gt;
&lt;span class="sd"&gt;Accuracy: 39.27&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;And we see how the algorithm is fitting the data set.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;plot_model_RM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="download
(2)" src="https://www.marcosantoni.com/images/download-2.png"&gt;{.alignnone
.size-full .wp-image-86 width="1165" height="449"}
We now try a different model like a &lt;em&gt;decision tree&lt;/em&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.tree&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DecisionTreeRegressor&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;FunctionTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;just_RM_column&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;FunctionTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;add_squared_col&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;DecisionTreeRegressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;evaluate_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="sd"&gt;&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;span class="sd"&gt;Mean Squared Error&lt;/span&gt;
&lt;span class="sd"&gt;[ 57.28366371 61.5437311 84.32756118]&lt;/span&gt;
&lt;span class="sd"&gt;Accuracy: 67.72&lt;/span&gt;
&lt;span class="sd"&gt;&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;plot_model_RM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="download
(3)" src="https://www.marcosantoni.com/images/download-3.png"&gt;{.alignnone
.size-full .wp-image-87 width="1165" height="449"}&lt;/p&gt;
&lt;p&gt;We now explore a second feature: &lt;em&gt;INDUS&lt;/em&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;INDUS&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Average number of rooms&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Housing price in \$1000&lt;/span&gt;&lt;span class="se"&gt;\&amp;#39;&lt;/span&gt;&lt;span class="s1"&gt;s&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="download
(4)" src="https://www.marcosantoni.com/images/download-4.png"&gt;{.alignnone
.size-full .wp-image-89 width="610" height="438"}&lt;/p&gt;
&lt;p&gt;So, we see another relation between &lt;em&gt;INDUS&lt;/em&gt; and &lt;em&gt;PRICE&lt;/em&gt;. So, let's add
this second feature.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;RM_and_INDUS_cols&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;RM_col_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="n"&gt;INDUS_col_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;RM_col_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;INDUS_col_index&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;FunctionTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RM_and_INDUS_cols&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;FunctionTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;add_squared_col&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;LinearRegression&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;evaluate_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="sd"&gt;&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;span class="sd"&gt;Mean Squared Error&lt;/span&gt;
&lt;span class="sd"&gt;[ 32.3420789 31.4260901 35.95835866]&lt;/span&gt;
&lt;span class="sd"&gt;Accuracy: 33.24&lt;/span&gt;
&lt;span class="sd"&gt;&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now, plotting a model in 3D needs a bit more effort.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;plot_model_RM_INDUS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
&lt;span class="n"&gt;cross_validation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.33&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;fig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p3&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Axes3D&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;marker&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;o&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;100.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;100.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;meshgrid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;fake_X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
&lt;span class="n"&gt;fake_X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;fake_X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fake_X&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot_surface&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;INDUS&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;RM&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_zlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plot_model_RM_INDUS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="animation" src="https://www.marcosantoni.com/images/animation.gif"&gt;{.alignnone
.size-full .wp-image-91 width="720" height="504"}&lt;/p&gt;
&lt;p&gt;How pretty is that?&lt;/p&gt;
&lt;p&gt;The following step is to use all the features available. So, we move to
a 13-dimensional feature vector.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;LinearRegression&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;evaluate_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="sd"&gt;&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;span class="sd"&gt;Mean Squared Error&lt;/span&gt;
&lt;span class="sd"&gt;[ 20.50009513 22.42870192 27.88911654]&lt;/span&gt;
&lt;span class="sd"&gt;Accuracy: 23.61&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The error got quite smaller. We cannot however plot the model in
13-dimensions. We will now re-use the function that adds a squared
feature.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;FunctionTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;add_squared_col&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;LinearRegression&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;evaluate_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="sd"&gt;&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;span class="sd"&gt;Mean Squared Error&lt;/span&gt;
&lt;span class="sd"&gt;[ 16.7819682 14.599869 18.17785453]&lt;/span&gt;
&lt;span class="sd"&gt;Accuracy: 16.52&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Even better. Now, we will switch to a ridge-regressor (combined with a
normalization of the features).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Ridge&lt;/span&gt;

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;make_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="n"&gt;FunctionTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;add_squared_col&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;Ridge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;evaluate_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="sd"&gt;&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;span class="sd"&gt;Mean Squared Error&lt;/span&gt;
&lt;span class="sd"&gt;[ 16.4292824 14.50522561 18.27167008]&lt;/span&gt;
&lt;span class="sd"&gt;Accuracy: 16.40&amp;#39;&amp;#39;&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</content><category term="posts"></category></entry><entry><title>Install a .deb file from terminal on Ubuntu</title><link href="https://www.marcosantoni.com/2016/05/23/install-a-deb-file-from-terminal-on-ubuntu.html" rel="alternate"></link><published>2016-05-23T08:18:00+02:00</published><updated>2016-05-23T08:18:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2016-05-23:/2016/05/23/install-a-deb-file-from-terminal-on-ubuntu.html</id><content type="html">&lt;p&gt;I use Ubuntu 16.04. Sometimes, when I double-click a &lt;em&gt;.deb&lt;/em&gt; file, the
installation program does not work. What often solves the problem is
installing it from terminal.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;dpkg&lt;span class="w"&gt; &lt;/span&gt;-i&lt;span class="w"&gt; &lt;/span&gt;my_deb_file.deb
sudo&lt;span class="w"&gt; &lt;/span&gt;apt-get&lt;span class="w"&gt; &lt;/span&gt;-f&lt;span class="w"&gt; &lt;/span&gt;install
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</content><category term="posts"></category></entry><entry><title>Insights from Data Science Milan - 19/05/16</title><link href="https://www.marcosantoni.com/2016/05/20/insights-from-data-science-milan-190516.html" rel="alternate"></link><published>2016-05-20T17:56:00+02:00</published><updated>2016-05-20T17:56:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2016-05-20:/2016/05/20/insights-from-data-science-milan-190516.html</id><summary type="html">&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;&lt;a href="https://twitter.com/hashtag/DeepLearning?src=hash"&gt;#DeepLearning&lt;/a&gt; introduction and enterprise architectures using &lt;a href="https://twitter.com/hashtag/H2O?src=hash"&gt;#H2O&lt;/a&gt; - first &lt;a href="https://twitter.com/hashtag/DataScienceMilan?src=hash"&gt;#DataScienceMilan&lt;/a&gt; meetup! - &lt;a href="https://t.co/I8LsfaFJSu"&gt;https://t.co/I8LsfaFJSu&lt;/a&gt;&lt;/p&gt;&amp;mdash; Andrea Scarso (@andreaesseci) &lt;a href="https://twitter.com/andreaesseci/status/733044189349482496"&gt;May 18, 2016&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="//platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;A new &lt;strong&gt;Data Science meetup&lt;/strong&gt; is out in Milan. Two talks about Deep
Learning were given in the first event.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Neural Networks and Deep Learning: An
Introduction. &lt;a href="https://twitter.com/milanhightech"&gt;@MilanHighTech&lt;/a&gt;.&lt;/strong&gt; The
first …&lt;/p&gt;</summary><content type="html">&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;&lt;a href="https://twitter.com/hashtag/DeepLearning?src=hash"&gt;#DeepLearning&lt;/a&gt; introduction and enterprise architectures using &lt;a href="https://twitter.com/hashtag/H2O?src=hash"&gt;#H2O&lt;/a&gt; - first &lt;a href="https://twitter.com/hashtag/DataScienceMilan?src=hash"&gt;#DataScienceMilan&lt;/a&gt; meetup! - &lt;a href="https://t.co/I8LsfaFJSu"&gt;https://t.co/I8LsfaFJSu&lt;/a&gt;&lt;/p&gt;&amp;mdash; Andrea Scarso (@andreaesseci) &lt;a href="https://twitter.com/andreaesseci/status/733044189349482496"&gt;May 18, 2016&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="//platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;A new &lt;strong&gt;Data Science meetup&lt;/strong&gt; is out in Milan. Two talks about Deep
Learning were given in the first event.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Neural Networks and Deep Learning: An
Introduction. &lt;a href="https://twitter.com/milanhightech"&gt;@MilanHighTech&lt;/a&gt;.&lt;/strong&gt; The
first talk by Valentino Zocca was a quick intro to Deep Learning The
speaker was able to explain the role of the additional layers in a
neural network. Each layer is learning something, and each one is
learning a different representation of the output. In particular, each
additional layer is learning a more abstract representation of the
output.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Face recognition" src="https://indico.io/blog/wp-content/uploads/2016/02/cnn_deeper.jpg"&gt;{.alignnone
width="370" height="506"}&lt;/p&gt;
&lt;p&gt;Each layer is learning a higher level of abstraction. In the example,
the first layer is learning the edges in the image; the second layer is
learning the parts of a face like the nose or the eye; the third layer
is learning large sections of a face. Ref: "&lt;em&gt;Convolutional Deep Belief
Networks for Scalable Unsupervised Learning of Hierarchical
Representations&lt;/em&gt;", Lee et al.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Bringing Deep Learning into production.&lt;/strong&gt;
&lt;a href="https://twitter.com/axlpado"&gt;@axlpado&lt;/a&gt;. The speaker gave his point of
view on deploying machine learning algorithms in production. There are a
variety of frameworks, and it's always easy to choose which one to
adopt. He gave a series of interesting tips, and I'll write here the
main ones.&lt;/p&gt;
&lt;p&gt;You can write machine learning in many languages such as Python, Java,
R, Matlab, Scala, etc. A good guideline is: choose the one you know the
most. Do not add the complexity of learning a new language to the
complexity of designing the algorithm.&lt;/p&gt;
&lt;p&gt;Different languages in different teams.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Data science
languages" src="https://www.marcosantoni.com/images/20160519_193804-1.jpg"&gt;{.alignnone
.size-full .wp-image-58 width="896" height="504"}&lt;/p&gt;
&lt;p&gt;It can be challenge to bring machine learning models from a team to
another. The reason is that often teams work in different languages or
in different frameworks. This organization leads to complex deployment
processes.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Tips for
deployment" src="https://www.marcosantoni.com/images/20160519_194315.jpg"&gt;{.alignnone
.size-full .wp-image-59 width="896" height="504"}&lt;/p&gt;
&lt;p&gt;Paolo recommended to have the entire team on the same framework. The
idea is to have the deployment pipeline as smooth as possible. It can be
an effort for the data scientists at the beginning to learn the data
engineer tools, but it can make the difference on the long term.&lt;/p&gt;</content><category term="posts"></category></entry><entry><title>Bayesian A/B Testing in Python</title><link href="https://www.marcosantoni.com/2016/05/15/bayesian-ab-testing-in-python.html" rel="alternate"></link><published>2016-05-15T15:33:00+02:00</published><updated>2016-05-15T15:33:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2016-05-15:/2016/05/15/bayesian-ab-testing-in-python.html</id><summary type="html">&lt;p&gt;Imagine you re-designing your e-commerce website. You have to decide
whether the "Buy Item" button should be blue or green. You decide to
setup an A/B test, so you build two versions of the item page:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Page A&lt;/strong&gt; which has a blue button;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Page B&lt;/strong&gt; which has a green …&lt;/li&gt;&lt;/ul&gt;</summary><content type="html">&lt;p&gt;Imagine you re-designing your e-commerce website. You have to decide
whether the "Buy Item" button should be blue or green. You decide to
setup an A/B test, so you build two versions of the item page:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Page A&lt;/strong&gt; which has a blue button;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Page B&lt;/strong&gt; which has a green button.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Pages A and B are identical except for the color of the button. You want
to quantify the likelihood of a user clicking the "Buy Item" button when
she is on page A or on page B. So, you start the experiment by sending
each user either to page A or to page B. Each time, you monitor whether
she clicked "Buy Item" or not.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Frequentist vs Bayesian&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;One could simply approximate the effectiveness of each page by computing
the &lt;strong&gt;success rate&lt;/strong&gt; on the two pages. E.g. if N=1000 users visited page
A, and 50 of them clicked the button, one could say that the likelihood
of clicking the button on page A is 50/1000 \~= 5%. This is the
so-called &lt;strong&gt;Frequentist &lt;/strong&gt;approach which envisions the probability in
terms of event frequency. However, the following issues might arise on a
daily basis:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;what if N is small (e.g. N=50)? Can we still be confident by just
    computing the success rate?&lt;/li&gt;
&lt;li&gt;What if N is different between page A and page B? Let's say that 500
    users visited page A and 2000 users visited page B. How can we
    combine such imbalanced experiments?&lt;/li&gt;
&lt;li&gt;How large should N be to achieve a 90% confidence in my estimates?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We'll now introduce a simple &lt;strong&gt;Bayesian&lt;/strong&gt; solution that allows to run
the A/B test and to handle the issues listed above. The code makes use
of &lt;a href="https://pymc-devs.github.io/pymc/"&gt;PyMC&lt;/a&gt; package, and it was
inspired by reading "Bayesian Methods for Hackers"  by &lt;a href="https://twitter.com/Cmrn_DP?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor"&gt;Cameron
Davidson-Pilon&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Evaluate Page A&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We'll first show how to evaluate the success rate on page A with a
Bayesian approach. The goal is to infer the probability of clicking the
"Buy Item" button on page A. We model this probability as a
&lt;a href="https://www.wikiwand.com/en/Bernoulli_distribution"&gt;Bernoulli&lt;/a&gt;
distribution with parameter $p_A$:&lt;/p&gt;
&lt;p&gt;$$P(click | \text{page}=A) =
\begin{cases}
p_A &amp;amp; click=1\
1-p_A &amp;amp; click=0\
\end{cases}$$&lt;/p&gt;
&lt;p&gt;So, $p_A$ is the parameter indicating the probability
of clicking the button on page A. This parameter is unknown and the goal
of the experiment is to infer it.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pymc&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Uniform&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rbernoulli&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Bernoulli&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MCMC&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;matplotlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# true value of p_A (unknown)&lt;/span&gt;
&lt;span class="n"&gt;p_A_true&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;
&lt;span class="c1"&gt;# number of users visiting page A&lt;/span&gt;
&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1500&lt;/span&gt;
&lt;span class="n"&gt;occurrences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rbernoulli&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_A_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Click-BUY:&amp;#39;&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="n"&gt;occurrences&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Observed frequency:&amp;#39;&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="n"&gt;occurrences&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In this code, we are simulating a realisation of the experiment where
1000 users visited page A. Here, &lt;em&gt;occurrences &lt;/em&gt;indicate how many
visitors have actually clicked on the button in this realisation.&lt;/p&gt;
&lt;p&gt;The next step consist of defining our prior on the
$p_A$ parameter. The &lt;strong&gt;prior definition &lt;/strong&gt;is the
first step of Bayesian inference and is a way to indicate our prior
belief in the variable.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;p_A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;p_A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;obs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Bernoulli&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;obs&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p_A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;occurrences&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;observed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In this section, we define the prior of $p_a$ to be a
uniform distribution. The &lt;em&gt;obs &lt;/em&gt;variable indicates the Bernoulli
distribution representing the observations of the click events (indeed
governed by the $p_a$ parameter). The two variables
are assigned to &lt;em&gt;Uniform&lt;/em&gt; and &lt;em&gt;Bernoulli&lt;/em&gt; which are stochastic variable
objects part of PyMC. Each variable is associated with a string name
(&lt;em&gt;p_A * and &lt;/em&gt;obs&lt;em&gt; in this case). The &lt;/em&gt;obs&lt;em&gt; variable has the &lt;/em&gt;value *
and the &lt;em&gt;observed &lt;/em&gt;parameter set because we have observed the
realisations of the experiments.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# defining a Monte Carlo Markov Chain model&lt;/span&gt;
&lt;span class="n"&gt;mcmc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MCMC&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;p_A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obs&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# setting the size of the simulations to 20k particles&lt;/span&gt;
&lt;span class="n"&gt;mcmc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# the resulting posterior distribution is stored in the trace variable&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="n"&gt;mcmc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;p_A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)[:]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In this section, the MCMC model is initialised, and the variables &lt;em&gt;p_A&lt;/em&gt;
and &lt;em&gt;obs&lt;/em&gt; are given to it as input. The &lt;em&gt;sample &lt;/em&gt;model will run the
Monte Carlo simulations and fit the observed data to the prior belief.
The posterior distribution is accessible via the &lt;em&gt;.trace&lt;/em&gt; attribute as
an array of realisations. We can now visualise the result of the
inference.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mcmc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;p_A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)[:],&lt;/span&gt; &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;histtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;stepfilled&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;normed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Probability of clicking BUY&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Density&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vlines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_A_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;linestyle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;--&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;True p_A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="p_A_hist_N_1500" src="https://www.marcosantoni.com/images/p_A_hist_N_1500.png"&gt;{.alignnone
.wp-image-38 .size-full width="800" height="700"}&lt;/p&gt;
&lt;p&gt;Then, we might want to answer the question: where am I 90% confident
that the true $p_A$ lies? That's easy to answer.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;p_A_samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mcmc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;p_A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)[:]&lt;/span&gt;
&lt;span class="n"&gt;lower_bound&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_A_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;upper_bound&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_A_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;There is 90&lt;/span&gt;&lt;span class="si"&gt;%%&lt;/span&gt;&lt;span class="s1"&gt; probability that p_A is between &lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt; and &lt;/span&gt;&lt;span class="si"&gt;%s&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lower_bound&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper_bound&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# There is 90% probability that p_A is between 0.0373019596856 and&lt;/span&gt;
&lt;span class="mf"&gt;0.0548052806892&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Comparing Page A and Page B&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We'll now repeat what we have done for page A, and we add a new
variable &lt;em&gt;delta &lt;/em&gt;indicating the difference
between $p_A$ and $p_B$.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pymc&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Uniform&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rbernoulli&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Bernoulli&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MCMC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;deterministic&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;matplotlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="n"&gt;p_A_true&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;
&lt;span class="n"&gt;p_B_true&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.04&lt;/span&gt;
&lt;span class="n"&gt;N_A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1500&lt;/span&gt;
&lt;span class="n"&gt;N_B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;750&lt;/span&gt;

&lt;span class="n"&gt;occurrences_A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rbernoulli&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_A_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N_A&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;occurrences_B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rbernoulli&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_B_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;N_B&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Observed frequency:&amp;#39;&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;A&amp;#39;&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="n"&gt;occurrences_A&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N_A&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;B&amp;#39;&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="n"&gt;occurrences_B&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N_B&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;p_A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;p_A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;p_B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;p_B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@deterministic&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_A&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;p_A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p_B&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;p_B&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;p_A&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;p_B&lt;/span&gt;

&lt;span class="n"&gt;obs_A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Bernoulli&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;obs_A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p_A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;occurrences_A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;observed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;obs_B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Bernoulli&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;obs_B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p_B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;occurrences_B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;observed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;mcmc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MCMC&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;p_A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p_B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obs_A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obs_B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;mcmc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;25000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;p_A_samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mcmc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;p_A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)[:]&lt;/span&gt;
&lt;span class="n"&gt;p_B_samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mcmc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;p_B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)[:]&lt;/span&gt;
&lt;span class="n"&gt;delta_samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mcmc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;delta&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)[:]&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_A_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;histtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;stepfilled&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;normed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;blue&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Posterior of p_A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vlines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_A_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;linestyle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;--&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;True p_A&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Probability of clicking BUY via A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_B_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;histtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;stepfilled&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;normed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;green&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Posterior of p_B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vlines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_B_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;linestyle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;--&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;True p_B&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Probability of clicking BUY via B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delta_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;histtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;stepfilled&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;normed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;red&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Posterior of delta&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vlines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p_A_true&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;p_B_true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;linestyle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;--&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;True&lt;/span&gt;
&lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;p_A - p_B&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="A_and_B" src="https://www.marcosantoni.com/images/A_and_B.png"&gt;{.alignnone
.wp-image-40 .size-full width="800" height="600"}&lt;/p&gt;
&lt;p&gt;Then, we can answer a question like: what is the probability that
$ p_A &amp;gt; p_B$?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Probability that p_A &amp;gt; p_B:&amp;#39;&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delta_samples&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# Probability that p_A &amp;gt; p_B&lt;/span&gt;
&lt;span class="c1"&gt;# 0.8919&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</content><category term="posts"></category></entry><entry><title>Insights from PyData Florence 16</title><link href="https://www.marcosantoni.com/2016/04/20/insights-from-pydata-florence-16.html" rel="alternate"></link><published>2016-04-20T06:05:00+02:00</published><updated>2016-04-20T06:05:00+02:00</updated><author><name>Marco Santoni</name></author><id>tag:www.marcosantoni.com,2016-04-20:/2016/04/20/insights-from-pydata-florence-16.html</id><summary type="html">&lt;p&gt;I have just joined &lt;a href="https://www.pycon.it/p3/schedule/pycon7/"&gt;PyData&lt;/a&gt;
conference in Florence, and I will list briefly some
interesting insights.&lt;/p&gt;
&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;Oh my... We are already overcrowded &lt;a href="https://twitter.com/pyconit"&gt;@pyconit&lt;/a&gt; and it&amp;#39;s *just* the beginning!! 🎉🎉 good job guys! 🙌🏻 &lt;a href="https://twitter.com/hashtag/pycon7?src=hash"&gt;#pycon7&lt;/a&gt;&lt;/p&gt;&amp;mdash; (((Valerio Maggio))) (@leriomaggio) &lt;a href="https://twitter.com/leriomaggio/status/720894471060201472"&gt;April 15, 2016&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="//platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;&lt;strong&gt;Time Travel and Time Series Analysis with Pandas and Statsmodels,
&lt;a href="http://twitter.com/hendorf"&gt;@hendorf …&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;</summary><content type="html">&lt;p&gt;I have just joined &lt;a href="https://www.pycon.it/p3/schedule/pycon7/"&gt;PyData&lt;/a&gt;
conference in Florence, and I will list briefly some
interesting insights.&lt;/p&gt;
&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;Oh my... We are already overcrowded &lt;a href="https://twitter.com/pyconit"&gt;@pyconit&lt;/a&gt; and it&amp;#39;s *just* the beginning!! 🎉🎉 good job guys! 🙌🏻 &lt;a href="https://twitter.com/hashtag/pycon7?src=hash"&gt;#pycon7&lt;/a&gt;&lt;/p&gt;&amp;mdash; (((Valerio Maggio))) (@leriomaggio) &lt;a href="https://twitter.com/leriomaggio/status/720894471060201472"&gt;April 15, 2016&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="//platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;&lt;strong&gt;Time Travel and Time Series Analysis with Pandas and Statsmodels,
&lt;a href="http://twitter.com/hendorf"&gt;@hendorf.&lt;/a&gt;&lt;/strong&gt; The focus of the talk was time
series analysis. The speaker pointed out something that a data scientist
should not forget when doing such time series analysis. He pointed out
that the time level of aggregation is something to do with care when
doing such analysis. Do you take into account that February has a number
of days that accounts to only 90% of the number of days of March? If you
compare e.g. sales per month, you cannot just ignore this fact. In the
talk, I found out that statsmodels has some nice tools that perform
trend analysis and seasonality analysis.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Machine learning and IoT for automatic presence detection of workers
on fall protection life lines,
&lt;a href="http://twitter.com/stefanoterna"&gt;@stefanoterna&lt;/a&gt;.&lt;/strong&gt; The talk was an
excellent overview of how TomorrowData is able to deploy machine
learning systems in the "real world". Their system uses neural networks
to detect a man walking on industrial cables. It was interesting to hear
about the different challenges that one has to consider in the Internet
of Things area due to hardware and environmental constraints. The fact
that they had to manually annotate the signals coming from an
accelerometer reminded me of &lt;a href="http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;amp;arnumber=7346953&amp;amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D7346953"&gt;my
work&lt;/a&gt;
about indoor localization. In this kind of areas, the data collection is
indeed a challenge due to its manual cost (compared to the datasets you
can easily collect through a web app).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Introduzione a Orange Data Mining,
&lt;a href="http://twitter.com/ericbonfadini"&gt;@ericbonfadini&lt;/a&gt;.&lt;/strong&gt; Eric introduced
Orange Data Mining which is both a python library and a GUI for machine
learning projects. I found interesting the nice GUI. It allows to define
pipelines of jobs to mine data. You can quickly get insights about data
and play around with machine learning models. I see this tool as quite
useful mainly for didactic purposes. I think it can be a nice tool for
teachers to explain data mining and machine learning in a nice graphical
way. It is really suitable for lectures.&lt;/p&gt;
&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;&amp;quot;Simple APIs and innovative documentation processes&amp;quot; keynote by &lt;a href="https://twitter.com/EGouillart"&gt;@EGouillart&lt;/a&gt; now live &lt;a href="https://twitter.com/PyData"&gt;@PyData&lt;/a&gt; &lt;a href="https://twitter.com/pyconit"&gt;@pyconit&lt;/a&gt; &lt;a href="https://twitter.com/hashtag/pydatait?src=hash"&gt;#pydatait&lt;/a&gt; &lt;a href="https://t.co/Gt8cxIyafJ"&gt;pic.twitter.com/Gt8cxIyafJ&lt;/a&gt;&lt;/p&gt;&amp;mdash; PyData Italy (@pydatait) &lt;a href="https://twitter.com/pydatait/status/721235005746188289"&gt;April 16, 2016&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="//platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;

&lt;p&gt;&lt;strong&gt;Simple APIs and innovative documentation processes: looking back at
the success of Scientific Python,
&lt;a href="http://twitter.com/EGouillart"&gt;@EGouillart&lt;/a&gt;.&lt;/strong&gt; The talk was the point
of view of a core developer of a scientific package like &lt;em&gt;scikit-image&lt;/em&gt;.
The speaker gave nice insights about the API design choices that need to
be taken when you contribute to open source projects. For example, what
is the advantage of getting rid of most classes in your package and
mainly expose functions. The idea is that, if you get rid of the
boilerplate of classes, you are forced to expose/return just numpy
arrays which you can then easily integrate to other tools in your
pipeline, e.g. scikit-learn. Another thing to take into account is that
54% of the users of packages are running a Windows machine (although
probably the developers of such package don't). So, you need to take
into account the tech gap between the developers and the end users.
Finally, the speaker mentioned the power of Sphinx as a documentation
tool.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Building Data Pipelines in Python,
&lt;a href="http://twitter.com/marcobonzanini"&gt;@marcobonzanini&lt;/a&gt;.&lt;/strong&gt; Luigi is an
awesome tool because simply it makes you feel relaxed when you are
running a data pipeline. You can programmatically define arbitrary
dependencies between tasks, and Luigi will make sure that the
dependencies are fulfilled. Marco's talk was a really nice intro to the
tool.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Going Functional in the Python Data Science Stack,
&lt;a href="http://twitter.com/data_hope"&gt;@data_hope&lt;/a&gt;. &lt;/strong&gt;The speaker explained
the directed acyclic graphs that are behind functional programming. It
was interesting to hear about Dask package and how you can bring its
lazy evaluation model. Dask allows you to abstract your code and perform
operations on datasets that do not fit in memory. The speaker pointed
out that doing functional programming means to decouple "how" from
"what". You can just focus on "what" your algorithm should do, then you
just choose "how" it will do it (e.g. Dask).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Reti Neurali in Python, &lt;a href="http://twitter.com/spiunno"&gt;@spiunno&lt;/a&gt;.&lt;/strong&gt; The
talk was a great overview of what are neural networks and how you can
implement them with Theano and Lasagne. The speaker was able give a talk
that was suitable both to beginners and both to an intermediate
audience. In particular, the Q&amp;amp;A session was really active, and
interesting topics were discussed, e.g. preventing overfitting,
computational costs, gravitational waves, etc. Regarding overfitting
prevention, I learnt about "dropout" which is a nice technique that
consists basically in dropping out links of the networks at random for
each sample. The advantage is that you prevent overfitting and reduce
the computational cost at the same time.&lt;/p&gt;
&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;&lt;a href="https://twitter.com/hendorf"&gt;@hendorf&lt;/a&gt; thank you for coming! enjoy your next conference :)&lt;/p&gt;&amp;mdash; PyCon Italy (@pyconit) &lt;a href="https://twitter.com/pyconit/status/722763833387966465"&gt;April 20, 2016&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async src="//platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;</content><category term="posts"></category></entry></feed>