Marco Santonihttps://www.marcosantoni.com/2023-08-27T07:35:00+02:00Reading "The Design of Web APIs"2023-08-27T07:35:00+02:002023-08-27T07:35:00+02:00Marco Santonitag:www.marcosantoni.com,2023-08-27:/design_web_api.html<p>Why bothering reading a book about design of web APIs when working in data science like I do? I found this book called <em>The Design of Web APIs</em> by <a href="https://apihandyman.io/about/">Arnaud Lauret</a> and decided to give it a try.</p>
<p><img alt="The Design of Web APIs" src="https://www.marcosantoni.com/images/bookshelf/webapi.jpg"></p>
<h2>Why reading it</h2>
<p>Data science is shifting towards turning models and solutions …</p><p>Why bothering reading a book about design of web APIs when working in data science like I do? I found this book called <em>The Design of Web APIs</em> by <a href="https://apihandyman.io/about/">Arnaud Lauret</a> and decided to give it a try.</p>
<p><img alt="The Design of Web APIs" src="https://www.marcosantoni.com/images/bookshelf/webapi.jpg"></p>
<h2>Why reading it</h2>
<p>Data science is shifting towards turning models and solutions as API products. And this can be true not only when you develop a product you actually sell publicly. Even when developing a data product inside an organization, you may want to expose your data service via APIs.</p>
<p>So, when you are at this point of developing an APIs, there are plenty of design decisions to take (eg which routes to expose, which result codes, which response payload, etc.). If you start this design process without some guidelines, you may spend plenty of energies on trying to answer these design questions or even risk to introduce technical debt that you will pay later.</p>
<h2>What is the book about</h2>
<p>The book states that, when you take the role of an <em>API designer</em>, you are just like a designer of real-world object. An API is made for users and shuold help <strong>them</strong> to achieve <strong>their</strong> goals. API designers should avoid that internal details of the backend affect the design of the APIs. The focus of the designer is to simplify the job of the consumer.</p>
<blockquote>
<p>Usability is what distinguishes awesome APIs from mediocre or passable ones.</p>
</blockquote>
<p>The book is focused on shifting your point of view <strong>from the provider to the consumer</strong>. It may seem obvious and everyone might agree on it, but it is not that straightforward to make it happen because we may have bias or we may take design shorcuts that simplify the development of our backend. The author introduces methods like <em>API goals canvas</em> to help us listing out the needs of the user and focusing on them.</p>
<p><img alt="Example of design tradeoffs" src="https://www.marcosantoni.com/images/api_design_tradeoff.png"></p>
<p>You will find in the books charts or schemas like the one above that explain the design choices you can face on a daily basis. In this example, you may want to stick to fully REST compliace with a <code>POST /orders</code>. Or you may want to relax this constraint via a non-REST design like <code>POST /cart/check-out</code> that might actually be more intuitive for the consumer developers.</p>
<h2>And more technicalities</h2>
<p>The book has a focus on these design choices (eg the <em>resource expansion</em> pattern for nested object in API responses), but is a good source of knowledge to learn more about some technical details around APIs that you can use on a daily basis. For example, you will read chapters about</p>
<ul>
<li>OpenAPI Specification</li>
<li>OAuth2</li>
<li>features of HTTP you might not be using (eg there are around 200 different standard HTTP headers)</li>
<li>data format standards like <em>ISO 4217</em> for currrencies or <em>ISO 8601</em> for date and time-related data</li>
<li>etc.</li>
</ul>
<h2>All about dev experience</h2>
<p>It is the first book I read so far entirely dedicated to developer experience. How can we improve the productivity and the overall satisfaction of the developers using our APIs?</p>
<p>By reading it, you can learn an approach that goes beyond designing web APIs. You learn to focus on what simplifies the life of a developer, and I'm sure this thinking has an effect on how you write your code, your internal tools or even your docs.</p>Expectations from a Data Analyst2023-08-07T07:35:00+02:002023-08-07T07:35:00+02:00Marco Santonitag:www.marcosantoni.com,2023-08-07:/expectations_from_data_analyst.html<p>When you work as a data analyst or data scientist (I'll use the terms interchangeably) in a company, you may not be training predictive models every single day. A significant (and often interesting) part of your job is answering business questions via data mining regardless if you do it with …</p><p>When you work as a data analyst or data scientist (I'll use the terms interchangeably) in a company, you may not be training predictive models every single day. A significant (and often interesting) part of your job is answering business questions via data mining regardless if you do it with machine learning, descriptive statistics or whatever. You may start with a business question like:</p>
<blockquote>
<p>why are our revenues increasing in the last quarter?</p>
<p>what are common patterns between our loyal customers?</p>
</blockquote>
<p>Such simple questions often require a complex work that goes beyond knowing well statistics. You need to know how how your business work and what are the expectations of your stakeholder.</p>
<h2>What the analyst enjoys the most (and the least)</h2>
<p>Once a business question has arrived, where do we start from? Most data analysts would start mining into the <strong>data exploration</strong> phase. This phase is usually the first one of the activity, and the data analysts look into distributions and patterns in the data. The goal here is to get a good comprehension of the data we are sitting on. And usually the data analysts has fun during this data exploration time.🎉🙌 He or she is playing with charts and with some statistics from the dataset.</p>
<p>What does the data analyst usually <strong>not</strong> enjoy doing? 👎😭 Based on my experience, preparing the <strong>presentation</strong> about the results of the analysis is the part of the activity that most data analysts enjoy the least. And what does it imply?</p>
<p><img alt="Time dedicated to slides" src="https://www.marcosantoni.com/images/time_dedicated_to_slides_small.png"></p>
<p>Imagine you have <strong>10 days</strong> to work on this data analysis before the meeting with your stakeholders. As our dear data analysts enjoy playing with the data more than playing with PowerPoint, they would probably spend 9 days on mining the data and 1 day working on the presentation. And probably the 9 days do not depend on the actual complexity of the task. If the business question can be answerend in 5 days with some basic descriptive statistics, the data analysts would probably invest more and more time trying some more advanced modelling technique or some more fancy data visualization. Why? Because they enjoy it. So, the data analysis part of the activity fills all the available space like a gas in a room would do.</p>
<p>Last day (if not very last hours) is usually left to working on the presentation.</p>
<h2>The wrong interpretation of the role</h2>
<p>I was expecting from a data analyst to focus on the data mining, and that looked fine to me. She/he would share the data with other stakeholders (eg marketing staff), and THEY would get the insights because THEY are the domain expert. The data scientist would get the data, would let the data talk, and the business stakeholder would read the insights.</p>
<p>I thought it was OK to present an exploratory analysis. And I was wrong.</p>
<h2>Explanatory over exploratory</h2>
<p>Recently, I read <a href="https://www.storytellingwithdata.com/books">Storytelling with Data</a> by Cole Nussbaumer Knaflic. The author explains why data scientists should show <strong>explanatory</strong> analyses (rather than exploratory).</p>
<blockquote>
<p>If you are the one analyzing and communicating the data, you likely know it best—you are a subject matter expert. This puts you in a unique position to interpret the data and help lead people to understanding and action.</p>
</blockquote>
<p>Once the explorary data mining phase is over, the data analyst should take the time and the effort to <strong>interpret</strong> the data. She/he should turn the data into information that can answer the need of the audience.</p>
<p>Why is it hard? We often believe that the audience is the subject matter expert and know what is actually the valuable information behind the data. That's why working on the explanatory phase is an uncomfortable zone for a data scientist, but she/he should feel confident in making recommendations and observations.</p>
<p>If we entitle a data analyst to interpret the business insights of the data, there are at least 2 things he/she should take into considerations:</p>
<ol>
<li>take enough <strong>time to interpret</strong> the data</li>
<li><strong>review the data visualizations</strong> to explicitly communicate his/her interpretation</li>
</ol>
<p>Regarding the 1st point, looking for business insights is surprisingly time consuming. You cannot just dedicate the very last hours of your activity to looking for explanations behind data patterns. We should probably reconsider a classic sequential approach to the activity (eg explore, explain, present) in favor of an approach that organizes our time as to have quick iterations around business hypothesis and repeat multiple iterations before concluding our activity.</p>
<p>Regarding, the 2nd point, I'll go a bit deeper with an example.</p>
<h2>Example: review your data viz</h2>
<p>You can find of course many examples on Knaflic's book. Let's look at one I picked one from <a href="https://www.storytellingwithdata.com/makeovers">her website</a>. Imagine we're working in a hospital and are analyzing lengths of hospitals stays after a surgery. For each stay of year 2019, we're given</p>
<ul>
<li>the quarter of the year</li>
<li>the length of stay</li>
</ul>
<p>When a data analysts is done with the first <em>exploratory</em> analysis, what could be the output?</p>
<p><img alt="Exploratory data analysis chart" src="https://www.marcosantoni.com/images/surgery_data_exploratory_small.png"></p>
<p>In this chart, data is presented to the audience. However, how easy is it to get valuable information out of it? You may notice some patterns (eg increase in frequency <code><=24</code> stays over the year), however finding patterns is hard or requires quite some cognitive effort.</p>
<p>What if the data analyst would take this effort of extracting the information out of the data? How should the presentation be revisited? She/he should be confident in highlighting what's actually valuable in the data and focus the attention on the reader on that.</p>
<p>In this example, the data analyst can make the key information explicit. She/he can find out that the <code><=24</code> stays have increased over the year and could know that this is considered a success. Why not emphasizing it on the chart?</p>
<p>Let's look at how an <em>explanatory</em> chart would look like.</p>
<p><img alt="Explanatory data analysis chart" src="https://www.marcosantoni.com/images/surgery_data_explanatory_small.png"></p>
<p>The chart now has a clear message that is stated in the title and is fully described in the text next to the actual plot. The new chart looks clean because any visual component that is not useful to grab the chosen message is either hidden or grayed out. The data analyst in this case has focused on explaining why 2019 was a success rather than showing plain data. That's why the bars of the <code><=24</code> stays are highlighted in black, while, in contrast, the remaining bars are grayed out. The choice of colors captures the attention of the reader on the patterns and on the signal, rather than on the data itself.</p>
<h2>Looking forward</h2>
<p>This article is mainly inspired by Knaflic's book and by my experience on interacting with stakeholders over the last years. I haven't done a research of the literature on the topic, so please consider this article as a set of opinionated recommendations on how a data scientist could maximize his/her impact when working on data mining activities. Agreeing with this approach means that a data scientist should dedicate energies to</p>
<ul>
<li>getting a deep knowledge of the business of the company she/he works at and the market where it competes</li>
<li>fine tuning and improving the data visualizations she/he by iterating over and over on them (not stopping at the default chart styles generated by statistics softwares)</li>
</ul>
<p>These thoughts do not apply fully to every single company of course. They make sense in teams or companies where data scientists spend a part of their time making data explorations and data mining activities to answer questions that business stakeholders ask them. I would appreciate any feedback or thought you have on it!</p>Guest at DaGrande podcast2023-07-23T06:35:00+02:002023-07-23T06:35:00+02:00Marco Santonitag:www.marcosantoni.com,2023-07-23:/guest_at_dagrande_podcast.html<p>I was recently guest at a new podcast called "DaGrande". The podcast was launched by <a href="https://www.linkedin.com/in/stefano-bosisio1/">Stefano Bosisio</a> and aims at helping students that are near to conclude their studies. "DaGrande" consists of a series of interviews where professionals from a variety of industries share tips or insights abuot career that …</p><p>I was recently guest at a new podcast called "DaGrande". The podcast was launched by <a href="https://www.linkedin.com/in/stefano-bosisio1/">Stefano Bosisio</a> and aims at helping students that are near to conclude their studies. "DaGrande" consists of a series of interviews where professionals from a variety of industries share tips or insights abuot career that they would have loved to hear when they were younger (eg when still at university).</p>
<p>I was the one interviewed in the second episode (see below), and I shared my advices for starting a career in the Data and AI world.</p>
<iframe style="border-radius:12px" src="https://open.spotify.com/embed/episode/0ePWYBwwqq71hGJw4H0woX?utm_source=generator&theme=0" width="100%" height="352" frameBorder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>Learning by teaching2023-01-28T09:35:00+01:002023-01-28T09:35:00+01:00Marco Santonitag:www.marcosantoni.com,2023-01-28:/learning_by_teaching.html<p>The picture below was taken just at the beginning of the exam of the course called <em>Apache Spark for Data Analysis</em> at <a href="https://www.itsrizzoli.it/en/home-en/">ITS Rizzoli</a> in Milan on November 2022. I was the one taking the picture because I was actually the lecturer of this course. In this post, I'll tell …</p><p>The picture below was taken just at the beginning of the exam of the course called <em>Apache Spark for Data Analysis</em> at <a href="https://www.itsrizzoli.it/en/home-en/">ITS Rizzoli</a> in Milan on November 2022. I was the one taking the picture because I was actually the lecturer of this course. In this post, I'll tell you why I ended up teaching this course.</p>
<p><img alt="The day of the exam" src="https://www.marcosantoni.com/images/spark_exam.jpg"></p>
<h2>Learning</h2>
<p>I have been working daily with Apache Spark for three years so far, and I've been implementing a variety of batch and streaming data transformations with it. I felt I knew the basics of the framework so that I was autonomous in creating new jobs. However, I wanted to go deeper in understanding how Spark works and what are the best practices to follow (or the antipatterns to avoid).</p>
<p>Rather than studying by myself a book about Spark or something like that, I asked myself: "<em>why not teaching an introductory course</em>"? And that was actually a good idea. I found that <strong>teaching</strong> has been an extremely effective way to <strong>learn</strong>. My course consisted of 44 hours of training spanning on 11 lessons over 2 months. While that may look not that large, preparing 44 hours of training material and designing the lessons requires a dense preparation on the topic you are teaching. I decided to design the course with more practice than theory and with plenty of live coding.</p>
<p>So, preparing this course has been an amazing opportunity to actually learn how Apache Spark works. After the end of the course, I have the impression I've truly improved my coding skills in PySpark way more than what I would have achieved by any dedicated training.</p>
<h2>Impact</h2>
<p>The <a href="https://en.wikipedia.org/wiki/Istituto_tecnico_superiore">ITS</a> is a 2 years technical school dedicated to 19-20 years old students. It is an alternative to academy studies, and it is designed to be shorter and with a technical foucs. These ITS schools often focus on areas or skills that are in high demand by the job market. Therefore, students are often able to find their first job soon.</p>
<p>My goal was helping young developers learning a key technology like Spark. Knowing Spark is almost a requirement for applying to Data Engineer positions, and the role of the Data Engineer is one with the highest demand in the tech job market. So, I decided to design the course I would have liked to follow 3 years ago to speed up learning these skills. I liked the idea of giving my contribute in supporting these young developers finding their first job in the Data domain.</p>
<h2>Teaching</h2>
<p>When you know a topic or a technology, it does not mean you are able to teach it. Teaching is a complex task where you cannot take anything for given and need find the good pace for the class. In a class of 23 students, I found a variety of expertises or a variety of backgrounds meaning that you need to balance them for teaching at the good rythm.</p>
<p>Another challenge is how not to make a lecture boring and having a good mix of theory and practive because you'll find both students that look for more of one or for more of the other. Teaching this course was then also an opportunity to improve my teaching skills, and these are not skills that you apply only during lectures. They are actually communication skills that you can apply and distill ona daily basis when working.</p>
<h2>Revenue</h2>
<p>I liked the idea of having a small second revenue, and, before starting preparing the course, I thought teaching would have been a great idea because I would have been paid to learn. The salary of a lecturer in the tech domain can vary a lot depending on the context, but it is generally ranging from 40-200 euro per hour (this is not an official statistic, it's just an approximation). However, this salary does <strong>not account</strong> for the preparation of the training. So, is <em>revenue</em> actually a good reason to teach? Probably not if you give this course only once or twice. The effort of the preparation is so large that the revenue will not compensate for it. If instead you have the opportunity to repeat the same training over and over, than it starts to make sense on the economical side too.</p>
<h2>Opportunity</h2>
<p>Three years ago I just would not have had the time to prepare a course like this. Why? I use to spend 2 hours per day commuting. I now have instead the chance to work from home quite often, and this gives me 1-2 hours of extra life. I was then able to prepare the course material incrementally over a couple of months before the course started.</p>
<p>The opportunity came when I <a href="https://open.spotify.com/episode/4OWbyxGWcEPcQULpNTiNqU">interviewed Andrea Biancini</a> at Intervista Pythonista podcast. Thanks to him, I knew a bit more about the tech education and training world and heard for the first time about ITS. Then, the idea was sticking in my head because I was looking forward to experience being a lecturer for the first time.</p>
<h2>Course design</h2>
<p>When you prepare a course, the nice part is that you can actually design the course you would like to attend. My course then consisted mainly of live coding sessions that started with a brief introduction of a topic (eg Spark APIs, Streaming, etc) and then ended with an excercise on that topic that the students could try to solve. I decided to <a href="https://github.com/Marco-Santoni/databricks-from-scratch/tree/main/training-spark">open source</a> the trainig material I prepared so that any other student or teach may benefit from it when needed. To simplify the course setup, I run the coding sessions on <a href="https://community.cloud.databricks.com/login.html">Databricks community edition</a> so that students only needed a browser and an internet connection to work on a Spark cluster.</p>
<p>What helped the design of the cours was adopting a textbook. Having a textbook speeds up the design of the contents of the course and gives the students a reference resource in case they want to go deeper on the topic. I chose <a href="https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf">Learning Spark</a> second edition by Damji et al. that is made freely available by Databricks.</p>Speaker at PyCon IT 20222022-08-05T06:41:00+02:002022-08-05T06:41:00+02:00Marco Santonitag:www.marcosantoni.com,2022-08-05:/pyconit_2022.html<p>I went back to PyCon IT 2022 in Florence in June. I gave one talk called <em>Why Is Our Project Late?</em> where I introduces mental and statistical bias that lead us to make wrong estimates when making a plan.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/zcDQwIQQwR4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>Furthermore, we held a live session of <a href="https://intervistapythonista.com/">Intervista Pythonista</a> podcast interviewing …</p><p>I went back to PyCon IT 2022 in Florence in June. I gave one talk called <em>Why Is Our Project Late?</em> where I introduces mental and statistical bias that lead us to make wrong estimates when making a plan.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/zcDQwIQQwR4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>Furthermore, we held a live session of <a href="https://intervistapythonista.com/">Intervista Pythonista</a> podcast interviewing Fabio Pliger, the creator of PyScript.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/l5-ecdsBaHE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>My Webinar on Databricks and PySpark2022-06-11T19:35:00+02:002022-06-11T19:35:00+02:00Marco Santonitag:www.marcosantoni.com,2022-06-11:/webinar_databricks_biella.html<p>I was invited by <a href="https://pythonbiellagroup.it/it/">Python Biella</a> community to hold a webinar introducing PySpark on Databricks (in Italian). You can find the video below and the <a href="https://github.com/Marco-Santoni/databricks-from-scratch/tree/main/live_python_biella">code here</a>.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/m0OiFDBJ0Rw?start=114" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>My 6 Gems on Data Visualization2022-02-26T19:35:00+01:002022-02-26T19:35:00+01:00Marco Santonitag:www.marcosantoni.com,2022-02-26:/data_viz_hidden_gems.html<p>I have been working quite some time with charts and business intelligence in the last 5 years. When you spend time building business reports, you may perceive data visualization as a cold technical and business tool. However, there are <strong>6 hidden gems</strong> in data visualization that I found by chance …</p><p>I have been working quite some time with charts and business intelligence in the last 5 years. When you spend time building business reports, you may perceive data visualization as a cold technical and business tool. However, there are <strong>6 hidden gems</strong> in data visualization that I found by chance. I realized data visualization is not as cold as I thought. Let me recap for you these 6 gems.</p>
<h2>1) The first chart ever</h2>
<p>William Playfair was a Scottish engineer and political scientist from the 18th century. He is considered as the author of the very first chart:</p>
<p><img alt="By William Playfair - The Commercial and Political Atlas, 1786 (3th ed. edition 1801), Public Domain" src="https://www.marcosantoni.com/images/datavizhiddengems/playfair_first_chart_800.jpg"></p>
<p>The chart was published back in 1786. It shows the volumes of imports and exports of Scotland over one year on a scale of 10k pounds. Each country is given two bars: one for volume of imports, one for volume of exports.</p>
<p>I am so used to seeing bar charts that I never asked myself who was the inventor or when they first appeared. It's nice to find out that the have been invented way before the invention of calculators and that they have changed so little since then.</p>
<h2>2) The best graphic ever</h2>
<p>Charles Minard represented 6 types of data about Napoleon's 1812 Russia campaign in one single chart. This visual was considered by <a href="https://www.nationalgeographic.com/culture/article/charles-minard-cartography-infographics-history">Edward Tufte</a> as "<em>the best statistical graphic ever produced</em>".</p>
<p><img alt="By Charles Minard (1869): map of Napoleon's disastrous Russian campaign of 1812" src="https://www.marcosantoni.com/images/datavizhiddengems/minardnapoleon_800.png"></p>
<p>Minard represented in two dimensions <a href="https://ageofrevolution.org/200-object/flow-map-of-napoleons-invasion-of-russia/">six types</a> of data: the number of Napoleon's troops; distance; temperature; the latitude and longitude; direction of travel; and location relative to specific dates.</p>
<h2>3) Non-neutrality: the Legarithmic scale</h2>
<p>Is data visualization a neutral discipline? Not really. Basic decisions like the choice of scale or of the limit of axes might change radically the information perceived by the reader. Take a look at the following tweet by Matteo Salvini (leader of "Lega" party) about results of a poll on popularity of Italian politicians:</p>
<blockquote class="twitter-tweet"><p lang="it" dir="ltr">Nonostante menzogne, attacchi e processi, milioni di Italiani credono, sperano, confidano nella Lega. <br>Eh già, e siamo ancora qua…<br>Non si molla mai, GRAZIE! <a href="https://t.co/DFMecxPFzC">pic.twitter.com/DFMecxPFzC</a></p>— Matteo Salvini (@matteosalvinimi) <a href="https://twitter.com/matteosalvinimi/status/1436662148709629952?ref_src=twsrc%5Etfw">September 11, 2021</a></blockquote>
<p><script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></p>
<p>Do you notice anything wrong with the chart? The y axis looks a bit tweaked. The difference between the axis does not follow any reasonable scale (perhapse a "Legarithmic" scale?) since the difference between the 3 bars is not consistent. Here is how the same data looks when plotted in Excel.</p>
<p><img alt="Unbiased chart of the same data shown in Matteo Salvini's tweet" src="https://www.marcosantoni.com/images/datavizhiddengems/realchartfromtweet_800.png"></p>
<p>However, the effect on the reader is not the same, isn't it?</p>
<h2>4) Beyond shapes: infographics</h2>
<p>Otto Neurath was one of the main contributor to the <em>picture language</em>, aka ISOTYPE (International System of Typographic Picture Education). This method consists of replacing classic shapes in data visualization (eg bars, circles, etc) with a set of standardized symbols. Quantities are represented by repeating the same symbol over and over proportionally to the measure. Consider the following example by Otto Neurath from 1930.</p>
<p><img alt="Otto Neurath, Residential density in big cities - 1930" src="https://www.marcosantoni.com/images/datavizhiddengems/isotypeexample_800.png"></p>
<p>The chart represents the density of population in different cities. The information is represented as the number of persons that would live in a flat of 200 m2. The count of persons is not represented by a digit or by a bar, but it is represented by the repetition of a symbol as many time as the count of persons for that city. The result is effective. Density is no more a number, and you can <em>feel</em> the size of the measure. Infographics can turn cold numbers into tangible perceptions of a phenomenon.</p>
<h2>5) Pie charts: bad by definition</h2>
<p>"Bad by definition" is the title of one of my <a href="https://www.data-to-viz.com/caveat/pie.html">favourite blog posts</a> about data visualization. This article is a clean explanation of why you should not use pie charts for most of the use cases. The article starts with this example.</p>
<p><img alt="Yan Holtz - The issue with pie chart" src="https://www.marcosantoni.com/images/datavizhiddengems/piechart_400.png"></p>
<p>Can you rank the slices of the pie by size? You'd probably struggle a bit trying to answer. The reason is that our brain is not used to measure and compare angles. It's funny to see pie charts being used every now and then in business reports. Most of the times, a basic bar chart would be way more effective to let the user understand the numbers behind. However, it seems that pie charts are now endemic in corporations, and the way is still long before getting rid of it 😁</p>
<h2>6) What is data visualization?</h2>
<p>Is data visualization a branch of computer science? It turns out that data visualization is broader discipline, and it is part of <a href="https://visme.co/blog/information-design/">information design</a>. Information design is the practice of presenting information in a way that fosters an efficient and effective understanding of the information.</p>
<p><img alt="Plain text representation of data" src="https://www.marcosantoni.com/images/datavizhiddengems/irpef_table_800.png"></p>
<p>Can the same data of a bar chart be represented in plain text? Yes.</p>
<p>Would plain text require us the same effort to understand the information behind the numbers? Probably not.</p>
<p>Would we even be able to get such information from plain text? Probably not because visualizing information helps our brain to perceive what's going on.</p>
<p><img alt="Same representation of data via line chart" src="https://www.marcosantoni.com/images/datavizhiddengems/irpef_chart.png"></p>
<p>I recently wrote about <a href="https://medium.com/@marcosantoni_39266/riforma-irpef-i-grafici-che-avrei-voluto-vedere-7a69f7577bc3">an article</a> on the impact of information design on journalism. The article starts from a recent tax reform in Italy. Most information media have kept showing tables about the new tax rates, however I found quite hard to get a clear and full picture of the reform. I was not able to find online a single data visualization about the data behind the reform. So, I have done it by myself, and it turned out the article was quite appreciated (with more than 2.3k reads at the time of this writing and plenty of positive feedbacks on social networks).</p>
<p>The reason why the article was so viral is that one single line chart was able to describe the reform way more effectively than the textual tables you could find online. I find this a decent example of "<em>efficient and effective understanding of information</em>" that is the overall goal of information design.</p>
<h2>References</h2>
<p>This article is a collection of notes I took in the last couple of years. Historical charts are inspired by talks by <a href="https://twitter.com/pciuccarelli">Paolo Ciuccarelli</a>. The ideas behind the critics to pie charts is inspired by the article of <a href="https://www.data-to-viz.com/caveat/pie.html">Yan Holtz</a>. Plenty of details are of course from Wikipedia.</p>How I started podcasting2021-11-07T09:35:00+01:002021-11-07T09:35:00+01:00Marco Santonitag:www.marcosantoni.com,2021-11-07:/start_podcasting.html<p>On May 2021, the first episode of my first podcast went live. The podcast is called <a href="http://intervistapythonista.com/">Intervista Pythonista</a> and is co-hosted with <a href="https://it.linkedin.com/in/cesare-placanica">Cesare Placanica</a>. Cesare and I are members of the <a href="http://milano.python.it/">Python Milano</a> community that helped us to kick-off the idea.</p>
<h2>Why podcasting?</h2>
<p>I am a heavy podcast listener. I …</p><p>On May 2021, the first episode of my first podcast went live. The podcast is called <a href="http://intervistapythonista.com/">Intervista Pythonista</a> and is co-hosted with <a href="https://it.linkedin.com/in/cesare-placanica">Cesare Placanica</a>. Cesare and I are members of the <a href="http://milano.python.it/">Python Milano</a> community that helped us to kick-off the idea.</p>
<h2>Why podcasting?</h2>
<p>I am a heavy podcast listener. I love podcasts because they are dense conversations on topics I love. These conversations let me hear the points of view of experts in the field and stay up to date with new trends.</p>
<p>I prefer podcasts over videos for two reasons. First, I can listen to them while I'm doing something else (usually low-attention tasks like dish-washing or running). Second, I don't need to sit in front of a screen after I've been working daily for 8+ hours still in front of a screen.</p>
<h2>Why now?</h2>
<p>Cesare and I participated as panelists in a <a href="https://talks.codemotion.com/panel-online---stories-of-python-and-dat">community talk</a> at last Codemotion conference. The panel was an informal discussion on topic like data team organization, learning tips, and latest trends in data science.</p>
<p>We had a surprisingly high number of attendees during the panel. I noticed that an informal chat between experts is a content that people were enjoying more than I expected. I suspect that people miss the <strong>informal chat</strong> they used to have during in-person meetups and conferences (ie suspended since the beginning of the pandemics).</p>
<p>So, I got back to Cesare with the idea:</p>
<blockquote>
<p>Why don't we start podcasting?</p>
</blockquote>
<p>Cesare was like: "tell me more about it". The idea was to interview an expert in Python or in its neighborhood. The format was inspired by Michael Kennedy's <a href="https://talkpython.fm/">Talk Python to Me</a> podcast. I was thinking to a similar format but narrowing it to an Italian audience by running interviews in Italian. The goal was not only to create valuable content for Italian Pythonistas, but also to give voice to local community members. Knowing with a direct interview the persons behind a tech community is a way to help the community grow by making it appear somehow closer to you.</p>
<p>The decision was taken. It was time to start.</p>
<h2>How to run a podcast?</h2>
<p>Neither Cesare nor I ever run a podcast before. None of us was expert of audio recording and audio post-processing. Fortunately, we live in a time where you can find plenty of user friendly tools to create digital content. After doing some research, I found <a href="https://anchor.fm/">Anchor</a> by Spotify. Anchor defines itself as "<em>the easiest way to make a podcast</em>". And it probabily is.</p>
<p>Anchor lets you start a new podcast in minutes for free. You can record, cut, merge, and publish episodes directly via the mobile app. The app lets you invite guests to join the recording too. Anchor will then take care of distributing the content on major podcasting platforms.</p>
<p>What is missing? A website and a logo! It turns out that Anchor creates a podcast page for your podcast. I simply bought a domain and linked it to that page. Regarding the logo, I have to confess I designed in Power Point.</p>
<p><img alt="Intervista Pythonista logo" src="https://www.marcosantoni.com/images/intervista_pythonista.png"></p>
<h2>Guests?</h2>
<p>Ok, we decided how to record and how to publish. It's time to record our first episode... who should we invite? Cesare and I started listing names of community members, colleagues, and even friends that could be interviewed. We soon had around 20 names, and our first choice was <a href="https://marcobonzanini.com/category/podcast/">Marco Bonzanini</a> (thanks Marco again for your availability!).</p>
<iframe src="https://anchor.fm/marco-santoni/embed/episodes/Ep-1-Diventare-imprenditori-di-se-stessi-con-NLP-e10a9g9/a-a5fjhcg" height="102px" width="400px" frameborder="0" scrolling="no"></iframe>
<p>We keep on updating a kind of kanban board that lists potential guests, guests that have accepted the invitation, and those that have already been scheduled. We decided to have a fixed schedule for recording (every 2 weeks, on the same day, at the same time). Having a recurring schedule reduces complexity and made things work.</p>
<p>At the end of every recording, we ask the guest to suggest us 1 or 2 names of potential future guests. This recommendation helps us filling the list of future guests with new names, and it lets us meet new Pythonistas outside of our direct network.</p>
<h2>Some numbers</h2>
<p>Two days ago, we published the <a href="https://anchor.fm/marco-santoni/episodes/Ep-10-Demand-forecasting-con-serie-temporali-gerarchiche-e19q48p">10th episode</a>, and we have enough history to look back at numbers. As of 7th November 2021, we had <em>1,364</em> plays. Our top episode had <em>167</em> plays. The <em>84%</em> of listeners are from Italy, and 2 out of 3 listeners uses their mobile device to listen to the podcast.</p>
<p>What I'm most glad of are not these numbers, but the messages we receive often via <a href="https://pythonmilano.herokuapp.com/">Slack</a> or <a href="https://www.linkedin.com/company/python-milano">LinkedIn</a>. Sometimes listeners writes us to say thanks for the valuable content they listened to. These messages are the highest reward for the time and effort we put into this podcast and the main reason we are doing this.</p>Getting PSM I Scrum Certification2021-08-24T09:41:00+02:002021-08-24T09:41:00+02:00Marco Santonitag:www.marcosantoni.com,2021-08-24:/getting-psm-i-scrum-certification.html<p>I've been working with Scrum framework over the last 18 months, and I thought it was time to test that what I was doing was real Scrum or kind-of-Scrum. I decided to take the <em>Professional Scrum Master I</em> certification exam to test my knowledge of the framework.</p>
<h2>Which certification?</h2>
<p>Where …</p><p>I've been working with Scrum framework over the last 18 months, and I thought it was time to test that what I was doing was real Scrum or kind-of-Scrum. I decided to take the <em>Professional Scrum Master I</em> certification exam to test my knowledge of the framework.</p>
<h2>Which certification?</h2>
<p>Where to start? It seems that the founders of Scrum have created 3 independent organizations that have 3 independent certification paths.</p>
<ul>
<li>Scrum.org</li>
<li>Scrum Alliance</li>
<li>Scrum Inc</li>
</ul>
<p>While <em>Scrum Alliance</em> and <em>Scrum Inc</em> require attending a class to take the exam, <em>Scrum.org</em> lets you directly take the exam thus allowing self-study. I did not find any in-person class in my area anytime soon and decided to go for <em>Scrum.org</em> exam. I did not consider attending an online class because I already spend most of the working time in front of a screen and prefer other ways of learning rather than online courses.</p>
<h2>How to prepare?</h2>
<p>In short, read the <a href="https://scrumguides.org/">Scrum Guide</a> at least 3-4 times. Focus on highlighting <strong>who</strong> is accountable for every artifact and activity (eg only the Developers are accountable for the Sprint Backlog, all the Scrum Team is accountable for the Sprint Goal, etc).</p>
<p>Repeat a few times excercises that simulate exam questions (either official or not)</p>
<ul>
<li>Scrum.org <a href="https://www.scrum.org/open-assessments/scrum-open">Open Assessment</a></li>
<li>Great set of 80 questions by <a href="https://mlapshin.com/index.php/scrum-quizzes/">Mikhail Lapshin</a></li>
<li>Few free questions on <a href="https://www.volkerdon.com/courses/take/sm-po-scaled-scrum-3-in-1/quizzes/24259915-product-owner-free-assessment">Volderkon</a></li>
</ul>
<p>I also enjoyed looking at some posters available on Scrum.org that help you visualize some aspects of the framework:</p>
<ul>
<li><a href="https://scrumorg-website-prod.s3.amazonaws.com/drupal/2021-01/Scrumorg-Scrum-Framework-tabloid.pdf">Scrum Framework Poster</a></li>
<li><a href="https://scrumorg-website-prod.s3.amazonaws.com/drupal/2018-05/ScrumValues-Tabloid.pdf">Scrum Values Poster</a></li>
</ul>
<h2>The exam</h2>
<p>The exam is an online quiz of 80 questions to be answered in 60 minutes. I suggest using the <em>Bookmark</em> feature of the quick. It lets you bookmark questions you're doubtful about and review them later. It took me about 40-45 minutes to go quickly through all questions. I then had approximately 15 minutes to review the bookmarked questions.</p>
<p>I've read on few forums that people encountered performance issues in the exam webpage. However, I did not find any issue and the exam run smoothly.</p>
<p>You can have notes either printed or on your laptop because there are no controls like browser locks or similar ones. You are basically free to look at any resource you like during the exam. The time pressure is a decent guarantee against cheating.</p>
<p>When you complete the exam, you'll have a printed certification, a badge like this:</p>
<p><img alt="Cases up to March 10th" src="https://www.marcosantoni.com/images/psmi_badge.png"></p>
<p>Your certificate will also available on your <a href="credly.com/">Credly</a> profile (if you have any).</p>Notes from Designing Data-Intensive Applications2021-04-10T07:31:00+02:002021-04-10T07:31:00+02:00Marco Santonitag:www.marcosantoni.com,2021-04-10:/review_designing_data_intensive.html<p><a href="https://dataintensive.net/">Designing Data-Intensive Applications</a> by Martin Kleppmann was not a quick-read. Let me be clear, it is not such a long book (the paper version is 400 pages), but it is so dense of information that takes some time to go through. The book covers indeed a broad spectrum of data …</p><p><a href="https://dataintensive.net/">Designing Data-Intensive Applications</a> by Martin Kleppmann was not a quick-read. Let me be clear, it is not such a long book (the paper version is 400 pages), but it is so dense of information that takes some time to go through. The book covers indeed a broad spectrum of data technologies and is dense of details in each paragraph. So, be ready before starting the journey.</p>
<p><img alt="Ocean of distributed data" src="https://www.marcosantoni.com/images/data_map_600.jpg"></p>
<p>What did I learn from the book? I'll take few quotes from my notes.</p>
<blockquote>
<p>An architecture that is appropriate for one level of load is unlikely to cope with 10 times that load. If you are working on a fast-growing service, it is therefore likely that you will need to rethink your architecture on every order of magnitude load increase — or perhaps even more often than that.</p>
</blockquote>
<p>We need to be able to test, develop, and change quickly our architecture. The book covers the main data solution designs, but you need a team and an organizaiton that is able to adapt and improve the architecture constantly. And more importantly, avoid <a href="http://wiki.c2.com/?PrematureOptimization">premature optimization</a> as much as possible. Prefer simplicity over complexity.</p>
<blockquote>
<p>If the same query can be written in 4 lines in one query language but requires 29 lines in another, that just shows that different data models are designed to satisfy different use cases. It’s important to pick a data model that is suitable for your application.</p>
</blockquote>
<p>Don't focus on data processing performance only, data models and query languages do matter. The overall simplicity and readability of the solution design should be taken into account when choosing the data model.</p>
<blockquote>
<p>On the surface, a data warehouse and a relational OLTP database look similar, because they both have a SQL query interface. However, the internals of the systems can look quite different, because they are optimized for very different query patterns. Many database vendors now focus on supporting either transaction processing or analytics workloads, but not both.</p>
</blockquote>
<p>We experienced this difference in my team. We started by building a data warehouse on top of SQL, but we run into performance issues quite soon. The statement by Kleppmann may seem obvious, but there are plenty of organization building data warehouses on SQL for a variety of reasons.</p>
<blockquote>
<p>... we will explore some of the most common ways how data flows between processes: via databases, via service calls (eg REST and RPC), and via asynchronous message passing (eg MQTT, AMQP).</p>
</blockquote>
<p>I find this an amazing summary. In the end, any data flow architecture falls in one these 3 categories, isn't it true?</p>
<blockquote>
<p>When you deploy a new version of your application (of a server-side application, at least), you may entirely replace the old version with the new version within a few minutes. The same is not true of database contents: the five-year-old data will still be there, in the original encoding, unless you have explicitly rewritten it since then. This observation is sometimes summed up as data outlives code.</p>
</blockquote>
<p>Migrating data is harder than updating an application (and there are richer tools available for deploying an application than migrating a database).</p>
<blockquote>
<p>May your application’s evolution be rapid and your deployments be frequent.</p>
</blockquote>
<p>I love this wish 😊</p>
<blockquote>
<p>All of the difficulty in replication lies in handling changes to replicated data, and that’s what this chapter is about. We will discuss three popular algorithms for replicating changes between nodes: <em>single-leader</em>, <em>multi-leader</em>, and <em>leaderless replication</em>. Almost all distributed databases use one of these three approaches.</p>
</blockquote>
<p>I found this quote in the introduction to the <em>Replication</em> chapter of the book. I heard often mentioning these replication mechanism, but for the first time I did a deep dive in the topic (that is not as easy as I would have expected). Kleppmann throughout the book makes you clear one thing: there are many things that can go wrong around data (timestamp alignment, networking, nodes down, etc), and they will go wrong at some point.</p>
<blockquote>
<p>Because of this risk of skew and hot spots, many distributed datastores use a hash function to determine the partition for a given key. A good hash function takes skewed data and makes it uniformly distributed.</p>
</blockquote>
<p>And fortunately this hashing is often managed under the hood by datastores themselvs, eg Azure Cosmos.</p>
<blockquote>
<p>Atomicity, isolation, and durability are properties of the database, whereas consistency (in the ACID sense) is a property of the application. The application may rely on the database’s atomicity and isolation properties in order to achieve consistency, but it’s not up to the database alone. Thus, the letter C doesn’t really belong in ACID.</p>
</blockquote>
<p>Interesting to read that the <em>C</em> in such a popular acronym is there just to make the acronym work.</p>
<blockquote>
<p>Errors will inevitably happen, but many software developers prefer to think only about the happy path rather than the intricacies of error handling.</p>
</blockquote>
<p>True story, but experience helps thinking a bit more to the <em>sad path</em>.</p>
<blockquote>
<p>Simply dumping data in its raw form allows for several such transformations. This approach has been dubbed the sushi principle: "raw data is better"</p>
</blockquote>
<p>I have been following the <em>sushi principle</em> in the last year without being aware of this definition. Nice name!</p>
<blockquote>
<p>Database triggers can be used to implement change data capture by registering triggers that observe all changes to data tables and add corresponding entries to a changelog table. However, they tend to be fragile and have significant performance overheads. Parsing the replication log can be a more robust approach, although it also comes with challenges, such as handling schema changes.</p>
</blockquote>
<p>I see replication log parsing as a growing trend. It enables the method "take data to datalake and then we'll see what to do". Furthermore, it fits for steaming data applications too. Today, not all vendors support the publication of such change logs natively (eg I didn't find a simple solution for <em>SQL Server</em>).</p>
<p><img alt="Database state as integral of stream" src="https://www.marcosantoni.com/images/state_as_integral_of_stream_600.png"></p>
<blockquote>
<p>If you are mathematically inclined, you might say that the application state is what you get when you integrate an event stream over time, and a change stream is what you get when you differentiate the state by time, as shown in figure. The analogy has limitations (for example, the second derivative of state does not seem to be meaningful), but it’s a useful starting point for thinking about data.</p>
</blockquote>
<p>This brilliant analogy is the intro of the <strong>chapter I enjoyed the most</strong> within the entire book, ie the <em>Stream Processing</em> chapter. It represents a database as the latest cache representing the replication logs (the opposite point of view we normally have).</p>
<blockquote>
<p>In the absence of widespread support for a good distributed transaction protocol, I believe that log-based derived data is the most promising approach for integrating different data systems.</p>
</blockquote>
<p>I have seen Kafka as a tool for stream processing so far. I was not thinking of it as a tool for integrating data systems. The last chapter of the book gives a hint on how <em>log-based derived data</em> may become a popular pattern soon.</p>
<blockquote>
<p>The trend has been to keep stateless application logic separate from state management (databases): not putting application logic in the database and not putting persistent state in the application. As people in the functional programming community like to joke, "We believe in the separation of Church and state"</p>
</blockquote>
<p>Good one.</p>Models of Data Science teams: Chess vs Checkers2021-03-27T09:35:00+01:002021-03-27T09:35:00+01:00Marco Santonitag:www.marcosantoni.com,2021-03-27:/chess_vs_checkers_teams.html<blockquote>
<p>How many data engineers should we hire? Are they too many compared to our data scientists?</p>
</blockquote>
<p>One of the key decisions to take when building a data science team is the <strong>mix of roles</strong>. This means choosing the right mix of background and of activities that each member of the …</p><blockquote>
<p>How many data engineers should we hire? Are they too many compared to our data scientists?</p>
</blockquote>
<p>One of the key decisions to take when building a data science team is the <strong>mix of roles</strong>. This means choosing the right mix of background and of activities that each member of the team should have. I'll compare two models of teams I've experienced so far and define them as <strong>chess-team</strong> model and <strong>checkers-team</strong> model.</p>
<h2>Chess-Team Model</h2>
<p><img alt="Chess board" src="https://www.marcosantoni.com/images/chess_400.jpg"></p>
<p>The chess-team model is the common model we read about in literature. In a chess-team, each member of the team has a <strong>specific role</strong>. Roles are usually: <em>data engineers</em>, <em>data scientists</em>, and <em>machine learning engineers</em>. These roles typically correspond to different sets of skills (eg ML and statistics vs coding and devops) and to different set of activities (model selection vs data preparation vs model deployment).</p>
<p>Similarly to a chess piece which has a clear role that is different from the other pieces, a member of a data science chess-team is assigned a subset of the tasks that are part of the development pipeline. Let's consider a simplistic development pipeline:</p>
<ul>
<li>data preparation -> data engineer</li>
<li>model development -> data scientists</li>
<li>model deployment -> machine learning engineer</li>
</ul>
<p>The three activities of this development pipeline correspond to the three roles of the team, and there is little space for confusion. A data engineer probably won't work a lot on the model development and selection, while a data scientist probably won't be the one deploying the model in production.</p>
<h2>Checkers-Team Model</h2>
<p><img alt="Checkers board" src="https://www.marcosantoni.com/images/checkers_400.jpg"></p>
<p>The checkers-team model is a definition of a team model that I introduce in this post. In a checkers-team, each member of the team does not have a specific role because he may in charge of working on <strong>any step of the development</strong> pipeline. There are no roles like <em>data engineer</em> or <em>data scientist</em> because taking such a role implies limiting the scope of activities a team member should work on. Let' make an example. In a checkers-team, there is no <em>data scientist</em> because no one is in charge of model development <strong>only</strong>.</p>
<p>So, what is the role of someone working in a checkers-team? A member of the team can be defined as a <strong>full-stack data developer</strong>. A full-stack data developer is someone that for example works on data extraction <em>AND</em> model development <em>AND</em> model deployment. In a checkers-team, everyone works possibly on every piece of the development lifecycle. In this sense, the team is more similar to checkers pieces. There is no move that a piece can take and another piece cannot. Similarly, there is no activity that any team member cannot do. For example, everyone can contribute to building devops pipelines and automation.</p>
<p>Of course, every team member has a different <strong>background</strong> and a different set of skills from his/her teammates. One can come from a software engineering experience, another one can come from data science studies. However, the strategy of building a checkers-team is to invest in <strong>training</strong> team members to grow <strong>horizontally</strong> their set of skills.</p>
<h2>Pros and Cons</h2>
<p>Let's consider some key differences between a chess and a checkers team model.</p>
<p><strong>Flexibility.</strong> The balance of types of activities is not stable over time in a team. There can be times when there is a peak of work items in data engineering and little or no work items in ML model development. These peaks can be due to different phases of the data product development cycle or due to varying business requirements. A checkers-team is flexible and can adapt quickly to these peaks. A checkers-team could for example dedicate the entire team to develop data engineering pipelines in a Scrum sprint if needed. The same flexibility is not as easy in a chess-team model where you have constraints due to different skills and different responsibilities.</p>
<p><strong>Complexity.</strong> Not every data science team is facing the same level of complexity in their projects. Imagine a team that is building an AI model for self-driving cars. It is a complex problem to solve that requires advanced skills in computer vision and AI. These skills cannot be learned quickly but usually need a specific education or career path. When facing such problems, you need team members which are specialists in area like vision or AI. A chess-team is designed to host specialists in certain fields and is designed to grow vertically such skills. In a checkers-team, there are not such specialists.</p>
<p><strong>Awareness.</strong> A member of a checkers-team knows in details every phase of the development cycle. While he is designing a ML model, he is aware at the same time of how the release pipeline and the operations of the model work. He may take decisions during model selection that take into consideration where the model will be hosted and possible constraints of the production platform. On the other hand, a data scientist of a chess-team knows less details (because he has not being working on it by himself) of how the model will be deployed and run. This minor awareness may lead to assumptions taken during model development, and these assumptions can bring to more complexity to those in charge of deploying such model.</p>
<p><strong>Sense of Ownership.</strong> In a checkers-team, you are in charge of both engineering data pipelines, developing models, and deploying them. Any issue that may occur in these phases is also <em>your</em> issue. You can't delegate too much, and, therefore, you naturally feel responsible to contribute to the resolution. Distributing the ownership makes every team member more active in improving the development life cycle.</p>
<h2>When is a Team Model Right?</h2>
<p>The answer depends on the context and the organization you work at. Is the data science team is working on the <strong>core product</strong> of the company? If this is the case, the models that are developed may need a level of specialization that can't just be achieved by a checkers-team.</p>
<p>Or is the team rather working on adding tiny features or on improving the operations of the company? In this case, probably you won't be developing state-of-the-art AI models, and you can rely existing <strong>libraries or SaaS</strong> that make life easier for you. As complexity is not an obstacle, going for checkers-team may be a good option.</p>
<p>What is the size of your data science team? Or even how many teams do you have? Large organizations go for multiple data teams. These teams may be divided <strong>functionally</strong> (eg 1 team of data engineers + 1 separate team of data sciensts) or they may be divided by <strong>business units</strong> (eg 1 data team for marketing and 1 data team for recommender system). You can't of course adopt the checkers-team model in an large organization that design the data teams by functions, but you may still adopt this model in a large organization that creates multiple self-organized teams each dedicated to a specific business unit.</p>
<p>A last point to consider is the <strong>IT architecture</strong>. A checkers-team requires the same person to work on very different tasks. This is viable only if the complexity of such tasks is small. Adopting <strong>SaaS and PaaS</strong> resources simplifies every task by hiding the complexity of managing and running the resources. They let you focus on your goal. For example, building an API endpoint hosted by a function-as-a-service is something feasible by a data scientist with a mathematical background. Doing the same from scratch on an on-premise server is not as feasible.</p>
<p><em>Images courtesy of <a href="https://unsplash.com/photos/DC-UrroFRr4">@pecanlie</a> and <a href="https://unsplash.com/photos/U_Kz2RnfFAk">@rafaelrex</a></em></p>Choosing my next job title (in a data science career)2021-01-08T07:41:00+01:002021-01-08T07:41:00+01:00Marco Santonitag:www.marcosantoni.com,2021-01-08:/choosing_next_job_title.html<p>I'm now part of a data and AI team in a fintech spinoff. When I joined the company, it did not make sense to spend time in defining precise job titles because we were to build everything from scratch (both software, teams and organization). My job title was therefore a …</p><p>I'm now part of a data and AI team in a fintech spinoff. When I joined the company, it did not make sense to spend time in defining precise job titles because we were to build everything from scratch (both software, teams and organization). My job title was therefore a generic "<em>AI Practitioner</em>". One year later, teams and responsibilities are more clear, and it is now time to define my job title.</p>
<h2>What was I doing up to now?</h2>
<p>I have a background in data science and software engineering. I started my career in 2013 as "<em>Data Scientist and Software Developer</em>" (what we would call today a <em>Machine Learning Engineer</em>?) in a small startup. I was then defined as an "<em>Associate</em>" when working as a data scientist in a consulting firm. In the last 3 years, I worked in a manufacturing firm as "<em>Data Scientist</em>".</p>
<h2>What am I doing now?</h2>
<p>In the company I currently work at, I work in the data and AI team. My main activities include:</p>
<ul>
<li>planning and prioritizing of our data solution</li>
<li>designing our data and software architecture</li>
<li>developing in first person our data integrations, analytics reporting, ML models and data solutions</li>
<li>making sure our Scrum cerimonies run smoothly</li>
</ul>
<p>My job has a mix of coding, architecture design, and project/product management. Why such a variety of responsibilities? I work in a small team part of company that is growing quickly starting from zero. Each team is quite autonomous in doing their work by taking an end-to-end ownership of the activity. For example, in my data and AI team we handle our work end-to-end. We are responsible for the entire pipeline: definining roadmaps, development, deployment, and monitoring.</p>
<h2>My job title?</h2>
<p>It is now time to define a job title that can summarize my responsibilities listed above. These are some alternatives I took into consideration:</p>
<table>
<thead>
<tr>
<th>Job title</th>
<th>Comments</th>
</tr>
</thead>
<tbody>
<tr>
<td>Senior Data Scientist/Engineer</td>
<td>Too vertical on a piece of the pipeline compared to the spectrum of activities I work on</td>
</tr>
<tr>
<td>Data Architect</td>
<td>Nicely defines the technical activities of designing and scaling our data solutions, but lacks the ownership of the backlog and of the product roadmap</td>
</tr>
<tr>
<td>Data Product Owner</td>
<td>States clearly the ownership of the product backlog, but I feel that the "Product Owner" title is too tight to a Scrum role and lacks of technical responsibilities</td>
</tr>
<tr>
<td>Lead Data and AI</td>
<td>States the responsibility of leading a team of experts in a domain. However, it does not feature any ownership on the product roadmap. Furthermore, it states a clear hierarchy in the team that goes against our team and company culture (a culture of distributed ownership and flat organization)</td>
</tr>
</tbody>
</table>
<p>I was not satisified with the job titles above. Then, I came up with <strong>"Data Product Manager"</strong>. I felt this job title was what I was looking for because:</p>
<ul>
<li>as a Product Manager, you are responsible for the product roadmap and strategy</li>
<li>the prefix "Data" adds a technical taste. By doing some research, I found that a TPM (Technical Product Manager) is a common job title that defines a product manager that is also in charge of the technical side of the product (architecture, etc)</li>
<li>it states the ownership of our data product but does add any hierarchy-sounding adjectives</li>
<li>my end-to-end range of activities can fit well in this definition</li>
</ul>
<p>I shared these thoughts with my manager that agreed both on the definition of my responsibilities and on the job title. Let's see if these notes can help those that are facing the same challenge of choosing their own job title.</p>What we expected from Covid on March 10th2020-12-26T09:35:00+01:002020-12-26T09:35:00+01:00Marco Santonitag:www.marcosantoni.com,2020-12-26:/what_we_expected_from_covid.html<p>The first Covid case in Italy was found on February 21st 2020. A couple of weeks later we were entering the lockdown with this number of new daily cases.</p>
<p><img alt="Cases up to March 10th" src="https://www.marcosantoni.com/images/cases_up_to_march_10.png"></p>
<p>The number of Covid-19 new cases was growing really fast every day. We had no clue about what was going to …</p><p>The first Covid case in Italy was found on February 21st 2020. A couple of weeks later we were entering the lockdown with this number of new daily cases.</p>
<p><img alt="Cases up to March 10th" src="https://www.marcosantoni.com/images/cases_up_to_march_10.png"></p>
<p>The number of Covid-19 new cases was growing really fast every day. We had no clue about what was going to happen and about when it would have ended. Was it going to end soon? How quickly was the virus spreading? I was wondering whether our feelings and <strong>expectations</strong> would have turned out to be true or not. So, I run a little <strong>experiment</strong> with 7 friends. I asked each of them the following 2 questions on March 10th 2020:</p>
<ol>
<li>What will the total number of Covid19 cases be by April 1st?</li>
<li>When will the number of new cases be smaller than 50 again?</li>
</ol>
<p>The goal of these questions was to investigate our ability as humans to nearly understand the size and the duration of such an unseen event like a global pandemy. Let's look at the answers we gave to these 2 questions.</p>
<h2>Total cases by April 1st</h2>
<p>The total number of Covid19 cases in Italy was <code>110k</code> (precisely <code>110574</code>). These were our 7 predictions made on March 10th.</p>
<p><img alt="Cases up to March 10th" src="https://www.marcosantoni.com/images/prediction_cases_april_1.png"></p>
<p>We see that 5 out of 7 respondents predicted a number of cases below <code>60k</code> (with 2 respondents even below <code>25k</code>). Only 2 out of 7 respondents gave more realistic predictions (<code>110k</code> and <code>130k</code> respectively). Why were most respondents too <strong>optimistic</strong>? If we look at the very first chart, an exponential growth of new cases was already happening on March 10th. Perhaps, the majority of respondents were perceiveing the growth as linear.</p>
<p>Does our brain have <strong>misperceptions</strong> about exponential growth? My little experiment gave this insight, but I was curious whether there is some scientific literature about this misperception. I found a <a href="https://link.springer.com/article/10.3758/BF03204114">paper</a> written back in 1975: <em>"Misperception of exponential growth"</em> by Wagennar and Sagaria.</p>
<p><img alt="Experiment in 1975 paper" src="https://www.marcosantoni.com/images/prediction_cases_misperception_experiment.png"></p>
<p>In this paper, researchers presented the beginning of an exponential time series starting ranging between 1970 and 1974. They presented this time series in different experiments both in the form of a series of numbers and in the form of a graph (see chart above). They asked to predict the value of this time series by 1979. A considerable <strong>underestimation</strong> of growth was encountered in all groups in all conditions.</p>
<p>The results of this paper helped me understanding why most of my respondents notably underestimated the growth of Covid-19 cases in Italy. Our brain is capable of intuitions only for linear growths and not for exponential growths.</p>
<p>The following question naturally comes up: if the underestimation of the Covid-19 growth was common in the vast majority of the citizens due to our unavoidable misperception, how has this impacted on micro and macro decisions when facing the pandemic?</p>
<h2>New cases smaller than 50 again</h2>
<p>We now move to the second question of my little experiment (asked on March 10th): <em>"When will the number of new cases be smaller than 50 again?"</em>. Plotting the answers:</p>
<p><img alt="Daily cases below 50" src="https://www.marcosantoni.com/images/prediction_cases_below_50.png"></p>
<p>I drew a black vertical line for each date given as answer. We were too optimistic in this survey too. 5 out of 7 respondents expected the situation to go under control (the red horizontal line represents the threshold of 50 cases in the question) by May 1st. No one was expecting the high number of daily cases to go beyond <strong>June 18th</strong>. As of today (December 26th), the number of daily cases in Italy did not go below 50 in a single day since then.</p>
<p>We were just starting to experience an extraordinary event, and we were not expecting it to last for that long. This bias in perceiving the pandemic shorter than it was probably helped the social distancing policies. Changing your social habits is a privation that you willingly make if you expect it to last for a short time. Imagine that we knew Covid-19 would last for 9 months or even more.</p>
<p>Another question naturally comes up about the economic policies that were taken to tackle the pandemic: were they subject to the same short-term bias that was measured in this experiment?</p>Summary: Building AI Solutions with Azure ML2020-08-19T06:41:00+02:002020-08-19T06:41:00+02:00Marco Santonitag:www.marcosantoni.com,2020-08-19:/summary_building_ai_solutions_azure_ml.html<p>While studying for the <em>Azure Data Scientist Associate</em> certification, I took notes from <a href="https://docs.microsoft.com/en-us/learn/paths/build-ai-solutions-with-azure-ml-service/">Building AI Solution with Azure ML</a> course. In this single page, you'll find the entire content of the course (as of 18th August, 2020). This page is a small support for those preparing for earning the certification …</p><p>While studying for the <em>Azure Data Scientist Associate</em> certification, I took notes from <a href="https://docs.microsoft.com/en-us/learn/paths/build-ai-solutions-with-azure-ml-service/">Building AI Solution with Azure ML</a> course. In this single page, you'll find the entire content of the course (as of 18th August, 2020). This page is a small support for those preparing for earning the certification.</p>
<h1>Intro</h1>
<h2>Azure ML Workspace</h2>
<p>workspaces are azure resources. include:</p>
<ul>
<li>compute</li>
<li>notebooks</li>
<li>pipelines</li>
<li>data</li>
<li>experiments</li>
<li>models</li>
</ul>
<p>created alongside</p>
<ul>
<li>storage account: files by WS + data</li>
<li>application insights</li>
<li>key vault</li>
<li>vm</li>
<li>container registry</li>
</ul>
<p>permission: RBAC</p>
<p>edition
- basic (no graphic designer)
- enterprise</p>
<h2>Tools</h2>
<p>Azure ML Studio
- designer (no code ML model dev)
- automated ML</p>
<p>Azure ML SDK</p>
<p>Azure ML CLI Extensions</p>
<p>Compute Instances
- choose VM
- store notebooks independently of VMs</p>
<p>VS Code - Azure ML Extension</p>
<h2>Experiments</h2>
<p>Azure ML tracks run of experiments</p>
<div class="highlight"><pre><span></span><code><span class="o">...</span>
<span class="n">run</span> <span class="o">=</span> <span class="n">experiment</span><span class="o">.</span><span class="n">start_logging</span><span class="p">()</span>
<span class="o">...</span>
<span class="n">run</span><span class="o">.</span><span class="n">complete</span><span class="p">()</span>
</code></pre></div>
<ul>
<li>logging metrics. <code>run.log('name', value)</code>. You can review them via <code>RunDetails(run).show()</code></li>
<li>experiment output file. Example: trained models. <code>run.upload_file(..)</code>.</li>
</ul>
<p><strong>Script as an experiment</strong>. In the script, you can get the context: <code>run = Rune.get_context()</code>. To run it, you define:</p>
<ul>
<li>RunConfiguration: python environment</li>
<li>ScriptRunConfig: associates RunConfiguration with script</li>
</ul>
<h1>Train a ML model</h1>
<h2>Estimators</h2>
<p>Estimator: encapsulates a run configuration and a script configuration in a single object. Save trained model as pickle in <code>outputs</code> folder</p>
<div class="highlight"><pre><span></span><code><span class="n">estimator</span> <span class="o">=</span> <span class="n">Estimator</span><span class="p">(</span>
<span class="n">source_directory</span><span class="o">=</span><span class="s1">'experiment'</span><span class="p">,</span>
<span class="n">entry_script</span><span class="o">=</span><span class="s1">'training.py'</span><span class="p">,</span>
<span class="n">compute_target</span><span class="o">=</span><span class="s1">'local'</span><span class="p">,</span>
<span class="n">conda_packages</span><span class="o">=</span><span class="p">[</span><span class="s1">'scikit-learn'</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">experiment</span> <span class="o">=</span> <span class="n">Experiment</span><span class="p">(</span><span class="n">workspace</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s1">'train_experiment'</span><span class="p">)</span>
<span class="n">run</span> <span class="o">=</span> <span class="n">experiment</span><span class="o">.</span><span class="n">submit</span><span class="p">(</span><span class="n">config</span><span class="o">=</span><span class="n">estimator</span><span class="p">)</span>
</code></pre></div>
<p>Framework-specific estimators simplify configurations</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">azureml.train.sklearn</span> <span class="kn">import</span> <span class="n">SKLearn</span>
<span class="n">estimator</span> <span class="o">=</span> <span class="n">SKLearn</span><span class="p">(</span>
<span class="n">source_directory</span><span class="o">=</span><span class="s1">'experiment'</span><span class="p">,</span>
<span class="n">entry_script</span><span class="o">=</span><span class="s1">'training.py'</span><span class="p">,</span>
<span class="n">compute_target</span><span class="o">=</span><span class="s1">'local'</span>
<span class="p">)</span>
</code></pre></div>
<h2>Script parameters</h2>
<p>Use <code>argparse</code> to read the parameters in a script (eg regularization rate). To pass a parameter to an <code>Estimator</code>:</p>
<div class="highlight"><pre><span></span><code><span class="n">estimator</span> <span class="o">=</span> <span class="n">SKLearn</span><span class="p">(</span>
<span class="n">source_directory</span><span class="o">=</span><span class="s1">'experiment'</span><span class="p">,</span>
<span class="n">entry_script</span><span class="o">=</span><span class="s1">'training.py'</span><span class="p">,</span>
<span class="n">script_params</span><span class="o">=</span><span class="p">{</span><span class="s1">'--reg_rate'</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">}</span>
<span class="n">compute_target</span><span class="o">=</span><span class="s1">'local'</span>
<span class="p">)</span>
</code></pre></div>
<h2>Registering models</h2>
<p>Once the experiment <code>Run</code> has completed, you can retrieve its outputs (eg trained model).</p>
<div class="highlight"><pre><span></span><code><span class="n">run</span><span class="o">.</span><span class="n">download_file</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">'outputs/models.pkl'</span><span class="p">,</span> <span class="n">output_file_path</span><span class="o">=</span><span class="s1">'model.pkl'</span><span class="p">)</span>
</code></pre></div>
<p>Registering a model allows to track multiple versions of a model.</p>
<div class="highlight"><pre><span></span><code><span class="n">model</span> <span class="o">=</span> <span class="n">Model</span><span class="o">.</span><span class="n">register</span><span class="p">(</span>
<span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span>
<span class="n">model_name</span><span class="o">=</span><span class="s1">'classification_model'</span><span class="p">,</span>
<span class="n">model_path</span><span class="o">=</span><span class="s1">'model.pkl'</span><span class="p">,</span> <span class="c1">#local path</span>
<span class="n">description</span><span class="o">=</span><span class="s1">'a classification model'</span><span class="p">,</span>
<span class="n">tags</span><span class="o">=</span><span class="p">{</span><span class="s1">'dept'</span><span class="p">:</span> <span class="s1">'sales'</span><span class="p">},</span>
<span class="n">model_framework</span><span class="o">=</span><span class="n">Model</span><span class="o">.</span><span class="n">Framework</span><span class="o">.</span><span class="n">SCIKITLEARN</span><span class="p">,</span>
<span class="n">model_framework_version</span><span class="o">=</span><span class="s1">'0.20.3'</span>
<span class="p">)</span>
</code></pre></div>
<p>or register from run:</p>
<div class="highlight"><pre><span></span><code><span class="n">run</span><span class="o">.</span><span class="n">register_model</span><span class="p">(</span>
<span class="o">...</span>
<span class="n">model_path</span><span class="o">=</span><span class="s1">'outputs/model.pkl'</span>
<span class="o">...</span>
<span class="p">)</span>
</code></pre></div>
<h1>Datastores</h1>
<p>Abstractions of cloud data sources encapsulating the information required to connect.</p>
<p>You can register a data store</p>
<ul>
<li>via ML Studio</li>
<li>via SDK</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="n">ws</span> <span class="o">=</span> <span class="n">Workspace</span><span class="o">.</span><span class="n">from_config</span><span class="p">()</span>
<span class="n">blob</span> <span class="o">=</span> <span class="n">Datastore</span><span class="o">.</span><span class="n">register_azure_blob_container</span><span class="p">(</span>
<span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span>
<span class="n">datastore_name</span><span class="o">=</span><span class="s1">'blob_data'</span><span class="p">,</span>
<span class="n">container_name</span><span class="o">=</span><span class="s1">'data_container'</span><span class="p">,</span>
<span class="n">account_name</span><span class="o">=</span><span class="s1">'az_acct'</span><span class="p">,</span>
<span class="n">account_key</span><span class="o">=</span><span class="s1">'123456'</span>
<span class="p">)</span>
</code></pre></div>
<p>In the SDK, you can list data stores.</p>
<h2>Use datastores</h2>
<p>Most common: Azure blob and file</p>
<div class="highlight"><pre><span></span><code><span class="n">blob_ds</span><span class="o">.</span><span class="n">upload</span><span class="p">(</span>
<span class="n">src_dir</span><span class="o">=</span><span class="s1">'/files'</span><span class="p">,</span>
<span class="n">target_path</span><span class="o">=</span><span class="s1">'/data/files'</span><span class="p">,</span>
<span class="n">overwrite</span><span class="o">=</span><span class="kc">True</span>
<span class="p">)</span>
<span class="n">blob_ds</span><span class="o">.</span><span class="n">download</span><span class="p">(</span>
<span class="n">target_path</span><span class="o">=</span><span class="s1">'downloads'</span><span class="p">,</span>
<span class="n">prefix</span><span class="o">=</span><span class="s1">'/data'</span>
<span class="p">)</span>
</code></pre></div>
<p>You pass a data reference to the script to use a datastore. Data access models</p>
<ul>
<li>download: contents downloaded to the compute context of experiment</li>
<li>upload: files generated by experiment are uploaded after run</li>
<li>mount: path of datastore mounted as remote storage (only on remote compute target)</li>
</ul>
<p>Pass reference as script parameter:</p>
<div class="highlight"><pre><span></span><code><span class="n">data_ref</span> <span class="o">=</span> <span class="n">blob_ds</span><span class="o">.</span><span class="n">path</span><span class="p">(</span><span class="s1">'data/files'</span><span class="p">)</span><span class="o">.</span><span class="n">as_download</span><span class="p">(</span><span class="n">path_on_compute</span><span class="o">=</span><span class="s1">'training_data'</span><span class="p">)</span>
<span class="n">estimator</span> <span class="o">=</span> <span class="n">SKLearn</span><span class="p">(</span>
<span class="n">source_directory</span><span class="o">=</span><span class="s1">'experiment_folder'</span><span class="p">,</span>
<span class="n">entry_script</span><span class="o">=</span><span class="s1">'training_script.py'</span><span class="p">,</span>
<span class="n">compute_target</span><span class="o">=</span><span class="s1">'local'</span><span class="p">,</span>
<span class="n">script_params</span><span class="o">=</span><span class="p">{</span><span class="s1">'--data_folder'</span><span class="p">:</span> <span class="n">data_ref</span><span class="p">}</span>
<span class="p">)</span>
</code></pre></div>
<p>Retrieve it in script and use it like local folder:</p>
<div class="highlight"><pre><span></span><code><span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="o">.</span><span class="n">ArgumentParser</span><span class="p">()</span>
<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s1">'--data_folder'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s1">'str'</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s1">'data_folder'</span><span class="p">)</span>
<span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span>
<span class="n">data_files</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">data_folder</span><span class="p">)</span>
</code></pre></div>
<h2>Datasets</h2>
<p>Datasets are versioned packaged data objects consumed in experiments and pipelines. Types</p>
<ul>
<li>tabular: read as table</li>
<li>file: list of file paths</li>
</ul>
<p>You can create dataset via Azure ML Studio or via SDK. File paths can have wildcards (<code>/files/*.csv</code>).</p>
<p>Once a dataset is created, you can <strong>register</strong> it in the workspace (available later too).</p>
<p>Tabular:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">azureml.core</span> <span class="kn">import</span> <span class="n">Dataset</span>
<span class="n">blob_ds</span> <span class="o">=</span> <span class="n">we</span><span class="o">.</span><span class="n">get_default_datastore</span><span class="p">()</span>
<span class="n">csv_paths</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">(</span><span class="n">blob_ds</span><span class="p">,</span> <span class="s1">'data/files/current_data.csv'</span><span class="p">),</span>
<span class="p">(</span><span class="n">blob_ds</span><span class="p">,</span> <span class="s1">'data/files/archive/*.csv'</span><span class="p">)</span>
<span class="p">]</span>
<span class="n">tab_ds</span> <span class="o">=</span> <span class="n">Dataset</span><span class="o">.</span><span class="n">Tabular</span><span class="o">.</span><span class="n">from_delimited_files</span><span class="p">(</span><span class="n">path</span><span class="o">=</span><span class="n">csv_paths</span><span class="p">)</span>
<span class="n">tab_ds</span> <span class="o">=</span> <span class="n">tab_ds</span><span class="o">.</span><span class="n">register</span><span class="p">(</span><span class="n">workspace</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s1">'csv_table'</span><span class="p">)</span>
</code></pre></div>
<p>File:</p>
<div class="highlight"><pre><span></span><code><span class="n">blob_ds</span> <span class="o">=</span> <span class="n">ws</span><span class="o">.</span><span class="n">get_default_datastore</span><span class="p">()</span>
<span class="n">file_ds</span> <span class="o">=</span> <span class="n">Dataset</span><span class="o">.</span><span class="n">File</span><span class="o">.</span><span class="n">from_files</span><span class="p">(</span><span class="n">path</span><span class="o">=</span><span class="p">(</span><span class="n">blob_ds</span><span class="p">,</span> <span class="s1">'data/files/images/*.jpg'</span><span class="p">))</span>
<span class="n">file_ds</span> <span class="o">=</span> <span class="n">file_ds</span><span class="o">.</span><span class="n">register</span><span class="p">(</span><span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s1">'img_files'</span><span class="p">)</span>
</code></pre></div>
<p><strong>Retrieve</strong> a dataset</p>
<div class="highlight"><pre><span></span><code><span class="n">ws</span> <span class="o">=</span> <span class="n">Workspace</span><span class="o">.</span><span class="n">from_config</span><span class="p">()</span>
<span class="c1"># Get a dataset from workspace datasets collection</span>
<span class="n">ds1</span> <span class="o">=</span> <span class="n">ws</span><span class="o">.</span><span class="n">datasets</span><span class="p">[</span><span class="s1">'csv_table'</span><span class="p">]</span>
<span class="c1"># Get a dataset by name from the datasets class</span>
<span class="n">ds2</span> <span class="o">=</span> <span class="n">Dataset</span><span class="o">.</span><span class="n">get_by_name</span><span class="p">(</span><span class="n">ws</span><span class="p">,</span> <span class="s1">'img_files'</span><span class="p">)</span>
</code></pre></div>
<p>Datasets can be <strong>versioned</strong>. Create a new versioning by registering with same name and <code>create_new_version</code> property:</p>
<div class="highlight"><pre><span></span><code><span class="n">file_ds</span> <span class="o">=</span> <span class="n">file_ds</span><span class="o">.</span><span class="n">register</span><span class="p">(</span><span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s1">'img_files'</span><span class="p">,</span> <span class="n">create_new_version</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</code></pre></div>
<p>Retrieve specific version:</p>
<div class="highlight"><pre><span></span><code><span class="n">img_ds</span> <span class="o">=</span> <span class="n">Dataset</span><span class="o">.</span><span class="n">get_by_name</span><span class="p">(</span><span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s1">'img_files'</span><span class="p">,</span> <span class="n">version</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
</code></pre></div>
<h1>Compute Contexts</h1>
<p>The runtime context for each experiment consists of</p>
<ul>
<li><em>environment</em> for the script, which includes all packages</li>
<li><em>compute target</em> on which the environment will be deployed</li>
</ul>
<h2>Intro to Environments</h2>
<p>Python runs in virtual environments (eg <code>Conda</code>, <code>pip</code>). Azure creates a Docker container and creates the environment. You create environments by</p>
<ul>
<li><code>Conda</code> or <code>pip</code> yaml file and load it:</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="n">env</span> <span class="o">=</span> <span class="n">Environment</span><span class="o">.</span><span class="n">from_conda_specification</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">'training_env'</span><span class="p">,</span> <span class="n">file_path</span><span class="o">=</span><span class="s1">'./conda.yml'</span><span class="p">)</span>
</code></pre></div>
<ul>
<li>from existing <code>Conda</code> environment:</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="n">env</span> <span class="o">=</span> <span class="n">Environment</span><span class="o">.</span><span class="n">from_conda_environment</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">'training_env'</span><span class="p">,</span>
<span class="n">conda_environment_name</span><span class="o">=</span><span class="s1">'py_env'</span><span class="p">)</span>
</code></pre></div>
<ul>
<li>specifying packages:</li>
</ul>
<div class="highlight"><pre><span></span><code><span class="n">env</span> <span class="o">=</span> <span class="n">Environment</span><span class="p">(</span><span class="s1">'training_env'</span><span class="p">)</span>
<span class="n">deps</span> <span class="o">=</span> <span class="n">CondaDependencies</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">conda_packages</span><span class="o">=</span><span class="p">[</span><span class="s1">'pandas'</span><span class="p">,</span> <span class="s1">'numpy'</span><span class="p">]</span>
<span class="n">pip_packages</span><span class="o">=</span><span class="p">[</span><span class="s1">'azureml-defaults'</span><span class="p">])</span>
<span class="n">env</span><span class="o">.</span><span class="n">python</span><span class="o">.</span><span class="n">conda_dependencies</span> <span class="o">=</span> <span class="n">deps</span>
</code></pre></div>
<p>Once created, you can register the environment in the workspace.</p>
<div class="highlight"><pre><span></span><code><span class="n">env</span><span class="o">.</span><span class="n">register</span><span class="p">(</span><span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">)</span>
</code></pre></div>
<p>Retrieve and assign it to a <code>ScriptRunConfig</code> or an <code>Estimator</code></p>
<div class="highlight"><pre><span></span><code><span class="n">tr_env</span> <span class="o">=</span> <span class="n">Environment</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s1">'training_env'</span><span class="p">)</span>
<span class="n">estimator</span> <span class="o">=</span> <span class="n">Estimator</span><span class="p">(</span>
<span class="n">source_directory</span><span class="o">=</span><span class="s1">'experiment_folder'</span><span class="p">,</span>
<span class="n">entry_script</span><span class="o">=</span><span class="s1">'training_script.py'</span><span class="p">,</span>
<span class="n">compute_target</span><span class="o">=</span><span class="s1">'local'</span><span class="p">,</span>
<span class="n">environment_definition</span><span class="o">=</span><span class="n">tr_env</span>
<span class="p">)</span>
</code></pre></div>
<h2>Compute targets</h2>
<p>Compute targets are physical or virtual computer on which experiments are run. Types of compute</p>
<ul>
<li><em>local compute</em>: your workstation or a virtual machine</li>
<li><em>compute clusters</em>: multi-node clusters of VMs that automatically scale up or down</li>
<li><em>inference clusters</em>: to deploy models, they use containers to initiate computing</li>
<li><em>attached compute</em>: attach a VM or Databricks cluster that you already use</li>
</ul>
<p>You can create a compute target via AML studio or via SDK. A <strong>managed</strong> compute target is one managed by AML. Via SDK</p>
<div class="highlight"><pre><span></span><code><span class="n">ws</span> <span class="o">=</span> <span class="n">Workspace</span><span class="o">.</span><span class="n">from_config</span><span class="p">()</span>
<span class="n">compute_name</span> <span class="o">=</span> <span class="s1">'aml-cluster'</span>
<span class="n">compute_config</span> <span class="o">=</span> <span class="n">AmlCompute</span><span class="o">.</span><span class="n">provisioning_configuration</span><span class="p">(</span>
<span class="n">vm_size</span><span class="o">=</span><span class="s1">'STANDARD_DS12_V2'</span><span class="p">,</span>
<span class="n">min_nodes</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
<span class="n">max_nodes</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
<span class="n">vm_priority</span><span class="o">=</span><span class="s1">'dedicated'</span>
<span class="p">)</span>
<span class="n">aml_cluster</span> <span class="o">=</span> <span class="n">ComputeTarget</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">we</span><span class="p">,</span> <span class="n">compute_name</span><span class="p">,</span> <span class="n">compute_config</span><span class="p">)</span>
<span class="n">aml_cluster</span><span class="o">.</span><span class="n">wait_for_completion</span><span class="p">()</span>
</code></pre></div>
<p>An <strong>unmanaged</strong> compute target is defined and managed outside AML. You can attach it via SDK:</p>
<div class="highlight"><pre><span></span><code><span class="n">ws</span> <span class="o">=</span> <span class="n">Workspace</span><span class="o">.</span><span class="n">from_config</span><span class="p">()</span>
<span class="n">compute_name</span> <span class="o">=</span> <span class="s1">'db-cluster'</span>
<span class="n">db_workspace_name</span> <span class="o">=</span> <span class="s1">'db_workspace'</span>
<span class="n">db_resource_group</span> <span class="o">=</span> <span class="s1">'db_resource_group'</span>
<span class="n">db_access_token</span> <span class="o">=</span> <span class="s1">'aocsinaocnasoivn'</span>
<span class="n">db_config</span> <span class="o">=</span> <span class="n">DatabricksCompute</span><span class="o">.</span><span class="n">attach_configuration</span><span class="p">(</span>
<span class="n">resource_group</span><span class="o">=</span><span class="n">db_resource_group</span><span class="p">,</span>
<span class="n">workspace_name</span><span class="o">=</span><span class="n">db_workspace_name</span><span class="p">,</span>
<span class="n">access_token</span><span class="o">=</span><span class="n">db_access_token</span>
<span class="p">)</span>
<span class="n">db_cluster</span> <span class="o">=</span> <span class="n">ComputeTarget</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">we</span><span class="p">,</span> <span class="n">compute_name</span><span class="p">,</span> <span class="n">db_config</span><span class="p">)</span>
<span class="n">db_cluster</span><span class="o">.</span><span class="n">wait_for_completion</span><span class="p">()</span>
</code></pre></div>
<p>You can check if a compute target does not exist already:</p>
<div class="highlight"><pre><span></span><code><span class="n">compute_name</span> <span class="o">=</span> <span class="s1">'aml_cluster'</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">aml_cluster</span> <span class="o">=</span> <span class="n">ComputeTarget</span><span class="p">(</span><span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="n">compute_name</span><span class="p">)</span>
<span class="k">except</span> <span class="n">ComputeTargetException</span><span class="p">:</span>
<span class="c1"># create it</span>
<span class="o">...</span>
</code></pre></div>
<p>You can use a compute target in an experiment run by specifying it as a parameter</p>
<div class="highlight"><pre><span></span><code><span class="n">compute_name</span> <span class="o">=</span> <span class="s1">'aml_cluster'</span>
<span class="n">training_env</span> <span class="o">=</span> <span class="n">Environment</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s1">'training_env'</span><span class="p">)</span>
<span class="n">estimator</span> <span class="o">=</span> <span class="n">Estimator</span><span class="p">(</span>
<span class="n">source_directory</span><span class="o">=</span><span class="s1">'experiment_folder'</span><span class="p">,</span>
<span class="n">entry_script</span><span class="o">=</span><span class="s1">'training_script.py'</span><span class="p">,</span>
<span class="n">environment_definition</span><span class="o">=</span><span class="n">training_env</span><span class="p">,</span>
<span class="n">compute_target</span><span class="o">=</span><span class="n">compute_name</span>
<span class="p">)</span>
<span class="c1"># or specify a ComputeTarget object</span>
<span class="n">training_cluster</span> <span class="o">=</span> <span class="n">ComputeTarget</span><span class="p">(</span><span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="n">compute_name</span><span class="p">)</span>
<span class="n">estimator</span> <span class="o">=</span> <span class="n">Estimator</span><span class="p">(</span>
<span class="n">source_directory</span><span class="o">=</span><span class="s1">'experiment_folder'</span><span class="p">,</span>
<span class="n">entry_script</span><span class="o">=</span><span class="s1">'training_script.py'</span><span class="p">,</span>
<span class="n">environment_definition</span><span class="o">=</span><span class="n">training_env</span><span class="p">,</span>
<span class="n">compute_target</span><span class="o">=</span><span class="n">training_cluster</span>
<span class="p">)</span>
</code></pre></div>
<h1>Orchestrating with Pipelines</h1>
<p>A <em>pipeline</em> is a workflow of ml tasks in which each tasks is implemented as a <em>step</em> (either sequential or parallel). You can combine different compute targets. Common types of step:</p>
<ul>
<li><em>PythonScriptStep</em></li>
<li><em>EstimatorStep</em>: runs an estimator</li>
<li><em>DataTransferStep</em>: uses ADF</li>
<li><em>DatabricksStep</em></li>
<li><em>AdlaStep</em>: runs a <code>U-SQL</code> job in Azure Data Lake Analytics</li>
</ul>
<p>Define steps:</p>
<div class="highlight"><pre><span></span><code><span class="n">step1</span> <span class="o">=</span> <span class="n">PythonScriptStep</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'prepare data'</span><span class="p">,</span>
<span class="n">source_directory</span><span class="o">=</span><span class="s1">'scripts'</span><span class="p">,</span>
<span class="n">script_name</span><span class="o">=</span><span class="s1">'data_prep.py'</span><span class="p">,</span>
<span class="n">compute_target</span><span class="o">=</span><span class="s1">'aml-cluster'</span><span class="p">,</span>
<span class="n">runconfig</span><span class="o">=</span><span class="n">run_config</span>
<span class="p">)</span>
<span class="n">step2</span> <span class="o">=</span> <span class="n">EstimatorStep</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'train model'</span><span class="p">,</span>
<span class="n">estimator</span><span class="o">=</span><span class="n">sk_estimator</span><span class="p">,</span>
<span class="n">compute_target</span><span class="o">=</span><span class="s1">'aml-cluster'</span>
<span class="p">)</span>
</code></pre></div>
<p>Assign steps to pipeline:</p>
<div class="highlight"><pre><span></span><code><span class="n">train_pipeline</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">(</span>
<span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span>
<span class="n">steps</span><span class="o">=</span><span class="p">[</span><span class="n">step1</span><span class="p">,</span><span class="n">step2</span><span class="p">]</span>
<span class="p">)</span>
<span class="c1"># create experiment and run pipeline</span>
<span class="n">experiment</span> <span class="o">=</span> <span class="n">Experiment</span><span class="p">(</span><span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s1">'training-pipeline'</span><span class="p">)</span>
<span class="n">pipeline_run</span> <span class="o">=</span> <span class="n">experiment</span><span class="o">.</span><span class="n">submit</span><span class="p">(</span><span class="n">train_pipeline</span><span class="p">)</span>
</code></pre></div>
<h2>Pass data between steps</h2>
<p>The <code>PipelineData</code> object is a special kind of <code>DataReference</code> that</p>
<ul>
<li>reference a location in a store</li>
<li>creates a da dependency between pipelines</li>
</ul>
<p>To pass it</p>
<ul>
<li>define a <code>PipelineData</code> object that references a location in a data store</li>
<li>specify the object as input or output for the steps that use it</li>
<li>pass the <code>PipelineData</code> object as a script parameter in steps that run scripts</li>
</ul>
<p>Example</p>
<div class="highlight"><pre><span></span><code><span class="n">raw_ds</span> <span class="o">=</span> <span class="n">Dataset</span><span class="o">.</span><span class="n">get_by_name</span><span class="p">(</span><span class="n">ws</span><span class="p">,</span> <span class="s1">'raw_dataset'</span><span class="p">)</span>
<span class="c1"># Define object to pass data between steps</span>
<span class="n">data_store</span> <span class="o">=</span> <span class="n">ws</span><span class="o">.</span><span class="n">get_default_datastore</span><span class="p">()</span>
<span class="n">prepped_data</span> <span class="o">=</span> <span class="n">PipelineData</span><span class="p">(</span><span class="s1">'prepped'</span><span class="p">,</span> <span class="n">datastore</span><span class="o">=</span><span class="n">data_store</span><span class="p">)</span>
<span class="n">step1</span> <span class="o">=</span> <span class="n">PythonScriptStep</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'prepare data'</span><span class="p">,</span>
<span class="n">source_directory</span><span class="o">=</span><span class="s1">'scripts'</span><span class="p">,</span>
<span class="n">script_name</span><span class="o">=</span><span class="s1">'data_prep.py'</span><span class="p">,</span>
<span class="n">compute_target</span><span class="o">=</span><span class="s1">'aml-cluster'</span><span class="p">,</span>
<span class="n">runconfig</span><span class="o">=</span><span class="n">run_config</span><span class="p">,</span>
<span class="c1"># specify dataset</span>
<span class="n">inputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">raw_ds</span><span class="o">.</span><span class="n">as_named_input</span><span class="p">(</span><span class="s1">'raw_data'</span><span class="p">)],</span>
<span class="c1"># specify PipelineData as output</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">prepped_data</span><span class="p">],</span>
<span class="c1"># script reference</span>
<span class="n">arugments</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'--folder'</span><span class="p">,</span> <span class="n">prepped_data</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">step2</span> <span class="o">=</span> <span class="n">EstimatorStep</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'train model'</span><span class="p">,</span>
<span class="n">estimator</span><span class="o">=</span><span class="n">sk_estimator</span><span class="p">,</span>
<span class="n">compute_target</span><span class="o">=</span><span class="s1">'aml-cluster'</span>
<span class="c1"># specify PipelineData</span>
<span class="n">inputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">prepped_data</span><span class="p">],</span>
<span class="c1"># pass reference to estimator script</span>
<span class="n">estimator_entry_script_arguments</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'--folder'</span><span class="p">,</span> <span class="n">prepped_data</span><span class="p">]</span>
<span class="p">)</span>
</code></pre></div>
<p>Inside the script, you can get reference to <code>PipelineData</code> object from the argument, and use it like a local folder.</p>
<div class="highlight"><pre><span></span><code><span class="n">parser</span> <span class="o">=</span> <span class="n">argpare</span><span class="o">.</span><span class="n">ArgumentParser</span><span class="p">()</span>
<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s1">'--folder'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s1">'folder'</span><span class="p">)</span>
<span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span>
<span class="n">output_folder</span> <span class="o">=</span> <span class="n">args</span><span class="o">.</span><span class="n">folder</span>
<span class="c1"># ...</span>
<span class="c1"># save data to PipelineData location</span>
<span class="n">os</span><span class="o">.</span><span class="n">makedirs</span><span class="p">(</span><span class="n">output_folder</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">output_path</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">output_folder</span><span class="p">,</span> <span class="s1">'prepped_data.csv'</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">output_path</span><span class="p">)</span>
</code></pre></div>
<h2>Reuse steps</h2>
<p>By default, the step output from a previous pipeline run is reused without rerunning the step (if script, source directory and other params have not changed). You can control this:</p>
<div class="highlight"><pre><span></span><code><span class="n">step1</span> <span class="o">=</span> <span class="n">PythonScriptStep</span><span class="p">(</span>
<span class="c1">#...</span>
<span class="n">allow_reuse</span><span class="o">=</span><span class="kc">False</span>
<span class="p">)</span>
</code></pre></div>
<p>You can force the steps to run regardless of individual configuration:</p>
<div class="highlight"><pre><span></span><code><span class="n">pipeline_run</span> <span class="o">=</span> <span class="n">experiment</span><span class="o">.</span><span class="n">submit</span><span class="p">(</span><span class="n">train_pipeline</span><span class="p">,</span> <span class="n">regenerate_outputs</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</code></pre></div>
<h2>Publish pipelines</h2>
<p>You can publish a pipelien to create a REST endpoint through which the pipeline can be run on demand.</p>
<div class="highlight"><pre><span></span><code><span class="n">published_pipeline</span> <span class="o">=</span> <span class="n">pipeline</span><span class="o">.</span><span class="n">publish</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'training_pipeline'</span><span class="p">,</span>
<span class="n">description</span><span class="o">=</span><span class="s1">'Model training pipeline'</span><span class="p">,</span>
<span class="n">version</span><span class="o">=</span><span class="s1">'1.0'</span>
<span class="p">)</span>
</code></pre></div>
<p>You can view it in ML Studio and get the endpoint:</p>
<div class="highlight"><pre><span></span><code><span class="n">published_pipeline</span><span class="o">.</span><span class="n">endpoint</span>
</code></pre></div>
<p>You start a published endpoint by making an HTTP request to it. You pass the authorisation header (with token) and a JSON payload specifying the experiment name. The pipeline is run asynchronously, you get the run ID as response.</p>
<h2>Pipeline parameters</h2>
<p>Create a <code>PipelineParameter</code> object for each parameter. Example:</p>
<div class="highlight"><pre><span></span><code><span class="n">reg_param</span> <span class="o">=</span> <span class="n">PipelineParameter</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">'reg_rate'</span><span class="p">,</span> <span class="n">default_value</span><span class="o">=</span><span class="mf">0.01</span><span class="p">)</span>
<span class="c1"># ...</span>
<span class="n">step2</span> <span class="o">=</span> <span class="n">EstimatorStep</span><span class="p">(</span>
<span class="c1"># ...</span>
<span class="n">estimator_entry_script_arguments</span><span class="o">=</span><span class="p">[</span>
<span class="s1">'--folder'</span><span class="p">,</span> <span class="n">prepped</span><span class="p">,</span>
<span class="s1">'--reg'</span><span class="p">,</span> <span class="n">reg_param</span>
<span class="p">]</span>
<span class="p">)</span>
</code></pre></div>
<p>After you publish a parametrised pipeline, you can pass parameter values in the JSON payload of the REST interface. Example</p>
<div class="highlight"><pre><span></span><code><span class="n">requests</span><span class="o">.</span><span class="n">post</span><span class="p">(</span>
<span class="n">enpoint</span><span class="p">,</span>
<span class="n">headers</span><span class="o">=</span><span class="n">auth_header</span><span class="p">,</span>
<span class="n">json</span><span class="o">=</span><span class="p">{</span>
<span class="s1">'ExperimentName'</span><span class="p">:</span> <span class="s1">'run_training_pipeline'</span><span class="p">,</span>
<span class="s1">'ParameterAssignments'</span><span class="p">:</span> <span class="p">{</span>
<span class="s1">'reg_rate'</span><span class="p">:</span> <span class="mf">0.1</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">)</span>
</code></pre></div>
<h2>Schedule pipelines</h2>
<p>Define a <code>ScheduleRecurrence</code> and use it to create a <code>Schedule</code>.</p>
<div class="highlight"><pre><span></span><code><span class="n">daily</span> <span class="o">=</span> <span class="n">ScheduleRecurrence</span><span class="p">(</span>
<span class="n">frequency</span><span class="o">=</span><span class="s1">'Day'</span><span class="p">,</span>
<span class="n">interval</span><span class="o">=</span><span class="mi">1</span>
<span class="p">)</span>
<span class="n">pipeline_schedule</span> <span class="o">=</span> <span class="n">Schedule</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
<span class="n">ws</span><span class="p">,</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'Daily Training'</span><span class="p">,</span>
<span class="n">description</span><span class="o">=</span><span class="s1">'train model every day'</span><span class="p">,</span>
<span class="n">pipeline_id</span><span class="o">=</span><span class="n">published_pipeline</span><span class="o">.</span><span class="n">id</span><span class="p">,</span>
<span class="n">experiment_name</span><span class="o">=</span><span class="s1">'Training_Pipeline'</span><span class="p">,</span>
<span class="n">recurrence</span><span class="o">=</span><span class="n">daily</span>
<span class="p">)</span>
</code></pre></div>
<p>To schedule a pipeline to run whenever <strong>data changes</strong>, you must create a <code>Schedule</code> that monitors a specific path on a datastore:</p>
<div class="highlight"><pre><span></span><code><span class="n">training_datastore</span> <span class="o">=</span> <span class="n">Datastore</span><span class="p">(</span><span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s1">'blob_data'</span><span class="p">)</span>
<span class="n">pipeline_schedule</span> <span class="o">=</span> <span class="n">Schedule</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
<span class="c1"># ...</span>
<span class="n">datastore</span><span class="o">=</span><span class="n">training_datastore</span><span class="p">,</span>
<span class="n">path_on_datastore</span><span class="o">=</span><span class="s1">'data/training'</span>
<span class="p">)</span>
</code></pre></div>
<h1>Deploy ML Models</h1>
<p>You can deploy ass <strong>container</strong> to several compute targets</p>
<ul>
<li>Azure ML compute instance</li>
<li>Azure container instance</li>
<li>Azure function</li>
<li>Azure Kubernetes service</li>
<li>IoT module</li>
</ul>
<p>Steps</p>
<ol>
<li>register the model</li>
<li>inference configuration</li>
<li>deployment configuration</li>
<li>deploy model</li>
</ol>
<h2><a name="registermodel"></a>Register the model</h2>
<p>After training, you must register the model to Azure ML workspace.</p>
<div class="highlight"><pre><span></span><code><span class="n">classification_model</span> <span class="o">=</span> <span class="n">Model</span><span class="o">.</span><span class="n">register</span><span class="p">(</span>
<span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span>
<span class="n">model_name</span><span class="o">=</span><span class="s1">'classification_model'</span><span class="p">,</span>
<span class="n">model_path</span><span class="o">=</span><span class="s1">'model.pkl'</span><span class="p">,</span>
<span class="n">description</span><span class="o">=</span><span class="s1">'A classification model'</span>
<span class="p">)</span>
</code></pre></div>
<p>Or you can use the reference to the run:</p>
<div class="highlight"><pre><span></span><code><span class="n">run</span><span class="o">.</span><span class="n">register_model</span><span class="p">(</span>
<span class="n">model_name</span><span class="o">=</span><span class="s1">'classification_model'</span><span class="p">,</span>
<span class="n">model_path</span><span class="o">=</span><span class="s1">'outputs/model.pkl'</span><span class="p">,</span>
<span class="n">description</span><span class="o">=</span><span class="s1">'A classification model'</span>
<span class="p">)</span>
</code></pre></div>
<h2><a name="scoringscript"></a>Inference configuration</h2>
<p>The model will be deployed as a service consisting of</p>
<ul>
<li>a script to load the model and return predictions for submitted data</li>
<li>an environment in which the script will be run</li>
</ul>
<p>Create the <em>entry script</em> (or <em>scoring script</em>) as a Python file including 2 functions</p>
<ul>
<li><code>init()</code> called when service is initialised (load model from registry)</li>
<li><code>run(raw_data)</code> called when new data is submitted to the service (generate predictions)</li>
</ul>
<p>Example</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">init</span><span class="p">():</span>
<span class="k">global</span> <span class="n">model</span>
<span class="n">model_path</span> <span class="o">=</span> <span class="n">Model</span><span class="o">.</span><span class="n">get_model_path</span><span class="p">(</span><span class="s1">'classification_model'</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">joblib</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">model_path</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="n">raw_data</span><span class="p">):</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">raw_data</span><span class="p">)[</span><span class="s1">'data'</span><span class="p">])</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="c1"># return predictions as any JSON seriazable format</span>
<span class="k">return</span> <span class="n">predictions</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
</code></pre></div>
<p>You can configure the environment using Conda. You can use a <code>CondaDependencies</code> class to create a default environment (including <code>azureml-defaults</code> and other commonly-used) and add any other required packages. You then serialize the environment to a string and save it.</p>
<div class="highlight"><pre><span></span><code><span class="n">myenv</span> <span class="o">=</span> <span class="n">CondaDependencies</span><span class="p">()</span>
<span class="n">myenv</span><span class="o">.</span><span class="n">add_conda_package</span><span class="p">(</span><span class="s1">'scikit-learn'</span><span class="p">)</span>
<span class="n">env_file</span> <span class="o">=</span> <span class="s1">'service_files/env.yml'</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">env_file</span><span class="p">,</span> <span class="s1">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">myenv</span><span class="o">.</span><span class="n">serialize_to_string</span><span class="p">())</span>
</code></pre></div>
<p>After creating the script and the environment, you combine them in an <code>InferenceConfig</code>:</p>
<div class="highlight"><pre><span></span><code><span class="n">classifier_inference_config</span> <span class="o">=</span> <span class="n">InferenceConfig</span><span class="p">(</span>
<span class="n">runtime</span><span class="o">=</span><span class="s1">'python'</span><span class="p">,</span>
<span class="n">source_directory</span><span class="o">=</span><span class="s1">'service_files'</span><span class="p">,</span>
<span class="n">entry_script</span><span class="o">=</span><span class="s1">'score.py'</span><span class="p">,</span>
<span class="n">conda_file</span><span class="o">=</span><span class="s1">'env.yml'</span>
<span class="p">)</span>
</code></pre></div>
<h2>Deployment configuration</h2>
<p>Now that you have the entry script and the environment, you configure the compute service. If you deploy to an AKS cluster, you create it</p>
<div class="highlight"><pre><span></span><code><span class="n">cluster_name</span> <span class="o">=</span> <span class="s1">'aks-cluster'</span>
<span class="n">compute_config</span> <span class="o">=</span> <span class="n">AksCompute</span><span class="o">.</span><span class="n">provisioning_configuration</span><span class="p">(</span><span class="n">location</span><span class="o">=</span><span class="s1">'eastus'</span><span class="p">)</span>
<span class="n">production_cluster</span> <span class="o">=</span> <span class="n">ComputeTarget</span><span class="o">.</span><span class="n">create</span><span class="p">(</span><span class="n">ws</span><span class="p">,</span> <span class="n">cluster_name</span><span class="p">,</span> <span class="n">compute_config</span><span class="p">)</span>
<span class="n">production_cluster</span><span class="o">.</span><span class="n">wait_for_completion</span><span class="p">()</span>
</code></pre></div>
<p>You define the deployment configuration</p>
<div class="highlight"><pre><span></span><code><span class="n">classifier_deploy_config</span> <span class="o">=</span> <span class="n">AksWebservice</span><span class="o">.</span><span class="n">deploy_configuration</span><span class="p">(</span>
<span class="n">cpu_cores</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">memory_gb</span><span class="o">=</span><span class="mi">1</span>
<span class="p">)</span>
</code></pre></div>
<h2>Deploy the model</h2>
<div class="highlight"><pre><span></span><code><span class="n">model</span> <span class="o">=</span> <span class="n">ws</span><span class="o">.</span><span class="n">models</span><span class="p">[</span><span class="s1">'classification_model'</span><span class="p">]</span>
<span class="n">service</span> <span class="o">=</span> <span class="n">Model</span><span class="o">.</span><span class="n">deploy</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'classification-service'</span><span class="p">,</span>
<span class="n">models</span><span class="o">=</span><span class="p">[</span><span class="n">model</span><span class="p">],</span>
<span class="n">inference_config</span><span class="o">=</span><span class="n">classifier_inference_config</span><span class="p">,</span>
<span class="n">deploy_config</span><span class="o">=</span><span class="n">classifier_deploy_config</span><span class="p">,</span>
<span class="n">deployment_target</span><span class="o">=</span><span class="n">production_cluster</span>
<span class="p">)</span>
<span class="n">service</span><span class="o">.</span><span class="n">wait_for_deployment</span><span class="p">()</span>
</code></pre></div>
<h2>Consuming a real-time inferencing service</h2>
<p>For <strong>testing</strong>, you can use the AML SDK to call a web service through the <code>run</code> method of a <code>WebService</code> object. Typically, you send data to <code>run</code> method in a JSON like</p>
<div class="highlight"><pre><span></span><code><span class="p">{</span>
<span class="s1">'data'</span><span class="p">:[</span>
<span class="p">[</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">,</span> <span class="mf">3.4</span><span class="p">],</span>
<span class="p">[</span><span class="mf">0.9</span><span class="p">,</span> <span class="mf">8.2</span><span class="p">,</span> <span class="mf">2.5</span><span class="p">],</span>
<span class="o">...</span>
<span class="p">]</span>
<span class="p">}</span>
</code></pre></div>
<p>The response is a JSON with a prediction for each case</p>
<div class="highlight"><pre><span></span><code><span class="n">response</span> <span class="o">=</span> <span class="n">service</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">input_data</span><span class="o">=</span><span class="n">json_data</span><span class="p">)</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
</code></pre></div>
<p>In <strong>production</strong>, you use a REST endpoint. You find the endpoint of a deployed service in Azure ML studio, or by retrieving the <code>scoring_url</code> property of a <code>Webservice</code> object:</p>
<div class="highlight"><pre><span></span><code><span class="n">endpoint</span> <span class="o">=</span> <span class="n">service</span><span class="o">.</span><span class="n">scoring_uri</span>
</code></pre></div>
<p>There are 2 kinds of <strong>authentication</strong>:</p>
<ul>
<li>key: requests are authenticated by specifying the key associated with the service</li>
<li>token: requests are authenticated by providing a JSON Web Token (JWT)</li>
</ul>
<p>By default, authentication is disabled for Azure Container Instance service (set to key-based authentication for AKS).</p>
<p>To make an authenticate call to the REST endpoint, you include the oey or the token in the request header.</p>
<h2>Troubleshooting service deployment</h2>
<p>You can</p>
<ul>
<li>check the service state (should be <em>healty</em>): <code>service.state</code></li>
<li>review service logs: <code>service.get_logs()</code></li>
<li>deploy to local container</li>
</ul>
<h1>Batch inference pipelines</h1>
<p>Pipeline to read input data, load a registered model, predict labels, and write results.</p>
<ol>
<li><a href="#registermodel">Register</a> a model</li>
<li>Create a <a href="#scoringscript">scoring script</a>. The <code>run(mini_batch)</code> method makes the inference on each batch.</li>
<li>Create a pipeline with ParallelRunStep</li>
<li>Run the pipeline and retrieve the step output</li>
</ol>
<p>Azure ML provides a pipeline step performs parallel batch inference. Using <code>ParallelRunStep</code> class, you can read batches of files from a <code>File</code> dataset and write the output to a <code>PipelineData</code> reference. You can set the <code>output_action</code> to <em>"append_row"</em> (ensuring all instances of the step will collate the result to a single output file named <code>parallel_run_step.txt</code>).</p>
<div class="highlight"><pre><span></span><code><span class="n">batch_data_set</span> <span class="o">=</span> <span class="n">ws</span><span class="o">.</span><span class="n">datasets</span><span class="p">(</span><span class="s1">'batch-data'</span><span class="p">)</span>
<span class="c1"># output location</span>
<span class="n">default_ds</span> <span class="o">=</span> <span class="n">we</span><span class="o">.</span><span class="n">get_default_datastore</span><span class="p">()</span>
<span class="n">output_dir</span> <span class="o">=</span> <span class="n">PipelineData</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'inferences'</span><span class="p">,</span>
<span class="n">datastore</span><span class="o">=</span><span class="n">default_ds</span><span class="p">,</span>
<span class="n">output_path_on_compute</span><span class="o">=</span><span class="s1">'results'</span>
<span class="p">)</span>
<span class="n">parallel_run_config</span> <span class="o">=</span> <span class="n">ParallelRunConfig</span><span class="p">(</span>
<span class="n">source_directory</span><span class="o">=</span><span class="s1">'batch_scripts'</span><span class="p">,</span>
<span class="n">entry_script</span><span class="o">=</span><span class="s1">'batch_scoring_script.py'</span><span class="p">,</span>
<span class="n">mini_batch_size</span><span class="o">=</span><span class="s2">"5"</span><span class="p">,</span>
<span class="n">error_threshold</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
<span class="n">output_action</span><span class="o">=</span><span class="s2">"append_row"</span><span class="p">,</span>
<span class="n">environment</span><span class="o">=</span><span class="n">batch_env</span><span class="p">,</span>
<span class="n">compute_target</span><span class="o">=</span><span class="n">aml_cluster</span><span class="p">,</span>
<span class="n">node_count</span><span class="o">=</span><span class="mi">4</span>
<span class="p">)</span>
<span class="n">parallelrun_step</span> <span class="o">=</span> <span class="n">ParallelRunStep</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s2">"batch-score"</span><span class="p">,</span>
<span class="n">parallel_run_config</span><span class="o">=</span><span class="n">parallel_run_config</span><span class="p">,</span>
<span class="n">inputs</span><span class="o">=</span><span class="p">[</span><span class="n">batch_data_set</span><span class="o">.</span><span class="n">as_named_input</span><span class="p">(</span><span class="s1">'batch_data'</span><span class="p">)],</span>
<span class="n">output</span><span class="o">=</span><span class="n">output_dir</span><span class="p">,</span>
<span class="n">arguments</span><span class="o">=</span><span class="p">[],</span>
<span class="n">allow_reuse</span><span class="o">=</span><span class="kc">True</span>
<span class="p">)</span>
<span class="n">pipeline</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">(</span>
<span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span>
<span class="n">steps</span><span class="o">=</span><span class="p">[</span><span class="n">parallelrun_step</span><span class="p">]</span>
<span class="p">)</span>
</code></pre></div>
<p>Run the pipeline and retrieve output.</p>
<div class="highlight"><pre><span></span><code><span class="n">pipeline_run</span> <span class="o">=</span> <span class="n">Experiment</span><span class="p">(</span><span class="n">ws</span><span class="p">,</span> <span class="s1">'batch_prediction_pipeline'</span><span class="p">)</span><span class="o">.</span><span class="n">submit</span><span class="p">(</span><span class="n">pipeline</span><span class="p">)</span>
<span class="n">pipeline_run</span><span class="o">.</span><span class="n">wait_for_completion</span><span class="p">()</span>
<span class="n">prediction_run</span> <span class="o">=</span> <span class="nb">next</span><span class="p">(</span><span class="n">pipeline_run</span><span class="o">.</span><span class="n">get_children</span><span class="p">())</span>
<span class="n">prediction_output</span> <span class="o">=</span> <span class="n">prediction_run</span><span class="o">.</span><span class="n">get_output_data</span><span class="p">(</span><span class="s1">'inferences'</span><span class="p">)</span>
<span class="n">prediction_output</span><span class="o">.</span><span class="n">download</span><span class="p">(</span><span class="n">local_path</span><span class="o">=</span><span class="s1">'results'</span><span class="p">)</span>
</code></pre></div>
<h2>Publishing a batch inference pipeline</h2>
<p>You can publish it as a <strong>REST</strong> service.</p>
<div class="highlight"><pre><span></span><code><span class="n">published_pipeline</span> <span class="o">=</span> <span class="n">pipeline_run</span><span class="o">.</span><span class="n">publish_pipeline</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'Batch_Prediction_Pipeline'</span><span class="p">,</span>
<span class="n">description</span><span class="o">=</span><span class="s1">'Batch Pipeline'</span><span class="p">,</span>
<span class="n">version</span><span class="o">=</span><span class="s1">'1.0'</span>
<span class="p">)</span>
<span class="n">rest_endpoint</span> <span class="o">=</span> <span class="n">published_pipeline</span><span class="o">.</span><span class="n">endpoint</span>
</code></pre></div>
<p>Once published, you can use the endpoint to initiate a batch inferencing job.</p>
<p>You can also <strong>schedule</strong> the published pipeline to have it run automatically.</p>
<div class="highlight"><pre><span></span><code><span class="n">weekly</span> <span class="o">=</span> <span class="n">ScheduleRecurrence</span><span class="p">(</span><span class="n">frequency</span><span class="o">=</span><span class="s1">'Week'</span><span class="p">,</span> <span class="n">interval</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">pipeline_schedule</span> <span class="o">=</span> <span class="n">Schedule</span><span class="o">.</span><span class="n">create</span><span class="p">(</span>
<span class="n">ws</span><span class="p">,</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'Weekly Predictions'</span><span class="p">,</span>
<span class="n">description</span><span class="o">=</span><span class="s1">'batch inferencing'</span><span class="p">,</span>
<span class="n">pipeline_id</span><span class="o">=</span><span class="n">published_pipeline</span><span class="o">.</span><span class="n">id</span><span class="p">,</span>
<span class="n">experiment_name</span><span class="o">=</span><span class="s1">'Batch_Prediction'</span><span class="p">,</span>
<span class="n">recurrence</span><span class="o">=</span><span class="n">weekly</span>
<span class="p">)</span>
</code></pre></div>
<h1>Tuning hyperparameters</h1>
<p>Accomplished by training multiple models, using same algorithm and training data but different hyperparameter values. Then, evaluate for each the performance metric (eg accuracy), and the best-performing model is selected.</p>
<p>In Azure ML, you make an experiment that consist of a <em>hyperdrive</em> run, which initiates a child run for each hyperparameter. Each child run uses a training script with parametrised hyperparameter values to train a model, and logs the target performance metric achieved by the training model.</p>
<h2>Define a search space</h2>
<p>Depends on the type of hyperparameter:</p>
<ul>
<li><strong>discrete</strong>. Make a <code>choice</code> out of</li>
<li>an explicit python <code>list</code>: <code>choice([10, 20, 30])</code></li>
<li>a <code>range</code>: <code>choice(range(1,10))</code></li>
<li>select values from a discrete distribution: <em>qnormal, quniform, qlognormal, qloguniform</em></li>
<li><strong>continuous</strong>. Use any of these distribution: <em>normal, uniform, lognormal, loguniform</em></li>
</ul>
<p>Define a search space by creating a dictionary with parameter expressions for each hyperparameter.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">azureml.train.hyperdrive</span> <span class="kn">import</span> <span class="n">choice</span><span class="p">,</span> <span class="n">normal</span>
<span class="n">param_space</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">'--batch_size'</span><span class="p">:</span> <span class="n">choice</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
<span class="s1">'--learning_rate'</span><span class="p">:</span> <span class="n">normal</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div>
<h2>Configuring sampling</h2>
<p>The values used in a tuning run depend on the type of <em>sampling</em> used.</p>
<p><strong>Grid sampling.</strong> Every possible combination when hyperparameters are discrete.</p>
<div class="highlight"><pre><span></span><code><span class="n">param_space</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">'--batch_size'</span><span class="p">:</span> <span class="n">choice</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
<span class="s1">'--learning_rate'</span><span class="p">:</span> <span class="n">choice</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">param_sampling</span> <span class="o">=</span> <span class="n">GridParameterSampling</span><span class="p">(</span><span class="n">param_space</span><span class="p">)</span>
</code></pre></div>
<p><strong>Random sampling.</strong> Randomly select a value for each hyperparameter.</p>
<div class="highlight"><pre><span></span><code><span class="n">param_space</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">'--batch_size'</span><span class="p">:</span> <span class="n">choice</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
<span class="s1">'--learning_rate'</span><span class="p">:</span> <span class="n">normal</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">param_sampling</span> <span class="o">=</span> <span class="n">RandomParameterSampling</span><span class="p">(</span><span class="n">param_space</span><span class="p">)</span>
</code></pre></div>
<p><strong>Bayesian sampling.</strong> Based on Bayesian optimisation algorithm that tries to select parameter combinations that will result in improved performance from the previous selection.</p>
<div class="highlight"><pre><span></span><code><span class="n">param_space</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">'--batch_size'</span><span class="p">:</span> <span class="n">choice</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
<span class="s1">'--learning_rate'</span><span class="p">:</span> <span class="n">uniform</span><span class="p">(</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">param_sampling</span> <span class="o">=</span> <span class="n">BayesianParameterSampling</span><span class="p">(</span><span class="n">param_space</span><span class="p">)</span>
</code></pre></div>
<p>Can only be used with <em>choice, uniform, quniform</em> distributions and can't be combined with <em>early termination</em>.</p>
<h2>Configuring an early termination</h2>
<p>Typically, you set a maximum number of iterations, but this could still result in a large number of runs that don't result in a better model than a combination that has already been tried.</p>
<p>To help preventing wasting time, you can set an <em>early termination</em> policy that abandons runs that are unlikely to produce a better result than previously completed runs. The policy is evaluated at an <em>evaluation interval</em> you specify, based on each time the target performance metric is logged. You can also set a <em>delay evaluation</em> parameter to avoid evaluating the policy until a minimum number of iterations have been completed.</p>
<p><strong>Note.</strong> Early termination is particularly useful for deep learning scenarios where a deep neural network is trained iteratively over a number of epochs. The training script can report the target metric after each epoch, and if the run is significantly underperforming previous runs after the same number of intervals, it can be abandoned.</p>
<p><strong>Bandit policy.</strong> Stop a run if the target performance metric underperforms the best run so far by a specified margin.</p>
<div class="highlight"><pre><span></span><code><span class="n">early_termination_policy</span> <span class="o">=</span> <span class="n">BanditPolicy</span><span class="p">(</span>
<span class="n">slack_amount</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span> <span class="c1"># abandon runs when metric is 0.2 or more worse than best run after the same number of intervals</span>
<span class="n">evaluation_interval</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">delay_evaluation</span><span class="o">=</span><span class="mi">5</span>
<span class="p">)</span>
</code></pre></div>
<p>You can also use a slack <em>factor</em> comparing the metric as ration rather than an absolute value.</p>
<p><strong>Median stopping policy.</strong> Abandoning runs where the target performance metric is worse than the median of the running averages fo all runs.</p>
<div class="highlight"><pre><span></span><code><span class="n">early_termination_policy</span> <span class="o">=</span> <span class="n">MedianStoppingPolicy</span><span class="p">(</span>
<span class="n">evaluation_interval</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">delay_evaluation</span><span class="o">=</span><span class="mi">5</span>
<span class="p">)</span>
</code></pre></div>
<p><strong>Truncation selection policy.</strong> Cancelling the lower performing <em>X%%</em> of runs at each evaluation interval based on the <em>truncation_percentage</em> valu you specify for <em>X</em>.</p>
<div class="highlight"><pre><span></span><code><span class="n">early_termination_policy</span> <span class="o">=</span> <span class="n">TruncationSelectionPolicy</span><span class="p">(</span>
<span class="n">truncation_percentage</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
<span class="n">evaluation_interval</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">delay_evaluation</span><span class="o">=</span><span class="mi">5</span>
<span class="p">)</span>
</code></pre></div>
<h2>Running a hyperparameter tuning experiment</h2>
<p>In Azure ML, you tune hyper by running a <em>hyperdrive</em> experiment. You need to create a training script just the way you would do for any other training experiment, except that you <strong>must</strong>:</p>
<ul>
<li>include an argument for each hyperparameter</li>
<li>log the target performance metric.</li>
</ul>
<p>This example script trains a logistic regression using a <code>--regularization</code> argument (regularization rate), and logs the <em>accuracy</em>.</p>
<div class="highlight"><pre><span></span><code><span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="o">.</span><span class="n">ArgumentParser</span><span class="p">()</span>
<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s1">'--regularization'</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">float</span><span class="p">,</span> <span class="n">dest</span><span class="o">=</span><span class="s1">'reg_rate'</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="mf">0.01</span><span class="p">)</span>
<span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span>
<span class="n">reg</span> <span class="o">=</span> <span class="n">args</span><span class="o">.</span><span class="n">reg_rate</span>
<span class="c1"># get experiment run context</span>
<span class="n">run</span> <span class="o">=</span> <span class="n">Run</span><span class="o">.</span><span class="n">get_context</span><span class="p">()</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">run</span><span class="o">.</span><span class="n">input_datasets</span><span class="p">[</span><span class="s1">'training_data'</span><span class="p">]</span><span class="o">.</span><span class="n">to_pandas_dataframe</span><span class="p">()</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">data</span><span class="p">[[</span><span class="s1">'feature1'</span><span class="p">,</span> <span class="s1">'feature2'</span><span class="p">,</span> <span class="s1">'feature3'</span><span class="p">,</span> <span class="s1">'feature4'</span><span class="p">]]</span><span class="o">.</span><span class="n">values</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s1">'label'</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.3</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">LogisticRegression</span><span class="p">(</span><span class="n">C</span><span class="o">=</span><span class="mi">1</span><span class="o">/</span><span class="n">reg</span><span class="p">,</span> <span class="n">solver</span><span class="o">=</span><span class="s1">'liblinear'</span><span class="p">)</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="c1"># calculate and log accuracy</span>
<span class="n">y_hat</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">acc</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">average</span><span class="p">(</span><span class="n">y_hat</span> <span class="o">==</span> <span class="n">y_test</span><span class="p">)</span>
<span class="n">run</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="s1">'Accuracy'</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">float</span><span class="p">(</span><span class="n">acc</span><span class="p">))</span>
<span class="c1"># save trained model</span>
<span class="n">os</span><span class="o">.</span><span class="n">makedirs</span><span class="p">(</span><span class="s1">'outputs'</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">joblib</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">value</span><span class="o">=</span><span class="n">model</span><span class="p">,</span> <span class="n">filename</span><span class="o">=</span><span class="s1">'outputs/model.pkl'</span><span class="p">)</span>
<span class="n">run</span><span class="o">.</span><span class="n">complete</span><span class="p">()</span>
</code></pre></div>
<p>To prepare the hyperdrive experiment, you use a <code>HyperDriveConfig</code> object to configure the experiment run.</p>
<div class="highlight"><pre><span></span><code><span class="n">hyperdrive</span> <span class="o">=</span> <span class="n">HyperDriveConfig</span><span class="p">(</span>
<span class="n">estimator</span><span class="o">=</span><span class="n">sklearn_estimator</span><span class="p">,</span>
<span class="n">hyperparameter_sampling</span><span class="o">=</span><span class="n">param_sampling</span><span class="p">,</span>
<span class="n">policy</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span>
<span class="n">primary_metric_name</span><span class="o">=</span><span class="s1">'Accuracy'</span><span class="p">,</span>
<span class="n">primary_metricgoal</span><span class="o">=</span><span class="n">PrimaryMetricGoal</span><span class="o">.</span><span class="n">MAXIMIZE</span><span class="p">,</span>
<span class="n">max_total_runs</span><span class="o">=</span><span class="mi">6</span><span class="p">,</span>
<span class="n">max_concurrent_runs</span><span class="o">=</span><span class="mi">4</span>
<span class="p">)</span>
<span class="n">experiment</span> <span class="o">=</span> <span class="n">Experiment</span><span class="p">(</span><span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s1">'hyperdrive_training'</span><span class="p">)</span>
<span class="n">hyperdrive_run</span> <span class="o">=</span> <span class="n">experiment</span><span class="o">.</span><span class="n">submit</span><span class="p">(</span><span class="n">config</span><span class="o">=</span><span class="n">hyperdrive</span><span class="p">)</span>
</code></pre></div>
<p>You can monitor hyperdrive experiment in Azure ML studio. The experiment will initiate a child run for each hyperparameter combination to be tried</p>
<h1>Automate model selection</h1>
<p>Visual interface for automated ML in Azure ML Studio for <em>Enterprise</em> edition only.</p>
<p>You can use automated ML to train models for the tasks below. Azure ML supports common algorithms for these tasks:</p>
<ul>
<li>classification</li>
<li>logistic regression</li>
<li>light gradient boosting machine</li>
<li>decision tree</li>
<li>random forest</li>
<li>naive Bayes</li>
<li>linear SVM</li>
<li>XGBoost</li>
<li>DNN classifier</li>
<li>others...</li>
<li>regression</li>
<li>linear regression</li>
<li>light gradient boosting machine</li>
<li>decision tree</li>
<li>random forest</li>
<li>elastic net</li>
<li>LARS Lasso</li>
<li>XGBoost</li>
<li>Others</li>
<li>time series forecasting</li>
<li>linear regression</li>
<li>light gradient boosting machine</li>
<li>decision tree</li>
<li>random forest</li>
<li>elastic net</li>
<li>LARS Lasso</li>
<li>XGBoost</li>
<li>others</li>
</ul>
<p>By default, automated machine learning, will randomly select from the full range of algorithms for the specified task. You can choose to <strong>block</strong> individual algorithms from being selected.</p>
<h2>Preprocessing and featurization</h2>
<p>Automated ML (AutoML) can apply preprocessing transformations to your data.</p>
<ul>
<li><strong>scaling and normalization</strong> applied to numeric data <strong>automatically</strong></li>
<li><strong>optional featurization</strong></li>
<li>missing value imputation</li>
<li>categorical encoding</li>
<li>dropping high cardinality features (eg IDs)</li>
<li>feature engineering (eg date parts from DateTime)</li>
</ul>
<h2>Running AutoML experiment</h2>
<p>You can use Auzure ML Studio UI or use SDK (using <code>AutoMLConfig</code> class).</p>
<div class="highlight"><pre><span></span><code><span class="n">automl_run_config</span> <span class="o">=</span> <span class="n">RunConfiguration</span><span class="p">(</span><span class="n">framework</span><span class="o">=</span><span class="s1">'python'</span><span class="p">)</span>
<span class="n">automl_config</span> <span class="o">=</span> <span class="n">AutoMLConfig</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'auto ml experiment'</span><span class="p">,</span>
<span class="n">task</span><span class="o">=</span><span class="s1">'classification'</span><span class="p">,</span>
<span class="n">primary_metric</span><span class="o">=</span><span class="s1">'AUC_weighted'</span><span class="p">,</span>
<span class="n">compute_target</span><span class="o">=</span><span class="n">aml_compute</span><span class="p">,</span>
<span class="n">training_data</span><span class="o">=</span><span class="n">train_dataset</span><span class="p">,</span>
<span class="n">validation_data</span><span class="o">=</span><span class="n">test_dataset</span><span class="p">,</span>
<span class="n">label_column_name</span><span class="o">=</span><span class="s1">'label'</span><span class="p">,</span>
<span class="n">featurization</span><span class="o">=</span><span class="s1">'auto'</span><span class="p">,</span>
<span class="n">iterations</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span>
<span class="n">max_concurrent_iterations</span><span class="o">=</span><span class="mi">4</span>
<span class="p">)</span>
</code></pre></div>
<p>With Azure ML Studio, you can create or select an Azure ML <em>dataset</em> to be used as input for your AutoML experiment. When using the SDK, you can submit data by</p>
<ul>
<li>specify a dataset or dataframe of <em>training data</em> that includes features and label to be predicted</li>
<li>optionally, specify a second <em>validation data</em> dataset or dataframe. If this is not provided, Azure ML will apply cross-validation.</li>
</ul>
<p>Alternatively:</p>
<ul>
<li>specify a dataset, dataframe, or numpy array of <em>X</em> values containing features with a corresponding <em>y</em> array of label values</li>
</ul>
<p>One of the most important setting you specify is <strong>primary_metric</strong> (ie target performance metric). Azure ML supports a set of named metrics for each type of task.</p>
<div class="highlight"><pre><span></span><code><span class="n">get_primary_metrics</span><span class="p">(</span><span class="s1">'classification'</span><span class="p">)</span>
</code></pre></div>
<p>You can <strong>submit</strong> an AutoML experiment like any other SDK-based experiment:</p>
<div class="highlight"><pre><span></span><code><span class="n">automl_experiment</span> <span class="o">=</span> <span class="n">Experiment</span><span class="p">(</span><span class="n">ws</span><span class="p">,</span> <span class="s1">'automl_experiment'</span><span class="p">)</span>
<span class="n">automl_run</span> <span class="o">=</span> <span class="n">automl_experiment</span><span class="o">.</span><span class="n">submit</span><span class="p">(</span><span class="n">automl_config</span><span class="p">)</span>
</code></pre></div>
<p>You can easily identify the best run in Auzre ML studio, and download or deploy the model it generated. Via SDK:</p>
<div class="highlight"><pre><span></span><code><span class="n">best_run</span><span class="p">,</span> <span class="n">fitted_model</span> <span class="o">=</span> <span class="n">automl_run</span><span class="o">.</span><span class="n">get_output</span><span class="p">()</span>
<span class="n">best_run_metrics</span> <span class="o">=</span> <span class="n">best_run</span><span class="o">.</span><span class="n">get_metrics</span><span class="p">()</span>
<span class="k">for</span> <span class="n">metric_name</span> <span class="ow">in</span> <span class="n">best_run_metrics</span><span class="p">:</span>
<span class="n">metric</span> <span class="o">=</span> <span class="n">best_run_metrics</span><span class="p">[</span><span class="n">metric_name</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="n">metric_name</span><span class="p">,</span> <span class="n">metric</span><span class="p">)</span>
</code></pre></div>
<p>AutoML uses <em>scikit-learn</em> pipelines. You can view the steps in the fitted model you obtained from the best run.</p>
<div class="highlight"><pre><span></span><code><span class="k">for</span> <span class="n">step</span> <span class="ow">in</span> <span class="n">fitted_model</span><span class="o">.</span><span class="n">named_steps</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="n">step</span><span class="p">)</span>
</code></pre></div>
<h1>Explain ML models</h1>
<p>Model explainers use statistical techniques to calculate <strong>feature importance</strong>. Explainers work by evaluating a test data set of feature cases and the labels the model predicts for them.</p>
<p><strong>Global feature importance</strong> quantifies the relative importance of each feature in the test dataset as a whole: which feature in the dataset influences prediction?</p>
<p><strong>Local feature importance</strong> measures the influence of each feature value for a specific individual prediction. Example, will Sam go deafult?</p>
<blockquote>
<p>Prediction=0: Samuel won't default on the loan repayment</p>
</blockquote>
<p>Features:</p>
<ul>
<li><em>loan amount</em>; support for 0: <code>0.9</code>; support for 1: <code>-0.9</code></li>
<li><em>income</em>; support for 0: <code>0.6</code></li>
<li><em>age</em>; support for 0: <code>-0.2</code></li>
<li><em>marital status</em>; support for 0: <code>0.1</code></li>
</ul>
<p>Because this is a <em>classification</em> model, each feature gets a local importance value for each possible class, indicating the amount of support for that class based on the feature value.</p>
<p>The most important feature for a prediction of class 1 is <em>loan amount</em>. There could be multiple reasons why local importance for an individualprediction varies form global importance for the overall dataset. For example, Sam might have a lower income than average, but the loan amount in this case might be unusually small.</p>
<p>For a multi-class classification model, a local importance value for each possible class is calculated for every feature, with the total across all classes always being 0.</p>
<p>For a <strong>regression model</strong>, the local importance values simply indicate the level of influence each feature has on the predicted scalar label.</p>
<h2>Using explainers</h2>
<p>You can use Azure ML SDK to create explainers for models even if they were not trained using an Azure ML experiment.</p>
<p>You install the <code>azureml-interpret</code> package. Types of explainer include:</p>
<ul>
<li><code>MimicExplainer</code> creates a <em>global surrogate model</em> that approximates your trained model and can be used to generate explanations. This explainable model must have the same kind of architecture as your trained model (eg linear or tree-based)</li>
<li><code>TabularExplainer</code> acts as a wrapper around various SHAP explainer algorithms, automatically choosing the one that is most appropriate for your model architecture</li>
<li><code>PFIExplainer</code> (<em>Permutation Feature Importance</em>) analyzes feature importance by shuffling feature values and measuring the impact on prediction performance</li>
</ul>
<p>Example for hypothetical model named <code>loan_model</code></p>
<div class="highlight"><pre><span></span><code><span class="n">mim_explainer</span> <span class="o">=</span> <span class="n">MimicExplainer</span><span class="p">(</span>
<span class="n">model</span><span class="o">=</span><span class="n">loan_model</span><span class="p">,</span>
<span class="n">initialization_examples</span><span class="o">=</span><span class="n">X_test</span><span class="p">,</span>
<span class="n">explainable_model</span><span class="o">=</span><span class="n">DecisionTreeExplainableModel</span><span class="p">,</span>
<span class="n">features</span><span class="o">=</span><span class="p">[</span><span class="s1">'loan_amount'</span><span class="p">,</span> <span class="s1">'income'</span><span class="p">,</span> <span class="s1">'age'</span><span class="p">,</span> <span class="s1">'marital_status'</span><span class="p">],</span>
<span class="n">classes</span><span class="o">=</span><span class="p">[</span><span class="s1">'reject'</span><span class="p">,</span> <span class="s1">'approve'</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">tab_explainer</span> <span class="o">=</span> <span class="n">TabularExplainer</span><span class="p">(</span>
<span class="n">model</span><span class="o">=</span><span class="n">loan_model</span><span class="p">,</span>
<span class="n">initialization_examples</span><span class="o">=</span><span class="n">X_test</span><span class="p">,</span>
<span class="n">features</span><span class="o">=</span><span class="p">[</span><span class="s1">'loan_amount'</span><span class="p">,</span> <span class="s1">'income'</span><span class="p">,</span> <span class="s1">'age'</span><span class="p">,</span> <span class="s1">'marital_status'</span><span class="p">],</span>
<span class="n">classes</span><span class="o">=</span><span class="p">[</span><span class="s1">'reject'</span><span class="p">,</span> <span class="s1">'approve'</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">pfi_explainer</span> <span class="o">=</span> <span class="n">PFIExplainer</span><span class="p">(</span>
<span class="n">model</span><span class="o">=</span><span class="n">loan_model</span><span class="p">,</span>
<span class="n">features</span><span class="o">=</span><span class="p">[</span><span class="s1">'loan_amount'</span><span class="p">,</span> <span class="s1">'income'</span><span class="p">,</span> <span class="s1">'age'</span><span class="p">,</span> <span class="s1">'marital_status'</span><span class="p">],</span>
<span class="n">classes</span><span class="o">=</span><span class="p">[</span><span class="s1">'reject'</span><span class="p">,</span> <span class="s1">'approve'</span><span class="p">]</span>
<span class="p">)</span>
</code></pre></div>
<p>To retrieve <strong>global feature importance</strong>, call the <code>explain_global()</code> method of your explainer, and then use the <code>get_feature_importance_dict()</code> method to get a dictionary of the feature importance values.</p>
<div class="highlight"><pre><span></span><code><span class="n">global_mim_explanation</span> <span class="o">=</span> <span class="n">mim_explainer</span><span class="o">.</span><span class="n">explain_global</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">global_mim_feature_importance</span> <span class="o">=</span> <span class="n">global_mim_explanation</span><span class="o">.</span><span class="n">get_feature_importance_dict</span><span class="p">()</span>
<span class="c1"># same as MimixExplainer</span>
<span class="n">global_tab_explanation</span> <span class="o">=</span> <span class="n">mim_explainer</span><span class="o">.</span><span class="n">explain_global</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">global_tab_feature_importance</span> <span class="o">=</span> <span class="n">global_tab_explanation</span><span class="o">.</span><span class="n">get_feature_importance_dict</span><span class="p">()</span>
<span class="c1"># requires actual labels</span>
<span class="n">global_pfi_explanation</span> <span class="o">=</span> <span class="n">mim_explainer</span><span class="o">.</span><span class="n">explain_global</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">global_pfi_feature_importance</span> <span class="o">=</span> <span class="n">global_pfi_explanation</span><span class="o">.</span><span class="n">get_feature_importance_dict</span><span class="p">()</span>
</code></pre></div>
<p>To retriev <strong>local feature importance</strong> from a <code>MimicExplainer</code> or a <code>TabularExplainer</code>, you must call the <code>explain_local()</code> specifying the subset of cases you want to explain. Then you use the <code>get_ranked_local_names()</code> and <code>get_ranked_local_values()</code> to retrieve dictionares.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># same for tab_explainer too</span>
<span class="n">local_mim_explanation</span> <span class="o">=</span> <span class="n">mim_explainer</span><span class="o">.</span><span class="n">explain</span><span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">5</span><span class="p">])</span>
<span class="n">local_mim_features</span> <span class="o">=</span> <span class="n">local_mim_explanation</span><span class="o">.</span><span class="n">get_ranked_local_names</span><span class="p">()</span>
<span class="n">local_mim_importance</span> <span class="o">=</span> <span class="n">local_mim_explanation</span><span class="o">.</span><span class="n">get_ranked_local_values</span><span class="p">()</span>
</code></pre></div>
<p><code>PFIExplainer</code> does not support local feature importance explanations.</p>
<h2>Creating explanations</h2>
<p>You can create an explainer and upload the explanation it generates to the run for later analysis.</p>
<p>To create an explanation for the <strong>experiment script</strong>, you'll need to ensure that the <code>azureml-interpret</code> and <code>azureml-contrib-interpret</code> packages are installed in the run environment. Then you can use these to create an explanation from your trained model and upload it to the run outputs.</p>
<div class="highlight"><pre><span></span><code><span class="n">run</span> <span class="o">=</span> <span class="n">Run</span><span class="o">.</span><span class="n">get_context</span><span class="p">()</span>
<span class="c1"># code to train model goes here</span>
<span class="c1"># get explanation</span>
<span class="n">explainer</span> <span class="o">=</span> <span class="n">TabularExplainer</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">X_train</span><span class="p">,</span> <span class="n">features</span><span class="o">=</span><span class="n">features</span><span class="p">,</span> <span class="n">classes</span><span class="o">=</span><span class="n">labels</span><span class="p">)</span>
<span class="n">explanation</span> <span class="o">=</span> <span class="n">explainer</span><span class="o">.</span><span class="n">explain_global</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="c1"># get an explanation client and upload the explanation</span>
<span class="n">explain_client</span> <span class="o">=</span> <span class="n">ExplanationClient</span><span class="o">.</span><span class="n">from_run</span><span class="p">(</span><span class="n">run</span><span class="p">)</span>
<span class="n">explain_client</span><span class="o">.</span><span class="n">upload_model_explanation</span><span class="p">(</span><span class="n">explanation</span><span class="p">,</span> <span class="n">comment</span><span class="o">=</span><span class="s1">'Tabular Explanation'</span><span class="p">)</span>
<span class="n">run</span><span class="o">.</span><span class="n">complete</span><span class="p">()</span>
</code></pre></div>
<p>You can view the explanation you created for your model in the <em>Explanations</em> tab for the run in Azure ML Studio.</p>
<h2>Visualizing explanations</h2>
<p>Model explanations in Azure ML Studio include multiple visualizations that you can use to explore feature importance. Visualizations:</p>
<ul>
<li>global feature importance</li>
<li>summary importance: shows the distribution of individual importance values for each feature across the test dataset</li>
<li>local feature importance by selecting an individual data point</li>
</ul>
<h1>Monitor models</h1>
<p>You can use Application Insights to capture and review telemetry from models published with Azure ML. You must have an Application Insights resource associated with your Azure ML workspace.</p>
<p>When you create an Azure ML workspace, you can select an Application Insights resource. If you do not select an existing resource, a new one is created in the same resource group as your workspace.</p>
<p>When deploying a new real-time service, you can <strong>enable</strong> Application Insights in the deployment configuration for the service.</p>
<div class="highlight"><pre><span></span><code><span class="n">dep_config</span> <span class="o">=</span> <span class="n">AciWebservice</span><span class="o">.</span><span class="n">deploy_configuration</span><span class="p">(</span>
<span class="n">cpu_cores</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">memory_gb</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">enable_app_insights</span><span class="o">=</span><span class="kc">True</span>
<span class="p">)</span>
</code></pre></div>
<p>If you want to enable Application Insights for a service that is already deployed, you can modify the deployment configuration for AKS based services in the Azure portal.</p>
<h2>Capture and view telemetry</h2>
<p>Application Insights automatically captures any information written to the standard output and error logs, and provides a query capability to view data in these logs.</p>
<p>You can write any value to the standard output in the scoring script by using a <code>print</code>:</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="n">raw_data</span><span class="p">):</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">raw_data</span><span class="p">)[</span><span class="s1">'data'</span><span class="p">]</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Data: '</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="o">+</span> <span class="s1">' - Predictions: '</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">predictions</span><span class="p">))</span>
<span class="k">return</span> <span class="n">predictions</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
</code></pre></div>
<p>Azure ML creates a <em>custom dimension</em> in the data model for the output you write.</p>
<p>Yuo can use the Log Analytics query interface for the Applcation Insights in the Azure portal. It supports a SQL-like query syntax.</p>
<h1>Monitor data drift</h1>
<p>Over time there may be trends that change the profile of the data, making your model less accurate. This change in data profiles between training and inferencing is known as <em>data drift</em>.</p>
<p>Azure ML supports data drift monitoring through the use of <em>datasets</em>. You can compare two registered datasets to detect data drift, or you can capture new feature data submitted to a deployed model service and compare it to the dataset with which the model was trained.</p>
<p>You register 2 datasets:</p>
<ul>
<li>a <em>baseline</em> dataset: original training data</li>
<li>a <em>target</em> dataset that will be compared to the baseline on time intervals. This dataset requires a column for each feature you want to compare, and a timestamp column</li>
</ul>
<p>You define a <em>dataset monitor</em> to detect data drift and trigger alerts if the rate of drift exceeds a specified threshold. You can create dataset monitors using Azure ML Studio or by using the <code>DataDriftDetector</code> class.</p>
<div class="highlight"><pre><span></span><code><span class="n">monitor</span> <span class="o">=</span> <span class="n">DataDriftDetector</span><span class="o">.</span><span class="n">create_from_datasets</span><span class="p">(</span>
<span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'dataset-drift-monitor'</span><span class="p">,</span>
<span class="n">baseline_data_set</span><span class="o">=</span><span class="n">train_ds</span><span class="p">,</span>
<span class="n">target_data_set</span><span class="o">=</span><span class="n">new_data_ds</span><span class="p">,</span>
<span class="n">compute_target</span><span class="o">=</span><span class="s1">'aml-cluster'</span><span class="p">,</span>
<span class="n">frequency</span><span class="o">=</span> <span class="s1">'week'</span><span class="p">,</span>
<span class="n">feature_list</span><span class="o">=</span><span class="p">[</span><span class="s1">'age'</span><span class="p">,</span> <span class="s1">'height'</span><span class="p">,</span> <span class="s1">'bmi'</span><span class="p">],</span>
<span class="n">latency</span><span class="o">=</span><span class="mi">24</span>
<span class="p">)</span>
</code></pre></div>
<p>You can <em>backfill</em> to immediately compare baseline to existing data in target.</p>
<div class="highlight"><pre><span></span><code><span class="n">backfill</span> <span class="o">=</span> <span class="n">monitor</span><span class="o">.</span><span class="n">backfill</span><span class="p">(</span> <span class="n">dt</span><span class="o">.</span><span class="n">datetime</span><span class="o">.</span><span class="n">now</span><span class="p">()</span> <span class="o">-</span> <span class="n">dt</span><span class="o">.</span><span class="n">timedelta</span><span class="p">(</span><span class="n">weeks</span><span class="o">=</span><span class="mi">6</span><span class="p">),</span> <span class="n">dt</span><span class="o">.</span><span class="n">datetime</span><span class="o">.</span><span class="n">now</span><span class="p">())</span>
</code></pre></div>
<p>If you have <strong>deployed a model</strong> as a real-time web service, you can capture new inferencing data s it is submitted, and compare it to the original training data. It has the benefit of automatically collecting new target data as the deployed model is used.</p>
<p>You include the training dataset in the model registration to provide a baseline.</p>
<div class="highlight"><pre><span></span><code><span class="n">model</span> <span class="o">=</span> <span class="n">Model</span><span class="o">.</span><span class="n">register</span><span class="p">(</span>
<span class="n">workspace</span><span class="o">=</span><span class="n">ws</span><span class="p">,</span>
<span class="n">model_path</span><span class="o">=</span><span class="s1">'./model/model.pkl'</span><span class="p">,</span>
<span class="n">model_name</span><span class="o">=</span><span class="s1">'mymodel'</span><span class="p">,</span>
<span class="n">datasets</span><span class="o">=</span><span class="p">[(</span><span class="n">Dataset</span><span class="o">.</span><span class="n">Scenario</span><span class="o">.</span><span class="n">TRAINING</span><span class="p">,</span> <span class="n">train_ds</span><span class="p">)]</span>
<span class="p">)</span>
</code></pre></div>
<p>You enable data collection for services in which the model is used. You use the <code>ModelDataCollector</code> class in each service's scoring script, writing code to capture data and predictions and write them to the data collector (which will store them in Azure blob storage).</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">init</span><span class="p">():</span>
<span class="k">global</span> <span class="n">model</span><span class="p">,</span> <span class="n">data_collect</span><span class="p">,</span> <span class="n">predict_collect</span>
<span class="n">model_name</span> <span class="o">=</span> <span class="s1">'mymodel'</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">joblib</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">Model</span><span class="o">.</span><span class="n">get_model_path</span><span class="p">(</span><span class="n">model_name</span><span class="p">))</span>
<span class="c1"># enable collection of data and predictions</span>
<span class="n">data_collect</span> <span class="o">=</span> <span class="n">ModelDataCollector</span><span class="p">(</span>
<span class="n">model_name</span><span class="p">,</span>
<span class="n">designation</span><span class="o">=</span><span class="s1">'inputs'</span><span class="p">,</span>
<span class="n">features</span><span class="o">=</span><span class="p">[</span><span class="s1">'age'</span><span class="p">,</span> <span class="s1">'height'</span><span class="p">,</span> <span class="s1">'bmi'</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">predict_collect</span> <span class="o">=</span> <span class="n">ModelDataCollector</span><span class="p">(</span>
<span class="n">model_name</span><span class="p">,</span>
<span class="n">designation</span><span class="o">=</span><span class="s1">'predictions'</span><span class="p">,</span>
<span class="n">features</span><span class="o">=</span><span class="p">[</span><span class="s1">'prediction'</span><span class="p">]</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="n">raw_data</span><span class="p">):</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">raw_data</span><span class="p">[</span><span class="s1">'data'</span><span class="p">])</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">data_collect</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">predict_collect</span><span class="p">(</span><span class="n">predictions</span><span class="p">)</span>
<span class="k">return</span> <span class="n">predictions</span><span class="o">.</span><span class="n">tolist</span><span class="p">()</span>
</code></pre></div>
<p>With the data collection code in place in the scoring script, you can enable data collection in the deployment configuration.</p>
<div class="highlight"><pre><span></span><code><span class="n">dep_config</span> <span class="o">=</span> <span class="n">AksWebservice</span><span class="o">.</span><span class="n">deploy_configuration</span><span class="p">(</span><span class="n">collect_model_data</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</code></pre></div>
<p>You can configure <strong>data drift monitoring</strong> by using a <code>DataDriftDetector</code> class.</p>
<div class="highlight"><pre><span></span><code><span class="n">model</span> <span class="o">=</span> <span class="n">ws</span><span class="o">.</span><span class="n">models</span><span class="p">[</span><span class="s1">'mymodel'</span><span class="p">]</span>
<span class="n">datadrift</span> <span class="o">=</span> <span class="n">DataDriftDetector</span><span class="o">.</span><span class="n">create_from_model</span><span class="p">(</span>
<span class="n">ws</span><span class="p">,</span>
<span class="n">model</span><span class="o">.</span><span class="n">name</span><span class="p">,</span>
<span class="n">model</span><span class="o">.</span><span class="n">version</span><span class="p">,</span>
<span class="n">services</span><span class="o">=</span><span class="p">[</span><span class="s1">'my-svc'</span><span class="p">],</span>
<span class="n">frequency</span><span class="o">=</span><span class="s1">'Week'</span>
<span class="p">)</span>
</code></pre></div>
<h2>Scheduling alerts</h2>
<p>You can specify a threshold for the rate of data drift and an operator email for notifications.</p>
<p>Monitoring works by running a comparison at scheduled <strong>frequency</strong> (day, week, or month), and calculating data drift metrics for the features. For dataset monitors, you can specify a <strong>latency</strong> indicating the number of hours to allow for new data to be collected and added to the target dataset. For deployed model data drifts monitor, you can specify a <code>schedule_start</code> time value to indicate when the data drift run should start (if omitted, the run will start at the current time).</p>
<p>Data drift is measured using a calculated <em>magnitude</em> of change in the statistical distributions of feature values over time. You can configure a <strong>threshold</strong> for data drift magnitude.</p>
<div class="highlight"><pre><span></span><code><span class="n">alert_email</span> <span class="o">=</span> <span class="n">AlertConfiguration</span><span class="p">(</span><span class="s1">'data_scientist@contoso.com'</span><span class="p">)</span>
<span class="n">monitor</span> <span class="o">=</span> <span class="n">DataDriftDetector</span><span class="o">.</span><span class="n">create_from_datasets</span><span class="p">(</span>
<span class="n">ws</span><span class="p">,</span>
<span class="s1">'dataset-drift-detector'</span><span class="p">,</span>
<span class="n">baseline_data_set</span><span class="p">,</span>
<span class="n">target_data_set</span><span class="p">,</span>
<span class="n">compute_target</span><span class="o">=</span><span class="n">cpu_cluster</span><span class="p">,</span>
<span class="n">frequency</span><span class="o">=</span><span class="s1">'Week'</span><span class="p">,</span>
<span class="n">latency</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
<span class="n">drift_threshold</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span>
<span class="n">alert_configuration</span><span class="o">=</span><span class="n">alert_email</span>
<span class="p">)</span>
</code></pre></div>Error when restarting Databricks streaming job2020-04-19T18:00:00+02:002020-04-19T18:00:00+02:00Marco Santonitag:www.marcosantoni.com,2020-04-19:/error_restarting_databricks_streaming.html<p>This is an error I encountered when I have a Spark Streaming job running on Databricks 6.1. Consider the case I have to update a running streaming query. Databricks <a href="https://docs.databricks.com/spark/latest/structured-streaming/production.html#configure-jobs-to-restart-streaming-queries-on-failure">recommends</a> to always start (and restart too?) a streaming query on a <strong>new</strong> dedicated cluster. However, in some scenario you …</p><p>This is an error I encountered when I have a Spark Streaming job running on Databricks 6.1. Consider the case I have to update a running streaming query. Databricks <a href="https://docs.databricks.com/spark/latest/structured-streaming/production.html#configure-jobs-to-restart-streaming-queries-on-failure">recommends</a> to always start (and restart too?) a streaming query on a <strong>new</strong> dedicated cluster. However, in some scenario you might not be able to do so, and you may want to:</p>
<ul>
<li>cancel the job run</li>
<li>update the notebooks</li>
<li>restart the job run</li>
</ul>
<p>By taking these steps, I encountered these error:</p>
<div class="highlight"><pre><span></span><code><span class="n">Concurrent</span><span class="w"> </span><span class="k">update</span><span class="w"> </span><span class="k">to</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="nf">log</span><span class="p">.</span><span class="w"> </span><span class="n">Multiple</span><span class="w"> </span><span class="n">streaming</span><span class="w"> </span><span class="n">jobs</span><span class="w"> </span><span class="n">detected</span><span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">...</span><span class="w"></span>
<span class="err">#</span><span class="w"> </span><span class="ow">or</span><span class="w"></span>
<span class="n">Multiple</span><span class="w"> </span><span class="n">streaming</span><span class="w"> </span><span class="n">queries</span><span class="w"> </span><span class="k">are</span><span class="w"> </span><span class="n">concurrently</span><span class="w"> </span><span class="k">using</span><span class="w"> </span><span class="p">...</span><span class="w"> </span><span class="o">[</span><span class="n">checkpoint</span><span class="o">]</span><span class="w"></span>
</code></pre></div>
<p>They did not occur every time I restarted the query, but most of the times. When restarting 2-3 times, the issue was solved and the streaming query run smoothly. By investigating a bit more the error, we found that cancelling a job run via Databricks CLI was not letting the stream query close smoothly. What happened? The running query was not closing cleanly the <a href="https://docs.databricks.com/spark/latest/structured-streaming/production.html#enable-checkpointing">checkpoints</a>. So, when a new job run started, it raised an error because it found a corrupted checkpoint.</p>
<h2>Solution</h2>
<p>You can</p>
<ul>
<li>upgrade do Databricks 6.3 and set <a href="https://docs.databricks.com/release-notes/runtime/6.3.html#improvements">spark.sql.streaming.stopActiveRunOnRestart</a> to true</li>
<li>wait for Databricks 7 to be release where this configuration is enabled by default</li>
</ul>New Work on atacmonitor.com2020-03-08T18:00:00+01:002020-03-08T18:00:00+01:00Marco Santonitag:www.marcosantoni.com,2020-03-08:/refactor_atacmonitor.html<p><img alt="atacmonitor chart" src="https://www.marcosantoni.com/images/atacmonitor_chart.png"></p>
<p>My side project <a href="http://www.atacmonitor.com/">atacmonitor</a> features a new guise. Data is now being collected for <strong>all bus and tram</strong> lines in Rome. Data pull is achieved via Python functions running on AWS Lambda. Data is then stored in MongoDB hosted in MongoDB Atlas. Atlas also provides the charts in the page …</p><p><img alt="atacmonitor chart" src="https://www.marcosantoni.com/images/atacmonitor_chart.png"></p>
<p>My side project <a href="http://www.atacmonitor.com/">atacmonitor</a> features a new guise. Data is now being collected for <strong>all bus and tram</strong> lines in Rome. Data pull is achieved via Python functions running on AWS Lambda. Data is then stored in MongoDB hosted in MongoDB Atlas. Atlas also provides the charts in the page. An overview of the new architecture is presented below.</p>
<p><img alt="atacmonitor architecture" src="https://www.marcosantoni.com/images/atacmonitor_architecture_2.png"></p>
<p><a href="http://www.marcosantoni.com/monitoring_bus_frequencies_in_rome.html">Link</a> to the post of the first release.</p>The Pragmatic Programmer [Highlights]2018-02-10T14:31:00+01:002018-02-10T14:31:00+01:00Marco Santonitag:www.marcosantoni.com,2018-02-10:/the_pragmatic_programmer.html<blockquote>
<p>Rather than construction, software is more like gardening— it is more organic than concrete. You plant many things in a garden according to an initial plan and conditions. Some thrive, others are destined to end up as compost. [...] You constantly monitor the health of the garden, and make adjustments (to …</p></blockquote><blockquote>
<p>Rather than construction, software is more like gardening— it is more organic than concrete. You plant many things in a garden according to an initial plan and conditions. Some thrive, others are destined to end up as compost. [...] You constantly monitor the health of the garden, and make adjustments (to the soil, the plants, the layout) as needed.</p>
</blockquote>
<p><em>The Pragmatic Programmer: from Journeyman to Master</em> by Andrew Hunt and David Thomas is a guide to best practices of software development. A software developer is like a woodcrafter. There are good practices that help him in achieving quality and efficiency in its work. I will summarize here some interesting hints that you can find in the book.</p>
<p>The book was originally published in 1999, so technologies and tools are quite outdated. However, the main principle remain surprisingly up to date.</p>
<p><img alt="The Pragmatic Programmer" src="https://www.marcosantoni.com/images/pragmatic_programmer.jpg"></p>
<h2>1. Don't Repeat Yourself</h2>
<blockquote>
<p>DRY— Don't Repeat Yourself The alternative is to have the same thing expressed in two or more places. If you change one, you have to remember to change the others [...]. It isn't a question of whether you'll remember: it's a question of <strong>when you'll forget</strong>.</p>
</blockquote>
<h2>2. Coding over GUIs</h2>
<blockquote>
<p>A benefit of GUIs is WYSIWYG— what you see is what you get. The disadvantage is WYSIAYG— what you see is <strong>all</strong> you get.</p>
</blockquote>
<h2>3. One Editor for All</h2>
<blockquote>
<p>We think it is better to know one editor very well, and use it for all editing tasks: code, documentation, memos, system administration, and so on. Without a single editor, you face a potential modern day Babel of confusion.</p>
</blockquote>
<h2>4. Always Source Control. Always.</h2>
<blockquote>
<p>Always. Even if you are a single-person team on a one-week project. Even if it's a "throw-away" prototype. Even if the stuff you're working on isn't source code. Make sure that everything is under source code control— documentation, phone number lists, memos to vendors, makefiles, build and release procedures, that little shell script that burns the CD master— everything.</p>
</blockquote>
<h2>5. Things can Happen</h2>
<blockquote>
<p>It goes THIS CAN NEVER HAPPEN... "This code won't be used 30 years from now, so two-digit dates are fine." "This application will never be used abroad, so why internationalize it?" "count can't be negative." "This printf can't fail.". Let's not practice this kind of self-deception, particularly when coding.</p>
</blockquote>
<h2>6. Become a User</h2>
<blockquote>
<p>There's a simple technique for getting inside your users' requirements that isn't used often enough: become a user. Are you writing a system for the help desk? Spend a couple of days monitoring the phones with an experienced support person. Are you automating a manual stock control system? Work in the warehouse for a week. As well as giving you insight into how the system will really be used, you'd be amazed at how the request "May I sit in for a week while you do your job?" helps build trust and establishes a basis for communication with your users. Just remember not to get in the way!</p>
</blockquote>
<h2>7. Web Docs over Files</h2>
<blockquote>
<p>Web-based distribution also avoids the typical two-inch-thick binder entitled Requirements Analysis that no one ever reads and that becomes outdated the instant ink hits paper. If it's on the Web, the programmers may even read it.</p>
</blockquote>
<h2>8. Quality, quality, quality.</h2>
<blockquote>
<p>Teams as a whole should not tolerate broken windows— those small imperfections that no one fixes. The team must take responsibility for the quality of the product, supporting developers who understand the no broken windows</p>
<p>Some team methodologies have a quality officer— someone to whom the team delegates the responsibility for the quality of the deliverable. This is clearly <strong>ridiculous</strong>: quality can come only from the individual contributions of all team members.</p>
</blockquote>
<h2>9. Marketing the Project</h2>
<blockquote>
<p>There is a simple marketing trick that helps teams communicate as one: generate a brand. When you start a project, come up with a name for it, ideally something off-the-wall.</p>
</blockquote>
<h2>10. Manual Ensures Errors</h2>
<blockquote>
<p>A great way to ensure both consistency and accuracy is to automate everything the team does.</p>
</blockquote>
<h2>Reference</h2>
<p>Hunt, Andrew; Thomas, David. The Pragmatic Programmer: From Journeyman to Master. Pearson Education. Kindle Edition.</p>6 Take-Aways after Reading "The Signal and The Noise"2017-11-11T19:07:00+01:002017-11-11T19:07:00+01:00Marco Santonitag:www.marcosantoni.com,2017-11-11:/the_signal_and_the_noise.html<p><em>The Signal and The Noise</em> by Nate Silver is a must-read book for those interested in predictions. It is not a technical book. You will not learn any algorithm. However, it presents a series of real-world scenarios when predictions did work and where predictions did not work. The book is …</p><p><em>The Signal and The Noise</em> by Nate Silver is a must-read book for those interested in predictions. It is not a technical book. You will not learn any algorithm. However, it presents a series of real-world scenarios when predictions did work and where predictions did not work. The book is well written and is full of valuable references to support its arguments.</p>
<p><img alt="The Signal and The Noise by Nate Silver" src="https://www.marcosantoni.com/images/signal_and_noise_book.jpg"></p>
<h2>1. Anyone can beat an index fund</h2>
<blockquote>
<p>After all, any investor can do as well as the average investor with almost no effort. All he needs to do is buy an index fund that tracks the average of the S&P500. In so doing he will come extremely close to replicating the average portfolio of every other trader, from Harvard MBAs to noise traders to George Soros' hedge fund manager. You have to be <em>really</em> good -or foolhardy- to turn that proposition down.</p>
</blockquote>
<h2>2. Bayesian statistics is less wrong</h2>
<blockquote>
<p>Recently, however, some well-respected statisticians have begun to argue that frequentist statistics should no longer be taught to undergraduates.</p>
</blockquote>
<p>Frequentist statistics emphasizes the purity of the experiment: every hypothesis could be tested to a perfect conclusion if only enough data were collected. These methods don't encourage us to think about the plausibility of our hypothesis.</p>
<h2>3. A bug made Deep Blue beat Kasparov</h2>
<blockquote>
<p>But what had inspired Kasparov to commit this mistake? His anxiety over Deep Blue's forty-fourth move in the first game - the move in which the computer had moved its rook for no apparent purpose. Kasparov had concluded that the counterintuitive play must be a sign of superior intelligence. He had never considered that it was simply a bug.</p>
</blockquote>
<h2>4. When predictions work - Weather</h2>
<p>Weather predictions do not rely on statistics, nor on machine learning. They employ heavy simulations. The earth is split in cells, and the meteorological dynamics are simulated via well known models. The first weather simulation ever done is by the English physicist Lewis Fry Richardson in 1916.</p>
<p><img alt="Richardson's Matrux" src="https://www.marcosantoni.com/images/richardson_grid.jpg"></p>
<h2>5. When predictions don't work - Earthquakes</h2>
<blockquote>
<p>These processes may not literally be random, but they are so irreducibly complex (right down the last grain of sand) that it just won't be possible to predict them beyond a certain level.</p>
</blockquote>
<h2>6. When predictions don't work - Economics</h2>
<p>Raw data for economics isn't much good.</p>
<blockquote>
<p>"Why do people [economists ed.] not give intervals? Because they're embarrassed"</p>
</blockquote>
<p>They are embarrassed because they are just too large.</p>My Talk about Superset [Python Milano Meetup]2017-06-22T17:56:00+02:002017-06-22T17:56:00+02:00Marco Santonitag:www.marcosantoni.com,2017-06-22:/talk_python_pills.html<p>Yesterday, I gave a talk <a href="https://www.meetup.com/Python-Milano/events/239846600/">Python Milano Meetup</a>. The Meetup was designed as Python pills: three 20-minutes talks in a row. The talks:</p>
<ul>
<li>Superset: data visualization at AirBnB - Marco Santoni</li>
<li>Java Vs Python - Cesare Placanica</li>
<li>pdb in action - <a href="https://twitter.com/greenkey">Lorenzo Mele</a></li>
</ul>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Very nice talk of <a href="https://twitter.com/Airbnb">@Airbnb</a> <a href="https://twitter.com/hashtag/Superset?src=hash">#Superset</a> with <a href="https://twitter.com/MrSantoni">@MrSantoni</a> at <a href="https://twitter.com/hashtag/PythonMilano?src=hash">#PythonMilano …</a></p></blockquote><p>Yesterday, I gave a talk <a href="https://www.meetup.com/Python-Milano/events/239846600/">Python Milano Meetup</a>. The Meetup was designed as Python pills: three 20-minutes talks in a row. The talks:</p>
<ul>
<li>Superset: data visualization at AirBnB - Marco Santoni</li>
<li>Java Vs Python - Cesare Placanica</li>
<li>pdb in action - <a href="https://twitter.com/greenkey">Lorenzo Mele</a></li>
</ul>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Very nice talk of <a href="https://twitter.com/Airbnb">@Airbnb</a> <a href="https://twitter.com/hashtag/Superset?src=hash">#Superset</a> with <a href="https://twitter.com/MrSantoni">@MrSantoni</a> at <a href="https://twitter.com/hashtag/PythonMilano?src=hash">#PythonMilano</a>. I see juicy applications for us <a href="https://twitter.com/hashtag/BIM?src=hash">#BIM</a> guys. <a href="https://t.co/Pf1r9nhNEd">https://t.co/Pf1r9nhNEd</a></p>— Chiara Rizzarda (@CrShelidon) <a href="https://twitter.com/CrShelidon/status/877595912612311041">June 21, 2017</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>I presented <a href="https://github.com/airbnb/superset">superset</a>, the open source project by AirBnB. It is a data visualization platform developed in Python. It allows to create interactive dashboards. The setup time is extremely short. It interesting for enterprises because the package features deep and granular authorization policies. The dashboards can be designed by business users too. You can indeed design dashboards without writing SQL queries (but there's still the option to write SQL of course). <code>superset</code> can integrate to most SQL databases thanks to <code>SQLAlchemy</code> query layer. Furthermore, <code>druid.io</code> database is supported. I presented <a href="http://www.marcosantoni.com/monitoring_bus_frequencies_in_rome.html">atacmonitor</a> as an example of a <code>superset</code> application.</p>Manufacturing. When data is not a commodity2017-02-25T17:56:00+01:002017-02-25T17:56:00+01:00Marco Santonitag:www.marcosantoni.com,2017-02-25:/datadriveninnovation17.html<p>What does it mean to work as a data scientist in manufacturing? What is the value behind data? Data science has gained popularity in domains like internet, but the industrial production domain has specific requirements.</p>
<p><img alt="Waiting times" src="https://www.marcosantoni.com/images/ddi_talk.jpg"></p>
<p>I gave a talk at <a href="http://2017.datadriveninnovation.org/">Data Driven Innovation</a> about the specific challenges when doing data …</p><p>What does it mean to work as a data scientist in manufacturing? What is the value behind data? Data science has gained popularity in domains like internet, but the industrial production domain has specific requirements.</p>
<p><img alt="Waiting times" src="https://www.marcosantoni.com/images/ddi_talk.jpg"></p>
<p>I gave a talk at <a href="http://2017.datadriveninnovation.org/">Data Driven Innovation</a> about the specific challenges when doing data science in manufacturing. I introduced the approach to data science that we are deploying at <a href="http://www.brembo.com/en">Brembo</a>. The talk was part of a track dedicated to Industry 4.0 and to IoT.</p>Weighted Random Sampling with PostgreSQL [Follow-up]2017-02-10T21:00:00+01:002017-02-10T21:00:00+01:00Marco Santonitag:www.marcosantoni.com,2017-02-10:/weighted_random_sampling_follow_up.html<blockquote>
<p>I received valuable feedbacks by <a href="https://www.linkedin.com/in/decibel/">Jim Nasby</a> regarding <a href="http://www.marcosantoni.com/2016/08/23/weighted-random-sampling-with-postgresql.html">the post</a> about weighted random sampling with PostgreSQL. I will report here Jim's email.</p>
</blockquote>
<p>Sadly, Common Table Expressions (CTE)s are <em>insanely</em> expensive, because
each one must be fully materialized. So in your example, you're
essentially creating 5 temp tables (one for …</p><blockquote>
<p>I received valuable feedbacks by <a href="https://www.linkedin.com/in/decibel/">Jim Nasby</a> regarding <a href="http://www.marcosantoni.com/2016/08/23/weighted-random-sampling-with-postgresql.html">the post</a> about weighted random sampling with PostgreSQL. I will report here Jim's email.</p>
</blockquote>
<p>Sadly, Common Table Expressions (CTE)s are <em>insanely</em> expensive, because
each one must be fully materialized. So in your example, you're
essentially creating 5 temp tables (one for each CTE). Obviously that's
not a big deal with only 4 weights and 1000 samples, but for other use
cases that overhead could really add up. Note that this is not the same
as the <code>OFFSET 0</code> trick...
You can get a similar breakdown of code by using subselects in <code>FROM</code>
clauses. That would look something like:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span> <span class="n">color</span>
<span class="k">FROM</span> <span class="p">(</span><span class="o"><</span><span class="n">samples</span> <span class="n">code</span><span class="o">></span><span class="p">)</span> <span class="k">AS</span> <span class="n">samples</span>
<span class="k">JOIN</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="o"><</span><span class="n">cumulative_bounds</span> <span class="k">SELECT</span><span class="o">></span>
<span class="k">FROM</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="o"><</span><span class="n">sampling_cumulative_prob</span> <span class="k">SELECT</span><span class="o">></span>
<span class="k">FROM</span> <span class="p">(....)</span>
<span class="p">)</span> <span class="k">AS</span> <span class="n">sampling_cumulative_prob</span>
<span class="p">)</span> <span class="k">AS</span> <span class="n">cumulative_bounds</span> <span class="k">ON</span> <span class="p">...</span>
</code></pre></div>
<p>Not as nice as <code>WITH</code>, but not horrible. You can also create temporary
views for each of the intermediate steps.</p>
<p>in weights_with_sum, you can get rid of the <code>join</code> in favor of <code>sum(weight)
OVER() AS weight_sum</code>.</p>
<p>Finally, <code>random()</code> produces <code>0.0 <= x < 1.0</code>, so the bounds on the <code>numrange</code>
should be <code>'[)'</code>, not <code>'(]'</code>. Personally, I would just create the <code>numrange</code>
immediately in <code>cummulative_bounds</code>, but that's mostly just a matter of style.</p>
<p>BTW, if you've got <code>plpythonu</code> loaded there's probably an easier way to
generate the set of ranges, which could then be joined to the random
samples.</p>
<p>BTW, <code>width_bucket(operand anyelement, thresholds anyarray)</code> (see <em>second</em>
instance on <a href="https://www.postgresql.org/docs/current/static/functions-math.html">docs</a>)
might be even faster; it'd definitely be simpler:</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span> <span class="n">color</span><span class="p">[</span><span class="n">width_bucket</span><span class="p">(</span><span class="n">random</span><span class="p">(),</span> <span class="n">thresholds</span><span class="p">)</span>
<span class="k">FROM</span> <span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1000</span><span class="p">)</span>
<span class="p">,</span> <span class="p">(</span>
<span class="k">SELECT</span> <span class="n">array_agg</span><span class="p">(</span><span class="n">color</span><span class="p">)</span> <span class="k">AS</span> <span class="n">colors</span>
<span class="p">,</span> <span class="n">array_agg</span><span class="p">(</span><span class="n">cum_prod</span><span class="p">)</span> <span class="k">AS</span> <span class="n">thresholds</span>
<span class="k">FROM</span> <span class="n">sampling_cumulative_prod</span>
<span class="p">)</span> <span class="k">AS</span> <span class="n">prob</span><span class="p">;</span>
</code></pre></div>Monitoring Bus Frequencies in Rome2017-01-21T18:00:00+01:002017-01-21T18:00:00+01:00Marco Santonitag:www.marcosantoni.com,2017-01-21:/monitoring_bus_frequencies_in_rome.html<p>I have just launched <a href="http://atacmonitor.com/">atacmonitor</a>. It is a website providing information about the waiting time at bus stops in Rome.</p>
<p><img alt="Waiting times" src="https://www.marcosantoni.com/images/atacmonitor.gif"></p>
<h2>Overview</h2>
<p>The datasource is live data about bus waiting time of ATAC, Rome's public transport company. The transport office provides <a href="https://romamobilita.it/it/azienda/open-data/api-real-time">public API</a> with real-time data.</p>
<p>I have implemented a <a href="https://github.com/Marco-Santoni/atacmonitor-data">simple …</a></p><p>I have just launched <a href="http://atacmonitor.com/">atacmonitor</a>. It is a website providing information about the waiting time at bus stops in Rome.</p>
<p><img alt="Waiting times" src="https://www.marcosantoni.com/images/atacmonitor.gif"></p>
<h2>Overview</h2>
<p>The datasource is live data about bus waiting time of ATAC, Rome's public transport company. The transport office provides <a href="https://romamobilita.it/it/azienda/open-data/api-real-time">public API</a> with real-time data.</p>
<p>I have implemented a <a href="https://github.com/Marco-Santoni/atacmonitor-data">simple application</a> that is regularly pulling such data and storing it in a PostgreSQL database. The data is presented via AirBnB's <a href="http://airbnb.io/superset/">Supereset</a>, an open source visualization platform. I deployed such application via <a href="www.heroku.com">Heroku</a> PaaS.</p>
<p>I have kicked-off the project and just few bus stops are being monitored. The goal is to have all bus stops monitored soon.</p>Blog Migrated to Pelican on GitHub Pages2016-12-28T15:38:00+01:002016-12-28T15:38:00+01:00Marco Santonitag:www.marcosantoni.com,2016-12-28:/migrated_to_pelican.html<p>I have migrated my blog. It is built under <a href="http://blog.getpelican.com/">Pelican</a>, a static site generator. It allows me to write posts as plain markdown or even Jupyter notebooks. I then use <a href="https://pages.github.com/">GitHub Pages</a> to version and publish the blog. I am continuing to use <a href="https://www.aruba.it/home.aspx">Aruba</a> as domain provider. It is sufficient …</p><p>I have migrated my blog. It is built under <a href="http://blog.getpelican.com/">Pelican</a>, a static site generator. It allows me to write posts as plain markdown or even Jupyter notebooks. I then use <a href="https://pages.github.com/">GitHub Pages</a> to version and publish the blog. I am continuing to use <a href="https://www.aruba.it/home.aspx">Aruba</a> as domain provider. It is sufficient to rename the <code>CNAME</code> and the <code>ANAME</code> variables to hide the blog under the <code>marcosantoni.com</code> domain.</p>
<p>The migration <a href="http://mathamy.com/migrating-to-github-pages-using-pelican.html">from Wordpress to Pelican</a> was sped up by the <code>pelican-import</code> plugin. <a href="https://fedoramagazine.org/make-github-pages-blog-with-pelican/">This blog post</a> is a good reference for deploying a Pelican blog on GitHub Pages</p>Insights from IEEE Big Data 162016-12-26T16:22:00+01:002016-12-26T16:22:00+01:00Marco Santonitag:www.marcosantoni.com,2016-12-26:/ieee_big_data_16.html<p>I have attended the IEEE Big Data 16 conference in Washington DC. I thank my company for sponsoring the trip. The conference included a <a href="http://cci.drexel.edu/bigdata/bigdata2016/SpecialSymposium.html">special symposium</a> dedicated to manufacturing. The symposium hosted some participants of the <a href="https://www.kaggle.com/c/bosch-production-line-performance">Bosch Production Line Performance</a> competition from Kaggle.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">2016 IEEE International Conference on Big Data …</p></blockquote><p>I have attended the IEEE Big Data 16 conference in Washington DC. I thank my company for sponsoring the trip. The conference included a <a href="http://cci.drexel.edu/bigdata/bigdata2016/SpecialSymposium.html">special symposium</a> dedicated to manufacturing. The symposium hosted some participants of the <a href="https://www.kaggle.com/c/bosch-production-line-performance">Bosch Production Line Performance</a> competition from Kaggle.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">2016 IEEE International Conference on Big Data kicked off today in Washington, DC. Share highlights w/ hashtag <a href="https://twitter.com/hashtag/IEEEBigData16?src=hash">#IEEEBigData16</a> & we’ll RT!</p>— IEEE Big Data (@ieeebigdata) <a href="https://twitter.com/ieeebigdata/status/805799488128425984">December 5, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>I'll list here a few notes I took during the conference.</p>
<ul>
<li><strong>Streaming Processing.</strong> I heard about the most popular architectures nowadays, and I highly recommend reading the blog posts by the authors of such architectures:<ul>
<li><a href="http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html">Lambda architecture</a></li>
<li><a href="https://www.oreilly.com/ideas/questioning-the-lambda-architecture">Kappa architecture</a></li>
</ul>
</li>
<li><strong>K-Spectral Centroid.</strong> The K-Spectral Centroid algorithm clusters time series by their shape, and finds the most representative shape (the cluster centroid) for each cluster.</li>
<li><strong>K-D Tree partition:</strong> an algorithm for space partitioning.</li>
<li><strong>Database Decay.</strong> Interesting keynote by Michael Stonebraker. Shortly, large applications often share a centralized database used by different groups of a company. The DBA point of view:<ul>
<li>High Risk. When changing a DB schema, I need to find applications all around in the company and update them accordingly (do I have budget for that?).</li>
<li>Low Risk. No change in schema, I do a workaround in data.</li>
<li>Claim. DBA want to lower the risk. --> no change in schema --> ER diagram diverges from reality --> database decay.</li>
<li>At some point, a total rewrite is the only way forward.</li>
<li>If you work in analytics getting data from operational DB, you realize data is getting more and more dirty.</li>
</ul>
</li>
<li><strong>PMML Scoring Engine.</strong> Max Ferguson introduced what a Predictive Model Markup Language (PMML) is. Basically, if you train a model and want to share it in a different application, PMML is a standard that defines how models should be stored as an XML.</li>
<li><strong>Uncertainty in RFs.</strong> Random Forests can express uncertainty. One just needs to look at distribution of predictions among the decision trees of the model.</li>
<li><strong>Bosch.</strong> Rumi Ghosh introduced the data science team at Bosch.<ul>
<li>Insight from production plants: plant managers prefer interpretable models (logistic regression or decision tree) over black box models.</li>
<li>Research directions:</li>
<li>Root cause analysis (via Bayesian inference)</li>
<li>Class imbalance</li>
</ul>
</li>
<li><strong>3 Approaches in Kaggle Competition.</strong> <a href="https://www.kaggle.com/bpavlyshenko">Bohdan Pavlyshenko</a> gave a talk on the three approaches he explored during the Kaggle competition about failure detection:<ul>
<li>Pure machine learning approach. 2-Levels of model ensembling, a pure black-box.</li>
<li>Generalized Linear Model with Lasso regularization. Informative about feature impact.</li>
<li>Bayesian model in BUGS. It enables to obtain the estimate of the probability distribution for each coefficient.</li>
</ul>
</li>
<li><strong>FTLR.</strong> Follow the regularized leader: a feature engineering method used to convert all categorical feature into one numerical feature.</li>
<li><strong>CRF.</strong> Conditional Random Fields is a class of predictive models used when the dataset is represented as a graph. Each node is a sample with a vector X and a target variable y.</li>
</ul>Weighted Random Sampling with PostgreSQL2016-08-23T16:22:00+02:002016-08-23T16:22:00+02:00Marco Santonitag:www.marcosantoni.com,2016-08-23:/2016/08/23/weighted-random-sampling-with-postgresql.html<p>You have a table like the following:</p>
<div class="highlight"><pre><span></span><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">weights</span> <span class="p">(</span>
<span class="n">color</span> <span class="nb">varchar</span> <span class="k">primary</span> <span class="k">key</span><span class="p">,</span>
<span class="n">weight</span> <span class="nb">float</span>
<span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">weights</span> <span class="p">(</span><span class="n">color</span><span class="p">,</span> <span class="n">weight</span><span class="p">)</span>
<span class="k">VALUES</span>
<span class="p">(</span><span class="s1">'red'</span><span class="p">,</span> <span class="mi">8</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'blue'</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'green'</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'yellow'</span><span class="p">,</span> <span class="mi">10</span><span class="p">);</span>
</code></pre></div>
<p>The table lists the weights associated with certain colors. Imagine a
weight representing how much you like that color.</p>
<p>Now …</p><p>You have a table like the following:</p>
<div class="highlight"><pre><span></span><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">weights</span> <span class="p">(</span>
<span class="n">color</span> <span class="nb">varchar</span> <span class="k">primary</span> <span class="k">key</span><span class="p">,</span>
<span class="n">weight</span> <span class="nb">float</span>
<span class="p">);</span>
<span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">weights</span> <span class="p">(</span><span class="n">color</span><span class="p">,</span> <span class="n">weight</span><span class="p">)</span>
<span class="k">VALUES</span>
<span class="p">(</span><span class="s1">'red'</span><span class="p">,</span> <span class="mi">8</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'blue'</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'green'</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'yellow'</span><span class="p">,</span> <span class="mi">10</span><span class="p">);</span>
</code></pre></div>
<p>The table lists the weights associated with certain colors. Imagine a
weight representing how much you like that color.</p>
<p>Now, you want to add 1000 colored tiles to your website. You want the
color of the tiles to be <strong>sampled at random</strong> according to the
<em>weights</em> table.</p>
<p>We'll write a PostgreSQL script that implements such random sampling.
I'll write the <strong>entire query first</strong>, and then explain each part
separately.</p>
<div class="highlight"><pre><span></span><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">sampled_colors</span> <span class="k">AS</span>
<span class="k">WITH</span> <span class="n">weights_with_sum</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">color</span><span class="p">,</span>
<span class="n">weight</span><span class="p">,</span>
<span class="n">weight_sum</span>
<span class="k">FROM</span> <span class="n">weights</span>
<span class="k">CROSS</span> <span class="k">JOIN</span> <span class="p">(</span><span class="k">SELECT</span> <span class="k">sum</span><span class="p">(</span><span class="n">weight</span><span class="p">)</span> <span class="k">AS</span> <span class="n">weight_sum</span> <span class="k">FROM</span> <span class="n">weights</span><span class="p">)</span> <span class="n">s</span>
<span class="p">),</span>
<span class="n">sampling_probability</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">color</span><span class="p">,</span>
<span class="n">weight</span> <span class="o">/</span> <span class="n">weight_sum</span> <span class="k">AS</span> <span class="n">prob</span>
<span class="k">FROM</span> <span class="n">weights_with_sum</span>
<span class="p">),</span>
<span class="n">sampling_cumulative_prob</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">color</span><span class="p">,</span>
<span class="k">sum</span><span class="p">(</span><span class="n">prob</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="k">order</span> <span class="k">by</span> <span class="n">color</span><span class="p">)</span> <span class="k">AS</span> <span class="n">cum_prob</span>
<span class="k">FROM</span> <span class="n">sampling_probability</span>
<span class="p">),</span>
<span class="n">cumulative_bounds</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">color</span><span class="p">,</span>
<span class="n">COALESCE</span><span class="p">(</span>
<span class="n">lag</span><span class="p">(</span><span class="n">cum_prob</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="n">cum_prob</span><span class="p">,</span> <span class="n">color</span><span class="p">),</span>
<span class="mi">0</span>
<span class="p">)</span> <span class="k">AS</span> <span class="n">lower_cum_bound</span><span class="p">,</span>
<span class="n">cum_prob</span> <span class="k">AS</span> <span class="n">upper_cum_bound</span>
<span class="k">FROM</span> <span class="n">sampling_cumulative_prob</span>
<span class="p">),</span>
<span class="n">samples</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span> <span class="k">AS</span> <span class="n">sample_idx</span><span class="p">,</span>
<span class="n">random</span><span class="p">()</span> <span class="k">AS</span> <span class="n">sample</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="n">color</span>
<span class="k">FROM</span> <span class="n">samples</span>
<span class="k">JOIN</span> <span class="n">cumulative_bounds</span> <span class="k">ON</span>
<span class="n">sample</span><span class="p">::</span><span class="nb">numeric</span> <span class="o"><@</span> <span class="n">numrange</span><span class="p">(</span><span class="n">lower_cum_bound</span><span class="p">::</span><span class="nb">numeric</span><span class="p">,</span>
<span class="n">upper_cum_bound</span><span class="p">::</span><span class="nb">numeric</span><span class="p">,</span> <span class="s1">'(]'</span><span class="p">);</span>
</code></pre></div>
<p>Let's look at one piece at a time.</p>
<div class="highlight"><pre><span></span><code><span class="k">WITH</span> <span class="n">weights_with_sum</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">color</span><span class="p">,</span>
<span class="n">weight</span><span class="p">,</span>
<span class="n">weight_sum</span>
<span class="k">FROM</span> <span class="n">weights</span>
<span class="k">CROSS</span> <span class="k">JOIN</span> <span class="p">(</span><span class="k">SELECT</span> <span class="k">sum</span><span class="p">(</span><span class="n">weight</span><span class="p">)</span> <span class="k">AS</span> <span class="n">weight_sum</span> <span class="k">FROM</span> <span class="n">weights</span><span class="p">)</span> <span class="n">s</span>
<span class="p">),</span>
<span class="n">sampling_probability</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">color</span><span class="p">,</span>
<span class="n">weight</span> <span class="o">/</span> <span class="n">weight_sum</span> <span class="k">AS</span> <span class="n">prob</span>
<span class="k">FROM</span> <span class="n">weights_with_sum</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="n">sampling_probability</span><span class="p">;</span>
<span class="c1">-- output:</span>
<span class="n">color</span> <span class="o">|</span> <span class="n">prob</span>
<span class="c1">--------+--------------------</span>
<span class="n">red</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">258064516129032</span>
<span class="n">blue</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0967741935483871</span>
<span class="n">green</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">32258064516129</span>
<span class="n">yellow</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">32258064516129</span>
</code></pre></div>
<p>Here, we're just normalizing the weights. Each weight is divided by the
total sum of the weights. In this way, we are re-writing each weight as
a <strong>discrete probability</strong> of that color being sampled.</p>
<div class="highlight"><pre><span></span><code><span class="p">...</span>
<span class="n">sampling_cumulative_prob</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">color</span><span class="p">,</span>
<span class="k">sum</span><span class="p">(</span><span class="n">prob</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="k">order</span> <span class="k">by</span> <span class="n">color</span><span class="p">)</span> <span class="k">AS</span> <span class="n">cum_prob</span>
<span class="k">FROM</span> <span class="n">sampling_probability</span>
<span class="p">),</span>
<span class="n">cumulative_bounds</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">color</span><span class="p">,</span>
<span class="n">COALESCE</span><span class="p">(</span>
<span class="n">lag</span><span class="p">(</span><span class="n">cum_prob</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="k">ORDER</span> <span class="k">BY</span> <span class="n">cum_prob</span><span class="p">,</span> <span class="n">color</span><span class="p">),</span>
<span class="mi">0</span>
<span class="p">)</span> <span class="k">AS</span> <span class="n">lower_cum_bound</span><span class="p">,</span>
<span class="n">cum_prob</span> <span class="k">AS</span> <span class="n">upper_cum_bound</span>
<span class="k">FROM</span> <span class="n">sampling_cumulative_prob</span>
<span class="p">)</span>
<span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="n">cumulative_bounds</span><span class="p">;</span>
<span class="c1">-- output:</span>
<span class="n">color</span> <span class="o">|</span> <span class="n">lower_cum_bound</span> <span class="o">|</span> <span class="n">upper_cum_bound</span>
<span class="c1">--------+--------------------+--------------------</span>
<span class="n">blue</span> <span class="o">|</span> <span class="mi">0</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0967741935483871</span>
<span class="n">green</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0967741935483871</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">419354838709677</span>
<span class="n">red</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">419354838709677</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">67741935483871</span>
<span class="n">yellow</span> <span class="o">|</span> <span class="mi">0</span><span class="p">.</span><span class="mi">67741935483871</span> <span class="o">|</span> <span class="mi">1</span>
</code></pre></div>
<p>In this piece of code, we're are representing the weights as a
<strong>cumulative</strong> distribution function.</p>
<div class="highlight"><pre><span></span><code><span class="p">...</span>
<span class="n">samples</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">generate_series</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span> <span class="k">AS</span> <span class="n">sample_idx</span><span class="p">,</span>
<span class="n">random</span><span class="p">()</span> <span class="k">AS</span> <span class="n">sample</span>
<span class="p">)</span>
<span class="k">SELECT</span>
<span class="n">color</span>
<span class="k">FROM</span> <span class="n">samples</span>
<span class="k">JOIN</span> <span class="n">cumulative_bounds</span> <span class="k">ON</span>
<span class="n">sample</span><span class="p">::</span><span class="nb">numeric</span> <span class="o"><@</span> <span class="n">numrange</span><span class="p">(</span><span class="n">lower_cum_bound</span><span class="p">::</span><span class="nb">numeric</span><span class="p">,</span>
<span class="n">upper_cum_bound</span><span class="p">::</span><span class="nb">numeric</span><span class="p">,</span> <span class="s1">'(]'</span><span class="p">);</span>
</code></pre></div>
<p>In the last part, we're sampling 1000 times a random number between 0
and 1. We then assign this sample to the corresponding color based on
the values of the cumulative function. For example, if the first sample
is 0.45, it will match the <em>'red'</em> range (0.41-0.67). Therefore, that
sample will be <em>'red'</em>.</p>
<p>The result of the query is a table filled with 1000 colors sampled at
random based on the weights.</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="n">sampled_colors</span>
<span class="k">LIMIT</span> <span class="mi">10</span><span class="p">;</span>
<span class="c1">-- output:</span>
<span class="n">color</span>
<span class="c1">--------</span>
<span class="n">green</span>
<span class="n">green</span>
<span class="n">red</span>
<span class="n">yellow</span>
<span class="n">yellow</span>
<span class="n">green</span>
<span class="n">blue</span>
<span class="n">red</span>
<span class="n">red</span>
<span class="n">red</span>
</code></pre></div>
<p>Can we check that the result is correct? Were the weights really taken
into account?</p>
<div class="highlight"><pre><span></span><code><span class="k">SELECT</span>
<span class="n">color</span><span class="p">,</span>
<span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span>
<span class="k">FROM</span> <span class="n">sampled_colors</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="mi">1</span><span class="p">;</span>
<span class="c1">-- output:</span>
<span class="n">color</span> <span class="o">|</span> <span class="k">count</span>
<span class="c1">--------+-------</span>
<span class="n">yellow</span> <span class="o">|</span> <span class="mi">309</span>
<span class="n">green</span> <span class="o">|</span> <span class="mi">320</span>
<span class="n">red</span> <span class="o">|</span> <span class="mi">276</span>
<span class="n">blue</span> <span class="o">|</span> <span class="mi">95</span>
</code></pre></div>
<p>The proportion of samples is quite close to the proportion of the
weights. This similarity is clear if we compare this table with the
discrete probability table above.</p>Applied Bayesian Inference with PyMC [video]2016-06-30T17:03:00+02:002016-06-30T17:03:00+02:00Marco Santonitag:www.marcosantoni.com,2016-06-30:/2016/06/30/applied-bayesian-inference-with-pymc-video.html<p>I was glad to give an intro to Bayesian Inference at PyData Florence
2016. The video of the talk is now out.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/BX1MjMDKhXU" frameborder="0" allowfullscreen></iframe>A Simple Machine Learning Pipeline2016-06-19T10:37:00+02:002016-06-19T10:37:00+02:00Marco Santonitag:www.marcosantoni.com,2016-06-19:/2016/06/19/a-simple-machine-learning-pipeline.html<p>This post contains the code that I used in my talk at Python Milano
Meetup on <a href="http://www.meetup.com/Python-Milano/events/231710577/">June 22nd
2016</a>. The talk
was a quick overview of <strong>Pipeline</strong>, a nice API by <em>scikitlearn</em> to
abstract your machine learning algorithm. It is based on the Boston
<a href="https://archive.ics.uci.edu/ml/datasets/Housing">Housing Data Set</a>.</p>
<p>We'll just load …</p><p>This post contains the code that I used in my talk at Python Milano
Meetup on <a href="http://www.meetup.com/Python-Milano/events/231710577/">June 22nd
2016</a>. The talk
was a quick overview of <strong>Pipeline</strong>, a nice API by <em>scikitlearn</em> to
abstract your machine learning algorithm. It is based on the Boston
<a href="https://archive.ics.uci.edu/ml/datasets/Housing">Housing Data Set</a>.</p>
<p>We'll just load the data set from <em>sklearn</em>.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">load_boston</span>
<span class="n">housing_data</span> <span class="o">=</span> <span class="n">load_boston</span><span class="p">()</span>
<span class="nb">print</span> <span class="n">housing_data</span><span class="o">.</span><span class="n">DESCR</span>
</code></pre></div>
<p>We might want to make it a Pandas dataframe to make things easier to
handle.</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">housing_data</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="n">housing_data</span><span class="o">.</span><span class="n">feature_names</span>
<span class="n">df</span><span class="p">[</span><span class="s1">'PRICE'</span><span class="p">]</span> <span class="o">=</span> <span class="n">housing_data</span><span class="o">.</span><span class="n">target</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div>
<p><img alt="table" src="https://www.marcosantoni.com/images/table.png"></p>
<p>The goal is to predict the <em>PRICE</em> variable given the other features.
How does this variable distribute?</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
<span class="n">df</span><span class="o">.</span><span class="n">PRICE</span><span class="o">.</span><span class="n">hist</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'PRICE'</span><span class="p">)</span>
</code></pre></div>
<p><img alt="download
(8)" src="https://www.marcosantoni.com/images/download-8.png">{.alignnone
.size-full .wp-image-74 width="378" height="271"}</p>
<p>Let's turn the dataframe into a ML-friendly notation.</p>
<div class="highlight"><pre><span></span><code><span class="n">X</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s1">'PRICE'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">'PRICE'</span><span class="p">]</span>
</code></pre></div>
<p>We will now define the metric that assess the accuracy of our
algorithm/pipeline. Let's use the good old cross validation.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">cross_validation</span>
<span class="k">def</span> <span class="nf">evaluate_model</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">algorithm</span><span class="p">):</span>
<span class="nb">print</span> <span class="s1">'Mean Squared Error'</span>
<span class="n">scores</span> <span class="o">=</span> <span class="n">cross_validation</span><span class="o">.</span><span class="n">cross_val_score</span><span class="p">(</span><span class="n">algorithm</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span>
<span class="n">scoring</span><span class="o">=</span><span class="s1">'mean_squared_error'</span><span class="p">)</span>
<span class="nb">print</span> <span class="o">-</span><span class="n">scores</span>
<span class="nb">print</span> <span class="s1">'Accuracy: </span><span class="si">%0.2f</span><span class="s1">'</span> <span class="o">%</span> <span class="o">-</span><span class="n">scores</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
</code></pre></div>
<p>So, now, we can try a bunch of algorithms and see which one works best
by calling <em>evaluate_model</em>. It is now time to implement a first
algorithm. So, let's explore a bit the data set. Is there any pattern we
can exploit?</p>
<div class="highlight"><pre><span></span><code><span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">7</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'RM'</span><span class="p">],</span> <span class="n">y</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'Average number of rooms'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">'Housing price in \$1000</span><span class="se">\'</span><span class="s1">s'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div>
<p><img alt="download" src="https://www.marcosantoni.com/images/download.png">{.alignnone
.size-full .wp-image-78 width="610" height="438"}</p>
<p>As expected, there is a relation between the average number of rooms and
the median price. So, let's build the first algorithm.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">sklearn.pipeline</span> <span class="kn">import</span> <span class="n">make_pipeline</span>
<span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">FunctionTransformer</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="k">def</span> <span class="nf">just_RM_column</span><span class="p">(</span><span class="n">X</span><span class="p">):</span>
<span class="n">RM_col_index</span> <span class="o">=</span> <span class="mi">5</span>
<span class="k">return</span> <span class="n">X</span><span class="p">[:,</span> <span class="p">[</span><span class="n">RM_col_index</span><span class="p">]]</span>
<span class="n">pipe</span> <span class="o">=</span> <span class="n">make_pipeline</span><span class="p">(</span>
<span class="n">FunctionTransformer</span><span class="p">(</span><span class="n">just_RM_column</span><span class="p">),</span>
<span class="n">LinearRegression</span><span class="p">()</span>
<span class="p">)</span>
</code></pre></div>
<p>How well does it perform?</p>
<div class="highlight"><pre><span></span><code><span class="n">evaluate_model</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pipe</span><span class="p">)</span>
<span class="sd">'''Mean Squared Error [43.19492771 41.72813479 46.89293772] Accuracy:</span>
<span class="sd">43.94'''</span>
</code></pre></div>
<p>Can we visualize what the pipeline is actually doing?</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">plot_model_RM</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pipe</span><span class="p">):</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span>
<span class="n">cross_validation</span><span class="o">.</span><span class="n">train_test_split</span><span class="p">(</span>
<span class="n">X</span><span class="p">,</span>
<span class="n">y</span><span class="p">,</span>
<span class="n">test_size</span><span class="o">=</span><span class="mf">0.33</span><span class="p">,</span>
<span class="n">random_state</span><span class="o">=</span><span class="mi">5</span>
<span class="p">)</span>
<span class="n">pipe</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">fake_X_train</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">fake_X_train</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">fake_X_train</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">]),</span>
<span class="nb">max</span><span class="p">(</span><span class="n">fake_X_train</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">]),</span> <span class="n">num</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">fake_X_train</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">]))</span>
<span class="n">fake_X_test</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">fake_X_test</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">fake_X_test</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">]),</span>
<span class="nb">max</span><span class="p">(</span><span class="n">fake_X_test</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">]),</span> <span class="n">num</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">fake_X_test</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">]))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span><span class="mi">7</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="s1">'RM'</span><span class="p">],</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">fake_X_train</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">],</span> <span class="n">pipe</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">fake_X_train</span><span class="p">),</span>
<span class="n">color</span><span class="o">=</span><span class="s1">'r'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'Average number of rooms'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">'Housing price in \$1000</span><span class="se">\'</span><span class="s1">s'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">'Train Data Set'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="s1">'RM'</span><span class="p">],</span> <span class="n">y_test</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">fake_X_test</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">],</span> <span class="n">pipe</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">fake_X_test</span><span class="p">),</span>
<span class="n">color</span><span class="o">=</span><span class="s1">'r'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'Average number of rooms'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">'Housing price in \$1000</span><span class="se">\'</span><span class="s1">s'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s1">'Test Data Set'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="n">plot_model_RM</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pipe</span><span class="p">)</span>
</code></pre></div>
<p><img alt="download
(1)" src="https://www.marcosantoni.com/images/download-1.png">{.alignnone
.size-full .wp-image-84 width="1173" height="449"}</p>
<p>We now do a bit of feature engineering. We square the features.</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">add_squared_col</span><span class="p">(</span><span class="n">X</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">X</span><span class="p">,</span> <span class="n">X</span><span class="o">**</span><span class="mi">2</span><span class="p">))</span>
<span class="n">pipe</span> <span class="o">=</span> <span class="n">make_pipeline</span><span class="p">(</span>
<span class="n">FunctionTransformer</span><span class="p">(</span><span class="n">just_RM_column</span><span class="p">),</span>
<span class="n">FunctionTransformer</span><span class="p">(</span><span class="n">add_squared_col</span><span class="p">),</span>
<span class="n">LinearRegression</span><span class="p">()</span>
<span class="p">)</span>
</code></pre></div>
<p>We evaluate this other pipeline.</p>
<div class="highlight"><pre><span></span><code><span class="n">evaluate_model</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pipe</span><span class="p">)</span>
<span class="sd">'''</span>
<span class="sd">Mean Squared Error</span>
<span class="sd">[ 40.31207562 36.75642688 40.75444834]</span>
<span class="sd">Accuracy: 39.27'''</span>
</code></pre></div>
<p>And we see how the algorithm is fitting the data set.</p>
<div class="highlight"><pre><span></span><code><span class="n">plot_model_RM</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pipe</span><span class="p">)</span>
</code></pre></div>
<p><img alt="download
(2)" src="https://www.marcosantoni.com/images/download-2.png">{.alignnone
.size-full .wp-image-86 width="1165" height="449"}
We now try a different model like a <em>decision tree</em>.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">sklearn.tree</span> <span class="kn">import</span> <span class="n">DecisionTreeRegressor</span>
<span class="n">pipe</span> <span class="o">=</span> <span class="n">make_pipeline</span><span class="p">(</span>
<span class="n">FunctionTransformer</span><span class="p">(</span><span class="n">just_RM_column</span><span class="p">),</span>
<span class="n">FunctionTransformer</span><span class="p">(</span><span class="n">add_squared_col</span><span class="p">),</span>
<span class="n">DecisionTreeRegressor</span><span class="p">(</span><span class="n">max_depth</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">evaluate_model</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pipe</span><span class="p">)</span>
<span class="sd">'''</span>
<span class="sd">Mean Squared Error</span>
<span class="sd">[ 57.28366371 61.5437311 84.32756118]</span>
<span class="sd">Accuracy: 67.72</span>
<span class="sd">'''</span>
<span class="n">plot_model_RM</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pipe</span><span class="p">)</span>
</code></pre></div>
<p><img alt="download
(3)" src="https://www.marcosantoni.com/images/download-3.png">{.alignnone
.size-full .wp-image-87 width="1165" height="449"}</p>
<p>We now explore a second feature: <em>INDUS</em>.</p>
<div class="highlight"><pre><span></span><code><span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">7</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'INDUS'</span><span class="p">],</span> <span class="n">y</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'Average number of rooms'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">'Housing price in \$1000</span><span class="se">\'</span><span class="s1">s'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div>
<p><img alt="download
(4)" src="https://www.marcosantoni.com/images/download-4.png">{.alignnone
.size-full .wp-image-89 width="610" height="438"}</p>
<p>So, we see another relation between <em>INDUS</em> and <em>PRICE</em>. So, let's add
this second feature.</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">RM_and_INDUS_cols</span><span class="p">(</span><span class="n">X</span><span class="p">):</span>
<span class="n">RM_col_index</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">INDUS_col_index</span> <span class="o">=</span> <span class="mi">2</span>
<span class="k">return</span> <span class="n">X</span><span class="p">[:,</span> <span class="p">[</span><span class="n">RM_col_index</span><span class="p">,</span> <span class="n">INDUS_col_index</span><span class="p">]]</span>
<span class="n">pipe</span> <span class="o">=</span> <span class="n">make_pipeline</span><span class="p">(</span>
<span class="n">FunctionTransformer</span><span class="p">(</span><span class="n">RM_and_INDUS_cols</span><span class="p">),</span>
<span class="n">FunctionTransformer</span><span class="p">(</span><span class="n">add_squared_col</span><span class="p">),</span>
<span class="n">LinearRegression</span><span class="p">()</span>
<span class="p">)</span>
<span class="n">evaluate_model</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pipe</span><span class="p">)</span>
<span class="sd">'''</span>
<span class="sd">Mean Squared Error</span>
<span class="sd">[ 32.3420789 31.4260901 35.95835866]</span>
<span class="sd">Accuracy: 33.24</span>
<span class="sd">'''</span>
</code></pre></div>
<p>Now, plotting a model in 3D needs a bit more effort.</p>
<div class="highlight"><pre><span></span><code><span class="k">def</span> <span class="nf">plot_model_RM_INDUS</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pipe</span><span class="p">):</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span>
<span class="n">cross_validation</span><span class="o">.</span><span class="n">train_test_split</span><span class="p">(</span>
<span class="n">X</span><span class="p">,</span>
<span class="n">y</span><span class="p">,</span>
<span class="n">test_size</span><span class="o">=</span><span class="mf">0.33</span><span class="p">,</span>
<span class="n">random_state</span><span class="o">=</span><span class="mi">5</span>
<span class="p">)</span>
<span class="n">pipe</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">X_test</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">7</span><span class="p">))</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">p3</span><span class="o">.</span><span class="n">Axes3D</span><span class="p">(</span><span class="n">fig</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">X_test</span><span class="p">[:,</span> <span class="mi">2</span><span class="p">]</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">X_test</span><span class="p">[:,</span> <span class="mi">5</span><span class="p">]</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">y_test</span>
<span class="n">ax</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s1">'r'</span><span class="p">,</span> <span class="n">marker</span><span class="o">=</span><span class="s1">'o'</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="nb">max</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="p">(</span><span class="nb">max</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">-</span> <span class="nb">min</span><span class="p">(</span><span class="n">x</span><span class="p">))</span> <span class="o">/</span> <span class="mf">100.0</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">y</span><span class="p">),</span> <span class="nb">max</span><span class="p">(</span><span class="n">y</span><span class="p">),</span> <span class="p">(</span><span class="nb">max</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="o">-</span> <span class="nb">min</span><span class="p">(</span><span class="n">y</span><span class="p">))</span> <span class="o">/</span> <span class="mf">100.0</span><span class="p">)</span>
<span class="n">X</span><span class="p">,</span> <span class="n">Y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">meshgrid</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">Z</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">fake_X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]):</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]):</span>
<span class="n">fake_X</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span>
<span class="n">fake_X</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">5</span><span class="p">]</span> <span class="o">=</span> <span class="n">Y</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span>
<span class="n">Z</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="n">pipe</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">fake_X</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">ax</span><span class="o">.</span><span class="n">plot_surface</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">,</span> <span class="n">Z</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s1">'INDUS'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s1">'RM'</span><span class="p">)</span>
<span class="n">ax</span><span class="o">.</span><span class="n">set_zlabel</span><span class="p">(</span><span class="s1">'Price'</span><span class="p">)</span>
<span class="n">plot_model_RM_INDUS</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pipe</span><span class="p">)</span>
</code></pre></div>
<p><img alt="animation" src="https://www.marcosantoni.com/images/animation.gif">{.alignnone
.size-full .wp-image-91 width="720" height="504"}</p>
<p>How pretty is that?</p>
<p>The following step is to use all the features available. So, we move to
a 13-dimensional feature vector.</p>
<div class="highlight"><pre><span></span><code><span class="n">pipe</span> <span class="o">=</span> <span class="n">make_pipeline</span><span class="p">(</span>
<span class="n">LinearRegression</span><span class="p">()</span>
<span class="p">)</span>
<span class="n">evaluate_model</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pipe</span><span class="p">)</span>
<span class="sd">'''</span>
<span class="sd">Mean Squared Error</span>
<span class="sd">[ 20.50009513 22.42870192 27.88911654]</span>
<span class="sd">Accuracy: 23.61'''</span>
</code></pre></div>
<p>The error got quite smaller. We cannot however plot the model in
13-dimensions. We will now re-use the function that adds a squared
feature.</p>
<div class="highlight"><pre><span></span><code><span class="n">pipe</span> <span class="o">=</span> <span class="n">make_pipeline</span><span class="p">(</span>
<span class="n">FunctionTransformer</span><span class="p">(</span><span class="n">add_squared_col</span><span class="p">),</span>
<span class="n">LinearRegression</span><span class="p">()</span>
<span class="p">)</span>
<span class="n">evaluate_model</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pipe</span><span class="p">)</span>
<span class="sd">'''</span>
<span class="sd">Mean Squared Error</span>
<span class="sd">[ 16.7819682 14.599869 18.17785453]</span>
<span class="sd">Accuracy: 16.52'''</span>
</code></pre></div>
<p>Even better. Now, we will switch to a ridge-regressor (combined with a
normalization of the features).</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">StandardScaler</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">Ridge</span>
<span class="n">pipe</span> <span class="o">=</span> <span class="n">make_pipeline</span><span class="p">(</span>
<span class="n">StandardScaler</span><span class="p">(),</span>
<span class="n">FunctionTransformer</span><span class="p">(</span><span class="n">add_squared_col</span><span class="p">),</span>
<span class="n">Ridge</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">evaluate_model</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">pipe</span><span class="p">)</span>
<span class="sd">'''</span>
<span class="sd">Mean Squared Error</span>
<span class="sd">[ 16.4292824 14.50522561 18.27167008]</span>
<span class="sd">Accuracy: 16.40'''</span>
</code></pre></div>Install a .deb file from terminal on Ubuntu2016-05-23T08:18:00+02:002016-05-23T08:18:00+02:00Marco Santonitag:www.marcosantoni.com,2016-05-23:/2016/05/23/install-a-deb-file-from-terminal-on-ubuntu.html<p>I use Ubuntu 16.04. Sometimes, when I double-click a <em>.deb</em> file, the
installation program does not work. What often solves the problem is
installing it from terminal.</p>
<div class="highlight"><pre><span></span><code>sudo dpkg -i my_deb_file.deb
sudo apt-get -f install
</code></pre></div>Insights from Data Science Milan - 19/05/162016-05-20T17:56:00+02:002016-05-20T17:56:00+02:00Marco Santonitag:www.marcosantoni.com,2016-05-20:/2016/05/20/insights-from-data-science-milan-190516.html<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/DeepLearning?src=hash">#DeepLearning</a> introduction and enterprise architectures using <a href="https://twitter.com/hashtag/H2O?src=hash">#H2O</a> - first <a href="https://twitter.com/hashtag/DataScienceMilan?src=hash">#DataScienceMilan</a> meetup! - <a href="https://t.co/I8LsfaFJSu">https://t.co/I8LsfaFJSu</a></p>— Andrea Scarso (@andreaesseci) <a href="https://twitter.com/andreaesseci/status/733044189349482496">May 18, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>A new <strong>Data Science meetup</strong> is out in Milan. Two talks about Deep
Learning were given in the first event.</p>
<p><strong>Neural Networks and Deep Learning: An
Introduction. <a href="https://twitter.com/milanhightech">@MilanHighTech</a>.</strong> The
first …</p><blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/DeepLearning?src=hash">#DeepLearning</a> introduction and enterprise architectures using <a href="https://twitter.com/hashtag/H2O?src=hash">#H2O</a> - first <a href="https://twitter.com/hashtag/DataScienceMilan?src=hash">#DataScienceMilan</a> meetup! - <a href="https://t.co/I8LsfaFJSu">https://t.co/I8LsfaFJSu</a></p>— Andrea Scarso (@andreaesseci) <a href="https://twitter.com/andreaesseci/status/733044189349482496">May 18, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>A new <strong>Data Science meetup</strong> is out in Milan. Two talks about Deep
Learning were given in the first event.</p>
<p><strong>Neural Networks and Deep Learning: An
Introduction. <a href="https://twitter.com/milanhightech">@MilanHighTech</a>.</strong> The
first talk by Valentino Zocca was a quick intro to Deep Learning The
speaker was able to explain the role of the additional layers in a
neural network. Each layer is learning something, and each one is
learning a different representation of the output. In particular, each
additional layer is learning a more abstract representation of the
output.</p>
<p><img alt="Face recognition" src="https://indico.io/blog/wp-content/uploads/2016/02/cnn_deeper.jpg">{.alignnone
width="370" height="506"}</p>
<p>Each layer is learning a higher level of abstraction. In the example,
the first layer is learning the edges in the image; the second layer is
learning the parts of a face like the nose or the eye; the third layer
is learning large sections of a face. Ref: "<em>Convolutional Deep Belief
Networks for Scalable Unsupervised Learning of Hierarchical
Representations</em>", Lee et al.</p>
<p><strong>Bringing Deep Learning into production.</strong>
<a href="https://twitter.com/axlpado">@axlpado</a>. The speaker gave his point of
view on deploying machine learning algorithms in production. There are a
variety of frameworks, and it's always easy to choose which one to
adopt. He gave a series of interesting tips, and I'll write here the
main ones.</p>
<p>You can write machine learning in many languages such as Python, Java,
R, Matlab, Scala, etc. A good guideline is: choose the one you know the
most. Do not add the complexity of learning a new language to the
complexity of designing the algorithm.</p>
<p>Different languages in different teams.</p>
<p><img alt="Data science
languages" src="https://www.marcosantoni.com/images/20160519_193804-1.jpg">{.alignnone
.size-full .wp-image-58 width="896" height="504"}</p>
<p>It can be challenge to bring machine learning models from a team to
another. The reason is that often teams work in different languages or
in different frameworks. This organization leads to complex deployment
processes.</p>
<p><img alt="Tips for
deployment" src="https://www.marcosantoni.com/images/20160519_194315.jpg">{.alignnone
.size-full .wp-image-59 width="896" height="504"}</p>
<p>Paolo recommended to have the entire team on the same framework. The
idea is to have the deployment pipeline as smooth as possible. It can be
an effort for the data scientists at the beginning to learn the data
engineer tools, but it can make the difference on the long term.</p>Bayesian A/B Testing in Python2016-05-15T15:33:00+02:002016-05-15T15:33:00+02:00Marco Santonitag:www.marcosantoni.com,2016-05-15:/2016/05/15/bayesian-ab-testing-in-python.html<p>Imagine you re-designing your e-commerce website. You have to decide
whether the "Buy Item" button should be blue or green. You decide to
setup an A/B test, so you build two versions of the item page:</p>
<ul>
<li><strong>Page A</strong> which has a blue button;</li>
<li><strong>Page B</strong> which has a green …</li></ul><p>Imagine you re-designing your e-commerce website. You have to decide
whether the "Buy Item" button should be blue or green. You decide to
setup an A/B test, so you build two versions of the item page:</p>
<ul>
<li><strong>Page A</strong> which has a blue button;</li>
<li><strong>Page B</strong> which has a green button.</li>
</ul>
<p>Pages A and B are identical except for the color of the button. You want
to quantify the likelihood of a user clicking the "Buy Item" button when
she is on page A or on page B. So, you start the experiment by sending
each user either to page A or to page B. Each time, you monitor whether
she clicked "Buy Item" or not.</p>
<p><strong>Frequentist vs Bayesian</strong></p>
<p>One could simply approximate the effectiveness of each page by computing
the <strong>success rate</strong> on the two pages. E.g. if N=1000 users visited page
A, and 50 of them clicked the button, one could say that the likelihood
of clicking the button on page A is 50/1000 \~= 5%. This is the
so-called <strong>Frequentist </strong>approach which envisions the probability in
terms of event frequency. However, the following issues might arise on a
daily basis:</p>
<ul>
<li>what if N is small (e.g. N=50)? Can we still be confident by just
computing the success rate?</li>
<li>What if N is different between page A and page B? Let's say that 500
users visited page A and 2000 users visited page B. How can we
combine such imbalanced experiments?</li>
<li>How large should N be to achieve a 90% confidence in my estimates?</li>
</ul>
<p>We'll now introduce a simple <strong>Bayesian</strong> solution that allows to run
the A/B test and to handle the issues listed above. The code makes use
of <a href="https://pymc-devs.github.io/pymc/">PyMC</a> package, and it was
inspired by reading "Bayesian Methods for Hackers" by <a href="https://twitter.com/Cmrn_DP?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor">Cameron
Davidson-Pilon</a>.</p>
<p><strong>Evaluate Page A</strong></p>
<p>We'll first show how to evaluate the success rate on page A with a
Bayesian approach. The goal is to infer the probability of clicking the
"Buy Item" button on page A. We model this probability as a
<a href="https://www.wikiwand.com/en/Bernoulli_distribution">Bernoulli</a>
distribution with parameter <span class="math">\(p_A\)</span>:</p>
<div class="math">$$P(click | \text{page}=A) =
\begin{cases}
p_A & click=1\\
1-p_A & click=0\\
\end{cases}$$</div>
<p>So, <span class="math">\(p_A\)</span> is the parameter indicating the probability
of clicking the button on page A. This parameter is unknown and the goal
of the experiment is to infer it.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">pymc</span> <span class="kn">import</span> <span class="n">Uniform</span><span class="p">,</span> <span class="n">rbernoulli</span><span class="p">,</span> <span class="n">Bernoulli</span><span class="p">,</span> <span class="n">MCMC</span>
<span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="c1"># true value of p_A (unknown)</span>
<span class="n">p_A_true</span> <span class="o">=</span> <span class="mf">0.05</span>
<span class="c1"># number of users visiting page A</span>
<span class="n">N</span> <span class="o">=</span> <span class="mi">1500</span>
<span class="n">occurrences</span> <span class="o">=</span> <span class="n">rbernoulli</span><span class="p">(</span><span class="n">p_A_true</span><span class="p">,</span> <span class="n">N</span><span class="p">)</span>
<span class="nb">print</span> <span class="s1">'Click-BUY:'</span>
<span class="nb">print</span> <span class="n">occurrences</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="nb">print</span> <span class="s1">'Observed frequency:'</span>
<span class="nb">print</span> <span class="n">occurrences</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="o">/</span> <span class="nb">float</span><span class="p">(</span><span class="n">N</span><span class="p">)</span>
</code></pre></div>
<p>In this code, we are simulating a realisation of the experiment where
1000 users visited page A. Here, <em>occurrences </em>indicate how many
visitors have actually clicked on the button in this realisation.</p>
<p>The next step consist of defining our prior on the
<span class="math">\(p_A\)</span> parameter. The <strong>prior definition </strong>is the
first step of Bayesian inference and is a way to indicate our prior
belief in the variable.</p>
<div class="highlight"><pre><span></span><code><span class="n">p_A</span> <span class="o">=</span> <span class="n">Uniform</span><span class="p">(</span><span class="s1">'p_A'</span><span class="p">,</span> <span class="n">lower</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">upper</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">obs</span> <span class="o">=</span> <span class="n">Bernoulli</span><span class="p">(</span><span class="s1">'obs'</span><span class="p">,</span> <span class="n">p_A</span><span class="p">,</span> <span class="n">value</span><span class="o">=</span><span class="n">occurrences</span><span class="p">,</span> <span class="n">observed</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</code></pre></div>
<p>In this section, we define the prior of <span class="math">\(p_a\)</span> to be a
uniform distribution. The <em>obs </em>variable indicates the Bernoulli
distribution representing the observations of the click events (indeed
governed by the <span class="math">\(p_a\)</span> parameter). The two variables
are assigned to <em>Uniform</em> and <em>Bernoulli</em> which are stochastic variable
objects part of PyMC. Each variable is associated with a string name
(<em>p_A * and </em>obs<em> in this case). The </em>obs<em> variable has the </em>value *
and the <em>observed </em>parameter set because we have observed the
realisations of the experiments.</p>
<div class="highlight"><pre><span></span><code><span class="c1"># defining a Monte Carlo Markov Chain model</span>
<span class="n">mcmc</span> <span class="o">=</span> <span class="n">MCMC</span><span class="p">([</span><span class="n">p_A</span><span class="p">,</span> <span class="n">obs</span><span class="p">])</span>
<span class="c1"># setting the size of the simulations to 20k particles</span>
<span class="n">mcmc</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">20000</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)</span>
<span class="c1"># the resulting posterior distribution is stored in the trace variable</span>
<span class="nb">print</span> <span class="n">mcmc</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="s1">'p_A'</span><span class="p">)[:]</span>
</code></pre></div>
<p>In this section, the MCMC model is initialised, and the variables <em>p_A</em>
and <em>obs</em> are given to it as input. The <em>sample </em>model will run the
Monte Carlo simulations and fit the observed data to the prior belief.
The posterior distribution is accessible via the <em>.trace</em> attribute as
an array of realisations. We can now visualise the result of the
inference.</p>
<div class="highlight"><pre><span></span><code><span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">7</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">mcmc</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="s1">'p_A'</span><span class="p">)[:],</span> <span class="n">bins</span><span class="o">=</span><span class="mi">35</span><span class="p">,</span> <span class="n">histtype</span><span class="o">=</span><span class="s1">'stepfilled'</span><span class="p">,</span>
<span class="n">normed</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'Probability of clicking BUY'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s1">'Density'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">vlines</span><span class="p">(</span><span class="n">p_A_true</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">90</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">'--'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'True p_A'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div>
<p><img alt="p_A_hist_N_1500" src="https://www.marcosantoni.com/images/p_A_hist_N_1500.png">{.alignnone
.wp-image-38 .size-full width="800" height="700"}</p>
<p>Then, we might want to answer the question: where am I 90% confident
that the true <span class="math">\(p_A\)</span> lies? That's easy to answer.</p>
<div class="highlight"><pre><span></span><code><span class="n">p_A_samples</span> <span class="o">=</span> <span class="n">mcmc</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="s1">'p_A'</span><span class="p">)[:]</span>
<span class="n">lower_bound</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">percentile</span><span class="p">(</span><span class="n">p_A_samples</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="n">upper_bound</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">percentile</span><span class="p">(</span><span class="n">p_A_samples</span><span class="p">,</span> <span class="mi">95</span><span class="p">)</span>
<span class="nb">print</span> <span class="s1">'There is 90</span><span class="si">%%</span><span class="s1"> probability that p_A is between </span><span class="si">%s</span><span class="s1"> and </span><span class="si">%s</span><span class="s1">'</span> <span class="o">%</span>
<span class="p">(</span><span class="n">lower_bound</span><span class="p">,</span> <span class="n">upper_bound</span><span class="p">)</span>
<span class="c1"># There is 90% probability that p_A is between 0.0373019596856 and</span>
<span class="mf">0.0548052806892</span>
</code></pre></div>
<p><strong>Comparing Page A and Page B</strong></p>
<p>We'll now repeat what we have done for page A, and we add a new
variable <em>delta </em>indicating the difference
between <span class="math">\(p_A\)</span> and <span class="math">\(p_B\)</span>.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">pymc</span> <span class="kn">import</span> <span class="n">Uniform</span><span class="p">,</span> <span class="n">rbernoulli</span><span class="p">,</span> <span class="n">Bernoulli</span><span class="p">,</span> <span class="n">MCMC</span><span class="p">,</span> <span class="n">deterministic</span>
<span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">p_A_true</span> <span class="o">=</span> <span class="mf">0.05</span>
<span class="n">p_B_true</span> <span class="o">=</span> <span class="mf">0.04</span>
<span class="n">N_A</span> <span class="o">=</span> <span class="mi">1500</span>
<span class="n">N_B</span> <span class="o">=</span> <span class="mi">750</span>
<span class="n">occurrences_A</span> <span class="o">=</span> <span class="n">rbernoulli</span><span class="p">(</span><span class="n">p_A_true</span><span class="p">,</span> <span class="n">N_A</span><span class="p">)</span>
<span class="n">occurrences_B</span> <span class="o">=</span> <span class="n">rbernoulli</span><span class="p">(</span><span class="n">p_B_true</span><span class="p">,</span> <span class="n">N_B</span><span class="p">)</span>
<span class="nb">print</span> <span class="s1">'Observed frequency:'</span>
<span class="nb">print</span> <span class="s1">'A'</span>
<span class="nb">print</span> <span class="n">occurrences_A</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="o">/</span> <span class="nb">float</span><span class="p">(</span><span class="n">N_A</span><span class="p">)</span>
<span class="nb">print</span> <span class="s1">'B'</span>
<span class="nb">print</span> <span class="n">occurrences_B</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="o">/</span> <span class="nb">float</span><span class="p">(</span><span class="n">N_B</span><span class="p">)</span>
<span class="n">p_A</span> <span class="o">=</span> <span class="n">Uniform</span><span class="p">(</span><span class="s1">'p_A'</span><span class="p">,</span> <span class="n">lower</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">upper</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">p_B</span> <span class="o">=</span> <span class="n">Uniform</span><span class="p">(</span><span class="s1">'p_B'</span><span class="p">,</span> <span class="n">lower</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">upper</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="nd">@deterministic</span>
<span class="k">def</span> <span class="nf">delta</span><span class="p">(</span><span class="n">p_A</span><span class="o">=</span><span class="n">p_A</span><span class="p">,</span> <span class="n">p_B</span><span class="o">=</span><span class="n">p_B</span><span class="p">):</span>
<span class="k">return</span> <span class="n">p_A</span> <span class="o">-</span> <span class="n">p_B</span>
<span class="n">obs_A</span> <span class="o">=</span> <span class="n">Bernoulli</span><span class="p">(</span><span class="s1">'obs_A'</span><span class="p">,</span> <span class="n">p_A</span><span class="p">,</span> <span class="n">value</span><span class="o">=</span><span class="n">occurrences_A</span><span class="p">,</span> <span class="n">observed</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">obs_B</span> <span class="o">=</span> <span class="n">Bernoulli</span><span class="p">(</span><span class="s1">'obs_B'</span><span class="p">,</span> <span class="n">p_B</span><span class="p">,</span> <span class="n">value</span><span class="o">=</span><span class="n">occurrences_B</span><span class="p">,</span> <span class="n">observed</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">mcmc</span> <span class="o">=</span> <span class="n">MCMC</span><span class="p">([</span><span class="n">p_A</span><span class="p">,</span> <span class="n">p_B</span><span class="p">,</span> <span class="n">obs_A</span><span class="p">,</span> <span class="n">obs_B</span><span class="p">,</span> <span class="n">delta</span><span class="p">])</span>
<span class="n">mcmc</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">25000</span><span class="p">,</span> <span class="mi">5000</span><span class="p">)</span>
<span class="n">p_A_samples</span> <span class="o">=</span> <span class="n">mcmc</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="s1">'p_A'</span><span class="p">)[:]</span>
<span class="n">p_B_samples</span> <span class="o">=</span> <span class="n">mcmc</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="s1">'p_B'</span><span class="p">)[:]</span>
<span class="n">delta_samples</span> <span class="o">=</span> <span class="n">mcmc</span><span class="o">.</span><span class="n">trace</span><span class="p">(</span><span class="s1">'delta'</span><span class="p">)[:]</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">p_A_samples</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">35</span><span class="p">,</span> <span class="n">histtype</span><span class="o">=</span><span class="s1">'stepfilled'</span><span class="p">,</span> <span class="n">normed</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="n">color</span><span class="o">=</span><span class="s1">'blue'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'Posterior of p_A'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">vlines</span><span class="p">(</span><span class="n">p_A_true</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">90</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">'--'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'True p_A</span>
<span class="p">(</span><span class="n">unknown</span><span class="p">)</span><span class="s1">')</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'Probability of clicking BUY via A'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">p_B_samples</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">35</span><span class="p">,</span> <span class="n">histtype</span><span class="o">=</span><span class="s1">'stepfilled'</span><span class="p">,</span> <span class="n">normed</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="n">color</span><span class="o">=</span><span class="s1">'green'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'Posterior of p_B'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">vlines</span><span class="p">(</span><span class="n">p_B_true</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">90</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">'--'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'True p_B</span>
<span class="p">(</span><span class="n">unknown</span><span class="p">)</span><span class="s1">')</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'Probability of clicking BUY via B'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">delta_samples</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">35</span><span class="p">,</span> <span class="n">histtype</span><span class="o">=</span><span class="s1">'stepfilled'</span><span class="p">,</span> <span class="n">normed</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="n">color</span><span class="o">=</span><span class="s1">'red'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'Posterior of delta'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">vlines</span><span class="p">(</span><span class="n">p_A_true</span> <span class="o">-</span> <span class="n">p_B_true</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">90</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s1">'--'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s1">'True</span>
<span class="n">delta</span> <span class="p">(</span><span class="n">unknown</span><span class="p">)</span><span class="s1">')</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s1">'p_A - p_B'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div>
<p><img alt="A_and_B" src="https://www.marcosantoni.com/images/A_and_B.png">{.alignnone
.wp-image-40 .size-full width="800" height="600"}</p>
<p>Then, we can answer a question like: what is the probability that
<span class="math">\( p_A > p_B\)</span>?</p>
<div class="highlight"><pre><span></span><code><span class="nb">print</span> <span class="s1">'Probability that p_A > p_B:'</span>
<span class="nb">print</span> <span class="p">(</span><span class="n">delta_samples</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="c1"># Probability that p_A > p_B</span>
<span class="c1"># 0.8919</span>
</code></pre></div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';
var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Insights from PyData Florence 162016-04-20T06:05:00+02:002016-04-20T06:05:00+02:00Marco Santonitag:www.marcosantoni.com,2016-04-20:/2016/04/20/insights-from-pydata-florence-16.html<p>I have just joined <a href="https://www.pycon.it/p3/schedule/pycon7/">PyData</a>
conference in Florence, and I will list briefly some
interesting insights.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Oh my... We are already overcrowded <a href="https://twitter.com/pyconit">@pyconit</a> and it's *just* the beginning!! 🎉🎉 good job guys! 🙌🏻 <a href="https://twitter.com/hashtag/pycon7?src=hash">#pycon7</a></p>— (((Valerio Maggio))) (@leriomaggio) <a href="https://twitter.com/leriomaggio/status/720894471060201472">April 15, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p><strong>Time Travel and Time Series Analysis with Pandas and Statsmodels,
<a href="http://twitter.com/hendorf">@hendorf …</a></strong></p><p>I have just joined <a href="https://www.pycon.it/p3/schedule/pycon7/">PyData</a>
conference in Florence, and I will list briefly some
interesting insights.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Oh my... We are already overcrowded <a href="https://twitter.com/pyconit">@pyconit</a> and it's *just* the beginning!! 🎉🎉 good job guys! 🙌🏻 <a href="https://twitter.com/hashtag/pycon7?src=hash">#pycon7</a></p>— (((Valerio Maggio))) (@leriomaggio) <a href="https://twitter.com/leriomaggio/status/720894471060201472">April 15, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p><strong>Time Travel and Time Series Analysis with Pandas and Statsmodels,
<a href="http://twitter.com/hendorf">@hendorf.</a></strong> The focus of the talk was time
series analysis. The speaker pointed out something that a data scientist
should not forget when doing such time series analysis. He pointed out
that the time level of aggregation is something to do with care when
doing such analysis. Do you take into account that February has a number
of days that accounts to only 90% of the number of days of March? If you
compare e.g. sales per month, you cannot just ignore this fact. In the
talk, I found out that statsmodels has some nice tools that perform
trend analysis and seasonality analysis.</p>
<p><strong>Machine learning and IoT for automatic presence detection of workers
on fall protection life lines,
<a href="http://twitter.com/stefanoterna">@stefanoterna</a>.</strong> The talk was an
excellent overview of how TomorrowData is able to deploy machine
learning systems in the "real world". Their system uses neural networks
to detect a man walking on industrial cables. It was interesting to hear
about the different challenges that one has to consider in the Internet
of Things area due to hardware and environmental constraints. The fact
that they had to manually annotate the signals coming from an
accelerometer reminded me of <a href="http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=7346953&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D7346953">my
work</a>
about indoor localization. In this kind of areas, the data collection is
indeed a challenge due to its manual cost (compared to the datasets you
can easily collect through a web app).</p>
<p><strong>Introduzione a Orange Data Mining,
<a href="http://twitter.com/ericbonfadini">@ericbonfadini</a>.</strong> Eric introduced
Orange Data Mining which is both a python library and a GUI for machine
learning projects. I found interesting the nice GUI. It allows to define
pipelines of jobs to mine data. You can quickly get insights about data
and play around with machine learning models. I see this tool as quite
useful mainly for didactic purposes. I think it can be a nice tool for
teachers to explain data mining and machine learning in a nice graphical
way. It is really suitable for lectures.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">"Simple APIs and innovative documentation processes" keynote by <a href="https://twitter.com/EGouillart">@EGouillart</a> now live <a href="https://twitter.com/PyData">@PyData</a> <a href="https://twitter.com/pyconit">@pyconit</a> <a href="https://twitter.com/hashtag/pydatait?src=hash">#pydatait</a> <a href="https://t.co/Gt8cxIyafJ">pic.twitter.com/Gt8cxIyafJ</a></p>— PyData Italy (@pydatait) <a href="https://twitter.com/pydatait/status/721235005746188289">April 16, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p><strong>Simple APIs and innovative documentation processes: looking back at
the success of Scientific Python,
<a href="http://twitter.com/EGouillart">@EGouillart</a>.</strong> The talk was the point
of view of a core developer of a scientific package like <em>scikit-image</em>.
The speaker gave nice insights about the API design choices that need to
be taken when you contribute to open source projects. For example, what
is the advantage of getting rid of most classes in your package and
mainly expose functions. The idea is that, if you get rid of the
boilerplate of classes, you are forced to expose/return just numpy
arrays which you can then easily integrate to other tools in your
pipeline, e.g. scikit-learn. Another thing to take into account is that
54% of the users of packages are running a Windows machine (although
probably the developers of such package don't). So, you need to take
into account the tech gap between the developers and the end users.
Finally, the speaker mentioned the power of Sphinx as a documentation
tool.</p>
<p><strong>Building Data Pipelines in Python,
<a href="http://twitter.com/marcobonzanini">@marcobonzanini</a>.</strong> Luigi is an
awesome tool because simply it makes you feel relaxed when you are
running a data pipeline. You can programmatically define arbitrary
dependencies between tasks, and Luigi will make sure that the
dependencies are fulfilled. Marco's talk was a really nice intro to the
tool.</p>
<p><strong>Going Functional in the Python Data Science Stack,
<a href="http://twitter.com/data_hope">@data_hope</a>. </strong>The speaker explained
the directed acyclic graphs that are behind functional programming. It
was interesting to hear about Dask package and how you can bring its
lazy evaluation model. Dask allows you to abstract your code and perform
operations on datasets that do not fit in memory. The speaker pointed
out that doing functional programming means to decouple "how" from
"what". You can just focus on "what" your algorithm should do, then you
just choose "how" it will do it (e.g. Dask).</p>
<p><strong>Reti Neurali in Python, <a href="http://twitter.com/spiunno">@spiunno</a>.</strong> The
talk was a great overview of what are neural networks and how you can
implement them with Theano and Lasagne. The speaker was able give a talk
that was suitable both to beginners and both to an intermediate
audience. In particular, the Q&A session was really active, and
interesting topics were discussed, e.g. preventing overfitting,
computational costs, gravitational waves, etc. Regarding overfitting
prevention, I learnt about "dropout" which is a nice technique that
consists basically in dropping out links of the networks at random for
each sample. The advantage is that you prevent overfitting and reduce
the computational cost at the same time.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/hendorf">@hendorf</a> thank you for coming! enjoy your next conference :)</p>— PyCon Italy (@pyconit) <a href="https://twitter.com/pyconit/status/722763833387966465">April 20, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>