Do you know the real tragedy of statistics education in most schools? Its boring! Teachers spend hours wading through derivations, equations, and theorems. Then, when you finally get to the best part applying concepts to actual numbers its with irrelevant, unimaginative examples like rolling dice. Its a shame because stats can be engaging if you skip the derivations (which youll likely never need) and focus on using the concepts to solve interesting problems.
So lets look at Poisson processes and the Poisson distribution, two important probability concepts. After highlighting the relevant theory, well work through a real-world example.
A Poisson process is a model for a series of discrete events where the average time between events is known, but the exact timing of events is random. The arrival of an event is independent of the event before (waiting time between events is memoryless). For example, suppose we own a website that our content delivery network (CDN) tells us goes down on average once per 60 days, but one failure doesnt affect the probability of the next. All we know is the average time between failures. The failures are a Poisson process that looks like:
We know the average time between events, but the events are randomly spaced in time (stochastic). We might have back-to-back failures, but we could also go years between failures because the process is stochastic.
A Poisson process meets the following criteria (in reality, many phenomena modeled as Poisson processes dont precisely match these but can be approximated as such):
The last point events are not simultaneous means we can think of each sub-interval in a Poisson process as a Bernoulli Trial, that is, either a success or a failure. With our website, the entire interval in consideration is 60 days, but each with sub-interval (one day) our website either goes down or it doesnt.
Common examples of Poisson processes are customers calling a help center, visitors to a website, radioactive decay in atoms, photons arriving at a space telescopeand movements in a stock price. Poisson processes are generally associated with time, but they dont have to be. In the case of stock prices, we might know the average movements per day (events per time), but we could also have a Poisson process for the number of trees in an acre (events per area).
One example of a Poisson process we often see is bus arrivals (or trains). However, this isnt a proper Poisson process because the arrivals arent independent of one another. Even for bus systems that run on time, a late arrival from one buscan impact the next buss arrival time. Jake VanderPlas has a great article on applying a Poisson process to bus arrival times which works better with made-up data than real-world data.
More From Will KoehrsenUse Precision and Recall to Evaluate Your Classification Model
The Poisson process is the model we use for describing randomly occurring events and, by itself, isnt that useful. We need the Poisson distribution to do interesting things like find the probability of a given number of events in a time period or find the probability of waiting some time until the next event.
The Poisson distribution probability mass function (pmf) gives the probability of observing k events in a time period given the length of the period and the average events per time:
The pmf is a little convoluted, and we can simplify events/time * time period into a single parameter, lambda (), the rate parameter. With this substitution, the Poisson Distribution probability function now has one parameter:
We can think of lambda as the expected number of events in the interval. (Well switch to calling this an interval because, remember, the Poisson process doesnt always use a time period). I like to write out lambda to remind myself the rate parameter is a function of both the average events per time and the length of the time period, but youll most commonly see it as above. (The discrete nature of the Poisson distribution is why this is a probability mass function and not a density function.)
As we change the rate parameter, , we change the probability of seeing different numbers of events in one interval. The graph below is the probability mass function of the Poisson distribution and shows the probability (y-axis) of a number of events (x-axis) occurring in one interval with different rate parameters.
The most likely number of events in one interval for each curve is the curves rate parameter. This makes sense because the rate parameter is the expected number of events in one interval. Therefore, the rate parameter represents the number of events with the greatest probability when the rate parameter is an integer. When the rate parameter is not an integer, the highest probability number of events will be the nearest integer to the rate parameter. (The rate parameter is also the mean and variance of the distribution, which dont need to be integers.)
We can use the Poisson distribution pmf to find the probability of observing a number of events over an interval generated by a Poisson process. Another use of the mass function equation (as well see later)is to find the probability of waiting a given amount of time between events.
Learn More From Our Data Science ExpertsWhat Is Multiple Regression?
We could continue with website failures to illustrate a problem solvable with a Poisson distribution, but I propose something grander. When I was a child, my father would sometimes take me into our yard to observe (or try to observe) meteor showers. We werent space geeks, but watching objects from outer space burn up in the sky was enough to get us outside, even though meteor showers always seemed to occur in the coldest months.
We can model the number of meteors seen as a Poisson distribution because the meteors are independent, the average number of meteors per hour is constant (in the short term), and this is an approximation meteors dont occur at the same time.
All we need to characterize the Poisson distribution is the rate parameter, the number of events per interval * interval length. In a typical meteor shower, we can expect five meteors per hour on average or one every 12 minutes. Due to the limited patience of a young child (especially on a freezing night), we never stayed out more than 60 minutes, so well use that as the time period. From these values, we get:
Five meteors expected mean that is the most likely number of meteors wed observe in an hour. According to my pessimistic dad, that meant wed see three meteors in an hour, tops. To test his prediction against the model, we can use the Poisson pmf distribution to find the probability of seeing exactly three meteors in one hour:
We get 14 percent or about 1/7. If we went outside and observed for one hour every night for a week, then we could expect my dad to be right once! We can use other values in the equation to get the probability of different numbers of events and construct the pmf distribution. Doing this by hand is tedious, so well use Python calculation and visualization (which you can see in this Jupyter Notebook).
The graph below shows the probability mass function for the number of meteors in an hour with an average of 12 minutes between meteors, the rate parameter (which is the same as saying five meteors expected in an hour).
The most likely number of meteors is five, the rate parameter of the distribution. (Due to a quirk of the numbers, four and fivehave the same probability, 18 percent). There is one most likely value as with any distribution, but there is also a wide range of possible values. For example, we could see zero meteors or see more than 10 in one hour. To find the probabilities of these events, we use the same equation but, this time, calculate sums of probabilities (see notebook for details).
We already calculated the chance of seeing precisely three meteors as about 14 percent. The chance of seeing three or fewer meteors in one hour is 27 percent which means the probability of seeing more than 3 is 73 percent. Likewise, the probability of more than five meteors is 38.4 percent, while we could expect to see five or fewer meteors in 61.6 percent of hours. Although its small, there is a 1.4 percent chance of observing more than ten meteors in an hour!
To visualize these possible scenarios, we can run an experiment by having our sister record the number of meteors she sees every hour for 10,000 hours. The results are in the histogram below:
(This is just a simulation. No sisters were harmed in the making of thisarticle.)
On a few lucky nights, wed see 10or more meteors in an hour, although more often, wed see four or fivemeteors.
The rate parameter, , is the only number we need to define the Poisson distribution. However, since its a product of two parts (events/interval * interval length), there are two ways to change it: we can increase or decrease the events/interval, and we can increase or decrease the interval length.
First, lets change the rate parameter by increasing or decreasing the number of meteors per hour to see how those shifts affect the distribution. For this graph, were keeping the time period constant at 60 minutes.
In each case, the most likely number of meteors in one hour is the expected number of meteors, the rate parameter. For example, at 12 meteors per hour (MPH), our rate parameter is 12, and theres an 11 percent chance of observing exactly 12 meteors in one hour. If our rate parameter increases, we should expect to see more meteors per hour.
Another option is to increase or decrease the interval length. Heres the same plot, but this time were keeping the number of meteors per hour constant at five and changing the length of time we observe.
Its no surprise that we expect to see more meteors the longer we stay out.
Improve Your Data Visualization Skills7 Ways to Tell Powerful Stories With Your Data Visualization
An intriguing part of a Poisson process involves figuring out how long we have to wait until the next event (sometimes called the interarrival time). Consider the situation: meteors appear once every 12 minutes on average. How long can we expect to wait to see the next meteor if we arrive at a random time? My dad always (this time optimistically) claimed we only had to wait six minutes for the first meteor, which agrees with our intuition. Lets use statistics to see if our intuition is correct.
I wont go into the derivation (it comes from the probability mass function equation), but the time we can expect to wait between events is a decaying exponential. The probability of waiting a given amount of time between successive events decreases exponentially as time increases. The following equation shows the probability of waiting more than a specified time.
With our example, we have one event per 12 minutes, and if we plug in the numbers, we get a 60.65 percent chance of waiting more than six minutes. So much for my dads guess! We can expect to wait more than 30 minutes, about 8.2 percent of the time. (Note this is the time between each successive pair of events. The waiting times between events are memoryless, so the time between two events has no effect on the time between any other events. This memorylessness is also known as the Markov property).
A graph helps us to visualize the exponentially decaying probability of waiting time:
There is a 100 percent chance of waiting more than zero minutes, which drops off to a near-zero percent chance of waiting more than 80 minutes. Again, as this is a distribution, theres a wide range of possible interarrival times.
Rearranging the equation, we can use it to find the probability of waiting less than or equal to a time:
We can expect to wait six minutes or less to see a meteor 39.4 percent of the time. We can also find the probability of waiting a length of time: Theres a 57.72 percent probability of waiting between 5 and 30 minutes to see the next meteor.
To visualize the distribution of waiting times, we can once again run a (simulated) experiment. We simulate watching for 100,000 minutes with an average rate of one meteor per 12 minutes. Then we find the waiting time between each meteor we see and plot the distribution.
The most likely waiting time is one minute, but thats distinct from the average waiting time. Lets try to answer the question: On average, how long can we expect to wait between meteor observations?
To answer the average waiting time question, well run 10,000 separate trials, each time watching the sky for 100,000 minutes, and record the time between each meteor. The graph below shows the distribution of the average waiting time between meteors from these trials:
The average of the 10,000 runs is 12.003 minutes. Surprisingly, this average is also the average waiting time to see the first meteor if we arrive at a random time. At first, this may seem counterintuitive: if events occur on average every 12 minutes, then why do we have to wait the entire 12 minutes before seeing one event? The answer is we are calculating an average waiting time, taking into account all possible situations.
If the meteors came precisely every 12 minutes with no randomness in arrivals, then the average time wed have to wait to see the first one would be six minutes. However, because waiting time is an exponential distribution, sometimes we show up and have to wait an hour, which outweighs the more frequent times when we wait fewer than 12 minutes. The average time to see the first meteor averaged over all the occurrences will be the same as the average time between events. The average first event waiting time in a Poisson process is known as the Waiting Time Paradox.
As a final visualization, lets do a random simulation of one hour of observation.
Well, this time we got precisely the result we expected: five meteors. We had to wait 15 minutes for the first one then 12 minutes for the next. In this case, itd be worth going out of the house for celestial observation!
The next time you find yourself losing focus in statistics, you have my permission to stop paying attention to the teacher. Instead, find an interesting problem and solve it using the statistics youre trying to learn. Applying technical concepts helps you learn the material and better appreciate how stats help us understand the world. Above all, stay curious: There are many amazing phenomena in the world, and data science is an excellent tool for exploring them.
This article was originally published on Towards Data Science.
- Will Data Science be in Demand in the Future? - Entrepreneur - September 30th, 2021
- Promoting the Public Good | UVA Today - UVA Today - September 30th, 2021
- MetaCell launches innovative Cloud Hosting for life science and healthcare - Yahoo Finance - September 30th, 2021
- KDD 2021 Honors Recipients of the SIGKDD Best Paper Awards - Yahoo Finance - September 30th, 2021
- Analytics Insight Announces Big Data Analytics Companies of the Year - Yahoo Finance - September 30th, 2021
- R is better than Python. Try telling that to banks - eFinancialCareers - September 30th, 2021
- World AI & Data Science Conference to be held on October 13th, 2021 - Analytics Insight - September 27th, 2021
- How Data Science and Big Data are Shaping the Indian Food Industry in 2021? - Analytics Insight - September 27th, 2021
- mRNA Could Fight Diseases Such as Alzheimer's and Cancer, With Help of UVA Scientist - University of Virginia - September 27th, 2021
- Media advisory: Kevin Leicht to testify before congressional subcommittee about disinformation - University of Illinois News - September 27th, 2021
- Heard on the Street 9/27/2021 - insideBIGDATA - September 27th, 2021
- New Business Institute at UT Austin Will Specialize in Sports Analytics - Diverse: Issues in Higher Education - September 27th, 2021
- How AI is Transforming The Race Strategy Of Electric Vehicles - Analytics India Magazine - September 27th, 2021
- Argentine project analyzing how data science and artificial intelligence can help prevent the outbreak of Covid-19 | Chosen from more than 150... - September 25th, 2021
- Metropolitan Chicago Data-science Corps to partner with area organizations on projects - Northwestern University NewsCenter - September 25th, 2021
- Business of Sports Institute at UT McCombs School Founded by Gift from Accenture - UT News - UT News | The University of Texas at Austin - September 25th, 2021
- 'I Want The Folks in Our Society to Be Data Literate So That We Are Making Good Decisions Together for the Good of the World,' Says Professor... - September 25th, 2021
- Pandemic oversight board to preserve data analytics tools beyond its sunset date - Federal News Network - September 25th, 2021
- Increase the Readability of Your Python Script With 1 Simple Tool - Built In - September 25th, 2021
- On World Cancer Research Day, Illumina Highlights the Transformative Power of Genomics - Yahoo Finance - September 25th, 2021
- An Introduction to Portfolio Optimization in Python - Built In - September 25th, 2021
- Life sciences use of digital twins mirrors its application in other industries - MedCity News - September 25th, 2021
- The Top 3 Tools Every Data Scientist Needs - Built In - September 21st, 2021
- OpsRamp Introduces The Future of Incident Response: Harnessing Machine Learning and Data Science to Predict and Prevent IT Outages - Yahoo Finance - September 21st, 2021
- Health Data Science Symposium: Smartphones, Wearables, and Health 11/5 Reduced Registration by 10/5 - HSPH News - September 21st, 2021
- Twitter round-up: KDnuggets' tweet on the importance of extract transform load (ETL) in data science the most popular tweet in Q2 2021 - Verdict - September 21st, 2021
- RwHealth: supporting the NHS with AI and data science - Healthcare Global - Healthcare News, Magazine and Website - September 21st, 2021
- Trialbee and Castor Partner to Democratize Access and Simplify Enrollment to Clinical Trials Globally - Northeast Mississippi Daily Journal - September 21st, 2021
- Data Scientist vs Data Engineers: All you need to know before choosing the right career path - India Today - September 21st, 2021
- Taylor & Francis Group Partners with Robert Bosch Centre for Data Science and AI to Amplify Research - IT News Online - September 21st, 2021
- Modern Hire Reveals New Research on the Effectiveness of Social Media in Hiring - PRNewswire - September 21st, 2021
- How is data science changing the way we get insured for the better? - BOSS Magazine - September 17th, 2021
- Business Analytics Lecture Series Kicks Off with Janssen Pharmaceutical's Jeffery Headd - Seton Hall University News & Events - September 17th, 2021
- Understanding the Role and Attributes of Data Access Governance in Data Science & Analytics - Analytics Insight - September 17th, 2021
- Groundbreakers: U of T's Data Sciences Institute to help researchers find answers to their biggest questions - News@UofT - September 17th, 2021
- Top 15 Tools Every Data Scientist Should Bring to Work - Analytics Insight - September 17th, 2021
- Cisco : data scientists work with nonprofit partner Replate to improve food recovery and delivery to communities in need - Marketscreener.com - September 17th, 2021
- Visualising the future through data at these 4 US universities - Study International News - September 15th, 2021
- $300K to teach data science for the jobs of the future | University of Hawaii System News - UH System Current News - September 15th, 2021
- Legacy Companies Biggest AI Challenge Often Isnt What You Might Think - Forbes - September 15th, 2021
- Is AI racist? Why more diversity is needed in the field of data science - The National - September 15th, 2021
- dotData and Tableau Partner to Accelerate Augmented and Predictive Analytics for the Business Intelligence Community - Yahoo Finance - September 15th, 2021
- The UK government has ended Palantir's NHS data deal. But the fight isn't over - Open Democracy - September 15th, 2021
- Global Data Science Platform Market Anticipated to Hit $224.3 Billion by 2026, Growing at a CAGR of 31.1% from 2019 to 2026 - GlobeNewswire - August 26th, 2021
- ConcertAI Expands Data Science Collaboration with Janssen to Drive Effective Therapies and Address Health Disparities in Clinical Trials - Woburn... - August 26th, 2021
- Liquidity is key to unlocking the value in data, researchers say - MIT Sloan News - August 26th, 2021
- MSK Study Identifies Biomarker That May Help Predict Benefits of Immunotherapy - On Cancer - Memorial Sloan Kettering - August 26th, 2021
- Understanding The Macroscope Initiative And GeoML - Forbes - August 26th, 2021
- Cancer Informatics for Cancer Centers: Scientific Drivers for Informatics, Data Science, and Care in Pediatric, Adolescent, and Young Adult Cancer -... - August 26th, 2021
- Empowering the Intelligent Data-Driven Enterprise in the Cloud - CDOTrends - August 26th, 2021
- The Winners Of Weekend Hackathon -Tea Story at MachineHack - Analytics India Magazine - August 26th, 2021
- Data science is a team sport: How to choose the right players - ZDNet - August 24th, 2021
- Mathematical Optimization: A Powerful Prescriptive Analytics Technology That Belongs In Your Data Science Toolbox - insideBIGDATA - August 24th, 2021
- Could Data Science Diversify the STEM Field? Why Courses Designed This Century Feel so Relevant to All Students - MindShift - KQED - August 24th, 2021
- Data science and digital coding could soon be the new English - Mint - August 24th, 2021
- Perfect data science team: The right blend of roles, responsibilities and skills - ETCIO.com - August 24th, 2021
- Top Data Science Quizzes That You Must Give a Try in 2021 - Analytics Insight - August 24th, 2021
- Governor Pritzker announced U of I will receive over $140 million funds - wcia.com - August 24th, 2021
- Excelra launches a re-envisioned version of GOSTAR, its structure-activity relationship application, with an innovative set of new features to... - August 24th, 2021
- Lacklustre success of analytics in the public cloud - ComputerWeekly.com - August 24th, 2021
- The Best Udacity Nanodegrees for Data Analytics and Visualization - Solutions Review - August 24th, 2021
- 5 tips to begin your career in the field of Data Science - India Today - August 14th, 2021
- What Stops Data Science from Scaling? Domino Data Lab to Host Enterprise MLOps Expert Virtual Panel on August 19th Exploring Common Failures and... - August 14th, 2021
- Fill the Application for These Top Data Scientist Jobs in MNCs Today - Analytics Insight - August 14th, 2021
- The rise of the autonomous data science teams - ETCIO.com - August 14th, 2021
- Learn the wonders of big data with this super-sized learning bundle for under $60 - The Next Web - August 14th, 2021
- Wildfire smoke may have contributed to thousands of extra COVID-19 cases and deaths in western US in 2020 - Harvard School of Engineering and Applied... - August 14th, 2021
- Seed grant to explore using AI to model subsurface rock formations | Penn State University - Penn State News - August 14th, 2021
- Augmented analytics capabilities mark the new era of BI - TechTarget - August 14th, 2021
- AdTheorent, a Leader in Data Science and Machine Learning Optimized Advertising, to List on NASDAQ via Merger with MCAP Acquisition Corporation -... - July 27th, 2021
- Why is it Necessary for Engineers to Learn Data Science in 2021? - Analytics Insight - July 27th, 2021
- Alliance formed to create new professional standards for data science - FE News - July 27th, 2021
- Hypertension: Comparing ACE inhibitors and ARBs - Medical News Today - July 27th, 2021
- Data Science is Here to Spearhead Organizations Through Tough Competition - Analytics Insight - July 25th, 2021
- The Biggest Data Science News Items During the First Half of 2021 - Solutions Review - July 25th, 2021
- Behind the scenes: A day in the life of a data scientist - TechRepublic - July 25th, 2021
- Scaling AI and data science 10 smart ways to move from pilot to production - VentureBeat - July 25th, 2021
- Beware the 1% view of data science - ComputerWeekly.com - July 25th, 2021
- Top Data Science Jobs to Apply for this Weekend - Analytics Insight - July 25th, 2021
- Thickness and structure of the martian crust from InSight seismic data - Science Magazine - July 25th, 2021