Posts

Showing posts from September, 2020

Chi Square Test

  HYPOTHESIS TESTING Chi Square Test In the last exercise, we looked at data where customers visited a website and either made a purchase or did not make a purchase. What if we also wanted to understand if the probability of making a purchase depends on some other categorical variable, like gender? If we want to understand whether the outcomes of two categorical variables are  associated , we should use a Chi Square test. It is useful in situations like: An A/B test where half of users were shown a green submit button and the other half were shown a purple submit button. Was one group more likely to click the submit button? People under and over age 40 were given a survey asking “Which of the following three products is your favorite?” Did these age groups have significantly different preferences? In SciPy, you can use the function  chi2_contingency  to perform a Chi Square test. The input to  chi2_contingency  is a contingency table where: The columns are ...

Binomial Test in Python

  Binomial Test Let’s imagine that we are analyzing the percentage of customers who make a purchase after visiting a website. 1000 customers visited the site this month, and 58 of them made a purchase. The marketing department reports that historical data suggests about 72 of every 1000 visitors make a purchase. Thus, they estimate that the probability of any particular customer making a purchase is 7.2%. We would like to know if this month’s number, 58 purchases, is significantly different from normal or a reasonable fluctuation due to random chance. In previous exercises, we collected samples of numerical information (eg. order price) and then used the mean and standard deviation of those samples to make comparisons. In contrast, we now have a sample where each unit (a visitor) falls into one of two discrete categories: “made a purchase” “did not make a purchase” Instead of comparing sample means, we want to compare the percent in the “made a purchase” category to some expectatio...

Turkey Test Range

  HYPOTHESIS TESTING Tukey's Range Test Let’s say that we have performed ANOVA to compare three sets of data from the three VeryAnts stores. We received the result that there is some significant difference between datasets. Now, we have to find out  which  datasets are different. We can perform a Tukey’s Range Test to determine the difference between datasets. If we feed in three datasets, such as the sales at the VeryAnts store locations A, B, and C, Tukey’s Test can tell us which pairs of locations are distinguishable from each other. The function to perform Tukey’s Range Test is  pairwise_tukeyhsd , which is found in  statsmodel , not  scipy . We have to provide the function with one list of all of the data and a list of labels that tell the function which elements of the list are from which set. We also provide the significance level we want, which is usually  0.05 . For example, if we were looking to compare mean scores of movies that are dramas, ...

ANOVA Python

HYPOTHESIS TESTING ANOVA In the last exercise, we saw that the probability of making a Type I error got dangerously high as we performed more t-tests. When comparing more than two numerical datasets, one way to preserve a Type I error probability of  0.05  is to use  ANOVA . ANOVA (Analysis of Variance) tests the null hypothesis that all of the samples come from populations with the same mean. If we reject the null hypothesis with ANOVA, we’re saying that at least one pair of populations (from which the samples were drawn) have different means; however, we cannot determine exactly which pair(s). We can use the SciPy function  f_oneway  to perform ANOVA on multiple datasets.  f_oneway  takes in each dataset as a different input and returns the F-statistic and the p-value. For example, if we were comparing scores on a videogame between math majors, writing majors, and psychology majors, we could run an ANOVA test with this line: fstat , pval = f_oneway...

Two - Sample T test

  2 Sample T-Test Suppose that a company has recently updated their website to make it more colorful and inviting. The company wants to know whether the new design is resulting in visitors staying on the site for a longer period of time. A sample of 100 visitors who saw the old design spent an average of 25 minutes on the site. A second sample of 100 visitors who saw the new version spent an average of 28 minutes on the site. Did the average time spent per visitor vary across groups? Or is this difference attributable to random chance? One way of testing whether this difference is significant is by using a  2 Sample T-Test . A 2 Sample T-Test compares two sets of numerical data. The null hypothesis of a 2 Sample T-Test is that the two observed samples come from populations with the same mean. In the example above, this means: if we could observe all site visitors in two alternate universes (one where they see each version of the site), the average visiting times in these unive...

One Sample T Test in Python

    1 Sample T Test , which compares a sample mean to a hypothetical population mean. ttest_1samp  requires two inputs, a sample distribution (eg. the list of the 50 observed purchase prices) and a mean to test against (eg.  1000 ): tstat , pval = ttest_1samp ( example_distribution , expected_mean ) print pval It also returns two outputs: the t-statistic (which we won’t cover in this course), and the p-value — telling us how confident we can be that the sample of values came from a distribution with the specified mean. One Sample T-Test II In the last exercise, we got a p-value that was much higher than  0.05 , so we cannot reject the null hypothesis. If we conduct another experiment and take a new sample of orders, will we get the same result? Not necessarily! Just because we don’t have enough data to detect a difference doesn’t mean that there isn’t one. Generally, the larger the sample(s) we have, the smaller a difference we’ll be able to detect. You can l...

SQL Case Study

Format for Germany SELECT     -- Cast PickupDate as a date and display as a German date FORMAT(CAST(PickupDate AS date), 'd', 'de-de') AS 'PickupDate', Zone.Borough,     -- Display TotalDistance in the German format FORMAT(SUM(TripDistance), 'n', 'de-de') AS 'TotalDistance',     -- Display TotalRideTime in the German format FORMAT(SUM(DATEDIFF(minute, PickupDate, DropoffDate)), 'n', 'de-de') AS 'TotalRideTime',     -- Display TotalFare in German currency TotalRideTime(SUM(TotalAmount), 'c', 'de-de') AS 'TotalFare' FROM YellowTripData INNER JOIN TaxiZoneLookup AS Zone  ON PULocationID = Zone.LocationID  GROUP BY CAST(PickupDate as date),     Zone.Borough  ORDER BY CAST(PickupDate as date),     Zone.Borough; NYC Borough statistics SP It's time to apply what that you have learned in this course and write a stored procedure to solve the first objective of the Taxi Ride business c...

Execute Stored Procedures

  EXECUTE with OUTPUT parameter Execute the  dbo.cuspSumRideHrsSingleDay  stored procedure and capture the output parameter. DECLARE @RideHrs AS numeric(18,0) -- Execute the stored procedure EXEC dbo.cuspSumRideHrsSingleDay     -- Pass the input parameter @DateParm = '3/1/2018',     -- Store the output in @RideHrs @RideHrsOut =  @RideHrs OUTPUT -- Select @RideHrs SELECT @RideHrs AS RideHours EXECUTE with the return value Execute  dbo.cuspRideSummaryUpdate  to change the  RideHours  to  300  for  '3/1/2018' . Store the return code from the stored procedure. DECLARE @ReturnStatus AS int -- Execute the SP, storing the result in @ReturnStatus EXEC @ReturnStatus = dbo.cuspRideSummaryUpdate     -- Specify @DateParm @DateParm= '3/1/2018',     -- Specify @RideHrs @RideHrs = 300 -- Select the columns of interest SELECT @ReturnStatus AS ReturnStatus,     Date,     RideHours...

Stored Procedure

  Create a Stored Procedure named  cuspSumRideHrsSingleDay  in the  dbo  schema that accepts a date and returns the total ride hours for the date passed. CREATE PROCEDURE dbo.cuspSumRideHrsSingleDay     -- Declare the input parameter @DateParm date,     -- Declare the output parameter @RideHrsOut numeric OUTPUT AS -- Don't send the row count  SET NOCOUNT ON BEGIN -- Assign the query result to @RideHrsOut SELECT @RideHrsOut = SUM(DATEDIFF(second, StartDate, EndDate))/3600 FROM CapitalBikeShare -- Cast StartDate as date and compare with @DateParm WHERE CAST(StartDate AS date) = @DateParm RETURN END CRUD  Create a stored procedure named  cusp_RideSummaryCreate  in the  dbo  schema that will insert a record into the  RideSummary  table. CREATE PROCEDURE dbo.cusp_RideSummaryCreate      (@DateParm date, @RideHrsParm numeric) AS BEGIN SET NOCOUNT ON -- Insert into the Date and RideHours co...