#thats massively underwieght... | Explore Tumblr Posts and Blogs

codingforcoursera-blog · 1 year

Text

Week Two

Program:

#where using spyder which will help us run the code in python. #when we open the program editior / script editor this is where we write our codes. This window is on left of screen. #on lower right is the ipython console which is where the commands will be executed we wil work with these two windows this week #first set working directory to the folder where we want to set up the filws in this class #the script eidtor uses color coding to write your program, blue font is key word, green is for strings, and red is for numbers #value pair is in purple #this is comment it wont be analysed by python #if you write a code and there is no color then you did something wrong, there would be a red symbol which will help you debug your code #python is case sensitve so you need to be careful in what words you capitalise or not

#lets import libraries to conduct data analysis, common libraries in pandas and numpy, here we will import panda and numpy library by #using code below: (for some reason it has triangle explanation point but ignore i guess)

import numpy import pandas

#Now lets download the data !!!! YOU HAVE to download it into the file directory instructions for it is in the starting python file thing data = pandas.read_csv("") #were calling it data #we put low memory = flase because sometimes theres issues data = pandas.read_csv('data.csv', low_memory=False)

#save the program to do this go to the menu and click save

#this code will tell us the number of observations and the number of columns in our data set, the result will show 3010 varibles measured on 43093 particapants print(len(data)) print(len(data.columns))

#sometimes numeric varibles will be reformatted as string or character variables and this will create problems especially when you try to perform nuemeric operations on variables that have numeric responses but python considers them not to be numeric #so you have to check the format of your date you type in the code below to see the format of your data: #so you do this code below for each variable of interest, data here represents our data file that we named data, in brackets it is the varibale of interest #the result below shows int64 which means python considers to be a neumeric value, if thats not a problem just leave it and move on #data['TAB12MDX'].dtype data['S2AQ4B'].dtype #how often drank coolers in last 12 months data['S7Q1'].dtype #ever had a strong fear or avoidance of social situation data['S2AQ16A'].dtype # Age when statrted drinking, not counting small tastes or sips

#this setting variables you will be working with to numeric (so if your converting a varibale to numeric do this for each variable you will be working with ) #data['TAB12MDX'] = pandas.to_numeric(data['TAB12MDX']) data['S7Q1'] = pandas.to_numeric(data['S7Q1']) data['S2AQ16A'] = pandas.to_numeric(data['S2AQ16A']) data['S2AQ4B'] = pandas.to_numeric(data['S2AQ4B'])

#lesson three running your program and examining frequency distribtuion #so remeber we said we are going to be looking at one variable at a time, this is called univariate or descriptive analysis, in order to convert raw data into useful information we neeed to summarise and examine the distribtuion of any varibales of interest # by distribtuion of a varibale we mean what the values the variable takes and how often the variable takes those values #so for example we ask a bunch of students how they feel about their body, do they feel overweight or about right or underweight, we have three that says over, 5 says under and 7 says about right, now we cant exactly answer those questions (seeing how much of our obeservation falls within each categroy when looking at a massive sample) #so we use code to do this #so we do this by forming a table, which has our catgeory (overwight, underwieght, about right), the count (number of students who fall within each categroy) and the most important is the frequency which is percentage of students falling in each category # so this is how we generate a frequency distribution table:

print("how often drank coolers in last 12 months, yes=1") c1 = data["S2AQ4B"].value_counts(sort=False) print(c1) print("ever had a strong fear or avoidance of social situation, yes=1") c2 = data["S7Q1"].value_counts(sort=False) print(c2)

print("Age when statrted drinking, not counting small tastes or sips, yes=1") c3 = data["S2AQ16A"].value_counts(sort=False) print(c3)

#calculating the percentages

print("how often drank coolers in last 12 months, yes=1") p1 = data["S2AQ4B"].value_counts(sort=False, normalize=True) print(p1)

print("ever had a strong fear or avoidance of social situation, yes=1") p2 = data["S7Q1"].value_counts(sort=False, normalize=True) print(p2)

print("Age when statrted drinking, not counting small tastes or sips, yes=1") p3 = data["S2AQ16A"].value_counts(sort=False, normalize=True) print(p3)

#frequency distribution ct1= data.groupby('S2AQ4B').size() print(ct1) pt1= data.groupby('S2AQ4B').size() * 100 / len(data) print(pt1)

ct2= data.groupby('S7Q1').size() print(ct2) pt2= data.groupby('S7Q1').size() * 100 / len(data) print(pt2)

ct3= data.groupby('S2AQ16A').size() print(ct3) pt3= data.groupby('S2AQ16A').size() * 100 / len(data) print(pt3)

Output

ct1:

S2AQ4B 35178 1 60 10 2632 2 48 3 116 4 239 5 449 6 724 7 1032 8 673 9 1928 99 14 dtype: int64

pt1:

S2AQ4B 81.632748 1 0.139234 10 6.107721 2 0.111387 3 0.269185 4 0.554614 5 1.041933 6 1.680087 7 2.394821 8 1.561739 9 4.474045 99 0.032488 dtype: float64

ct2:

S7Q1 1 3854 2 37784 9 1455 dtype: int64

pt2:

S7Q1 1 8.943448 2 87.680134 9 3.376418 dtype: float64

ct3:

S2AQ16A 8266 10 132 11 79 12 382 13 532 14 1020 15 1649 16 3301 17 3214 18 7042 19 2547 20 2661 21 4469 22 1250 23 721 24 501 25 1152 26 262 27 223 28 242 29 102 30 598 31 59 32 105 33 71 34 65 35 231 36 51 37 26 38 44 39 32 40 186 41 18 42 26 43 27 44 9 45 68 46 17 47 18 48 20 49 9 5 240 50 70 51 9 52 8 53 8 54 9 55 17 56 8 57 4 58 8 59 5 6 30 60 39 61 6 62 14 63 5 64 5 65 9 66 1 67 2 68 3 69 8 7 57 70 10 71 5 72 2 73 4 74 3 75 6 78 1 79 2 8 76 81 1 82 2 83 1 9 52 99 936 dtype: int64

pt3:

S2AQ16A 19.181770 10 0.306314 11 0.183324 12 0.886455 13 1.234539 14 2.366974 15 3.826608 16 7.660177 17 7.458288 18 16.341401 19 5.910473 20 6.175017 21 10.370594 22 2.900703 23 1.673126 24 1.162602 25 2.673288 26 0.607987 27 0.517485 28 0.561576 29 0.236697 30 1.387696 31 0.136913 32 0.243659 33 0.164760 34 0.150837 35 0.536050 36 0.118349 37 0.060335 38 0.102105 39 0.074258 40 0.431625 41 0.041770 42 0.060335 43 0.062655 44 0.020885 45 0.157798 46 0.039450 47 0.041770 48 0.046411 49 0.020885 5 0.556935 50 0.162439 51 0.020885 52 0.018565 53 0.018565 54 0.020885 55 0.039450 56 0.018565 57 0.009282 58 0.018565 59 0.011603 6 0.069617 60 0.090502 61 0.013923 62 0.032488 63 0.011603 64 0.011603 65 0.020885 66 0.002321 67 0.004641 68 0.006962 69 0.018565 7 0.132272 70 0.023206 71 0.011603 72 0.004641 73 0.009282 74 0.006962 75 0.013923 78 0.002321 79 0.004641 8 0.176363 81 0.002321 82 0.004641 83 0.002321 9 0.120669 99 2.172047 dtype: float64

for pt3 it tajes value of 99 the most, pt2 is value of 2 the most, pt1 is value of 10 the most

0 notes