Week Two
Program:
#where using spyder which will help us run the code in python.
#when we open the program editior / script editor this is where we write our codes. This window is on left of screen.
#on lower right is the ipython console which is where the commands will be executed we wil work with these two windows this week
#first set working directory to the folder where we want to set up the filws in this class
#the script eidtor uses color coding to write your program, blue font is key word, green is for strings, and red is for numbers
#value pair is in purple
#this is comment it wont be analysed by python
#if you write a code and there is no color then you did something wrong, there would be a red symbol which will help you debug your code
#python is case sensitve so you need to be careful in what words you capitalise or not
#lets import libraries to conduct data analysis, common libraries in pandas and numpy, here we will import panda and numpy library by
#using code below: (for some reason it has triangle explanation point but ignore i guess)
import numpy
import pandas
#Now lets download the data !!!! YOU HAVE to download it into the file directory instructions for it is in the starting python file thing data = pandas.read_csv("")
#were calling it data
#we put low memory = flase because sometimes theres issues
data = pandas.read_csv('data.csv', low_memory=False)
#save the program to do this go to the menu and click save
#this code will tell us the number of observations and the number of columns in our data set, the result will show 3010 varibles measured on 43093 particapants
print(len(data))
print(len(data.columns))
#sometimes numeric varibles will be reformatted as string or character variables and this will create problems especially when you try to perform nuemeric operations on variables that have numeric responses but python considers them not to be numeric
#so you have to check the format of your date you type in the code below to see the format of your data:
#so you do this code below for each variable of interest, data here represents our data file that we named data, in brackets it is the varibale of interest
#the result below shows int64 which means python considers to be a neumeric value, if thats not a problem just leave it and move on
#data['TAB12MDX'].dtype
data['S2AQ4B'].dtype #how often drank coolers in last 12 months
data['S7Q1'].dtype #ever had a strong fear or avoidance of social situation
data['S2AQ16A'].dtype # Age when statrted drinking, not counting small tastes or sips
#this setting variables you will be working with to numeric (so if your converting a varibale to numeric do this for each variable you will be working with )
#data['TAB12MDX'] = pandas.to_numeric(data['TAB12MDX'])
data['S7Q1'] = pandas.to_numeric(data['S7Q1'])
data['S2AQ16A'] = pandas.to_numeric(data['S2AQ16A'])
data['S2AQ4B'] = pandas.to_numeric(data['S2AQ4B'])
#lesson three running your program and examining frequency distribtuion
#so remeber we said we are going to be looking at one variable at a time, this is called univariate or descriptive analysis, in order to convert raw data into useful information we neeed to summarise and examine the distribtuion of any varibales of interest
# by distribtuion of a varibale we mean what the values the variable takes and how often the variable takes those values
#so for example we ask a bunch of students how they feel about their body, do they feel overweight or about right or underweight, we have three that says over, 5 says under and 7 says about right, now we cant exactly answer those questions (seeing how much of our obeservation falls within each categroy when looking at a massive sample)
#so we use code to do this
#so we do this by forming a table, which has our catgeory (overwight, underwieght, about right), the count (number of students who fall within each categroy) and the most important is the frequency which is percentage of students falling in each category
# so this is how we generate a frequency distribution table:
print("how often drank coolers in last 12 months, yes=1")
c1 = data["S2AQ4B"].value_counts(sort=False)
print(c1)
print("ever had a strong fear or avoidance of social situation, yes=1")
c2 = data["S7Q1"].value_counts(sort=False)
print(c2)
print("Age when statrted drinking, not counting small tastes or sips, yes=1")
c3 = data["S2AQ16A"].value_counts(sort=False)
print(c3)
#calculating the percentages
print("how often drank coolers in last 12 months, yes=1")
p1 = data["S2AQ4B"].value_counts(sort=False, normalize=True)
print(p1)
print("ever had a strong fear or avoidance of social situation, yes=1")
p2 = data["S7Q1"].value_counts(sort=False, normalize=True)
print(p2)
print("Age when statrted drinking, not counting small tastes or sips, yes=1")
p3 = data["S2AQ16A"].value_counts(sort=False, normalize=True)
print(p3)
#frequency distribution
ct1= data.groupby('S2AQ4B').size()
print(ct1)
pt1= data.groupby('S2AQ4B').size() * 100 / len(data)
print(pt1)
ct2= data.groupby('S7Q1').size()
print(ct2)
pt2= data.groupby('S7Q1').size() * 100 / len(data)
print(pt2)
ct3= data.groupby('S2AQ16A').size()
print(ct3)
pt3= data.groupby('S2AQ16A').size() * 100 / len(data)
print(pt3)
Output
ct1:
S2AQ4B
35178
1 60
10 2632
2 48
3 116
4 239
5 449
6 724
7 1032
8 673
9 1928
99 14
dtype: int64
pt1:
S2AQ4B
81.632748
1 0.139234
10 6.107721
2 0.111387
3 0.269185
4 0.554614
5 1.041933
6 1.680087
7 2.394821
8 1.561739
9 4.474045
99 0.032488
dtype: float64
ct2:
S7Q1
1 3854
2 37784
9 1455
dtype: int64
pt2:
S7Q1
1 8.943448
2 87.680134
9 3.376418
dtype: float64
ct3:
S2AQ16A
8266
10 132
11 79
12 382
13 532
14 1020
15 1649
16 3301
17 3214
18 7042
19 2547
20 2661
21 4469
22 1250
23 721
24 501
25 1152
26 262
27 223
28 242
29 102
30 598
31 59
32 105
33 71
34 65
35 231
36 51
37 26
38 44
39 32
40 186
41 18
42 26
43 27
44 9
45 68
46 17
47 18
48 20
49 9
5 240
50 70
51 9
52 8
53 8
54 9
55 17
56 8
57 4
58 8
59 5
6 30
60 39
61 6
62 14
63 5
64 5
65 9
66 1
67 2
68 3
69 8
7 57
70 10
71 5
72 2
73 4
74 3
75 6
78 1
79 2
8 76
81 1
82 2
83 1
9 52
99 936
dtype: int64
pt3:
S2AQ16A
19.181770
10 0.306314
11 0.183324
12 0.886455
13 1.234539
14 2.366974
15 3.826608
16 7.660177
17 7.458288
18 16.341401
19 5.910473
20 6.175017
21 10.370594
22 2.900703
23 1.673126
24 1.162602
25 2.673288
26 0.607987
27 0.517485
28 0.561576
29 0.236697
30 1.387696
31 0.136913
32 0.243659
33 0.164760
34 0.150837
35 0.536050
36 0.118349
37 0.060335
38 0.102105
39 0.074258
40 0.431625
41 0.041770
42 0.060335
43 0.062655
44 0.020885
45 0.157798
46 0.039450
47 0.041770
48 0.046411
49 0.020885
5 0.556935
50 0.162439
51 0.020885
52 0.018565
53 0.018565
54 0.020885
55 0.039450
56 0.018565
57 0.009282
58 0.018565
59 0.011603
6 0.069617
60 0.090502
61 0.013923
62 0.032488
63 0.011603
64 0.011603
65 0.020885
66 0.002321
67 0.004641
68 0.006962
69 0.018565
7 0.132272
70 0.023206
71 0.011603
72 0.004641
73 0.009282
74 0.006962
75 0.013923
78 0.002321
79 0.004641
8 0.176363
81 0.002321
82 0.004641
83 0.002321
9 0.120669
99 2.172047
dtype: float64
for pt3 it tajes value of 99 the most, pt2 is value of 2 the most, pt1 is value of 10 the most
0 notes