# Module V: Sampling Methods with Prof. Dr. Pascale Salameh

*card_membership*Claim your certificate

# Summary

Enrich your understanding of advanced clinical statistics in this comprehensive on-demand teaching session. The presenter will guide you on the principles of data sampling, a critical element in clinical trials and other medical research studies. You will receive clear explanations and easily-understandable examples of statistical measurements including the win ratio, relative risk, hazard ratio, and odds ratio, as well as their implications in the context of medical treatments and interventions. Additionally, you will learn vital computer-aided techniques for sample selection utilizing the SPSS statistical software, with dedicated sections on random and non-random sampling. This course is an essential resource for medical professionals seeking to deepen their knowledge of statistics and enhance their research efficacy.

*smart_toy*Generated by MedBot

# Learning objectives

- Understand and calculate the win ratio method used in clinical trials and how it differs from relative risk and the odds ratio.
- Understand the principles of sampling, why it is fundamental to medical research, and its advantages and disadvantages.
- Understand and apply the concept of 95% confidence interval and how it impacts the reliability of research findings and their extrapolation to the population.
- Learn and execute sample size calculation using statistical software and understand the importance of it in conducting medical research.
- Recognize the difference between random and quota sampling, their applications, and the implications for sample representativeness.

*smart_toy*Generated by MedBot

*card_membership*

# Related content

# Similar communities

*arrow_forward*

# Similar events and on demand videos

#### Computer generated transcript

Warning!

The following transcript was generated automatically from the content and has not been checked or corrected manually.

Yellow. We can go if you want yellow. I'm going live now. Lunch. Ok. Yes, now certainly life. Yeah, we can start. Uh All right. Hello everyone. I hope you're fine today. I hope you had the time to rest and uh maybe to review a little what we did yesterday. Actually, I uh I didn't receive any questions uh except the one that we did not solve yesterday. Uh And for the person who asked it, uh it was about the win ratio. If you remember, actually, it's not a uh a big deal. Uh The one ratio, it seems to be a, a method of calculation that is used for composite scales. It's used mainly in clinical trials and it's a way of calculating uh how much people are really uh benefiting from a certain intervention. And it's ratio of the number of people who are benefiting over the number of people who are not. And it's similar some somehow to the number needed to treat and number needed to harm. And all these, let's say a special ways of doing calculations that are a little bit different from the relative risk and the OS ratio. But eventually they mean the same. So uh I don't know if uh you need more information, I guess that it's as simple as that. It's not that uh difficult and it was not, it has not been applied uh everywhere. So this is why, for example, you cannot calculate it except manually. It doesn't exist in the known statistical software that can be uh used and where you can calculate easily the relative risk, the hazard ratio, the odds ratio and their adjusted counterparts. So until now, they are not included in the statistical software, you have to calculate them manually. So I hope this answers your uh uh question. And now I will move to my uh second, let's say part which is your module number five, we will talk about some sampling principles and sample size calculation. So I will start by talking about sampling and then I will show you on Softwares how to do a sample size calculation because this is really very important whenever you want to work uh in an appropriate manner and you want to reach a result that is really useful and statistically significant when necessary. So let us start by the sampling methods. And of course, we do that because we want to collect data, we are unable to work on populations. Unfortunately, working on full population is something that is very difficult and requires lots of resources. So this is why most of the time we work on samples. Now, this has the advantage of being less expensive, less time consuming. However, it has many disadvantages, particularly it increases the risk of not being representative. So if your sample is not representative of your population, this will let you uh draw some some erroneous conclusions. So this is my first problem and my second problem will be the sample fluctuation. So there are many types of errors that might occur. And the sample fluctuation is a type of random error that will allow me to state a conclusion that is not 100% sure because I work on samples, I will say that all my conclusion will be more or less, let's say trustworthy. And uh this is why they invented what we call the 95% confidence interval. So the word confidence, it's how much you trust your result and how uh uh you can, let's say, uh assume or uh uh how, how much you can extrapolate the results from the sample to the population. So the population is a complete set of items to which the survey results are to be extrapolated. Well, as we said, the ideal is to work on a, a full population. However, this is mostly impossible. And this is why we go to a sample where we select some members of this population and we carry out our study on these member. Now you have to know that most of the time you have to use what we call a sampling frame. What is a frame? It's a list of items. It's a list of individuals. It's a list of observations. It's a list of statistical units actually from which my sample is selected. So if I'm working on people and this is most of the time, this is it. So it's the list of all people within the population if I'm working on rats. OK. So it's the rest of rats if I'm working on uh um I don't know uh uh uh samples on objects. So it's the list of these samples or objects. So eventually, the ideal is to have this frame, the sampling frame from which I can draw my sample. And this is here also another problem. Sometimes the list is available, but sometimes the list is not available and it is here where problems start to occur. If I don't have a list, how can I know what is my population in order to draw a sample from it? So this is not really easy. We will talk about it. Uh When you say there are survey steps. So we will see there are steps that are done in one step in two step in three steps. So you will talk about this. Now, one quick example, one step, it's whenever you have this frame of the list of municipalities, for example, the second step, you take the list of inhabitants per selected municipality. And this is a way of, let's say uh uh overcoming uh the absence of the list of inhabitants living all over the country. This is particularly true for Lebanon. And I think you all know that in Lebanon there is no list of people who live in Lebanon because of many reasons, political, religious, things like that. And since we don't have a list of individuals, so we'll go to this kind of, let's say, two step or two stage sampling in order to uh come up with an acceptable sample, a four hour study. Uh What is a random sample? Well, each person in, in the sample has a known probability of being selected. This is what we call a random sample. Uh So it's very simple. It's as you imagine in your hand, you have a list and you do a, just a random draw from this list. You can put all the names on small sheets or, or, or pieces of paper and you just put them in a uh container and yeah, you just draw the names from the container. So this is if you want to do it physically. But of course, nowadays, with the uh uh existence of sampling Softwares, we don't need this type of uh uh uh of work. We just have our list within the software and we ask the software to do the sampling for us. I will show you how we do it. You also have the non random sampling and here you are not starting from a, a list because you don't have the sampling frame and a non random sample. It's based most of the time on the judgment of the surveyor or of the researcher. For example, a person is included in the sample. If she or she is considered the representative. This is particularly true when we talk about the C A method, the quota is whenever you divide your population into parts. And you say, well, I want to take, let's say 50 individuals who have between 2030 and then I want to take 40 individual who are between 4050 so on so forth. So you divide your sample according to the percentage that you think are good based on some percentages that you may get from the liter, which uh uh percentages you may get from the Central Statistical uh uh agency uh in Lebanon. And this is an institution that is related to the uh should be some to the ministry. Uh In fact, uh uh the bii and it's directly related to much, it seems that they have their own way of telling you what are the percentage of people who live in Lebanon. Of course, they will tell you the percentage of uh gender, males and females, the age categories and uh uh the percentage of educated versus uneducated uh things like that. So you can rely on these percentages in order to come up with a representative, let's say, distribution of your sample. And you go and you do your quota sampling and you try to uh some a, a uh a sample that is similar in structure to this quota uh percentages. Uh uh one word about what we call estimate actually. Of course, you know that from the sample, we will calculate mean a standard deviation, a percentage prevalence. All, all these are considered estimates and they are used to go and try to estimate or extrapolate to the population to see what is ongoing in the population. So I will just open my S PSS. OK. Just two seconds. OK. So, all right. So this is S PSSI guess that the majority of, you know, S PSS, particularly those who finished the fourth year. Uh We've already worked on S PSS. You have the variable view and you have the data view. So I will open a, a database that you might know which is a database that I use a lot. Uh during uh my courses, it's a database about a representative sample taken from Lebanon and the database. Uh the, the study uh has the objective to calculate the prevalence of CO PD. Uh in Lebanon. It was a national study that we did. We went over the uh the Lebanese regions and we had a portable spirometer. We went there and we measured uh uh we took spirometric measures for people and we decided whether they had or did not have cidine. Of course, there are lots of values that we would take. So suppose that I want to take a random sample of this big list. It's 1800. I want to take a random sample of it. So what should I do? I would go to data and here I would say select cases and here look at it. It's just here, a random sample of cases. So you press here and you go and you ask PSS to draw for you either a certain percentage of cases, let's say 20% of cases or exactly, let's say uh I want to have, let's say 100 cases out of the total which are a little bit more than uh 1800 whatever. So we say continue and we say, OK, so what happens is that he will just collect whatever is needed and what is not needed. They are just hidden and you can do all your calculation on your random sample of this big sample. So this is one way of using it. Another way is to, as we said, if we want to draw a sample out of a full population, uh I told you that it's not possible because the last census in Lebanon, I it was in 1932. If you look at your aid, you will find it just, it's written that based on 1932 census. Uh this was the last sentence that that happened in Lebanon. So this is why we don't have a full list of individuals. But what we have is something that is interesting. And let me tell you that the uh Central Agency of Statistics and Marquezi still, what did they, what did they do? They published, what is known to be the list of Lebanese communities. So you go there and you have the list of all Lebanese community. Actually, it's a list that is available on their website. So you can go, you can just download it uh under an Excel or under a word format. Now from Excel, you can easily open it with S PSS. And what do you have? It's the list of all the villages and all the communities that are in Lebanese regions. And in Lebanon, it is known that we have 2789 communities. So if I want to do a study that is similar to the one I told you about, I did the national study on CO PD. So I cannot choose the people but I can choose the villages and communities. So what we did was to take 100 communities from this list. So we went to the same place. We selected a random sample of hm Yeah, 100 cases from the two thou uh 2000, I think 900 or 800 whatever. And you say, OK, and continue. And you will have just here this list of 100 communities that are selected for you. So you go to every community that was selected, you try to see the municipality, the, the uh the, the, the, the whoever you want, uh somebody who is known to be a key, let's say uh informant in, in his uh village or the community. And you ask him to tell you who lives there in general, these people, they have everyone in his own community, a list of people who live there. So the second step will be to choose among those who live there, uh who you want to interview, for example. And this is a way as we said to just you want to proceed. Yes and no. So this is a way of uh doing a kind of sample that is good. It's a random sample, but it's not the best way, but we have no other options. So I will take these different types of sampling and go a little bit into the details, the simple random sample as its name indicates. Well, it's simple. So it's as you can imagine it, you have a population, you have a random number generator as I showed you uh using S PSS or others. And this will allow you to select a sample a small N from the big N. And as we said, every individual in the population will have a known probability and it will have the same probability of being selected. Actually, you will do the selection regardless of any other factor regardless of gender of age of religion, of region, you don't look at anything. It's just a simple random. But of course, there are really thousands of samples that can be drawn from one population and you might have some fluctuations of these samples. And in addition to that, sometimes if you have a subgroup that is small, it might just haphazardly, it might not be represented in your sample. So in order to avoid this small drawback, we go to other ways. Now we will see. So how do I select it? As we said, you have a sampling list or a sampling frame of N population and you use a random process to generate N numbers. Very simple, very straightforward. No, if your uh uh uh frame or your list is not on Excel or on any other software, you have it on paper and it's difficult if you have it on paper to, to, to just copy it on uh on your software. So what you do is what we call a systematic sample, how to do this systematic sample. Well, you choose the first individual randomly and then for the rest, you just take them with a AAA uh with adding a fixed interval that we call a step. For example, I have, let's say a, a population of 1000 individual I want to draw mm let's say 100 individual from this population. So I want to draw 100 out of 1000 Y I want to draw one of 10. So if I want to draw one of 10, so from every 10 individuals, I need one. So what I do, I put 10 pieces of paper on which on I write 123456789, 10. I select randomly one of these papers, let's say I obtain seven. And then I will just select all individuals that happen to end by seven and 7, 1727 37 47 57 and so on so forth. So whenever I finish my uh uh uh sample, they will be representing my population. So it's a good way to avoid copying the full list on your computer. And it is a more or less acceptable way of drawing this kind of random sample. No, I told you that whatever you use either simple, a random sample or even systematic, sometimes there is a chance that you might miss small subgroups, small subgroups that are not really very numerous in your population. So what to do in order to uh uh guarantee that these small subgroups, they will be uh uh uh selected. Well, we divide our population into what we call strata. And from this strata, I will try to uh uh uh uh uh take a number of people that is more or less representative of this strata. Now, there are many way uh of choosing strata. For example, I can take administrative strata, for example, geographic regions, I can take each strata, rural, urban, urban area or anything. And then I will go and search for a sampling frame in each stratum. Now, this method has a very good accuracy. If you have within every strata, you have a good homogeneity. And if you have heterogeneous uh uh between strata comparisons in general, you can have several methods. Either you take a haphazard allocation. It means I don't know how many people I will take from every Strat, for example, from every geographical region. Mm I would take whatever comes the second way is to have an equal allocation. So an equal number of subjects in each stratum, the third way. And I think that this one is the best one. It's the proportional allocation. The proportion of each stratum in the sample is the same as the proportion in the population. And here just let me arrange this one. It should be like that. And this is known to be an optimal allocation. Uh OK. All right, an optimal allocation that will allow me to gain in terms of accuracy. And this is an example, this is an equal allocation. So whatever is the size of the strata within the population, you will take a sample that includes the same number. No, this is not the best way. This one, if you have in your population, let's say 40 35 1510, you will go in your sample and you will try to get a sample that includes 40 1015, 35 the same distribution. So this is much better and it will give you a much, much more accurate representation of your population. So why is it the best choice because the sample is self weighted. So it's more representative, the estimation calculation are so simple. And if you have a stratum that is too small, then here you might prefer equal allocation or you might let's say over represents a stratum that is too small. For example, if there is a region where you have really a very low percentage of people, but you still want to include it and represent it. So you increase a little bit, the number of people that you would take from there. So if, for example, it represent 1% of the population in your sample, you put it as 5% and then you will arrange uh and do some waiting uh improvement whenever you will do the analysis. And this is how you will get really much, much better results. No one additional thing I told you that the simple sampling and the systematic sampling and sometimes the uh uh stratum sampling. Well, they are good, they are practical. They allow me to do an excellent work. However, if you don't have a sampling frame from the beginning, you cannot apply them, you cannot apply neither the simple nor the systematic nor the stratified. So what to do in this case, in this case, we will do the cluster sampling as I showed you on S PSS. So what to do instead of choosing individuals, you choose groups of individuals, for example, instead of choosing people all over Lebanon, we chose villages and then from every village we went and did our study. Another example, if you are working on school Children instead of in every school going and choosing one or two Children from all schools of Lebanon, this is time consuming. So what you do is from the list of schools, you choose a certain number of schools, let's say 20 schools and then from every school, you will go and take a certain number of Children, the same for universities, the same for hospitals. So whatever you are doing, you might take a list of clusters and from this list, you will select the clusters and you will go and select people from this cluster. So whatever the population is large, you have some difficulty in constructing an accurate frame or having this uh sampling frame is very expensive uh or your population is very dispersed. You might have logistic problem of time budget and everything in this situation, you will choose your cluster sampling and it will be a very good way of doing your study because still we are talking sorry about a random sample. So they are random samples. So you divide your population into clusters or group of individual and then you do a random sampling of the, of a certain number of clusters, either through simple systematic or probability proportional to side cluster sampling. I mean the uh uh stratified sampling and then you will at the step, no, you will go and do a random selection of subjects in the selected clusters. Uh Yes, you can take the same number of subjects in each cluster. But even if it's not possible, it's OK. Even if you have to take different numbers of subjects, it's OK. You can do so. So here this is your population, these are your clusters and you choose a certain number of representative cluster and then from every cluster, you choose individuals to include in your sample. Now, of course, you have to know something using cluster sampling has some cost, OK. It's less expensive than a a random or a systematic sample, less expensive than a stratified sample. However, however, you have to put in mind something, whatever you choose individual from one cluster, there are high chances that these individuals are really similar in lots of ways. And since they are similar, so what will happen is that they might have the same characteristics, exposure, disease status, lifestyle behaviors and things like that. So they are homogeneous within every cluster and this is not that good. So to take into account this high homogenate within clusters, what we would do is to multiply by two the total sample. So if for a simple random sample, you would need, let's say 300 individual. If you are working on a cluster sampling, you have to multiply it by two. And this is what we call the sample design effect or the study design effect. So these kinds of sampling, these are random samples that are very good from the scientific point of view because they allow me to be confident more or less that the sample that I will draw will be representative. No, how to do it. I showed you how we did it. I will not repeat it. This is a, a manual way of doing this kind of sampling. Um No, every type of random sampling has its own advantages disadvantages the simple, well, it has a good accuracy. It's simple co conception and it's the theoretical basis of all general statistics. However, it might have some disadvantages. It might be impractical because you need a complete list of the population. And as we said, it's not always available. And second, some subgroups may be either over or underrepresented. And here you might want to change this simple random sampling. It's something that gives you a better result, the systematic. Well, it's similar to the simple, better uniform distribution of selected subject all over the list. It is useful for simplifying the simple uh uh random sampling selection procedure, particularly as I told you if your list or your sampling frame is not computerized, you have it on paper. Third, the stratified. Well, you have a good accuracy. The sample will include subjects from all strata, particularly the ones that are small and where you are afraid that they will not appear in a simple or a systematic sample. It allows you to have some strata estimates. Of course, uh You can ahead of time decide how many you need from every strata. And sometimes it's really very simple, even simpler than the simple random sample. But sometimes it could be operationally challenging because here also you need a list within every stratum and the cluster sample. Well, this one is the one that we like the in, particularly in Lebanon, it will uh uh have several advantages. It is convenient because you don't need to have the list of the, the whole population. It's enough to have the list of the clusters. It is cost effective, most likely used. As we said, uh you don't need a lot of time to go and search for individuals that are dispersed all over the countries. However, it is not that accurate. It is less accurate than other random sampling methods because you have too much homogeneity within the cluster. So the accuracy will be better if you have inter cluster heterogeneity and intercluster homogeneity. So how can we become more accurate as usual accuracy is gained whenever we increase the sample size? Yeah. OK. No, the cluster effect is, as I told you or the design effect, it will be equal to N if we have a cluster sampling or, and if simple random sampling, what is this N for cluster sample, it will be equal two. And for random sampling it will be equal one. So just remember one thing whenever you're doing any cluster sampling, you will put a tube. What does it mean? We, you will multiply the sample that you calculate by two in order to have a good distribution and in order to have at the same time, a confidence interval, that is not very wide, this will allow you to decrease the confidence interval whiteness. Now, in addition to these random samples that are very well known to be really very interesting from the scientific point of view. And that will allow me to uh uh uh really uh have a good discussion and say that well, I did what I have to do and I chose a random sample, you have non random samples that are sometimes used. And let me to tell you that uh we started using lots of non random samples, particularly during the COVID-19 uh pandemic when we started doing online surveys and online surveys, most of the times are not based on uh lists. Now let us be clear if, if you have a list of individuals, a whatsapp group for example, and you know that you have a closed number of individuals and you are targeting these individuals. It's OK. You can consider that it is a random sample but most of the time, particularly if you're working with the general population. Well, there is no way you can say it is a random sample. You put your mm let's say uh uh a link to the survey on Facebook on Instagram, on linkedin. And you try to uh call people that you know, please. Can you fill out my survey? All this is a non non random sample. You might choose a quota sample. Now here the quota sample, it resembles a little bit the stratified sample. So you decide ahead of time, what are the characteristics of the people that you want to answer? And you try to uh uh let's say equilibrate the age category, the gender categories, the regions and things like that. But what's the difference between this one and the certified sample? The certified sample is a random sample. So in every strata, you have a list in the quota sample, you don't have a list, you just decide. Well, I want somebody who has a certain age who is from a certain gender and I want him to answer my uh survey or my study. Another way of doing it is what we call the snowball sampling. Now, this one is also used a lot in electronic surveys. Uh You start your uh survey by a small group of people and then you ask these people to diffuse it to the people that you, that they know and you tell them to ask these people that they know to diffuse it to other people that they know. So this is what we call the snowball sample. Uh Of course, there will be always a bias given to a social group because if you start with a group of people, for example, who are well educated, you will most probably obtain a full sample of well well educated people. And this is what is going on with us in the majority of our surveys that we are conducting electronically, we are always, always obtaining more women, more educated, better socioeconomic status. So is it representative? No, is that the best way to calculate prevalence, prevalence of disease, prevalence of a certain types of exposure incidents? No, no, no, no. The only thing that you can do with this kind of nonrepresentative sample is to work on associations because associations are not affected by the sample representativity. So this is good or else or our work would be irrelevant. So if you want to do an electronic survey, you have to pay attention. If you have a list, great start from it. If you don't have a list, you can still do it, but you cannot pretend that you are doing neither a prevalence nor an incidence. You cannot calculate a, a national, let's say prevalence of diabetes or CO PD. If you are using this kind of sampling. Now, when I when is it good to use the snowboard sampling? If you have a rare disease or a rare exposure. Uh And you want really uh uh people who have this rare disease or exposure. Yes, you can use it. Uh or, or uh as we said, for online surveys, electronic surveys, you can use this snowball sampling. In addition to that, you might go for what we call samples of volunteers. Is it a good way? Not that good in general. This is a nonprofessional sampling technique. It's useful for preliminary research just to choose from people, you know, around 78 people to try your uh uh questionnaire. For example, yes, you can do so but not more than that. And the accidental or convenience sample. This is whenever you target the most accessible individuals and it has a questionable scientific value. For example, if you go to a mall uh and you try to uh fill out some questionnaire in this mode. Well, I don't know how representative people in the mall are. So all these type of non random samples. Well, they might be useful sometimes but they are not the best, particularly not for measuring prevalence and incidence of diseases and of exposure. You cannot do so you can measure associations. It's OK. You can validate scales also more or less. It's also OK. OK. Now that we look, looked at the different types of samples, uh we have to also know that it's not only the type of the sample that matter, but it's also the size of the sample that matters and this is what we will do now. But before I move to the second part, I would like to ask you, do you have any question about what, what, what we did until now? I don't know. I don't think there's any question if uh we can give them uh the participants one minute. So maybe they are writing. OK. All right. So it seems that things are clear now. I would like you to know something. All these sampling methods are mainly mainly important for cross sectional studies. If we are talking about case control studies in general, the uh issue is easier because cases will be taken from all cases that have the disease and controls will be taken from controls who have the disease. Do I have to include everybody? Most of the time I might include everybody and I will hardly be able to reach the sample size whenever I include everybody who would like to uh work with me. But let me tell you one thing. Uh sometimes you might have a register that includes, let's say thousands of individuals who have the disease and you don't want these thousands, you only want 500 let's say. So here you would go and choose either a simple random sample or a stratified sample. As we uh uh said, all the methods that we stated, well, they might apply in this case. Also the same applies for uh uh cohorts for cohorts. Most of the time, you hardly can reach the minimal sample size with the individuals who are available, with the population, who is available. Uh However, if, if for a certain reason, you have a huge population that you want to follow up and you don't want to follow up everyone. So here also again, you will apply these sampling method in order to select the people that you want to uh include in your study. Um And my last I idea is if I have an intervention, so if I have an experimental study again, I can hardly reach the sample size. But if, if I have too many, I will apply that the, the these sampling method. And what is also important is to differentiate between a random sample and a randomization to the intervention as an, we use the same wording random random but pay attention a random sample is a random sample of individual taken from a list. But after that, there is another step where you will allocate or assign the interventions. For example, drugs versus placebo randomly. Again, it's a randomly but here the randomization, it's something that is different. It's whenever you try to classify or to, to assign people to the groups of comparison, some will take the medication, some will take the placebo. For example, uh just for you between brackets not to mix between the two who they both contain the word random. OK. All right. So now we move to the power and sample size calculation. And actually, as I told you, there is the nature of the sample that is important, but there is also its size that is important. The nature of the sample is related to its representativity. And if it is not representative, it will the uh uh causing what we call election bias. However, for the size, the problem is totally different. If you don't reach the minimal sample size, that is necessary for you to do your study, you will have a random error and this random error will cause a low power to detect whatever you want to detect through your study. You want to detect a difference between two groups. You want to detect an association between two variables, you want to detect a correlation, you whatever you want to do, if you want to be able to do it statistically, you need to have a minimal sample size that is readily uh uh uh uh powerful. And that can do so. And actually uh calculating the minimum sample size will allow us to test any hypothesis appropriately. Now, of course, we should fix the risk of error that are called alpha or type one error and beta type two errors. Uh uh And I will explain to you a little bit more. What's the meaning of alpha and beta? Actually, we were saying that if you work on a sample and you want to take your results from the sample and extrapolate them to the population. So this extrapolation will really include the risk of error. And in general, there are two types of error. We have the alpha and the beta, the alpha adder is fixed at 5% and the beta error has a maximum of 20%. Now, we wish to be lower for both for alpha and for beta. And the way to minimize the error is by increasing the sample size, there's no other way but you cannot increase your sample size indefinitely. Because increasing your sample size means that you need more money, you need more time, you need more energy, you need more resources. And this is why we do a calculation of a minimal sample size. So we do a certain equilibrium between how much I can play pay and how much I can uh consume in terms of resources and versus the minimal risk of error that can, that I can afford, which is 5% for alpha and 20% for be. So in order to do this calculation, there is also another aspect that I need to know what is the minimal difference or association that you would like to see. In other words, how powerful do you want your study to be? Well, the ideal is to have a super powerful study. Yes, but this is if you have $1 million and if you have 10 years to do it, which is not the case most of the time. You don't have a million dollar and you don't have 10 years, you have a limited amount of time, a limited amount of energy and you don't have money or you have a minimal amount of money. So this is why you can be modest. It's ok. You can say, well, I will go for a minimal difference. That makes sense, at least from the clinical point of view or a minimal association or this is what we call an effect size. That makes sense. For example, an odds ratio of two was good. Now, yet it we can do a smaller odds ratio and demonstrate very small odds ratio and and come up with significant results with very small odds ratio. However, this would need a lot. So if you want to be fair, you would choose minimal difference and association that make clinical sense. And this is what we call a moderate effect size. Then after you fix alpha beta and the minimal difference or association, you will go and apply either a formula, a mathematical formula or you can use a software. And here I will show you now uh there are many Softwares, some are available online, very simple, some are downloadable online for free and I will show you APM four and G power. These are the two software that I mostly use and a lot of people use them. And once you apply the formula, you will obtain a minimal sample size. And if you respect this minimal sample size. This will let you increase your confidence with your results, particularly if you have negative results. Yeah, I mean if your B value the one that you obtain is higher than 0.05 and if you have a good sample size, then you will be at ease. You will say, well, I did what I have to do, I respected a minimal sample size and still my results were negative. So maybe there is no association or maybe there is no difference. But if you don't respect your minimal sample size, well, you have all the chances of obtaining a negative P value AP value that is not significant. And then you cannot conclude very simply. So one word about the power of a test based on what we said, let's get a, a little bit deeper into the uh detail, the first type error or the alpha error. It's the risk of rejecting the null hypothesis while it is true. So what is the null hypothesis? It's the hypothesis that you put in your mind saying there is no difference between the groups I want to compare or there is no association between the variables that I am trying to uh to link the type two error or the better error. It's the opposite. It's the risk of accepting the non hypothesis while it is false. So both are errors and both. Well, we will try to minimize them. Now we have to know that there is something that we call the power of the study and which is equal to one minus beta. It's the capacity of a study to see a small difference. Actually, the smaller the difference that you want to observe, the higher the sample size needs to be because you need to have really a very powerful study to see a small difference. And if your study is not powerful enough, you will falsely conclude that H zero is true. So this is why we said that we have to have a bit of 20% and a minimal power of 80% in order to be at ease with my result. And as I told you, in order to decrease alpha and beta and to increase the power of the study, the best way is to increase the sample size. There's no other way anyway. So this is a table that is interesting because it is, it will summarize what we are saying. You have the conclusion from your statistical analysis that tells you either to accept the null hypothesis or to reject the null hypothesis. And you have the true state of nature in nature, the truth, the whole truth, it's either that the non hypothesis is true or it is false. So if the truth is that uh H zero is true and your results, the one you obtained from your statistical analysis, they give you that, then you're fine, you're in a correct conclusion. If you are here, your conclusion tells you, you should reject the non hypothesis. And if in reality, the null hypothesis is false, then you are here below. And the right. No, this is also correct. But what if the truth is the null hypothesis is true? But your analysis gives you a, a result that tells you to reject the non hypothesis. Well, this is what we call a type one error. It should be a maximum 5%. And the other way around if the null hypothesis is false, but you did your statistical analysis and it tells you to accept the null hypothesis, then you have this type two error because you are accepting a false null hypothesis. In other word, there is a difference that you could not see. So you are here at the level of type two. So we want to decrease both types of errors, we should decrease the sample size. So in all cases, you should calculate a sample size for proportions for means if you want to compare proportions, if you want to compare means if you want to do anything statistically, you want to first be able to calculate a minimal sample size. So how to do it? Well, there are many ways of doing it. The first one is by using some types of formula mathematical formula, they rely on the population size. The expected mean, the Z of which is a value that you get from statistical table, the margin of error that you can expect a proportion to all these. Well, you include them in your formula and you obtain the minimal sample size. But if you want for easier way, it's better to go to a software and do the work because you can repeat it as much as you want and there will be no error, a software will decrease the risk of error, of course. So I will start with APM four. Uh if you open PM four, as I told you, this is a software that is available from for free, you download it from the CDC, Atlanta. And there is here a uh a process that you can use that is called STAT CALC. If you want, for example, to calculate for a population survey, uh where you want to calculate the prevalence of a disease, what to do. So you will have first a population size, you know it uh then it will ask you for the expected frequency. What is the expected frequency of, let's say the prevalence of diabetes? So you go to the literature, you search for studies that were done in the region, previous studies done in Lebanon or you search for the average worldwide whatever and you come up with the value that you think is the closest to what you are looking for. Second, you fix your acceptable margin of error versus 10%. A 2% is more or less acceptable. But if you put a 5% and the true value is 10%. It will be too much if you allow yourself to have a 5% error mean, but 10% it will be too much. This is why we try to decrease it and the more you decrease it, the more the sample size will be higher. And then look at it here what you have the design effect. You remember whenever we talked about the cluster, if you have clusters, you put the number of clusters here and you put the design effect equal to this will double the sample size that is needed in order to calculate or to give you the minimal sample size for a certain survey. Now I will go and show you based on APM four. So this is APL four ST population survey. I will put it here so that you can see it better. I'll take another example, let's say I want to calculate the prevalence of hypertension in Lebanon 11. And let's say I know that there are uh I don't know, 4 million adults. I don't want to include Children, right? In all adults, we have 3 million sweets. OK. What is the expected frequency of hypertension? Well, based on worldwide values, I might go and check uh check the uh region uh the uh neighboring countries, let's say I found out it's 15%. So versus 15% 5%. It's the one third, it's too much. I will decrease it a little bit. I will put it 3%. Now, the design effect, if I want to do a cluster sample, let's say I will choose 10 clusters and I will take the cluster design equal to. So look at it, it's telling me that I will need a minimal sample size of 1090 individual with a 95% confidence level. And for every cluster I will have and take 109 individuals if you change the number of clusters, so you will have almost the same sample size. But it will tell you that for every sample size, for every cluster, you will take 55 individuals if you want to decrease the margin of errors. Two. So look at it 2460 123 from every cluster. She wants still to decrease it more 1% margin of error. So in every cluster 489 and the total will be huge 9780. So you can see you have to be modest. You have to be rational and choose a, a margin of error that is not. So stringent mileage, let us accept 3%. It will give you a sample size that is acceptable. OK, then if we are talking about a me again, there's a formula that you can apply manually that includes the expected mean, the margin of error, the standard deviation and the C score. OK. But you can also use another software this time, you cannot use APF four but you will go and use a software called G power. So I will open G power, I will close these once this is G power. And suppose that I want to compare a mean to a fixed value. So this is it difference from constant. It's a one sample case. Now you can see that G power is really very, very huge. It can do any, any type of power calculation and of minimal sample size calculation. It it's really look at it here. Look how, how many types of calculations you can do with this G power. So it will all depend on the type of analysis for which you want to calculate the power. So going back to our one mean that we want to, to compare to a fixed value. So I'll go here in general, I will always choose to tail test and look at what appears here the effect size conventions. What is the effect size? It's how much difference I want to see. Sometimes you go back to the literature and check what is the difference that is expected? How much difference I want to see between the groups or between one group and a fixed value. If it is a small difference, I would, I should put O2 if it is a moderate, I put 05 and it is large, I put 08 in general, if I have no idea what to put, I put the 05, this is a moderate difference. If I want to be more stringent, I go to small effect size. If I want to be, let's say more line, I go to big effect size. So I will keep it here alpha, it's 5%. And the power, as we said in general, we put it on 08 and I will ask the G power to calculate a minimal sample size. So sometimes I go to the literature and I will have, let's say a fixed value of 50. And I want to compare my sample size. I check the standard deviation available in the literature. I calculate the effect size that is taken from the literature. I transfer it to the main window and I redo some calculation. So it all depends on what is the effect size that I want. What is the difference between the sample that I obtain and the fixed value? So as you can see, these tools are very powerful and they will allow me to work flexibly. Uh As I told you, you have to be uh not linear, but at least to be modest in the way you are trying to choose your effect size. If you choose it to be very small, you will need a huge sample size. Suppose for example, I put 0.1 as a an effect size. I'm very stringent. You will need 787 extremely stringent 0.05 effect size 3140 42. So the smaller the difference that you want to observe, the higher the sample size that you will need. OK. So this tool is also available for free. You download it for free from the internet. If you want to compare two proportions. Well, again, you can apply the formula or you can go to APM four and you can compare these proportions. Let me tell you if I have for example, a cross sectional study in a cross sectional study. Of course, you have a proportion of disease and proportion of exposure and you have lots of proportions within a cross sectional study. First, you fix your confidence level, 95% you fix your powers 80% then you fixed the ratio of unexposed to exposed. Where do you get this ratio? Well, you get it from the literature suppose that you are working on the effect of smoking on cardiovascular disease. Let's see. So the ratio of unexposed to exposed is based on what is the prevalence of smoking in your population. If you know, for example, that 20% of your population smoke. So your ratio is four. Because for every four people who don't smoke, you have one person who smoke, then what is the percentage of outcome in the unexposed group? So what is the known cardiovascular disease prevalence among those who do not smoke. I'm gonna suppose that it is 5%. And then there is one additional thing, what is the minimal risk ratio that you want to be able to demonstrate by using this sample size in general? As I told you, we choose, sorry, we choose two when we choose to. This means that we have a modest effect size. And here this is what you get. So in a cohort or a cross sectional study, you need 274 exposed people, 1094 nonexposed and a total of 1368. If you decide to be more stringent, you want to demonstrate a smaller ration such as 1.5. 0 look at what we get 925 3697 that the smaller the risk ratio or the smaller the association that I would like to detect. Well, the higher the sample size I would need, this is how it is if you want to be more lenient, for example, 2.5. So you obtain numbers that are really easily accessible 140 exposed 559. So it's up to you finally to decide how lant you want to be or how stringent you want to be with yourself and how powerful your, your, your uh uh study, you would like it to, to be, if you want to be modest. If it's your first study, you're working your thesis. For example, don't be super ambitious two. It's OK. 26 72 74 versus 1094. Uh If you think it's too high, now, this one is easily achievable. OK. So what I'm telling you is that you can be really flexible at this level whole, you bring them from the literature and you have to put your reference and you have to say, where did you bring these number, these figures from the compare to reference? The only thing is here where you put according to what you feel you can achieve and your minimal sample size will be reflect a reflection of what you feel you can achieve. So here this is what we just did. And if you want to compare two means you have to go to the G power again and in the G power, you will have again to fix your effect size, your alpha, your beta and let us sit. Mm So still at T test but I want to search for comparing two independent means two tails effect size. Well, you if you have no idea, you don't have any literature results, you choose 05. But if you have any idea or if you have some literature, you go to the literature and you fill out what was obtained by researchers in other countries. For example, on the same topic, the mean in group one mean in group two standard deviation group one and group two, you calculate the effect size and you transfer it here. Now, here by default, you should I uh uh put the power 08 and your allocation, it's one, it means that you can take two groups that are alike. But as I told you, you might go up to four and here you will have a total sample size that is acceptable 40 versus 160. So it's achievable in a more or less small period of time. So all these is to show any difference. Now, if you want to see some positive study on some positive results, so here you might go and choose a one tail test, it will be more difficult to see. Uh the P value that we are all accepting is less than 0.05. And the effect size, as we said, it depends on the variable type and it is generally automatically calculated unless you have some literature and you put the results of the literature within it one word about the effect size. It's a quantitative measure of the magnitude of the experimental effect. And as I told you, and as I showed you, the larger the effect size, the stronger the relationship, the easier it will be to see it. So you will need a smaller sample size to see a bigger effect size and vice versa. You will need a bigger sample size to see a small effect size. Uh There is a formula uh for the effect size. Now, most of the times you don't need to use the formula manually because as I showed you, it can be immediately calculated just here, if you apply, let's say 20 22 let's say you have three and here you have five. For example, this is your effect size you calculated and you apply it. So you don't need to do this calculation manually just just for you to know that there's a formula that can be applied. And one last word about what we call a negative results trial. Well, if you want to demonstrate that two treatments are equally effective, so thi this is what we call an equivalence trial. It's a little bit different from the positive trial. So you're not trying to show that there is a difference, you're trying to show that there is no difference what to do in this case. Well, you will fix the minimal difference that is clinic clinically meaningful. And you will try to show that your difference, the difference that you obtain between the two groups and this full confidence interval are within the clinically meaningful difference that you know. So this is very important and having your negative result within this confidence is good. But if it's not, you cannot conclude. So it's important to have a confidence interval that is within the interval that has some clinical meaning or else it doesn't work. So it's not enough to have ap value that is not significant to say, well, the uh uh uh the two let's say treatments are alike. This is not true. It doesn't work like that. And that's all I had to tell you regarding sample size calculation. So my question to you is everything clear. Bye. Thank you a lot doctor. It was really a very insightful session. I will send uh I found the link for the G power and the uh uh called or Yes. Yes, exactly. I found the link for the free uh download and I will send them uh on the group. Perfect. Ok, perfect. They will be happy to have them. Good. Thank you a lot doctor. I don't think if anyone has any uh question, please, you can ask it now and I will send the feedback form now. Ok. Uh What would you like us to do? Uh We, we stop for five minutes before we, we proceed with the, the, the last session. Uh Yes, I think we uh we can uh wait for 5 to 10 minutes and you have to uh enter through the other link, the other link. Exactly. Ok. Exactly. Ok. And, and for everyone else, please, uh you, you just have to click on the email, the, the second email, the module six to be able to and to do the second session. Ok, perfect. Ok, then uh see you in around seven minutes, let's say yes. Ok. In seven. Ok, perfect. Ok. Ok. Thank you. Bye bye.