How to subset data from an existing data set( randomly draw)

Question

Hi,

Can anyone tell me how to randomly draw subsets from an existing data set.

For instance, I have a variable with 100 observations, now I want to only keep 50 of it( randomly drawn). It will be good if the subset can mimic main features of the original data properly.

I did not find any programming deal with this specifically, can you tell how you deal this in general.

Many thanks!

3 Answers

Your Answer

aptech · Answer 1

GAUSS allows you to index into a matrix with a vector of indices, for example:

x = { 5 1,
      2 9,
      3 7,
      6 4,
      8 0 };

idx = { 2,
        4,
        5 };

z = x[idx, .];

After the code above, z will equal the second, fourth and fifth rows of x:

     2 9
z =  6 4
     8 0

Now all we need to do is to create some random integers between 1 and the number of observations in our dataset to draw randomly from it. You can do that by multiplying a series of uniform random numbers by the number of observations in your data and rounding up. Here is an example:

//create a dataset for this example
my_dataset = rndn(100, 5);

//how many observations to draw at a time
num_draws = 50;

//create index for random draws
//(edited to fix bug reported in this thread)
idx = ceil(num_draws * rndu(num_draws, 1));

//draw sample
my_sub_sample = my_dataset[idx, .];

link

aptech

1,773

applegrass · Answer 2

Hi, Aptech

I do not think this program you provided works correctly, as I used it and I found that the sub-smaple size is still 100, which is as the same as the number of observations in the my_dataset. I want to keep 50 observation in the subsamples.

And,

//create index for random draws idx = ceil(num_draws * rndu(rows(my_dataset), 1));

should be

//create index for random draws idx = ceil(num_draws * rndu(rows(num_draws), 1)); ?

Although the size will be correct by uisng the above code, this does not make sense actually.

Could you please explain more? Maybe I misunderstood your program.

Thank you very much!!

link

applegrass

5

aptech · Answer 3

Yes, there is a bug in that post. That code will draw a random sample that is the same size as your original data--not what you want. Since the variable num_draws in that code snippet is a scalar, then rows(num_draws) will return 1. The code you proposed:

idx = ceil(num_draws * rndu(rows(num_draws), 1));

will draw a random sample of only one observation. What you actually want the line to read is:

idx = ceil(num_draws * rndu(num_draws, 1));

I think if I break the assignment of idx into separate statements, then it will be clear to you what is going on. The new corrected line could be rewritten as follows:

//Create 'num_draws' uniform random numbers between 0 and 1
r = rndu(num_draws, 1);

//Change the scale of our uniform random numbers
//from 0-1 to 0-'num_draws'
r_scaled = num_draws * r;

//Force the scaled uniform random numbers
//to integers from 1-'num_draws'
idx = ceil(r_scaled);

Let us know if this clears up the issue or if you have any more questions!

link

aptech

1,773

aptech · Answer 4

GAUSS allows you to index into a matrix with a vector of indices, for example:

x = { 5 1,
      2 9,
      3 7,
      6 4,
      8 0 };

idx = { 2,
        4,
        5 };

z = x[idx, .];

After the code above, z will equal the second, fourth and fifth rows of x:

     2 9
z =  6 4
     8 0

Now all we need to do is to create some random integers between 1 and the number of observations in our dataset to draw randomly from it. You can do that by multiplying a series of uniform random numbers by the number of observations in your data and rounding up. Here is an example:

//create a dataset for this example
my_dataset = rndn(100, 5);

//how many observations to draw at a time
num_draws = 50;

//create index for random draws
//(edited to fix bug reported in this thread)
idx = ceil(num_draws * rndu(num_draws, 1));

//draw sample
my_sub_sample = my_dataset[idx, .];

link

aptech

1,773

applegrass · Answer 5

Hi, Aptech

I do not think this program you provided works correctly, as I used it and I found that the sub-smaple size is still 100, which is as the same as the number of observations in the my_dataset. I want to keep 50 observation in the subsamples.

And,

//create index for random draws idx = ceil(num_draws * rndu(rows(my_dataset), 1));

should be

//create index for random draws idx = ceil(num_draws * rndu(rows(num_draws), 1)); ?

Although the size will be correct by uisng the above code, this does not make sense actually.

Could you please explain more? Maybe I misunderstood your program.

Thank you very much!!

link

applegrass

5

aptech · Answer 6

Yes, there is a bug in that post. That code will draw a random sample that is the same size as your original data--not what you want. Since the variable num_draws in that code snippet is a scalar, then rows(num_draws) will return 1. The code you proposed:

idx = ceil(num_draws * rndu(rows(num_draws), 1));

will draw a random sample of only one observation. What you actually want the line to read is:

idx = ceil(num_draws * rndu(num_draws, 1));

I think if I break the assignment of idx into separate statements, then it will be clear to you what is going on. The new corrected line could be rewritten as follows:

//Create 'num_draws' uniform random numbers between 0 and 1
r = rndu(num_draws, 1);

//Change the scale of our uniform random numbers
//from 0-1 to 0-'num_draws'
r_scaled = num_draws * r;

//Force the scaled uniform random numbers
//to integers from 1-'num_draws'
idx = ceil(r_scaled);

Let us know if this clears up the issue or if you have any more questions!

link

aptech

1,773

How to subset data from an existing data set( randomly draw)

3 Answers

Your Answer

3 Answers

You must login to post answers.

Have a Specific Question?

Need Support?