Hi,

Can anyone tell me how to randomly draw subsets from an existing data set.

For instance, I have a variable with 100 observations, now I want to only keep 50 of it( randomly drawn). It will be good if the subset can mimic main features of the original data properly.

I did not find any programming deal with this specifically, can you tell how you deal this in general.

Many thanks!

## 3 Answers

0

GAUSS allows you to index into a matrix with a vector of indices, for example:

x = { 5 1, 2 9, 3 7, 6 4, 8 0 }; idx = { 2, 4, 5 }; z = x[idx, .];

After the code above, `z` will equal the second, fourth and fifth rows of `x`:

2 9 z = 6 4 8 0

Now all we need to do is to create some random integers between 1 and the number of observations in our dataset to draw randomly from it. You can do that by multiplying a series of uniform random numbers by the number of observations in your data and rounding up. Here is an example:

//create a dataset for this example my_dataset = rndn(100, 5); //how many observations to draw at a time num_draws = 50; //create index for random draws //(edited to fix bug reported in this thread) idx = ceil(num_draws * rndu(num_draws, 1)); //draw sample my_sub_sample = my_dataset[idx, .];

0

Hi, Aptech

I do not think this program you provided works correctly, as I used it and I found that the sub-smaple size is still 100, which is as the same as the number of observations in the my_dataset. I want to keep 50 observation in the subsamples.

And,

//create index for random draws idx = ceil(num_draws * rndu(rows(my_dataset), 1));

should be

//create index for random draws idx = ceil(num_draws * rndu(rows(num_draws), 1)); ?

Although the size will be correct by uisng the above code, this does not make sense actually.

Could you please explain more? Maybe I misunderstood your program.

Thank you very much!!

0

Yes, there is a bug in that post. That code will draw a random sample that is the same size as your original data--not what you want. Since the variable `num_draws` in that code snippet is a scalar, then `rows(num_draws)` will return 1. The code you proposed:

idx = ceil(num_draws * rndu(rows(num_draws), 1));

will draw a random sample of only one observation. What you actually want the line to read is:

idx = ceil(num_draws * rndu(num_draws, 1));

I think if I break the assignment of `idx` into separate statements, then it will be clear to you what is going on. The new corrected line could be rewritten as follows:

//Create 'num_draws' uniform random numbers between 0 and 1 r = rndu(num_draws, 1); //Change the scale of our uniform random numbers //from 0-1 to 0-'num_draws' r_scaled = num_draws * r; //Force the scaled uniform random numbers //to integers from 1-'num_draws' idx = ceil(r_scaled);

Let us know if this clears up the issue or if you have any more questions!