 # How to subset data from an existing data set( randomly draw)

Hi,

Can anyone tell me how to randomly draw subsets from an existing data set.

For instance, I have a  variable with 100 observations, now I want to only keep 50 of it( randomly drawn). It will be good if the subset can mimic main features of the original data properly.

I did not find any programming deal with this specifically, can you tell how you deal this in general.

Many thanks!

0

GAUSS allows you to index into a matrix with a vector of indices, for example:

```x = { 5 1,
2 9,
3 7,
6 4,
8 0 };

idx = { 2,
4,
5 };

z = x[idx, .];
```

After the code above, z will equal the second, fourth and fifth rows of x:

```     2 9
z =  6 4
8 0
```

Now all we need to do is to create some random integers between 1 and the number of observations in our dataset to draw randomly from it. You can do that by multiplying a series of uniform random numbers by the number of observations in your data and rounding up. Here is an example:

```//create a dataset for this example
my_dataset = rndn(100, 5);

//how many observations to draw at a time
num_draws = 50;

//create index for random draws
//(edited to fix bug reported in this thread)
idx = ceil(num_draws * rndu(num_draws, 1));

//draw sample
my_sub_sample = my_dataset[idx, .];
``` aptech

1,728

0

Hi, Aptech

I do not think this program you provided works correctly, as I used it and  I found that the sub-smaple size is still 100, which is as the same as the number of observations in the my_dataset. I want to keep 50 observation in the subsamples.

And,

//create index for random draws idx = ceil(num_draws * rndu(rows(my_dataset), 1));

should be

//create index for random draws idx = ceil(num_draws * rndu(rows(num_draws), 1)); ?

Although the size will be correct by uisng the above code, this does not make sense actually.

Thank you very much!! 0

Yes, there is a bug in that post. That code will draw a random sample that is the same size as your original data--not what you want. Since the variable num_draws in that code snippet is a scalar, then rows(num_draws) will return 1. The code you proposed:

```idx = ceil(num_draws * rndu(rows(num_draws), 1));
```

will draw a random sample of only one observation. What you actually want the line to read is:

```idx = ceil(num_draws * rndu(num_draws, 1));
```

I think if I break the assignment of idx into separate statements, then it will be clear to you what is going on. The new corrected line could be rewritten as follows:

```//Create 'num_draws' uniform random numbers between 0 and 1
r = rndu(num_draws, 1);

//Change the scale of our uniform random numbers
//from 0-1 to 0-'num_draws'
r_scaled = num_draws * r;

//Force the scaled uniform random numbers
//to integers from 1-'num_draws'
idx = ceil(r_scaled);
```

Let us know if this clears up the issue or if you have any more questions! aptech

1,728

0

GAUSS allows you to index into a matrix with a vector of indices, for example:

```x = { 5 1,
2 9,
3 7,
6 4,
8 0 };

idx = { 2,
4,
5 };

z = x[idx, .];
```

After the code above, z will equal the second, fourth and fifth rows of x:

```     2 9
z =  6 4
8 0
```

Now all we need to do is to create some random integers between 1 and the number of observations in our dataset to draw randomly from it. You can do that by multiplying a series of uniform random numbers by the number of observations in your data and rounding up. Here is an example:

```//create a dataset for this example
my_dataset = rndn(100, 5);

//how many observations to draw at a time
num_draws = 50;

//create index for random draws
//(edited to fix bug reported in this thread)
idx = ceil(num_draws * rndu(num_draws, 1));

//draw sample
my_sub_sample = my_dataset[idx, .];
``` aptech
1,728
0

Hi, Aptech

I do not think this program you provided works correctly, as I used it and  I found that the sub-smaple size is still 100, which is as the same as the number of observations in the my_dataset. I want to keep 50 observation in the subsamples.

And,

//create index for random draws idx = ceil(num_draws * rndu(rows(my_dataset), 1));

should be

//create index for random draws idx = ceil(num_draws * rndu(rows(num_draws), 1)); ?

Although the size will be correct by uisng the above code, this does not make sense actually.

Thank you very much!! 0

Yes, there is a bug in that post. That code will draw a random sample that is the same size as your original data--not what you want. Since the variable num_draws in that code snippet is a scalar, then rows(num_draws) will return 1. The code you proposed:

```idx = ceil(num_draws * rndu(rows(num_draws), 1));
```

will draw a random sample of only one observation. What you actually want the line to read is:

```idx = ceil(num_draws * rndu(num_draws, 1));
```

I think if I break the assignment of idx into separate statements, then it will be clear to you what is going on. The new corrected line could be rewritten as follows:

```//Create 'num_draws' uniform random numbers between 0 and 1
r = rndu(num_draws, 1);

//Change the scale of our uniform random numbers
//from 0-1 to 0-'num_draws'
r_scaled = num_draws * r;

//Force the scaled uniform random numbers
//to integers from 1-'num_draws'
idx = ceil(r_scaled);
```

Let us know if this clears up the issue or if you have any more questions! aptech
1,728

### Have a Specific Question?

Get a real answer from a real person

### Need Support?

Get help from our friendly experts.