# How to subset data from an existing data set( randomly draw)

1

Hi,

Can anyone tell me how to randomly draw subsets from an existing data set.

For instance, I have a  variable with 100 observations, now I want to only keep 50 of it( randomly drawn). It will be good if the subset can mimic main features of the original data properly.

I did not find any programming deal with this specifically, can you tell how you deal this in general.

Many thanks!

0

GAUSS allows you to index into a matrix with a vector of indices, for example:

```x = { 5 1,
2 9,
3 7,
6 4,
8 0 };

idx = { 2,
4,
5 };

z = x[idx, .];
```

After the code above, z will equal the second, fourth and fifth rows of x:

```     2 9
z =  6 4
8 0
```

Now all we need to do is to create some random integers between 1 and the number of observations in our dataset to draw randomly from it. You can do that by multiplying a series of uniform random numbers by the number of observations in your data and rounding up. Here is an example:

```//create a dataset for this example
my_dataset = rndn(100, 5);

//how many observations to draw at a time
num_draws = 50;

//create index for random draws
//(edited to fix bug reported in this thread)
idx = ceil(num_draws * rndu(num_draws, 1));

//draw sample
my_sub_sample = my_dataset[idx, .];
```
aptech
615
0

Hi, Aptech

I do not think this program you provided works correctly, as I used it and  I found that the sub-smaple size is still 100, which is as the same as the number of observations in the my_dataset. I want to keep 50 observation in the subsamples.

And,

//create index for random draws idx = ceil(num_draws * rndu(rows(my_dataset), 1));

should be

//create index for random draws idx = ceil(num_draws * rndu(rows(num_draws), 1)); ?

Although the size will be correct by uisng the above code, this does not make sense actually.

Thank you very much!!

0

Yes, there is a bug in that post. That code will draw a random sample that is the same size as your original data--not what you want. Since the variable num_draws in that code snippet is a scalar, then rows(num_draws) will return 1. The code you proposed:

```idx = ceil(num_draws * rndu(rows(num_draws), 1));
```

will draw a random sample of only one observation. What you actually want the line to read is:

```idx = ceil(num_draws * rndu(num_draws, 1));
```

I think if I break the assignment of idx into separate statements, then it will be clear to you what is going on. The new corrected line could be rewritten as follows:

```//Create 'num_draws' uniform random numbers between 0 and 1
r = rndu(num_draws, 1);

//Change the scale of our uniform random numbers
//from 0-1 to 0-'num_draws'
r_scaled = num_draws * r;

//Force the scaled uniform random numbers
//to integers from 1-'num_draws'
idx = ceil(r_scaled);
```

Let us know if this clears up the issue or if you have any more questions!

aptech
615

• ### Aptech Systems, Inc. Worldwide Headquarters

Aptech Systems, Inc.
2350 East Germann Road, Suite #21
Chandler, AZ 85286

Phone: 360.886.7100
FAX: 360.886.8922

• ### Training & Events

Want more guidance while learning about the full functionality of GAUSS and its capabilities? Get in touch for in-person training or browse additional references below.

• ### Tutorials

Step-by-step, informative lessons for those who want to dive into GAUSS and achieve their goals, fast.

• ### Have a Specific Question?

Get a real answer from a real person

• Need Support?
• ### Support Plans

Premier Support and Platinum Premier Support are annually renewable membership programs that provide you with important benefits including technical support, product maintenance, and substantial cost-saving features for your GAUSS System or the GAUSS Engine.

• ### User Forums

Join our community to see why our users are considered some of the most active and helpful in the industry!