STATISTICA







STATISTICA BASIC Program RandomSample.stb

{ In situations where one has a data set with a very large sample size, it is often useful to identify and analyze a randomly selected subset of the cases, and to identify additional randomly selected subsets of cases to use for cross-validation purposes. This program generates codes for a sample identifier variable which will uniquely identify random subsamples of a user-specified size from the total number of valid cases in the data file. The data file should already have a variable created for the sample identifier codes. The program will prompt you for this variable. The Error Level specified from the Options pull-down menu may need to be increased (e.g., to 100,000) before running this program on very large data files (The number specified for the Error Level option represents the acceptable number of missing data assignments that can occur during execution of the STATISTICA BASIC program. If this limit were exceeded, an error message would be displayed and the execution of the program would stop.
Program written, modified, or edited at StatSoft, Inc.
}

RandomAccess;
NoDataFileVariableNames;
{determine sample size}
size:=10;
if DisplayNumericInputBox ('Enter desired sample size', 'n:', size)=0 then exit;
{check if sample size is larger than zero}
if size<=0 then begin
   DisplayMessageBox (MB_OK, 'Invalid parameter', 'The sample size cannot be <= 0.');
   exit;
end;
{determine variable to contain sample identifier values}
if SelectVariables1 ('Select a variable to contain the sample identifiers', 1, 1, VarList1, Count1,
  'Sample identifier:')=0 then exit;
{initialize the specified variable}
for i:=1 to NCases do
 if Valid(Data(i,VarList1)) then
    Data(i, VarList1):=Missing;
{determine how many samples will be created}
nsamples:=trunc(NCases/size);
{dimension the arrays that will be used for sorting}
ReDim array1(NCases);
ReDim array2(NCases);
{start main loop for creating the subsamples}
for i:=1 to nsamples do begin
  {start loop for assigning random numbers from the Standard Normal Distribution to each case
  that has not already been used in a previous subsample}
  for j:=1 to NCases do begin
    array1(j):=Missing;
    array2(j):=Missing;
    if not Valid(Data(j,VarList1)) then begin
      array1(j):=VNormal ( Rnd (1), 0, 1);
      array2(j):=j;
    end;
  end; {end loop}
  {sort the arrays based on random numbers in ascending order}
  VectorDualSort(array1, array2, SORT_ASCENDING);
  {fill the specified variable with sample identifier values}
  for j:= 1 to size do
    Data(array2(j), VarList1):=i;
end; {end main loop}
Back to List of Programs



[StatSoft]
2300 East 14th Street, Tulsa, OK 74104
Phone: (918) 749-1119; Fax: (918) 749-2217

[StatSoft]e-mail: info@statsoft.com

©Copyright StatSoft, Inc., 1984-2004.
StatSoft, StatSoft logo, STATISTICA, SEWSS, SEDAS, Data Miner, SEPATH and GTrees are trademarks of StatSoft, Inc.