Splitting and Merging Files
Recoding, Verifying, and Cleaning Data
On the Data menu, there are two commands for splitting a spreadsheet: Subset and Random Sampling.
Subset creates a new spreadsheet based on specified variables and case selection conditions.
Random Sampling creates a new spreadsheet based on a simple random or stratified random sampling.
Select either Save or Save As from the File menu to save the new spreadsheet. The original spreadsheet is not in danger of being overwritten unless you use the same name.
Yes, you can. Select Random Sampling from the Data menu to display the Create a Random Sample dialog. Sampling types include simple, systematic, split node, and stratified. Various options are available for each sampling type.
Select Merge from the Data menu to display the Merge Options dialog.
Either variables (columns of data) or cases (rows of data) can be merged. In other words, the second file can be appended to either the "right side" (as variables) or "bottom" (as cases) of the first file. STATISTICA provides additional options for different merge types.
Yes. Select Merge from the Data menu to display the Merge Options dialog, and select the Text Labels tab.
On this tab, you can select the way in which the text values from the two files are to be merged.
Select Merge from the Data menu to display the Merge Options dialog, and select the Cases tab. After selecting the two data files in the lower section of the dialog, select the Match by variable name check box in the Mode box, and then click the OK button to execute the merge.
Yes, you can select multiple variables by which to match when using the Match variables mode (accessible in the Merge Options dialog - Variables tab, Mode group box). The variables between the two data sets will be paired together in the (column) order that they appear in the data set.
Variables in STATISTICA Spreadsheets can be defined by formulas that support a wide selection of mathematical, logical, and statistical functions.
When you enter a long variable name (in the Long name field on the Variable dialog) that starts with an equal sign, STATISTICA will assume that it is a formula and will verify it for formal correctness. If the formula is formally correct, you will then be given the choice to recalculate the variable now or later. You can use variable formulas to verify data, transform a variable, recode a variable or create values of the variable based on logical conditions (e.g., =(v0<=100)*1 + (v0>100)*2 will assign a value of 1 to cases number 1 through 100 and 2 to cases above 100). Refer to variables by their names (e.g., Test1, Income) or numbers (e.g., v1, v2, v3, ...); v0 is the case number. A comment may be added to a formula following a semicolon.
Access the Function Browser by clicking the Functions button in the Variable dialog. The Function Browser helps you quickly access different functions and their syntax and descriptions. Functions are grouped by Category for easier searching.
Yes. Select the Auto-recalculate when the data change check box in the Recalculate Spreadsheet Formulas dialog (accessed by clicking the Recalculate button on the toolbar, or selecting Recalculate Spreadsheet Formulas from the Data menu, or pressing Shift+F9).
STATISTICA will automatically recalculate all spreadsheet formulas when data are changed in the spreadsheet. Alternatively, you can press F9 to manually recalculate the formulas in the spreadsheet.
You can use either of the data transformation facilities: spreadsheet formulas or STATISTICA Visual Basic. However, often the quickest way to create values of a new variable based on conditions met by other variables is to use the data recoding facility, Recode Values of Variable dialog, which is accessible at any point from the spreadsheet by clicking the Vars button and selecting Recode or selecting Recode from the Data menu. Note that the currently selected variable does not have to be included in the text of the recoding conditions. Thus, you can use this facility to create values of a variable based on conditions met by other variables.
For example, you can add a new (empty) variable to the data file, and then use this facility to create the new values. For instance, the recoding conditions could be used to assign 1s to the new variable for all "male subjects 18 to 25 years old with cholesterol levels below 200," 2s to "male subjects 18 to 25 years old with cholesterol levels above 200," and assign the missing data value to all other subjects.
Spreadsheets have a quick recoding facility. Select Recode from the spreadsheet Vars toolbar button to display the Recode Values of Variable dialog. You can scroll down in this dialog to define new values (text or numeric) of the current variable (see the groups New Value 1, New Value 2, New Value 3, etc., below) depending on the specific conditions that you define (see the Category 1, Category 2, Category 3, etc., groups below).
For example, the recoding conditions specified above would "translate" the values less than or equal to 3 of the current variable (Quality) into Low, the values greater than 3 and less than or equal to 6 into Medium, and the values greater than 6 and less than or equal to 10 into High. Note that any other values that occur in the Quality variable would be "translated" into the missing data code because the MD Code option button is selected in the Other group box. Note also that STATISTICA first finds all cases that meet Category 1 and recodes them; then STATISTICA only searches through the remaining cases that have not been recoded to see if they meet Category 2, etc. Hence, you do not have to specify Quality > 3 AND Quality < = 6 in Category 2.
When specifying conditions, follow the standard syntax conventions common in STATISTICA to all those procedures that involve any operation of "selecting cases" based on their values.
Select Verify Data from the Data - Verify Data submenu to access an interactive data-verification and cleaning facility. Use the options in the Verify Data dialog to enter the conditions to be met by the data.
Follow the standard syntax conventions common in STATISTICA to all those procedures that involve any operation of selecting cases based on their values. You can also save the current verification condition to a text file or open a file with previously saved conditions.
The verification can be as simple as checking whether values in a variable are "legal" (e.g., only 1 and 2 might be allowed for Gender) or whether they fall within allowed ranges of values (e.g., Age must be more than 0 and less than 100). It can also be as complex as checking multiple logical conditions that some values must meet in relation to other values.
Consider this example of conditional verification: If a person is a male or less than 10 years old, then the number of pregnancies for that person cannot be more than zero. In order to apply these conditions, you would specify (for example):
Invalid if: (v1='MALE' or AGE<10) and PREGN>0
Once you have entered your verification condition(s), click either the Find First button to select the first invalid case in your data file (after this first case has been selected, you can find the next case by selecting Find Next Invalid Case from the Data - Verify Data submenu) or click the Mark All button to mark all of the invalid cases in the data file according to the Marked Cells spreadsheet layout.
Yes. Select Sort from the Data menu to display the Sort Options dialog, which contains options for sorting variables. You can create up to 14 sorting keys in this dialog
Select Rank from the Vars button menu on the spreadsheet toolbar or from the Data menu to display the Rank Order Values dialog.
Then click the Variables button to select the variables to be ranked. Note that this operation replaces the original values with their respective ranks based on the sorted series. Optionally, you can also specify a subset of cases to be affected by the operation (click the Cases button), use case weights (click the Weight button), and set a number of options to perform specific types of ranking.
From the Data - Transpose submenu, select either Block or File.
Transposing a block of data. The block transposing command affects only the contents of cells in the block currently selected in the spreadsheet (the block must be square); the variable names and case names will not be affected. For example, the Transpose - Block command executed on the following square block of data:
produces the following result:
Transposing a data file. Data file transposing restructures the entire file. For example, transposing the data file shown above (before the block was transposed), changes the number of cases and variables in the data file:
Note that when you transpose a file, case names become variable names and variable names become case names.
STATISTICA has a variety of operations that extrapolate (AutoFill) ranges.
A horizontal or vertical series in a block can be extrapolated by dragging the block's Fill Handle (a small solid black box located on the lower-right corner of the block). After selecting the block, point at the Fill Handle. When the mouse pointer changes to a plus sign, drag to the last cell(s) to which you want to extrapolate the data. STATISTICA can create series of values such as sequential numbers, linear extrapolations, and dates (e.g., you can extend a series such as 1, 2, 3 to include 4, 5, 6).
You can extrapolate (AutoFill) a block in the following ways:
Select a block of values in the spreadsheet, right-click, and select either Fill/Standardize Block - Standardize Columns or Fill/Standardize Block - Standardize Rows from the shortcut menu. The standardized values are computed as follows:
Standard Value = (raw value - mean of highlighted row/column)/standard deviation.
You can also standardize selected variables by selecting Standardize from the Data menu to display the Standardization of Values dialog. This procedure operates independent of the currently selected block, but takes into account the current case selection conditions and weights.
Every analysis definition dialog in STATISTICA contains at least one Variables button that is used to specify variables to be analyzed. Click the button (or press v) to display the variable selection dialog. If you forget to specify variables and click the OK button to start the analysis, STATISTICA prompts you to specify the variables to be analyzed. The variable selection dialog supports various ways of selecting variables, and it offers various shortcuts and options to review the contents of the data file.
Yes. If you select a block in the spreadsheet, the variables included in the block are automatically preselected for the next analysis. This shortcut is designed to limit the chance of producing unintended results in the following ways:
Before an analysis begins (i.e., before the data are processed), you can instruct STATISTICA to select only cases (i.e., rows in the spreadsheet) that meet specific selection criteria. A facility to define and manage case selection conditions can be accessed by selecting Edit from the Tools - Selection Conditions menu or by clicking on the status bar field Sel, which shows the current status of the Case Selection Conditions. Note that when case selection conditions are disabled, this field reads Sel:OFF, when they are enabled, it reads Sel:ON).
Note that there are two types of case selection conditions in STATISTICA - permanent (saved with the data file) and temporary (associated with the current analysis only).
Global case selection conditions (definitions of subsets of data) are attached to a spreadsheet and are called Spreadsheet Case Selection Conditions. They are accessed from the Tools - Selection Conditions menu or by clicking the status bar field Sel. Analysis/Graph Case Selection Conditions can be locally connected to an analysis or graph. They are accessed by clicking the Select Cases button that is available on all analysis/graph definition dialogs. You can save and open either type of selection conditions (and maintain libraries of case selection conditions) using the Open or Save As buttons in the appropriate Case Selection Conditions dialog. Note that case weights are specified in a similar manner.
You can either select Descriptive statistics from the Basic Statistics and Tables Startup Panel (accessible from the Statistics menu) or click the Values/Stats button in the Variable specifications dialog. To display the Variable specifications dialog, double-click on the variable name in the spreadsheet. Alternatively, right-click on the variable name and select Variable Specs from the shortcut menu.
When you click the Values/Stats button, the Values/Stats dialog displays information about the selected variable, a sorted list of its values, and descriptive statistics. The descriptive statistics can be copied to the Clipboard by clicking the Copy button (located in the lower-right corner of the dialog).
Click the Values/Stats button in the Variable specifications dialog to display a sorted list of all unique numeric values and corresponding text labels (if there are any) for that variable. The Variable specifications dialog is accessed by double-clicking the variable name in the spreadsheet.
Note that the assignments between the numeric values and text labels for the variable can be edited in the Text Labels Editor.
Statistics for each row or column in a selected block can be computed and added to the spreadsheet (i.e., appended at the end) by selecting the desired Statistics of Block Data from the shortcut menu.
For example, the selection shown above computes the sums for each row in the selected block. The selected statistic (Sum) is displayed in a new spreadsheet.
If you select Block Columns, the results of the selected statistic are appended as a new case, and the case name contains the name of the selected statistic and the range description, as shown below.
In addition, a (customizable) list of predefined graphs is also available from this shortcut menu.
These Graphs of Block Data enable you to visualize the data in the selected block either by row or by column.
Yes. STATISTICA supports a variety of matrix file types as input data (e.g., for Multiple Regression, Factor Analysis, Reliability/Item Analysis, and other modules). Matrix files can be edited in the spreadsheet (as if they were raw data files); however, in order to be properly interpreted as matrices, they need to meet specific content and format conditions (depending on the matrix type). STATISTICA also supports the multiple matrix data file format for input in procedures (such as Structural Equation Modeling) that support matrix data for groups. Note that matrix data that meet the specific content and format conditions are automatically saved in STATISTICA Matrix Spreadsheet (*.smx) format.
Select Cases buttons are available on all analysis and graph specification dialogs. Click this button to display the Analysis/Graph Case Selection Conditions dialog, which contains options to create conditions for which cases will be included (or excluded) in the current analysis. By default, whenever case selection conditions are used, a subtitle is appended to the graph that displays the selection condition. Options controlling the display of selection conditions in graphs are located in the Analyses/Graphs: Display options pane of the Options dialog, accessible by selecting Options from the Tools menu.