Xem mẫu

Chapter 2 Data Information systems are combinations of computer technologies that help us to store, find, and otherwise handle the information we need. In practice, the central component of an information system is a database. However, a database, by itself, is not enough for niche modeling. Many people rely on spreadsheets for handling their data. In contrast, for larger datasets and multi-user installations, the conventional approach to information systems is three-tiered, consisting of the following: • a database, • an analysis system, and • a presentation system. For example, a low-cost system for niche modeling might use a database, the statistics language R for analytics, and a browser for presentation of results. While there are advantages in partitioning operations in this way, inefficien-cies and errors are introduced when transferring data between applications, particularly in passing data in and out of databases. In keeping with the main theme of this book – successful niche modeling and recognition of sources of error – eliminating a possible source of errors is one of the advantages of a single system like a spreadsheet. Fully integrated systems can also be very efficient. A novel approach de-veloped for the financial industry integrates database and analysis functions into the one, memory resident application. The vector-based K language has produced very fast trading applications – over 100 times faster than equivalent three-tiered applications [Whi04]. In this section we demonstrate the use of R as a database. While not rec-ommended for large datasets, it may be possible to dispense with an external database by replicating relational database operations including select and join in R. As well as simplicity and efficiency for smaller systems, a sim-ple database example helps to build knowledge of the R’s powerful indexing operations. 23 © 2007 by Taylor and Francis Group, LLC 24 Niche Modeling SQL or structured query language is one of the main languages used in databases. While not going into the syntax of this language, we replicate expressions in SQL with operations written in R. 2.1 Creating Data must be moved in and out of a database. Where one might use import or mysqlimport in MySQL, here we read a small local file of locations where a species was sighted that might have been saved in a spreadsheet. The operation read.table also permits reading from a URI, so data can be read directly from the web. > locations <- read.table("obs.txt", header = TRUE) TABLE 2.1: Example data consisting of field observations with locations. id X Y 11 ML240 110.00 111.00 2 ML240 211.00 102.00 3 ML240 123.40 114.30 A database is composed of a set of tables. The R data structure called a dataframe serves the purpose of tables in a relational database. A data frame is really a list of vectors (columns) of equal length, ideal for storing all kinds of information. To simulate a relational database table, the data frame name serves as the table name, and column names serve as the table attributes. Comprehensive information about a data frame, or any R object, can be listed with the command attributes. > attributes(locations) $names [1] "id" "X" "Y" $class [1] "data.frame" © 2007 by Taylor and Francis Group, LLC Data 25 $row.names [1] "11" "2" "3" 2.2 Entering data Manipulating and entering data in R is perhaps easier than most databases. To insert rows at the bottom of a table we use rbind. To insert a column, say of number of animals seen at each observation, we would use cbind. > locations <- rbind(locations, data.frame(id = "PC101", + X = 113.4, Y = 114.3)) > locations <- cbind(locations, Number = c(2, 1, 3, + 1)) > locations id X Y Number 11 ML240 110.0 111.0 2 2 ML240 211.0 102.0 1 3 ML240 123.4 114.3 3 1 PC101 113.4 114.3 1 Altering tables is easy in R. Whereas a table is modified in a relational database with the drop command, in R a column is deleted by assignment of NULL. Assignment also changes the name of a table, and the old locations object can be deleted with rm. R automatically assigns names to rows but not the ones we want. A unique index for each observation could be added to the table as row names as shown below. Changing the names of columns is also easy in R, as shown. > row.names(locations) <- c("12", "13", "14", "15") > obs <- locations > rm(locations) > names(obs) <- c("id", "X", "Y", "n") > obs id X Y n 12 ML240 110.0 111.0 2 13 ML240 211.0 102.0 1 14 ML240 123.4 114.3 3 15 PC101 113.4 114.3 1 © 2007 by Taylor and Francis Group, LLC 26 Niche Modeling 2.3 Queries Of course the previous operations are preliminaries relative to the main purpose of a relational database, which is to perform queries. As before, R can easily be coerced to replicate SQL statements for queries. The following are examples of how concisely the indexing in R can mimic the corresponding SQL select statement. 2.3.0.1 SQL: SELECT * FROM obs > obs id X Y n 12 ML240 110.0 111.0 2 13 ML240 211.0 102.0 1 14 ML240 123.4 114.3 3 15 PC101 113.4 114.3 1 2.3.0.2 SQL: SELECT * FROM obs WHERE rownum == 2 or rownum ==3 > obs[2:3, ] id X Y n 13 ML240 211.0 102.0 1 14 ML240 123.4 114.3 3 > obs[c(FALSE, TRUE, TRUE, FALSE), ] id X Y n 13 ML240 211.0 102.0 1 14 ML240 123.4 114.3 3 Operations on columns have a shorthand form. The dollar selects a column from a table by name. 2.3.0.3 SQL: SELECT id FROM obs > obs$id [1] ML240 ML240 ML240 PC101 Levels: ML240 PC101 © 2007 by Taylor and Francis Group, LLC Data 27 A more flexible form of selection with an arbitrary list of column names as an argument is subset. 2.3.0.4 SQL: SELECT X,Y FROM obs > subset(obs, select = c("X", "Y")) X Y 12 110.0 111.0 13 211.0 102.0 14 123.4 114.3 15 113.4 114.3 > subset(obs, select = c(2, 3)) X Y 12 110.0 111.0 13 211.0 102.0 14 123.4 114.3 15 113.4 114.3 The following example illustrates the combination of row and column selec-tion. 2.3.0.5 SQL: SELECT * FROM obs WHERE id = PC101 > obs[obs$id == "PC101", ] id X Y n 15 PC101 113.4 114.3 1 > subset(obs, id == "PC101") id X Y n 15 PC101 113.4 114.3 1 Another important feature of database languages is functions for aggregat-ing data, such as summing and counting. Say we wanted to find out how many animals of each species had been observed. In SQL we would use a combination of select and the count or sum functions to perform aggregation. The following are examples of R implementations of typical SQL queries using the dim and aggregate functions. The first counts the number of unique observations. The second sums the number of animals of each species observed at a site. © 2007 by Taylor and Francis Group, LLC ... - tailieumienphi.vn
nguon tai.lieu . vn