MSF_503_Chapter_4.pdf

CHAPTER 4 DATABASES and SQL 4.1 Database Management Systems Financial analysis requires data—historical price data, fundamental data, calculated data, economic data, news data, sentiment data. Python is an efficient platform for creating financial applications that interact with databases. In general there are two types of databases used in financial markets: operational databases and analytical databases. Operational databases store dynamic data such as portfolio and position information. Analytical databases hold static data such as historical data. A database management system (DBMS) is a software package that controls the organization, storage, management, and retrieval of data. It provides the ability to access (or “query”), add, delete, and modify the data. The four types of DBMSs are: hierarchical, network, relational, and object models. The best one to use depends upon the particular data and the user needs. The user typically balances transaction rate (speed), reliability, maintainability, scalability, and cost. Of concern for us, is that both operational and analytical databases we will use the relational database model (RDM). The most popular large-scale RDBMSs are from Microsoft SQL Server, Oracle, and Sybase. In the RDM, data is held in tables, which are made up of columns, called fields, and rows, called records. A field, or column, represents a characteristic of a record, and have names, data types, and (usually) lengths. Data in databases can be alphanumeric, numeric, or date/time. Also, fields can contain distinct or multipart values and may have values that are calculated. A record, or row, holds the actual data in a table. A single record in a table is made up of one row containing all of the columns in the table including a primary key that uniquely identifies a record. Connections between different tables are defined in relationships, which ensure data integrity, consistency, and accuracy. Relationships happen through keys. A primary key is a special field in a table that uniquely identifies a record. Now two records in the table can have the same value for the primary key field. Every table in a relational database must have a primary key and no two tables should have the same primary key. © 2024 Ben Van Vliet 71 Foreign keys establish relationships between pairs of tables. Relationships between two tables arise when the primary key column in one table is identical to the foreign key column in the other. A basic difference between primary key and foreign key is foreign keys allow null values and duplicate values, and it refers to a primary key in another table. A relationship is said to be one-to-one if a single record in the first table is related to a single record in the second table, and vice versa. A relationship is said to be one-to- many if a single record in the first table can be related to several records in the second table, but at the same time a single record in the second table can only be related to only a single record in the first table. A relationship is said to be many-to-many if a single record in the first table is related to many records in the second table, and vice versa. In the case of a many-to-many relationship, we need to create a linking table by copying the primary key from each table into the new table. A schema defines the formal, or even theoretical, structure of a database. Its how the the database is constructed, or conceptualized, using tables, fields, relationships, etc. Integrity constraints ensure compatibility between different parts of the schema, or structure. A schema can contain formulas that represent the integrity constraints for a specific application. We often represent a database’s schema graphically. Normalization is a process of analyzing the database schemas to minimize redundancies and problems associated inserting, deleting and updating data. If a schema doesn’t accomplish these things, we often decomposed it into smaller schemas. Thus, normalization is often the process of breaking down a large table or tables into smaller tables. A normal form is a set of rules that test a table structure to ensure it is sound and free of errors. There are at least five normal forms used to test for specific sets of problems. Tables we will use are in at least third normal for since each one has a primary key that uniquely identifies each record. © 2024 Ben Van Vliet 72 LAB 4.1: Working with SQLite Databases SQLite is an open source RDBMS. You do not need to have SQLite software installed on your computer to interface with SQLite databases through Python. According the SQLite website (https://www.sqlite.org/about.html), “SQLite is an in- process package that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. [They claim that] SQLite is the most widely deployed database in the world. SQLite reads and writes directly to ordinary disk files. A complete SQL database with multiple tables, indices, triggers, and views, is contained in a single disk file,” which has an.sqlite extension. In Python, we can create and access SQLite databases. Let’s create a test.sqlite database and use SQL statements to create a table, insert rows into the table, and query the database. import sqlite3 as lite import os # if the database already exists, don’t try to create it if not os.path.isfile( 'C:\\Python\\test.sqlite' ): con = lite.connect( 'C:\\Python\\test.sqlite' ) with con: cur = con.cursor() cur.execute("CREATE TABLE Prices(Symbol TEXT, Date DATE, Price FLOAT, Volume INT32)") cur.execute("INSERT INTO Prices VALUES('IBM', '01/02/2017', 101.1, 500 )" ) cur.execute("INSERT INTO Prices VALUES('WMT', '01/02/2017', 84.56, 400 )" ) cur.execute("INSERT INTO Prices VALUES('MSFT', '01/02/2017', 45.68, 1500 )" ) con.close() # fetchall gets a result set of an SQL query and returns a list of tuples con = lite.connect( 'C:\\Python\\test.sqlite' ) with con: © 2024 Ben Van Vliet 73 cur = con.cursor() cur.execute( 'SELECT * FROM Prices' ) rows = cur.fetchall() for row in rows: print( row ) con.close() © 2024 Ben Van Vliet 74 LAB 4.2: Working with the Finance.sqlite Database Finance.sqlite is analytical database that uses flat files to hold daily historical price data for 13 stocks and the S&P 500. The individual data tables in Finance.sqlite are named AXP, GE, GM, IBM, INTC, JNJ, KO, MCD, MO, MRK, MSFT, SUNW, WMT and SPX. In addition, there is a validation table named Tickers, which contains the 13 stock ticker symbols shown. The 14 data tables consist of the primary key column, the Date, and open, high, low, close, and volume fields. Date OpenPrice HighPrice LowPrice ClosePrice Volume 2-Jan-90 23.54 24.38 23.48 24.35 1760600 3-Jan-90 24.53 24.72 24.44 24.56 2369400 4-Jan-90 24.6 24.94 24.56 24.84 2423600 5-Jan-90 24.81 25.25 24.72 24.78 1893900 8-Jan-90 24.66 25.06 24.66 24.94 1159800 2-Jan-90 23.54 24.38 23.48 24.35 1760600 The Tickers validation table consists of a single column named Symbols, which holds the ticker symbols for each of the 13 stocks. Here is a sample of the Tickers table: Symbols AXP GE GM IBM etc. Let’s connect to Finance.sqlite. import sqlite3 as lite con = lite.connect( 'C:\\Python\\Finance.sqlite' ) with con: cur = con.cursor() cur.execute( 'SELECT * FROM IBM' ) rows = cur.fetchall() for row in rows: print( row ) © 2024 Ben Van Vliet 75 con.close() © 2024 Ben Van Vliet 76 LAB 4.3: Working with the Options.sqlite Database The Options.sqlite operational database uses a relational database structure to hold information about stocks and options as well as stock trades and option trades. In fact, there are four tables in the Options.sqlite database representing each of these things—Stocks, OptionContracts, StockTrades and OptionTrades. As we saw earlier, the relationships between two tables in a relational database are made possible by common primary and foreign keys. In Options.sqlite, for example, the Stock and StockTrades tables are related through a StockSymbol primary key in the Stock table and the foreign key StockSymbol column in the StockTrades table. Here is a diagram showing the structure or schema of the Option.sqlite database. In this diagram, the relationships are represented by arrows. © 2024 Ben Van Vliet 77 All of the relationships in the Options.sqlite database are one-to-many. As you may be able to gather from the diagram, a one-to-many relationship exists between the Stock and OptionContracts tables. Clearly, a single stock can have many options contracts on it. But in the opposite direction, it is not the same. A single option contract can have only one underlying stock associated with it. import sqlite3 as lite con = lite.connect( 'C:\\Python\\Options.sqlite' ) with con: cur = con.cursor() cur.execute( 'SELECT * FROM OptionTrades' ) rows = cur.fetchall() for row in rows: print( row ) con.close() © 2024 Ben Van Vliet 78 4.2 Structured Query Language Structured Query Language (SQL) is an ANSI/ISO-standard language for communication and interaction with databases. It was created to be a cross-platform syntax to extract and manipulate data from disparate database systems. So, in theory the same SQL queries written for an Oracle database will work on a Sybase database or an Access database and so on. However, database vendors have also developed their own versions of SQL such as Transact-SQL and Oracle’s PL/SQL which extend ANSI/ISO SQL. This chapter will focus on writing standard SQL and will not use any vendor specific SQL syntax. We can embed SQL statements into our Python programs to perform everything from simple data retrieval to high-level operations on databases. The SQL statement that we may most often be concerned with when developing quantitative trading or risk management systems are those that retrieve data, called data query language (DQL). However, we will at times also need to write, change or delete data in a database. These types of SQL statements are referred to as data manipulation language (DML). Also, however, SQL can be used to actually change the structure of the database itself. These SQL statements are called data definition language (DDL). Data control language (DCL) includes SQL queries like GRANT and REVOKE, which are commands used to provide and remove access privileges to database users. 4.2.1 Data Manipulation Language We use DML to retrieve and otherwise work with the actual data held within a database. SELECT Reading data is the most common task we want to perform against a database. A SELECT statement queries the database and retrieves selected data that matches the criteria that we specify. The SELECT statement has five main clauses, although a FROM © 2024 Ben Van Vliet 79 clause is the only required one. Each of the clauses has a wide array of options and parameters. A SELECT statement means that we want to choose columns from a table. When selecting multiple columns, a comma must delimit each of them except for the last column. Also, be aware that as with Python, SQL is not case sensitive. Upper or lower case letters will do just fine. Be aware too that most, but not all, databases require the SQL statement to be terminated by a semi-colon. Before we get too in depth, let’s create a Python program to test out the SQL statements we look at as we go along. Create a new Python file named SQL_example. import sqlite3 as lite con = lite.connect('C:\\Python\\Options.sqlite') with con: SQLstatement = "SELECT * FROM OptionTrades" cur = con.cursor() cur.execute( SQLstatement ) rows = cur.fetchall() # rows is a list for row in rows: print( row ) con.close() Now, we will test out several SQL statements. Run the program and embed the SQL statement shown. SELECT OptionSymbol,StockSymbol,Year,Month,Strike,Bid,Ask,OpenInt FROM OptionContracts Note that the columns are displayed in the order that they appear in the SELECT statement. If all columns from a table were needed to be part of the result set, we do not © 2024 Ben Van Vliet 80 need to explicitly specify them. Rather, in the case where all columns are to be selected, we can use the * symbol. WHERE Clause The previous example retrieved a result set that included all the rows in the table from the specified columns. However, we may want to filter some rows out according to some condition or based upon some comparison. This is where the WHERE clause comes in. For comparison in SQL, we use the following operators: Comparison Description Operator < Contents of the field are less than the value. Contents of the field are greater than the value. >= Contents of the field are greater than or equal to the value. = Contents of the field are equal to the value. Contents of the field are not equal to the value. BETWEEN Contents of the field fall between a range of values. LIKE Contents of the field match a certain pattern. IN Contents of the field match one of a number of criteria. So, if we were interested in only the option contracts with open interest greater than 1,000, our SQL would look like this: SELECT * FROM OptionContracts WHERE OpenInt > 10000 The WHERE clause can also have multiple conditions using AND or OR. If we wanted to see all contracts where open interest is over 1,000 and the bid is greater than 0, it would look like this: SELECT * FROM OptionContracts WHERE OpenInt > 10000 AND Bid > 0 If we needed to build a WHERE clause for such a field, SQLite requires that we use single quotes for string comparison like this: © 2024 Ben Van Vliet 81 SELECT * FROM OptionContracts WHERE StockSymbol = 'IBM' Date comparison sometimes requires the use of the pound sign, # or ‘, but SQLite uses single quotes. For example, if we wanted to see all of the options trades done in February of 2003: SELECT * FROM OptionTrades WHERE TradeDateTime >= '2003-02-01' AND TradeDateTime

MSF_503_Chapter_4.pdf

Document Details

Related

Full Transcript