SDS Midterm Review.pdf
Document Details
Uploaded by Deleted User
Full Transcript
SDS Midterm Review Week 1 What is Spatial Data Science? Data is any information that can be collected an analyzed Qualitative Data This data type is descriptive and often deals with characteristics, qualities, or attributes. Non-numerical: include things like names, colors, textures, and...
SDS Midterm Review Week 1 What is Spatial Data Science? Data is any information that can be collected an analyzed Qualitative Data This data type is descriptive and often deals with characteristics, qualities, or attributes. Non-numerical: include things like names, colors, textures, and descriptions. Quantitative Data This type of data is numerical and represents quantities or measurements. It can be further divided into : Discrete data (distinct values: Income Classes) Continuous data (measurable values: Temperature) What is Data Science? Data Science is a Multidisciplinary field Data Collection and Preparation Data Analysis and Exploration Data Visualization Machine Learning and Predictive Modeling Big Data Analytics Spatial Data Science Process 1. Data Collection 2. Data Engineering & Integration 3. Modeling and Scripting 4. Analytics 5. Visualization, Storytelling, Sharing Geodesign Framework (Professor Carl Steinitz) Organizing Education for GeoDesign Describes a collaborative activity that is not the exclusive territory of any design profession, geographic science or information technology. Each participant must know and be able to contribute something that the others cannot or do not GeoDesign problems share six questions: 1. How should the context be described? (DATA) 2. How does the context function? (KNOWLEDGE) 3. Is the context working well? (VALUES) 4. How might the context be altered? (DATA) 5. What differences might the changes cause? (KNOWLEDGE) 6. How should the context be changed? (VALUES) WHY? WHAT, WHERE, WHEN ? HOW? Beginnings of Geodesign (reading) By Carl Steinitz Geodesign is conceived as an iterative design method that uses stakeholder input, geospatial modeling, impact simulations, and real-time feedback to 10 facilitate holistic designs and smart decisions. This paper aims to lay bare the beginnings of geodesign as such from 1965 onwards. It offers a personal historical perspective of Carl Steinitz, one of the protagonists in the field of geodesign. The paper describes some important milestones and influential people in a joint effort to bridge geo-information technology, spatial design and planning. It showcases the ongoing effort to employ the potential power of using GIS to link different model types and ways of designing to make better plans. “Geodesign is a method which tightly couples the creation of proposals for change with impact simulations informed by geographic contexts and systems think- ing, and normally supported by digital technology.” Michael Flaxman and Stephen Ervin, 2010 understanding of decision processes It distinguishes between land-use demands and evaluations of their locational attractiveness and site resources and evaluations of their vulnerabilities. It assesses risk and impacts and proposes generating plans with the rules of a simulation model. earliest years of GIS and its applications into three stages: 1. In the middle 1960s, we used computers and computer graphics to do things we already knew how to do using non-computer technologies. a. We acquired data and encoded it and produced maps. The analytic capabilities of the time were primitive, typically limited to applied studies on landscape classifications, sieve maps, or overlay combinations, all of which could have been accomplished with hand-drawn methods. b. Spatial and statistical analyses were difficult; professional acceptance was low, and public cynicism was high regarding analyses and the resultant graphics produced by computers. 2. in the later 1960s, emphasized substantially more sophisticated GIS analyses: a. the merging of mapping and statistical techniques, b. the introduction of more sophisticated spatial analysis methods, c. the introduction of graphic displays more diverse than two-dimensional maps. d. A strong research effort in theoretical geography was organized and directed by William Warntz and related to the theory of surfaces, the macro- geography of social and economic phenomena and central place theory. 3. in the early 1970s, the laboratory saw important interaction with other disciplines and professions, particularly the scientific and engineering professions. a. We had the self-criticism that recognized the need for more predictable analysis and for better models. b. The view throughout this third stage was that information could and should influence design decisions. c. A critical professional role would be to organize that information, have it available and adaptable to questions, and thus provide decision makers with information relevant to decisions at hand. d. The focus on aiding decisions rather than making decisions increased both public and professional interest and acceptance. GIS Shapefiles and Its Components A shapefile is a simple, nontopological format for storing the geometric location and attribute information of geographic features. Geographic features in a shapefile can be represented by points, lines, or polygons (areas). The workspace containing shapefiles may also contain dBASE tables, which can store additional attributes that can be joined to a shapefile's features. Video: Explore GIS Shapefiles and Its Components Shapefile file size limitations Each of the component files of a shapefile is limited to 2 GB each. Therefore,.dbf files cannot exceed 2 GB and.shp files cannot exceed 2 GB (these are the only files that are likely to be huge). The total size for all the component files can exceed 2 GB. What are the shapefile files? Here is a list of all the files that make up a shapefile, including SHP, SHX, DBF, PRJ, XML, SBN, SBX, and CPG. Main File (.SHP) - mandatory SHP is a mandatory Esri file that gives features their geometry. Every shapefile has its own.shp file that represents spatial vector data. For example, it could be points, lines, and polygons in a map. Index File (.SHX) - mandatory SHX are mandatory Esri and AutoCAD shape index positions. This type of file is used to search forward and backward. dBASE File (.DBF) - mandatory DBF is a standard database file used to store attribute data and object IDs. A.dbf file is mandatory for shape files. You can open DBF files in Microsoft Access or Excel. Projection File (.PRJ) - optional PRJ is an optional file that contains the metadata associated with the shapefiles coordinate and projection system. If this file does not exist, you will get the error “unknown coordinate system”. If you want to fix this error, you have to use the “define projection” tool which generates.prj files. Extensible Markup Language File (.XML) - Optional XML file types contain the metadata associated with the shapefile. If you delete this file, you essentially delete your metadata. You can open and edit this optional file type (.xml) in any text editor. Spatial Index File (.SBN) - optional SBN files are optional spatial index files that optimize spatial queries. This file type is saved together with a.sbx file. These two files make up a shape index to speed up spatial queries. Spatial Index File (.SBX) - optional SBX files are similar to.sbn files which speed up loading times. It works with.sbn files to optimize spatial queries. We tested.sbn and.sbx extensions and found that there were faster load times when these files existed. It was 6 seconds faster (27.3 sec versus 33.3 sec) compared with/without.sbn and.sbx files. Code Page File (.CPG) - optional CPG files are optional plain text files that describe the encoding applied to create the shapefile. If your shapefile doesn’t have a.cpg file, then it has the system default encoding. Mandatory and optional files for shapefiles If you need to move shapefile files in Windows Explorer, you should drag and drop all the mandatory and optional files. If you are in ArcCatalog, it will move all the mandatory and optional files for you. Introduction to the American Community Survey (ACS) How is ACS data collected? https://www.census.gov/data/academy/courses/discovering-the-american-community- survey.html Provides local statistics on critical planning topics such as age, children, commuting, education, and employment. Household sample of 3.5million addresses each year Estimates over 40 topics Four Main types of topics: o Social o Demographic o Economic o Housing Decennial Census: every 10 years (short form only) since 2010 ACS: every year since 2005 ACS was fully implemented in 2005 (testing from 1996-2004) Data collection Process: ACS data is collected through the Internet, mail, and in-person visits Data collection for each monthly panel takes place over a three-month period Data release schedule - Typically released one year after data is collected - Supplemental estimates are simplified versions of ACS tables The ACS is conducted every month and the data are released every year How to Get NHGIS Data (Exam Material) Download U.S. Census Data Tables & Mapping Files The National Historical Geographic Information System (NHGIS) provides easy access to summary tables and time series of population, housing, agriculture, and economic data, along with GIS-compatible mapping files, for years from 1790 through the present and for all levels of U.S. census geography, including states, counties, tracts, and blocks. Read more. Week 2 – Geography and Geospatial Data Getting Started with OnTheMap (Reading) OnTheMap, an online mapping application that shows where people work and where workers live. What is OnTheMap? OnTheMap has been developed through a unique partnership between the U.S. Census Bureau and 50 partner states (plus the District of Columbia) through the Local Employment Dynamics (LED) partnership. The employment data used in this application are derived from payroll tax (unemployment insurance) payment records maintained by each state. The states assign employer locations (QCEW data), while individual worker home locations are assigned by the U.S. Census Bureau using data from multiple Federal agencies. Age, earnings, and industry profiles are compiled using each state's records along with other supplemental Census Bureau source data. Final compilations and confidentiality modeling is performed by the Census Bureau. OnTheMap contains annual historical data starting in 2002 for most participating states. Home Area or Workplace Area? – Maps can be produced that display where workers live or where workers are employed, and also where workers are living or working within a selected area. You can reorient a map by simply changing a few settings and resubmitting the analysis. What Type of Analysis? – OnTheMap allows users to produce several different types of analyses that provide a wide variety of output results. All analysis types have their own associated map overlays, charts, and reports. 1. Area Profile Analysis generates results showing the location and characteristics of workers living or working inside the selected study area. 2. Area Comparison Analysis generates results showing the count and characteristics of workers employed or living in locations contained by the selected study area. The "Areas to Compare:" dropdown determines the type of locations to be compared. 3. Distance/Direction Analysis generates results showing the distance and direction totals between residence and employment locations for workers employed or living in the selected study area. 4. Destination Analysis generates results showing the home or work destinations of workers employed or living in the selected study area. Select the geographic destination type (i.e. counties, cities, tracts) using the "Destination Type:" dropdown. 5. Inflow/Outflow Analysis generates results showing the count and characteristics of worker flows in to, out of, and within the selected study area. 6. Paired Area Analysis generates results showing the location and characteristics of workers that share the selected home and work areas. Map Overlays – Home and work locations are displayed in map overlays consisting Of: - Point Overlays (round dots), o Points show where workers are clustered on the map with each dot representing a specific home or work location. The larger the dot, the more workers there are that live or work at that census block. - Thermal Overlays (shaded contours – similar to those used in weather maps) o Thermals show the density of workers measured in terms of workers per square mile. - Thematic Overlays (shaded geographic areas) o The Thematic Overlays display counts of workers employed or living in geographic features from the selected destination or comparison layer type. - Spoke Overlays (commute lines) o Spokes appear in the Destination Analysis results and represent commuting from the selection area to each of the destination areas. - Flow Overlays (worker flow arrows). o Lastly, the Flow Overlay represents the flow of workers entering, leaving, or staying within the selection area to work. Detailed Reports – Each analysis type provides a detailed report in HTML, PDF, or Excel spreadsheet formats. The Area Profile and Area Comparison Analysis reports contain information on workers employed or living in the area of analysis, regardless of where these workers commute to/from. The other analysis types provide information on the home/work connection of each job living or working in the area of analysis. ACS Geography Handbook (Reading) 1. Geographic Areas Covered in the ACS US Census Bureau divides the nation into two main types of geographic areas: Legal – defined specifically by law, usually represented by elected officials o State o Local o Tribal government units o Some specially defined administrative areas like congressional districts Statistical – defined directly by the USCB and state and regional. Or local authorities for the purpose of presenting data o Census designated places, census tracts, urban areas, and metropolitan statistical areas Geographic areas are organized in a hierarchy. In the American Community Survey (ACS), block groups are the lowest (smallest) level of geography published. Block group data are only available in the ACS 5-year data products. The ACS does not produce data at the block level. In Figure 1.1, the geographic types connected by lines are nested within each other. For example, a line extends from counties to census tracts because a county is completely comprised of census tracts, and a single census tract cannot cross a county boundary. If there is no line joining two geographic types, then an absolute and predictable relationship does not exist between them. For example, although many places (cities and towns) are confined to one county, some places, such as New York City, extend over more than one county (see Figure 1.2). Therefore, an absolute hierarchical relationship does not exist between counties and places. Geographic Summary Levels and Codes There are two main types of identifiers that the Census Bureau uses for geographic areas: summary levels - represent a geographic type o Summary levels range from very large reporting units, such as “State,” to much smaller reporting units such as “Census Tract.” Each summary level has an assigned three-digit, summary-level code to help programmers link each summary level to its appropriate use in a table, map, or other data summarization format. o Here are some common summary levels used to identify types of geographic areas: 010 Nation 020 Region 030 Division2 040 State 050 State-County 140 State-County-Census Tract 155 State-Place-County 160 State-Place 250 American Indian Area/Alaska Native Area/Hawaiian Home Land 310 Metropolitan Statistical Area/Micropolitan Statistical Area 500 State-Congressional District o Summary levels may cross between two or more geographic hierarchies to produce units that are only portions of geographic areas. For example, summary level “State-Place-County” crosses the “State-Place” hierarchy with the “State-County” hierarchy and may create units that cover only a portion of one county. geographic identifiers (GEOIDs) - used to identify individual geographic areas. o The Census Bureau and other state and federal agencies are responsible for assigning codes, or GEOIDs, to geographic areas o GEOIDs are numeric codes that uniquely identify each legal or statistical geographic area for which the Census Bureau tabulates data. o GEOIDS are useful for sorting names of geographic areas for presentation purposes or analysis, merging ACS data with data from other sources, identifying areas as legally or statistically defined entities, and describing the classification category of the area. o The most frequently used code systems are Federal Information Processing Series and American National Standards Institute codes. o To identify a geographic area that is nested within a larger area, such as a state or the nation, one or more higher-level codes may be required. Census tract codes are unique within counties, and county codes are unique within states. Therefore, a complete set of o state, county, and tract codes is needed to uniquely identify a particular census tract. For example, State-County (summary level 050), represents the concept of a county within a state, while the GEOID for Madison County, Texas, is 48313 (state code ‘48’ combined with county code ‘313’). These codes can be used to identify geographic areas in the ACS and many other public data sources. Population Thresholds for Geographic Areas Each year, the Census Bureau publishes ACS 1-year estimates for geographic areas with populations of 65,000 or more. The 65,000-population threshold ensures that 1-year data are available for all regions, divisions, states, the District of Columbia, Puerto Rico, congressional districts, Public Use Microdata Areas (PUMAs), many large counties and county equivalents, metropolitan and micropolitan statistical areas, cities, school districts, and American Indian areas o PUMAs are collections of counties—or census tracts within counties—with approximately 100,000 people each. PUMAs do not cross state lines. PUMAs were initially adopted by the ACS because they were the only wall-to-wall geographic entities below the state level that met the minimum population threshold of 65,000 required to disseminate ACS 1-year estimates. The 1-year Supplemental Estimates—simplified versions of popular ACS tables—are also available for geographic areas with at least 20,000 people. For geographic areas with smaller populations, the ACS samples too few housing units to provide reliable single-year estimates. For these areas, several years of data are pooled together to create more precise multiyear estimates. Since 2010, the ACS has published 5-year data (beginning with 2005–2009 estimates) for geographic areas down to the census tract and block group levels for areas with populations of 65,000 or more you have a choice between the 1- year and 5-year data series. Which data should be used and why? o The 1-year estimates for an area reflect the most current data but have larger margins of error (MOEs)—indicating less reliability or preci-sion— than the 5-year estimates because they are based on a smaller sample. o The 5-year estimates for an area have a larger sample and smaller MOEs than the 1-year estimates. However, they are less current because the larger samples include data that were collected in earlier years. The main advantage of using multiyear estimates is the increased statistical reliability for smaller geographic areas and small population groups. Key Geographic Areas in the ACS congressional districts, counties, PUMAs Census tracts. Congressional districts are redrawn after each census for the purpose of electing the members of the U.S. House of Representatives. Counties are also important because they are the primary legal subdivision within each state. The Census Bureau also divides each state into a series of PUMAs, each of which has a minimum population of 100,000. PUMAs are constructed based on county and census tract boundaries and do not cross state lines. PUMAs provide nationwide coverage for 1-year and 5-year data and can be aggregated to create custom geographic areas. PUMAs are updated after each decennial census. Typically, counties with large populations are subdivided into multiple PUMAs, while PUMAs in more ruralareas are made up of groups of adjacent counties. Census tracts—small subdivisions of counties that typically have between 1,200 and 8,000 residents—are commonly used to present information for small towns, rural areas, and neighborhoods. There are also more than 300 ACS data tables available for block groups— subdivisions of census tracts— that include between 600 and 3,000 people each. In the ACS, block groups are the lowest (smallest) level of geography published. Block group data are only available in the ACS 5-year data products. User-Defined Areas there are instances where analysts might want to show data for a custom, user- defined geographic area. Advanced users who are aggregating ACS estimates can use the Census Bureau’s Variance Replicate Tables to produce MOEs for selected ACS 5-year Detailed Tables.12 Users can calculate MOEs for aggregated data by using the variance replicates. Unlike available approximation formulas, this method exactly matches MOEs published on data.census.gov by including a covariance term. TIP: When aggregating ACS estimates across different geographic areas or population subgroups, data users should avoid combining ACS 1-year estimates with ACS 5-year estimates. That is, 1-year estimates should only be combined with other 1-year estimates, and 5-year estimates should only be combined with other 5-year estimates. When such derived estimates are generated, the user must also calculate the associated MOE. 2. Geographic Boundaries, Vintages, and Frequency of Updates The American Community Survey (ACS) publishes estimates using vintages (the latest available geographic boundaries). For the ACS 5-year estimates, the vintage is the last year of the multiyear period. For example, the 2017 ACS 1- year estimates and 2013–2017 ACS 5-year estimates use the same vintage (2017) of geographic boundaries. ACS data generally reflect the geographic boundaries of legal areas as of January 1 of the estimate year. The Census Bureau does not revise ACS data for previous years to reflect changes in geographic boundaries. Given the major changes to congressional district boundaries after each census, a comparison of congressional district data between 2011 and 2012 is not feasible. TIP: In some cases, a geographic boundary may change, but the GEOID may remain the same, so data users need to pay attention to year-to-year changes to make sure the data are comparable over time. 3. Accessing and Mapping ACS Data Data.census.gov is the Census Bureau's primary tool for accessing population, housing, and economic data from the American Community Survey (ACS), the Puerto Rico Community Survey, the decennial census, and many other Census Bureau data sets. Other specialized tools, such as My Congressional District and Census Business Builder, provide users with quick and easy access to statistics for particular geographic areas and topics.17 More advanced users also have several options to access more detailed ACS data through the downloadable Summary File, the Public Use Microdata Sample (PUMS) files, or the Census Bureau’s Application Programming Interface (API) Topologically Integrated Geographic Encoding and Referencing (TIGER) Data and Products TIGER products are spatial extracts from the Census Bureau's Master Address File (MAF)/TIGER database (MTDB), designed for use with GIS (geographic information science/system) software. The data contain features, such as roads, railroads, and rivers, as well as legal and statistical geographic areas. TIGER products include the following: o TIGERweb is a Web-based system that allows users to visualize TIGER data in several ways such as viewing spatial data online or streaming tomapping applications. o TIGER/Line with Selected Demographic and Economic Data are geodatabases (or shapefiles for some 2010 Census data) joined with selected attributes (including population and housing unit counts, demographic characteristics, such as sex by age, and socio-economic characteristics such as poverty) from the 2010 Census, 2006–2010 through current ACS 5-year estimates, and County Business Patterns for selected geographic areas o TIGER/Line Shapefiles provide legal boundaries, roads, address ranges, water features, and more. These files do not include demographic information but can be linked to data from demographic tables using the GEOID. o TIGER/Line Geodatabases are spatial extracts rom the Census Bureau’s MTDB. The geodatabases contain national coverage (for geographic boundaries or features) or state coverage (boundaries within state). These files do not include demographic data, but they contain GEOIDs that can be linked to the Census Bureau’s demographic data. o Cartographic Boundary Shapefiles are small scale (limited detail) mapping projects clipped to shoreline. These files are designed for thematic mapping using GIS and are available for a limited set of geographic types. o Keyhole Markup Language—Cartographic Boundary Files are simplified representations of selected geographic areas from the Census Bureau’s MAF/TIGER system. These boundary files are specifically designed for small-scale thematic mapping using an online tool such as Google Earth or Google Maps. Understanding and Using ACS Single-Year and Multiyear Estimates (Reading) Understanding Period Estimates Single-year and multiyear estimates from the ACS are all “period” estimates derived from a sample collected over a period of time, as opposed to “point-in-time” estimates such as those from past decennial censuses. For example, the 2000 Census “long form” sampled the resident U.S. population as of April 1, 2000. While an ACS 1-year estimate includes information collected over a 12-month period, an ACS 5-year estimate includes data collected over a 60-month period. In the case of ACS 1-year estimates, the period is the calendar year (e.g., the 2018 ACS covers the period from January 2018 through December 2018). In the case of ACS multiyear estimates, the period is 5 calendar years (e.g., the 2014–2018 ACS estimates cover the period from January 2014 through December 2018). Therefore, ACS estimates based on data collected from 2014–2018 should not be labeled “2016,” even though that is the midpoint of the 5-year period. Multiyear estimates should be labeled to indicate clearly the full period of time (e.g., “The child poverty rate in 2014–2018 was X percent.”). They do not describe any specific day, month, or year within that time period. Multiyear estimates require some considerations that single-year estimates do not. For example, multiyear estimates released in consecutive years consist mostly of overlapping years and shared data. The primary advantage of using multiyear estimates is the increased statistical reliability of the data compared with that of single-year estimates, particularly for small geographic areas and small population subgroups. Deciding Which ACS Estimate to Use For data users interested in obtaining detailed ACS data for small geographic areas (areas with fewer than 65,000 residents), ACS 5-year estimates are the only option.1 Week 3 – Data Exploration and Visualization Visual Analysis Best Practices It is vital that your visualization has a purpose and you are selective about what you include in your visualization to fulfill that purpose. Trends over Time Area and bar charts are great at amplifying total trends over time and how individual sectors contribute to the total over time. o Area chart treats each sector as a single pattern o Bar chart focuses on each unit of time as a single pattern Comparison and ranking Bar chart is great for comparison and ranking because it encodes quantitative values as length on the same baseline, making it easy to compare values Correlation Correlation analysis is a great place to start in identifying relationships between measures Correlation doesn’t guarantee a relationship Visualized through scatter plot or stacked line/bar charts Distribution Distribution analysis shows how your quantitative values are distributed across their full quantitative range Box plots (box and whisker) o Identify low values, 25th percentile values, medians, 75th percentiles, and maximum values Histogram o Break data out by bins and count the number of data points in each segment o Shows us the peak or most common values Part to Whole Although pie charts are commonly used in this type of situation, we suggest avoiding them for two reasons: 1) The human visual system is not very good at estimating area and 2) You can only compare slices right next to each other. Alternative: stacked bar chart Geographical Data When you want to show location, use a map Maps are often best when paired with another chart that details what the map displays Emphasize the most important data Choosing where to put each measure depends on what kind of analysis you are doing and what you are trying to emphasize Rule of thumb is to put the most important data on the X or Y axis and less important data on color, size or shape Orient your views for legibility If you find yourself with a view that has long labels that only fit vertically, try rotating the view Organize your View Bullet chart combines a bar chart with reference lines to create a visual comparison between actual and target numbers Avoid overloading your Views Instead of stacking many measures and dimensions into one condensed view, break them down into small multiples Limit the number of colors and shapes in a single view o 7-10 max so you can distinguish them and see important patterns Use interactive views only when necessary: you need to guide a story, encourage user exploration, or there is too much detail to show all at once The following guidelines will help you design great dashboards: Place the most important view at the top of your dashboard, or in the upper left corner. When looking at a dashboard, your eye is usually drawn to that corner first. If your visualization has chained interactivity (the first view filters the next view which filters the last view, etc.), structure them from top to bottom and left to right. That way, the final view to be filtered will be on the bottom, or bottom right. Unless there is an absolute need to add more, limit the number of views in your dashboard to three or four. If you add too many views, the big picture can get lost in the details. Remember, you can always use multiple dashboards to tell one story! Avoid using multiple color schemes in a dashboard—unless there are natural and independent color schemes in your data. If you have multiple filters, try to group them together with a layout container. A light border around them gives a subtle visual cue that they have shared features. The right, top, and left sides of the dashboard are all great places to put your filters. If a legend applies to all of your views, place them together with all of your filters. If a legend applies to one or a few more views, place it as close to those views as possible. Highlights let you quickly show relationships between values in a specific area or category, even across multiple views Filters let you slice data from different angles or drill down to a more detailed level. Enable multi-level data exploration and user-driven data analysis Color Try to use no more than two color palettes Consider how color will be interpreted, create semantically meaningful colors The midpoint of diverging color pallets should be meaningful Avoid adding color encoding to more than 12 distinct values Which chart or graph is right for you? Bar Chart Quickly compare data across categories, highlight differences, show trends and outliers, and reveal historical highs and lows at a glance. Especially effective when you have data that can be split into multiple categories Line Chart Connects several distinct data points, presenting them as one continuous evolution Use line charts to view trends in data, usually over time Not just limited to time Pie Chart Add detail to other visualizations Alone, a pie chart doesn’t give the viewer a way to quickly and accurately compare information, so key points can get lost Maps Visualize location information Density Map Reveal patterns or relative concentrations that might otherwise be hidden due to an overlapping mark on a map Help identify locations with greater or fewer numbers of data points Most effective when working with a data set containing many data points in a small geographic area Scatter Plots Investigate the relationship between different variables, showing if one variable is a good predictor of another or if they tend to change independently Presents lots of distinct data points on a single chart Chart can be enhanced with analytics like cluster analysis or trend lines Gantt Chart Display a project schedule or show changes in activity over time Shows steps that need to be completed before others can begin, along with resource allocation Not limited to projects o Represent any data related to a time series with this chart o Bubble Chart Adds detail to scatter plots or maps to show the relationship between three or more measures Varying the size and color of circles creates visually compelling charts that present large volumes of data at once Histogram Chart Show how your data is distributed across distinct groups Group your data into specific categories known as “bins” Assign a bar that is proportional to the number of records in each category Bullet Chart quickly compare progress against a goal Variation of a bar chart Designed to replace dashboard gages, meters, and thermometers Shows more information and provides more points of comparison while using less space Doesn’t display history Best suited for quick “how are we doing” dashboards, rather than deep analysis Highlight Table Take heat maps one step further Uses color to grab the viewer’s attention, while still presenting precise figures Tree Map Relate different segments of your data to the whole Each rectangle in a tree-map is subdivided into smaller rectangles, or sub-branches, based on its proportion to the whole Make efficient use of space to show percent total for each category Box-and-whisker Plot Also known as boxplots Common way to show distributions of data the box, which contains the median of the data along with the 1st and 3rd quartiles (25% greater and less than the median) the whiskers, which typically represent data within 1.5 times the interquartile range (the difference between the 1st and 3rd quartiles) o The whiskers can also be used to show the maximum and minimum points o within the data. Candlestick Chart commonly used for financial analysis to show metrics about a financial instrument over a period of time shows the open, close, high, and low values of an instrument over time Good enough to Great Charts Comparing Categories - bar charts are best when you have a single measure Checking Progress - Bullet charts, reference lines, bands and distributions focus attention on targets Distribution – Histograms and box plots show where your data is clustered, and can compare categories Regional Analysis - visualize data on geographical maps to answer locational specific questions, or aid geographical exploration (not just because it looks nice) Custom Shapes -use subject matter shapes to tell a more compelling story Color the first thing we notice immediately highlight specific insights data should drive the use of color to make a point differentiation – don’t use similar colors, or too many colors. don’t re-use colors for different dimensions or measures on the same dashboard measurable – does the color scale match my data? Does it move from light to dark, or stepped in the best way to represent what you’re measuring? Relatable – semantically-resonant colors help people process information faster Size The bigger the object, the bolder it looks Bold shapes and colors might work well with bar charts and area charts, but they may also look gaudy when used in a different chart like a treemap Use size to draw emphasis to your key message, not obscure it Line and bar charts – if the difference between data points is very minimal or very great, size may not always be a good encoding tool, as the visuals may become hard to read map charts – mark size should be based on the range of values on the map Text readability is essential make the most important information stand out Tiles – keep them short but powerful. Convey the point, message or story in the fewest words possible Labels – too many mark labels can be very distracting. Try labeling the most recent mark, or min/max, and save additional information for tooltips Dashboard layout Purpose is to help guide the reader’s eye through more than one visualization, tell the story of each insight, and reveal how they’re connected Guide the user Rule of three – don’t make a lot of important info compete for attention. Tell a story – connect different visualizations with story points Map Evaluation Guidelines Week 4 – Data Engineering Data Governance Literacy What is Data Governance Literacy? Data Literacy is the ability to read, understand, create, and communicate data as information Data Governance is the orchestration of people, processes, and technology to manage the company’s critical data assets by using roles, responsibilities, policies, and procedures to ensure the data is accurate, consistent, secure, and aligns with the overall company objectives Data governance literacy is the ability to define data, the parameters of data quality, understand the provenance of the data, and champion protection to serve as a catalyst in a data-driven organization Where does it fit in? Data governance literacy intends to help your organization either become or maintain its data-driven status o data governance literacy is about the greater good data literacy intends to increase self-service. o Data literacy is (usually) about improving individual skills for the greater good What’s the Goal? The goal of data governance literacy is to ensure that your data owners, data stewards, and others who work in data governance have the skills they need to define data, set parameters, understand lineage (and its implications), and champion protection. How to Get Started? Assess. What do you have that you can reuse? Start with content from your data catalog or data literacy programs Check what you already have to support your data governance team Look at data literacy program Keep in mind you ca create short, simple content to put on your learning management system so as projects add data stewards they get trained right away Go to the experts. Ask your data stewards what resources would be helpful to them in their journey Data stewards already doing the job know what it took to get skilled New data stewards have perspectives as to what they need to learn to be successful Data owner likely has insight on how to be successful Look for training resource groups in your organization Start a backlog. Keeping track of the content you want to build for your data governance literacy program should be part of your overall data governance strategy Begin a checklist for what you want to build Break it out into smaller, iterative work efforts & review with executive stakeholders Add backlog to your project for prioritization Data Governance Literacy should help train and upskill the key roles in data governance. Preparing them to define, manage, and communicate about the data is a step often above and beyond for data governance but will improve the likelihood of success. The Pillars of Data Governance Four Operational Pillars : 1. Increasing data usage 2. Improving data quality 3. Identifying data lineage 4. Ensuring data protection For far too long data governance has been focused on risk aversion, or command and control. A more modern way of approaching data governance, and honestly any aspect of data and analytics delivery is to focus on increasing usage of the assets that our organizations work so hard to create. Disrupting Data Governance Increasing Usage - the point is to increase the usage of data so that our business partners and stakeholders can use it to drive decisions By focusing on increasing data usage that will drive support for your data governance efforts and when you get support and bring value, you no longer have to defend every dollar. Improving quality – helps create trust between provider and consumer If you want to get support for data governance, focus your efforts on data quality because most organizations struggle with good or even good enough data quality. Identifying data lineage – understanding what happens to the data from creation to archival is more than just metadata The ability to do a regression analysis (for example) to isolate what happens to that data in support of data quality remediation is critically important (and yes, requires metadata). All these things support data governance efforts, but it should not be about metadata alone. Ensuring data protection – protecting our data assets is the cost of doing business. Do that well with InfoSec and Compliance When you do any kind of data management function you are responsible for that data. But the reality is the accountability for the protection of data and the associated data assets should clearly fall under your Privacy, Information Security team, and/or Compliance team. Data governance should support this effort by helping to deploy rules about the data to ensure that the actionable aspects of a policy or procedure exist within the data repository. But we are not the ones that should be driving the creation of the policies or the procedures. How to Operationalize You must deliver something in each of these four areas to be doing data governance well. Applying the rules that you have created through data governance functions (whether that be definitions or data quality parameters) into a repository that supports automated functions for monitoring (among other things) so that it is a repeatable mechanism. Just creating definitions or just creating data quality parameters without actually encoding them into your repositories is not - I repeat not - data governance Week 6 - Spatial Analysis, Spatial Joins, Site Suitability Models The Language of Spatial Analysis An introduction to spatial analysis the first step in spatial analysis is understanding where second development phase—navigation third step in cognitive development is understanding spatial relationships and patterns Definition: Spatial analysis — how we understand our world — mapping where things are, how they relate, what it all means, and what actions to take. The language of spatial analysis The six categories of spatial analysis Understanding where Measuring size, shape, and distribution Determining how places are related Finding the best locations and paths Detecting and quantifying patterns Making predictions Understanding where Understanding where includes geocoding your data, putting it on a map, and symbolizing it in ways that can help you visualize and understand your data. Within the taxonomy of spatial analysis, the first category of understanding where contains three types of questions. TYPES 1. Understanding where things are (location maps) a. The map is often the visualization medium, and the human brain does the analysis. 2. Understanding where the variations and patterns in values are (comparative maps) a. With comparative mapping you can visualize the highs and lows and their distribution across space. These comparative analyses can be accomplished using historical, current, and even real-time analytical maps. 3. Understanding where and when things change a. Mapping the changing conditions in a place over time, such as loss of vegetation, can help us anticipate future conditions and implement policies that will positively impact our world. Measuring size, shape, and distribution When there are multiple objects, the set of objects takes on additional properties, including extent, central tendency, and other characteristics that collectively define the distribution of the entire dataset. The process of measuring and describing these characteristics constitutes the second category of spatial analysis questions. TYPES 4. Calculating individual feature geometries 5. Calculating geometries and distributions of feature collections Determining how places are related Answering spatial questions often requires not only an understanding of context (understanding where), but also an understanding of the relationships between features. Take any two objects: How are they related in space? How are they related in time? These relationships in space and time include associations such as proximity, coincidence, intersection, overlap, visibility, and accessibility. Determining how places are related includes a set of questions that help describe and quantify the relationships between two or more features. TYPES 6. Determining what is nearby or coincident 7. Determining and summarizing what is within an area(s) 8. Determining what is closest 9. Determining what is visible from a given location(s) 10. Determining overlapping relationships in space and time Finding the best locations and paths A very common type of spatial analysis, and probably the one you are most familiar with, is optimization and finding the best of something. You might be looking for the best route to travel, the best path to ride a bicycle, the best corridor to build a pipeline, or the best location to site a new store. Using multiple input variables or a set of decision criteria for finding the best locations and paths can help you make more informed decisions using your spatial data. TYPES 11. Finding the best locations that satisfy a set of criteria 12. Finding the best allocation of resources to geographic areas 13. Finding the best route, path, or flow along a network 14. Finding the best route, path, or corridor across open terrain 15. Finding the best supply locations given known demand and a travel network Detecting and quantifying patterns In the fifth category of the spatial analysis taxonomy, the keyword is patterns. These spatial analysis questions go beyond visualization and human interpretation of data (from the understanding where category) to mathematically detecting and quantifying patterns in data. For example, spatial statistics can be used to find hot spots and outliers; data mining techniques can be used to find natural data clusters; and both approaches can be used to analyze changes in patterns over time. TYPES 16. Where are the significant hot spots, anomalies, and outliers? 17. What are the local, regional, and global spatial trends? 18. Which features/pixels are similar, and how can they be grouped together? 19. Are spatial patterns changing over time? Making predictions The last category of the taxonomy includes those questions that use powerful modeling techniques to make predictions and aid understanding. These techniques can be used to predict and interpolate data values between sample points, find the factors related to complex phenomena, and make predictions in the future or over new geographies. Many specialized modeling approaches also build on the physical, economic, and social sciences to predict how objects will interact, flow, and disperse. Despite their differences, all these questions share the same principles: they are used to predict behavior and outcomes and to help us better understand our world. TYPES 20. Given a success case, identifying, ranking, and predicting similar locations 21. Finding the factors that explain observed spatial patterns and making predictions 22. Interpolating a continuous surface and trends from discrete sample observations 23. Predicting how and where objects spatially interact (attraction and decay) 24. Predicting how and where objects affect wave propagation 25. Predicting where phenomena will move, flow, or spread 26. Predicting what-if The seven steps to successful spatial analysis 1. Ask questions: Formulate hypotheses and spatial questions. 2. Explore the data: Examine the data quality, completeness, and measurement limitations (scale and resolution) to determine the level of analysis and interpretation that can be supported. 3. Analyze and model: Break the problem down into solvable components that can be modeled. Quantify and evaluate the spatial questions. 4. Interpret the results: Evaluate and analyze the results in the context of the question posed, data limitations, accuracy, and other implications. 5. Repeat as necessary: Spatial analysis is a continuous and iterative process that often leads to further questions and refinements. 6. Present the results: The best information and analysis becomes increasingly valuable when it can be effectively presented and shared with a larger audience. 7. Make a decision: Spatial analysis and GIS are used to support the decision-making process. A successful spatial analysis process often leads to the understanding necessary to drive decisions and action. The benefits of spatial analysis Achieve objectives Improve program outcomes Reduce costs Avoid costs Increase efficiency and productivity Increase revenue Assure revenue Protect staff and citizens (health and safety) Support regulatory compliance Improve customer service Enhance customer satisfaction Enhance competitive advantage