Databases Notes.docx

Full Transcript

Notes for DataBases **Structured data is labeled whereas unstructured is unlabeled**. Example: **Structured** - labeled - numbers and values\ Example: **Unstructured** - unlabeled - audio files, photos **How is it a New Oil** - everything in a digital world runs on data and data is extremely valu...

Notes for DataBases **Structured data is labeled whereas unstructured is unlabeled**. Example: **Structured** - labeled - numbers and values\ Example: **Unstructured** - unlabeled - audio files, photos **How is it a New Oil** - everything in a digital world runs on data and data is extremely valuable. It needs to be managed, collected **How is it not like Oil** - Data is infinite, it's all around us. Oil is a raw material. **Impediments of Conducting Research** 1. 2. 3. **5 Vs** 1. 2. 3. 4. 5. **8/29/2024**\ ![](media/image29.png) **Data is raw it needs management\ ** ![](media/image62.png) \ \ **Metadata** - data about data **Data Governance:** **Data Quality**: Ensuring accuracy, completeness, and reliability of data. **Data Security**: Protecting data from unauthorized access and breaches. **Data Stewardship**: Assigning roles and responsibilities for managing data. **Data Policies and Standards**: Establishing rules and guidelines for data handling. **Compliance**: Adhering to legal and regulatory requirements related to data. **Principles of Data Management** - Maximize the value of data while minimizing risks Why's: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. "Better data is better than better models." **CSVs -** Comma-separated values **Codebook/Data Dictionary -** shows the structure and content of the data file that includes metadata: Source, type, variable levels![](media/image53.png) The documentation should provide a careful guide so that others can follow each of the steps taken by the analyst. It ensures reproducibility. Make well-commended code. **Data Cleaning and Screening** **Data Cleaning:** a set of procedures designed to detect and correct errors in the dataset to ensure data quality and integrity. Check for: errors, duplicates, and inconsistencies **Data Screening:** identify and designate appropriate identifiers for missing data and determine whether there are extreme values that might affect an analysis. 99 could be *Refused to Answer* in a survey Check for: missing values, check for extreme values Missing Values:![](media/image73.png) **MCAR:** missing by accident **MAR:** missing because of something else we can observe/ on purpose **MNAR:** missing because of questions themselves or patients themselves ![](media/image20.png) **9/3/2024** ![](media/image74.png) **DE - Data Engineering** ![](media/image22.png) ![](media/image18.png)![](media/image54.png) **Data Wrangling: Discover, Clean, Normalize, Enrich** ![](media/image33.png) ![](media/image28.png) ![](media/image19.png)![](media/image21.png) 1. a. b. c. 2. d. 3. e. f. 1. 2. 3. 4. Issues with: Acquiring, Provisioning, and Maintaining **Cloud computing model** Infrastructure as software 1. 2. Software solutions **8/10/2024** **Ingestion** ![](media/image10.png) ![](media/image47.png)![](media/image56.png) **Amazon DMS** - - **Amazon Kinesis** - - - - - - - - - - - - - - - - - - - - - - - - ![](media/image70.png) **AWS Snow Family** - - - **Transformation** **AWS Lambda** - - - ![](media/image69.png) **AWS Glue** - - - - - ![](media/image63.png) ![](media/image7.png) **9/17/2024** ![](media/image31.png) **C)** ![](media/image23.png) **B)** **A)** ![](media/image8.png) **Can't rename a bucket, no capitalization, no underscores, no symbols** ![](media/image43.png) **IA - Infrequent Access** **Eby - Ebonizer (Dog's name)** ![](media/image42.png) **Considerations for Serving Data** 1. a. b. c. d. 2. e. 3. f. 4. g. 5. h. i. j. ![](media/image25.png)![](media/image49.png) **Principles of Orchestration** 1. 2. 3. 4. 5. 6. 7. 8. 9. ![](media/image51.png) ![](media/image32.png) **Data governance:** ![](media/image40.png) Exam question: What does PII stand for? Personally identifiable information such as SSN, IP address, medical info, etc. The principle of least privilege is giving a user enough access to do their job and no more. Ways to secure data: ![](media/image48.png) Anonymizing data: refers to removing PII ![](media/image16.png) ![](media/image61.png) ![](media/image64.png) ![](media/image14.png)

Use Quizgecko on...
Browser
Browser