Databases Notes.docx
Document Details
Uploaded by Deleted User
Full Transcript
Notes for DataBases **Structured data is labeled whereas unstructured is unlabeled**. Example: **Structured** - labeled - numbers and values\ Example: **Unstructured** - unlabeled - audio files, photos **How is it a New Oil** - everything in a digital world runs on data and data is extremely valu...
Notes for DataBases **Structured data is labeled whereas unstructured is unlabeled**. Example: **Structured** - labeled - numbers and values\ Example: **Unstructured** - unlabeled - audio files, photos **How is it a New Oil** - everything in a digital world runs on data and data is extremely valuable. It needs to be managed, collected **How is it not like Oil** - Data is infinite, it's all around us. Oil is a raw material. **Impediments of Conducting Research** 1. 2. 3. **5 Vs** 1. 2. 3. 4. 5. **8/29/2024**\  **Data is raw it needs management\ **  \ \ **Metadata** - data about data **Data Governance:** **Data Quality**: Ensuring accuracy, completeness, and reliability of data. **Data Security**: Protecting data from unauthorized access and breaches. **Data Stewardship**: Assigning roles and responsibilities for managing data. **Data Policies and Standards**: Establishing rules and guidelines for data handling. **Compliance**: Adhering to legal and regulatory requirements related to data. **Principles of Data Management** - Maximize the value of data while minimizing risks Why's: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. "Better data is better than better models." **CSVs -** Comma-separated values **Codebook/Data Dictionary -** shows the structure and content of the data file that includes metadata: Source, type, variable levels The documentation should provide a careful guide so that others can follow each of the steps taken by the analyst. It ensures reproducibility. Make well-commended code. **Data Cleaning and Screening** **Data Cleaning:** a set of procedures designed to detect and correct errors in the dataset to ensure data quality and integrity. Check for: errors, duplicates, and inconsistencies **Data Screening:** identify and designate appropriate identifiers for missing data and determine whether there are extreme values that might affect an analysis. 99 could be *Refused to Answer* in a survey Check for: missing values, check for extreme values Missing Values: **MCAR:** missing by accident **MAR:** missing because of something else we can observe/ on purpose **MNAR:** missing because of questions themselves or patients themselves  **9/3/2024**  **DE - Data Engineering**   **Data Wrangling: Discover, Clean, Normalize, Enrich**    1. a. b. c. 2. d. 3. e. f. 1. 2. 3. 4. Issues with: Acquiring, Provisioning, and Maintaining **Cloud computing model** Infrastructure as software 1. 2. Software solutions **8/10/2024** **Ingestion**   **Amazon DMS** - - **Amazon Kinesis** - - - - - - - - - - - - - - - - - - - - - - - -  **AWS Snow Family** - - - **Transformation** **AWS Lambda** - - -  **AWS Glue** - - - - -   **9/17/2024**  **C)**  **B)** **A)**  **Can't rename a bucket, no capitalization, no underscores, no symbols**  **IA - Infrequent Access** **Eby - Ebonizer (Dog's name)**  **Considerations for Serving Data** 1. a. b. c. d. 2. e. 3. f. 4. g. 5. h. i. j.  **Principles of Orchestration** 1. 2. 3. 4. 5. 6. 7. 8. 9.   **Data governance:**  Exam question: What does PII stand for? Personally identifiable information such as SSN, IP address, medical info, etc. The principle of least privilege is giving a user enough access to do their job and no more. Ways to secure data:  Anonymizing data: refers to removing PII    