test — De Breakfast Club

Augmented Data Management for Public Data Exploration

Lecture 1. Introduction to Data Analytics and ADM

1. Augmented data management

1 What is Data Management and When is it Augmented?

Data analytics deals with extracting information from data and using it to predict trends and behavioral patterns.

Extract information from data (Data Management)  Data are facts and have the lowest level of abstraction Eg. House Location: 5kms. Mode of transport : bike. → augmented: Sentiment: I am happy. Mode of transport : bike.  Information is data with context added Eg. My house is 5kms from work and I bike to work → augmented: Eg. I am happy and I bike to work  Techniques: Data mining, Parsing etc.

Predict trends and behaviour (Data Analytics)  Understanding what happens in the future Eg. People who would live about 5kms from work; bike to work → augmented: Eg. People who bike to work are happy.  Techniques: Econometrics, ML

What is “augmented”?  Incorporates novel techniques(e.g., web scraping, APIs, LLMs) to work with messy, real-world data—text, images, reviews, posts.  Enables decisions, predictions, and insights in new domains like health, governance, and media.

2 Data Management and Analytics Process

Objective > Data Prep > Analytics > Validation

Define objective  Interpret a business problem as an analytical need – Need for deep domain knowledge  Selecting the tool and overall approach appropriate for solving the problem: Do you even need a predictive analytical solution?

Data collection and management  Understand the data –what, where, and how  Data collection and cleaning  Creating appropriate data structures – from unstructured to structured

Data model and analytics  Statistical Modeling - econometric models and ML models

Validation  Validity and accuracy  Analyze results

2. ADM and analytics process

1 The Wild Web of Data Analytics

 Health, Media, Government, Agriculture

2 Data Management Considerations for Organizations – Big data

 Big data management - when traditional data processing methods cannot handle the volume, velocity, variety and veracity of data

Volume - Petabytes (10^15)  Do you– At least have enough memory to open it?  Autonomous vehicles – several TB generated each day by each vehicle  Googol – 10^100, Googolplex = 10^Googol

Velocity – Real time needs, sensors  Autonomous vehicles – response within a fraction of a second

Variety – Many forms and formats  Autonomous vehicles – radar, lidar and camera

Veracity – Is it accurate?  Autonomous vehicles – billboard image of a person walking vs. a person walking on to the road

➔ Rule of thumb – you are dealing with big data, when the data is too much and too quick to be stored and processed at a later time and at the same time requires complex processing because of its variety and veracity

3 Data Management Considerations – Social Media Augmentation

 Social media analytics involves extracting information from semi-structured and unstructured social media data to predict trends and behavioral patterns

Text analytics  Reviews and opinions  Techniques: Sentiment analysis, tone analysis

Speech analytics  Speech recognition: ~1971 to Siri and Alexa – Are they listening to us?  Techniques: Voice identification, tone analysis

Image analytics  Image recognition  Techniques: Object identification, collision detection  ImageNet project – Fei Fei Li in 2006 - 48000 mturk workers across 167 countries to produce a labelled set of a billion images - Neural network – 24 m nodes, 140 m parameters, 15 b connections - Current state of image recognition, 2000 categories, 3 years old child

1 Data Collection and Integration

Convert unstructured data to structured data  Structured – tabular, predictable  Unstructured – unpredictable, often visual  Create data structures from the collected data

2 Web Data Collection – Structured vs. Unstructured Data

Unstructured – unpredictable, often visual  Data Collection Approach – scraping Structured – tabular, predictable  Data Collection Approach – Web API’s

1. Concept and Process 1) Define Objective 2) Data Collection and Integration

Augmented Data Management for Public Data Exploration

3 Structured Web Data Collection – Web APIs

Web application programming interface (API)  When there is a need for applications to interact with the website and use the data  Type of consumer - User (User Interface) vs. Application (Application programming interface)

Access web API service by making HTTP requests to the specific API URLs  Instead of HTML pages, web API’s provide data is in a more structured format that are easier for the programs to consume, such as JSON and XML

JSON (JavaScript Object Notation)  especially well suited data exchange and is commonly used in APIs  easy for humans read & write; easy for machines to parse & generate

4 Unstructured Data – Web Scraping

While web pages are easy for humans it is quite unstructured  Surrounded by ads and extraneous content it can get a little complicated E.g. http://www.tilburguniversity.edu/

Web crawling  Web crawler or spider- crawl through the internet using a set of URLs search for new URLs and get data  Used for archiving of indexing

Search engine bots are web crawlers (e.g. googlebot, bingbot)  Step 1: Crawling: Finding out what pages exist on the web  Step 2: Indexing: Understand what the page Is about  Step 3: Serving and ranking: Find the relevant answer from its index based on many factor

Web scraping  Crawl the webpage to find the right location where the data is located  Scraping – get some specific information from website which may be in HTML/ CSS or some other format

1 Prediction Models – AI vs. ML

 Artificial Intelligence: Branch of science that revolves around developing machines that mimic human intelligence

2 ML vs. Econometrics

ML – Prediction  ML is a subset of AI techniques that allows machines to learn from data for specific prediction tasks  Objective: To predict an outcome; 𝑦̂ given a set of predictors 𝑥 (supervised)  Example: Predict the future income of students

Econometric Approaches – Estimation  Revolve around parameter estimation: produce good estimates of 𝛽 that explains the relationship between 𝑥 and 𝑦  Example: Understand how university education impact future wages  Objective: Identify the effect size; 𝛽̂ (e.g 𝑦=𝛽̂ 𝑥+𝑒)

 ML (as of now) belongs in the part of the toolbox marked 𝑦 rather than in the 𝛽̂ compartment

3 Prediction Models

Supervised Learning  Start with pre-classified set of data(a sample of data for which we know the actual outcomes) – training data  Depending on the outcome we can have - Regression (regression , continuous outcomes) or Classification (discreate/ categorial outcomes)  Example 1 Regression: Predict the future income of students  Example 2 Classification: sentiment analysis. Labelled data set of sentences with positive and negative sentiment is available

Un-supervised Learning  No-labelled data available  Find patterns in the data  Clustering (find different groups in data) and association rule mining (find different associations)

Where do we use supervised learning ?  Example 1: Find different groups or clusters – are there different learning styles in students ?  Example 2: Market basket analysis – if you buy baby diapers on Friday there is a high likelihood that you will buy beer?

4 Estimation Models – Linear Regression

 Relationship between dependent variable (Y) and independent variables (X) is best described using a straight line  It is represented by an equation 𝑦=𝛼+𝛽∙𝑥+𝑒, where a is intercept, 𝛽 is slope of the line and e is error term  Example: 𝑎𝑣𝑔.𝑖𝑛𝑐𝑜𝑚𝑒=150+250∗𝑢𝑛𝑖𝑣𝑒𝑟𝑠𝑖𝑡𝑦 𝑔𝑟𝑎𝑑𝑒𝑠+𝑒

Question: What analytics approach is suitable for the following tasks ? (a) Identify how much effect age has on dementia (b) Identify which factors cause dementia (c) Identify possible lifestyle patterns of people who get dementia (a) regression (b) supervised learning (c) unsupervised learning

1 Issues With Prediction Models

Overfitting – the curse of high dimensionality  Model relying on idiosyncrasies in the training data  Difficult to generalize: model too closely follows the sample data and fails at predicating future observations  Ensemble techniques, bootstrapping can reduce overfitting

Consistency  The estimates from complex prediction models are rarely consistent  Small changes to the data creates large △ in the prediction model  One of the big limitations of decision trees, NN and other ‘path’ based algorithms

2 Validation of Prediction Models

Measuring Performance  How complete is the model?  How accurate is the model? Goodness of fit – Is the model complete?  How much of the variance is explained by the model  Metrics: R-square, Adjusted r-square, error term Precision scores - Is the prediction accurate?  Here you divide your data set into two groups (train and test)  Metrics: Accuracy, recall, F1 scores 3) Data Modeling and analytics 4) Validation – Prediction Models

Augmented Data Management for Public Data Exploration

3 Evaluation Metrics (Classification Models)

 Confusion Matrix: provides a more complete picture of how a classification model performs and can help in understanding the trade-off between different types of errors.

Actual labels

Positive

Negative

Predicted Labels

Positive

FP (Type 1 error)

Negative

FN (Type 2)

 Accuracy Objective: Maximize # of correct (true) predications 𝑇𝑜𝑡𝑎𝑙 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑇𝑜𝑡𝑎𝑙 𝑝𝑟𝑒𝑑𝑇𝑃+𝑇𝑁𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁

 Precision Objective: Maximize the ability to correctly predict positive labels (correctly diagnose the disease) 𝑇𝑜𝑡𝑎𝑙 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑜𝑠.𝑇𝑜𝑡𝑎𝑙 𝑝𝑜𝑠.𝑇𝑃𝑇𝑃+𝐹𝑃

 Recall Objective: Maximize the ability to correctly recall positive labels 𝑇𝑜𝑡𝑎𝑙 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑜𝑠.𝑇𝑜𝑡𝑎𝑙 𝑝𝑜𝑠 𝑙𝑎𝑏𝑒𝑙𝑇𝑃𝑇𝑃+𝐹𝑁

 F score Objective: Harmonic mean of precision and recall 2∗𝑝𝑟𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙

4 Is the model good or bad ?

 In the game of roulette, a ball is dropped on a spinning wheel and eventually lands in one of 38 slots. Using visual features (the spin of the ball, the position of the wheel when the ball was dropped, the height of the ball over the wheel), an ML model can predict the slot that the ball will land in with an accuracy of 4%.  This ML model is making predictions far better than chance; a random guess would be correct 1/38 of the time—yielding an accuracy of 2.6%. Although the model's accuracy is "only" 4%, the benefits of success far outweigh the disadvantages of failure

 An expensive robotic chicken crosses a very busy road a thousand times per day. An ML model evaluates traffic patterns and predicts when this chicken can safely cross the street with an accuracy of 99.99%.  A 99.99% accuracy value on a very busy road strongly suggests that the ML model is far better than chance. In some settings, however, the cost of making even a small number of mistakes is still too high. 99.99% accuracy means that the expensive chicken will need to be replaced, on average, every 10 days

 A sentiment analysis model predicts if the sentiment of a sentence from -5 to +5. The accuracy of the model 50%  Accuracy is a poor metric here, The model may actually be pretty good. Since the outcome can take 10 values, random chance is 10%.  What do you do if the outcome is a continuous integer?

When the outcome is continuous (regression models): Loss and MSE E.g. prediction model for the stock prices (stock prices is continuous)  Minimize loss in prediction - Loss is the penalty for bad prediction  Mean Square Error: avg squared loss per example over whole dataset 𝑀𝑆𝐸=1𝑁Σ(𝑦−𝑓̂(𝑥))2  𝑦: outcome (sentiment) 𝑥𝑥 is the set of features  𝑓̂(𝑥): prediction of outcome based on the given 𝑥  𝑁 the number of examples

Which model is better? 𝑀𝑆𝐸𝑎=0.4 𝑀𝑆𝐸𝑏=0.8

 Besides reducing your absolute MSE, reduce your MSE relative to your label values  You're predicting prices of two stocks that have mean prices of 5 and 100. In both cases, the MSE is 5  In the first case MSE is 100% of your mean price, which is clearly a large error. In the second case, the MSE is 5% of your mean price, which is a reasonable error

1 Causality and Experimental design

Correlation vs. Causation  Correlation: Existence of the relationship (Describing a phenomenon)  Causality: Causation indicates that one event is the result of the occurrence of the other event (Explain the phenomenon)

Experiments can identify causality

 Experiments can explore causality and are quite robust (e.g. drug testing)  Randomly allocating participants to the treatment and control groups  Can you think of an experiment to study the influence of university education on wages?

 Experiments can be time consuming, hard and sometimes cross ethical boundaries  What do you do when you can not conduct an experiment?

2 Identification Strategy

Observational (secondary and primary) data  Grade and wage data of all the past students at Tilburg University  However, there may be many confounding factors. E.g. parents income  Can you control (keep constant) all the confounding factors?

Identification Strategy  The manner in which observational data approximated to real exprmt  Is there a natural experiment?  Policy change (10 point to pass/ fail)

5) Validation – Estimation Models

Augmented Data Management for Public Data Exploration

3. Quick Recap

Predictive analytics  Answered the question - what is predictive analytics?  Studied the new considerations for predictive analytics – Big data and social media

Predictive analytics process  Learnt the predictive analytics process which includes- Define objective; Data preparation; Modelling and analytics; Validation and deployment

Tools for predictive analytics  What is a programming language?  What are the different programming concepts?

Augmented Data Management for Public Data Exploration

Lecture 2. Web Data Collection Process

1. Understanding a Web Browser

1 Client-Server Model

 Data exchange through the world wide web involves two entities – client, server  Client is any user/application that wants to access a web resource like content from a webpage  Webserver: computer/data center which stores data (webpage, files or other resources) that a client can access - Each webpage/ resource is located in some webserver or the other - More that 13 million web servers (In 1993 there were ~500) - Google has 900,000 web servers  Web browser – The Application - SW application for retrieving, presenting and traversing information resources on the World Wide Web. - Nexus in 1990

2 Client-Server Model – Key Terms

URL – Universal Resource Locator  way to uniquely represent a server and a resource on that server E.g. https://www.tilburguniversity.edu/campus - HTTP (protocol); tilburguniversity.edu (domain); /campus (resource)

HTTP – Hyper Text Transfer Protocol  A specification (protocol) for web clients and servers to interchange requests and responses  Hypertext - cross-referencing between related sections of text and associated graphic material (i.e. text + links)

HTML – Hyper Text Markup Language  A language used for creating web pages Markup languages are designed for the processing, definition and presentation of text

Client-Server Model

3 What Does a Web Browser Do?

 A browser creates a HTTP request message using the URL name and sends it to the appropriate web server  Second: Waits for the HTTP response message from the webserver which has the web resource  Third: Interprets the response received, which contains data that is used to display a website

The response received has three parts  Status line – the status of the HTTP request and the response - 2xx: Everything went well, xx gives some additional details - 1xx : additional information, 3xx : redirection, 4xx : client error, 5xx : server error  Header information – some metadata which provides information about the response data  Message body – Website data in HTML, pictures, video etc.

HTTP Response Packet

2. Unstructured Web Data Collection

1 Structure of web data

Unstructured data  Unpredictable, often visual  Mixed with images, text, video, audio etc.  Sources: Websites, Social media  Methods to access: Scraping, parsing

Structured data  Tabular, predictable  Often with metadata  Sources : Databases, files  Methods to access : Download in excel, querying, Web API

1 Unstructured Data – Web Scraping

 While webpages are easy for humans it is quite unstructured  Surrounded by ads and extraneous content it can get a little complicated

Web crawling  Web crawler or spider- crawl through the internet using a set of URLs search for new URLs and get data  Used for archiving of indexing (eg. Search engine bots)

Web scraping  Crawl the webpage to find the right location where the data is located  Scraping – get some specific information from website which may be in HTML/ CSS or some other format

1) Client-Server Architecture 1) Structured vs. Unstructured data 2) Crawling and Scraping

Augmented Data Management for Public Data Exploration

1 Unstructured web data collection process

Step 1: Understand the webpage and its limitations  Are you allowed to collect the data?  Check the robots.txt file Step 2: Inspect the webpage data structure  Study the HTML page  Find the unique “tag” that identifies the data that you need Step 3: Make HTTP request and get the HTML data  Get the HTTP response data Step 4: Convert the HTML data into a format that Python can understand and search  Parsing: process of converting string data to program readable data structure Step 5: Find and store the data you need  Search the parsed HTML data to locate the information that you need (based on step 2)

Step 1: Understand the Webpage and Its Limitations

 Read /robots.txt: text file that is used to instruct search engine bots on how to crawl and index website pages - Important for search engine optimization (SEO)  Before scraping check if it is allowed in the robots.txt file. Do not abuse or overload web servers  Avoid scraping on shared public IP addresses

Step 2: Inspect the webpage data

Web Content vs. HTML/CSS Code

Step 3: Make HTTP request and get the HTML data

 Make a HTTP request and get the response data  Get the HTML data from the message body  Python Module: requests

Step 4: Convert the HTML data into a format that Python can understand

 Find the info you need by locating the unique HTML Tag, and Attribute  Python module: BeautifulSoup

Step 5: Find the data

 Find and store the data you need  Search the parsed HTML data to locate the information that you need (based on step 2)

3. Structured Web Data Collection

1 Structured Data – Web APIs

Access web API service by making HTTP requests to specific API URLs  Web API’s provide data is in a more structured format that are easier for the programs to consume, such as JSON and XML

JSON (JavaScript Object Notation)  especially well suited data exchange and is commonly used in APIs  easy for humans to read and write. easy for machines to parse and generate  HTML -> Webpage/Scraping ; JSON -> Web API

2 Structured Data – JSON

Webpage how it looks for us vs. JSON data that is accessed through web APIs (Structured)

1 Data Collection Using REST API

REST API (Representational State Transfer) With a REST API, you would typically gather the data by accessing multiple endpoints (URLs)  Uses URLs to represent resources  Uses HTTP methods: GET → fetch data POST → create dataPUT → update data  Returns data in JSON format  Simple, widely used, but can return too much or too little data E.g. On Github, for IBM get (a) location of the organization (b) all project descriptions

Example: (a) location of the organization Endpoint 1 (Get organization information): api.github.com/orgs/ibm

Example: (b) project (repo)descriptions Endpoint 2 (Get project/repo information): api.github.com/orgs/ibm/repos

2 Facebook issues with REST and the development of GraphQL

Facebook Need  Improve the user experience for Facebook users on mobile by building a native mobile application from scratch.  “The decision to adopt HTML5 on mobile was the worst decision in Facebook history” Mark Zuckerberg

Technical solution- new API engine designed to be suitable for mobile  Data requirements organized in a tree like structure  Design around allowing multiple queries to be served as one  Design around single sources of truth

 In Oct 2016, GitHub announced a major shift of their public APIs from the old RESTful to GraphQL.  GitHub was the first to create public APIs out of GraphQL, Facebook and other companies had been using it internally for so long.

3 Data Collection Using GraphQL API

 was developed to cope with the need for more flexibility and efficiency  Uses a query to represent exactly what you need: WYSIWYG  Send a single query to the GraphQL server that includes the concrete data requirements.  Returns data in JSON format  A bit more complex than REST but very efficient E.g. On Github, for IBM get (a) location of the organization (b) all project descriptions

3) Unstructured data collection process 1) Web APIs 2) REST vs. GraphQL

Augmented Data Management for Public Data Exploration

Lecture 3. Embedding Techniques

1. Rest vs. graphQL

1 Data Collection Using REST API

Example: (a) location of the organization Endpoint 1 (Get organization information): api.github.com/orgs/ibm

Example: (b) project (repo)descriptions Endpoint 2 (Get project/repo information): api.github.com/orgs/ibm/repos

Issues  Underfetching – two endpoints to get org and repo of org information  Over fetching – each endpoint is static and gives a lot of information many of which we do not need

3 Data Collection Using GraphQL API

Issues: Complex – need to know the schema to understand the query you need to make

1 An Analytical Problem

 Predicting an outcome for media content  How do you create a relationship between repo descriptions and number of stars a project has  How do you create a relationship between the plot descriptions and their rating?  How do you create a relationship btw images and their sentiment?

2 The Analytical Approach

 Class of algorithms that are data-driven. unlike "normal" algorithms, it is the data that "tells" what the "good answer" is  No requirement of a hardcoded definition of the rating of a movie  It can figure out what the rating should be by learning from examples  Given some data to begin with, train a model by identifying the relationship that best predicts the outcome

3 Finding x in Text, Images and other ‘Media’ Objects

Embeddings  numerical representations of data  Convert complex data into vectors of numbers, which are easier for machine learning models to process

Tokenization  task of chopping text up into pieces (e.g. words), called tokens  In case of sentence tokens could be words (bag of words)  In case of images tokens could be pixels

Encoding  selecting the right features and determining how to encode them  Decision 1: What tokens to use (e.g one word at a time – uni gram; two words at a time - bi-grams)  Decision 2: How to encode them? (counts, Tf-IDF)

2. Bag-of-Words

1 Tokenization : Bag of Words

Bag of Words model is a way to tokenize text data by:  Ignoring grammar and word order,  Treating each document as a “bag” of words (i.e., just the words that appear),  Counting how often each word appears.

Text-sentiment example My name is Poonacha → Negatvie Poonacha is bad → Negative The weather is good today → Positive

2 Encoding Using Count Vectorizer

Unique words : [‘my', 'name', 'is', 'Poonacha', 'bad', ‘the', 'weather', 'good'] My name is Poonacha (Negative) → (1, 1, 1, 1, 0, 0, 0, 0) Poonacha is bad (Negative) → (0, 0, 1, 1, 1, 0, 0, 0) The weather is good today (Positive) → (0, 0, 1, 0, 0, 1, 1, 1)

3 Training

 Establish relationship between the sentences and its labels through its individual words  Example model for sentiment sentence: sentiment = -0.03 * (1 if “poonacha” is present else 0) - 0.003 e.g., If the sentence contains the words ‘poonacha’ or ‘bad’ it is likely that the sentiment of the sentence is negative.

4 Pros and Cons of CountVectorizer

(+) Simplicity: straightforward technique, easy to implement and understand. (+) Interpretability: Easy to interpret the results as they are a matrix representation of word frequencies.

(-) Word semantics: Ignores the meaning of words, which can limit its performance in tasks requiring semantic understanding. e.g., IMDB -> the word hit has a positive sentiment (-) High dimensionality: The resultant matrix can become extremely high dimensional due to its nature leading to computational challenges. 1) From the last lecture and lab 2) After collecting data 1) Count Vectorizer

 Each word in the corpus (training data) is a dimension (-) Common words lot of weight, uncommon words very low weight

Augmented Data Management for Public Data Exploration

1 TF-IDF

 TF-IDF balances common and rare words to highlight the most meaningful terms. 𝑇𝐹 ∗ 𝐼𝐷𝐹

Term Frequency (TF)  Measures how often a word appears in a sentence (called document).  If a word appears frequently in a document, it is likely relevant to the document’s content. 𝑇𝐹(𝑤,𝑑)=𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑤 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 (𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒)𝑑𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑑

Inverse document frequency (IDF)  Measures how important a term is across the document. Rare terms get higher scores. 𝐼𝐷𝐹(𝑤)=log(𝑁𝐷(𝑤)) N = total number of documents D(w) = Number of documents where word w appears

2 TF-IDF Calculation

[‘my', 'name', 'is', 'Poonacha', 'bad', ‘the', 'weather', 'good'] My name is Poonacha (Negative) → (1, 1, 1, 1, 0, 0, 0, 0) Poonacha is bad (Negative) → (0, 0, 1, 1, 1, 0, 0, 0) The weather is good today (Positive) → (0, 0, 1, 0, 0, 1, 1, 1)

TF for the document: “My name is Poonacha” 𝑇𝐹(𝑚𝑦,𝑑1)=𝑇𝐹(𝑛𝑎𝑚𝑒,𝑑1)=𝑇𝐹(𝑖𝑠,𝑑1)=𝑇𝐹(𝑝𝑜𝑜𝑛𝑎𝑐ℎ𝑎,𝑑1)=13

IDF for the document: “My name is Poonacha” 𝐼𝐷𝐹(𝑚𝑦)=𝐼𝐷𝐹(𝑛𝑎𝑚𝑒)=ln(31)≈1.10 𝐼𝐷𝐹(𝑖𝑠)=ln(33)=0 𝐼𝐷𝐹(𝑝𝑜𝑜𝑛𝑎𝑐ℎ𝑎)=ln(32)≈0.41

TF-IDF for the document: My name is Poonacha 𝑇𝐹𝐼𝐷𝐹(𝑚𝑦,𝑑1)=𝑇𝐹(𝑚𝑦,𝑑1)∗𝐼𝐷𝐹(𝑚𝑦)=0.33∗1.10=0.363 𝑇𝐹𝐼𝐷𝐹(𝑛𝑎𝑚𝑒,𝑑1)=𝑇𝐹(𝑛𝑎𝑚𝑒,𝑑1)∗𝐼𝐷𝐹(𝑛𝑎𝑚𝑒)=0.33∗1.10=0.363 𝑇𝐹𝐼𝐷𝐹(𝑖𝑠,𝑑1)=𝑇𝐹(𝑖𝑠,𝑑1)∗𝐼𝐷𝐹(𝑖𝑠)=0.33∗0=0 𝑇𝐹𝐼𝐷𝐹(𝑝𝑜𝑜𝑛𝑎𝑐ℎ𝑎,𝑑1)=𝑇𝐹(𝑝𝑜𝑜𝑛𝑎𝑐ℎ𝑎,𝑑1)∗𝐼𝐷𝐹(𝑝𝑜𝑜𝑛𝑎𝑐ℎ𝑎)=0.33∗0.41=0.1353

 Count vectorizer: My name is Poonacha = (1, 1, 1, 1, 0, 0, 0, 0)  IF IDF = My name is Poonacha = (0.363, 0.363, 0, 0.1353, 0, 0, 0, 0)

3 Pros and Cons of TF- IDF

(+) considers both word frequency and document specificity, provides a better representation of the document's content. (+) Dimensionality: It focuses more on the significant terms hence the TF-TDF has a reduced dimensionality → Frequently occurring words can get dropped

(-) Word semantics: Ignores the meaning of words, which can limit its performance in tasks requiring semantic understanding. (-) Complexity: Could be a more complex model to implement (-) Word Order: Not ideal to capture the word order or context which may limit its performance in tasks like sentiment analysis.

4 Bag of Words – Pre-processing – Removing Stopwords

 Often as a pre-processing step in the BOW approach, we remove stop words to reduce the dimentionality  Stopwords: words in any language which does not add much meaning to a sentence. can safely be ignored w/o sacrificing meaning of sentence e.g., my, is, the, but, than, …

3. Word2Vec

1 How do we represent the meaning of the word?

e.g, What does the word – Pulchritudinous – mean ?

Lets see examples of its use:  As the sun set behind the mountains, the pulchritudinous view left us all speechless.  The actress walked down the red carpet in a pulchritudinous gown that took everyone’s breath away.  He was captivated by the pulchritudinous landscape that stretched across the valley.  Despite her pulchritudinous appearance, it was her kindness that truly made her stand out.

Distribution similarity “You shall know a word by the company it keeps” - J. R. Firth 1957 “the meaning of a word is its use in the language.”  More specifically, you need to look at its surrounding words or context  Can we create a vector representation of the word such that it encodes its meaning?

Distributed Representation of Words One of the most successful ideas of NLP :“To represent the meaning of a word (or any object) we need a vector representation of a word that is good at identifying the words that often go with it (or words that are close to it)!”

2 Computational Objective

We need a vector representation of a word so that it is  Good at predicting what other words appear in its context  Specifically: Words that occur together should have similar vectors

3 Embedding as a Learning Task

 The word embedding is learned - Given a set of words predict the likely next word  Source of Text > Training Sample > Vocabulary (V) > Learned Embedding (Dimension V X N)

From a simple objective to a miracle  A simple objective leads to a miracle  With this simple objective the word embeddings can capture deeper analogies from our language  vector(“queen")−vector(“woman")+vector(“man") ≈ vector(“king”) 2) TF-IDF 1) Encode the meaning of the word

Augmented Data Management for Public Data Exploration

Lecture 4. Causal Inference - Randomized Experiments & OLS

1. Introduction

1 Causal Inference

 the process of using data to figure out whether a change in one thing causes a change in another  Using causal inference we can answer many questions! e.g., Does a new vaccine reduce flu cases? Did a marketing campaign cause an increase in sales? Do smaller class sizes improve student grades?

2 Causal Inference vs Machine Learning

ML  Main goal: make accurate predictions  Key question: What will happen?  Focus: Finding patterns in the data  Performance Metric: Accuracy, AUC

Causal Inference  Main goal: Understand cause-and-effect relationships  Key question: Why did this happen? / What would happen if we did something differently?  Focus: Estimating the impact of actions or interventions  Performance Metric: Credibility of the causal estimate

1 Causal Inference Example

CASE 1) A new strain of the flu is spreading quickly, and a pharmaceutical company has just released a vaccine they claim can reduce infections. Health researchers want to find out: Does the vaccine work? 전염병 이후 백신 효과 있없?

CASE 2) A group of researchers wants to understand whether getting more years of education leads to higher salaries. They notice that individuals with more schooling tend to earn more, but they want to find out: Do more years of education cause higher earnings?

2 Correlation vs. Causation

 Observation: Ice cream sales and drownings increase together.  Causation? Eating ice cream somehow increases the risk of drowning.  Reality: Both happen more often in summer - seasonality drives both.

 Observation: More firefighters are present at bigger fires.  Causation? Having more firefighters causes fires to become bigger.  Reality: Larger fires cause more firefighters to be called.

 Observation: Coffee drinkers show higher rates of heart attacks.  Wrong thinking: Drinking coffee directly causes heart attacks.  Reality: Coffee drinkers might smoke more - smoking is the true risk factor.

 Just bc two things happen together X mean one causes the other  Correlation measures association, not causality. Without controlling for confounders, or without an experimental design, we cannot conclude that one variable causes changes in another.

3. Methodology to Explore Causal Inference

1 Experimental Design

 What do we need to measure? → Infection rates with and without the vaccine (placebo).  What data do we need? → Infection outcomes and vaccination status.  How do we collect the data?

2 Motivation for Randomized Experiments

Self-selection bias  Let people choose whether to get vaccinated? No!  Problem: People who choose vaccination may already be more healthconscious, have better access to healthcare, or take fewer risks.  Why this matters: If people differ systematically between groups, we can no longer tell whether differences in flu rates are due to the vaccine or to these pre-existing differences (self-selection bias)

Attribution bias  Inform people about their assigned condition? No!  Problem: Participants may drop out after learning whether they received the vaccine or not.  Why this matters: Those disappointed by their assignment (e.g., not getting the vaccine) might be more likely to leave the study. If dropouts are related to health, risk, or other factors, the remaining groups will no longer be comparable, leading to biased estimates of the vaccine’s effect

Confounding  Assign based on characteristics (e.g., high-risk individuals)? No!  Problem: Groups would differ systematically from the start. For example, high-risk individuals (e.g., older adults or those with chronic conditions) may be more likely to receive the vaccine.  Why this matters: If the groups differ from the start, such as one being more vulnerable than the other, we cannot tell whether differences in flu outcomes are due to the vaccine or to these underlying risk differences (confounding)

Randomized Experiments How does this help  Avoids self-selection bias: People don’t choose whether they get the vaccine, so personal traits (e.g., health-consciousness) don’t influence group assignment.  Protects against attrition bias: If people drop out, dropout is not related to knowing their group or making a personal choice, because they didn’t choose their condition.  Reduces confounding: Any differences in risk factors are balanced across groups by random chance, making the groups comparable at baseline.

3 Randomized Experiments – Example

 The health researchers wanted to find out if a new flu vaccine could actually reduce flu infections. To test this, they conducted a randomized experiment with 200 volunteers. Each volunteer was randomly assigned 1) Causal inference 2) Examples of Causal Inference 1) Randomized Experiments

Augmented Data Management for Public Data Exploration

to one of two groups: one group received the new vaccine, and the other did not.

4 Randomized Experiments - Feasibility

 Are randomized experiments always feasible?

Impossible to Implement  can be impossible to implement because we cannot control nature, history, or genetics  Example: You cannot randomly assign people different genes to study whether and which genetics affect health - genes are determined at birth and cannot be manipulated.

Practical Constraints  can be expensive, time-consuming, or logistically difficult to run.  Example: Testing a new education program across many schools might require resources and coordination that aren't available.

Ethical Concerns  It may be unethical to randomly assign treatments that could cause harm or deny people access to something beneficial.  e.g., cannot randomly assign ppl to smoke to study health effects

What if we need a relationship between variables but we cannot implement a randomized experiment?

1 Regression Example

 What data do we need? → Earnings and years of schooling for each individual  How do we collect the data? Randomized experiment? → Gather survey or administrative data.  How do we analyze the data?

 Estimate the correlation between earnings and years of schooling? No!  Problem: People who choose to get more education might already be different (more motivated, smarter, wealthier).  Why this matters: Higher earnings might not be caused by education itself - they could be due to the pre-existing differences (confounding)  Without random assignment, we have to make strong assumptions to believe that differences in education cause differences in earnings.

2 Regression models

Regression models help us control for observable differences: 𝑦=𝛽0+𝛽1𝑥1+𝛽2𝑥2+⋯ +𝛽𝑘𝑥𝑘+𝑢  𝑦 = dependent variable (observable)  𝑥1, …, 𝑥k = independent variables (observable)  𝛽1, … , 𝛽𝜅 = slope parameter (estimated)  𝛽0 = intercept parameter (estimated)  𝑘 = the number of independent variables (i.e., controls)  𝑢 = residual (unobservable)

→ 𝐹𝑢𝑡𝑢𝑟𝑒 𝐸𝑎𝑟𝑛𝑖𝑛𝑔̂= 𝛽0̂+𝛽1̂𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛+𝑢̂  𝑦 = dependent variable (observable)  𝛽1̂ = slope parameter (estimated)  𝛽0̂ = intercept parameter (estimated)  𝑢̂ = error term (unobservable)

3 Regression Example

Population Regression Function (PRF)  Suppose that we have data on every single individual in the population.  If we estimate the relationship between the years of schooling and the future earnings (for the entire population), then we get PRF (Population Regression Function).  Why do we have residuals? Remember that the regression forces a linear relationship BUT reality can be more complicated!

Sample Regression Function (SRF)  Suppose that we only have data on a subset of individuals from the population (sample).  If we estimate the relationship between years of schooling and future earnings using this sample, then we get the SRF (Sample Regression Function).  Why do we have error terms? The error term captures the dff btw the observed outcome and the outcome predicted by our sample regression.

 Hope: SRF = PRF “on avg” or “when n goes to infinity”

 𝑦𝑖: The actual/observed value of 𝑦 for observation i

 𝑦𝑖̂: The fitted value of 𝑦 for obs i (pred by our sample regression)

 𝑦̅: The sample avg of variable 𝑦

 𝑢𝑖̂=𝑦−𝑦𝑖̂: residual for obs i

Total sum of squares (total variation in yi) 𝑆𝑆𝑇=Σ(𝑦𝑖−𝑦̅)2𝑛𝑖=1=𝑆𝑆𝐸+𝑆𝑆𝑅

Unexplained sum of squares SSE (var in 𝑦𝑖̂ explained by the model) 𝑆𝑆𝐸=Σ(𝑦𝑖̂−𝑦̅)2𝑛𝑖=1 Residual sum of squares SSR (var in 𝑦𝑖̂ not explained by the model) 𝑆𝑆𝑅=Σ(𝑦𝑖−𝑦𝑖̂)2𝑛𝑖=1

𝑆𝑆𝑇=𝑆𝑆𝐸+𝑆𝑆𝑅 OLS (Ordinary Least Squares): minimize SSR

3 R-Squared: A Goodness-of-Fit Measure

 How well does x explain y? How well does the OLS regression line fit the data?  We may use the fraction of variation in y that is explained by x (or by SRF) to measure!  R-squared (coefficient of determination): 𝑅2=𝑆𝑆𝐸𝑆𝑆𝑇=1−𝑆𝑆𝑅𝑆𝑆𝑇  Larger R2, better fit; 0 ≤ 𝑅2 ≤ 1

4 Regression Example Interpretation

𝐹𝑢𝑡𝑢𝑟𝑒 𝐸𝑎𝑟𝑛𝑖𝑛𝑔̂= 𝛽0̂+𝛽1̂𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛+𝑢̂  Slope of 0.54: each additional year of education increases future earnings by $0.54  Intercept of -0.9: “fitted wage of a person with 0 years of education”? SRF does poorly at low levels of education  Predicted future earnings for a person with 10 years of education?  R2 = 0.165; 16.5% of variation in wage is explained by years of educ

2) Regression

Augmented Data Management for Public Data Exploration

5 Regression - Causal Assumptions

But can we establish causality? We have to make some important assumptions! → We have made the groups comparable enough to trust the causal story

No omitted variables (Unconfoundedness)  af accounting for observable differences, there are no unobserved factors that simultaneously influence both the level of the IV and the DV. e.g., Years of Educ <- Parents’ Income -> Future Earnings  Why it's important: If unobserved confounders exist, the apparent relationship between the independent variable and the outcome might be spurious, leading to a biased estimate of the true effect.  Relation to linear regression: If unmeasured confounders are present, the coefficient for the independent variable might not reflect the true causal effect.

No reverse causality  The direction of the causal effect is assumed to flow from the IV to the DV, and not the other way around.  Income  Health: Having more money allows you access to better healthcare, food, etc. BUT reverse causality is possible because healthy people can work more, be more productive, and earn more!  Why it's important: If the outcome influences the independent variable, the estimated relationship will be a mix of both effects, obscuring the true impact of the independent variable on the outcome.  Relation to linear regression: If reverse causality is present, the coefficient for the independent variable will capture this bidirectional relationship, not just the effect in the assumed direction.

Overlap (Common Support)  For each combination of observed characteristics (e.g., age, background), there are individuals with different levels of the independent variable (e.g., years of education).  Suppose you're studying whether playing in the top levels of football or basketball leads to higher salaries. We collect data on athletes’ salaries from European football leagues and the NBA and want to estimate the causal effect of sport choice on earnings. But here’s the problem: - Football players in your data are all based in Europe - NBA players are all based in North America  So we are not comparing apples to apples — differences in earnings might reflect regional factors (like league size, taxes, or cost of living), not the sport itself.  We don’t have comparable individuals across the groups we are trying to compare  Why it's important: Without overlap, we compare fundamentally different groups, making it hard to isolate the effect of the IV.  Relation to linear regression: Linear regression can still produce estimates w/o overlap, but the causal interpretation of the coef becomes weak (since we are comparing individuals who are not truly comparable in their observed characteristics).

➔ If we use linear regression, we also make technical assumptions about how well the model fits the data - like assuming a simple relationship and that we have enough variation. ➔ In this course, we’ll focus mainly on the causal assumptions, because if they don’t hold, no model can save us!

Augmented Data Management for Public Data Exploration

Lecture 5. Advanced Causal Inference Techniques

1. Data Enrichment

1 Concept

 Data enrichment: the process of enhancing a dataset by adding new information that helps improve analysis, interpretation, or modelling especially for identifying causal relationships!  Derive new variables from existing data  Merging the existing data with external data

Example 1

Suppose you work as a data analyst at TripAdvisor. Your manager asks you to investigate what makes certain reviews more helpful than others. Your goal is to understand which characteristics of a review make users more likely to find it useful. How do you go about this?

Step 1: Collect the content and number of helpful votes of each review  Time between stay and review  Review photo characteristics  Review textual content (e.g., length, complexity, sentiment)  Look into management’s response

Step 2: Derive new variables using the review content! Step 3: Run your model

Example 2

Suppose you are working on Funda, and your goal is to understand which factors influence the listing price of homes in the NL. You know that you can collect standard property features, such as square meters, number of bedrooms, and year built. How do you proceed?

Step 1: Collect the content and number of helpful votes of each review Step 2: Merging the existing data with external data Step 3: Run your model

 You suspect that location-based and neighborhood characteristics also play a big role  To build a more complete model, you decide to enrich your data using external sources  This allows you to control for things like proximity to public transport or parks, school quality, or local infrastructure investments, which may explain part of the price differences between similar homes  Data sources: Google Maps, NS, Centraal Bureau voor de Statistiek etc.

Ebay Example

 Suppose you want to collect data from eBay for the sold sports trading cards; eBay only allows to view cards sold within the past three months  Web scraper issues: # returned results, Page links, Product links  The data that you have for every card includes: sold price, sale date, seller identifier, seller reputation, card title, and product section information data that is unclean.

How can we enrich our data?  Use card title to identify card player and year  Use card player and combine our data with external sources to identify

2. Differences-in-Differences

1 Case and Questions

[Case] In 2022, Amsterdam introduced stricter regulations on short-term rentals like Airbnb - limiting the number of days per year an apartment can be rented. What was the effect of the new regulations on long-term rental prices?

 Do the LT rental prices in Amsterdam drop because of the shock?  Or would they have dropped despite the shock? E.g., demand shifts (e.g., tourism), economic downturn.  Need a valid counterfactual! Parallel trends before the shock & unaffected by shock

2 Difference-in-differences (DiD) specification

𝑦𝑖𝑡 = 𝛽0 + 𝛽1 𝑇𝑟𝑡𝑚𝑒𝑛𝑡𝑖 + 𝛽2 𝐴𝑓𝑆ℎ𝑜𝑐𝑘𝑡 + 𝛽3 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖 × 𝐴𝑓𝑆ℎ𝑜𝑐𝑘𝑡 + 𝑢

 𝑦𝑖 = dependent variable (observable)  𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖 = binary variable if observation 𝑖 belongs to the treatment group (if so it takes the value 1, otherwise 0)  𝐴𝑓 𝑆ℎ𝑜𝑐𝑘𝑡 = binary variable if observation 𝑖 occurred after the shock (if so it takes the value 1, otherwise 0)  𝑢 = error term (unobservable)

 𝛽0: The baseline outcome for the control group before the shock  𝛽1: The diff btw treatment and control groups before the shock i.e., any pre-existing gap  𝛽2: The change over time in the control group (from bf to af the shock)

Left: 𝛽2 = 0 Right: 𝛽2 (ΔPc) ≠ 0; ΔPt: The change over time in the treatment group (from before to after the shock) (here ≠ 0) 𝜷𝟑= ΔPt – ΔPc; Difference-in-differences how much more (or less) the outcome changed in the treatment group relative to the control group. DiD estimate - causal effect of the T

1 Minimum wage ~ Employment

Q. What is the effect of minimum wage on employment? April 1992, New Jersey increased the minimum wage $4.25 to $5.05 But in the neighboring state Pennsylvania mw stayed at $4.25  Fast food restaurants in NJ (treatment group) and PS (control group)  Panel data: Same individuals over multiple times (say April 1992 and November 1992)  Difference 1: Difference within individuals After the treatment minus before; NJ in Nov 92 - NJ in Feb 92, PA in Nov 92 – PA in Feb 92 1) Concept and Example 2) Data Collection & Enrichment 1) DiD 2) DiD Assumptions

groups of players, e.g., active/inactive players, hall of famers  Run deeper analyses

 Difference 2: Difference across individuals; Difference in NJ – Difference in PA → Differences-in-Differences

 What would have happened in NJ if the mw did not increase?  Assume NJ and PA are: Equal in expectation, Parallel trends assumption

Augmented Data Management for Public Data Exploration

2 Parallel Trends Assumption

 It might look like the parallel trends assumption holds - and in some cases, that visual check might be enough  But is it really? Can we test it more formally?  Leads and Lags Model: break the time periods before and after the shock into smaller, equally spaced intervals, so we can check how the outcome evolves over time - both before and after treatment

 Leads and Lags Model, Placebo Test, Time Trend Test

3 Leads and Lags

Specification 𝑦𝑖𝑡 = 𝛽0 + 𝛽1 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖 + 𝛾−2 𝑇−2 + 𝛾0 𝑇0 + 𝛾1 𝑇1 +𝛿−2 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖 × 𝑇−2 + 𝛿0 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖 × 𝑇0 + 𝛿1 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖 × 𝑇1 + 𝑢 𝑦𝑖𝑡=𝛽0+𝛽1𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖+Σ𝛾𝑡𝑇𝑡+1𝑡=−2Σ𝛿𝑡𝑇𝑟𝑒𝑎𝑡𝑖×𝑇𝑡+𝑢1𝑡=−2

t ≠ -1 (this period is used as the baseline)

 We choose the time periods (e.g., days, weeks, months)  We find no evidence that the parallel trends assumption violated if the interaction term coeff (𝛿𝑡) for the pre-treatment periods not stat. sig.

4 Placebo Test

 running a DiD-style analysis only on the pre-treatment period, pretending the treatment occurred earlier than it actually did.  Only focus on the pre-shock window  Select a placebo shock date (Usually, the middle of the pre-shock window)  Run a DiD model using the placebo shock date 𝑦𝑖𝑡=𝛽0+𝛽1𝑇𝑟𝑒𝑎𝑡𝑖+𝛽2𝐴𝑓𝑃𝑙𝑎𝑐𝑒𝑏𝑜𝑡+𝛽3𝑇𝑟𝑒𝑎𝑡𝑖∙ 𝐴𝑓𝑃𝑙𝑎𝑐𝑒𝑏𝑜𝑡+𝑢

𝛽3: If not sig, T and C evolving similarly bf shock, no evidence parallel trends assumption violated

Left: 𝛽3 not sig → parallel assumptions not violated Right: 𝛽3 sig → parallel assumptions not violated

5 Time Trend Test

 The time trend test checks whether the treatment and control groups were following similar trends over time before the treatment.  Only focus on the pre-shock window  Create a continuous variable that indicates the time diff (e.g., days, months) btw each observation and the end of the pre-shock window (time trend)  Run a DiD model using the time trend instead of a shock date 𝑦𝑖𝑡=𝛽0+𝛽1𝑇𝑟𝑒𝑎𝑡𝑖+𝛽2𝑇𝑖𝑚𝑒𝑇𝑟𝑒𝑛𝑑+𝛽3𝑇𝑟𝑒𝑎𝑡𝑖∙ 𝑇𝑖𝑚𝑒𝑇𝑟𝑒𝑛𝑑𝑡+𝑢

𝛽3: If not sig, If not sig, T and C evolving similarly bf shock, no evidence parallel trends assumption violated

Left: 𝛽3 not sig → parallel assumptions not violated Right: 𝛽3 sig → parallel assumptions not violated

3. Propensity Score Matching

1 Smoking ~ Life Expectancy Case

A public health researcher wants to estimate the causal effect of smoking on life expectancy. They have access to health and mortality records from a large national health database. They ask: Do smokers live shorter lives because of smoking?

 Can we simply compare smokers and non-smokers? No! (Confounders)  Randomized experiment? No! (Ethical reasons)

2 Propensity Score & Matching (PSM)

 Can we compare individuals who look similar in all observable variables except for receiving the treatment? → Propensity Score Matching (PSM)  It tries to mimic randomized experiments by matching treated and untreated units that have similar observable characteristics

Propensity score: The probability that a unit receives the treatment, given its observed characteristics

Estimated using a model (usually logistic regression): 𝑒𝑥=𝑃(𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡=1|𝑋)

Matching: For each treated unit, find one (or more) untreated unit(s) with similar propensity scores

3 Balance Assessment

Matching Success: Matching only works if the T and C groups become similar af matching (i.e., they are balanced on key covariates)

How do we check that?  DIST of covariates (e.g., age, income, education) should look nearly the same in two groups (Standardized Mean Differences (SMD), t-tests, etc.)  If it is, this means the comparison is now apples to apples!  Only then does it make sense to compare outcomes between matched treated and control units to get an estimate of the treatment effect

Keep in mind  Only adjusts for observed differences - not unmeasured confounders  Matching quality depends on the richness of your data  May discard many units (especially if poor overlap)

4 Propensity Score Matching Process

 Select observed (pre-treatment!) covariates do use for matching  Estimate propensity score for each unit  Match treated units to control units with similar propensity scores  Check the balance of the matched sample (treated vs control) across the matched covariates  Estimate the treatment effect by comparing the average outcome of interest of the treated and the control groups

Augmented Data Management for Public Data Exploration

Lecture 6. Data Quality and Generative AI

1. Foundations of Data Analytics and Its Evolution

1 Extracting Value from Complexity

 Systematic process of examining and interpreting data in order to extract meaningful information. Aiming to uncover  Patterns: recurring behaviors or sequences in data  Trends: directional movements or changes over time  Relationships: correlations or connections  Insights: actionable knowledge that adds value  Anomalies: outliers or unexpected behaviors

2 Turning Data into Actionable Intelligence

 Transform raw data into valuable input for understanding and optimizing processes, systems, or behaviors  Informed decision-making: providing evidence-based insights to guide actions  Problem-solving: identifying root causes and designing solutions  Strategic planning: anticipating future trends and preparing accordingly  Performance improvement: monitoring and enhancing efficiency and outcomes  Innovation: discovering new opportunities or approaches

3 Preparing for Data Analytics Initiative

 Analytics projects can have different diverse project aims (as introduced in previous slides)  Bf starting a DA initiative, essential to clearly define the goals  important bc each type of data analytics serves a specific purpose and requires the use of appropriate tools and techniques to be effective

Type, Key Characteristics, Common usages, Example

Type A :: Descriptive Data Analytics

 The process of using current and historical data to identify relationships and trends  Provides a clear and concise understanding of what has happened in the past

Key Characteristics  Simplest and most straightforward form of data analysis  Focuses on what and when something happened  Does not explore underlying causes  clear and concise summary of past events Tools: Excel, Google Charts, Tableau, Power BI, etc.

Common usages  Parse and organize data  Identify relationships and trends between variables  Present data visually for easier understanding

Example – Netflix  Gathers data on users’ in-platform behavior  Analyze it to determine the trending series and movies  Displays trending titles on the home screen

Type B :: Diagnostic Data Analytics

 analyzing data to understand why a trend or pattern occurred, uncovering the root causes behind observed events or outcomes

Key Characteristics  Explains why something happened by identifying the underlying causes  Follows descriptive analytics by delving deeper into data to explore root causes  Provides actionable insights to helps businesses address underlying issues and formulate strategies for improvement

Tools: Data Mining Techniques, such as regression, clustering, and decision trees

Example – HelloFresh  The company notices a decline in subscribers in a particular region over the past few months  Regression analysis: examine the correlation btw the decline in subscribers and external factors such as changes in meal offerings, subscription price adjustments, or competitor promotions

Type C :: Predictive Data Analytics

 Forecasts future trends and outcomes using historical data and statistical modeling  Identifies patterns to uncover potential risks and opportunities  Supports strategic decision-making by providing datadriven insights  Without predictive analytics, businesses risk making decisions based on assumptions rather than evidence, potentially leading to costly mistakes

Tools: advanced platforms (e.g., IBM Watson Studio, Google Cloud AI), which combine machine learning, optimization, etc.

Example – Healthcare  Hospitals aim to identify patients at risk of readmission  They analyze historical patient data, such as medical history, length of stay, diagnosis, and treatment plans  ML models are used to predict which patients are most likely to be readmitted within 30 days of discharge

Type D :: Prescriptive Data Analytics

 Determines and recommends the optimal course of action  Goes beyond explanation and prediction by suggesting what to do next

Key Characteristics  Recommends actions to influence or change the outcome  Models multiple scenarios to identify the best possible decisions  Often uses AI and optimization to adapt and improve over time

Tools: IBM Watson Studio, Google Cloud AI, Gurobi, extend beyond prediction by using optimization and simulation to recommend the best actions

Example – DHL  The company needs to optimize its delivery routes for thousands of parcels every day, ensuring timely deliveries while minimizing fuel costs, and maximizing operational efficiency  Real-time recommendations using Gurobi or IBM Watson

1) Data Analytics 2) Types of Data Analytics

Augmented Data Management for Public Data Exploration

 Different types of analytics rely on diff tools that range in complexity  The sequence of the four types does often reflect on: 1. An increasing level of complexity 2. Their chronological order of use in many real-world scenarios

Information –- (Increasing complexity) --> Optimization Descriptive analytics: what happened? / visualization, data mining → Diagnostic analytics: why did it happen? / explanations, causality, what-if analysis → Predictive analytics: what will happen? / probabilistic models, regression, simulations → Prescriptive analytics: how to make it happen? / mathematical optimization

2. Understanding the Journey to LLMs

1 Language Modeling Problem

Goal of the Language Model  Predict what comes next, based on the given words  P( next word | “The moon is” )  Each time: New probability distribution over vocabulary words  Each time: Use word with highest probability

2 Language Models

 LMs are behind many real-world applications: Google search, Email auto-completion, Phone keyboards → In use for over 20+ years!!!  What changed in the last years? Not the problem, but the power to solve it: faster processors, bigger data, smarter models

1 n-gram Language Model Example

“Once upon a time, there was a little girl who wore a beautiful red cloak that her grandma had made for her. It had a big red hood, so everyone called the girl Little Red Riding Hood.”  Consider word sequences → n-grams  Collect statistics about how frequent are the different n-grams, and use these to predict the next word 𝑃(𝑛𝑒𝑥𝑡 𝑤𝑜𝑟𝑑|𝑤𝑜𝑟𝑑 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒)=𝑐𝑜𝑢𝑛𝑡(𝑤 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒+𝑛𝑒𝑥𝑡 𝑤𝑜𝑟𝑑)𝑐𝑜𝑢𝑛𝑡(𝑤𝑜𝑟𝑑 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒)

Unigram probability (1-gram) P(girl) = count(“girl”) / count-all = 2 / 37 ≈ 0.054 P(grandma) = count(“grandma”) / count-all = 1 / 37 ≈ 0.027

n-gram probability P( time | once upon a ) = count ("once upon a time") /count("once upon a") = 1

 Sentences sampling from an 1-gram language model (i.e., random combination of words): “girl red a had” “hood called her big”  Sentences sampling from an 2-gram language model (i.e., word pairs): “red cloak that her” “had a big red”  Sentences sampling from an 4-gram language model: “once upon a time” “a big red hood” “little red riding hood”

2 Limitations of n-gram LM

 Data Sparsity: Many word combinations don't appear in the training data, especially as n increases  Memory Inefficiency: Requires storing large n-gram tables, which can become resource-heavy  Poor Generalization: n-grams tend to memorize data, struggling with unseen sequences or new word patterns

1 Neural Language Model

 Uses neural networks to predict the next word in a sequence of words  Neural networks can not process raw text  Needing numerical input, thus - Words are converted into vectors - Each vector is a list of numbers

Example cat → [0.62, 0.75, 0.10] dog → [0.65, 0.72, 0.12] cat and dog similar bc they are animals banana → [0.95, 0.10, 0.15]

2 NLM Components

Embedding layer  Converts words (e.g., "dog", "cat") into vectors  Captures semantic similarities between words Hidden layers  Learn complex language patterns and relationships  Enable the model to understand deeper, contextual meanings in text Output: Next-word prediction or another task (e.g., classification)

3 Role of NLM

 Richer Representations: Learn more detailed & meaningful representations of words compared to n-grams  Long-range Dependencies: Understand relationships between words that are far apart  Better Generalization: Generalize better to unseen words, or contexts by learning patterns from large datasets rather than relying on fixed, predefined rules

4 Main Problems with Neural Language Models

 High Computational Costs: Training requires significant computational power, making the process both expensive and time-consuming  Large Datasets & Hardware Demands: Models need massive datasets and advanced hardware to perform effectively → limiting accessibility  Risk of Overfitting: When trained on small or insufficiently diverse data, the learning patterns don’t generalize well to new, unseen data

1 Key Breakthroughs

 Transformer Model: Efficient learning of complex language patterns by stacking deep layers e.g., ChatGPT (T: Transformer)  Self-Attention Mechanism: Learns language patterns better by using repeated layers that highlight important words and their relationships

2 Self-Attention Mechanism

 Evaluates relationships between all words in a sentence, independently of their position  Compares each word with all others to assess its importance in the current context, i.e., in relation to the other words  Generates an attention score to determine which words should be given more weight (to understand the context) 1. Captures contextual relationships more effectively 2. Long-range dependencies are easier to capture 3) Summary 1) Language Models 2) n-gram Language Model 3) Neural Language Model 4) Attention is All You Need

Augmented Data Management for Public Data Exploration

Example “Once upon a time, there was a little girl who wore a beautiful red cloak that her grandma had made for her. It had a big red hood so everyone called the girl Little Red Riding Hood.”

 All word are converted to a vector  Initial vector for “time” → [0.43, 0.51, 0.01, …  “Attention” operation allows the model to adjust the representation of each word based on the context of the surrounding words → Finally, “time” is understood as part of a narrative context and its vector is adjusted to captures that specific meaning

3 Transformer Architecture

 Previous Issue: data was processed sequentially → slow and inefficient training  Transformers enable parallel processing of words in a sentence, improving training speed  The model does not process words one-by-one but processes all words in a sequence simultaneously  Can handle larger datasets & more complex models, making them ideal for complex language tasks

4 From Transformers to LLMs

 Large Language Models (LLMs) based on the transformer architecture  Leverage the attention mechanism and achieve greater performance by: - Increasing the number of parameters: e.g., more layers, which gives the model more capacity to understand complex relationships in data - Training on larger datasets: a diverse and extensive collection of examples to learn from

3. Rethinking Data Analytics: The Future with LLMs

1 LLMs and Their Role in Data Analytics

 The primary goal of LLMs is to understand language and generate human-like responses → Not created specifically for Data Analytics BUT LLMs text-based capabilities can be aligned with data analytics in various ways  Interact with humans over datasets using natural language (e.g., no need for SQL statements and / or Python code)  Improve integration in large volumes of unstructured data, such as customer reviews, emails, and social media content  Extract sentiment, trends, or future actions from text-based data to inform decision-making  Explain the results of complex data mining techniques  Suggest how to modify input variables to influence or optimize outcomes in algorithms or decision models

2 Complex Tasks with Limited Training

 ChatGPT (GP: Generative Pre-trained)  Zero-Shot Learning: The model performs the task without having seen any examples or any additional training  One-Shot Learning: The model uses the single example as a reference to understand how to handle similar tasks  Few-Shot Learning - perform the task after seeing only 2-10 examples - The labeled examples are used to generalize the learning and make predictions for new, unseen data

Few-Shot Learning Example The following sentences are classified as positive or negative: 1. “This movie was fantastic!” (positive) 2. “I hate waiting in lines.” (negative) → Now classify the sentence “This movie was good.”

3 Next-Gen: Modern Solutions to Classic Tasks

 Translations become more accurate and fluent  Programming tasks with natural language interfaces  Understands user intent, summarizing content, and reasoning across documents  Assist with cleaning, labeling, and converting data formats using simple instructions  Expert-level tasks through pre-trained knowledge, e.g., legal, medical, financial

4 Entity Resolution for Data Integration

 Entities encode a large part of our knowledge  Valuable asset for numerous (Web) applications  Many names, descriptions, or IDs (URIs) are used to describe the same real-world objects

Issue A: Converting the problem to Language Task

 Requires creating language descriptions for each entity and task  Often referred to as using named prompts E.g.-1: Person Albert Einstein, born on 14-03-1879, born place Ulm, Germany, died on 18-04-1955, etc.  E.g.-2: Do the following two entity descriptions refer to the same real-world entity?

Issue B: Creating the Task Demonstrations

 Show the model how the task should be completed E.g., it should generate a Yes / No answer 1. Random sampling of examples from a labeled dataset 2. Manually construct examples that optimize performance on a validation set (typically 10% of the original labeled dataset) → More costly (requires more time) but Improves performance when examples are carefully constructed

Narayan et al., PVLDB 2022

 Work done in 2022, F1-score over generated entities  Zero-shot performance is significantly lower than the few-shot  suggest: demonstrations are very important for the particular task  w. Attr. Select.: sub-selecting attributes during row serialization  w/o Example Select.: use randomly selected demonstrations

Results  Attribute sub-selection boosts performance by removing noisy attributes that hurt performance  Prompt formatting (e.g., word choice, punctuation) can have significant

impact on model performance  Examples need to be carefully crafted to learn new tasks 1) Future with LLMs 2) Solution with LLMs

Augmented Data Management for Public Data Exploration

L7. Data Quality Dimensions and their Impact on Analytics in an LLM-Driven World

1. When LLMs Mislead

1 Understanding the limits

 GenAI creates facts, it doesn’t retrieve them  May generate links or sources that don’t exist  Can misinterpret context and user intent  Not truly understand its errors: Lacks awareness or correction mechanisms  It is probabilistic, not deterministic  Not governed by fixed rules, i.e., it learns them through patterns

2 Data Quality in Analytics

 We need to understand how it works and not treat it as a black box Data quality affects every phase of the pipeline  Is the input the right one (and expressed as it should)?  Is the intended context understood by the system?  Is the necessary data available to the system (or can be discovered)?  How is the system generating answer — what’s happening internally?  What data sources are being used, and how do they influence output?  What is the system producing, in what format, and is it interpreted correctly?

Key Boundaries in LLMs

 Inconsistencies: Can produce conflicting outputs for very similar prompts  Hallucinations: May generate text that seems realistic and plausible but is actually inaccurate  Lack of Long-Term Memory: Cannot retain information from previous chats or update knowledge in real-time  Limited Reasoning: Struggles with tasks requiring complex reasoning or multi-step problem-solving  Outdated Information: Unable to give you up-to-date statistics (info from training data)  Computational Limits: Can not maintain too much text in its “working memory” at once  Bias and Fairness: Inherent biases based on the data the model was trained on can result in outputs that reinforce stereotypes or exhibit discrimination  Dependence on Input Quality: The quality of the output depends heavily on the quality and specificity of the input prompt  Ambiguity Handling: LLMs can struggle with ambiguity, often misinterpreting prompts with unclear intent or context, leading to incorrect answers → Knowing and Improving Data Quality: For any data retrieval or data analytic task, it is critically important to know the quality of your data

2. Data Quality Dimensions

Dimensions: accuracy, completeness, consistency, timeliness, validity, uniqueness, efficiency, clarity, usefulness, importance, …

Data Accuracy

The extent to which the data values correctly reflect the real-world phenomena, objects, or events they are intended to represent Common Causes of Inaccuracy  Manual data entry errors  Outdated reference sources  Misconfigured sensors or systems  Incorrect assumptions during transformation  Misunderstandings in meaning or definition e.g., “start date” is interpreted as contract date but actually refers to the project kickoff

Data Completeness

The extent to which all required data values and / or instances are present to meet the intended use or answer a given question Common Causes of Data Incompleteness  Missing values or fields  Incomplete records from failed data collection, system outages, or truncated imports  Poorly designed forms or interfaces that don’t enforce required inputs Data loss or mismatch during integration, migration, or merging from incompatible systems  Privacy constraints or anonymization that remove or mask key information

Data Consistency

The degree to which a data collection does not contain contradictions and is presented in a uniform and coherent manner Common Causes of Data Consistency  Different formats in a single column  Different values for the same information E.g., more than one city for the same zip code  Duplicate entries in the data

Data Timeliness

The degree to which data reflects the current state of the real world at the time it is used Common Causes of Poor Data Timeliness  Delayed data entry or manual updates  Lack of real-time data integration  System or network latency  Human bottlenecks in approval or validation  Unclear or missing data refresh schedules

Data Validity

The extent to which data conforms to defined rules, formats, or standards that determine what values are acceptable or expected Common Causes of Invalid Data  Incorrect data formats (e.g., strings in date attributes)  Out-of-range values (e.g., age set to 300)  Violations of business rules (e.g., end date before start date)  Data corruption during transfer or processing

Data Uniqueness

The degree to which each real-world object or event is represented only once in a dataset, without unnecessary duplication Common Causes of Poor Uniqueness  Lack of unique identifiers  Integrating data from multiple sources without proper handling  Failure to enforce constraints  System errors during data synchronization

3. Improving and Enhancing Data Quality in Practice

1 Cleaning and Organizing Data

Augmented Data Management for Public Data Exploration

 Additional motivation: 가장 시간을 많이 들여야 하면서 DS들이 제일 하기 싫어하는 task  Schema mapping, Deduplication, Coping with Evolution, Data Rotting

2 Schema Mapping

 Transforms or mappings the structure of one data schema to another schema  (semi-) unstructured data에도 applicable

3 Record Linkage & Deduplication

 Identify different representations of the same realworld entity and merge them into one unified record  Relation w/ duplicates → Compute pair-wise similarity → Cluster similar records → Merge clusters → Clean relation

4 Coping with Evolution

 Building ladders in Wikipedia for role discovery  Historical version 비교로 옛날 답변 알 수 있음

5 Data Rotting

 Not all data is important!  People fear of loosing potentially important data  Already now, sometimes there is really no choice  The database must selectively forget data on its own initiative for the sake of storage management and responsiveness as well as to ensure the truthfulness of the results produced by generative AI based on that data

Augmented Data Management for Public Data Exploration

Sample Exam Questions

Part 1

 Describe the steps you would follow to scrape a website (2 Marks)

 What is the difference between structured and unstructured data

 How is REST APIs different from GraphQL (2 marks)

 What is parsing? How does parsing approach for web scrapping differ

from the parsing approach for Web APIs? (2 marks)

 Following is the HTML code of IMDB: Can you complete the scraping

code after inspecting the HTML data ? (2 marks)

 What are the two new data management considerations for

organizations ? (2 marks)

 What are two main issues with predictive models ? ( 2 marks) Can you

think of way to solve them ? (1 mark)

 What is the main difference between Machine learning and

Econometrics? (1 Mark)

A) The models are always different

B) They use different algorithms

C) Their objectives are different

D) They are identical

 What does the "Bag of Words" model ignore? (1 Mark)

A) The frequency of words

B) The order and context of words

C) The vocabulary size

D) Rare terms in the document

 What are some of the benefits of using TF-IDF? (2 marks)

Sentence completion algorithm (5 marks)

 Context: Sentence completion algorithms are crucial in various

applications like text editors, chatbots, and predictive typing tools. These

algorithms suggest the most likely next words in a sentence based on

the given context

 Task: Describe the approach to design a sentence completion

algorithm that generates the top three word suggestions for a given

input word.

 Example: If the input word is 'have', the algorithm should suggest

words like 'you', 'we', 'been'. For 'looking', the suggestions might include

'forward', 'at', 'ahead’.

 Requirements

- Data Collection and Preparation: Explain the types of text data

needed and how you would prepare this data for the algorithm

- Embedding: Describe how you would transform the input data into a

format suitable for the model.

- Training the Model: Outline the steps for training your chosen model

with the preprocessed data.

- Prediction: Explain how the model will generate and select the top

three word suggestions.

- Evaluation and Refinement: Discuss how you would assess the

performance of the algorithm and refine it for better accuracy.

 Please Note: You are NOT required to write the actual code. You can

describe your algorithm/approach using descriptive steps and

pseudocode (partial code), wherever possible.

Part 2

Primarily of multiple-choice questions: Check your understanding of the

concepts we cover in this part of the cours

There will be no coding questions.

Part 3

No codes from LLMs

Open questions and multiple questions (similar to quiz)

e.g., give examples of “data accuracy”

Programming related (10 marks)

Conceptual (~20-25 marks)

Augmented Data Management for Public Data Exploration

Lecture 1. Introduction to Data Analytics and ADM

1. Augmented data management

1. Concept and Process

1 What is Data Management and When is it Augmented?

2 Data Management and Analytics Process

2. ADM and analytics process

1) Define Objective

1 The Wild Web of Data Analytics

2 Data Management Considerations for Organizations – Big data

3 Data Management Considerations – Social Media Augmentation

2) Data Collection and Integration

1 Data Collection and Integration

2 Web Data Collection – Structured vs. Unstructured Data

3 Structured Web Data Collection – Web APIs

4 Unstructured Data – Web Scraping

3) Data Modeling and analytics

1 Prediction Models – AI vs. ML

2 ML vs. Econometrics

3 Prediction Models

4 Estimation Models – Linear Regression

4) Validation – Prediction Models

1 Issues With Prediction Models

2 Validation of Prediction Models

3 Evaluation Metrics (Classification Models)

4 Is the model good or bad ?

5) Validation – Estimation Models

1 Causality and Experimental design

2 Identification Strategy

3. Quick Recap

Lecture 2. Web Data Collection Process

1. Understanding a Web Browser

1) Client-Server Architecture

1 Client-Server Model

2 Client-Server Model – Key Terms

3 What Does a Web Browser Do?

2. Unstructured Web Data Collection

1) Structured vs. Unstructured data

1 Structure of web data

2) Crawling and Scraping

1 Unstructured Data – Web Scraping

3) Unstructured data collection process

1 Unstructured web data collection process

Step 1: Understand the Webpage and Its Limitations

Step 2: Inspect the webpage data

Step 3: Make HTTP request and get the HTML data

Step 4: Convert the HTML data into a format that Python can understand

Step 5: Find the data

3. Structured Web Data Collection

1) Web APIs

1 Structured Data – Web APIs

2 Structured Data – JSON

2) REST vs. GraphQL

1 Data Collection Using REST API

2 Facebook issues with REST and the development of GraphQL

3 Data Collection Using GraphQL API

Lecture 3. Embedding Techniques

1. Rest vs. graphQL

1) From the last lecture and lab

1 Data Collection Using REST API

3 Data Collection Using GraphQL API

2) After collecting data

1 An Analytical Problem

2 The Analytical Approach

3 Finding x in Text, Images and other ‘Media’ Objects

2. Bag-of-Words

1) Count Vectorizer

1 Tokenization : Bag of Words

2 Encoding Using Count Vectorizer

3 Training

4 Pros and Cons of CountVectorizer

2) TF-IDF

1 TF-IDF

2 TF-IDF Calculation

3 Pros and Cons of TF- IDF

4 Bag of Words – Pre-processing – Removing Stopwords

3. Word2Vec

1) Encode the meaning of the word

1 How do we represent the meaning of the word?

2 Computational Objective

3 Embedding as a Learning Task

Lecture 4. Causal Inference - Randomized Experiments & OLS

1. Introduction

1) Causal inference

1 Causal Inference

2 Causal Inference vs Machine Learning

2) Examples of Causal Inference

1 Causal Inference Example

2 Correlation vs. Causation

3. Methodology to Explore Causal Inference

1) Randomized Experiments

1 Experimental Design

2 Motivation for Randomized Experiments

3 Randomized Experiments – Example

4 Randomized Experiments - Feasibility

2) Regression

1 Regression Example

2 Regression models

3 Regression Example

3 R-Squared: A Goodness-of-Fit Measure

4 Regression Example Interpretation

5 Regression - Causal Assumptions

Augmented Data Management for Public Data Exploration

Lecture 5. Advanced Causal Inference Techniques

1. Data Enrichment

1) Concept and Example

1 Concept

Example 1

Example 2

2) Data Collection & Enrichment

Ebay Example

2. Differences-in-Differences

1) DiD

1 Case and Questions

2 Difference-in-differences (DiD) specification

2) DiD Assumptions

1 Minimum wage ~ Employment

2 Parallel Trends Assumption

3 Leads and Lags

4 Placebo Test

5 Time Trend Test

3. Propensity Score Matching

1 Smoking ~ Life Expectancy Case

2 Propensity Score & Matching (PSM)

3 Balance Assessment

4 Propensity Score Matching Process

Lecture 6. Data Quality and Generative AI

1. Foundations of Data Analytics and Its Evolution

1) Data Analytics

1 Extracting Value from Complexity

2 Turning Data into Actionable Intelligence

3 Preparing for Data Analytics Initiative

2) Types of Data Analytics

Type A :: Descriptive Data Analytics

Type B :: Diagnostic Data Analytics

Type C :: Predictive Data Analytics

Type D :: Prescriptive Data Analytics

3) Summary

2. Understanding the Journey to LLMs

1) Language Models

1 Language Modeling Problem

2 Language Models

2) n-gram Language Model

1 n-gram Language Model Example

2 Limitations of n-gram LM

3) Neural Language Model

1 Neural Language Model

2 NLM Components

3 Role of NLM

4 Main Problems with Neural Language Models

4) Attention is All You Need

1 Key Breakthroughs

2 Self-Attention Mechanism

3 Transformer Architecture

4 From Transformers to LLMs

3. Rethinking Data Analytics: The Future with LLMs

1) Future with LLMs

1 LLMs and Their Role in Data Analytics

2 Complex Tasks with Limited Training

3 Next-Gen: Modern Solutions to Classic Tasks

4 Entity Resolution for Data Integration

2) Solution with LLMs

Issue A: Converting the problem to Language Task

Issue B: Creating the Task Demonstrations

Narayan et al., PVLDB 2022

L7. Data Quality Dimensions and their Impact on Analytics in an LLM-Driven World

1. When LLMs Mislead

1 Understanding the limits

2 Data Quality in Analytics

Key Boundaries in LLMs

2. Data Quality Dimensions

Data Accuracy

Data Completeness

Data Consistency

Data Timeliness

Data Validity

Data Uniqueness

3. Improving and Enhancing Data Quality in Practice

1 Cleaning and Organizing Data

2 Schema Mapping

3 Record Linkage & Deduplication

4 Coping with Evolution

5 Data Rotting