ChatGPT Data Extraction: A quick demonstration

  Рет қаралды 10,982

Brandon Roberts

Brandon Roberts

Күн бұрын

I demonstration of how to use ChatGPT to extract data from text documents. I run through the various ways to perform AI powered structured data extraction, show some tricks I learned, explain where it falls short and how this applies to data journalism.
This is an extension of an OpenNews article I wrote: source.opennew...
You can learn more about journalism technology at my website: bxroberts.org

Пікірлер: 18
@ianmatiello
@ianmatiello 6 ай бұрын
I have no words how to thank you. I'm a Brazilian lawyer and for months, maybe even more than a year, I've been looking for ways to automate boring repetitive and analogue tasks in my work, which only waste my time, with my limited programming knowledge (almost zero). You were the first to teach a method that is easily understandable and applicable to those who don't have much knowledge in programming, and, most importantly, it is useful. I work in a Union, and here there is a culture almost against technology, but perhaps it is due to ignorance of its possibilities than anything else. To give you an idea, we calculate how much a union member has to receive in a lawsuit. The information is contained in a financial statement provided by the City Hall. Until then, the calculation is done "by hand", that is, reading the information from the table and manually entering it into an Excel spreadsheet, which, for each calculation, takes around one or two hours.
@Back_at_Bardot
@Back_at_Bardot Жыл бұрын
❤ finally been searching for content on this sort of topics
@biraescudero
@biraescudero Жыл бұрын
Great! I usually have this kind of problem and your approach is very good! Thanks for sharing!
@lorenzoleongutierrez7927
@lorenzoleongutierrez7927 Жыл бұрын
Thanks for sharing!
@retrogamingplayback
@retrogamingplayback Жыл бұрын
Don't forget on long outputs, typing "continue" is your friend. It should resume where it left off.
@bxroberts
@bxroberts Жыл бұрын
While that does work, for the purposes of data extraction I found the "continue" prompt to cause more problems than it fixed. When outputting code, asking ChatGPT to "continue" usually caused it to completely re-output the entire JSON, often in a different format and ignoring the schema. The further in a context ChatGPT gets from the prompt, the less likely it is to obey it. For long schemas, it would be better to split the schema in two and ask two separate times. Just my experience, though!
@retrogamingplayback
@retrogamingplayback Жыл бұрын
@@bxroberts Makes perfect sense, appreciate the tip for splitting JSON schema - hadn't thought of that.
@andre-le-bone-aparte
@andre-le-bone-aparte Жыл бұрын
Excellent Content - Another sub for you sir!
@TheHavyxon
@TheHavyxon Жыл бұрын
Do you think that the police reports are intentionally written so they are this difficult to read specifically because of data extraction?
@bxroberts
@bxroberts Жыл бұрын
The documents in question were written over a 20 year time span, between 2000 and 2020, so they were written before most people even knew about automated extraction. So I don't think it's intentional. The records suffer from a few things that trip up ChatGPT: 1) They're really messy and OCR isn't perfect 2) many of them are excerpts from large email chains making context difficult to figure out and 3) there are a mix of documents, use of force reports, reprimands and memos of termination and they're all written totally differently.
@TheHavyxon
@TheHavyxon Жыл бұрын
@@bxroberts well I think the term "machine readable" is kinda old
@13statistician13
@13statistician13 4 ай бұрын
No. Police reports are public and FOIA-able. An even easier method than doing all this programming in Python and Extraction with ChatGPT, is to go to the source database. You'd be amazed how much easier you can make your life by simply picking up the phone and contacting the information technology department at your local police department. You can usually ask them to supply you with an electronic copy of their police reports in a machine readable format, and they will oblige. Typically, you can ask for csv files or even a copy of their database, but more often than not they will simply provide you with csv files rather than their database since their DB design may be proprietary. You'll want to limit the result set by providing a date range (limited by a date range of course). In many cases, you'll get several tables. In that case, you'll simply need to write some basic SQL code to join the tables, but that's super easy to learn. You could use R, SAS or other statistical programming languages to accomplish that as well. In general, the only reason you might not get data in an easy to use format, is because that particular PD's It department is incompetent or resource constrained - not because they are attempting to hide anything. One final note: if you do request the information, you might be expected to pay a nominal fee for the service. It's usually significantly cheaper to pay this fee than spending the time to build out, often times unreliable, Dat extraction processes.
@13statistician13
@13statistician13 4 ай бұрын
​@@TheHavyxonHuh? No. Machine readable is a very modern term used by cloud engineers, data engineering teams, data scientists, and statisticians to this very day.
@sarahbratt5178
@sarahbratt5178 Жыл бұрын
Awesome video! Could you link to PDF Plumber?
@bxroberts
@bxroberts Жыл бұрын
Sure! It's on GitHub here: github.com/jsvine/pdfplumber
@ain92ru
@ain92ru Жыл бұрын
I guess the occasional mistakes might have been caused by the temperature set too high (unfortunately, I don't know how to change it in the ChatGPT interface because I don't use it)
@bxroberts
@bxroberts Жыл бұрын
Hello! Temperature is exposed by the GPT-3 API, but you can't change it in ChatGPT (currently Mar 2023). You can definitely improve the hallucination rate using some of the other APIs and params but then you also need to invest more time in prompt engineering. Ultimately, even the best tuned temperature will still exhibit some hallucination, but you're def right that it can be controlled a bit with fine tuning params.
POWER AUTOMATE: COMO EXTRAIR DADOS DE UM PDF COM O AI BUILDER
22:04
Daniel Morais PRO
Рет қаралды 18 М.
Мы сделали гигантские сухарики!  #большаяеда
00:44
UNO!
00:18
БРУНО
Рет қаралды 5 МЛН
Webinar: AI Powered Document Data Extraction
23:02
TextMine
Рет қаралды 133
Get Data from PDFs and Send to EXCEL with Power Automate Desktop!
18:52
Christine Payton
Рет қаралды 52 М.
How To Use ChatGPT PDF Analysis Tool & Read Any File For Beginners
11:35
Using Chat GPT to format Unstructured Data
6:32
Dwain: Unfinished
Рет қаралды 615
Don't Use ChatGPT Until You Watch This Video
13:40
Leila Gharani
Рет қаралды 1,6 МЛН
PDF Parsing has changed in GPT-4o - 1000 Subscriber Highlight
16:53
Stable Discussion
Рет қаралды 3,6 М.
LlamaParse: Convert PDF (with tables) to Markdown
15:55
Alejandro AO - Software & Ai
Рет қаралды 11 М.
Systematic Review Data Extraction
9:17
Muhamed Elnaggar
Рет қаралды 4,6 М.