how to build Q&A chatbot using chatgpt

Building a QA bot using free Chatgpt 3.5 model from OpenAI

QA bot using ChatGpt

ChatGpt is the AI sensation of the year. Like everyone else, I was curious to try using it to simplify day to day activities. Here’s a simple project I did in R, to collect data and answer as many questions from Oracle Forums, using ChatGpt.

Project

Answer as many questions in selected communities from Oracle Forums

Oracle Forums – User Stats

Solution

ChatGpt limitations –

Before discussing the solution in detail, we need to keep in mind that ChatGpt has some limitations –

  • Hallucination – may generate a completely made up response, which is irrelevant/incorrect
  • We can send only 3 questions per minute, using OpenAI rest API call
  • Model used “Chat-gpt-3.5”, which is freely available solution from OpenAI, at the time of writing this article
  • ChatGpt 3.5 – is trained till Sep 2021 – So, it knows about Oracle tech stack/releases till Sep 2021

R Limitations

Memory limits and size – these are specific to the machine and the R version installed on that machine

Oracle Forums Limitations

Since OTN has been recently migrated to Oracle APEX cloud, page errors and most of the functionality is error-prone, resulting in unexpected page errors while loading even manually.

Data Collection

Select topics you are familiar with, to verify and judge whether ChatGpt response is correct or not.

Collected 50 html pages manually, related to topics – Oracle BI Publisher, OBIEE, SQL, REST API, OAUTH, Integration, ML, AI.

Tried automatic data collection, using RSelenium and R; but in vain. Automating this task was tedious and time-consuming and also Oracle Forums were recently migrated to Oracle APEX, this migration process caused a lot of downtime, irrational/unexpected/unavailable functionality in most of Oracle Forums threads.

Data Preparation

Select questions/posts; wisely based on the post date, post tags. Skim through collected data and remove posts related to latest versions, released after Sep 2021

Prerequisites to use ChatGpt 3.5

Create a user account in OpenAI.com, one can log in using Google/Apple/Microsoft accounts

Generate OpenAI Key or token – Go to https://platform.openai.com/account/api-keys, to create a new secret key

Save this newly generated secret key, this is required for authentication for each API call we are about to make

Install libraries, (rvest, httr, jsonlite) to make REST API calls

Running Code

Code is uploaded to github. You can see code snippets here:

library(httr)
library(rvest)
#library(tidyr)
#library(xml2)
library(jsonlite)
library (RSelenium)

#setwd(dir = “/Users/arunamalla/Documents/GitHub/arunamalla.github.io/_posts/_site/Chatgpt_ora_forums/Data”)
OPENAI_API_KEY = “<>”

# read the file urls_txt.txt and feed this to chatgpt 3.5 using an API KEY

prompt <- “Write a simple linear regression program in R” #text prompt

make_req_gpt <- function(prompt){

headers <- c(
‘Content-Type’ = ‘application/json’,
‘Authorization’ = paste0(‘Bearer ‘, OPENAI_API_KEY)
)

data <- list(
‘model’ = ‘gpt-3.5-turbo’,
‘messages’ = list(
list(‘role’ = ‘system’, ‘content’ = ‘You are a helpful assistant.’),
list(‘role’ = ‘user’, ‘content’ = prompt)
),
‘temperature’ = 0.0
)

json_data <- toJSON(data, auto_unbox = TRUE)

response <- POST(
url = ‘https://api.openai.com/v1/chat/completions’,
body = json_data,
add_headers(.headers = headers),
encode = “json”
)
return (response)
}

Client Code

Load our custom library with make_req_gpt function from the collect_chatgpt-.R file, reads and preps the data from all .html pages we scraped from Oracle Forums.

In addition, parallel processing is implemented to make the R code and Overall solution a tad bit faster.

# read each html page in Data folder and retreive post URLS from them
# get the post details for each URL and store it in a data frame
# feed these post details to chatgpt

library(readr)
library(dplyr)
library(httr)
library(rvest)
library(jsonlite)
library(foreach)

source(“/Users/arunamalla/Documents/GitHub/Bgit/R Packages/collect_chatpgt-.R”)
# set working directory
#setwd(dir = “/Users/arunamalla/Documents/GitHub/arunamalla.github.io/_posts/_site/Chatgpt_ora_forums/Data”)

pgurls_df <- data.frame (“url”, “pagenum”)
colnames(pgurls_df) <- c(“url”, “pagenum”)
data.files <- list.files(path = getwd(), pattern = “*.html”, full.names = TRUE)
for (i in 1:length(data.files)) {
read_html (read_file(file = data.files[i])) -> response_body_html
response_body_html %>% html_nodes(“.rw-ListItem-primaryText a”) %>% html_attr(name = “href”) -> pgUrls
pgUrls %>% as.data.frame.character() %>% mutate(pagenum=i) -> pgUrls
colnames(pgUrls) <- c(“url”, “pagenum”)
merge.data.frame (pgurls_df, pgUrls, by = intersect(names(pgurls_df), names(pgUrls)), all.x = TRUE, all.y = TRUE) -> pgurls_df
}

# remove unused objects from memory
rm(list=c(“pgUrls”, “response_body_html”, “i”))

# Load required libraries
library(foreach)
library(iterators)
library(doParallel)

# Create a cluster
makeCluster(2) -> cl1
registerDoParallel(cl = cl1)

# Define the run_parallel function
run_parallel <- function(i, df) {
df$url <- df_unique[i, “url”]
as.character(df_unique[i, “url”]) -> gurl
df$url <- gurl
gurl_data <- httr::GET(gurl, user_agent(“Mozilla/5.0”))
gurl_data_html <- content(gurl_data, as = “text”, encoding = “UTF-8”) %>% read_html()
qatitle <- gurl_data_html %>% html_nodes(“.ds-Question-text .ds-Question-title”) %>% html_text()
qaattrs <- gurl_data_html %>% html_nodes(“.ds-Question-text .ds-Question-attrs”) %>% html_text2()
qabody <- gurl_data_html %>% html_nodes(“.ds-Question-text .ds-Question-body”) %>% html_text()
qalock <- gurl_data_html %>% html_nodes(“.ds-ContentRowReport-itemDescription”) %>% html_text()
if(length(qalock)>0) {
df$qa_lock <- as.character(qalock[1])
}
else {
df$qalock <- as.character(“NA”)
if (length(qatitle) > 0) df$qa_title <- as.character(qatitle)
if (length(qaattrs) > 0) df$qa_attrs <- as.character(qaattrs)
if (length(qabody) > 0) df$qa_body <- as.character(qabody)
}
tryCatch({
if (length(qabody) > 0) {
gpt_prompt <- paste0(“Question title:\n”, df$qa_title, “\nquestion details:”, df$qa_body)
gpt_response <- make_req_gpt(gpt_prompt)
if(gpt_response$status_code != 200){
warning(paste0(“Request completed with error. Code: “, gpt_response$status_code
, “, message: “, gpt_response$error$message))
df$error = gpt_response$error$message
df$message = gpt_response$error$message
df$status_code = gpt_response$status_code
print(“error occurred – sleeping for 20 sec”)
Sys.sleep(time=20)
}
else {
response_content <- parse_json(content(gpt_response, “text”, encoding = “UTF-8”))$choices[[1]]$message$content
df$prompt_content = gpt_prompt
df$chatgpt_response = response_content
}
# print(paste0(“gpt response – ” , response_content))
Sys.sleep(time=20)
}
},
error = function(e) {
print(e)},
warning = function(w) {
print(w)}
, finally = print(paste0(“processed url – “, i)))

return(df)
}

# Create an empty core_output object
core_output <- list(url = character(), qa_title = character(), qa_attrs = character(), qa_body = character())

# Run the parallel loop and retrieve the updated core_output
system.time({
df1 <- foreach(j = 851:950, .combine = ‘rbind’, .init = core_output) %do% {
run_parallel(j, core_output)
}
})

# Run the parallel loop, call the chatgpt api, and retrieve the updated core_output
system.time({
df1 <- foreach(j = 1:1069, .combine = ‘rbind’, .init = core_output) %do% {
run_parallel(j, core_output)
}
})

# Stop the cluster
stopCluster(cl1)

Post responses in Oracle Forums

Manually verify each question content and the ChatGpt response, verify whether the responses are authentic and correct, align with Oracle Forums policies and guidelines.

Post responses as comments in each question, manually

Further improvements

Data Collection/Scraping – can be automated using RSelenium and R, GeckoDrivers

Provided Oracle Forums latest migration to Oracle APEX cloud is more stabilized and causes no issues while accessing data using web drivers

Data Preparation

Set and store information such as Topics, corresponding Community thread URL, the latest release info effective on or before Sep 2021

Pre-process post title and post content, to decide whether it’s related to any of our input topics/releases set in point #2.1

Posting responses in Oracle Forums

Can be automated. However, verifying the response is correct or not should be done manually

Conclusion

Although one can experience a significant improvement in productivity and by solving issues at a faster pace; one cannot avoid manual intervention in terms of verifying the generated response, and also whether the suggested solution is working or not. This demands a certain level of familiarity with the topics being requested. In my case, it’s OBIEE, Oracle BI Publisher, R, Python, ML, AI, SQL, etc.

Leave a Reply