Creating A Job Posting Search Engine Using OpenAI Embeddings

I recently worked on a job posting search engine and wanted to share how I approached it and some findings.

Motivation

I had a data set of job postings and wanted to provide a way to find jobs using natural language queries. So a user could say something like “job posting for remote Ruby on Rails engineer at a startup that values diversity” and the search engine would return relevant job postings.

This would enable the user to search for jobs without having to know what filters to use. For example, if you wanted to search for remote jobs, typically you would have to check the “remote” box. But if you could just say “remote” in your query, that would be much easier. Also, you could query for more abstract terms like “has good work/life balance” or some of the attributes that something like { key: values } would give.

Approach

We could potentially use something like Elasticsearch or create our own job search engine with rules, but I wanted to see how well embeddings would work. These models are typically trained on internet-scale data, so they might capture some nuances of job postings that would be difficult for us to model.

When you embed a string of text, you get a vector that represents the meaning of the text. You can then compare the embeddings of two strings to see how similar they are. So my approach was to first get embeddings for a set of job postings. This could be done once per posting. Then, when a user enters a query, I would embed the user’s query and find the job posting vectors that were closest using cosine similarity.

One nice thing about ordering by similarity is that the most relevant job posting should be first, and then other similar job postings would be next. This matches how other search engines work.

OpenAI recently came out with the text-embedding-ada-002 embedding engine, which is significantly cheaper and higher performing than previous versions. Notably, the token length was also increased to 8191 tokens, which meant we can embed whole job postings. So I decided to use this for creating the embeddings.

The job postings data set that I have had some additional data, like company name. So I wanted to embed that so we can use that information when comparing to the user’s query:

# truncate to 8000 characters since more is not likely to yield signal and makes it less likely we'll run into token length issues
# could also do this by using tiktoken and truncating to 8191 tokens for that engine
df['for_embedding'] = df \
  .apply(lambda x: f"Job posting\nCompany: {x['company_name']}\nTitle: {x['title']}\nBody: {x['body'].strip()}"[:8000],
         axis=1)
df['embedding'] = df['for_embedding'].apply(lambda x: cached_get_embedding(x, engine='text-embedding-ada-002'))

Results

For my example query at the beginning of the post (“job posting for remote Ruby on Rails engineer at a startup that values diversity”), the search engine returned the following job posting body as the top result (emphasis mine):

… We are a fast-paced, user-first, technology company that’s passionate about building responsibly. We believe the future of work is a regenerative corporate environment where giving and receiving is in balance. When we build we don’t just think about maximizing profit, we believe you can be wildly profitable while also being socially and environmentally conscious. Our fully-remote team is comprised of 13 awesome people (and quickly growing!) in New York, Texas, and North Carolina. We are committed to developing diverse teams. Our current team is 35% POC and 60% women, and we continuously strive to add more diversity on our team.

Job Requirements and Responsibilities:

  • Strong front end experience and familiarity working in a Rails system
  • Design, build and test end-to-end features using Rails

Candidate Qualifications:

  • Familiarity with our stack: Rails and Angular sitting on top of Heroku using Postgres, Elasticsearch, Redis, and a variety of AWS services.
  • You have startup experience and you enjoy working in small teams

What You Get:

  • Fully remote role, so you can work from home
  • Stock Options

Pretty great fit! (Here’s a link to it, in case you’re interested!)

Some other interesting queries I ran:

“job posting for software engineer at consultancy in Washington State”

The first result was a job posting for consultant in Bellevue, which is in Washington State. The posting didn’t mention Washington State specifically anywhere. This is a good example of something that would be hard to do with traditional document search, but works well with embeddings trained on internet data. There must be some signal in the embeddings that captures the fact that Bellevue is located in Washington State.

“job posting for software engineer at <company name>”

The top results for this were indeed job postings for that company. This reinforces the decision to embed some metadata about the job posting.

“remote machine learning and product engineer”

One useful result had “You’d work on product-oriented research for generative natural language detection, and tackle cutting-edge deep learning and NLP problems with an emphasis on classification and adversarial methods.” Seems interesting!

Queries around eligibility (visa, citizenship, etc.)

Seemed to work OK. It was sometimes hard to tell if it was filtering these or if it just mentioned this. Also was hard to tell sometimes what country the citizenship was referring to.

Asking for specific salary ranges

This didn’t seem to consistently work well. Many postings didn’t list salary information. Also, it would sometimes get confused by other compensation numbers or revenue numbers (“$10M ARR”).

Overall

Overall, this was a fun project and I was impressed with the results. It only cost me a few dollars to create the embeddings, and the search engine was pretty fast. Also, it only took a couple of hours thanks to using an off-the-shelf embedding engine.

Resources

I found the following resources helpful for implementing this approach:

Categories: main

« Using TamperMonkey to Clean Up Websites Caching OpenAI Embeddings API Calls In-memory »

Comments